Objective To systematically review the accuracy and consistency of large language models (LLM) in assessing risk of bias in analytical studies. Methods The cohort and case-control studies related to COVID-19 based on the team's published systematic review of clinical characteristics of COVID-19 were included. Two researchers independently screened the studies, extracted data, and assessed risk of bias of the included studies with the LLM-based BiasBee model (version Non-RCT) used for automated evaluation. Kappa statistics and score differences were used to analyze the agreement between LLM and human evaluations, with subgroup analysis for Chinese and English studies. Results A total of 210 studies were included. Meta-analysis showed that LLM scores were generally higher than those of human evaluators, particularly in representativeness of exposed cohorts (△=0.764) and selection of external controls (△=0.109). Kappa analysis indicated slight agreement in items such as exposure assessment (κ=0.059) and adequacy of follow-up (κ=0.093), while showing significant discrepancies in more subjective items, such as control selection (κ=−0.112) and non-response rate (κ=−0.115). Subgroup analysis revealed higher scoring consistency for LLM in English-language studies compared to that of Chinese-language studies. Conclusion LLM demonstrate potential in risk of bias assessment; however, notable differences remain in more subjective tasks. Future research should focus on optimizing prompt engineering and model fine-tuning to enhance LLM accuracy and consistency in complex tasks.
Objective To compare the performance of ChatGPT-4.5 and DeepSeek-V3 across five key domains of physical therapy for knee osteoarthritis (KOA), evaluating the accuracy, completeness, reliability, and readability of their responses and exploring their clinical application potential. Methods Twenty-one core questions were extracted from 10 authoritative KOA rehabilitation guidelines published between September 2011 and January 2024, covering five task categories: rehabilitation assessment, physical agent modalities, exercise therapy, assistive device use, and patient education. Responses were generated using both the ChatGPT-4.5 and DeepSeek-V3 models and evaluated by four physical therapists with over five years of clinical experience using Likert scales (accuracy and completeness: 5 points; reliability: 7 points). The scale scores were compared between the two large language models. Additional assessment included language style clustering. Results Most of the scale scores did not follow a normal distribution, and were presented as median (lower quartile, upper quartile). ChatGPT-4.5 outperformed DeepSeek-V3 with higher scores in accuracy [4.75 (4.75, 4.75) vs. 4.75 (4.50, 5.00), P=0.018], completeness [4.75 (4.50, 5.00) vs. 4.25 (4.00, 4.50), P=0.006], and reliability [5.75 (5.50, 6.00) vs. 5.50 (5.50, 5.50), P=0.015]. Clustering analysis of language styles revealed that ChatGPT-4.5 demonstrated a more diverse linguistic style, whereas DeepSeek-V3 responses were more standardized. ChatGPT-4.5 achieved higher scores than DeepSeek-V3 in lexical richness [4.792 (4.720, 4.912) vs. 4.564 (4.409, 4.653), P<0.001], but lower than DeepSeek-V3 in syntactic richness [2.133 (2.072, 2.154) vs. 2.187 (2.154, 2.206), P=0.003]. Conclusions ChatGPT-4.5 demonstrates superior performance in accuracy, completeness, and reliability, indicating a stronger capacity for task execution. It uses more diverse words and has stronger flexibility in language generation. DeepSeek-V3 exhibited greater syntactic richness and is more normative in language. ChatGPT-4.5 is better suited for content-rich tasks that require detailed explanation, while DeepSeek-V3 is more appropriate for standardized question-answering applications.