west china medical publishers
Keyword
  • Title
  • Author
  • Keyword
  • Abstract
Advance search
Advance search

Search

find Keyword "Large language models" 2 results
  • Evaluation of the accuracy of the large language model for risk of bias assessment in analytical studies

    Objective To systematically review the accuracy and consistency of large language models (LLM) in assessing risk of bias in analytical studies. Methods The cohort and case-control studies related to COVID-19 based on the team's published systematic review of clinical characteristics of COVID-19 were included. Two researchers independently screened the studies, extracted data, and assessed risk of bias of the included studies with the LLM-based BiasBee model (version Non-RCT) used for automated evaluation. Kappa statistics and score differences were used to analyze the agreement between LLM and human evaluations, with subgroup analysis for Chinese and English studies. Results A total of 210 studies were included. Meta-analysis showed that LLM scores were generally higher than those of human evaluators, particularly in representativeness of exposed cohorts (△=0.764) and selection of external controls (△=0.109). Kappa analysis indicated slight agreement in items such as exposure assessment (κ=0.059) and adequacy of follow-up (κ=0.093), while showing significant discrepancies in more subjective items, such as control selection (κ=−0.112) and non-response rate (κ=−0.115). Subgroup analysis revealed higher scoring consistency for LLM in English-language studies compared to that of Chinese-language studies. Conclusion LLM demonstrate potential in risk of bias assessment; however, notable differences remain in more subjective tasks. Future research should focus on optimizing prompt engineering and model fine-tuning to enhance LLM accuracy and consistency in complex tasks.

    Release date: Export PDF Favorites Scan
  • Performance comparison of ChatGPT-4.5 and DeepSeek-V3 in rehabilitation guidance for knee osteoarthritis

    Objective To compare the performance of ChatGPT-4.5 and DeepSeek-V3 across five key domains of physical therapy for knee osteoarthritis (KOA), evaluating the accuracy, completeness, reliability, and readability of their responses and exploring their clinical application potential. Methods Twenty-one core questions were extracted from 10 authoritative KOA rehabilitation guidelines published between September 2011 and January 2024, covering five task categories: rehabilitation assessment, physical agent modalities, exercise therapy, assistive device use, and patient education. Responses were generated using both the ChatGPT-4.5 and DeepSeek-V3 models and evaluated by four physical therapists with over five years of clinical experience using Likert scales (accuracy and completeness: 5 points; reliability: 7 points). The scale scores were compared between the two large language models. Additional assessment included language style clustering. Results Most of the scale scores did not follow a normal distribution, and were presented as median (lower quartile, upper quartile). ChatGPT-4.5 outperformed DeepSeek-V3 with higher scores in accuracy [4.75 (4.75, 4.75) vs. 4.75 (4.50, 5.00), P=0.018], completeness [4.75 (4.50, 5.00) vs. 4.25 (4.00, 4.50), P=0.006], and reliability [5.75 (5.50, 6.00) vs. 5.50 (5.50, 5.50), P=0.015]. Clustering analysis of language styles revealed that ChatGPT-4.5 demonstrated a more diverse linguistic style, whereas DeepSeek-V3 responses were more standardized. ChatGPT-4.5 achieved higher scores than DeepSeek-V3 in lexical richness [4.792 (4.720, 4.912) vs. 4.564 (4.409, 4.653), P<0.001], but lower than DeepSeek-V3 in syntactic richness [2.133 (2.072, 2.154) vs. 2.187 (2.154, 2.206), P=0.003]. Conclusions ChatGPT-4.5 demonstrates superior performance in accuracy, completeness, and reliability, indicating a stronger capacity for task execution. It uses more diverse words and has stronger flexibility in language generation. DeepSeek-V3 exhibited greater syntactic richness and is more normative in language. ChatGPT-4.5 is better suited for content-rich tasks that require detailed explanation, while DeepSeek-V3 is more appropriate for standardized question-answering applications.

    Release date: Export PDF Favorites Scan
1 pages Previous 1 Next

Format

Content