ZHOU Lijun 1,2,3 , LIAN Hao 4 , LIU Sijia 1,2
  • 1. Rehabilitation Medicine Center and Institute of Rehabilitation Medicine, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, P. R. China;
  • 2. Key Laboratory of Rehabilitation Medicine in Sichuan Province, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, P. R. China;
  • 3. West China School of Medicine, Sichuan University, Chengdu, Sichuan 610041, P. R. China;
  • 4. West China School of Public Health, Sichuan University, Chengdu, Sichuan 610041, P. R. China;
LIU Sijia, Email: 137595879@qq.com
Export PDF Favorites Scan Get Citation

Objective  To compare the performance of ChatGPT-4.5 and DeepSeek-V3 across five key domains of physical therapy for knee osteoarthritis (KOA), evaluating the accuracy, completeness, reliability, and readability of their responses and exploring their clinical application potential. Methods  Twenty-one core questions were extracted from 10 authoritative KOA rehabilitation guidelines published between September 2011 and January 2024, covering five task categories: rehabilitation assessment, physical agent modalities, exercise therapy, assistive device use, and patient education. Responses were generated using both the ChatGPT-4.5 and DeepSeek-V3 models and evaluated by four physical therapists with over five years of clinical experience using Likert scales (accuracy and completeness: 5 points; reliability: 7 points). The scale scores were compared between the two large language models. Additional assessment included language style clustering. Results  Most of the scale scores did not follow a normal distribution, and were presented as median (lower quartile, upper quartile). ChatGPT-4.5 outperformed DeepSeek-V3 with higher scores in accuracy [4.75 (4.75, 4.75) vs. 4.75 (4.50, 5.00), P=0.018], completeness [4.75 (4.50, 5.00) vs. 4.25 (4.00, 4.50), P=0.006], and reliability [5.75 (5.50, 6.00) vs. 5.50 (5.50, 5.50), P=0.015]. Clustering analysis of language styles revealed that ChatGPT-4.5 demonstrated a more diverse linguistic style, whereas DeepSeek-V3 responses were more standardized. ChatGPT-4.5 achieved higher scores than DeepSeek-V3 in lexical richness [4.792 (4.720, 4.912) vs. 4.564 (4.409, 4.653), P<0.001], but lower than DeepSeek-V3 in syntactic richness [2.133 (2.072, 2.154) vs. 2.187 (2.154, 2.206), P=0.003]. Conclusions  ChatGPT-4.5 demonstrates superior performance in accuracy, completeness, and reliability, indicating a stronger capacity for task execution. It uses more diverse words and has stronger flexibility in language generation. DeepSeek-V3 exhibited greater syntactic richness and is more normative in language. ChatGPT-4.5 is better suited for content-rich tasks that require detailed explanation, while DeepSeek-V3 is more appropriate for standardized question-answering applications.

Citation: ZHOU Lijun, LIAN Hao, LIU Sijia. Performance comparison of ChatGPT-4.5 and DeepSeek-V3 in rehabilitation guidance for knee osteoarthritis. West China Medical Journal, 2025, 40(6): 870-875. doi: 10.7507/1002-0179.202504066 Copy

Copyright © the editorial department of West China Medical Journal of West China Medical Publisher. All rights reserved