• 1. Department of Clinical Pharmacy, General Hospital of Southern Theater Command of PLA, Guangzhou 510010, P. R. China;
  • 2. Guangdong Pharmaceutical Association, Guangzhou 510080, P. R. China;
  • 3. School of Pharmaceutical Sciences, Southern Medical University, Guangzhou 510515, P. R. China;
  • 4. College of Pharmacy, Guangzhou University of Traditional Chinese Medicine, Guangzhou 510006, P. R. China;
WAN Ning, Email: dela0811@163.com
Export PDF Favorites Scan Get Citation

Objective This study proposes employing large language models (LLMs) for medical literature quality assessment, exploring their potential to establish a standardized and scalable intelligent evaluation framework for off-label drug use (OLDU). Methods The study used two freely available LLMs platforms in China, DeepSeek-R1 and Doubao. Following the medical literature quality assessment tools recommended in the evidence-based evaluation specification for off-label drug use issued by the Guangdong Pharmaceutical Association, we selected the Jadad scale and the MINORS criteria. These tools were employed to assess the quality of the two most prevalent types of medical literature in OLDU evidence evaluation: randomized controlled trials (RCTs) and non-randomized controlled trials (non-RCTs). Utilizing chain-of-thought (CoT) prompting techniques, we developed standardized evaluation templates. The quality scores generated by the LLMs were then compared against those reported in systematic reviews or assigned by clinical pharmacists. Results For RCT, DeepSeek-R1 demonstrated consistency with human assessments in quality appraisal. However, discrepancies exist between the Doubao model and manual evaluation results, with three repeated evaluations yielding inconsistent outcomes and inaccurate identification of "allocation concealment" items. For Non-RCT, all models achieved concordant quality assessment outcomes with human evaluators, while demonstrating unique capacity to detect systematic evaluation inaccuracies attributable to human subjective bias. Conclusion This study demonstrates that prompt engineering-driven LLMs can efficiently conduct quality assessments of medical literature. However, the selection of models requires rigorous validation against domain-specific benchmarks, alongside mandatory expert validation of scoring outputs. Our findings further reveal the necessity of refining current quality appraisal criteria through granular operational definitions, thereby facilitating standardized automation. This approach not only enhances the efficiency and transparency of evidence-based decision-making for OLDU but also extends to systematic reviews and rapid health technology assessments. By replacing traditional literature quality evaluation models with automated scoring mechanisms, it enables a paradigm shift in the efficiency of evidence processing.

Copyright © the editorial department of Chinese Journal of Evidence-Based Medicine of West China Medical Publisher. All rights reserved