Objective To systematically review the accuracy and consistency of large language models (LLM) in assessing risk of bias in analytical studies. Methods The cohort and case-control studies related to COVID-19 based on the team's published systematic review of clinical characteristics of COVID-19 were included. Two researchers independently screened the studies, extracted data, and assessed risk of bias of the included studies with the LLM-based BiasBee model (version Non-RCT) used for automated evaluation. Kappa statistics and score differences were used to analyze the agreement between LLM and human evaluations, with subgroup analysis for Chinese and English studies. Results A total of 210 studies were included. Meta-analysis showed that LLM scores were generally higher than those of human evaluators, particularly in representativeness of exposed cohorts (△=0.764) and selection of external controls (△=0.109). Kappa analysis indicated slight agreement in items such as exposure assessment (κ=0.059) and adequacy of follow-up (κ=0.093), while showing significant discrepancies in more subjective items, such as control selection (κ=−0.112) and non-response rate (κ=−0.115). Subgroup analysis revealed higher scoring consistency for LLM in English-language studies compared to that of Chinese-language studies. Conclusion LLM demonstrate potential in risk of bias assessment; however, notable differences remain in more subjective tasks. Future research should focus on optimizing prompt engineering and model fine-tuning to enhance LLM accuracy and consistency in complex tasks.
Nonrandomized studies are an important method for evaluating the effects of exposures (including environmental, occupational, and behavioral exposures) on human health. Risk of bias in nonrandomized studies of exposures (ROBINS-E) is used to evaluate the risk of bias in natural or occupational exposure observational studies. This paper introduces the main contents of ROBINS-E 2022, including backgrounds, seven domains, signal questions and the operation process.
The COSMIN-RoB checklist includes three sections with a total of 10 boxes, which is used to evaluate risk of bias of studies on content validity, internal structure, and other measurement properties. COSMIN classifies reliability, measurement error, criteria validity, hypothesis testing for construct validity, and responsiveness as other measurement properties, which primarily focus on the quality of the (sub)scale as a whole, rather than on the item level. Among the five measurement properties, reliability, measurement error and criteria validity are the most widely used in the studies. Therefore, this paper aims to interpret COSMIN-RoB checklist with examples to guide researchers to evaluate the risk of bias of the studies on reliability, measurement error and criteria validity of PROMs.
The COSMIN community updated the COSMIN-RoB checklist on reliability and measurement error in 2021. The updated checklist can be applied to the assessment of all types of outcome measurement studies, including clinician-reported outcome measures (ClinPOMs), performance-basd outcome measurement instruments (PerFOMs), and laboratory values. In order to help readers better understand and apply the updated COSMIN-RoB checklist and provide methodological references for conducting systematic reviews of ClinPOMs, PerFOMs and laboratory values, this paper aimed to interpret the updated COSMIN-RoB checklist on reliability and measurement error studies.
ObjectiveTo realize automatic risk bias assessment for the randomized controlled trial (RCT) literature using BERT (Bidirectional Encoder Representations from Transformers) as an approach for feature representation and text classification.MethodsWe first searched The Cochrane Library to obtain risk bias assessment data and detailed information on RCTs, and constructed data sets for text classification. We assigned 80% of the data set as the training set, 10% as the test set, and 10% as the validation set. Then, we used BERT to extract features, construct text classification model, and evaluate the seven types of risk bias values (high and low). The results were compared with those from traditional machine learning methods using a combination of n-gram and TF-IDF as well as the Linear SVM classifier. The accuracy rate (P value), recall rate (R value) and F1 value were used to evaluate the performance of the models.ResultsOur BERT-based model achieved F1 values of 78.5% to 95.2% for the seven types of risk bias assessment tasks, which was 14.7% higher than the traditional machine learning method. F1 values of 85.7% to 92.8% were obtained in the extraction task of the other six types of biased descriptors except "other sources of bias", which was 18.2% higher than the traditional machine learning method.ConclusionsThe BERT-based automatic risk bias assessment model can realize higher accuracy in risk of bias assessment for RCT literature, and improve the efficiency of assessment.
Accurately assessing the risk of bias is a critical challenge in network meta-analysis (NMA). By integrating direct and indirect evidence, NMA enables the comparison of multiple interventions, but its outcomes are often influenced by bias risks, particularly the propagation of bias within complex evidence networks. This paper systematically reviews commonly used bias risk assessment tools in NMA, highlighting their applications, limitations, and challenges across interventional trials, observational studies, diagnostic tests, and animal experiments. Addressing the issues of tool misapplication, mixed usage, and the lack of comprehensive tools for overall bias assessment in NMA, we propose strategies such as simplifying tool operation, enhancing usability, and standardizing evaluation processes. Furthermore, advancements in artificial intelligence (AI) and large language models (LLMs) offer promising opportunities to streamline bias risk assessments and reduce human interference. The development of specialized tools and the integration of intelligent technologies will enhance the rigor and reliability of NMA studies, providing robust evidence to support medical research and clinical decision-making.
在GRADE方法中,若多数相关证据来自高偏倚风险的研究,则起初被定为高质量证据的随机试验和低质量证据的观察性研究均有可能被降低质量等级。随机试验已确定的局限性包括:未进行分配隐藏、未实施盲法、未报告失访情况及未恰当考虑意向性治疗原则。最近提出的局限性包括:因明显获益而早期终止试验和基于结果选择性报告结局。观察性研究的主要局限性包括使用不合适的对照及未能充分调整预后的不平衡。偏倚风险可因不同结果而异(如全死因死亡率的失访远少于生命质量的失访),许多系统评价都容易忽略这一点。在决定是否因偏倚风险而降低质量等级时,不管是随机试验还是观察性研究,作者不应采用对各个研究取平均值的方法。相反,对任何单个结果,当同时存在高、低偏倚风险的研究时,则应考虑只纳入较低偏倚风险的研究。
Measurement properties studies of patient-reported outcome measures (PROMs) aims to validate the measurement properties of PROMs. In the process of designing and statistical analysis of these measurement properties studies, bias will occur if there are any defects, which will affect the quality of PROMs. Therefore, the COSMIN (consensus-based standards for the selection of health measurement instruments) team has developed the COSMIN risk of bias (COSMIN-RoB) checklist to evaluate risk of bias of studies on measurement properties of PROMs. The checklist can be used to develop systematic reviews of PROMs measurement properties, and for PROMs developers, it can also be used to guide the research design in the measurement tool development process for reducing bias. At present, similar assessment tools are lacking in China. Therefore, this article aims to introduce the primary contents of COSMIN-RoB checklist and to interpret how to evaluate risk of bias of the internal structure studies of PROMs with examples.
GRADE(Grades of Recommendation, Assessment, Development,and Evaluation)方法为卫生保健中的证据质量评价与推荐强度评级提供指导。对那些为系统评价、卫生技术评估及临床实践指南总结证据的人而言,GRADE具有重要意义。GRADE提供了一个系统而透明的框架用以明确问题,确定所关注的结局,总结针对某问题的证据,以及从证据到形成推荐或作出决策。GRADE方法的广泛传播与应用,获全球50余个组织认可,这些组织大多有很强的影响力(http://www.gradeworkinggroup.org/),足以证明该工作的重要性。本文介绍临床流行病学杂志将刊出的20篇系列文章,为如何使用GRADE方法提供指导。
With the rapid development of artificial intelligence (AI) and machine learning technologies, the development of AI-based prediction models has become increasingly prevalent in the medical field. However, the PROBAST tool, which is used to evaluate prediction models, has shown growing limitations when assessing models built on AI technologies. Therefore, Moons and colleagues updated and expanded PROBAST to develop the PROBAST+AI tool. This tool is suitable for evaluating prediction model studies based on both artificial intelligence methods and regression methods. It covers four domains: participants and data sources, predictors, outcomes, and analysis, allowing for systematic assessment of quality in model development, risk of bias in model evaluation, and applicability. This article interprets the content and evaluation process of the PROBAST+AI tool, aiming to provide references and guidance for domestic researchers using this tool.