Spin in research reports includes reporting practices that distort the presentation of results. This is particularly critical in medicine, where spin is present in more than 50% of randomized controlled trials (RCT) that fail to reach the threshold of statistical significance. Comparing declared and reported outcomes is crucial for detecting various types of spin, such as selective reporting. We developed a system for automatic detection of spin in clinical trials. We used 300 pairs of outcomes, labeled with semantic similarity. We evaluated baseline statistical models, masked language models (MLM) and generative large language models (LLM). We generated similarity scores and used Youden index to determine the classification threshold. The proposed approach to comparing outcomes using LLMs involves prompt engineering, generating similarity scores based on token probabilities and majority voting. The results on the test set of 2500 examples, with 90% accuracy and F1 score of 78%, outperform dedicated models for semantic similarity evaluation, but trail behind fine-tuned versions of BERT model. An advantage of our approach is the ability to generate explanations for the classified examples.
|