Logo del repository
  1. Home
 
Opzioni

Expert of Experts Verification and Alignment (EVAL) Framework for Large Language Models Safety in Gastroenterology

Giuffrè, Mauro
•
You, Kisung
•
Pang, Ziteng
altro
Shung, Dennis L.
2025
  • journal article

Periodico
NPJ DIGITAL MEDICINE
Abstract
Large language models generate plausible text responses to medical questions, but inaccurate responses pose significant risks in medical decision-making. Grading LLM outputs to determine the best model or answer is time-consuming and impractical in clinical settings; therefore, we introduce EVAL (Expert-of-Experts Verification and Alignment) to streamline this process and enhance LLM safety for upper gastrointestinal bleeding (UGIB). We evaluated OpenAI's GPT-3.5/4/4o/o1-preview, Anthropic's Claude-3-Opus, Meta's LLaMA-2 (7B/13B/70B), and Mistral AI's Mixtral (7B) across 27 configurations, including zero-shot baseline, retrieval-augmented generation, and supervised fine-tuning. EVAL uses similarity-based ranking and a reward model trained on human-graded responses for rejection sampling. Among the employed similarity metrics, Fine-Tuned ColBERT achieved the highest alignment with human performance across three separate datasets (ρ = 0.81-0.91). The reward model replicated human grading with 87.9% of cases across temperature settings and significantly improved accuracy through rejection sampling by 8.36% overall. EVAL offers scalable potential to assess accuracy for high-stakes medical decision-making.
DOI
10.1038/s41746-025-01589-z
WOS
WOS:001480768200004
Archivio
https://hdl.handle.net/11368/3120720
info:eu-repo/semantics/altIdentifier/scopus/2-s2.0-105004050808
https://www.nature.com/articles/s41746-025-01589-z
Diritti
open access
license:creative commons
license uri:http://creativecommons.org/licenses/by-nc-nd/4.0/
Soggetti
  • Best model

  • Clinical setting

  • Fine tuning

  • Gastrointestinal blee...

  • Human performance

  • Language model

  • Medical decision maki...

  • Rejection sampling

  • Similarity metric

  • Upper gastrointestina...

  • artificial intelligen...

  • benchmarking

  • gastroenterology

  • human

  • large language model

  • multiple choice test

  • reinforcement (psycho...

  • retrieval augmented g...

  • safety

  • supervised fine tunin...

  • upper gastrointestina...

  • zero shot prompting

google-scholar
Get Involved!
  • Source Code
  • Documentation
  • Slack Channel
Make it your own

DSpace-CRIS can be extensively configured to meet your needs. Decide which information need to be collected and available with fine-grained security. Start updating the theme to match your nstitution's web identity.

Need professional help?

The original creators of DSpace-CRIS at 4Science can take your project to the next level, get in touch!

Realizzato con Software DSpace-CRIS - Estensione mantenuta e ottimizzata da 4Science

  • Impostazioni dei cookie
  • Informativa sulla privacy
  • Accordo con l'utente finale
  • Invia il tuo Feedback