Monday, May 5, 2025

Expert of Experts Verification and Alignment (EVAL) Framework for Large Language Models Safety in Gastroenterology

Share

Large language model configurations

We tested the following large language model architectures based on availability for clinical use: GPT-3.5-Turbo, GPT-4-Turbo, GPT-4o, GPT-o1-preview, Claude-3-Opus, LLaMA-2-7B, LLaMA-2-13B, LLaMA-2-70B, and Mistral-7B. We tested models at the zero-shot baseline, with Retrieval Augmented Generation (RAG) using clinical guidelines, after Supervised Fine-Tuning (SFT) using clinical guidelines, and RAG with a fine-tuned model. Of note, we could not fine-tune GPT-o1 and Claude-3-Opus due to company restrictions on accessing model weights.

To create the external knowledge dataset used for RAG and SFT, we collected six guideline documents for UGIB (related to variceal and non-variceal bleeding) created by major Northern American, European, and Asia-Pacific societies19,20,21,22,23,24. Following our previously published protocol12, we reformatted the original documents from raw PDF formats to ones suitable for LLMs, as described elsewhere12. This involved converting all information, both text and non-text, into a textual format, creating a coherent structure across all guidelines, and dividing each document into three macro sections: pre-endoscopic, endoscopic, and post-endoscopic management.

For retrieval augmented generation (RAG)39, the reformatted guidelines were integrated according to each model’s context window size. RAG is a technique that combines retrieval of relevant documents with generation, enabling the model to produce more accurate and contextually appropriate responses. For example, OpenAI’s GPT-3.5-turbo can take an input context of up to 4096 tokens, roughly equal to 800 English words. Due to this constraint, each clinical guideline was split into smaller sections, or “chunks,” of text at the paragraph level. When a user inputs a query to RAG-GPT-3.5-Turbo, it first searches the most relevant text among the chunks by similarity search using cosine similarity and selects the chunk with the highest similarity. The same chunking strategy was used for LLaMA-2-7B, LLaMA-2-13B, LLaMA-2-70B, and Mistral-7B. On the other hand, OpenAI’s GPT-4-Turbo, GPT-4o, and GPT-o1-preview have a context window of up to 128000 tokens, whereas Anthropic’s Claude-3-Opus has a context window of up to 200,000 tokens allowing for chunking at the document level. In these cases, we provided three chunks: one containing the Northern American Guidelines, one with European Guidelines, and one with Asia-Pacific Guidelines.

Supervised fine-tuning was performed using low-rank adaptation (LoRA)40,41, which updates a small fraction of the model’s parameters, significantly reducing the computational cost and memory usage compared to traditional fine-tuning methods. We employed LoRA to fine-tune GPT-3.5-Turbo, GPT-4-Turbo, GPT-4o, Llama-2-7B, Llama-2-13B, Llama-2-70B, and Mistral-2-7B on the reformatted clinical guidelines. We performed human-guided chunking at the paragraph level, obtaining 96 chunks in total. Train/test split was not performed randomly but was designed to ensure complete information about each management part in training to avoid loss of key information. We used the United States clinical guidelines as the training dataset, and the European/Asia-Pacific guidelines as the testing dataset. Technical details related to the fine-tuning process are reported in the Supplementary Materials.

Benchmark datasets and human-grading

To ensure methodological rigor in our framework evaluation across multiple datasets, we implemented a standardized documentation structure to address the following four items: the question dataset (which encompasses the methodological approach to dataset construction and question development), the answer generation process (which delineates the systematic implementation of LLMs for response generation), the answer review criteria (which explicates the comprehensive evaluation protocol employed for response assessment), and the task (which specifies the precise validation objective within our framework’s evaluation schema). Each dataset is systematically analyzed through these four methodological dimensions. Before proceeding, it is important to highlight that human-evaluation of the accuracy of LLM-generated answers is based on the following criteria: (1) the answer was entirely accurate and free from any inaccuracies, (2) the answer directly addressed the question posed, and (3) the answer was comprehensive, providing a complete response that covered all critical aspects of the question.

The first benchmarking dataset was the expert-generated UGIB questions. We created a 13-question expert-generated dataset written in conjunction with the expert-of-experts who were senior authors (in North America, Europe, and Asia-Pacific regions) of clinical guidelines for UGIB (L.L., A.B., G.G.T., I.G., J.S.) focused on areas of high value and relevance to the care of patients with UGIB. These key topics encompassed the full spectrum of UGIB care, from initial risk assessment and pre-endoscopic management through to post-procedural care (e.g., risk stratification, transfusion thresholds, or resuming of anticoagulant medication). The questions were separated into two types of question-related tasks: direct content retrieval (n = 9) and analysis of clinical context (n = 4) in the form of clinical cases (Table 3). These cases were specifically designed to test the ability to integrate multiple guideline recommendations in realistic clinical contexts.

Table 3 List of Expert-Generated Questions for Upper Gastrointestinal Bleeding Management

We also invited those five expert-of-experts to independently provided free-text answers (i.e., “golden-labels”) to each question, collected on the Qualtrics Platform. Each answer was stored in a separate dataset, with the number of characters and word for each question. Each expert answer is reported in the Supplementary Files.

Using these expert-curated questions, we also generated responses using all LLM configurations at a temperature setting of 0.842, producing ten answers per question for each configuration for a total of 3510 responses. These same questions were previously used to collect responses from five different model configurations (i.e., baseline PaLM, baseline GPT-3-5, baseline GPT-4, RAG-GPT-3.5, RAG-GPT-4) across multiple temperature thresholds (0.0 to 2.0, with 0.2 increments), creating a dataset of 8580 answers. We generated an additional dataset (n = 1430) using only the best-performing model configuration, following the same temperature range pattern. In all cases, through heuristic prompt engineering, we constrained LLM response lengths to match the maximum word count of the corresponding expert answers, ensuring comparable response formats.

Two independent gastroenterologists blindly evaluated the accuracy of the responses generated at temperature 0.8, comparing them against clinical guidelines and expert answers. In cases of disagreement, a third expert reviewer served as a tiebreaker (disagreement requiring a tiebreaker happened in 6.6% of cases). Four medical experts independently graded the responses generated across different temperature thresholds, and majority voting was used to resolve any disagreements.

The expert responses (“golden labels”) were used to develop and evaluate different text similarity approaches. The LLM-generated responses at temperature 0.8 were used as a validation benchmark to evaluate which similarity technique (fine-tuned ColBERT, Sentence Transformers, and TF-IDF) best correlated with actual model performance. The historical temperature-varying dataset (n = 8580) served for training and internal validation, while the additional dataset from the best-performing model (n = 1430) was used for external validation of the reward model.

The second benchmarking dataset was obtained from the American College of Gastroenterology (ACG) Multiple-Choice Questions (MCQs). Among all self-assessment board preparation tests published by the ACG, only 40 MCQs strictly focused on the management of patients with UGIB. To establish a benchmark for human performance, we calculated the pooled percentage of correct answers from previous practicing ACG physician test-takers at varying career stages, which averaged 75% for these specific questions. This dataset cannot be released due to the proprietary nature of the MCQs.

Each LLM configuration was tested using a zero-shot approach, where models were instructed to provide only the letter corresponding to the correct answer among the available choices, without any additional explanation or context. All responses were generated using a temperature setting of 0.8.

Two independent reviewers evaluated the number of correct responses for each LLM configuration, comparing them against the reference answers.

This dataset served as a validation benchmark to evaluate which similarity technique (fine-tuned ColBERT, Sentence Transformers, and TF-IDF) best correlated with actual model performance.

The third benchmarking dataset was obtained from real-world questions from the Simulation Scenario. In particular, we compiled a dataset of 117 questions from 82 physician trainees across 29 sessions involving 5 standardized UGIB scenarios, conducted in medical simulation settings between 2023-2024 (IRB protocol number #2000034521). The complete list of scenarios and related questions is provided in the Supplementary Materials. The simulation scenarios were designed as part of a clinical trial evaluating the LLM interface (named GUT-GPT) effectiveness in clinical decision support, which was conducted in accordance with the ethical principles outlined in the Declaration of Helsinki43. Each clinical case-question pair is reported in the Supplementary Files.

Each LLM configuration was tested using a heuristic prompting approach, necessary due to the unpredictable nature of trainee questions. The prompts were structured to include complete clinical case analysis, providing all relevant context (including patient demographics, laboratory findings, and clinical presentation) and requesting both case-specific information and management recommendations based on the trainee’s specific query. This approach allowed the models to address both direct management questions and requests for case-specific information (e.g., age, laboratory values, etc.). All responses were generated using a temperature setting of 0.8.

Two independent gastroenterologists blindly evaluated the accuracy of responses for each LLM configuration against established clinical guidelines. In cases of disagreement, a third expert reviewer served as a tiebreaker (disagreement requiring a tiebreaker happened in 9.5% of cases).

This dataset served as a validation benchmark to evaluate which similarity technique (fine-tuned ColBERT, Sentence Transformers, and TF-IDF) best correlated with actual model performance. This dataset was also used for a supplementary analysis of the reward model alignment with human-grading.

Unsupervised similarity metrics alignment with expert-of-expert golden labels

The EVAL framework provides a scalable solution for AI safety in clinical settings through complementary approaches operating at two levels: at the model level, using unsupervised embeddings to automatically evaluate and rank different LLM configurations based on expert-generated answers (“golden labels”), and at the answer level, employing a reward model to screen individual responses for accuracy against guideline-based recommendations, as illustrated in Fig. 4.

Fig. 4: EVAL framework summary.

The EVAL framework consists of three interconnected components. The first component comprises the Question Datasets: expert-generated questions (N = 13), real-world questions (N = 117), and American College of Gastroenterology questions (N = 40). The second component shows the LLM configurations, which combines different LLM architectures (Meta’s Llama-2-7B/13B/70B, Mistral AI’s Mistral-7B, OpenAI’s GPT-3.5/4/4o/o1, and Anthropic’s Claude-3-Opus) with various configurations (without guidelines as baseline, with guidelines through Retrieval Augmented Generation, Supervised Fine Tuning, and a combination of Retrieval Augmented Generation and Supervised Fine Tuning). These LLMs and configurations are then evaluated through three distinct tasks: Task #1 uses unsupervised similarity metrics for model ranking, Task #2 employs a reward model for automated answer grading, and Task #3 implements automated rejection sampling to ensure response quality and safety.

We evaluated three different similarity metrics to quantify the alignment between LLM-generated responses and expert-provided answers: Contextualized Late Interaction over BERT (ColBERT), Sentence Transformers, and TF-IDF as summarized in Fig. 5.

Fig. 5: Evaluation and validation framework for embedding similarity metrics.

This figure illustrates a comprehensive framework for evaluating the alignment of responses generated by large language models (LLMs) with expert-defined Golden Labels (i.e., free-text answers from the experts). a Step 1 – Embedding Similarity Metrics: Model ranking by comparing the similarity of LLM-generated answers to the Golden Labels using TF-IDF, Sentence Transformers, and Fine-Tuned ColBERT. Fine-tuning was performed to maximize the cosine similarity between the embeddings of the “golden labels” and their corresponding paragraphs while minimizing similarity with unrelated paragraphs. This step enhances the model’s ability to differentiate between relevant and irrelevant responses. b Step 2 – Model Performance Evaluation: model responses were assessed by human experts, who graded them for accuracy using expert-generated datasets, real-world questions, and the American College of Gastroenterology Multiple-Choice Questions (ACG-MCQs). Models were then ranked based on their performance and accuracy scores. c Step 3 – Selection of the Best Embedding Similarity Metrics: the average similarity values for each model were correlated with human performance evaluations using Spearman’s rank correlation coefficient. This process identified the similarity metrics with the highest correlation coefficients, underscoring their utility in assessing model response quality. Abbreviations: ACG-MCQs American College of Gastroenterology Multiple Choice Questions, TF-IDF Term Frequency-Inverse Document Frequency, ColBERT Contextualized Late Interaction over BERT.

We used ColBERT44 to quantify the alignment between responses generated by LLMs and responses by experts (Fig. 5). We chose ColBERT for its ability to handle the variability of responses within a relatively small semantic space, and its unique token-level comparison approach. Unlike traditional embedding methods that create a single vector representing an entire text (paragraph-level embedding or “early aggregation”), ColBERT preserves the meaning of individual words or tokens separately and compares these individual representations between texts before making a final similarity decision (token-level embedding or “late interaction”). This approach allows for more precise matching of specific clinical terms and concepts in context, rather than simply comparing overall text meanings. To enhance precision in distinguishing between high-quality and lower-quality responses, we fine-tuned the ColBERT embeddings as follows: for each expert label, we created triplets consisting of the label itself, a closely matching paragraph, and a non-matching paragraph from a set of clinical guidelines. We used Bidirectional Encoder Representations from Transformers (BERT)45 embeddings for each triplet component. The matching paragraphs were chosen based on their high relevance to the expert label, while the non-matching paragraphs were selected based on their slight, but not complete, irrelevance (an example is provided in Supplementary Table 3). The objective function for fine-tuning maximized the cosine similarity between the embeddings of the expert label and the matching paragraph while minimizing the similarity between the expert label and the non-matching paragraphs. This is achieved using pairwise softmax cross-entropy loss, which effectively pushes the model to enhance the distinction between relevant and irrelevant responses regarding embedding proximity. Fine-tuned ColBERT can produce a more refined separation between relevant and irrelevant text snippets. To account for the plurality of opinions from multiple experts, we evaluated this by calculating the average similarity score across multiple sets of embeddings generated from a variety of responses to different questions. This score reflects the overall alignment of the model’s generated responses with expert-provided answers (details in Supplementary Materials.) To validate model ranking accuracy, we compared the ranking of the Fine-Tuned Colbert to the accuracy rankings of each LLM configuration for the expert-generated answer dataset and the performance on ACG-MCQs. For better visualization of the relative gap between the Colbert score from different models, we provide the transformation of first normalizing the Colbert raw score with its maximum attainable score and then applying the logit function. To showcase the performance of our Fine-Tuned Colbert method, we provide the following two baselines: Sentence Transformer46, a common existing LLM-based method for textual similarity, and TF-IDF47, which is a classical method based on word and document statistics.

For the Sentence Transformers-based similarity metrics, we use the publicly available pre-trained embedding model, all-MiniLM-L6-v2, from Sentence Transformer38 to calculate embeddings for answers and then use the cosine similarity to calculate the score between a pair of answer embeddings. The model is a pre-trained BERT model further finetuned by paired sentences optimized for producing high similarity scores for paired sentences. It’s oftentimes a decent approach for similarity tasks and thus serves as a well-suited baseline to be compared with our model.

For the TF-IDF-based similarity metric, we follow the standard practice of calculating the feature vector and then compare feature vectors with cosine similarity, which falls under the similar framework of our Colbert method, with the difference being TF-IDF uses pre-defined statistics instead of our highly specialized data-driven Colbert. Specifically, for each pair of LLM output and expert response, we calculate the TF-IDF score by multiplying the term frequency and inverse document frequency. In this context, the document is either one LLM output or one expert answer. The term frequency, TF, is the number of times a given term appears in the document. The inverse document frequency, IDF, is the ratio of one plus the total number of documents divided by one plus the number of documents having the term, then take the log and add one again. The several constant value ones are in place for normalizing and avoiding the divided by zero issues and is the standard common approach48. Lastly, we calculate the cosine similarity between the calculated TF-IDF score to serve as the final similarity score.

For each similarity method, we performed pairwise t-tests comparing the highest-scoring model configuration against all other configurations individually. Similarly, we conducted pairwise t-tests for human-graded accuracies across the three evaluation sets (expert-generated questions, real-world questions, and ACG MCQs), comparing the best-performing configuration against all others. For all statistical comparisons, we considered a two-tailed p-value < 0.05 as statistically significant. To determine which similarity metric best aligned with human evaluation, we calculated Spearman rank correlation coefficients between the average scores from each method and the model accuracies determined by human grading. This analysis allowed us to identify which of the three proposed methods showed the strongest alignment with both human-graded accuracy and performance on ACG MCQs.

Reward model to screen for high-quality LLM responses

One concern of deploying probabilistic large language models in clinical settings is the presence of hallucinations—seemingly plausible but inaccurate information49. It is not uncommon for models to output answers that contain factual inaccuracies or “misread” the guidelines, or to be confidently incorrect in giving factually incorrect information without any indication of uncertainty. This part of our framework that addresses the issue of hallucinations is represented graphically in Fig. 6.

Fig. 6: Reward model training, testing, and validation and application with automated rejection sampling.

This figure illustrates a two-step framework for optimizing the accuracy and reliability of responses generated by large language models (LLMs), with clear stages for reward model training and application. a Step 1 – Reward Model Training and Validation: previously graded answers from the expert-generated questions were utilized for training and testing the reward model. The reward model assigns accuracy scores to the generated answers (e.g., 0.98 for accurate responses and 0.02 for inaccurate ones). Validation was performed using human-graded answers from the best-performing model, determined through Fine-Tuned ColBERT ranking. This process ensured that the reward model could accurately evaluate the quality of new question-answer pairs, thereby validating its grading accuracy. b Step 2 -Application with Automated Rejection Sampling: For each question, the LLM generates multiple candidate answers (K answers). These answers are passed through the trained reward model, which assigns accuracy scores and ranks the responses. The answer with the highest score is selected as the final output. This filtering mechanism increases the reliability of the model by systematically rejecting less accurate responses, thereby ensuring only the most accurate answers are retained.

As a solution to the best model selection, we employ an alternative approach by training an additional Reward Model to serve as a substitute for human feedback. A reward model is an LLM tasked with approximating part of the traditional environment in a reinforcement learning problem. The reward model takes in text and returns a score. The objective of this reward model is to assess the level of congruence between a model’s response and human preferences. In simpler terms, a reward model is a type of model that takes a pair of inputs (prompt and response) and produces an output in the form of a reward or score. The primary difficulty in constructing such a model lies in obtaining a dataset of high quality. The subjective evaluation of good and bad varies among individuals, making it unfeasible to quantify. Previous evidence suggests that a dataset containing between 1000 and 10000 high-quality question-answer pairs is sufficient for training a reward model in moderately complex domains50,51. For larger or more nuanced topics, a dataset exceeding 50000 pairs may be necessary52.

To train our reward model, which we will refer to as the Grader Model (GM), the LLM receives data in the following format: [Question, Answer, Score]. The GM’s task is to take a specific [Question, Answer] pair and map it to the answer’s score. Scores are provided by a human evaluator who reads the response and assigns it a numerical ranking of 0 or 1 based on the accuracy. To train this model, we replace the LLM’s traditional head, which outputs the log probability of the next word, with a value head that predicts the score of [Question, Answer] pair. Since the answers are classified as either Good (Score = 1) or Bad (Score = 0), the value head outputs the probability that the answer is good. The model is trained using cross entropy (classification) loss and gradient descent to improve score accuracy.

We used the previously graded dataset (n = 7150) obtained from multiple LLM configurations (i.e., baseline PaLM, baseline GPT-3-5, baseline GPT-4, RAG-GPT-3.5 with American Guidelines, RAG-GPT-3.5 with American, European and Asia-Pacific Guidelines) to train the Reward Model, which was then internally validated to the previous state-of-the-art model (i.e., RAG-GPT-4 with American, European and Asia-Pacific Guidelines; n = 1430). The Reward Model performance was externally validated using the new state-of-the-art model (i.e., SFT-GTP-4o; n = 1430) that was selected according to the highest similarity metrics according to Fine-Tuned Colbert.

The reward model was trained using Meta’s OPT-350M, a 350 million parameters decoder-only LLM. The use of a smaller RM such as Meta’s OPT-350M aligns with findings indicating that compact models are sufficient for tasks where the dataset quality is prioritized over model scale, as smaller models demonstrate robust generalization and efficiency without significant performance trade-offs in preference learning or alignment tasks, provided they are trained on high-quality, curated datasets46,53,54. The reward model output is binary: “Good” (Score = 1) or “Bad” (Score = 0). Alignment to human-experts was evaluated as the number of true labels (i.e., the number of answers for which the reward model produced the same label with human grading). The results were interpreted by breaking down the temperatures into three regimes, positive (temperature < 1.2), negative (temperature >1.6), and mixed (temperature between 1.2 and 1.6) according to the model’s graded performance. These thresholds were chosen such that the positive regime has over 80% graded accuracy and the negative regime has <20% graded accuracy. The reward model was then applied to the best model according to ColBERT ranking and validated the grading accuracy on this new dataset of question-answer pairs. As a sensitivity analysis, we reported alignment across all temperature thresholds in the Supplementary Materials. In addition, we tested the alignment of the reward model with human grading on the real-world questions for all models at the fixed temperature of 0.8, with results being reported in the Supplementary Materials. The reward model is publicly available on Hugging Face (https://huggingface.co/ZachariahPang/medical_reward_model).

Automated rejection sampling

Extending the reward model pipeline, we can incorporate the reward function directly into the answer pipeline by using a rejection sampling approach. For each question, the LLM agent generates K candidate answers. These K answers are evaluated by the reward model, and only the top-scoring answer is sent forward. This serves as a form of self-filtering, allowing the reward model to capture and filter out suboptimal answers before they reach the end user. In this way, rejection sampling enhances the model’s overall output quality by rescuing from suboptimal answers. To evaluate the rejection sampling approach, we used the same curated dataset for reward model alignment described in the previous section. Human-graded accuracy was compared across multiple K values (1, 3, 5, 7, and 10), as reported in the Supplementary Table 4. The results demonstrated a consistent improvement in accuracy with increasing K. However, larger K values also demand significantly more computational resources. We selected K = 5 for the main analysis as it provides a practical balance between computational efficiency and improved accuracy. Detailed trends in accuracy with and without rejection sampling, as well as the impact of varying K, are included in the Supplementary Materials to illustrate the trade-offs and performance improvements.

Read more

Related updates