The healthcare and medical fields are highly complex, requiring in-depth expertise and knowledge. Despite the growing demand for healthcare services, a significant gap persists between this demand and the availability of skilled professionals1,2,3. While previous artificial intelligence (AI) technologies have partially alleviated this gap, they have either lacked linguistic capabilities or were limited to functioning within specialized, single tasks4,5. As a result, these technologies are not easily adaptable to the diverse and dynamic tasks encountered in healthcare settings, where language-based interaction between patients, medical staff, and healthcare professionals are essential.
Recent advancements in large language models (LLM) suggest a promising future for the application of AI in the field of healthcare and medicine, serving as efficient and rapid decision-making assistants for professionals6,7. These foundation models are capable of handling a wide range of language tasks in a generalizable manner. Several models have demonstrated their capability by surpassing the 60% passing threshold on the United States Medical Licensing Examination (USMLE) questions8,9,10,11,12, recently reaching a remarkable accuracy rate of 95.0%13. Furthermore, their effectiveness has been highlighted in addressing real-world clinical case challenges, including responding to clinical inquiries related to daily practices, engaging in conversational history-taking, and diagnosing complex clinical cases14,15,16.
Despite these advancements, the deployment of LLMs in the medical domain faces significant challenges in two key aspects. First, the use of commercial LLMs in healthcare, such as OpenAI’s GPT-3.517 and GPT-418, is constrained by privacy and security concerns due to their closed-source nature6,19,20,21. These models rely on web-based APIs for data transmission, making the management of sensitive patient data particularly challenging in the absence of well-defined regulatory frameworks. Second, the high computational demands of LLMs pose a significant barrier to local deployment. Recent releases of open-source LLMs22,23,24 have made on-premises deployment on in-house servers possible, offering a privacy-preserving alternative by eliminating the need to transmit sensitive data to external platforms. However, the substantial hardware requirements far exceed the capacity of standard desktop systems (requiring multiple 80GB A100/H100 GPUs), rendering on-premises deployment impractical for many hospitals and clinical research laboratories.
A practical solution likely involves deploying open-weight small language models (SLMs) with fewer than ten billion parameters locally 25,26,27,28,29,30,31. These models can typically be run on high-end PCs (e.g., RTX 3090), making them accessible to a wider range of users and environments. However, a key challenge remains that these models lack the necessary multi-step reasoning capabilities to solve complex problems. In medicine, strong reasoning skills are particularly crucial for analyzing problems systematically, constructing logical paths, and accurately predicting answers. Thanks to their vast amount of parameters, often exceeding several hundreds of billion, LLMs naturally exhibit this “chain-of-thought” (CoT) reasoning ability32, enabling them to provide step-by-step explanations to arrive at a conclusion for complex problems. In contrast, SLMs do not consistently acquire these abilities during pre-training33,34,35.
Unfortunately, effective and efficient training methods for improving medical reasoning remain understudied. Existing medical SLMs are commonly initialized from general-domain SLMs and further trained on millions of domain-specific documents using a basic continuous pre-training method28,29,30,31. This approach not only demands substantial computational resources but also yields only limited effectiveness for training medical reasoning. For instance, PMC-Llama-7B required 32 A100 GPUs around 7 days to complete training, totaling around 5376 GPU hours28. With new and improved general-domain models being released on a monthly basis, this approach of continuously training models on medical corpora is becoming increasingly unsustainable. Furthermore, the performance improvements achieved often fail to justify the resource demands. For example, PMC-Llama-7B demonstrated only a 1.02% improvement over its counterpart, Llama-2-7B, on the MedQA benchmark.
We seek to address the following research questions: (1) What training method can be implemented to effectively enhance the limited reasoning capabilities of SLMs? (2) In light of the rapid release of new SLMs by both industry and research institutions, can this method be efficiently adapted to evolving models?
In this study, we introduce Meerkat, a new family of on-premises medical AI systems with enhanced reasoning skills acquired from textbooks. Our model is built upon the current state-of-the-art LMs, such as Mistral-7B26 and Llama-3-8B26, and fine-tuned using a diverse set of carefully crafted data. Specifically, we employed an LLM to extract CoT reasoning paths for 9.3K USMLE-style questions from the MedQA dataset36. To enhance diversity in reasoning, we further synthesized 78K additional questions along with their CoT reasoning paths using authoritative resources. This effort involved leveraging 18 textbooks that comprehensively span 16 medical disciplines. Furthermore, we aggregated existing instruction-following and chat datasets, authorized for research use, to address a broad range of applications in this domain. In total, the model was fine-tuned on 460K examples. The fine-tuning process takes only around 1 day on eight A100 GPUs, making it significantly more efficient compared to traditional continuous pre-training approaches.
Meerkat-7B and Meerkat-8B achieved an average accuracy of 64.5% and 66.7% across six benchmarks, surpassing the previous leading general- and medical-domain models, including Mistral-7B (41.2%), Llama-3-8B (56.1%), MediTron-7B (51.0%)29, BioMistral-7B (55.4%)30, and GPT-3.5 (54.8%)17. Notably, Meerkat-7B achieved scores of 77.1 on the MedQA36, marking the first instance where a 7B model surpassed the USMLE’s passing threshold of 60% accuracy, and also exceeding the previous best open-weight model performance of 70.2% set by MediTron-70B. In a test of NEJM Case Challenges, Meerkat-8B accurately diagnosed 20 cases, surpassing the human average of 13.8 and nearly matching GPT-4’s score of 21.8. In human evaluations of the rationale generated by the models, Meerkat-8B outperformed its counterpart, Llama-3-8B, across all four metrics: completeness, factuality, clarity, and logical consistency. We underscore that our Meerkat models, along with the CoT fine-tuning approach, substantially narrowed the performance gap with commercial LMs, enabling smaller models to tackle challenging reasoning tasks. Our contributions are summarized as follows:
We introduce Meerkat, a cutting-edge series of on-premises medical models with high-level reasoning capabilities. Meerkat represents the first instance of training a medical AI system using CoT data synthesized from raw textbooks and showing its effectiveness. Our fine-tuning approach is significantly more efficient than continuous pre-training, requiring approximately 28 times less GPU time for a 7B model. Moreover, it consistently demonstrates effectiveness regardless of the initial LM, meaning that our method can enhance performance even for newly released models through fine-tuning.
Our models surpassed general and domain-specific open-weight models on six medical benchmarks. Meerkat-7B is the first 7B model to exceed the USMLE passing threshold, setting the standard as the leading open-weight model in its class. Additionally, Meerkat-8B surpassed the human benchmark score by 6.3 on the NEJM Case Challenges. In expert evaluations, Meerkat-8B significantly outperformed Llama-3-8B in four fine-grained metrics.
We plan to release all associated artifacts, including our model weights and training data. The released data includes the new MedQA-CoT and MedBooks-18-CoT datasets, which comprises synthetic question-answer pairs with CoT reasoning paths extracted from a USMLE-style dataset and 18 textbooks. This can serve as a valuable resource for fine-tuning new models in the medical domain.