A recent study published in Nature by Bedi et al. (2026) raises the question regarding the capacity of Large Language Models (LLMs) in improving daily healthcare practices and tasks. These aforementioned models have shown near-perfect scores on medical licensing exams; nevertheless, such benchmarks do not reflect real clinical complexity or day-to-day healthcare tasks. To enable real-world evaluation of LLMs for medical use, the authors introduce MedHELM, a holistic evaluation framework designed to assess LLMs performance across diverse clinical tasks beyond exam-style questions.
How does MedHELM work?
MedHELM comprises three key components:
- A clinician-validated taxonomy organizing medical AI applications into five major categories reflecting actual clinical work: clinical decision support, clinical note generation, patient communication, medical research assistance, and administration and workflow. This taxonomy spans 22 subcategories and 121 discrete tasks.
- A benchmark suite encompassing 35 evaluations that cover the full taxonomy, including both existing and newly formulated benchmarks representative of real clinical tasks.
- A systematic comparison of nine state-of-the-art LLMs – including Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, GPT-4o, GPT-4o mini, Llama 3.3, o3-mini – using an automated LLMs-jury evaluation method, where multiple AI evaluators assess model outputs against expert-defined clinical criteria.
LLMs strengths and limitations in practicing tasks
The main results showed a substantial performance variation across tasks and categories:
- Stronger performance was observed in tasks such as clinical note generation and patient communication and education, where models scored relatively high on normalized accuracy scales (0.74–0.85).
- Moderate performance emerged for medical research assistance and clinical decision support tasks.
- Lower performance was seen in administration and workflow tasks, indicating challenges for real-world process-oriented tasks.
- Advanced reasoning models (e.g., DeepSeek R1, o3-mini) achieved the highest overall win rates across benchmarks, while models like Claude 3.5 Sonnet delivered competitive performance at lower computational cost.
- The LLMs-jury evaluation method aligned with clinician judgment at acceptable levels, outperforming traditional automatic metrics such as ROUGE-L and BERTScore in correlating with expert ratings.
Towards specific evaluation metrics and integration into healthcare workflows and systems
The main findings of this study demonstrate that score performance on licensing exams does not guarantee real-world clinical utility. MedHELM’s comprehensive taxonomy and benchmark suite reveal nuanced strengths and weaknesses across tasks, highlighting the need for task-specific and clinician-informed evaluation metrics. The framework sheds light on where current LLMs are effective (e.g., communication, documentation tasks) and where they fall short (e.g., structured decision support and administrative workflows), guiding practical selection and deployment of medical AI systems.
MedHELM establishes a practical, clinician-validated standard for evaluating LLMs on real-world medical tasks, moving beyond conventional exam-based benchmarks. Its open framework enables evidence-based assessment, encourages continuous improvement of medical AI, and supports safer, more reliable integration of LLMs technology into healthcare workflows and systems.
Did you find this study interesting? Do you think these LLMs could be used in complex contexts such as mental health practice?


