Evaluating Large Language Models for Technical MRI Expertise: A New Study from MIMRTL

A new arXiv preprint by Alan B. McMillan, PI of MIMRTL, investigates the performance of large language models (LLMs) in answering technical MRI questions, assessing their potential to provide expert-level guidance in real-world clinical settings.

The Challenge: Variability in MRI Expertise

Magnetic resonance imaging (MRI) is a powerful but technically complex imaging modality. Operator skill levels vary widely, leading to inconsistencies in image quality, protocol adherence, and artifact management—all of which can impact diagnostic accuracy and patient outcomes. While extensive training programs exist, access to expert MRI technologists and physicists is not uniformly available, especially in geographically isolated or resource-limited settings.

With the rapid advancements in AI-driven natural language processing, LLMs present an opportunity to augment MRI expertise, offering real-time technical support, troubleshooting guidance, and reinforcement of best practices. However, no systematic evaluation of these models in the context of MRI technical knowledge had been conducted—until now.

The Study: Assessing AI’s Ability to Answer MRI Questions

McMillan’s study systematically evaluated the ability of closed-source and open-source LLMs to accurately answer 570 MRI-related questions  Questions covered a broad range of MRI topics, such as: Basic Principles, Image Production, Artifacts and Corrections, Pulse Sequences, Instrumentation, Safety.

The tested LLMs included state-of-the-art closed-source models (such as GPT-4o, GPT-4 Turbo, and Claude 3.5 Haiku) and open-source models (such as Phi 3.5 Mini, Llama 3.1, and smolLM2). Each model was queried using standardized prompts via the LangChain framework, and responses were graded using an automated scoring protocol based on exact matches and semantic similarity.

Key Findings: AI Models Can Match—and Even Surpass—Human Expertise

The study found that LLMs are highly capable of providing accurate technical MRI guidance, with some models achieving near-perfect performance:

  • OpenAI’s o1 Preview model achieved the highest accuracy (94%), significantly outperforming the random-guess baseline of 26.5%.
  • GPT-4o and o1 Mini followed closely with 88% accuracy.
  • GPT-4 Turbo and Claude 3.5 Haiku scored 84%, reinforcing the strong performance of closed-source models.
  • The best-performing open-source model, Phi 3.5 Mini, achieved 78% accuracy, demonstrating that smaller, open-access models can still be highly effective.

Performance Across MRI Categories

The study also analyzed model performance across different MRI subcategories:

  • Best performance was observed in Basic Principles and Instrumentation, with top models exceeding 95% accuracy.
  • Lower accuracy was seen in Image Weighting and Contrast, Artifacts, and MRI History, indicating challenges in grasping nuanced concepts related to signal generation, contrast mechanisms, and historical developments in MRI technology.

Implications: AI as an On-Demand MRI Consultant

These findings highlight AI’s potential to standardize and enhance MRI practice by providing real-time, reliable technical guidance. Potential applications include:

  • AI-powered troubleshooting assistants to help technologists adjust protocols and mitigate artifacts.
  • Training augmentation tools that reinforce best practices and support continuing education.
  • Standardization of MRI workflows, particularly in settings where expert oversight is limited.

If successfully integrated, AI-driven assistance could reduce imaging errors, optimize protocol adherence, and improve diagnostic quality—benefiting patients, technologists, and radiologists alike.

Challenges and Future Directions

While the results are promising, the study also identifies key limitations:

  1. Closed-source models, while powerful, lack transparency—their internal decision-making processes remain proprietary.
  2. No models were fine-tuned on MRI-specific data—further domain-specific training could enhance their accuracy.
  3. Ethical concerns remain regarding AI overreliance and potential misinformation in high-stakes clinical settings.

Future work should explore:

  • Fine-tuning LLMs on curated MRI datasets for enhanced domain specialization.
  • Human-in-the-loop validation to ensure AI recommendations align with expert judgment.
  • Clinical integration studies to evaluate real-world usability and effectiveness.

Conclusion: The Future of AI in MRI Expertise

This study provides strong evidence that AI can support and enhance MRI technical expertise, particularly in settings where human expertise is inconsistent or inaccessible. As AI models continue to improve, they may become an integral part of clinical workflows, providing on-demand guidance, optimizing imaging protocols, and reducing variability in MRI practice.

Read the full preprint on arXiv: Performance of Large Language Models in Technical MRI Question Answering: A Comparative Study.