The Molecular Imaging/Magnetic Resonance Technology Laboratory (MIMRTL) is excited to share a new publication in the journal Tomography, BAE-ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation, led by Jinnian Zhang, Weijie Chen, and a team of MIMRTL collaborators. This study introduces a novel vision transformer model that integrates both image and demographic data to enhance bone age estimation (BAE).
📄 Read the full paper here: PMC Article
📹 Watch a video about the paper here on LinkedIn: Video
Advancing Bone Age Estimation with AI
Bone age estimation is a critical radiological assessment used to evaluate skeletal maturity, growth disorders, and endocrine abnormalities. Traditional methods, such as Greulich–Pyle and Tanner–Whitehouse, rely on radiologist expertise, making them time-intensive and subject to variability.
AI-driven approaches, particularly convolutional neural networks (CNNs), have improved efficiency and accuracy in BAE. However, most existing models do not fully integrate demographic data (such as biological sex), despite its known impact on bone maturation rates. BAE-ViT (Bone Age Estimation Vision Transformer) addresses this gap by introducing a multimodal fusion method that allows detailed interaction between image and non-image data.
Why a Vision Transformer?
Unlike CNNs, which rely on spatially localized feature extraction, vision transformers (ViTs) utilize self-attention mechanisms that allow each pixel to interact with all others in the image. This enables: ✅ Better feature learning across the entire image.
✅ More flexibility in processing diverse data types.
✅ A scalable approach for multimodal fusion (e.g., combining images with patient demographics).
How Does BAE-ViT Work?
🔹 Tokenized Fusion: Instead of simply concatenating sex information with image features (as CNN-based models do), BAE-ViT treats non-visual data as tokens and integrates them directly within the transformer blocks. This allows the model to learn richer relationships between sex and skeletal maturity.
🔹 Pre-Trained Transformer Architecture: BAE-ViT leverages TinyViT-21M, a highly efficient vision transformer model, optimized for medical imaging tasks.
🔹 Patch-Based Image Processing: Hand X-ray images are divided into patches, which are encoded and processed alongside demographic tokens using a hierarchical transformer architecture.
🔹 Improved Generalization: By incorporating a diverse dataset from the RSNA Pediatric Bone Age Challenge 2017 (over 14,000 images) and an external validation set, BAE-ViT achieves robust performance across different patient populations.
Key Findings: BAE-ViT Outperforms CNN-Based Approaches
The study compared BAE-ViT against multiple CNN-based models, including Inception-V3, ResNet50, and EfficientNet-B5.
📊 Performance Highlights (Mean Absolute Error, MAE in months):
✅ BAE-ViT achieved the lowest MAE (4.1 months) across datasets.
✅ The RSNA Challenge 2017 winning model had an MAE of 4.2 months—BAE-ViT outperformed this state-of-the-art approach.
✅ Traditional CNN-based models had significantly higher MAEs (~5.0–6.8 months).
✅ BAE-ViT was more robust to image distortions, handling low-quality X-rays better than CNN-based models.
🔬 Demographic Sensitivity Experiment:
The researchers also tested the importance of accurate sex labels by intentionally mislabeling biological sex in the dataset. The results showed a dramatic increase in prediction error (MAE jumped from 4.1 to 21.5 months), emphasizing the crucial role of demographic integration in BAE models.
Why This Matters: A New Standard for Multimodal AI in Medical Imaging
This research demonstrates that vision transformers provide a powerful alternative to traditional CNN-based bone age estimation models. By leveraging tokenized fusion of image and demographic data, BAE-ViT offers:
🚀 Higher Accuracy – Reduces errors in bone age prediction compared to leading deep learning models.
💡 Improved Interpretability – Uses ScoreCAM heatmaps to highlight key skeletal regions influencing predictions.
⚡ Greater Computational Efficiency – TinyViT architecture ensures fast and scalable deployment in clinical settings.
🌍 Robustness to Data Variability – Performs well on multi-institutional datasets and low-quality images.
Future Directions & Clinical Impact
🔹 Expanding to Other Modalities: Future work will explore BAE-ViT in MRI- and CT-based skeletal assessment.
🔹 Integration with Hospital AI Systems: The model is compatible with existing AI orchestration platforms like NVIDIA Clara.
🔹 Enhancing Generalization: Further fine-tuning on diverse global datasets will improve model adaptability to different populations.
By reducing manual effort and inter-radiologist variability, BAE-ViT could become an essential tool for automated skeletal maturity assessment in pediatric endocrinology, orthopedics, and forensic medicine.
Conclusion
The MIMRTL team’s research on BAE-ViT marks a major step forward in multimodal AI for medical imaging. By harnessing vision transformers for bone age estimation, this approach achieves state-of-the-art performance while improving efficiency, robustness, and generalizability.
🔗 Read the full study: PMC Article