The
Molecular Imaging Technology Research Program (MITRP) is excited to share a new publication in the journal Tomography,
BAE-ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation, led by
Jinnian Zhang, Weijie Chen, and a team of MITRP collaborators. This study introduces
a novel vision transformer model that integrates both image and demographic data to enhance bone age estimation (BAE).
📄
Read the full paper here: PMC Article
📹 Watch a video about the paper here on LinkedIn: Video
Advancing Bone Age Estimation with AI
Bone age estimation is a critical radiological assessment used to evaluate
skeletal maturity, growth disorders, and endocrine abnormalities. Traditional methods, such as
Greulich–Pyle and Tanner–Whitehouse, rely on radiologist expertise, making them
time-intensive and subject to variability.
AI-driven approaches, particularly
convolutional neural networks (CNNs), have improved efficiency and accuracy in BAE. However,
most existing models do not fully integrate demographic data (such as biological sex), despite its known impact on bone maturation rates.
BAE-ViT (Bone Age Estimation Vision Transformer) addresses this gap by introducing a multimodal fusion method that allows detailed interaction between image and non-image data.
Why a Vision Transformer?
Unlike CNNs, which rely on
spatially localized feature extraction, vision transformers (ViTs) utilize
self-attention mechanisms that allow each pixel to interact with all others in the image. This enables: ✅
Better feature learning across the entire image.
✅
More flexibility in processing diverse data types.
✅
A scalable approach for multimodal fusion (e.g., combining images with patient demographics).
How Does BAE-ViT Work?
🔹
Tokenized Fusion: Instead of simply concatenating sex information with image features (as CNN-based models do),
BAE-ViT treats non-visual data as tokens and integrates them directly within the transformer blocks. This allows the model to learn richer relationships between sex and skeletal maturity.
🔹
Pre-Trained Transformer Architecture: BAE-ViT leverages
TinyViT-21M, a highly efficient vision transformer model, optimized for medical imaging tasks.
🔹
Patch-Based Image Processing: Hand X-ray images are divided into
patches, which are
encoded and processed alongside demographic tokens using a hierarchical transformer architecture.
🔹
Improved Generalization: By incorporating
a diverse dataset from the
RSNA Pediatric Bone Age Challenge 2017 (over 14,000 images) and an
external validation set, BAE-ViT achieves
robust performance across different patient populations.
Key Findings: BAE-ViT Outperforms CNN-Based Approaches
The study compared
BAE-ViT against multiple CNN-based models, including
Inception-V3, ResNet50, and EfficientNet-B5.
📊
Performance Highlights (Mean Absolute Error, MAE in months):
✅
BAE-ViT achieved the lowest MAE (4.1 months) across datasets.
✅ The
RSNA Challenge 2017 winning model had an MAE of 4.2 months—BAE-ViT
outperformed this state-of-the-art approach.
✅ Traditional
CNN-based models had significantly higher MAEs (~5.0–6.8 months).
✅
BAE-ViT was more robust to image distortions, handling low-quality X-rays better than CNN-based models.
🔬
Demographic Sensitivity Experiment:
The researchers also tested the
importance of accurate sex labels by
intentionally mislabeling biological sex in the dataset. The results showed a
dramatic increase in prediction error (MAE jumped from 4.1 to 21.5 months), emphasizing the
crucial role of demographic integration in BAE models.
Why This Matters: A New Standard for Multimodal AI in Medical Imaging
This research demonstrates that
vision transformers provide a powerful alternative to traditional CNN-based bone age estimation models. By leveraging
tokenized fusion of image and demographic data, BAE-ViT offers:
🚀
Higher Accuracy – Reduces errors in bone age prediction compared to leading deep learning models.
💡
Improved Interpretability – Uses
ScoreCAM heatmaps to highlight key skeletal regions influencing predictions.
⚡
Greater Computational Efficiency – TinyViT architecture ensures
fast and scalable deployment in clinical settings.
🌍
Robustness to Data Variability – Performs well on
multi-institutional datasets and low-quality images.
Future Directions & Clinical Impact
🔹
Expanding to Other Modalities: Future work will explore
BAE-ViT in MRI- and CT-based skeletal assessment.
🔹
Integration with Hospital AI Systems: The model is compatible with existing
AI orchestration platforms like NVIDIA Clara.
🔹
Enhancing Generalization: Further
fine-tuning on diverse global datasets will improve model adaptability to different populations.
By reducing
manual effort and inter-radiologist variability, BAE-ViT could become an essential tool for
automated skeletal maturity assessment in pediatric endocrinology, orthopedics, and forensic medicine.
Conclusion
The
MITRP team’s research on BAE-ViT marks a step forward in multimodal AI for medical imaging. By
harnessing vision transformers for bone age estimation, this approach achieves
state-of-the-art performance while improving efficiency, robustness, and generalizability.
🔗 Read the full study: PMC Article