Vision, Language, and Vision-Language Modeling in Radiology
About this resource
In this talk from the Machine Learning for Medical Imaging (ML4MI) community, Tyler Bradshaw (PhD) discusses the historical context (e.g., CNN, VGG) leading up to the new era of multimodal learning (e.g., vision-language models), and explores how these models are currently being leveraged in the radiology field.
Video (requires UW netID): View 24-09-16 ML4MI Recording.
Key points
Vision models
Vision timeline: CNN (vision, ~1989-2018) → UNET (segmentation, 2015) → Transformer (2017) → Vision Transformer (ViT, 2020) → ConvNext (2022) → Segment Anything Model (SAM, 2023)
ConvNext: A convolutional network that refines CNN architectures for modern deep learning challenges. Introduced as a bridge between traditional CNNs and transformers, it maintains CNN’s competitive performance while incorporating elements that make transformers excel in large data environments.
Segment Anything Model (SAM): This model introduces a breakthrough in segmentation tasks by allowing for zero-shot generalization across a wide range of segmentation applications.
CNNs remain competitive with more modern models like Vision Transformers (ViTs). ViTs tend to outperform CNNs only when there are very large datasets. For midsized datasets, especially in medical imaging, CNNs may still hold an advantage.
Future of vision models: The future may include foundation models—very large models trained on extensive and diverse datasets. These models aim to acquire world knowledge across various domains, potentially outperforming models specifically trained on individual tasks. However, the extent to which these models will become “foundational” is still under research.
Medical context example: We can better predict tumor presence from PET images by incorporating both pre-treatment and post-treatment images. This temporal timeline supports tumor prediction when contrast is low, and ViTs are well-suited to model these longitudinal effects.
Language models
Language timeline: TF-IDF → Word2Vec & GloVe (2013) → Transformer (2017) → BERT and GPT (2018)
Transformers and language models generate embeddings that represent words/tokens. Words with similar meanings are placed closer together in the embedding space.
These embeddings are contextually aware, allowing words with different meanings in different contexts to be encoded effectively using the self-attention mechanism.
LLMs can learn patterns in sequential data, making them broadly applicable for various tasks (e.g., text as sequences, images as sequences of pixels, videos as sequences of frames).
See also
- Model: UNET: Learn more about the UNET model.