Clustering the BioTrove Dataset

Projects
ML Marathon
MLM25
Computer vision
Clustering
Unsupervised learning
Biodiversity
Image data
Deep learning
CLIP
Author

Chris Endemann

Published

September 11, 2025

Clustering the BioTrove Dataset was featured in the 2025 Machine Learning Marathon (MLM25). This challenge asks participants to discover genus- and species-level structure in biodiversity images using unsupervised and self-supervised learning methods.

Challenge design

  • Task: Cluster biodiversity images to recover taxonomic structure (genus and species groupings) without explicit labels.
  • Domain: Biodiversity and ecology – automated species identification can support pest control, crop monitoring, biodiversity assessment, and environmental conservation.
  • Data: Images drawn from BioTrove, the largest publicly accessible biodiversity image dataset (161.9 million images, ~366K species), curated from iNaturalist with research-grade annotations.
  • Methods: Contrastive learning, autoencoders, CLIP-based embeddings, and other unsupervised/semi-supervised approaches.

Questions

If you have any lingering questions about this project, please feel free to post to the Nexus Q&A on GitHub. We will improve materials on this website as additional questions come in.