Clustering the BioTrove Dataset
Projects
ML Marathon
MLM25
Computer vision
Clustering
Unsupervised learning
Biodiversity
Image data
Deep learning
CLIP
Clustering the BioTrove Dataset was featured in the 2025 Machine Learning Marathon (MLM25). This challenge asks participants to discover genus- and species-level structure in biodiversity images using unsupervised and self-supervised learning methods.
Challenge design
- Task: Cluster biodiversity images to recover taxonomic structure (genus and species groupings) without explicit labels.
- Domain: Biodiversity and ecology – automated species identification can support pest control, crop monitoring, biodiversity assessment, and environmental conservation.
- Data: Images drawn from BioTrove, the largest publicly accessible biodiversity image dataset (161.9 million images, ~366K species), curated from iNaturalist with research-grade annotations.
- Methods: Contrastive learning, autoencoders, CLIP-based embeddings, and other unsupervised/semi-supervised approaches.
Links
- Kaggle challenge: Clustering the BioTrove Dataset
- Winning writeup: 1st place: It All Depends on a Good Embedding
- BioTrove project: baskargroup.github.io/BioTrove
Questions
If you have any lingering questions about this project, please feel free to post to the Nexus Q&A on GitHub. We will improve materials on this website as additional questions come in.