Clustering the BioTrove Dataset

Projects

ML Marathon

MLM25

Computer vision

Clustering

Unsupervised learning

Biodiversity

Image data

Deep learning

CLIP

Author

Chris Endemann

Published

September 11, 2025

Clustering the BioTrove Dataset was featured in the 2025 Machine Learning Marathon (MLM25). This challenge asks participants to discover genus- and species-level structure in biodiversity images using unsupervised and self-supervised learning methods.

Challenge design

Task: Cluster biodiversity images to recover taxonomic structure (genus and species groupings) without explicit labels.
Domain: Biodiversity and ecology – automated species identification can support pest control, crop monitoring, biodiversity assessment, and environmental conservation.
Data: Images drawn from BioTrove, the largest publicly accessible biodiversity image dataset (161.9 million images, ~366K species), curated from iNaturalist with research-grade annotations.
Methods: Contrastive learning, autoencoders, CLIP-based embeddings, and other unsupervised/semi-supervised approaches.

Clustering the BioTrove Dataset

Challenge design

Links

Comments

Challenge design

Links

Related resources

Comments