1Dept. of Chemistry and Bioscience, Aalborg University · 2Dept. of Architecture, Design and Media Technology, Visual Analysis and Perception Lab, Aalborg University
Manual labeling of animal images remains a significant bottleneck in ecological research, limiting the scale and efficiency of biodiversity monitoring. This study investigates whether state-of-the-art Vision Transformer (ViT) foundation models can reduce thousands of unlabeled animal images directly to species-level clusters. We present a comprehensive benchmarking framework evaluating five ViT models combined with five dimensionality reduction techniques and four clustering algorithms across 60 species (30 mammals and 30 birds), with each test using a random subset of 200 validated images per species. Our results demonstrate near-perfect species-level clustering (V-measure: 0.958) using DINOv3 embeddings with t-SNE and supervised hierarchical clustering. Unsupervised approaches achieve competitive performance (0.943) while requiring no prior species knowledge, rejecting only 1.14% of images as outliers. We further demonstrate robustness to realistic long-tailed distributions and show that intentional over-clustering can reliably extract intra-specific variation including age classes, sexual dimorphism, and pelage differences.
State-of-the-art models like SpeciesNet provide predictions across thousands of species but use conservative rollup strategies, leaving many animals labeled at high taxonomic levels. We present a hierarchical re-classification system combining SpeciesNet predictions with CLIP embeddings and metric learning to refine labels toward species-level identification, achieving 96.5% accuracy on re-classified detections.
Evaluates zero-shot approaches for organizing unlabeled camera trap imagery using self-supervised vision transformers (CLIP, DINOv2, MegaDescriptor) with clustering and continuous 1D similarity ordering. DINOv2 with UMAP and GMM achieves 88.6% accuracy, while 1D t-SNE sorting reaches 95.2% coherence for fish and 88.2% for mammals/birds. Deployed in production on Animal Detect to accelerate annotation workflows.
Works best on cameratrap images of mammals and birds
Supports JPG, PNG, WebP formats (max 5MB)
Wildlife Detection, MegaDetector, Camera Trap Analysis
TensorFlow, PyTorch, Wildlife Classification Models
Python, Wildlife APIs, LILA BC Datasets
Trail Cameras, IR Systems, Wildlife Monitoring
ROS, Hexapod Robots, Search & Rescue Systems
MSc Robotics, AI:EcoNet (PhD) - ongoing
Co-founded in 2023, Animal Detect is revolutionizing wildlife conservation through AI. Born from my passion for merging robotics with nature, this platform helps researchers process millions of camera trap images, turning wildlife data into actionable conservation insights.
Founded during my MSc at AAU with Eugene Galaxy, Really A Robot has been my 5+ year journey from robotics student to conservation technology entrepreneur. From hexapod search-rescue robots to wildlife cameras, it's where Animal Detect was born and my passion for technology-driven conservation flourished.