Wild Animal Species Classification From Camera Traps Using Metadata Analysis
Wild Animal Species Classification From Camera Traps Using Metadata Analysis
Wild Animal Species Classification From Camera Traps Using Metadata Analysis
Abstract—Camera trap imaging has emerged as a valuable tool we explored related works concerning image classification,
for modern wildlife surveillance, enabling researchers to monitor explicitly focusing on animals. For example, Norouzzadeh et
and study wild animals and their behaviours. However, a signifi- al. [13] suggest image classification is enhanced by object
cant challenge in camera trap data analysis is the labour-intensive
task of species classification from the captured images. This study detection, filtering irrelevant background data without requir-
proposes a novel approach to species classification by leveraging ing additional resources. They used an existing pre-trained
metadata associated with camera trap images. By developing model for object detection, achieving an accuracy of 91.71%,
predictive models using metadata alone, we demonstrate that precision of 84.47%, and recall of 84.24%. Animals in each
accurate species classification can be achieved without accessing scene were counted via bounding boxes, and the kind of animal
the image data. Our approach reduces the computational burden
and offers potential benefits in scenarios where image access is in non-empty images was identified. Despite an imbalanced
restricted or limited. Our findings highlight the valuable role dataset, they achieved high accuracy for the majority of classes
of metadata in complementing the species classification process and an overall accuracy of 91.37%. The paper also explores
and present new opportunities for efficient and scalable wildlife active learning methods. Norouzzadeh et al. [14] focuses on
monitoring using camera trap technology. animal classification, object counting [22], action recognition
Index Terms—Metadata, Camera trap imaging, Neural net-
works, Data fusion, Scene recognition. [23], and detecting children’s presence. Their multi-stage fu-
sion network outperforms a full classifier model, tackling four
I. I NTRODUCTION objectives: animal species classification [24], social interaction
[25], animal count [26], and attribute addition [27]. They
Human-induced influences like climate change [1], [2], achieved 96.8% accuracy with VGG [28] network for the first
deforestation [3], and trafficked roads [4], [5] have resulted task, top-1 accuracy of 94.9%, and top-5 accuracy of 99.1%
in a dramatic wildlife strain, ushering in an era termed ”An- for the second. Binned animal count achieved 62.8% accuracy
thropocene” [6]. Monitoring such habitats [7], [8] is crucial, and 83.6% when counting within one bin. Action detection
as shown by the 2019-20 Australian wildfires [9]. Camera yielded 75.6% accuracy, 84.5% precision, and 80.9% recall.
traps offer rich insights [10]–[12], but growing data volumes Similarly, Schindler et al. [29] proposes a two-stage fusion
necessitate robust filtering [13], [14]. Databases like LILA BC network using Mask R-CNN for animal classification and
and the Snapshot Serengeti (SS) dataset [15] exist, and this action determination. Temporal data from the video were used
paper utilizes a smaller dataset from the Norwegian Institute for action recognition, with variations of ResNet-18 handling
for Nature Research [16]. Past studies mainly employed im- 3 × T × H × W frame input. The SlowFast network proposed
age analysis for species identification [13], [14], [17], with by [30] underperformed. The authors also present their own
few incorporating metadata [18]–[20]. Our study emphasizes accuracy metrics for segmentation, with the best segmentation
metadata’s significance, defining explicit metadata as data method achieving 63.8% average precision and 94.1% action
accompanying the image (like temperature, date, and location) detection accuracy.
and implicit metadata as indirect information about the image
itself (like scene descriptors and attributes), extracted using III. M ETHODOLOGY
pre-trained models on the places365 dataset [21]. We advance A. Acquisition
species classification by using metadata alongside image data, The acquisition of the NINA Viltkamera dataset metadata is
enhancing accuracy in camera trap research. The paper pro- a complex task. All images and their corresponding metadata
ceeds with: Related work in section II, section III discusses are publicly available on the Norwegian Institute for Nature
the methodology for data acquisition and how the classification Research (NINA) website. However, direct downloading is
was done, Results and discussion is in section IV, and finally not feasible due to the extensive number of potential unique
we conclude our findings in section V. URLs. Therefore, we resorted to web scraping to acquire the
necessary data. Within the website’s interactive map, each
II. R ELATED W ORKS
camera trap pin held specific metadata. By creating a script,
Although there are numerous papers discussing various we automated the extraction process of these URLs and their
aspects of metadata usage, limited attention has been given to corresponding metadata. Each URL was linked to a JSON
its direct application for classification purposes. In this section, object under the ”VM” entity on the website. This JSON object
Here, the elements xi,j constitute the observed response matrix had to rely on dimensionality reduction techniques instead.
M . Finally, the Cohen Kappa Score (κ) is calculated using Dimensionality reduction, in general, aims to preserve the
these probabilities: structure of the data as much as possible while reducing
po − pe the overall information saved for each data point. Our paper
κ=
1 − pe utilizes a new approach to dimensionality reduction proposed
This score provides a more robust measure than accuracy by [32]. Uniform Manifold Approximation and Projection, or
as it considers both the class imbalance and the probability UMAP for short, utilizes topology, higher dimensional mani-
of a correct prediction occurring by chance, offering a more folds, and graph theory in order to project high dimensional
nuanced view of our model’s performance. data down to a lower dimension while minimizing the cross
entropy between the original projection and the re-projection.
E. Classification The algorithm has been demonstrated to equal or outperform
To properly evaluate what effects metadata would have on other popular dimensionality reduction techniques such as t-
classification, we need to perform an exhaustive search of SNE [33], LargeVis [34], and Laplacian eigenmaps [35]. The
the classes and features available. This involves classifying n theory behind UMAP is quite involved, requiring a good
classes using m features, where n ≥ 2 and m ≥ 1. To run all understanding of the topic of topology. However, an excellent
these combinations, we would have a total of 1, 040, 186, 586 summary was given by [36]. They break down the process into
individual cases to test. This amount of computation is cur- two major steps and a couple of minor steps in each major
rently unrealistic. Instead, we opted to look at a subset of the step as so:
classes. The classes we decided to investigate were: ‘Fox’, 1 Learn manifold structure
‘Deer’, ‘Mustelidae’, ‘Bird’, ‘Lynx’, ‘Cat’, ‘Sheep’, ‘Rodent’, 1.1 Finding nearest neighbours
and ‘Wolf’. We also combined temperature and position into 1.2 Constructing neighbours graph
one feature. The reasoning is that the single data point of 1.2.1 Varying distance
temperature would likely not be a perfect classifier. This 1.2.2 Local connectivity
left us with nine classes and four features that could be 1.2.3 Fuzzy area
included or excluded. This gives a more manageable 7529 1.2.4 Merging of edges
combination that we exhaustively classify. We focused on the
2 Finding low-dimensional representation
quantitative study of all permutations of animals and metadata
information. We used a 4-layer fully connected network, with 2.1 Minimum distance
batch normalization and dropout between each layer to combat 2.2 Minimizing the cost function
overfitting. The hidden layers were static, having 64 and 32 Utilizing UMAP, we can investigate if any patterns emerge on
neurons, respectively. The input layer had a dynamic number animal clusters. If we find local clusters in the dimensionality-
of neurons equal to the number of input features currently reduced space, we can expect those same patterns to hold in
selected. Likewise, the output layer was set to the current the original 538-dimensional space we cannot investigate.
number of classes to be classified.
G. Implementation Details
F. Data Visualization To create and run the models, we used Python program-
Another efficient way of assessing if metadata can be used ming language, with PyTorch [37] framework for creating,
to classify different species is the use of data visualization importing, and training models. The models primarily used
tools. Our data consists of 538 data points, meaning we could categorical cross-entropy [38] as the loss function and the
map the data in a 538-dimensional space and assess what Adam optimizer [39]. The networks were mainly created and
groupings are present in the data. As no currently known trained on a Linux computer using an intel-i9 12900KF, 128
technique exists for viewing visual information above three Gigabytes of RAM and an RTX3080-Ti. All weights were
dimensions, four if you include temporal information, we randomly initialized, with the optimizer set with an initial
Classes Features used Acc κ
4, 6 Scene attributes 0.948 0.894
6, 12 Position and temperature, Scene attributes 0.982 0.945
4, 6 Places, Position and temperature, Scene attributes 0.967 0.932
6, 12 Datetime, Places, Position and temperature, Scene attributes 0.989 0.964
3, 4, 6 Scene attributes 0.87 0.779
3, 4, 6 Position and temperature, Scene attributes 0.869 0.782
3, 4, 6 Datetime, Places, Scene attributes 0.866 0.775
3, 4, 6 Datetime, Places, Position and temperature, Scene attributes 0.878 0.796
2, 3, 4, 6 Scene attributes 0.696 0.552
3, 4, 6, 12 Position and temperature, Scene attributes 0.731 0.603
3, 4, 6, 12 Datetime, Position and temperature, Scene attributes 0.729 0.614
3, 4, 6, 12 Datetime, Places, Position and temperature, Scene attributes 0.746 0.63