Wild Animal Species Classification From Camera Traps Using Metadata Analysis

Wild Animal Species Classification from Camera
Traps Using Metadata Analysis

Aslak Tøn∗ , Ali Shariq Imran† and Mohib Ullah‡
Department of Computer Science, Norwegian University of Science and Technology, 2815 Gjøvik, Norway
Email: ∗ aslakto@stud.ntnu.no, † ali.imran@ntnu.no, ‡ mohib.ullah@ntnu.no
Abstract—Camera trap imaging has emerged as a valuable tool we explored related works concerning image classification,
for modern wildlife surveillance, enabling researchers to monitor explicitly focusing on animals. For example, Norouzzadeh et
and study wild animals and their behaviours. However, a signifi- al. [13] suggest image classification is enhanced by object
cant challenge in camera trap data analysis is the labour-intensive
task of species classification from the captured images. This study detection, filtering irrelevant background data without requir-
proposes a novel approach to species classification by leveraging ing additional resources. They used an existing pre-trained
metadata associated with camera trap images. By developing model for object detection, achieving an accuracy of 91.71%,
predictive models using metadata alone, we demonstrate that precision of 84.47%, and recall of 84.24%. Animals in each
accurate species classification can be achieved without accessing scene were counted via bounding boxes, and the kind of animal
the image data. Our approach reduces the computational burden
and offers potential benefits in scenarios where image access is in non-empty images was identified. Despite an imbalanced
restricted or limited. Our findings highlight the valuable role dataset, they achieved high accuracy for the majority of classes
of metadata in complementing the species classification process and an overall accuracy of 91.37%. The paper also explores
and present new opportunities for efficient and scalable wildlife active learning methods. Norouzzadeh et al. [14] focuses on
monitoring using camera trap technology. animal classification, object counting [22], action recognition
Index Terms—Metadata, Camera trap imaging, Neural net-
works, Data fusion, Scene recognition. [23], and detecting children’s presence. Their multi-stage fu-
sion network outperforms a full classifier model, tackling four
I. I NTRODUCTION objectives: animal species classification [24], social interaction
[25], animal count [26], and attribute addition [27]. They
Human-induced influences like climate change [1], [2], achieved 96.8% accuracy with VGG [28] network for the first
deforestation [3], and trafficked roads [4], [5] have resulted task, top-1 accuracy of 94.9%, and top-5 accuracy of 99.1%
in a dramatic wildlife strain, ushering in an era termed ”An- for the second. Binned animal count achieved 62.8% accuracy
thropocene” [6]. Monitoring such habitats [7], [8] is crucial, and 83.6% when counting within one bin. Action detection
as shown by the 2019-20 Australian wildfires [9]. Camera yielded 75.6% accuracy, 84.5% precision, and 80.9% recall.
traps offer rich insights [10]–[12], but growing data volumes Similarly, Schindler et al. [29] proposes a two-stage fusion
necessitate robust filtering [13], [14]. Databases like LILA BC network using Mask R-CNN for animal classification and
and the Snapshot Serengeti (SS) dataset [15] exist, and this action determination. Temporal data from the video were used
paper utilizes a smaller dataset from the Norwegian Institute for action recognition, with variations of ResNet-18 handling
for Nature Research [16]. Past studies mainly employed im- 3 × T × H × W frame input. The SlowFast network proposed
age analysis for species identification [13], [14], [17], with by [30] underperformed. The authors also present their own
few incorporating metadata [18]–[20]. Our study emphasizes accuracy metrics for segmentation, with the best segmentation
metadata’s significance, defining explicit metadata as data method achieving 63.8% average precision and 94.1% action
accompanying the image (like temperature, date, and location) detection accuracy.
and implicit metadata as indirect information about the image
itself (like scene descriptors and attributes), extracted using III. M ETHODOLOGY
pre-trained models on the places365 dataset [21]. We advance A. Acquisition
species classification by using metadata alongside image data, The acquisition of the NINA Viltkamera dataset metadata is
enhancing accuracy in camera trap research. The paper pro- a complex task. All images and their corresponding metadata
ceeds with: Related work in section II, section III discusses are publicly available on the Norwegian Institute for Nature
the methodology for data acquisition and how the classification Research (NINA) website. However, direct downloading is
was done, Results and discussion is in section IV, and finally not feasible due to the extensive number of potential unique
we conclude our findings in section V. URLs. Therefore, we resorted to web scraping to acquire the
necessary data. Within the website’s interactive map, each
II. R ELATED W ORKS
camera trap pin held specific metadata. By creating a script,
Although there are numerous papers discussing various we automated the extraction process of these URLs and their
aspects of metadata usage, limited attention has been given to corresponding metadata. Each URL was linked to a JSON
its direct application for classification purposes. In this section, object under the ”VM” entity on the website. This JSON object
979-8-3503-4218-5/23/$31.00 ©2023 IEEE

contained essential metadata like the filename and a foreign Minority Oversampling Technique (Borderline SMOTE) [31].
key referencing the species id (NOR: ”FK ArtID”). To link Borderline SMOTE generates more valuable sample points
the foreign key with the species name, we utilized the function than the regular SMOTE algorithm. Borderline SMOTE gener-
”vm.arter()”. Furthermore, the ”vm.lokaliteter()” function was ates synthetic samples on the boundary region between classes,
used to map the location ID to its corresponding latitude and which gives the network more hard-to-tell samples, which
longitude. This strategy allowed us to automate the extraction should provide more benefit during training.
of metadata, which was essential for our study. In total, meta-
data was collected for 170 thousand camera trap images. These C. Noisy Labels
samples were split into 65 original classes. These classes were
One issue with this dataset is the lack of validation on
severely imbalanced, to the point where some classes had
the said dataset. Several samples with one given class were,
one or two samples. To combat this, we employed both class
in fact, a different class (see Fig. 1). Unfortunately, due to
combination and data augmentation. More information on this
the sheer number of samples, combined with the lack of
is discussed in Section III-B. In terms of additional metadata,
relevant expertise from the authors of the paper, reclassifying
temperature data was often missing. To fill in these gaps,
the animals is infeasible. Luckily, the vast majority of labels
we used the Norwegian Metrological Institute’s Frost API1 .
are correct, with only around 0.5%−1% of labels being wrong.
This API provided temperature data from the nearest weather
station to the camera trap. We limited temperature readouts to
D. Evaluation metrics
within a 24-hour window of the image capture time. This still
left some missing temperature values (16 thousand samples); Our study primarily focuses on two significant metrics:
these were set to the average temperature of the entire dataset. Accuracy and the Cohen Kappa Score. Accuracy quantifies
The date and time were stored as a one-hot encoded vector, the fraction of true results (including both true positives
dubbed the “datetime” vector. This preserves the cyclical T P and true negatives T N ) in the total number of samples
nature present in time data while eliminating any ambiguity analyzed. Formally, Accuracy is defined as follows:
that may arise. We first considered a sine curve to represent TP + TN
time, as this would also capture the cyclic nature of time. Accuracy =
TP + TN + FP + FN
However, this may have confused, as spring and fall would
result in the same values. In the same vein, dawn and dusk Where:
would also result in the same values. Latitude and longitude • T P stands for True Positives: the number of samples
were also included to capture potential geographical variations correctly classified to class yi .
in animal distribution. It is important to note that the positional • T N represents True Negatives: the samples correctly not
data acquired is only approximate, as the locations of the pins assigned to class yi .
are only accurate to within about a kilometre radius. Lastly, • F P is False Positives: the samples wrongly assigned to
implicit metadata was obtained through pre-trained models on class yi but should have been classified to a different class
the Places365 dataset. This provided us with scene attributes yj .
and scene descriptors, which offered extra context for species • F N denotes False Negatives: the samples that should
identification. To prevent computation delays during model have been classified to yi but were classified to yj .
training, these attributes were pre-extracted and stored along-
This metric provides a view of our model’s overall perfor-
side the image metadata.
mance. Due to the imbalanced nature of our dataset, we
B. Class Imbalance opted to use a metric sensitive to prediction accuracy that
accounts for class imbalance. Thus, we incorporate the Co-
As mentioned previously, the 170 thousand data points hen Kappa Score. The Cohen Kappa Score measures the
collected were severely imbalanced. The largest class “Roe agreement between two raters who classify N items into C
Deer” consisted of 53 thousand samples alone, while other mutually exclusive classes. The score calculates the possibility
classes, like “Lemmings” only had three. The birds were es- of the agreement occurring by chance (pe ) and the observed
pecially prone to low sample size, as each individual species of agreement (po ). Initially, the probability of random agreement,
bird was catalogued. Two methods were used to combat this: pe , is calculated as:
Class combination and data augmentation. Class combination
combines certain classes, like the different bird species, into C
1 X (1) (2)
one larger super-class. In the case of bird species, we com- pe = 2 nk nk
bined them to form the “Bird” superclass. Other classes were N
k=1
similarly combined, “Rodent” became one superclass, as did
(i)
“Deer”. In total, with these combinations, we ended up with Here, nk is the number of times the rater i predicted class
25 classes. Furthermore, to balance out the class representation k. Next, the observed agreement, po , is calculated as:
when running deep learning, we utilized Borderline Synthetic PC
xi,i
1 https://frost.met.no/index.html
po = PC i=1 PC
i=1 j=1 xi,j
Fig. 1: Animal misclassifications. All are labelled as “Sheep”
Here, the elements xi,j constitute the observed response matrix had to rely on dimensionality reduction techniques instead.
M . Finally, the Cohen Kappa Score (κ) is calculated using Dimensionality reduction, in general, aims to preserve the
these probabilities: structure of the data as much as possible while reducing
po − pe the overall information saved for each data point. Our paper
κ=
1 − pe utilizes a new approach to dimensionality reduction proposed
This score provides a more robust measure than accuracy by [32]. Uniform Manifold Approximation and Projection, or
as it considers both the class imbalance and the probability UMAP for short, utilizes topology, higher dimensional mani-
of a correct prediction occurring by chance, offering a more folds, and graph theory in order to project high dimensional
nuanced view of our model’s performance. data down to a lower dimension while minimizing the cross
entropy between the original projection and the re-projection.
E. Classification The algorithm has been demonstrated to equal or outperform
To properly evaluate what effects metadata would have on other popular dimensionality reduction techniques such as t-
classification, we need to perform an exhaustive search of SNE [33], LargeVis [34], and Laplacian eigenmaps [35]. The
the classes and features available. This involves classifying n theory behind UMAP is quite involved, requiring a good
classes using m features, where n ≥ 2 and m ≥ 1. To run all understanding of the topic of topology. However, an excellent
these combinations, we would have a total of 1, 040, 186, 586 summary was given by [36]. They break down the process into
individual cases to test. This amount of computation is cur- two major steps and a couple of minor steps in each major
rently unrealistic. Instead, we opted to look at a subset of the step as so:
classes. The classes we decided to investigate were: ‘Fox’, 1 Learn manifold structure
‘Deer’, ‘Mustelidae’, ‘Bird’, ‘Lynx’, ‘Cat’, ‘Sheep’, ‘Rodent’, 1.1 Finding nearest neighbours
and ‘Wolf’. We also combined temperature and position into 1.2 Constructing neighbours graph
one feature. The reasoning is that the single data point of 1.2.1 Varying distance
temperature would likely not be a perfect classifier. This 1.2.2 Local connectivity
left us with nine classes and four features that could be 1.2.3 Fuzzy area
included or excluded. This gives a more manageable 7529 1.2.4 Merging of edges
combination that we exhaustively classify. We focused on the
2 Finding low-dimensional representation
quantitative study of all permutations of animals and metadata
information. We used a 4-layer fully connected network, with 2.1 Minimum distance
batch normalization and dropout between each layer to combat 2.2 Minimizing the cost function
overfitting. The hidden layers were static, having 64 and 32 Utilizing UMAP, we can investigate if any patterns emerge on
neurons, respectively. The input layer had a dynamic number animal clusters. If we find local clusters in the dimensionality-
of neurons equal to the number of input features currently reduced space, we can expect those same patterns to hold in
selected. Likewise, the output layer was set to the current the original 538-dimensional space we cannot investigate.
number of classes to be classified.
G. Implementation Details
F. Data Visualization To create and run the models, we used Python program-
Another efficient way of assessing if metadata can be used ming language, with PyTorch [37] framework for creating,
to classify different species is the use of data visualization importing, and training models. The models primarily used
tools. Our data consists of 538 data points, meaning we could categorical cross-entropy [38] as the loss function and the
map the data in a 538-dimensional space and assess what Adam optimizer [39]. The networks were mainly created and
groupings are present in the data. As no currently known trained on a Linux computer using an intel-i9 12900KF, 128
technique exists for viewing visual information above three Gigabytes of RAM and an RTX3080-Ti. All weights were
dimensions, four if you include temporal information, we randomly initialized, with the optimizer set with an initial
Classes Features used Acc κ
4, 6 Scene attributes 0.948 0.894
6, 12 Position and temperature, Scene attributes 0.982 0.945
4, 6 Places, Position and temperature, Scene attributes 0.967 0.932
6, 12 Datetime, Places, Position and temperature, Scene attributes 0.989 0.964
3, 4, 6 Scene attributes 0.87 0.779
3, 4, 6 Position and temperature, Scene attributes 0.869 0.782
3, 4, 6 Datetime, Places, Scene attributes 0.866 0.775
3, 4, 6 Datetime, Places, Position and temperature, Scene attributes 0.878 0.796
2, 3, 4, 6 Scene attributes 0.696 0.552
3, 4, 6, 12 Position and temperature, Scene attributes 0.731 0.603
3, 4, 6, 12 Datetime, Position and temperature, Scene attributes 0.729 0.614
3, 4, 6, 12 Datetime, Places, Position and temperature, Scene attributes 0.746 0.63
TABLE I: Metadata Predictors Scores
learning rate of 1e − 3. The learning rate was then reduced

by an order of magnitude every seven epochs, and a total of
25 epochs ran for each model. The samples were split into
mini-batches of 64. For each epoch, the model was validated
using 10% of the test samples; if the model performed worse
than previous runs, it was reset back to its best-performing
iteration. Finally, the model was evaluated using 10% of the
data that was left aside before training started.
To ensure balanced representation in the training data. Bor-
derline SMOTE [31] was utilized. By having the same number
of samples from each class, the network cannot “cheat” by
only predicting the majority class to achieve an acceptable
result. The validation sets and testing sets were left unaltered. (a)
IV. R ESULTS AND DISCUSSION

We can see the results for two or three separate classes using
one, two, three or all four features. Looking at Table I, we see a
reasonably high accuracy for classifying some animal species,
despite not having any image data. We’ve decided to use an
ID for each species instead of the said species’ name. The
corresponding ID to species is 0: ‘Fox’, 1: ‘Deer’, 2: ‘Weasel’,
3: ‘Bird’, 4: ‘Lynx’, 5: ‘Cat’, 6: ‘Sheep’, 7: ‘Squirrel’, 8: ‘Rab-
bit’, 9: ‘Rodent’, 10: ‘Cattle’, 11: ‘Boar’, 12: ‘Wolf’, and 13:
‘Bear’. We see that the “Scene attributes” information yields
the best single feature to include in the prediction. We also see (b)
as we increase the number of features included increases, the
best performer is still “Scene attributes”. However, including
extra attributes does yield diminishing returns. The average
performance of the different features is less clear-cut. We can
quantify this relation better by looking at the “winner” when
pitting n predictors against each other to predict between m
classes. By finding and counting the best predictor(s) for all
combinations of animals, we get Fig. 2. To save space, we
used abbreviated versions of the feature names, ‘SA’ equates
to scene attributes, ‘Pl’ is short for “Places” which are the
Scene descriptors, ‘DT’ is the datetime vector, and ‘P & T’ is
the position and temperature information. We see that “Scene
(c)
attributes” is the clear best single predictor. However, it is
not among the pair of best predictors, being beaten out by the Fig. 2: The best n features to use to distinguish a set of m
combination of “Datetime” and “Places”. Its worth noting that animals
this method of counting the winner does not take into account
how much better one predictor performed than another. We do other at a metadata level, such as different biomes, locations,
not know whether “Scene Features” dominated the competition or sleep schedules that result in image capture during different
as the singular feature or if other features were close seconds hours.
to the best performance of “Scene Features”. However, we
can conclude that accuracy, in general, improves when more
features are included. Meaning all the metadata contributes
something valuable to the prediction of the animal feature.
Remember that these predictions of animal classes are purely
based on the metadata information, no image of the animal is
given to the model, yet it can quite confidently predict between
two classes.
(a) UMAP separating Mustelidae cleanly from other classes
Fig. 3: Prediction score versus the number of classes to

distinguish
The prediction score does steadily decrease as more classes

are included. Fig. 3 demonstrates this clearly. We postulate this
is due to the increased homogeneous actions of the animals.
Some animals may be active during the daytime, others during
nighttime; some are preferentially spotted in some locations, (b) UMAP struggling to separate the remaining classes
while others avoid those same locations. When we only have
two animals, we can use these facts to separate them. However, Fig. 4: UMAP embedding of metadata features and classes
once multiple animals act similarly, we can no longer separate
them purely using this metadata, and image data are required. V. C ONCLUSION
This issue of reduced performance when more classes make In our study, we have showcased the effectiveness of
intuitive sense. It is harder to guess between 5 categories than utilizing explicit and implicit metadata associated with cam-
it is to guess between only two. However, the kappa score era trap images for animal prediction. The results obtained
should account for this increased performance of randomly highlight the potential of metadata-driven augmentation for
guessing the correct class, but it is also declining. Some of deep-learning approaches in the field of animal classification.
the explanations for this can be seen by using UMAP. Fig. 4a Building upon these findings, we recommend employing a
shows Mustelidae cleanly separating into its own cluster. This two-step classification process: First, identifying the appropri-
indicates that some higher dimensional line can be drawn that ate subgroups into which animals can be separated using the
can confidently classify Mustelidae away from other animals. available metadata and then utilizing more specific prediction
However, once we remove many of the classes, we find that models to assign the final species label to each animal.
UMAP no longer cleanly separates these classes. This problem This coarse-to-fine classification methodology aligns well with
can be seen in Fig. 4b. We can summarize that metadata has the outcomes and implications presented in the paper. Our
the ability to help differentiate species from each other without work holds promise for improving the overall accuracy and
the need for image data to be included. These findings are efficiency of animal classification in camera trap research.
more valuable when we include image data once again. By
designing networks that can incorporate metadata to image ACKNOWLEDGMENT
feature extraction for networks, we believe we can enhance We would like to acknowledge the provision of images
the classification results over standard network architectures. by the SCANDCAM project, which is coordinated by the
Metadata should prove even more helpful in cases where there Norwegian Institute for Nature Research and has received
are few classes to choose from or where the existing classes funding from the Norwegian Environment Agency and various
have distinct behavioural patterns that separate them from each Norwegian county councils.
R EFERENCES applied to skin cancer classification,” IEEE journal of biomedical and
health informatics, vol. 25, no. 9, pp. 3554–3563, 2021.
[1] V. Masson-Delmotte, P. Zhai, A. Pirani, S.L. Connors, C. Péan, [20] Weipeng Li, Jiaxin Zhuang, Ruixuan Wang, Jianguo Zhang, and Wei-
S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, Shi Zheng, “Fusing metadata and dermoscopy images for skin disease
K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, diagnosis,” in 2020 IEEE 17th international symposium on biomedical
O. Yelekçi, R. Yu, and B. Zhou, Eds., Human Influence on the Climate imaging (ISBI). IEEE, 2020, pp. 1996–2000.
System, pp. 423–552, Cambridge University Press, Cambridge, United [21] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio
Kingdom and New York, NY, USA, 2021. Torralba, “Places: A 10 million image database for scene recognition,”
[2] Muhammad Munsif, Hina Afridi, Mohib Ullah, Sultan Daud Khan, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
Faouzi Alaya Cheikh, and Muhammad Sajjad, “A lightweight convolu- [22] Sultan Daud Khan, Ahmed B Altamimi, Mohib Ullah, Habib Ullah,
tion neural network for automatic disasters recognition,” in 2022 10th and Faouzi Alaya Cheikh, “Tcm: Temporal consistency model for head
European Workshop on Visual Information Processing (EUVIP). IEEE, detection in complex videos,” Journal of Sensors, vol. 2020, pp. 1–13,
2022, pp. 1–6. 2020.
[3] Kusum Lata, Arvind Kumar Misra, and Jang Bahadur Shukla, “Model- [23] Mohib Ullah, Muhammad Mudassar Yamin, Ahmed Mohammed, Sul-
ing the effect of deforestation caused by human population pressure on tan Daud Khan, Habib Ullah, and Faouzi Alaya Cheikh, “Attention-
wildlife species,” Nonlinear Analysis: Modelling and Control, vol. 23, based lstm network for action recognition in sports,” Electronic Imaging,
no. 3, pp. 303–320, 2018. vol. 33, pp. 1–6, 2021.
[4] W Richard J Dean, Colleen L Seymour, Grant S Joseph, and Stefan H [24] Tinao Petso, Rodrigo S Jamisola, and Dimane Mpoeleng, “Review
Foord, “A review of the impacts of roads on wildlife in semi-arid on methods used for wildlife species and individual identification,”
regions,” Diversity, vol. 11, no. 5, pp. 81, 2019. European Journal of Wildlife Research, vol. 68, pp. 1–18, 2022.
[5] Maryam Hassan, Farhan Hussain, Sultan Daud Khan, Mohib Ullah, [25] Habib Ullah, Sultan Daud Khan, Mohib Ullah, and Faouzi Alaya Cheikh,
Mudassar Yamin, and Habib Ullah, “Crowd counting using deep learning “Social modeling meets virtual reality: An immersive implication,” in
based head detection,” Electronic Imaging, vol. 35, pp. 293–1, 2023. Pattern Recognition. ICPR International Workshops and Challenges:
[6] Simon L Lewis and Mark A Maslin, “Defining the anthropocene,” Virtual Event, January 10–15, 2021, Proceedings, Part IV. Springer,
Nature, vol. 519, no. 7542, pp. 171–180, 2015. 2021, pp. 131–140.
[7] Joel Berger, Steven L Cain, and Kim Murray Berger, “Connecting the [26] Colin J Torney, David J Lloyd-Jones, Mark Chevallier, David C Moyer,
dots: an invariant migration corridor links the holocene to the present,” Honori T Maliti, Machoke Mwita, Edward M Kohi, and Grant C
Biology Letters, vol. 2, no. 4, pp. 528–531, 2006. Hopcraft, “A comparison of deep learning and citizen science techniques
[8] Toby A Patterson, Len Thomas, Chris Wilcox, Otso Ovaskainen, and for counting wildlife in aerial survey images,” Methods in Ecology and
Jason Matthiopoulos, “State–space models of individual animal move- Evolution, vol. 10, no. 6, pp. 779–787, 2019.
ment,” Trends in ecology & evolution, vol. 23, no. 2, pp. 87–94, 2008. [27] Milan Kresovic, Thong Nguyen, Mohib Ullah, Hina Afridi, and
[9] Isabel T Hyman, Shane T Ahyong, Frank Köhler, Shane F McEvey, Faouzi Alaya Cheikh, “Pigpose: A realtime framework for farm animal
GA Milledge, Chris AM Reid, and Jodi JL Rowley, “Impacts of the pose estimation and tracking,” in IFIP International Conference on
2019–2020 bushfires on new south wales biodiversity: a rapid assessment Artificial Intelligence Applications and Innovations. Springer, 2022, pp.
of distribution data for selected invertebrate taxa,” Technical reports of 204–215.
the Australian Museum online, vol. 32, pp. 1–17, 2020. [28] Karen Simonyan and Andrew Zisserman, “Very deep convolu-
[10] Franck Trolliet, Cédric Vermeulen, Marie-Claude Huynen, and Alain tional networks for large-scale image recognition,” arXiv preprint
Hambuckers, “Use of camera traps for wildlife studies: a review,” arXiv:1409.1556, 2014.
Biotechnologie, Agronomie, Société et Environnement, vol. 18, no. 3, [29] Frank Schindler and Volker Steinhage, “Identification of animals and
2014. recognition of their actions in wildlife videos using deep learning
[11] Allan F O’Connell, James D Nichols, and K Ullas Karanth, Camera techniques,” Ecological Informatics, vol. 61, pp. 101215, 2021.
traps in animal ecology: methods and analyses, vol. 271, Springer, 2011. [30] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He,
[12] Francesco Rovero, Fridolin Zimmermann, Duccio Berzi, and Paul Meek, “Slowfast networks for video recognition,” in Proceedings of the
“” which camera trap type and how many do i need?” a review of camera IEEE/CVF international conference on computer vision, 2019, pp. 6202–
features and study designs for a range of wildlife research applications.,” 6211.
Hystrix, 2013. [31] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao, “Borderline-smote:
[13] Mohammad Sadegh Norouzzadeh, Dan Morris, Sara Beery, Neel Joshi, a new over-sampling method in imbalanced data sets learning,” in
Nebojsa Jojic, and Jeff Clune, “A deep active learning system for species International conference on intelligent computing. Springer, 2005, pp.
identification and counting in camera trap images,” Methods in ecology 878–887.
and evolution, vol. 12, no. 1, pp. 150–161, 2021. [32] Leland McInnes, John Healy, and James Melville, “Umap: Uniform
[14] Mohammad Sadegh Norouzzadeh, Anh Nguyen, Margaret Kosmala, manifold approximation and projection for dimension reduction,” 2020.
Alexandra Swanson, Meredith S Palmer, Craig Packer, and Jeff Clune, [33] Geoffrey E Hinton and Sam Roweis, “Stochastic neighbor embedding,”
“Automatically identifying, counting, and describing wild animals in Advances in neural information processing systems, vol. 15, 2002.
camera-trap images with deep learning,” Proceedings of the National [34] Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei, “Visualizing
Academy of Sciences, vol. 115, no. 25, pp. E5716–E5725, 2018. large-scale and high-dimensional data,” in Proceedings of the 25th
[15] AB Swanson, M Kosmala, CJ Lintott, RJ Simpson, A Smith, and International Conference on World Wide Web. apr 2016, International
C Packer, “Data from: Snapshot serengeti, high-frequency annotated World Wide Web Conferences Steering Committee.
camera trap images of 40 mammalian species in an african savanna,” [35] Mikhail Belkin and Partha Niyogi, “Laplacian eigenmaps for dimen-
2015. sionality reduction and data representation,” Neural computation, vol.
[16] John Odden and Jon E. Swenson, “Scandcam project,” https:// 15, no. 6, pp. 1373–1396, 2003.
viltkamera.nina.no/, 2023, Images provided by the SCANDCAM project [36] Hina Afridi, Mohib Ullah, Øyvind Nordbø, Faouzi Alaya Cheikh, and
coordinated by the Norwegian Institute for Nature Research with funding Anne Guro Larsgard, “Optimized deep-learning-based method for cattle
from the Norwegian Environment Agency and multiple Norwegian udder traits classification,” Mathematics, vol. 10, no. 17, pp. 3097, 2022.
county councils. [37] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad-
[17] Mohib Ullah, Zolbayar Shagdar, Habib Ullah, and Faouzi Alaya Cheikh, bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,
“Semi-supervised principal neighbourhood aggregation model for sar Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach
image classification,” in 2022 16th International Conference on Signal- DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit
Image Technology & Internet-Based Systems (SITIS). IEEE, 2022, pp. Steiner, Lu Fang, Junjie Bai, and Soumith Chintala, “Pytorch: An
211–217. imperative style, high-performance deep learning library,” 2019.
[18] John Arevalo, Thamar Solorio, Manuel Montes-y Gómez, and Fabio A. [38] Zhilu Zhang and Mert R. Sabuncu, “Generalized cross entropy loss for
González, “Gated multimodal units for information fusion,” arXiv training deep neural networks with noisy labels,” 2018.
preprint arXiv:1702.01992, 2017, Submitted on 7 Feb 2017. [39] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic
[19] Andre GC Pacheco and Renato A Krohling, “An attention-based optimization,” 2017.
mechanism to combine images and metadata in deep learning models

Wild Animal Species Classification From Camera Traps Using Metadata Analysis

Uploaded by

Copyright:

Available Formats

Wild Animal Species Classification From Camera Traps Using Metadata Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wild Animal Species Classification From Camera Traps Using Metadata Analysis

Uploaded by

Copyright:

Available Formats

Wild Animal Species Classification from Camera

Traps Using Metadata Analysis

979-8-3503-4218-5/23/$31.00 ©2023 IEEE

TABLE I: Metadata Predictors Scores

learning rate of 1e − 3. The learning rate was then reduced

IV. R ESULTS AND DISCUSSION

(a) UMAP separating Mustelidae cleanly from other classes

Fig. 3: Prediction score versus the number of classes to

The prediction score does steadily decrease as more classes

You might also like