Data Collection Techniques with Big Data and AI for Animal Preservation
Data Collection Techniques with Big Data and AI for Animal Preservation
Research Title:
Trimester: 1 / 2019
The thesis paper, titled “Data Collection Experiments with Big Data and AI
for Animal Preservation”, is being submitted to the research supervisor, Dr.
Mahira M. Mowjoon, at Victoria University, Sydney. This submission is being
made as a course requirement for the partial fulfillment of the degree of
Masters in Applied Information Technology.
This declaration of the originality of this thesis is hereby being made to concede
that the content written in this document has been prepared and compiled
through my own evidence-based research on the subject after going through all
the relevant sources and related scholarly articles. Any and all the references to
the documents conferred to during the research have been cited where
applicable and all the original authors of the references have been given their
due credit. The information present in this thesis has never been submitted
before with the same intent and purposes for any other fulfillment of degree
requirements to the best of my knowledge.
______________________ Date:
Signed,
Mridha Faysal Hossain
i
ACKNOWLEDGMENTS
I, Mridha Faysal Hossain, am dedicating this section of the paper to thank and
acknowledge all the people around me, who have helped me not only in my
research but also aided me in the compilation of this thesis and inspired me to
take on the challenges of life. I thank all the people who helped me come this
far, ranging from my parents who raised me, to the person who shared trivia
with me during an idle conversation. I might not be mentioning each and every
one of such persons by name, but I understand that I am a product of each of
their actions and influences, bringing me where I am right now.
The biggest thanks for this goes to my Thesis Supervisor, Dr. Mahira M.
Mowjoon, for his thorough guidance during my research, till the completion of
my thesis. His timely and earnest support and supervision inspired me to
continue this research with diligence and helped me walk an extra mile. I thank
him for the time he gave me and for his meticulous evaluation of my progress. I
thank him for giving me helpful advice and for providing me with leads and
sources wherever necessary. Without him checking my paper and the mentoring
he gave me, I believe that this thesis wouldn’t have been completed the way it
has been.
I acknowledge the influence that all the lecturers and professors of Victoria
University had in my advancement and extend my sincere thanks to all the
teachers who taught me anything, whether it be my alphabets or me driving a
car. Through all their teachings and mentorship, I have gotten to this point
where I can conduct my own research and write a complete thesis.
I thank you, mother and I thank you father, for raising me to be who I am today.
I thank all my family and relatives for being so supportive of me over all these
years and forgiving me for all the minor and major mistakes that I made and for
accepting me despite my countless flaws.
ii
ABSTRACT
Large amounts of data, surpassing the use of merely ASCII encoded data fields,
are known as Big Data. These data can be images, videos and surveillance
footage. Using such data, collected from CCTV cameras and Camera Traps can
aid the human race in surpassing their limitations, thus, conducting surveys
24/7. With this, it is now possible to collect a comprehensive database that was
only a distant dream in the past.
However, with bigger data comes the need of a bigger workforce to process
these data. Given the same human limitations that these technologies allowed
us to surpass, it is nearly impossible for them to analyze all of these collected
data. Enter Artificial Intelligence: it can remove the need of manually analyzing
Terabytes of data, as the machines do the analyzing for us. The AI can now go
through the Big Data and create graphs and charts, as well as inform us when
something is wrong and needs our attention.
This is exactly the type of thing the world needs right now in order to protect
the wildlife. The biologists and conservationists now need to be aware of the
animal populations now more than ever and need to be on their toes whenever
one of the populations reaches a dangerous low. Therefore, this research
intends to collect the information on the methodologies of the Big Data
Collection and the Artificial Intelligence Tools used to analyze these collected
data. With the proper knowledge of these technologies, the conservationists can
work on filling the gaps in their knowledge base and create a complete
database, while the future researchers can work on further improving these
technologies.
iii
TABLE OF CONTENTS
DECLARATION OF ORIGINALITY...............................................................i
ACKNOWLEDGMENTS...............................................................................ii
ABSTRACT................................................................................................. iii
TABLE OF CONTENTS..............................................................................iv
LIST OF FIGURES AND TABLES...............................................................iv
INTRODUCTION......................................................................................... 1
Collecting Wildlife Data...........................................................................1
Purpose of this Research.........................................................................1
Scope of the Research.............................................................................2
LITERATURE REVIEW................................................................................4
Techniques and Methodology of Big Data Collection and Analysis.........4
Available Tools for Experimentation, Analysis and Data Collection......14
Statistics and Correlation......................................................................18
RESEARCH METHODOLOGY...................................................................22
Research Methodology Process.............................................................22
Effectiveness of the Methodology..........................................................22
Hardware and Software Requirements for the Research......................24
CONCLUSION.......................................................................................... 40
REFERENCES........................................................................................... 41
iv
LIST OF FIGURES AND TABLES
Figure 1: Data Flow Diagram of Big Data Mining from Public Bioassays for
Research............................................................................................................... 5
Figure 2: The Data Volume Challenge..................................................................8
Figure 3: (a) MLM Analysis (b) PCA...................................................................14
Figure 4: HADOOP Cluster – Data Load Performance........................................16
Figure 5: Number of studies and publications on Big Data................................19
Figure 6: Fine-grained big data could cause over-fitting....................................20
Figure 7: Protected Areas in Namibia.................................................................30
Figure 8: Biodiversity Index of The United States of America............................31
Figure 9: Protected Areas around the World......................................................31
Figure 10: Summary of the Red List Found on the IUCN Red List Website.......32
Figure 11: The IUCN Red List of Threatened Species........................................33
Figure 12: Data Analyzed and Collected by IUCN using SMART
……………………………………. 34
Figure 13: Evaluation Framework Matrix for Animal Conservation
……………………………….. 40
v
INTRODUCTION
After the recent disaster in the Amazon forest (1) (Howell, R. 2019, Forbes), there
is no doubt that the world needs a wake-up call and start focusing more on the
preservation of Wildlife. While different conservation movements have been
taking place for a long time now, our dependency on technology has, debatably,
been destroying the world that we live in (2) (Confino, J. 2013, The Guardian).
However, with its numerous uses, technology might just be the thing that can
be used to improve the conservation efforts in the end.
Some say that the most powerful tool in the world is knowledge (3) (Aksari, M.
2018, Medium). After all, one cannot hope to solve a problem one has no
knowledge about. Thus, the first step to helping the world with preserving
wildlife is to collect all the information about its numbers. One cannot simply
prioritize the preservation of a species without truly understanding the
endangerment levels of the species of animals.
Needless to say, all the conservation efforts need to start with collecting the
data on the population of the wildlife. In the modern world, the resources
required to do just that are at the tip of the fingers. Using CCTV Surveillance
Cameras, Satellite Imagery, Camera Traps, Motion Sensors and a lot of other
technology, it is possible to keep on counting the animals and surveying their
actions and movements with the minimal effort and human involvement.
Regardless of the ‘how’s, the bottom line is: the Collection of Wildlife Data is of
the utmost importance. However, without the know-how, it is impossible to
reach that end.
1
Purpose of this Research
This research aims to take a look at the modern technologies available for the
collection of Big Data and the tools available for analyzing them using Artificial
Intelligence. The research will be collecting information from the research
documents prepared by the foremost authorities in the subject to find the know-
hows required for not only for using these technologies in the field, but also to
help the future researchers work on them and improve their efficiencies.
Furthermore, the research will attempt to take an objective look at the current
methodologies of these systems and refer to statistical data to analyze their
efficiencies. The thesis will include a comparison of the statistical data from
different era to pin-point the improvements in the efficiencies and accuracies of
these said data collection systems and their collected data, respectively.
This research, will therefore, not include any research on any of those
information, including the definition and the core principles behind the system.
Instead, it will focus on the practical application of the systems and compile all
the information regarding the Big Data Collection and Analysis methods and
techniques as well as the use of the software and the hardware required for the
said collection and analysis.
2
Big Data Collection and Analysis is something that is done even by large
corporations like Facebook and Google, which is too big an undertaking to
research into and is unnecessary due to two reasons:
They do not need any further research by third parties since they have
dedicated teams to improve them.
It will be impossible to collect information about their Big Data Analysis
systems properly due to their business confidentialities.
As such, this research will not be conducted on Big Data Collection Methods
and Analysis techniques in general, but only focus upon how they can and are
being used in the field of Wildlife Biology by individual groups of researchers
and independent conservationists. Above all, these information are mostly open
for public use and analysis and should be accessible for research, making this
research scope viable with the proper range of information.
3
LITERATURE REVIEW
As the primary source of information, the research papers and documents from
some of the foremost authorities on the subjects are being reviewed in this
section. The research papers are being compiled under three specific topics, to
provide more in-depth understanding of the said titles.
Data mining is one of the best techniques of collecting big data from a large
data source in order to create a new set of data. As such, the researchers
accessed the large amounts of bioassays available for public use, from sources
such as the National Center for Biotechnology
4
Information, ChEMBL and the Molecular Libraries Program from the National
Institute of Health. They used these sources of data to mine for Big Data
required for their research into animal toxicity. The data mining techniques
used in the research are depicted in the diagram below:
Figure 1: Data Flow Diagram of Big Data Mining from Public Bioassays for Research
However, after the initial data collection, the researchers used automated filters
and manual screening techniques to narrow down their results instead of using
Artificial Intelligence to do the screening. Using a tool called Entrez Utilities(6)
(Entrez Programming Utilities, NCBI), they removed most of the data used for
profiling (From the 739000 assays, see diagram above), and narrowed down the
compounds to 4841 unique compounds.
5
After screening the compounds using different filters, the final data was
presented in a graphical format using Artificial Intelligence. Namely, this has
been a tool that has been developed solely for the purpose of this research,
which seems to also be a viable alternative to the off-the-shelf software. This
automatic virtual profiling tool has been created to analyze the data and
produce heat graphs and dot charts that have been used in the research.
The paper titled “Design Principles for Effective Knowledge Discovery from Big
Data” has been prepared by Edmon Begoli and James Horey at the Oak Ridge
National Laboratory in Tennessee(7) (Begoli, E. 2012, Oak Ridge National
Laboratory). This research is a compilation of some of the Big Data analysis
principles that can aid in the processing of Big Data and collect useful
information from them.
According to the document, the principles are derived from different real life
projects that has been conducted at the Oak Ridge National Laboratory,
collaborating with different state and federal agencies. After this brief
introduction to establish their authority and citing references, the thesis moved
on to explaining the actual design principles that has been proven to be
effective in Big Data Processing. These principles mainly deal with maximizing
the controlling factors, thus allowing the researchers to corroborate with the
data with relative ease. The principles mentioned are as follows:
Some of the data analysis methods are outlined next, namely Statistical
Analysis (Summarizing large datasets and defining prediction models),
Data Mining (Using AI to automatically mine for useful datasets from
6
among Big Data), Machine Learning (Combined with Data Mining and
Statistical Analysis, the machine attempts to understand patterns in order
to improve the mining and analysis techniques automatically) and
Visualization (Large datasets are presented in visual formats to help
researchers discover interesting relationships).
2. One Size Fits All Solutions Do Not Exist: A single drive or a local file
system might be sufficient for a small amount of Data, but when dealing
with Big Data, it isn’t a viable solution. While many of the large datasets
in the past employed the use of heavy relational databases, different
types of intermediate database structures and analysis requires special
architectures tailored for their needs. Many of the other authorities agree
that the time for a specific database structure as a solution for all is past.
First of all, Data Accessibility could be improved using open source and
popular frameworks. With most research requiring the use of a complex
set of software, using the popular options can minimize the complexity
and improve data accessibility. Next, the paper emphasizes on the need
of using a Lightweight Architecture to provide rich, interactive
experiences. While technologies like J2EE would be enough to deliver the
results to the users, newer technologies like Node.js would make it easier
for the users to access the data by catering to their demands. Finally, the
results need to be exposed using an API, so that the users can use the
data in any way they want using the API, whether they want to download
the results in different formats or just get a visual representation.
Andrej Kalicky, under the supervision of Eng. Vladimir Kyjonka, prepared the
thesis titled “High Performance Analytics” at the Charles University in Prague (8)
7
(Kalický, A. 2013, Charles University). The thesis deals with the challenges and
problems with pioneering advanced analytics systems and summarizes all the
information pertaining High Performance Analytics on Big Data. As such, this
paper has been a sufficient resource to compile the information necessary for
the current thesis, eliminating the need to review multiple sources of redundant
information.
According to the research paper, one problem that exists in Big Data analysis is
the disparity between the knowledge gap and the execution gap. Knowledge
gap is defined by the limitations of the analysis techniques, such as poor
algorithms and data mining methods while the execution gap actually arise from
the limitations of hardware and other resources. Now, the major problem in the
disparity is that the Hardware is being improved at a stable rate while the data
is increasing exponentially with time. This, in turn, is increasing the knowledge
gap and slowing down the Big Data Analysis capabilities. The Figure Below
shows a graph chart of this phenomenon.
8
In order to overcome the knowledge gap, the analysis algorithms need to be
improved. That’s where the high performance analytics come into play. The
main aim of high performance analytics is to optimize the available hardware to
maximize the analysis potential in order to facilitate and effective allocation of
the available computing resources. There are four types of data analysis:
Diagnostic Analytics: These analytics try to find out the root causes behind
the information gathered from the descriptive analysis and find why is it
happening.
Predictive Analytics: This analysis tries to figure out, from the trends and the
causes, how things might turn out in the future.
The concept of HPA, then, sums up to the use of these aforementioned high-
performance computing techniques with analytics as their goal. While this
seems to be only a concept, it is actually based upon using Advanced Analytics
as the core infrastructure of a Big Data Analysis system.
This next piece of research paper titled “Geospatial Big Data Handling Theory
and Methods: A Review and Research Challenges” has been prepared by
9
Songnian Li, Suzana Dragicevic, Francesc Antón Castro, Monika Sester,
Stephan Winter, Arzu Coltekin, Chris Pettit, Bin Jiang, James Haworth, Alfred
Stein and Tao Cheng(9) (Li, S. 2015, ISPRS Journal of Photogrammetry and
Sensing). This paper has been published in the ISPRS Journal of
Photogrammetry and Remote Sensing in October 2015.
This research paper is particular important for the current topic due to the fact
that Geospatial data plays an important role in determining wildlife habitats, as
such, are just as important in monitoring their population as the data on their
numbers itself. Thus, the review on this paper can and does provide some
valuable insight into handling Geospatial Big Data.
According to the thesis, Geospatial Big Data can be characterized on the basis
of Volume, Variety, Velocity, Veracity, Visualization and Visibility. The paper
then proceeds to tackle each one of them and point out the problems and
suggest Big Data Handling methods as solution to the said problems. The
findings of the paper are summarized in the table below:
10
Characteristics Definition Problems Solutions
The larger the variety of data, the Open Source solutions such as
Different types of data, such as more challenging it is to combine OpenStreetMap have been
Map Data, Geotagged Text Data, them in Big Data Analytics, proven to be effective at
Variety
Imagery Data, Raster Data, challenging the error propagation managing crowd sourced data,
Vector Data etc. models based on functional but it still remains a concern
relationships. at the moment.
11
12
One proposed solution is to
produce a locally distributed
stream sensing, processing
The real-time analysis for the
The speed at which the data can and telecommunication
future predictions do not allow
be accessed, such as, continuous paradigm.
for setting up proper error
streaming of sensor
propagation since disturbances in
Velocity observations, frequently The other solution is to focus
the sensor/channels can only be
revisiting imagery data at high on new processing algorithms
detected through prediction
resolutions, real-time GNSS to handle large volumes of
models, throwing it into a catch-
trajectory, etc. data through use of
22 situation.
functional programming
languages is to design new
streaming algorithms.
Veracity The accuracy of the data: many Unreliability of data is often Unreliable data as such is
of Geospatial data come from unpredictable. For example, the used to predict and estimate
unverified sources. The accuracy accuracy of satellite imagery is only the specific set of data
of the collected data is a major dependent on location (Canyons that allows for some
player in characterizing can prove to be a challenge) “messiness” and inaccuracies.
Geospatial Big Data. making its accuracy Example: While a satellite
unpredictable. imagery might not be able to
show all the narrow lanes and
streets, it still is enough to
provide an estimation of the
13
ratio of big city blocks to small
blocks.
Visualization The ability to visualize Big Data Analytics may result in trying to Many possible methods are
in a human-readable format, display “too much data” in one suggested but yet remains to
helping the researchers identify small window, making the results be empirically tested for
patterns, draw statistics and seem too fuzzy and difficult to efficiency. Suggested solutions
understand the information from read. This may lead to are, multiple-linked views,
within the data. information overload, making it focus-context visualization and
difficult for the human mind to foveation.
14
process the information.
15
Cloud computing has
Accessibility of data through Large amounts of data needing to
essentially solved the problem
different media and how easily be accessed by different
Visibility of visibility and as of now, may
they can be accessed by the researchers from different places
not prove to be a challenge for
people who need them. and media is a challenge.
the scientific community.
16
Available Tools for Experimentation, Analysis and Data
Collection
After taking a look at the most important design principles and high
performance analytics concepts, it was now time to take a better look into the
available tools for utilizing these methods. To do that, the following research
documents are being reviewed:
A New Tool Called DISSECT For Analyzing Large Genomic Data Sets
The first document to be reviewed in this section would be the article titled, “A
New Tool Called DISSECT For Analyzing Large Genomic Data Sets Using a Big
Data Approach”, by Oriol Canela-Xandri, Andy Law, Alan Gray, John A.
Woolliams and Albert Tenesa(10) (Canela-Xandri, O. 2015, Nature
Communications Journal). Published at Nature Communications journal in
November 2015, this paper discusses one of the AI tools used in modern
computing to analyze Big Data.
This article mainly exposes the findings of the researchers after using this tool,
including its features and shortcomings as well as, its technical specifications
and requirements. Taking a look at its technical specifications and
computational skills, the research heads into testing out its performance with
MLM (Multi-level modelling) and PCA (Principal Component Analysis) analyses.
These two analyses were selected due to them being very computationally
demanding. The results are shown below:
17
Figure 3: (a) MLM Analysis (b) PCA. The Blue Lines and the Left Axis represents the
Computational Time and the Red lines and the Right axis represents the Number of Processor
Cores Used
Next, DISSECT was tested for its prediction results with large data samples.
While the predictions weren’t accurate by itself, after using a large amount of
data sample to train its prediction algorithm, DISSECT showed high accuracies
(up to 86%). This shows that DISSECT’s machine learning capabilities only
shine when used in conjuction with Big Data.
The thesis prepared by Ketaki Subhash Raste at University of San Diego, titled
“Big Data Analytics – HADOOP Performance Analysis”, gives us a great in-depth
look at HADOOP and how it is used with Big Data (11) (Raste, K. 2014, University
of San Diego). Approved by the Faculty Committee in April 2014, this document
is a credible source of information and worthy of a review in this thesis.
The thesis starts off with the aim and objectives until it reaches Chapter 2,
where it discusses and defines Big Data. Since that part has already been
covered in the previous research into wildlife preservation with Big Data,
reviewing that section would be redundant. So, moving on to Chapter 3, the
thesis explains the Architecture of HADOOP.
18
to detect and handle failures at the application layer, so delivering a highly
available service on top of a cluster of computers, each of which may be prone
to failures.”
Chapter 4 of the thesis introduces some of the technologies that can be used
with Big Data seamlessly with the HADOOP framework. The first one is a
NoSQL Database, such as MongoDB and CouchDB as Document Databases,
Neo4J as a Graph Database, Redis and MemcacheDB for Key-Value Stores, and
HBase/Cassandra as a Column Database. The second option is the use of Cloud
Computing for remote processing and storage, saving on effective cost.
Chapter 5 then moves on to describe Amazon Web Service (14), and the practical
methods of setting up HADOOP on AWS. Chapter 6 explains in detail the
hardware and system the researcher used to test out the performance of
HADOOP with a bit detail on how the whole system was set up. Finally, the
results of the performance analysis are shown as below:
19
select
select
count(movieid)
no of count(movieid)
data size from ratings
nodes from
where rating
ratings group
like ‘5’ query
by userid
query time time
4GB 2 5.94 4.56
4 4.43 3.29
6 3.64 2.95
8 3.03 2.58
10 3.11 2.36
6GB 2 7.6 6.17
4 5.26 4.37
6 4.31 3.55
8 4.25 3.13
10 3.46 2.83
Table 2: Average Time in Minutes to Analyse Data on a Hadoop Cluster with Increasing Number
of Nodes
The most relevant part of the thesis discusses different proposed architectural
models to be used with Big Data that can be implemented anywhere, whether it
is a college campus, a laboratory or a data center. The two presented models
are:
ECSEL: Providing a recap to a previous research on this topic, the paper details
the End-to-End Service at Layer 2 or ECSEL architecture for short. It is a
proposed solution to a circuit with no loss, stable latency and guaranteed
bandwidth, which had always been a challenge for deploying data in WAN.
ECSEL is an implementation of an IDC (Inter-domain Controller), which
negotiates with the remote and the local resources while keeping the
administrative boundaries intact.
20
OpenFlow/SDN ScienceDMZ Architectural Model: This architecture is
suitable for multiple science centers, which employ the use of different
supercomputers for data hosting and analysis. This architecture proposes that a
DTN redirector is placed at the Science DMZ so that whenever the DTN
redirector receives a data transfer request, it is redirected to the right DTN
following the correct security protocols using the data flow rules.
Estimating Animal Density Using Camera Traps without the Need for
Individual Recognition
The other papers in this section tackled the available software, web server
framework and the networking models suitable for processing Big Data in
Animal Conservation, this paper titled “Estimating Animal Density Using
Camera Traps without the Need for Individual Recognition”, deals with the
hardware required to collect the said Big Data; and the key piece of hardware
for automatically monitoring and collecting Big Data on animals are Camera
traps(16) (Rowcliffe, J. 2008, Journal of Applied Zoology).
21
According to the paper, estimating the animal density in an area is the most
important part of wildlife management. As such, using camera traps to estimate
the density had some restrictions with the capture-recapture analysis of species
with individually identifiable markings. To eliminate that restriction, this paper
suggested a solution that eliminates the need for individual recognition by
creating a model that handles the underlying process of contact between the
cameras and the animals.
The paper describes the practical field tests conducted with the proposed
solution to identify animals including the hardware and system used for the
experiments. The experiment was conducted using six DeerCam DC300
Camera(17) traps and used to trap images of animals moving. By calculating the
distance and time taken for the animals to move past a specific point, the
system could get an idea of the speed of an animal. The speed is then used to
identify the individuals after carrying out the census through manual counting
of the animal population by 12 scientists. While some manual involvement was
still required in this experiment, this proposed solution still seems to be a viable
solution in contrast to the limited recognition capabilities of the current
systems.
After taking a look at the design principles of a Big Data system, its collection
and handling methods and a concept of high performance analysis on the said
data, the review looked into the software and system architectures available for
the relevant field. It was now time to take a look at the statistics and how all
this relates to the current research through two reviews:
Rethinking Big Data: A Review on the Data Quality and Usage Issues
“Rethinking Big Data: A Review on the Data Quality and Usage Issues” is an
article from the ISPRS Journal of of Photogrammetry and Remote Sensing (18)
(Zhang, J. 2014, PLos One). Prepared by Jianzheng Liu, Jie Li, Weifeng Li and
Jiansheng Wub, and published in December 2015, this document takes a good
look at the issues that have already been discussed in the previous reviews and
given an overall summary of the challenges. Although that makes this the
22
perfect document as a conclusion to the literature review section, to avoid
redundancies, only the statistical data from this document will be reviewed to
be correlated with the existing findings from the previous reviews.
The article begins with the figure above showing the number of studies and
publications made on Big Data since the year 2004. While this data lacks more
recent data, namely, of the past 4 years, it still serves to show the growing
interest and the research into Big Data and Artificial Intelligence. As such, it
reinforces the idea that not only Big Data is gaining momentum, but also that
research is ongoing onto the many different challenges that come with handling
and analyzing Big Data.
23
Table 3: Summary of some of the Big Data Research on Spatial Information Sciences
The table above shows the summary of a few of the countless research
performed on the improvement of Big Data. Even through all that, the modern
world is still struggling with the challenges with Big Data and Artificial
Intelligence. One such problem is mentioned further into the article, when it
moves to discussing the potential errors brought about by Big Data. These
errors, as per the article, are brought about by Inauthentic Data Collection,
Information Incorrectness, and the “Noise” in Big Data. An example of the
“Noise” and incorrected correlation during the analysis is depicted in the figure
below:
Figure 6: Fine-grained big data could cause over-fitting when wrong models or analysis methods
are applied
24
and Social media, which are currently irrelevant to the subject topic and are
excluded.
So, how does all these reviews relate to the current topic then? To take a look at
that, the chapter “How Technology Can Transform Wildlife Conservation” from
IntechOpen, prepared by Xareni P. Pacheco is being reviewed as a conclusion to
this Literature Review section(19) (Zhang, J. 2014, PLos One).
25
different field of research and is not relevant to the current research, but was
worth a mention due to their potential for future research.
26
RESEARCH METHODOLOGY
This research has been conducted with a proper procedure and strict guideline,
ensuring that all of the information presented is evidence-based and properly
referenced. In order to do that, the research had to follow a specific
methodology to be effective and is detailed below.
The initial inspiration for the research came from the previous research on the
subject, titled “Wildlife Preservation with Big Data and AI”. After
conducting that research and preparing the thesis, it became apparent that the
research was only focusing on the theoretical aspects and needed a more in-
depth research on the more practical side.
Thus, the preliminary research for this project was conducted. It started out by
collecting all the relevant research documents related to this field of study. With
a collection of the available resource materials in hand, it became apparent that
it was a viable research that is possible to conduct within a limited scope. With
a brief understanding of the subject matter, a research proposal had been
created and submitted. Once the proposal has been approved, the actual
research process had begun.
Next, all the collected documents have been skimmed through in order to find
the most relevant of the documents that could provide a complete, one-stop
source of information for the future researchers and users of Big Data and AI.
With the final short-listed documents, an in-depth literature review has been
made on those documents. This literature review served as the primary
research data for the research.
For the secondary data, real-life statistics from the foremost conservationists
and organizations have been collected for analysis. Furthermore, they are
compared to the data that has been collected before the advent of the
researched technology and thus, allowed the research to come up with a
conclusive understanding of the subject matter.
27
Effectiveness of the Methodology
After the previous research project on Big Data and AI had been conducted and
the thesis has been prepared, the foundation for this research was already in
place. Since then, this new research methodology, following a similar pattern to
the already successfully completed project, has been deemed to be efficient for
a few reasons. Not only a similar method to this one has demonstrated history
of being a successful method, but also is effective for the following reasons:
The primary data has been collected from published research papers and books
and other documents. This means that all the materials being referred to during
this evidence based research, are also evidence based and thus, can be
considered to be accurate.
After going through the documents, and collecting all the necessary
information, the most relevant sources had been selected for the literature
review. This ensured that the unnecessary information are weeded out and the
documents that would have provided overlapping information are not reviewed.
Thus, this thesis help the future researchers save time behind reading duplicate
information, had they consider going through all of the documents manually.
28
The Statistical Data collected during the research helped the thesis by backing
up the facts found from the evidence-based research done during the literature
review. Furthermore, they are the perfect source of information due to being
used for practical purposes and can thus help in determining the efficiencies of
the system accurately.
Since the research was conducted with the primary data being the published
research papers, it was deemed unnecessary to purchase and buy any
additional hardware and software for this research. However, if this research
was to include any first hand research data, it would definitely need some
specific hardware. Since this research is being conducted to help the future
researchers conduct their own experiment, it is important to list down the
hardware they can be expected to buy for their first-hand research.
Camera Traps
A set of camera traps should definitely be the first thing a researcher needs in
order to collect data from the fields. As such, a researcher is suggested to buy a
Spypoint Force-20 camera trap as a budget option or a collection of Browning
Strike Force Pro XD if budget is not a problem. The latter would help the
researcher to set up camera traps all across a designated region, increasing the
number of snapshots of wildlife out in the open.
Surveillance Cameras
This hardware might be necessary or might not be, depending on the type of
data the researcher wants to collect on the animal. If the researcher is a
biologist aiming to study behavioral patterns of an animal, this hardware is a
29
must. However, for a Computer Engineer looking to Engineer newer Big Data
capture solutions, might not need these hardware.
Workstation
Software
If a researcher needs to stake out the available software in the market in order
to reverse engineer and improve upon them, they really should invest in buying
a service like SMART. Other open-source solutions are available on a HADOOP
base with DISSECT as the analytical tool. Regardless, if the researcher is
looking to create an improved system from the scratch and has the adequate
resources, the software requirements might change depending on the language
the researcher prefers to work on.
30
ANALYSIS AND RESULTS
After the primary information has been compiled in the literature review
section, it was now the time to analyze the information and the results to find
out what the research has to offer and teach its reader. To do so, the analyzed
problems would be summarized with the underlying causes mentioned. Then
the secondary data would be used to correlated them with the primary
information and finally the possible solutions to these problems are presented.
From the literature review, some problems in terms of the basic design
concepts and data collection and analysis methods and principles are existent in
the modern world. These problems and their underlying causes are shown in
the table below:
Problems Causes
Different Analysis techniques need to While there are still ongoing attempts
be supported. at creating a good solution to
incorporate Big Data and AI in all the
analysis techniques, the modern
world still employs the use of
different software and hardware for
different analysis.
31
solution exists at the moment.
Table 4: Problems and Causes in Big Data Collection and Analysis Methods
The second part of the Literature Review reviewed the current available
technologies for performing wildlife conservation acts. The technologies
reviewed included a tool called DISSECT, a Cloud Networking Solution
Framework called HADOOP, some Software Defined Networking Solutions and
Camera Trap Hardware.
The primary problem with DISSECT was that, despite its performance in
analyzing Genomics and Genetics, it still lacks sufficient support with Camera
Traps, AVEDS (Animal Borne Video and Imaging Devices), making it unsuitable
for wildlife monitoring purposes. While taking a look at HADOOP, it turns out
that it is the only viable open-source solution to the problems, but still leaves a
lot of room for improvement in terms of its performance. But when
implementing this system in different science and data centers, a few
inefficiencies in the system were revealed in the Software Defined Networking
research paper. Two of the most prominent problems with these systems were
the use of multiple point-to-point circuits for rerouting on each site, increasing
the manual involvement of management, and the use of different types of
supercomputers in different science centers while they could’ve benefited from
sharing resources.
32
Finally, the problems with capture-recapture using camera traps were
mentioned with their functions being restricted to identifying animals with
distinctive markings; meaning, they could not be used to monitor all species of
animals without any manual involvement. When dealing with Big Data, manual
involvement as such is impractical. Above all, Big Data introduces some errors
in terms of the noise generated by the variety of data, leading to some
impractical predictions.
While there are many different techniques and methods of Big Data Analysis,
with different major corporations having their own confidential models,
following the presented solutions from the documents reviewed can potentially
fill in the gaps in the analysis techniques. Addressing the problems mentioned
in the previous section in order, the respective solutions have been offered:
Problems Solution
33
accommodate these different
solutions in the form of technologies
such as NoSQL databases.
The knowledge gap and the execution Using a High Performance Analysis
gap keeps increasing. Model can help solve this problem
and utilize the available hardware in
the most optimized way.
34
point-to-point circuits on any specific easily with minimal effort.
site has to be managed manually.
With the animal population going down significantly, along with the recent
catastrophe in the Amazon Jungle, more and more species of animals are being
endangered. As such, many of the natural habitats are being turned into
conservation areas to protect the wildlife. The diagrams below show some of the
protected regions around the world:
35
Figure 7: Protected Areas in Namibia
36
Figure 9: Protected Areas around the World
Figure 1 and 2 shows some promise with their initiatives to preserve wildlife,
however, the Figure 3 reveals a more shocking truth. None of the countries in
the world are in the green, meaning, none of them are nearly as protected as
they are supposed to be. This makes it far more important to implement the
digital solutions to keep a close eye on the animal and plant populations and use
the analysis methods to keep a close eye on any danger signals.
As such, different organizations such as the World Wildlife Fund (WWF) and the
International Union for Conservation of Nature (IUCN) has already been
engaged in doing just that. Using the modern technologies, such as the Spatial
Monitoring and Reporting Tool (SMART)(23), they are focused on collecting more
data. As such, they have come up with a “Red List” of Threatened Animals.
37
Figure 10: Summary of the Red List Found on the IUCN Red List Website
Taking a look at the statistics of the data collected by the IUCN since 2000,
reveals an exponential growth in their collection efficiency in the last ten years.
The figure below demonstrates this:
Figure 11: Increase in the number of species assessed for The IUCN Red List of Threatened
Species
38
From the diagram above, it is apparent that the number of species assessed
since 2009 has significantly improved, marking a stark improvement in the
efficiency of the data collection methods. Incidentally, 2009 has been the year
when the use of Big Data began to flourish, according to Eric Schmidt (21).
While we also see a great increase in the efficiency from 2002 to 2004, it more
or less was stunted after that. That is due to the fact that the collected data was
more or less an estimation of the animal population using the Mark-Recapture
Method back in 2000 and it required a more detailed record for conservation
purposes(20). Since 2004, Complete Counts were being conducted from the
Biologists in the field and collected by IUCN from different sources as seen from
the Table 1a, published by the IUCN(24).
The image above shows the real life data collected and analyzed using the
automated tool, SMART, that automatically counts the extant of the species and
their proportion of endangerment. Here, EW refers to “Extinct in the Wild”
denoted in purple, CR is Critically Endangered, EN is Endangered and VU is
39
vulnerable. The thing that this research is more concerned about is the gray
area, denoted as DD or “Data Deficient”. Most of these data are still not
available, mostly due to the lacking in the field-work. These deficiencies can be
reduced with further improvements to the modern technologies, a bigger
budget for the field works and employing more of the modern solutions till the
improved solutions arrive(22).
The research had been conducted solely through the review of different
approved sources of information and some of the available statistics, as
mentioned in the research methodology. However, a more thorough research
would require first-hand experimentations using the said technologies, in order
to provide a better perspective on the subject.
However, due to some constraints to the budget, time, lack of access to the
proper geographic locations, the research has been limited to literature reviews
and statistics only. Even through such limitations, with the evidence-based
research, the information in the thesis can be deemed to be accurate due to the
validity of the source materials. Nevertheless, the constraints of the research
and the limitations of the scopes needed to be mentioned in the thesis for clarity
purposes. Should a researcher be willing to refer to this guide in order to
conduct his/her own research, the researcher would need to conduct his/her
own first hand experimentations in order to find their first hand data.
Below is a framework matrix showing the steps and techniques that individuals
or groups of specialists of a field can follow in order to operate a conservation
project by using Big Data and AI:
40
Specialist Tools Used Activities Results
Software Camera Traps and AVEDS Set up the whole system A software engineer would hold the key
Engineers Networking Tools for Big Data mining responsibilities before starting a
Analysis Software and from the field, including project. He/she will be responsible for
Hardware setting up the cloud training the Forest Rangers and
networking services, if Zoologists on how to set up the
any other project sites cameras and AVEDS and then will
41
connect all the devices in a network to
set up a Data Mining Rig from these
devices.
are to be collaborated
with.
Once the whole project has been set
Provide the necessary
up, he/she will also be in-charge of
training to the Forest
maintenance and monitoring the
Rangers, Biologists and
system and troubleshoot any problems
Zoologists in using the
that might occur. In the end, the
software and setting up
Software Engineer will aid the
the devices.
conservationists and Biologists by
training them in the use of the Analysis
software should they require it.
Zoologists Camera Traps and AVEDS Setting up Camera Zoologists would take the necessary
Monitoring Hardware Traps in the required training from the Software Engineers if
Communication Hardware regions. needed and then, with the help of
Capturing animals and Forest Rangers, choose suitable
attaching AVEDS to candidates for the AVEDS placements.
them before setting They would capture the animals, attach
them back into the wild. AVEDS to them before returning them
Monitor the collected to the wild. They would also set up
data from the fields. Camera Traps with information on the
Perform the manual terrain and Region from the Forest
42
tasks on the data, Rangers as well as with their
including but not limited protection.
to, counting animals
population, screening Once everything has been set up and
repetitive data and the Data starts flowing into the servers,
communicating with Zoologists would monitor the data and
different specialists. count animal population, making the
Contact and entries in the database for Wildlife
communicate with the Biologists’ analysis. They would
respective authorities if observe animal behavioural patterns
any anomalies are found and keep an eye out for any
among the data, abnormalities. Any animal casualties
including animal due to natural causes or because of
casualties, abnormal predators would be noted and the
behaviour or travelling database would be updated by the
patterns. Zoologists.
43
Engineer.
Table 6: Framework Matrix for conducting successful Animal Conservation Projects using Big Data and AI
44
45
Figure 13: Evaluation Framework Matrix for Animal Conservation Efforts Using Big Data and AI
Below is a table showing different processes and tools and how their efficiencies
can be measured and evaluated by the users of a system, given that the set up is
complete and the system is up and running:
46
Work
Methods/Techniques Evaluation
Process/Tools
The images captured by the camera traps are automatically stored in the
Raw Camera Trap Images
databases, in case any manual supervision needs to be conducted.
Data Collection The captured images are processed using AI and the collected data is
Processed Camera Trap
stored in the database in text format that should be easily readable and
Images
understandable by the users.
They GPS systems should collect and keep track of the geospatial data: for
example, the specific cameras at specific Areas of Operation, should input
Geospatial Data from GPS
its geographic location in addition to the time when a certain picture was
taken.
Automatic Processing Artificial Intelligence should be able to recognize individual animals and
thus avoid taking redundant photographs of the individuals. However,
Individual Identification
since these technologies are in the prototype stage, a margin of error
should be acceptable.
Grouping by Geolocation Animal groups could be identified with respect to their herds, individual
families and/or even their individual habitats. For example: One Area of
Operations can have 3 different families of Tigers with a population of 12.
However, each individual can be identified as a part of their specific family
47
and not be associated with other families using Geospatial Data of their
habitats. This way, the biologists can determine if a casualty is caused by
rival tigers or poachers.
DISSECT With DISSECT’s core algorithms modified to monitor Animal Densities, this
48
can be used to effectively collect data from the different devices.
49
CONCLUSION
The world now only needs to address the few problems with the errors in the
collected Big Data and come up with viable solutions to the hardware
performance issues with the large amount of data to reduce the cost of these
systems. With these challenges out of the way, Big Data and AI could truly be
the strongest of weapons against the extinction of species and help in the
preservation efforts.
This thesis, in conjunction with the previous thesis titled, “Aiding Wildlife with
Big Data and AI”, is here to provide just the motivation and information the
future researchers need to get started on addressing these issues and coming
up with viable solutions to these challenges and help mankind take a leap
towards a better environment.
50
REFERENCES
51
genomic data sets using a Big Data approach”. Nature Communications. 6-
10162. https://doi.org/10.1038/ncomms10162
11.Raste, K. 2014. Big Data Analysis – Hadoop Performance Analysis [D], San
Diego: San Diego State University
12.http://hadoop.apache.org/
13.Ghemawat, S., Gobioff, H. and Leung, S. 2003. “The Google File System”.
Google AI, Google Corporation. https://research.google.com/archive/gfs-
sosp2003.pdf
14.http://aws.amazon.com/
15.Monga, Inder & Pouyoul, Eric & Guok, Chin. (2012). “Software-Defined
Networking for Big-Data Science - Architectural Models from Campus to the
WAN”. 1629-1635. 10.1109/SC.Companion.2012.341.
16.Rowcliffe, J. M., Field, J., Turvey, S. T. and Carbone, C. (2008), “Estimating
animal density using camera traps without the need for individual
recognition”. Journal of Applied Ecology, 45: 1228-1236. doi:10.1111/j.1365-
2664.2008.01473.x
17.https://www.researchgate.net/figure/4-A-picture-of-DeerCam-DC300-camera-
trap-The-muddy-appearance-was-result-of-wild-boar_fig4_326773887
18.Liu, Jianzheng & Li, Jie & Li, Weifeng & Wu, Jiansheng. (2015). “Rethinking
big data: A review on the data quality and usage issues”. ISPRS Journal of
Photogrammetry and Remote Sensing. 115. 10.1016/j.isprsjprs.2015.11.006.
19.Xareni P. Pacheco (December 10th 2018). How Technology Can Transform
Wildlife Conservation, Green Technologies to Improve the Environment on
Earth, Marquidia Pacheco, IntechOpen, DOI: 10.5772/intechopen.82359.
Available from: https://www.intechopen.com/books/green-technologies-to-
improve-the-environment-on-earth/how-technology-can-transform-wildlife-
conservation
20.https://projects.ncsu.edu/cals/course/fw353/Estimate.htm
21.https://datafloq.com/read/big-data-history/239
22.https://www.iucn.org/news/secretariat/201704/experts-call-more-
collaboration-and-investment-biodiversity-monitoring
23.https://smartconservationtools.org/
24. https://nc.iucnredlist.org/redlist/content/attachment_files/
2019_2_RL_Table_1a_v2.pdf
25. Winters, J., August 15, 2018. How Conservationists Are Using AI And Big
Data To Aid Wildlife. NWPB News.
52
26.https://www.datacamp.com/community/tutorials/r-packages-guide?
utm_source=adwords_ppc&utm_campaignid=1455363063&utm_adgroupid=
65083631748&utm_device=c&utm_keyword=&utm_matchtype=b&utm_net
work=g&utm_adpostion=1t1&utm_creative=278443377095&utm_targetid=
aud-392016246653:dsa-
473406586795&utm_loc_interest_ms=&utm_loc_physical_ms=9069450&gcli
d=Cj0KCQjw0IDtBRC6ARIsAIA5gWuJi0gC4okN1ZfuGRs5NSiEy1bzGmjFPLT
Bqh9O3IMH86GbyxRIDOsaAovZEALw_wcB
53