0% found this document useful (0 votes)

2 views

Data Collection Techniques with Big Data and AI for Animal Preservation

The document is a research thesis titled 'Data Collection Techniques with Big Data and AI for Animal Preservation' submitted by Mridha Faysal Hossain for a Master's degree at Victoria University. It explores the use of Big Data and Artificial Intelligence in wildlife conservation, focusing on methodologies for data collection and analysis to support conservation efforts. The research aims to provide insights into current technologies and propose solutions for identified inefficiencies in data collection systems.

Uploaded by

Mghow Genius

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Data Collection Techniques with Big Data and AI for Animal Preservation

Uploaded by

Mghow Genius

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 59

NEF6001

Research Project Part B - Research Paper

Research Title:

Data Collection Techniques

with Big Data and AI for
Animal Preservation

Trimester: 1 / 2019

Student Name: Mridha Faysal Hossain

Student ID: 0424810318
Supervisor Name: Dr. Mahira M. Mowjoon
DECLARATION OF ORIGINALITY

The thesis paper, titled “Data Collection Experiments with Big Data and AI
for Animal Preservation”, is being submitted to the research supervisor, Dr.
Mahira M. Mowjoon, at Victoria University, Sydney. This submission is being
made as a course requirement for the partial fulfillment of the degree of
Masters in Applied Information Technology.

This declaration of the originality of this thesis is hereby being made to concede
that the content written in this document has been prepared and compiled
through my own evidence-based research on the subject after going through all
the relevant sources and related scholarly articles. Any and all the references to
the documents conferred to during the research have been cited where
applicable and all the original authors of the references have been given their
due credit. The information present in this thesis has never been submitted
before with the same intent and purposes for any other fulfillment of degree
requirements to the best of my knowledge.

______________________ Date:
Signed,
Mridha Faysal Hossain

i
ACKNOWLEDGMENTS

I, Mridha Faysal Hossain, am dedicating this section of the paper to thank and
acknowledge all the people around me, who have helped me not only in my
research but also aided me in the compilation of this thesis and inspired me to
take on the challenges of life. I thank all the people who helped me come this
far, ranging from my parents who raised me, to the person who shared trivia
with me during an idle conversation. I might not be mentioning each and every
one of such persons by name, but I understand that I am a product of each of
their actions and influences, bringing me where I am right now.

The biggest thanks for this goes to my Thesis Supervisor, Dr. Mahira M.
Mowjoon, for his thorough guidance during my research, till the completion of
my thesis. His timely and earnest support and supervision inspired me to
continue this research with diligence and helped me walk an extra mile. I thank
him for the time he gave me and for his meticulous evaluation of my progress. I
thank him for giving me helpful advice and for providing me with leads and
sources wherever necessary. Without him checking my paper and the mentoring
he gave me, I believe that this thesis wouldn’t have been completed the way it
has been.

I acknowledge the influence that all the lecturers and professors of Victoria
University had in my advancement and extend my sincere thanks to all the
teachers who taught me anything, whether it be my alphabets or me driving a
car. Through all their teachings and mentorship, I have gotten to this point
where I can conduct my own research and write a complete thesis.

I thank you, mother and I thank you father, for raising me to be who I am today.
I thank all my family and relatives for being so supportive of me over all these
years and forgiving me for all the minor and major mistakes that I made and for
accepting me despite my countless flaws.

ii
ABSTRACT

Large amounts of data, surpassing the use of merely ASCII encoded data fields,
are known as Big Data. These data can be images, videos and surveillance
footage. Using such data, collected from CCTV cameras and Camera Traps can
aid the human race in surpassing their limitations, thus, conducting surveys
24/7. With this, it is now possible to collect a comprehensive database that was
only a distant dream in the past.

However, with bigger data comes the need of a bigger workforce to process
these data. Given the same human limitations that these technologies allowed
us to surpass, it is nearly impossible for them to analyze all of these collected
data. Enter Artificial Intelligence: it can remove the need of manually analyzing
Terabytes of data, as the machines do the analyzing for us. The AI can now go
through the Big Data and create graphs and charts, as well as inform us when
something is wrong and needs our attention.

This is exactly the type of thing the world needs right now in order to protect
the wildlife. The biologists and conservationists now need to be aware of the
animal populations now more than ever and need to be on their toes whenever
one of the populations reaches a dangerous low. Therefore, this research
intends to collect the information on the methodologies of the Big Data
Collection and the Artificial Intelligence Tools used to analyze these collected
data. With the proper knowledge of these technologies, the conservationists can
work on filling the gaps in their knowledge base and create a complete
database, while the future researchers can work on further improving these
technologies.

iii
TABLE OF CONTENTS

DECLARATION OF ORIGINALITY...............................................................i
ACKNOWLEDGMENTS...............................................................................ii
ABSTRACT................................................................................................. iii
TABLE OF CONTENTS..............................................................................iv
LIST OF FIGURES AND TABLES...............................................................iv

INTRODUCTION......................................................................................... 1
Collecting Wildlife Data...........................................................................1
Purpose of this Research.........................................................................1
Scope of the Research.............................................................................2

LITERATURE REVIEW................................................................................4
Techniques and Methodology of Big Data Collection and Analysis.........4
Available Tools for Experimentation, Analysis and Data Collection......14
Statistics and Correlation......................................................................18

RESEARCH METHODOLOGY...................................................................22
Research Methodology Process.............................................................22
Effectiveness of the Methodology..........................................................22
Hardware and Software Requirements for the Research......................24

ANALYSIS AND RESULTS........................................................................26

Problem Analysis and the Underlying Causes.......................................26
Proposed Solutions to the Problems......................................................28
Statistics and Real Life Data.................................................................29
Limitations and Constraints..................................................................34
Usable Animal Conservation Techniques Framework Matrix...............35

CONCLUSION.......................................................................................... 40
REFERENCES........................................................................................... 41

iv
LIST OF FIGURES AND TABLES

Table 1: Characteristics, Challenges and Solutions to Handling Geospatial Big

Data..................................................................................................................... 13
Table 2: Average Time in Minutes to Analyse Data on a Hadoop Cluster..........16
Table 3: Summary of some of the Big Data Research on Spatial Information
Sciences.............................................................................................................. 20
Table 4: Problems and Causes in Big Data Collection and Analysis Methods....27
Table 5: Proposed Solutions to the Analyzed Problems......................................29
Table 6: Framework Matrix for conducting successful Animal Conservation
Projects............................................................................................................... 39
Table 7: System Evaluation Framework Matrix
……………………………………………………………. 42

Figure 1: Data Flow Diagram of Big Data Mining from Public Bioassays for
Research............................................................................................................... 5
Figure 2: The Data Volume Challenge..................................................................8
Figure 3: (a) MLM Analysis (b) PCA...................................................................14
Figure 4: HADOOP Cluster – Data Load Performance........................................16
Figure 5: Number of studies and publications on Big Data................................19
Figure 6: Fine-grained big data could cause over-fitting....................................20
Figure 7: Protected Areas in Namibia.................................................................30
Figure 8: Biodiversity Index of The United States of America............................31
Figure 9: Protected Areas around the World......................................................31
Figure 10: Summary of the Red List Found on the IUCN Red List Website.......32
Figure 11: The IUCN Red List of Threatened Species........................................33
Figure 12: Data Analyzed and Collected by IUCN using SMART
……………………………………. 34
Figure 13: Evaluation Framework Matrix for Animal Conservation
……………………………….. 40

v
INTRODUCTION

After the recent disaster in the Amazon forest (1) (Howell, R. 2019, Forbes), there
is no doubt that the world needs a wake-up call and start focusing more on the
preservation of Wildlife. While different conservation movements have been
taking place for a long time now, our dependency on technology has, debatably,
been destroying the world that we live in (2) (Confino, J. 2013, The Guardian).
However, with its numerous uses, technology might just be the thing that can
be used to improve the conservation efforts in the end.

Collecting Wildlife Data

Some say that the most powerful tool in the world is knowledge (3) (Aksari, M.
2018, Medium). After all, one cannot hope to solve a problem one has no
knowledge about. Thus, the first step to helping the world with preserving
wildlife is to collect all the information about its numbers. One cannot simply
prioritize the preservation of a species without truly understanding the
endangerment levels of the species of animals.

Needless to say, all the conservation efforts need to start with collecting the
data on the population of the wildlife. In the modern world, the resources
required to do just that are at the tip of the fingers. Using CCTV Surveillance
Cameras, Satellite Imagery, Camera Traps, Motion Sensors and a lot of other
technology, it is possible to keep on counting the animals and surveying their
actions and movements with the minimal effort and human involvement.

This truly might be the solution to creating a perfect, accurate, comprehensive

database on animal numbers and their behaviors, leading the conservationists
not only to keep a track of the endangered species, but also pin-point their
behaviors that might lead to a further decrease in their population. With CCTV
cameras, it is possible for the Forest Rangers to keep a closer eye on illegal
activities and poaching.

Regardless of the ‘how’s, the bottom line is: the Collection of Wildlife Data is of
the utmost importance. However, without the know-how, it is impossible to
reach that end.

1
Purpose of this Research

This research aims to take a look at the modern technologies available for the
collection of Big Data and the tools available for analyzing them using Artificial
Intelligence. The research will be collecting information from the research
documents prepared by the foremost authorities in the subject to find the know-
hows required for not only for using these technologies in the field, but also to
help the future researchers work on them and improve their efficiencies.

Furthermore, the research will attempt to take an objective look at the current
methodologies of these systems and refer to statistical data to analyze their
efficiencies. The thesis will include a comparison of the statistical data from
different era to pin-point the improvements in the efficiencies and accuracies of
these said data collection systems and their collected data, respectively.

Finally, if any inefficiencies are found in the current systems, theoretical

solutions to these problems will be compiled from the literature review of the
published research papers and documents and be presented in the thesis.

Scope of the Research

The previous research(4) (Hossain, M. 2019, Victoria University Sydney) on this

subject dealt with the compilation of the information on Big Data and Artificial
Intelligence in the field of Wildlife Preservation. It provided concise information
on the current status of the developments and advancements in the said
technology as well as the problems and barriers to its development.

This research, will therefore, not include any research on any of those
information, including the definition and the core principles behind the system.
Instead, it will focus on the practical application of the systems and compile all
the information regarding the Big Data Collection and Analysis methods and
techniques as well as the use of the software and the hardware required for the
said collection and analysis.

2
Big Data Collection and Analysis is something that is done even by large
corporations like Facebook and Google, which is too big an undertaking to
research into and is unnecessary due to two reasons:
 They do not need any further research by third parties since they have
dedicated teams to improve them.
 It will be impossible to collect information about their Big Data Analysis
systems properly due to their business confidentialities.

As such, this research will not be conducted on Big Data Collection Methods
and Analysis techniques in general, but only focus upon how they can and are
being used in the field of Wildlife Biology by individual groups of researchers
and independent conservationists. Above all, these information are mostly open
for public use and analysis and should be accessible for research, making this
research scope viable with the proper range of information.

3
LITERATURE REVIEW

As the primary source of information, the research papers and documents from
some of the foremost authorities on the subjects are being reviewed in this
section. The research papers are being compiled under three specific topics, to
provide more in-depth understanding of the said titles.

Techniques and Methodology of Big Data Collection and

Analysis

To understand and accumulate the information regarding the modern

applications and Big Data Collection techniques, four different research papers
are going to be reviewed in this section.

Profiling Animal Toxicants by Automatic Mining for Computational

Toxicology

The first source material to be reviewed would be the “Profiling Animal

Toxicants by Automatically Mining Public Bioassay Data: A Big Data Approach
for Computational Toxicology” by Jun Zhang, Jui-Hua Hsieh and Hao Zhu (5)
(Zhang, J. 2014, PLos One). The research has been conducted at the Rutgers
University in collaboration with The Rutgers Center for Computational and
Integrative Biology and The National Institute of Environmental Health
Sciences.

Most of the research consists of information regarding chemicals and bio-

toxicity of animals, which are mostly irrelevant to the subject research.
However, the key takeaways from this source material would be the
methodologies used during the research that aided the researchers in attaining
their goals of mining big data for computational toxicology.

Data mining is one of the best techniques of collecting big data from a large
data source in order to create a new set of data. As such, the researchers
accessed the large amounts of bioassays available for public use, from sources
such as the National Center for Biotechnology

4
Information, ChEMBL and the Molecular Libraries Program from the National
Institute of Health. They used these sources of data to mine for Big Data
required for their research into animal toxicity. The data mining techniques
used in the research are depicted in the diagram below:

Figure 1: Data Flow Diagram of Big Data Mining from Public Bioassays for Research

However, after the initial data collection, the researchers used automated filters
and manual screening techniques to narrow down their results instead of using
Artificial Intelligence to do the screening. Using a tool called Entrez Utilities(6)
(Entrez Programming Utilities, NCBI), they removed most of the data used for
profiling (From the 739000 assays, see diagram above), and narrowed down the
compounds to 4841 unique compounds.

5
After screening the compounds using different filters, the final data was
presented in a graphical format using Artificial Intelligence. Namely, this has
been a tool that has been developed solely for the purpose of this research,
which seems to also be a viable alternative to the off-the-shelf software. This
automatic virtual profiling tool has been created to analyze the data and
produce heat graphs and dot charts that have been used in the research.

Design Principles for Effective Knowledge Discovery from Big Data

The paper titled “Design Principles for Effective Knowledge Discovery from Big
Data” has been prepared by Edmon Begoli and James Horey at the Oak Ridge
National Laboratory in Tennessee(7) (Begoli, E. 2012, Oak Ridge National
Laboratory). This research is a compilation of some of the Big Data analysis
principles that can aid in the processing of Big Data and collect useful
information from them.

According to the document, the principles are derived from different real life
projects that has been conducted at the Oak Ridge National Laboratory,
collaborating with different state and federal agencies. After this brief
introduction to establish their authority and citing references, the thesis moved
on to explaining the actual design principles that has been proven to be
effective in Big Data Processing. These principles mainly deal with maximizing
the controlling factors, thus allowing the researchers to corroborate with the
data with relative ease. The principles mentioned are as follows:

1. Supporting a Variety of Analysis Techniques: Different specialists

from different fields have their own software preferences for Data
Analysis. For example, a Computer Engineer might prefer using Java and
Hadoop for their tasks while statisticians might prefer SAS. According to
the document, the most efficient way, even though it might seem counter-
intuitive, is to allow the users the flexibility of using their own software
rather than forcing them to learn and use one standardized platform.

Some of the data analysis methods are outlined next, namely Statistical
Analysis (Summarizing large datasets and defining prediction models),
Data Mining (Using AI to automatically mine for useful datasets from

6
among Big Data), Machine Learning (Combined with Data Mining and
Statistical Analysis, the machine attempts to understand patterns in order
to improve the mining and analysis techniques automatically) and
Visualization (Large datasets are presented in visual formats to help
researchers discover interesting relationships).

2. One Size Fits All Solutions Do Not Exist: A single drive or a local file
system might be sufficient for a small amount of Data, but when dealing
with Big Data, it isn’t a viable solution. While many of the large datasets
in the past employed the use of heavy relational databases, different
types of intermediate database structures and analysis requires special
architectures tailored for their needs. Many of the other authorities agree
that the time for a specific database structure as a solution for all is past.

3. Data Needs to Be Accessible: The other principles are related to Data

Analysis and Organization, but this one is more focused on the end
product. According to the thesis, the data must be easily accessible by
researchers of all walks, and three different approaches are proposed for
realizing this goal:

First of all, Data Accessibility could be improved using open source and
popular frameworks. With most research requiring the use of a complex
set of software, using the popular options can minimize the complexity
and improve data accessibility. Next, the paper emphasizes on the need
of using a Lightweight Architecture to provide rich, interactive
experiences. While technologies like J2EE would be enough to deliver the
results to the users, newer technologies like Node.js would make it easier
for the users to access the data by catering to their demands. Finally, the
results need to be exposed using an API, so that the users can use the
data in any way they want using the API, whether they want to download
the results in different formats or just get a visual representation.

High Performance Analytics

Andrej Kalicky, under the supervision of Eng. Vladimir Kyjonka, prepared the
thesis titled “High Performance Analytics” at the Charles University in Prague (8)

7
(Kalický, A. 2013, Charles University). The thesis deals with the challenges and
problems with pioneering advanced analytics systems and summarizes all the
information pertaining High Performance Analytics on Big Data. As such, this
paper has been a sufficient resource to compile the information necessary for
the current thesis, eliminating the need to review multiple sources of redundant
information.

According to the research paper, one problem that exists in Big Data analysis is
the disparity between the knowledge gap and the execution gap. Knowledge
gap is defined by the limitations of the analysis techniques, such as poor
algorithms and data mining methods while the execution gap actually arise from
the limitations of hardware and other resources. Now, the major problem in the
disparity is that the Hardware is being improved at a stable rate while the data
is increasing exponentially with time. This, in turn, is increasing the knowledge
gap and slowing down the Big Data Analysis capabilities. The Figure Below
shows a graph chart of this phenomenon.

Figure 2: The Data Volume Challenge

8
In order to overcome the knowledge gap, the analysis algorithms need to be
improved. That’s where the high performance analytics come into play. The
main aim of high performance analytics is to optimize the available hardware to
maximize the analysis potential in order to facilitate and effective allocation of
the available computing resources. There are four types of data analysis:

Descriptive Analytics: These analytics explore some of the behavioral

patterns, trends and relationships, to pinpoint what is happening.

Diagnostic Analytics: These analytics try to find out the root causes behind
the information gathered from the descriptive analysis and find why is it
happening.

Predictive Analytics: This analysis tries to figure out, from the trends and the
causes, how things might turn out in the future.

Prescriptive Analytics: These analytics will provide a solution based on the

predicted information, that can aid in further optimizing the analysis methods.

However, there is another type of Analytics, using more advanced techniques

and a combination of powerful Artificial Intelligence, known as Explanatory
Analytics or simply, Advanced Analytics. This type of analysis tried to
incorporate all four types of analysis in a single method, using practices and
technologies like neural network, natural language processing, predictive
analysis, machine learning, decision trees and data mining. This is mostly used
for exploring the raw, detailed datasets rather than some smaller chunks.

The concept of HPA, then, sums up to the use of these aforementioned high-
performance computing techniques with analytics as their goal. While this
seems to be only a concept, it is actually based upon using Advanced Analytics
as the core infrastructure of a Big Data Analysis system.

Geospatial Big Data Handling Theory and Methods

This next piece of research paper titled “Geospatial Big Data Handling Theory
and Methods: A Review and Research Challenges” has been prepared by

9
Songnian Li, Suzana Dragicevic, Francesc Antón Castro, Monika Sester,
Stephan Winter, Arzu Coltekin, Chris Pettit, Bin Jiang, James Haworth, Alfred
Stein and Tao Cheng(9) (Li, S. 2015, ISPRS Journal of Photogrammetry and
Sensing). This paper has been published in the ISPRS Journal of
Photogrammetry and Remote Sensing in October 2015.

This research paper is particular important for the current topic due to the fact
that Geospatial data plays an important role in determining wildlife habitats, as
such, are just as important in monitoring their population as the data on their
numbers itself. Thus, the review on this paper can and does provide some
valuable insight into handling Geospatial Big Data.

According to the thesis, Geospatial Big Data can be characterized on the basis
of Volume, Variety, Velocity, Veracity, Visualization and Visibility. The paper
then proceeds to tackle each one of them and point out the problems and
suggest Big Data Handling methods as solution to the said problems. The
findings of the paper are summarized in the table below:

10
Characteristics Definition Problems Solutions

Petabyte archives with

Volume of Data is on the rise in massive data servers can
unforeseen magnitude; storing easily help overcome this
them is becoming a challenge challenge. However, budget
might be a problem.
The amount of Data in units of Using the latest technologies
Bytes, collected from remotely in conjunction to optimized
Volume sensed imagery, real-time sensor Predictive Analytics drawn from systems can help in
observations and location-based these data require not only the minimizing the bottlenecks.
data. current data, but also historical Example: Using higher bus
data. As such, the data accessing speeds for RAMs and
speed maybe compromised Lightning fast SSD Storage
through bandwith bottlenecks. devices can overcome some of
the bottlenecks when trying to
access large amounts of data.

The larger the variety of data, the Open Source solutions such as
Different types of data, such as more challenging it is to combine OpenStreetMap have been
Map Data, Geotagged Text Data, them in Big Data Analytics, proven to be effective at
Variety
Imagery Data, Raster Data, challenging the error propagation managing crowd sourced data,
Vector Data etc. models based on functional but it still remains a concern
relationships. at the moment.

11
12
One proposed solution is to
produce a locally distributed
stream sensing, processing
The real-time analysis for the
The speed at which the data can and telecommunication
future predictions do not allow
be accessed, such as, continuous paradigm.
for setting up proper error
streaming of sensor
propagation since disturbances in
Velocity observations, frequently The other solution is to focus
the sensor/channels can only be
revisiting imagery data at high on new processing algorithms
detected through prediction
resolutions, real-time GNSS to handle large volumes of
models, throwing it into a catch-
trajectory, etc. data through use of
22 situation.
functional programming
languages is to design new
streaming algorithms.

Veracity The accuracy of the data: many Unreliability of data is often Unreliable data as such is
of Geospatial data come from unpredictable. For example, the used to predict and estimate
unverified sources. The accuracy accuracy of satellite imagery is only the specific set of data
of the collected data is a major dependent on location (Canyons that allows for some
player in characterizing can prove to be a challenge) “messiness” and inaccuracies.
Geospatial Big Data. making its accuracy Example: While a satellite
unpredictable. imagery might not be able to
show all the narrow lanes and
streets, it still is enough to
provide an estimation of the

13
ratio of big city blocks to small
blocks.

Data Sampling is not a viable

Particular phenomena are not
method of data collection for
being sampled. E.g. isolated
Big Data, however can provide
incidents might not be covered,
a proxy for approximate
which is exactly what needs to be
behavioural patterns based on
covered for wildlife conservation.
the partially sampled data.

While there are still some

massive volumes of Big Data
Irregular sampling rates,
available for research, these
redundancy, data corruption,
issues are making them
entry errors, lack of
unusable at the moment. No
synchronization etc.
particular solution has been
provided in the paper.

Visualization The ability to visualize Big Data Analytics may result in trying to Many possible methods are
in a human-readable format, display “too much data” in one suggested but yet remains to
helping the researchers identify small window, making the results be empirically tested for
patterns, draw statistics and seem too fuzzy and difficult to efficiency. Suggested solutions
understand the information from read. This may lead to are, multiple-linked views,
within the data. information overload, making it focus-context visualization and
difficult for the human mind to foveation.

14
process the information.

15
Cloud computing has
Accessibility of data through Large amounts of data needing to
essentially solved the problem
different media and how easily be accessed by different
Visibility of visibility and as of now, may
they can be accessed by the researchers from different places
not prove to be a challenge for
people who need them. and media is a challenge.
the scientific community.

Table 1: Characteristics, Challenges and Solutions to Handling Geospatial Big Data

16
Available Tools for Experimentation, Analysis and Data
Collection

After taking a look at the most important design principles and high
performance analytics concepts, it was now time to take a better look into the
available tools for utilizing these methods. To do that, the following research
documents are being reviewed:

A New Tool Called DISSECT For Analyzing Large Genomic Data Sets

The first document to be reviewed in this section would be the article titled, “A
New Tool Called DISSECT For Analyzing Large Genomic Data Sets Using a Big
Data Approach”, by Oriol Canela-Xandri, Andy Law, Alan Gray, John A.
Woolliams and Albert Tenesa(10) (Canela-Xandri, O. 2015, Nature
Communications Journal). Published at Nature Communications journal in
November 2015, this paper discusses one of the AI tools used in modern
computing to analyze Big Data.

This article mainly exposes the findings of the researchers after using this tool,
including its features and shortcomings as well as, its technical specifications
and requirements. Taking a look at its technical specifications and
computational skills, the research heads into testing out its performance with
MLM (Multi-level modelling) and PCA (Principal Component Analysis) analyses.
These two analyses were selected due to them being very computationally
demanding. The results are shown below:

17
Figure 3: (a) MLM Analysis (b) PCA. The Blue Lines and the Left Axis represents the
Computational Time and the Red lines and the Right axis represents the Number of Processor
Cores Used

Next, DISSECT was tested for its prediction results with large data samples.
While the predictions weren’t accurate by itself, after using a large amount of
data sample to train its prediction algorithm, DISSECT showed high accuracies
(up to 86%). This shows that DISSECT’s machine learning capabilities only
shine when used in conjuction with Big Data.

However, DISSECT is a tool that would be more appropriately used to analyze

the genotypes of specific species rather than analyze and supervise animal
population. While it may be an excellent tool for performing a wide range of
genetic and genomic analysis by overcoming computational limitations using
the combined power of networked computers, it may not be a suitable choice
for wildlife conservation efforts.

Big Data Analytics – HADOOP Performance Analysis

The thesis prepared by Ketaki Subhash Raste at University of San Diego, titled
“Big Data Analytics – HADOOP Performance Analysis”, gives us a great in-depth
look at HADOOP and how it is used with Big Data (11) (Raste, K. 2014, University
of San Diego). Approved by the Faculty Committee in April 2014, this document
is a credible source of information and worthy of a review in this thesis.

The thesis starts off with the aim and objectives until it reaches Chapter 2,
where it discusses and defines Big Data. Since that part has already been
covered in the previous research into wildlife preservation with Big Data,
reviewing that section would be redundant. So, moving on to Chapter 3, the
thesis explains the Architecture of HADOOP.

The official introduction to HADOOP is presented here by quoting Apache (12):

“The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Rather
than rely on hardware to deliver high-availability, the library itself is designed

18
to detect and handle failures at the application layer, so delivering a highly
available service on top of a cluster of computers, each of which may be prone
to failures.”

HADOOP has been inspired by Google File System(13), and is an open-source

framework. The primary attributes of HADOOP is that it is fault-tolerant, has a
built-in redundancy system, fully scalable and automatically scales up and down
and the computational queries are done on the data store itself instead of
requiring a different system for the analysis. It has two main components,
namely the HADOOP File System (HDFS) and the MapReduce.

Chapter 4 of the thesis introduces some of the technologies that can be used
with Big Data seamlessly with the HADOOP framework. The first one is a
NoSQL Database, such as MongoDB and CouchDB as Document Databases,
Neo4J as a Graph Database, Redis and MemcacheDB for Key-Value Stores, and
HBase/Cassandra as a Column Database. The second option is the use of Cloud
Computing for remote processing and storage, saving on effective cost.

Chapter 5 then moves on to describe Amazon Web Service (14), and the practical
methods of setting up HADOOP on AWS. Chapter 6 explains in detail the
hardware and system the researcher used to test out the performance of
HADOOP with a bit detail on how the whole system was set up. Finally, the
results of the performance analysis are shown as below:

Figure 4: HADOOP Cluster – Data Load Performance

19
select
select
count(movieid)
no of count(movieid)
data size from ratings
nodes from
where rating
ratings group
like ‘5’ query
by userid
query time time
4GB 2 5.94 4.56
4 4.43 3.29
6 3.64 2.95
8 3.03 2.58
10 3.11 2.36
6GB 2 7.6 6.17
4 5.26 4.37
6 4.31 3.55
8 4.25 3.13
10 3.46 2.83

Table 2: Average Time in Minutes to Analyse Data on a Hadoop Cluster with Increasing Number
of Nodes

Software Defined Networking for big-data science

“Software Defined Networking for big-data science” is prepared by Inder

Monga, Eric Pouyoul and Chin Guok at the Lawrence Berkeley Lab and
discusses different Architechtural models to be used for Big Data, connecting
different university campuses to the WAN(15) (Monga, I. 2012, Lawrence
Berkeley Lab). The document mostly discusses different definitions, histories
and backgrounds in the opening chapters, which are mostly redundant and
irrelevant to the current research.

The most relevant part of the thesis discusses different proposed architectural
models to be used with Big Data that can be implemented anywhere, whether it
is a college campus, a laboratory or a data center. The two presented models
are:

ECSEL: Providing a recap to a previous research on this topic, the paper details
the End-to-End Service at Layer 2 or ECSEL architecture for short. It is a
proposed solution to a circuit with no loss, stable latency and guaranteed
bandwidth, which had always been a challenge for deploying data in WAN.
ECSEL is an implementation of an IDC (Inter-domain Controller), which
negotiates with the remote and the local resources while keeping the
administrative boundaries intact.

20
OpenFlow/SDN ScienceDMZ Architectural Model: This architecture is
suitable for multiple science centers, which employ the use of different
supercomputers for data hosting and analysis. This architecture proposes that a
DTN redirector is placed at the Science DMZ so that whenever the DTN
redirector receives a data transfer request, it is redirected to the right DTN
following the correct security protocols using the data flow rules.

“One-Virtual-Switch” Network Virtualization Model: Dynamic Science

Collaborations connect their different sites using a static or dynamic point-to-
point circuits, forcing all the management and maintenance of the PPCs in each
sites to be performed manually. This limit can be overcome using the proposed
OVSNV abstraction model. The concept behind this abstraction is that each site
will generate simple, atomic and programmable logical circuits and assign them
to virtual ports in the WAN’s virtual switch. Thus, only one switch may control
all the point-to-point routes without needing separate PPCs.

Estimating Animal Density Using Camera Traps without the Need for
Individual Recognition

The other papers in this section tackled the available software, web server
framework and the networking models suitable for processing Big Data in
Animal Conservation, this paper titled “Estimating Animal Density Using
Camera Traps without the Need for Individual Recognition”, deals with the
hardware required to collect the said Big Data; and the key piece of hardware
for automatically monitoring and collecting Big Data on animals are Camera
traps(16) (Rowcliffe, J. 2008, Journal of Applied Zoology).

Prepared by J. Marcus Rowcliffe, Juliet Field, Samuel T. Turvey and Chris

Carbone at the Zoological Society of London and University of Leeds and
published in the Journal of Applied Ecology, this research paper still remains
one of the foremost authorities on the subject.

21
According to the paper, estimating the animal density in an area is the most
important part of wildlife management. As such, using camera traps to estimate
the density had some restrictions with the capture-recapture analysis of species
with individually identifiable markings. To eliminate that restriction, this paper
suggested a solution that eliminates the need for individual recognition by
creating a model that handles the underlying process of contact between the
cameras and the animals.

The paper describes the practical field tests conducted with the proposed
solution to identify animals including the hardware and system used for the
experiments. The experiment was conducted using six DeerCam DC300
Camera(17) traps and used to trap images of animals moving. By calculating the
distance and time taken for the animals to move past a specific point, the
system could get an idea of the speed of an animal. The speed is then used to
identify the individuals after carrying out the census through manual counting
of the animal population by 12 scientists. While some manual involvement was
still required in this experiment, this proposed solution still seems to be a viable
solution in contrast to the limited recognition capabilities of the current
systems.

Statistics and Correlation

After taking a look at the design principles of a Big Data system, its collection
and handling methods and a concept of high performance analysis on the said
data, the review looked into the software and system architectures available for
the relevant field. It was now time to take a look at the statistics and how all
this relates to the current research through two reviews:

Rethinking Big Data: A Review on the Data Quality and Usage Issues

“Rethinking Big Data: A Review on the Data Quality and Usage Issues” is an
article from the ISPRS Journal of of Photogrammetry and Remote Sensing (18)
(Zhang, J. 2014, PLos One). Prepared by Jianzheng Liu, Jie Li, Weifeng Li and
Jiansheng Wub, and published in December 2015, this document takes a good
look at the issues that have already been discussed in the previous reviews and
given an overall summary of the challenges. Although that makes this the

22
perfect document as a conclusion to the literature review section, to avoid
redundancies, only the statistical data from this document will be reviewed to
be correlated with the existing findings from the previous reviews.

Figure 5: Number of studies and publications on Big Data

The article begins with the figure above showing the number of studies and
publications made on Big Data since the year 2004. While this data lacks more
recent data, namely, of the past 4 years, it still serves to show the growing
interest and the research into Big Data and Artificial Intelligence. As such, it
reinforces the idea that not only Big Data is gaining momentum, but also that
research is ongoing onto the many different challenges that come with handling
and analyzing Big Data.

23
Table 3: Summary of some of the Big Data Research on Spatial Information Sciences

The table above shows the summary of a few of the countless research
performed on the improvement of Big Data. Even through all that, the modern
world is still struggling with the challenges with Big Data and Artificial
Intelligence. One such problem is mentioned further into the article, when it
moves to discussing the potential errors brought about by Big Data. These
errors, as per the article, are brought about by Inauthentic Data Collection,
Information Incorrectness, and the “Noise” in Big Data. An example of the
“Noise” and incorrected correlation during the analysis is depicted in the figure
below:

Figure 6: Fine-grained big data could cause over-fitting when wrong models or analysis methods
are applied

Then the article finally moves on to discussing some of the Representativeness

Problems with Big Data, Consistency and Reliability Challenges and the Ethical
Issues with Big Data. Since all of these have already been discussed in the
previous thesis on the topic, their details are not being reviewed here to reduce
redundancies in information. Furthermore, the document discusses some of the
issues in more detail regarding the issues caused by Big Data in Search Engines

24
and Social media, which are currently irrelevant to the subject topic and are
excluded.

How Technology Can Transform Wildlife Conservation

So, how does all these reviews relate to the current topic then? To take a look at
that, the chapter “How Technology Can Transform Wildlife Conservation” from
IntechOpen, prepared by Xareni P. Pacheco is being reviewed as a conclusion to
this Literature Review section(19) (Zhang, J. 2014, PLos One).

The chapter opens with an introduction to the threats to wildlife and

emphasizes the necessity of conservation efforts. According to this section,
there are two types of conservation that can be performed on a specific species:
On-site Conservation (In Situ) and Off-site Conservation (Ex Situ). Off-site
conservation refers to bringing an animal into captivity within a safe zone to
promote their breeding and increase their population to battle extinction.
Examples of Ex Situ conservation are found in Zoos, Safari Parks and Animal
Conservation Shelters.

However, In situ conservations should be the highest priority, giving the

animals the opportunity to breed and populate in their natural habitats.
Technologies like Big Data and Artificial intelligence can do just that. Using Bio-
Logging and Bio-Telemetry, two different data collection methods, to monitor
the physiological, behavioral and environmental information on organisms can
help in keeping track of animal populations that are otherwise unattainable.

Using devices such as Animal-Borne Archival Loggers to collect animal’s

environmental and behavioral data, GPS systems for tracking their locations
and movement patterns, and the camera traps to record and monitor their
population density can easily aid the related organizations to keep an eye on
any red flags or threats to the animals and act accordingly.

Using Synthetic Biology, technology can genetically modify animals to give

them better chances of survival through artificial evolutionary traits as well as
synthetic ivory and horns might prevent poaching. However, that is a whole

25
different field of research and is not relevant to the current research, but was
worth a mention due to their potential for future research.

26
RESEARCH METHODOLOGY

This research has been conducted with a proper procedure and strict guideline,
ensuring that all of the information presented is evidence-based and properly
referenced. In order to do that, the research had to follow a specific
methodology to be effective and is detailed below.

Research Methodology Process

The initial inspiration for the research came from the previous research on the
subject, titled “Wildlife Preservation with Big Data and AI”. After
conducting that research and preparing the thesis, it became apparent that the
research was only focusing on the theoretical aspects and needed a more in-
depth research on the more practical side.

Thus, the preliminary research for this project was conducted. It started out by
collecting all the relevant research documents related to this field of study. With
a collection of the available resource materials in hand, it became apparent that
it was a viable research that is possible to conduct within a limited scope. With
a brief understanding of the subject matter, a research proposal had been
created and submitted. Once the proposal has been approved, the actual
research process had begun.

Next, all the collected documents have been skimmed through in order to find
the most relevant of the documents that could provide a complete, one-stop
source of information for the future researchers and users of Big Data and AI.
With the final short-listed documents, an in-depth literature review has been
made on those documents. This literature review served as the primary
research data for the research.

For the secondary data, real-life statistics from the foremost conservationists
and organizations have been collected for analysis. Furthermore, they are
compared to the data that has been collected before the advent of the
researched technology and thus, allowed the research to come up with a
conclusive understanding of the subject matter.

27
Effectiveness of the Methodology

After the previous research project on Big Data and AI had been conducted and
the thesis has been prepared, the foundation for this research was already in
place. Since then, this new research methodology, following a similar pattern to
the already successfully completed project, has been deemed to be efficient for
a few reasons. Not only a similar method to this one has demonstrated history
of being a successful method, but also is effective for the following reasons:

The Preliminary Research Ensures That Nothing Important is Left Out

Since the purpose of the research is to provide a one-stop information source

for the future researchers, it is imperative that this thesis contains all the
necessary information. As mentioned in the Research Methodology Process
section above, all the documents related to this subject has been collected and
read. This method ensures that no significant information goes missing.

The Primary Data Deals with Established Research Findings

The primary data has been collected from published research papers and books
and other documents. This means that all the materials being referred to during
this evidence based research, are also evidence based and thus, can be
considered to be accurate.

It Removes any Redundancies

After going through the documents, and collecting all the necessary
information, the most relevant sources had been selected for the literature
review. This ensured that the unnecessary information are weeded out and the
documents that would have provided overlapping information are not reviewed.
Thus, this thesis help the future researchers save time behind reading duplicate
information, had they consider going through all of the documents manually.

The Statistical Data Provides a Different Perspective

28
The Statistical Data collected during the research helped the thesis by backing
up the facts found from the evidence-based research done during the literature
review. Furthermore, they are the perfect source of information due to being
used for practical purposes and can thus help in determining the efficiencies of
the system accurately.

Therefore, the research methodology can be deemed efficient and to yield

accurate results. However, it is important to note that this method is only
efficient within the current research scope and should the project be expanded,
a newer approach would be necessary.

Hardware and Software Requirements for the Research

Since the research was conducted with the primary data being the published
research papers, it was deemed unnecessary to purchase and buy any
additional hardware and software for this research. However, if this research
was to include any first hand research data, it would definitely need some
specific hardware. Since this research is being conducted to help the future
researchers conduct their own experiment, it is important to list down the
hardware they can be expected to buy for their first-hand research.

Camera Traps

A set of camera traps should definitely be the first thing a researcher needs in
order to collect data from the fields. As such, a researcher is suggested to buy a
Spypoint Force-20 camera trap as a budget option or a collection of Browning
Strike Force Pro XD if budget is not a problem. The latter would help the
researcher to set up camera traps all across a designated region, increasing the
number of snapshots of wildlife out in the open.

Surveillance Cameras

This hardware might be necessary or might not be, depending on the type of
data the researcher wants to collect on the animal. If the researcher is a
biologist aiming to study behavioral patterns of an animal, this hardware is a

29
must. However, for a Computer Engineer looking to Engineer newer Big Data
capture solutions, might not need these hardware.

Workstation

A powerful workstation with a server motherboard running a Xeon processor

would greatly benefit a researcher trying to work with Big Data and AI. A
suggested set up would consist of a Xeon E5-2699 v3 Processor on a Server
Motherboard, such as a MD80 series. It should be set up with a minimum 32 GB
RAM to improve processing performances. The choice of Data Storage devices
would be entirely up to the researcher, given the amount of data he is looking to
process. If he is planning on making any graphical analysis, a graphics
accelerator such as NVidia Tesla is also recommended.

Software

If a researcher needs to stake out the available software in the market in order
to reverse engineer and improve upon them, they really should invest in buying
a service like SMART. Other open-source solutions are available on a HADOOP
base with DISSECT as the analytical tool. Regardless, if the researcher is
looking to create an improved system from the scratch and has the adequate
resources, the software requirements might change depending on the language
the researcher prefers to work on.

30
ANALYSIS AND RESULTS

After the primary information has been compiled in the literature review
section, it was now the time to analyze the information and the results to find
out what the research has to offer and teach its reader. To do so, the analyzed
problems would be summarized with the underlying causes mentioned. Then
the secondary data would be used to correlated them with the primary
information and finally the possible solutions to these problems are presented.

Problem Analysis and the Underlying Causes

From the literature review, some problems in terms of the basic design
concepts and data collection and analysis methods and principles are existent in
the modern world. These problems and their underlying causes are shown in
the table below:

Problems Causes

Manual involvement cannot be When it comes to monitoring animal

eliminated altogether. Systems still toxicity out in the wild, regardless of
have far to go in terms of being fully how detailed the system design is, it
automatic. still requires some manual screening
of the compounds to isolate the
distinctive toxicants that harm the
environment.

Different Analysis techniques need to While there are still ongoing attempts
be supported. at creating a good solution to
incorporate Big Data and AI in all the
analysis techniques, the modern
world still employs the use of
different software and hardware for
different analysis.

There is no “One-size-fits-all” Understanding this limitation can

solution. help the researchers prepare all the
necessary resources before starting a
project. Apart from that, no unified

31
solution exists at the moment.

Data Accessibility is a concern. With different researchers from

different fields requiring to access
different data from different devices,
data accessibility becomes a
challenge.

The knowledge gap and the execution Due to the technological

gap keeps increasing. advancements going on at a stable
pace but the volume of Big Data
increasing exponentially, the
knowledge and Execution gap is
increasing drastically.

Table 4: Problems and Causes in Big Data Collection and Analysis Methods

The second part of the Literature Review reviewed the current available
technologies for performing wildlife conservation acts. The technologies
reviewed included a tool called DISSECT, a Cloud Networking Solution
Framework called HADOOP, some Software Defined Networking Solutions and
Camera Trap Hardware.

The primary problem with DISSECT was that, despite its performance in
analyzing Genomics and Genetics, it still lacks sufficient support with Camera
Traps, AVEDS (Animal Borne Video and Imaging Devices), making it unsuitable
for wildlife monitoring purposes. While taking a look at HADOOP, it turns out
that it is the only viable open-source solution to the problems, but still leaves a
lot of room for improvement in terms of its performance. But when
implementing this system in different science and data centers, a few
inefficiencies in the system were revealed in the Software Defined Networking
research paper. Two of the most prominent problems with these systems were
the use of multiple point-to-point circuits for rerouting on each site, increasing
the manual involvement of management, and the use of different types of
supercomputers in different science centers while they could’ve benefited from
sharing resources.

32
Finally, the problems with capture-recapture using camera traps were
mentioned with their functions being restricted to identifying animals with
distinctive markings; meaning, they could not be used to monitor all species of
animals without any manual involvement. When dealing with Big Data, manual
involvement as such is impractical. Above all, Big Data introduces some errors
in terms of the noise generated by the variety of data, leading to some
impractical predictions.

Proposed Solutions to the Problems

While there are many different techniques and methods of Big Data Analysis,
with different major corporations having their own confidential models,
following the presented solutions from the documents reviewed can potentially
fill in the gaps in the analysis techniques. Addressing the problems mentioned
in the previous section in order, the respective solutions have been offered:

Problems Solution

Manual involvement cannot be Using Entrez Utilities can help filter

eliminated altogether. Systems still compounds among toxicants and
have far to go in terms of being fully reduce the amount of manual filtering
automatic. required for the process.

Different Analysis techniques need to Allow the scientists and researchers

be supported. the flexibility to choose the software
they feel comfortable using. While
this may sound counter-intuitive, it
can actually save a lot of collective
man hours behind learning different
tools if only the end-results are being
standardized rather than all the
stages of operation.

There is no “One-size-fits-all” No unified solution to the problem

solution. currently exists. As such, different
attempts are being made to

33
accommodate these different
solutions in the form of technologies
such as NoSQL databases.

Data Accessibility is a concern. This problem has been already solved

through the use of Cloud Computing,
successfully and is no longer a
concern. However, there might still
be some room for improvement in
terms of optimizing the hardware.

The knowledge gap and the execution Using a High Performance Analysis
gap keeps increasing. Model can help solve this problem
and utilize the available hardware in
the most optimized way.

DISSECT is an excellent tool for Using the Algorithm used to run

analyzing Genomic and Genetic Data. DISSECT and working upon it to
improve it and tailor it for data
analysis for Animal conservation. Its
analytic algorithm can essentially
make distinctions between individuals
within a species if tailored in such
way.

Different Science Centers each have Using the proposed OpenFlow/SDN

its own set of supercomputers and ScienceDMZ Architectural Model,
data servers built for their specific rerouting data becomes easier and
purposes with little to no resource thus, different science centers can
sharing capabilities. When different share its resources within a single
zoological observation efforts are network.
attempting to gather data for a single
project, this might be a problem.

Dynamic Science Collaborations Using ‘One-virtual-switch’ network

employ the use of point-to-point virtualization model to create logical
circuits to connect to each other. For circuits on an abstract model, can
a large collaboration of biologists, help create a virtual switch to control
zoologists, programmers and and manage all the IP addresses,
analysts, this will mean that all the security policies and access controls

34
point-to-point circuits on any specific easily with minimal effort.
site has to be managed manually.

Camera traps use capture-recapture Using the camera traps to maintain

methods to identify species of animals and monitor animal movement speeds
with recognizable markings. Species can be used to recognize animals
without one cannot be automatically even without any identifiable
identified. markings. However, counting the
animal population still needs manual
involvement and is a part that still
needs to be worked on.

Table 5: Proposed Solutions to the Analyzed Problems

Statistics and Real Life Data

With the animal population going down significantly, along with the recent
catastrophe in the Amazon Jungle, more and more species of animals are being
endangered. As such, many of the natural habitats are being turned into
conservation areas to protect the wildlife. The diagrams below show some of the
protected regions around the world:

35
Figure 7: Protected Areas in Namibia

Figure 8: Biodiversity Index of The United States of America

36
Figure 9: Protected Areas around the World

Figure 1 and 2 shows some promise with their initiatives to preserve wildlife,
however, the Figure 3 reveals a more shocking truth. None of the countries in
the world are in the green, meaning, none of them are nearly as protected as
they are supposed to be. This makes it far more important to implement the
digital solutions to keep a close eye on the animal and plant populations and use
the analysis methods to keep a close eye on any danger signals.

As such, different organizations such as the World Wildlife Fund (WWF) and the
International Union for Conservation of Nature (IUCN) has already been
engaged in doing just that. Using the modern technologies, such as the Spatial
Monitoring and Reporting Tool (SMART)(23), they are focused on collecting more
data. As such, they have come up with a “Red List” of Threatened Animals.

37
Figure 10: Summary of the Red List Found on the IUCN Red List Website

Taking a look at the statistics of the data collected by the IUCN since 2000,
reveals an exponential growth in their collection efficiency in the last ten years.
The figure below demonstrates this:

Figure 11: Increase in the number of species assessed for The IUCN Red List of Threatened
Species

38
From the diagram above, it is apparent that the number of species assessed
since 2009 has significantly improved, marking a stark improvement in the
efficiency of the data collection methods. Incidentally, 2009 has been the year
when the use of Big Data began to flourish, according to Eric Schmidt (21).

While we also see a great increase in the efficiency from 2002 to 2004, it more
or less was stunted after that. That is due to the fact that the collected data was
more or less an estimation of the animal population using the Mark-Recapture
Method back in 2000 and it required a more detailed record for conservation
purposes(20). Since 2004, Complete Counts were being conducted from the
Biologists in the field and collected by IUCN from different sources as seen from
the Table 1a, published by the IUCN(24).

Figure 12: Data Analyzed and Collected by IUCN using SMART

The image above shows the real life data collected and analyzed using the
automated tool, SMART, that automatically counts the extant of the species and
their proportion of endangerment. Here, EW refers to “Extinct in the Wild”
denoted in purple, CR is Critically Endangered, EN is Endangered and VU is

39
vulnerable. The thing that this research is more concerned about is the gray
area, denoted as DD or “Data Deficient”. Most of these data are still not
available, mostly due to the lacking in the field-work. These deficiencies can be
reduced with further improvements to the modern technologies, a bigger
budget for the field works and employing more of the modern solutions till the
improved solutions arrive(22).

Limitations and Constraints

The research had been conducted solely through the review of different
approved sources of information and some of the available statistics, as
mentioned in the research methodology. However, a more thorough research
would require first-hand experimentations using the said technologies, in order
to provide a better perspective on the subject.

However, due to some constraints to the budget, time, lack of access to the
proper geographic locations, the research has been limited to literature reviews
and statistics only. Even through such limitations, with the evidence-based
research, the information in the thesis can be deemed to be accurate due to the
validity of the source materials. Nevertheless, the constraints of the research
and the limitations of the scopes needed to be mentioned in the thesis for clarity
purposes. Should a researcher be willing to refer to this guide in order to
conduct his/her own research, the researcher would need to conduct his/her
own first hand experimentations in order to find their first hand data.

Usable Animal Conservation Techniques Framework

Matrix

Below is a framework matrix showing the steps and techniques that individuals
or groups of specialists of a field can follow in order to operate a conservation
project by using Big Data and AI:

40
Specialist Tools Used Activities Results

The Forest Rangers will collaborate

with the Software Engineers to learn
 Aid the Zoologists in
about the terrain limitations when
placing camera traps in
installing Camera Traps. They will then
the protected regions
be responsible for aiding the Zoologists
and aid in capture of
by assisting in finding suitable
 Vehicles animals to attach
locations for placing the camera traps
Forest Rangers  Camera Traps AVEDS to them.
in the specified region.
 AVEDS  Enforce Law and Order
whenever any
If any abnormalities are seen amongst
abnormalities in the
animal behaviour or population, the
animal population has
Forest Rangers will be responsible for
been detected.
checking them out and enforce law and
order to protect the animals.

Software  Camera Traps and AVEDS  Set up the whole system A software engineer would hold the key
Engineers  Networking Tools for Big Data mining responsibilities before starting a
 Analysis Software and from the field, including project. He/she will be responsible for
Hardware setting up the cloud training the Forest Rangers and
networking services, if Zoologists on how to set up the
any other project sites cameras and AVEDS and then will

41
connect all the devices in a network to
set up a Data Mining Rig from these
devices.
are to be collaborated
with.
Once the whole project has been set
 Provide the necessary
up, he/she will also be in-charge of
training to the Forest
maintenance and monitoring the
Rangers, Biologists and
system and troubleshoot any problems
Zoologists in using the
that might occur. In the end, the
software and setting up
Software Engineer will aid the
the devices.
conservationists and Biologists by
training them in the use of the Analysis
software should they require it.

Zoologists  Camera Traps and AVEDS  Setting up Camera Zoologists would take the necessary
 Monitoring Hardware Traps in the required training from the Software Engineers if
 Communication Hardware regions. needed and then, with the help of
 Capturing animals and Forest Rangers, choose suitable
attaching AVEDS to candidates for the AVEDS placements.
them before setting They would capture the animals, attach
them back into the wild. AVEDS to them before returning them
 Monitor the collected to the wild. They would also set up
data from the fields. Camera Traps with information on the
 Perform the manual terrain and Region from the Forest

42
tasks on the data, Rangers as well as with their
including but not limited protection.
to, counting animals
population, screening Once everything has been set up and
repetitive data and the Data starts flowing into the servers,
communicating with Zoologists would monitor the data and
different specialists. count animal population, making the
 Contact and entries in the database for Wildlife
communicate with the Biologists’ analysis. They would
respective authorities if observe animal behavioural patterns
any anomalies are found and keep an eye out for any
among the data, abnormalities. Any animal casualties
including animal due to natural causes or because of
casualties, abnormal predators would be noted and the
behaviour or travelling database would be updated by the
patterns. Zoologists.

Any other missing animals or unnatural

casualties or behaviours would require
investigation and the Zoologist would
contact the Forest Rangers to
investigate. Any issues with the system
will be reported to the Software

43
Engineer.

Wildlife Biologists may or may not need

 Analyse the collected to remain on-site, depending on the
data by using the project, preference and necessity. With
appropriate software in Cloud Computing and Software Defined
the system. Networking, he/she may access all the
 Create reports and find Big Data from anywhere to be analysed
trends among the with the tool of his/her choice.
increase or decline in
 Access to Big Data Animal Population They will analyse the collected data
Wildlife collected from the Field  Report any cause for and produce reports that can be used
Biologists  Analysis Software concern to the in conducting animal preservation
 Communication Hardware Zoologists for further efforts.
actions
 Use the reports to As such, any cause for alarm in terms
create campaigns and of animal and/or plant population and
brochures to increase pollutants will be brought to the
public awareness and attention of the Zoologists and Forest
aid in animal Rangers by the Wildlife Biologists to
conservation efforts. increase their conservation efforts on
the most critical issues.

Table 6: Framework Matrix for conducting successful Animal Conservation Projects using Big Data and AI

44
45
Figure 13: Evaluation Framework Matrix for Animal Conservation Efforts Using Big Data and AI

Below is a table showing different processes and tools and how their efficiencies
can be measured and evaluated by the users of a system, given that the set up is
complete and the system is up and running:

46
Work
Methods/Techniques Evaluation
Process/Tools

The collected CCTV footage to capture animal behaviour should be present

in the database and easily accessible. Any missing footage, unusual
CCTV Footage
frequency (if the cameras are set to capture footage after intervals) should
be investigated and would be unacceptable.

The images captured by the camera traps are automatically stored in the
Raw Camera Trap Images
databases, in case any manual supervision needs to be conducted.

Data Collection The captured images are processed using AI and the collected data is
Processed Camera Trap
stored in the database in text format that should be easily readable and
Images
understandable by the users.

They GPS systems should collect and keep track of the geospatial data: for
example, the specific cameras at specific Areas of Operation, should input
Geospatial Data from GPS
its geographic location in addition to the time when a certain picture was
taken.

Automatic Processing Artificial Intelligence should be able to recognize individual animals and
thus avoid taking redundant photographs of the individuals. However,
Individual Identification
since these technologies are in the prototype stage, a margin of error
should be acceptable.

Grouping by Geolocation Animal groups could be identified with respect to their herds, individual
families and/or even their individual habitats. For example: One Area of
Operations can have 3 different families of Tigers with a population of 12.
However, each individual can be identified as a part of their specific family

47
and not be associated with other families using Geospatial Data of their
habitats. This way, the biologists can determine if a casualty is caused by
rival tigers or poachers.

Movement speed of animals can be calculated from different images(16). A

camera trap can take multiple images while an animal is running past and
Movement Speed Recording
the number of frames the animal appears in can be used to determine the
movement speed of the animals and recorded as data using AI.

The prototype AI Algorithm, called Wildbook(25) (Winters, J. 2018), can be

used for the individual animal recognition. However, the software should
Wildbook
be allowed to run with repeated images deleted manually to help it in
learning and increasing its accuracy.
Proposed Algorithms
R Libraries(26) can be used for Geocomputation and in conjunction with a
custom code snippet, it can easily be incorporated with the geolocation
R Libraries
data captured from the GPS of the Camera Traps and connected with the
specific individual animal data to group the animals.

Frameworks HADOOP is the best open-source solution to any animal conservation

HADOOP project due to its customizability. Using a NOSQL Database, HADOOP
Framework can be used as a backbone of the whole system.

For Cloud-Based Solution, if the Biologists are working from off-shore or

collaborating with other projects and operations, Amazon Web Service can
Amazon Web Service
be used. However, this service may be opted out of for small projects if a
local server is used.

DISSECT With DISSECT’s core algorithms modified to monitor Animal Densities, this

48
can be used to effectively collect data from the different devices.

Optionally, screening unwanted animal photographs or redundant ones can

be obtained using Entrez Utilities. This can be combined with Wildbook’s
Entrez Utilities
machine learning capabilities to improve its accuracy with minimal human
involvement.

Table 7: System Evaluation Framework Matrix

49
CONCLUSION

Animal Conservation efforts now need to be redoubled due to the recent

catastrophes involving the Amazon rainforest. While many might blame the
information technology for the increase in power consumption and depleting
resources, it might also be the most powerful tool for Wildlife preservation.
With the advent of Big Data and Artificial Intelligence, monitoring animals using
digital technology is now easier than ever.

The world now only needs to address the few problems with the errors in the
collected Big Data and come up with viable solutions to the hardware
performance issues with the large amount of data to reduce the cost of these
systems. With these challenges out of the way, Big Data and AI could truly be
the strongest of weapons against the extinction of species and help in the
preservation efforts.

This thesis, in conjunction with the previous thesis titled, “Aiding Wildlife with
Big Data and AI”, is here to provide just the motivation and information the
future researchers need to get started on addressing these issues and coming
up with viable solutions to these challenges and help mankind take a leap
towards a better environment.

50
REFERENCES

1. Howells, R. September 3, 2019. “What is the Amazon (Fire) Effect on Our

Environment and Businesses?” Forbes Magazine.
https://www.forbes.com/sites/sap/2019/09/03/what-is-the-amazon-fire-effect-
on-our-environment-and-businesses/#67926e99530b
2. Confino, J. July 11, 2013. “How technology has stopped evolution and is
destroying the world”. The Guardian.
https://www.theguardian.com/sustainable-business/technology-stopped-
evolution-destroying-world
3. Aksari, M. April 9, 2018. “Knowledge is the Most Powerful Weapon”.
Medium Corporation Article.
https://medium.com/@mahekam_ansari/knowledge-is-the-most-powerful-
weapon-683fd387938f
4. Hossain, M., 2019. “Aiding Wildlife with Big Data and AI”. Victoria
University, Sydney.
5. Zhang J, Hsieh J-H, Zhu H, 2014. “Profiling Animal Toxicants by
Automatically Mining Public Bioassay Data: A Big Data Approach for
Computational Toxicology”. PLoS ONE 9(6): e99863.
https://doi.org/10.1371/journal.pone.0099863
6. https://www.ncbi.nlm.nih.gov/home/tools/
7. Begoli, Edmon & Horey, J. (2012). “Design Principles for Effective
Knowledge Discovery from Big Data”. 10.1109/WICSA-ECSA.212.32.
8. Kalický A, 2013. “High Performance Analytics”. Charles University in
Prague, Department of Software Engineering. Thesis ID: 139442.
9. Li, Songnian & Dragicevic, Suzana & Anton, François & Sester, Monika &
Winter, Stephan & Coltekin, Arzu & Pettit, Christopher & Jiang, Bin &
Haworth, James & Stein, Alfred & Cheng, T. (2015). “Geospatial Big Data
Handling Theory and Methods: A Review and Research Challenges”. ISPRS
Journal of Photogrammetry and Remote Sensing. 115.
10.1016/j.isprsjprs.2015.10.012.
10.Canela-Xandri, Oriol & Law, Andy & Gray, Alan & Woolliams, John A. &
Tenesa, Albert, 2015. “A new tool called DISSECT for analysing large

51
genomic data sets using a Big Data approach”. Nature Communications. 6-
10162. https://doi.org/10.1038/ncomms10162
11.Raste, K. 2014. Big Data Analysis – Hadoop Performance Analysis [D], San
Diego: San Diego State University
12.http://hadoop.apache.org/
13.Ghemawat, S., Gobioff, H. and Leung, S. 2003. “The Google File System”.
Google AI, Google Corporation. https://research.google.com/archive/gfs-
sosp2003.pdf
14.http://aws.amazon.com/
15.Monga, Inder & Pouyoul, Eric & Guok, Chin. (2012). “Software-Defined
Networking for Big-Data Science - Architectural Models from Campus to the
WAN”. 1629-1635. 10.1109/SC.Companion.2012.341.
16.Rowcliffe, J. M., Field, J., Turvey, S. T. and Carbone, C. (2008), “Estimating
animal density using camera traps without the need for individual
recognition”. Journal of Applied Ecology, 45: 1228-1236. doi:10.1111/j.1365-
2664.2008.01473.x
17.https://www.researchgate.net/figure/4-A-picture-of-DeerCam-DC300-camera-
trap-The-muddy-appearance-was-result-of-wild-boar_fig4_326773887
18.Liu, Jianzheng & Li, Jie & Li, Weifeng & Wu, Jiansheng. (2015). “Rethinking
big data: A review on the data quality and usage issues”. ISPRS Journal of
Photogrammetry and Remote Sensing. 115. 10.1016/j.isprsjprs.2015.11.006.
19.Xareni P. Pacheco (December 10th 2018). How Technology Can Transform
Wildlife Conservation, Green Technologies to Improve the Environment on
Earth, Marquidia Pacheco, IntechOpen, DOI: 10.5772/intechopen.82359.
Available from: https://www.intechopen.com/books/green-technologies-to-
improve-the-environment-on-earth/how-technology-can-transform-wildlife-
conservation
20.https://projects.ncsu.edu/cals/course/fw353/Estimate.htm
21.https://datafloq.com/read/big-data-history/239
22.https://www.iucn.org/news/secretariat/201704/experts-call-more-
collaboration-and-investment-biodiversity-monitoring
23.https://smartconservationtools.org/
24. https://nc.iucnredlist.org/redlist/content/attachment_files/
2019_2_RL_Table_1a_v2.pdf
25. Winters, J., August 15, 2018. How Conservationists Are Using AI And Big
Data To Aid Wildlife. NWPB News.

52
26.https://www.datacamp.com/community/tutorials/r-packages-guide?
utm_source=adwords_ppc&utm_campaignid=1455363063&utm_adgroupid=
65083631748&utm_device=c&utm_keyword=&utm_matchtype=b&utm_net
work=g&utm_adpostion=1t1&utm_creative=278443377095&utm_targetid=
aud-392016246653:dsa-
473406586795&utm_loc_interest_ms=&utm_loc_physical_ms=9069450&gcli
d=Cj0KCQjw0IDtBRC6ARIsAIA5gWuJi0gC4okN1ZfuGRs5NSiEy1bzGmjFPLT
Bqh9O3IMH86GbyxRIDOsaAovZEALw_wcB