Big Data Forensics - Learning Hadoop Investigations - Sample Chapter
Big Data Forensics - Learning Hadoop Investigations - Sample Chapter
Big Data Forensics - Learning Hadoop Investigations - Sample Chapter
ee
Sa
pl
firm. He conducts digital investigations and advises clients on complex data and
investigative issues. He has worked on some of the largest civil litigation and
corporate fraud investigations, including issues involving Ponzi schemes, stock
option backdating, and mortgage-backed security fraud. He is a member of the
Association of Certified Fraud Examiners and the Sedona Conference.
Preface
Forensics is an important topic for law enforcement, civil litigators, corporate
investigators, academics, and other professionals who deal with complex digital
investigations. Digital forensics has played a major role in some of the largest
criminal and civil investigations of the past two decadesmost notably, the Enron
investigation in the early 2000s. Forensics has been used in many different situations.
From criminal cases, to civil litigation, to organization-initiated internal investigations,
digital forensics is the way data becomes evidencesometimes, the most important
evidenceand that evidence is how many types of modern investigations are solved.
The increased usage of Big Data solutions, such as Hadoop, has required new
approaches to how forensics is conducted, and with the rise in popularity of Big Data
across a wide number of organizations, forensic investigators need to understand how
to work with these solutions. The number of organizations who have implemented Big
Data solutions has surged in the past decade. These systems house critical information
that can provide information on an organization's operations and strategieskey areas
of interest in different types of investigations. Hadoop has been the most popular of
the Big Data solutions, and with its distributed architecture, in-memory data storage,
and voluminous data storage capabilities, performing forensics on Hadoop offers new
challenges to forensic investigators.
A new area within forensics, called Big Data forensics, focuses on the forensics of Big
Data systems. These systems are unique in their scale, how they store data, and the
practical limitations that can prevent an investigator from using traditional forensic
means. The field of digital forensics has expanded from primarily dealing with desktop
computers and servers to include mobile devices, tablets, and large-scale data systems.
Forensic investigators have kept pace with the changes in technologies by utilizing
new techniques, software, and hardware to collect, preserve, and analyze digital
evidence. Big Data solutions, likewise, require different approaches to analyze the
collected data.
Preface
In this book, the processes, tools, and techniques for performing a forensic
investigation of Hadoop are described and explored in detail. Many of the concepts
covered in this book can be applied to other Big Data systemsnot just Hadoop.
The processes for identifying and collecting forensic evidence are covered, and the
processes for analyzing the data as part of an investigation and presenting the findings
are detailed. Practical examples are given by using LightHadoop and Amazon Web
Services to develop test Hadoop environments and perform forensics against them.
By the end of the book, you will be able to work with the Hadoop command line
and forensic software packages and understand the forensic process.
Understanding Hadoop
Internals and Architecture
Hadoop is currently the most widely adopted Big Data platform, with a diverse
ecosystem of applications and data sources for forensic evidence. An Apache
Foundation framework solution, Hadoop has been developed and tested in
enterprise systems as a Big Data solution. Hadoop is virtually synonymous
with Big Data and has become the de facto standard in the industry.
As a new Big Data solution, Hadoop has experienced a high adoption rate by many
types of organizations and users. Developed by Yahoo! in the mid-2000sand
released to the Apache Foundation as one of the first major open source Big Data
frameworksHadoop is designed to enable the distributed processing of large,
complex data sets across a set of clustered computers. Hadoop's distributed
architecture and open source ecosystem of software packages make it ideal for speed,
scalability, and flexibility. Hadoop's adoption by large-scale technology companies is
well publicized, and many other types of organizations and users have come to adopt
Hadoop as well. These include scientific researchers, healthcare corporations, and
data-driven marketing firms. Understanding how Hadoop works and how to perform
forensics on Hadoop enables investigators to apply that same understanding to
other Big Data solutions, such as PyTables.
Performing Big Data forensic investigations requires knowledge of Hadoop's internals
and architecture. Just as knowing how the NTFS filesystem works is important for
performing forensics in Windows, knowing the layers within a Hadoop solution is
vital for properly identifying, collecting, and analyzing evidence in Hadoop. Moreover,
Hadoop is rapidly changingnew software packages are being added and updates
to Hadoop are being applied on a regular basis. Having a foundational knowledge
of Hadoop's architecture and how it functions will enable an investigator to perform
forensics on Hadoop as it continues to expand and evolve.
[ 21 ]
With its own filesystem, databases, and application layers, Hadoop can store data
(that is, evidence) in various formsand in different locations. Hadoop's multilayer
architecture runs on top of the host operating system, which means evidence may need
to be collected from the host operating system or from within the Hadoop ecosystem.
Evidence can reside in each of the layers. This may require performing forensic
collection and analysis in a manner specific to each layer.
This chapter explores how Hadoop works. The following topics are covered in detail:
Hadoop's architecture, files, and data input/output (I/O). This is done to provide
an understanding of the technical underpinnings of Hadoop. The key components
of the Hadoop forensic evidence ecosystem are mapped out, and how to locate
evidence within a Hadoop solution is covered. Finally, this chapter concludes with
instructions on how to set up and run LightHadoop and Amazon Web Services. These
are introduced as the Hadoop instances that serve as the basis for the examples used
in this book. If you are interested in performing forensic investigations, you should
follow the instructions on how to install LightHadoop and set up an Amazon Web
Services instance at the end of this chapter. These systems are necessary to follow
the examples presented throughout this book.
[ 22 ]
Chapter 2
The Hadoop layers are an abstraction for how the various components are organized
and the relationship between the other components. The following are the various
Hadoop layers:
The Operating System layer: The first layer is the Operating System on the
host machine. Hadoop is installed on top of the operating system and runs
the same regardless of the host operating system (for example, Windows
or Linux).
The Hadoop layer: This is the base installation of Hadoop, which includes
the file system and MapReduce components.
The DBMS layer: On top of Hadoop, the various Hadoop DBMS and related
applications are installed. Typically, Hadoop installations include a data
warehousing or database package, such as Hive or HBase.
The Application layer: The Application layer is the top layer, which includes
the tools that provide data management, analysis, and other capabilities.
Some tools, such as Pig, can interact directly with the operating system and
Hadoop layers. Other tools only interact with the database layer or other
application-layer tools.
[ 23 ]
Hadoop uses the HDFS to logically store data for use by Hadoop's applications.
HDFS is designed to store data on commodity storage hardware in a distributed
fashion. The NameNode file controls the tasks of storing and managing the data
across each of the DataNode. When data is stored in Hadoop, the NameNode file
automatically stores and replicates the data in multiple blocks (64 MB or 128 MB
by default) across the various DataNode. This is done to ensure fault tolerance
and high availability. HDFS is covered in more detail in the next section.
MapReduce is a key concept and framework for how Hadoop processes data.
Using Hadoop's distributed processing model, MapReduce enables large jobs to
be divided into Map() procedures and Reduce() procedures. Map() procedures
are filtering and sorting operations, whereas Reduce() procedures are summary
operations (for example, summation or counting). A single query can be divided
into Map() and Reduce() procedures with a Master Node distributing the tasks
to each of the Slave Node. The SlaveNode perform their discrete tasks and transmit
the results back to the Master Node for analysis compilation and reporting.
[ 24 ]
Chapter 2
The first step of MapReduce is to run a Map() function on the initial data. This
creates data subsets that can be distributed to one or more nodes for processing. In
this example, the data consists of information about widget sales quantity and price
information, with each node receiving information about one widget. Each node that
receives a record performs an operation on the record. In this case, the nodes calculate
the total sales amounts. Finally, the Reduce() function computes the total sales
amount for all widgets.
MapReduce programs can be written and executed in a number of different ways.
First, programs can be written natively in Java using the org.apache.hadoop.mapred
library. A MapReduce program is compiled using a Java compiler; it is then run in
Hadoop using the Java runtime. Alternatively, additional Hadoop packages offer
abstractions of MapReduce that can implement the Map() and Reduce() functions
without using Java (for example, Pig).
For more information about programming in MapReduce, visit http://hadoop.apache.
org/docs/r1.2.1/mapred_tutorial.html.
The layers above the Hadoop layer are the add-on functionality for process and
resource management. These layers store, retrieve, convert, and analyze data. The
following table provides examples of tools found in these layers:
Tool
HBase
Hive
Description
This is a column-based data warehouse for high-speed
execution of operations over large data sets.
This is a data warehouse that offers SQL-like access to
data in HDFS.
[ 25 ]
Tool
Sqoop
Pig
Flume
Description
This is a data transfer tool for moving to and from
relational database systems.
This is the framework for executing MapReduce on HDFS
data using its own scripting language.
This harvests, aggregates, and moves large amounts of log
data in and out of Hadoop.
Data is imported into HDFS and then stored in blocks for distributed storage. Files
and data can be imported into HDFS in a number of ways, but all data stored in
HDFS is split into a series of blocks. The blocks are split by size only. A file may
contain record information, and the splits may occur within an individual record
if that record spans a block size boundary. By default, blocks are 64 MB or 128 MB,
but the size can be set to a different number by a system administrator. Hadoop is
designed to work with terabytes and petabytes of data. The metadata about each
block is stored centrally on a server, so Hadoop cannot afford to store the metadata
about 4 KB blocks of data. Thus, Hadoop's block size is significantly larger than the
blocks in a traditional filesystem.
[ 26 ]
Chapter 2
After the data has been split, it is stored in a number of DataNode. By default, the
replication level is set to three DataNode per block, but that setting can also be
changed by a system administrator. Mapping information indicating where the
data blocks are stored and other metadata are contained in NameNode, which is
located in the Master Node. The following figure illustrates this process:
Files are stored logically in HDFS, and they can be accessed through HDFS just like
a file in any other filesystem. Files may be stored in data blocks across a number of
DataNode, but the files still retain their filenames and can be accessed in a number
of ways. NameNode stores the information necessary to perform a lookup on a
filename, identifies where the various blocks reside that comprise the file, and
provides file-level security. When a file request is made in HDFS, Hadoop retrieves
data blocks and provides access to the data as a file.
Once stored in HDFS, files can be accessed through a number of mechanisms. Files
can be accessed via the Hadoop shell command line. The standard ways to locate
and access files through the command line are the ls and cp commands, which are
available through Hadoop. For example, the following commands can be executed
to perform a folder listing and a file copy for HDFS data, respectively:
$ hdfs dfs -ls /user/hadoop/file1
$ hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
[ 27 ]
Files can also be accessed through the HDFS web interface. The HDFS web interface
provides information about the status of a Hadoop cluster. The interface enables
browsing through directories and files in HDFS. By default, the web interface can
be accessed at http://namenode-name:50070/.
These commands are possible because of the way information is stored in HDFS.
Whether the Hadoop cluster is a single node or distributed across multiple nodes,
the files are logically accessible in the same manner.
Information stored in NameNode is stored in memory, but it is also written to
the filesystem for storage and disaster recovery. The in-memory data stored by
NameNode is the active information used to locate data blocks and pull metadata.
Because NameNode can have issues, or may need to be rebooted, the filesystem
information stored in memory is also written to two files: fsimage and edits. The
fsimage file is a recent checkpoint of the data stored in the memory of NameNode.
The fsimage file is a complete backup of the contents and is sufficient to bring
NameNode back online in the event of a system restart or failure. The edits file
stores all changes from the last fsimage checkpoint process. This is similar to a
database that utilizes a differential backup. NameNode does not utilize these files
except for when NameNode is started, at which point, the contents of the files are
brought into memory by NameNode. This is done by way of restoring the fsimage
file data and then applying all updates from the edits file in the sequential order.
The fsimage file is similar in filesystem structure to a Windows File Allocation
Table (FAT). The file stores information about pointers to file locations; file locations
are called inodes. Each inode has associated metadata about the file, including the
number of blocks, permissions, modification and access times, and user and group
ownership. The fsimage file can be useful in a forensic investigation when questions
arise about metadata. The fsimage file is covered in more detail in later chapters.
The Hadoop system is offline and cannot be brought back online, so the
Hadoop nodes need to be identified in order to collect data from each
[ 28 ]
Chapter 2
mapred-default.xml
Description
This contains the general default system variables and data
locations
This contains the site-specific version of hadoop-default.xml
This contains the MapReduce parameters
job.xml
hadoop-default.xml
hadoop-site.xml
[ 29 ]
This configuration file contains the following information that may be of value to
a forensic investigation:
Hadoop daemons
Hadoop daemons are the processes that provide the core Hadoop functionality,
such as the NameNode and DataNode services. Hadoop's daemons are the processes
that run and form the backbone of Hadoop's operations, similar to the daemons that
provide operating system-level and other functionality within Linux and other
Unix variants.
Hadoop runs several daemons in the host operating system's Java Virtual Machine
(JVM). The primary daemons are:
NameNode
DataNode
SecondaryNameNode
JobTracker
TaskTracker
[ 30 ]
Chapter 2
The daemons run as processes in the host operating system, so the status of the
daemons can be monitored from the host operating system, not only within Hadoop.
Because Hadoop is a Java-based system, the daemons are written in Java and the
tool jps can be used to test whether there are active daemons. jps is the Java Virtual
Machine Process Status Tool and it can be run from any host operating system with
Java installed. If Hadoop is running, the jps output will contain the five daemons
mentioned earlier. This is an excellent tool for investigators to use when working with
a system suspected of running Hadoop. The following is an example of running jps
and its output:
$ jps
The response from jps shows the process identifier (pid) and process name as
follows:
1986 Jps
1359 ResourceManager
1223 RunJar
1353 NodeManager
1383 JobHistoryServer
1346 DataNode
1345 NameNode
[ 31 ]
Hive
Hive is a data warehousing solution developed to store and manage large volumes
of data. It offers an SQL-like language for analysis. Hive is a general purpose system
that can be scaled to extremely large data sets. As a data warehousing system, data is
imported into Hive data stores that can be accessed via an SQL-like query language
called HiveQL.
The Hive service is the engine that manages the data storage and query operations.
Hive queries are passed through the service, converted into jobs, and then executed
with the results returned to the query interface. Hive stores two types of data: table
data and metadata. Table data is stored in HDFS, and the metadata indicating where
the partitions and data tables are stored is located in the Hive metastore. The metastore
is a service and storage component that connects to a relational database (for example,
MySQL or Oracle) for storage of the metadata. This enables Hive to retrieve data and
table structure information. The following figure shows an overview of the
Hive environment:
Depending on the data volume, Hive data is stored in the local HDFS filesystem.
By default, data is stored in the /user/hive/warehouse directory. Hive can
be configured to store data in other locations by way of modifying the
hive-default.xml file's hive.metastore.warehouse.dir variable.
The following Hive query loads data to a new Hive table:
LOAD DATA LOCAL INPATH '/home/data/import.txt'
OVERWRITE INTO TABLE sampletable
[ 32 ]
Chapter 2
This query imports the records from the import.txt file into a Hive table named
sampletable. Since the default data location is /user/hive/warehouse, the data
is stored in a new directory called sampletable. The metastore is also updated
with metadata related to the new table and data location. The following Hadoop
command shows the imported file:
$ hadoop fs -ls /user/hive/warehouse/sampletable/
This example, however, only shows how Hive stores data when the local HDFS is
used. Other options exist, so investigators should be aware that data can be stored
in other locations. Hive table data can be stored in remote locations, such as cloud
storage as well as on local nodes. Likewise, the metastore and its database can either
be on the local machine or a remote machine. If the metastore is a required piece of
an investigation, the location of the metastore should be identified.
Hive provides logging for critical events and errors. By default, Hive logs errors
to /tmp/$USER/hive.log. The error log location can be specified for a different
directory in the hive log configuration file conf/hiv-log4j.properties. The
primary configuration file for Hive is the hive-default.xml file.
The alternative to searching all of these additional sources in an investigation is to
extract data from Hive via queries. With the potential for multiple remote systems,
a large metastore, and various system configuration and log filesa simpler solution
to extract the data is required. This can be done by running HiveQL queries to retrieve
the contents from all tables and store the results in flat files. This approach enables the
investigator to retrieve the entire set of contents from Hive; it is useful when metadata
or questions about data removal are not relevant.
HBase
HBase is currently the most popular NoSQL database for Hadoop. HBase is a
column-oriented, distributed database that is built on top of HDFS. This database is
commonly used for large-scale analysis across sparsely-populated datasets. HBase
does not support SQL, and data is organized by columns instead of the familiar
relational sets of tables.
HBase's data model is unique and requires understanding before data is collected
by an investigator. HBase makes use of the following concepts:
Table: HBase organizes data into tables, with each table having a
unique name.
[ 33 ]
Row: Data is stored in rows within each column, and the rows are identified
by their unique row key.
Column Qualifier: The individual row columns are specified by the column
qualifier. In the previous example, location:city is part of the location
column family and its qualifier is city.
Cell: The unique identification of a value within a row is a cell. Cells are
identified by a combination of the table, row key, column family, and
column qualifier.
The following figure shows a sample set of data within HBase. The table contains
two column families: name and location. Each of the families has two qualifiers. A
combination of the unique row key, column family, and column qualifier represents
a cell. For example, the cell value for row key 00001 + name:first is John:
HBase stores all column family members together in HDFS. HBase is considered
a column-oriented database, but the physical storage is actually performed by
grouping columns and storing those together. Because of this storage methodology,
column families are expected to have similar data size and content characteristics
to enable faster sorting and analysis.
Tables are partitioned horizontally into sections of fixed-size chunks called regions.
When a table is first created, the entire contents of the table are stored in a single
region. As the number of rows reaches a certain size threshold, a new region is
created for the additional rows. The new region is typically stored on a separate
machine, enabling the data to scale without compromising the speed of storage
and analysis.
[ 34 ]
Chapter 2
HBase utilizes a set of servers and a database log file for running its distributed
database. The region servers store the data contents and are the data analysis
engines. Each region server has HFile data and a memstore. The region servers
share a write-ahead log (WAL) that stores all changes to the data, primarily for
disaster recovery. Each HBase instance has a master server, which is responsible
for assigning regions to region servers, recovering from region server failure,
and bootstrapping. Unlike the MapReduce process, master servers do not control
operations for analysis. Large-scale HBase instances typically have a backup master
server for failover purposes.
HBase also uses and depends on a tool called ZooKeeper to maintain the HBase
cluster. ZooKeeper is a software package used for the maintenance of configuration
information and performing synchronization across distributed servers. At a
minimum, HBase uses a ZooKeeper leader server and a ZooKeeper follower server to
assign tasks to HBase nodes and track progress. These servers also provide disaster
recovery services.
The following figure highlights the configuration of an HBase and ZooKeeper
environment:
The data file format used by HBase is HFile. The files are written in 64 KB blocks by
default. HFile blocks are not to be confused with HDFS blocks. HFiles are divided
into four regions as follows:
Scanned Block Section: The data content (that is, key and value pairs) and
pointer information that is scanned; multiple data and leaf index blocks
can be stored in this section.
Non-Scanned Block Section: The meta information that is not scanned;
multiple blocks can be stored in this section.
[ 35 ]
The following file layout figure shows the structure of an HFile that is stored on a
region server:
Like Hive and other Hadoop applications, the HBase settings can be found in
its configuration files. The two configuration files are hbase-default.xml and
hbase-site.xml. By default, hbase-site.xml contains information about where
HBase and ZooKeeper write data.
HBase data can be accessed in a number of ways. The following is a list of means
by which HBase data can be accessed:
Java program: HBase is a Java-based database that has its own object
library that can be implemented in custom Java programs for querying
Extracting data from HBase requires using one of these methods. This makes data
collection more difficult, but the alternative is to identify all regional servers and use
configuration files and knowledge of HFiles to carve out the relevant data. These and
other HBase data collection and analysis issues are covered in later chapters.
[ 36 ]
Chapter 2
Pig
Pig is a tool that creates an abstraction layer on top of MapReduce to enable simpler
and faster analysis. Pig is a scripting language designed to facilitate query-like data
operations that can be executed with just several lines of code. Native MapReduce
applications written in Java are effective and powerful tools, but the time to develop
and test the applications is time-consuming and complex. Pig solves this problem
by offering a simpler development and testing process that takes advantage of the
power of MapReduce, without the need to build large Java applications. Whereas
Java programs may require 50-100 lines, Pig scripts often have ten lines of code or
less. Pig is comprised of two elements as follows:
An execution environment that runs Pig scripts against Hadoop data sets
Pig is not a database or a data storage tool. Unlike HBase, Pig does not require data
to be loaded into a data repository. Pig can read data directly from HDFS at script
runtime, which makes Pig very flexible and useful for analyzing data across HDFS
in real time.
Pig scripts typically have a .pig extension. If the Pig scripts
may be relevant or useful, investigators should collect the
scripts to help understand the data and how that data was
analyzed on the source system.
[ 37 ]
File permissions
HDFS uses a standard file permission approach. The three types of permissions for
files and directories are:
Each file and directory has an associated owner, group, and mode. The owner and
group are assigned based on who owns or created the file or directory, and the same
is true for the group. The mode is the list of permissions for the owner, the members
of the group, and all others (that is, a non-owner and non-group member for the file
or directory). There are also superuser accounts in HDFS, and all superuser accounts
can access any file or directory, regardless of permissions.
File permissions in HDFS are not as useful for determining the actual people and
location of account logins as is the case with traditional operating systems. Client
accounts in Hadoop run under process accounts. So rather than each individual
having a login to the Hadoop instance, the clients access HDFS via an application
that has its own account. For example, an HBase client has an associated account,
HBase, by default and that account would be the one running analysis. While tools
such as ZooKeeper provide Access Control Lists (ACLs) to manage such community
accounts, one can see that having processes that act as user accounts can create
difficulties for identifying which person or location performed specific actions.
Some Hadoop packages contain access control mechanisms that
enable more granular user access control. HBase, for example,
has an Access Controller coprocessor that can be added to the
hbase-site.xml configuration file to control which users can
access individual tables or perform specific HBase actions. The
ACL is stored in the HBase table _acl_.
Trash
Hadoop has a trash feature that stores deleted files for a specific amount of time.
All Hadoop users have a .Trash folder, where deleted files are stored. When a file
is deleted in Hadoop, a subdirectory is created under the user's $HOME folder
using the original file path, and the file is stored there. All files stored in trash
are permanently deleted when one of the following events happen:
The periodic trash deletion process is run by Hadoop. This occurs after a
fixed amount of time, as specified by a user-configured time.
[ 38 ]
Chapter 2
A user runs an expunge job. This can be performed from the Hadoop
command line as follows:
%hadoop fs expunge
Files are only moved to the trash when deleted by a user from the
Hadoop command line. Files deleted programmatically bypass the
trash and are permanently deleted immediately.
Log files
Log files are valuable sources of forensic evidence. They store information about where
data was stored, where data inputs originated, jobs that have been run, the locations of
other nodes, and other event-based information. As in any forensic investigation, the
logs may not contain directly relevant evidence; however, the information in logs can
be useful for identifying other locations and sources of evidence.
The following types of logs can be found on machines running a Hadoop cluster:
Hadoop daemon logs: Stored in the host operating system, these .log files
contain error and warning information. By default, these log files will have
a hadoop prefix in the filename.
log4j: These logs store information from the log4j process. The log4j
application is an Apache logging interface that is used by many Hadoop
applications. These logs are stored in the /var/log/hadoop directory.
Job statistics: The Hadoop JobTracker creates these logs to store information
about the number of job step attempts and the job runtime for each job.
[ 39 ]
Log file retention varies across implementation and administrator settings. Some logs,
such as log4j, can grow very quickly and may only have a retention period of several
hours. Even if logs are purged, a best practice for many types of logs is to archive them
in an offline system for diagnostics and job performance tracking.
bzip2
DEFLATE
gzip
LZO
LZ4
Snappy
Files compressed with DEFLATE in Hadoop have a .deflate
file extension.
While compressed files can be transmitted more easily, sending out one compressed
file to multiple nodes is not always an efficient option. Hadoop's MapReduce is
designed with a framework to enable sending out smaller jobs to multiple nodes.
Each node does not need to receive the complete data set if it is only tasked with
a subset of the data. Instead, the data should be split into subsets, with each node
receiving only the subset it needs. For this reason, compression algorithms whose
files can be split are preferred. DEFLATE does not support splitting, but formats
such as bzip2, LZO, LZ4, and Snappy do.
A forensic investigator should be aware of split files that can be stored on node
machines. These files may require forensic collection of the individual split data
files on the various nodes to fully reconstruct the complete, original data container.
[ 40 ]
Chapter 2
Hadoop SequenceFile
SequenceFile are Hadoop's persistent data structure for key-value pair data
for MapReduce functions. These files are both the input and output format for
MapReduce. They contain key-value pair values and have a defined structure.
SequenceFile are a common file format in Hadoop, and they facilitate the splitting
of data for each transfer during MapReduce jobs. There are three formats of
SequenceFiles:
The three formats have a common file header format. The following table lists
the fields found in the file header:
Field
Version
keyClassName
valueClassName
Compression
blockCompression
Compression Codec
Metadata
Sync
Description
This holds SEQ4 or SEQ6, depending on the SequenceFile version
This holds the name of the key class
This holds the name of the value class
This is used for key/pairs: 1 if compressed, 0 if uncompressed
This is used for key/pairs blocks: 1 if compressed, 0 if
uncompressed
This holds the compression codec name value
This is user-defined metadata
This is a marker to denote the end of the header
The header segment defines the type of SequenceFile and the summary information
for the file.
Both uncompressed and record-compressed SequenceFile have record and sync
blocks. The only difference between the two is that the value within the record
segment is compressed in the record-compressed format. The block-compressed
format is comprised of alternating sync and block segments. Within the block
segments, the keys and values are combined and compressed together.
[ 41 ]
The following figure illustrates the contents of each of the three SequenceFile formats:
SequenceFile are the base data structure for several variants. MapFiles are a
directory structure that have /index and /data directories. The key information is
stored in /index and the key/pairs are stored in /data. SetFile and ArrayFile are
MapFile variants that add functionality to the MapFile structure. Finally, BloomFiles
are extensions of MapFiles that have a /bloom directory for storing bloom filter
information. All of these types of MapFile and MapFile variants can be readily
identified by the presence of these directories.
Chapter 2
HAR files are multiple small files stored in a single, uncompressed container file.
HDFS has an interface that enables the individual files within a HAR file to be accessed
in parallel. Similar to the TAR container file format that is common in UNIX, multiple
files are combined into a single archive. Unlike TAR, however, HAR files are designed
such that individual files can be accessed from inside the container. HAR files can be
accessed by virtually all Hadoop components and applications, such as MapReduce,
the Hadoop command line, and Pig.
While HAR files offer several advantages for Hadoop, they also have limitations.
The advantage of HAR files is the capability to access multiple small files in parallel,
which reduces the NameNode file management load. In addition, HAR files work
well in MapReduce jobs because the individual files can be accessed in parallel.
HAR files also have their disadvantages. For example, because they are permanent
structures, they cannot be modified after they are created. This means new HAR files
have to be created any time new files should be added to a HAR file. Accessing a file
within a HAR file also requires an index lookup process, which adds an extra step
to the process.
The HAR file format has several elements. The following three elements comprise
the HAR format:
The file data is stored in multiple part files based on block allocation, and the content
location is stored in the _masterindex element. The index element stores the file
statuses and original directory structure.
Individual files from within a HAR file can be accessed via a har:// prefix. The
following command copies a file called testFile, originally stored in a directory
called testDir, from a HAR file stored on NameNode called foo to the local
filesystem:
% hadoop fs get har://namenode/foo.har#testDir/testFile
localdir
HAR files are unique to Hadoop. When forensically analyzing HAR data, investigators
should export the data from Hadoop to a local filesystem for analysis.
[ 43 ]
Data serialization
Hadoop supports several data serialization frameworks. Data serialization is
a framework for storing data in a common format for transmission to other
applications or systems. For Hadoop, data serialization is primarily used for
transmitting data for MapReduce-related tasks. The three most common data
serialization frameworks in Hadoop are:
Apache Avro
Apache Thrift
Data serialization frameworks are designed to transmit data that is read and stored in
memory, but the data files used for storage and transmission can be relevant forensic
evidence. The frameworks are fairly similar in overall structure and forensic artifacts.
The forensic artifacts for all three would be the data schema file that defines the data
structure and the text- or binary-encoded data files that store the data contents.
Avro is currently the most common data serialization framework in use for Hadoop.
An .avro container file is the artifact created when data is serialized. The .avro file
includes a schema file that is a plaintext definition of the data structure; it also includes
either a binary or text data content file. For the data format, Avro supports both its
own binary encoding and JSON text-based encoding. Avro files can be extracted
either directly through Avro or through Avro's Java methods.
The following figure illustrates the Avro container file format:
[ 44 ]
Chapter 2
To extract the full set of contents from jarTestFile, the JAR extract command can
be run from the following Java command line:
$ jar xf jarTestFile
[ 45 ]
Record evidence: This is any data that is analyzed in Hadoop, whether that
is HBase data, text files for MapReduce jobs, or Pig output.
User and application evidence: This is the third form of forensic data of
interest. This evidence includes the log and configuration files, analysis
scripts, MapReduce logic, metadata, and other forms of customization and
logic that act on the data. This form of evidence is useful in investigations
when questions arise about how the data was analyzed or generated.
The following figure lists the most common form of data for each type of
forensic evidence:
The difficulty in facing a forensic investigator working with a Big Data system such
as Hadoop is the volume of data as well as data being stored across multiple nodes.
An investigator cannot simply image a single hard drive and expect to have all
data from that system. Instead, forensic investigators working in Hadoop need to
first identify the relevant data, and only then can the actual evidence be collected.
Some forms of data such as log and configuration files are valuable for identifying
where evidence is stored and whether data archives exist. This type of evidence is
categorized as supporting information. It is valuable in both the early and late stages
of an investigation for identifying information and piecing together how the Hadoop
cluster operated.
[ 46 ]
Chapter 2
Record evidence is the most common form of evidence for Hadoop investigations.
In nonBig Data investigations, the files and e-mails from employees can be the most
valuable form of evidence; however, most organizations' employees do not interact
much with Hadoop. Rather, Hadoop is managed and operated by IT and data
analysis staff. The value of Hadoop is the data stored and analyzed in Hadoop. That
data is the transactional and unstructured data stored in Hadoop for analysis as well
as the outputs from the analyses. These structured and unstructured data are forms
of record evidence. The challenging aspect of Hadoop investigations is identifying
all potentially relevant sources of record evidence as record evidence can exist in
multiple forms and in multiple applications within Hadoop.
User and application evidence is any type of evidence that shows how the system
operates or the logic used to run analysis as they directly relate to the record evidence.
In some investigations, questions arise about what was done to data or how operations
were performed. While these questions can sometimes be answered by analyzing the
record evidence, user and application evidence provides a simpler and more powerful
way to answer such questions. User and application evidence ranges from the scripts
used to import and analyze data to the configuration and log files within Hadoop.
Running Hadoop
Hadoop can be run from a number of different platforms. Hadoop can be installed
and run from a single desktop, from a distributed network of systems, or as a
cloud-based service. Investigators should be aware of the differences and versed
in the various architectures. Hadoop runs in the same manner on all three setups;
however, collecting evidence may require different steps depending on how the
data is stored. For instance, a cloud-based Hadoop server may require a different
collection because of the lack of physical access to the servers.
This section details how to set up and run Hadoop using a free virtual machine
instance (LightHadoop) and a cloud-based service (Amazon Web Services). Both
LightHadoop and Amazon Web Services are used in the examples throughout this
book. They serve as testbed environments to highlight how Big Data forensics is
performed against different setups.
[ 47 ]
LightHadoop
Many of the examples in this book are intended to be hands-on exercises using
LightHadoop. LightHadoop is a freely distributed CentOS Hadoop virtual machine
instance. This Hadoop distribution differs from larger ones such as Cloudera.
LightHadoop requires less hardware and fewer storage resources. This makes
LightHadoop ideal for learning Hadoop on a single machine and enables one to create
many virtual machines for testing purposes without requiring large storage volumes
or multiple nodes. Due to the small virtual machine (VM) size, LightHadoop does
not include all of the major Hadoop-related Apache packages. However, it does
include the main ones required for learning about Hadoop and running a database
management system (DBMS). The following Apache packages are currently included
in LightHadoop:
Hadoop
Hive
Pig
Sqoop
[ 48 ]
Chapter 2
[ 49 ]
4. In the EMR console, click Create cluster and follow these steps:
1. Name the cluster and set the S3 folder name. S3 is the storage folder
and must be uniquely named across AWS. The following screenshot
shows an example:
3. Under Security and Access, select the EC2 key pair just created.
4. Leave all other items with their default settings, and click
Create cluster.
5. After a few minutes, the cluster will be created. This can be accessed from
the EMR Cluster List menu once the cluster's status is Running.
[ 50 ]
Chapter 2
The cluster can be accessed from an SSH terminal program such as PuTTY. To access
the cluster using PuTTY, follow these steps:
1. Convert the .pem key file created previously into a .ppk file.
2. Locate the instance's master public Domain Name System (DNS) in the EMR
Cluster List. The following screenshot illustrates an example of configuration:
3. Using PuTTY, provide the location of the .ppk key file, and enter the
host name as ec2-user@<Master Public DNS value>.
4. Connect with those parameters, and the EMR instance will load the
Linux session.
HDFS has built-in commands that can be used to copy data from the local
filesystem. The two commands are as follows:
[ 51 ]
Sqoop is an Apache Foundation tool designed to transfer bulk data sets between
Hadoop and structured databases. Sqoop can either directly import data into HDFS,
or it can import data indirectly by way of a Hive store that is stored in HDFS. Sqoop
has the ability to connect to a number of different data sources, such as MySQL and
Oracle databases. Sqoop connects to the data source and then efficiently imports the
data either directly into HDFS or into HDFS via Hive.
The third most common method for importing data into Hadoop is the use of a
Hadoop connector application. Hadoop's Java-based design and supporting libraries
provide developers with opportunities to directly connect to Hadoop for data
management, including importing data into Hadoop. Some data providers offer
Hadoop connector applications. Google, MongoDB, and Oracle are three examples
of Hadoop connector application providers.
The fourth method is to use a Hadoop data or file manager. Several major Hadoop
distributions offer their own file and data managers that can be used to import and
export data from Hadoop. Currently, the most popular Hadoop packages that offer
this are Hortonworks and Cloudera.
Methods for exporting or extracting data from HDFS are covered in the
subsequent chapters.
[ 52 ]
Chapter 2
To load the data into LightHadoop, access the file from the local filesystem via the
mounted drive inside of VirtualBox. Repeat steps 3 and 4 running the HDFS -put
command and verifying the file was copied with the HDFS -ls command.
The data is now loaded into HDFS. It can be accessed by Hadoop MapReduce and
analysis tools. In subsequent chapters, the data from this exercise is loaded into
analysis tools to demonstrate how to extract data from those tools.
Summary
This chapter covered many primary Hadoop concepts that a forensic investigator
needs to understand. Successful forensic investigations involve properly identifying
and collecting data, which requires the investigator to know how to locate the sources
of information in Hadoop as well as understand data structures and the methods for
extracting that information. Forensic investigations also involve analyzing the data that
has been collected, which in turn requires knowing how to extract information from
the Hadoop file structures.
The next chapter discusses how to identify evidence. This process involves standard
investigative skills such as conducting interviews as well as applying technical
knowledge about Hadoop to identify relevant evidence.
[ 53 ]
www.PacktPub.com
Stay Connected: