Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
699 views

Unit II-bid Data Programming

HDFS is a distributed file system designed for large data sets and optimized for streaming data access. It uses a master-slave architecture with a NameNode master and DataNode slaves. The NameNode manages file system metadata and maps files to blocks. DataNodes store and retrieve blocks of large files upon instruction from the NameNode. HDFS introduces blocks as a large unit of storage (default 64MB) to minimize disk seek times. HDFS addresses single points of failure through features like namespace backups, secondary NameNodes, federation, and high availability using an active-standby NameNode configuration with shared storage and failover controllers.

Uploaded by

jasmine
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
699 views

Unit II-bid Data Programming

HDFS is a distributed file system designed for large data sets and optimized for streaming data access. It uses a master-slave architecture with a NameNode master and DataNode slaves. The NameNode manages file system metadata and maps files to blocks. DataNodes store and retrieve blocks of large files upon instruction from the NameNode. HDFS introduces blocks as a large unit of storage (default 64MB) to minimize disk seek times. HDFS addresses single points of failure through features like namespace backups, secondary NameNodes, federation, and high availability using an active-standby NameNode configuration with shared storage and failover controllers.

Uploaded by

jasmine
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

UNIT II

The Design of HDFS, HDFS Concepts, Command Line Interface, Hadoop file system interfaces,
Data flow, Data Ingest with Flume and Scoop and Hadoop archives, Hadoop I/O: Compression,
Serialization, File-Based Data structures.

HDFS

When a dataset outgrows the storage capacity of a single physical machine, it becomes
necessary to partition it across a number of separate machines. Filesystems that manage the
storage across a network of machines are called distributed filesystems.

Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed
Filesystem.

The Design of HDFS :

HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware.

Very large files:

“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes
in size. There are Hadoop clusters running today that store petabytes of data.

Streaming data access :

HDFS is built around the idea that the most efficient data processing pattern is a write-once,
read- many-times pattern. A dataset is typically generated or copied from source, then various
analyses are performed on that dataset over time.

Commodity hardware :

Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on
clusters of commodity hardware (commonly available hardware available from multiple
vendors3) for which the chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable interruption to the user in
the face of such failure.

These are areas where HDFS is not a good fit today:

Low-latency data access :

Applications that require low-latency access to data, in the tens of milliseconds range, will not
work well with HDFS.

Lots of small files :

Since the namenode holds filesystem metadata in memory, the limit to the number of files
in a filesystem is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications:

Files in HDFS may be written to by a single writer. Writes are always made at the end of
the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the
file.

HDFS Concepts

Blocks:

HDFS has the concept of a block, but it is a much larger unit—64 MB by default. Files in
HDFS are broken into block-sized chunks, which are stored as independent units.

Having a block abstraction for a distributed filesystem brings several

benefits.: The first benefit :

A file can be larger than any single disk in the network. There’s nothing that requires the
blocks from a file to be stored on the same disk, so they can take advantage of any of the disks
in the cluster.

Second:

Making the unit of abstraction a block rather than a file simplifies the storage subsystem. The
storage subsystem deals with blocks, simplifying storage management (since blocks are a fixed
size, it is easy to calculate how many can be stored on a given disk) and eliminating metadata
concerns.

Third:

Blocks fit well with replication for providing fault tolerance and availability. To insure against
corrupted blocks and disk and machine failure, each block is replicated to a small number of
physically separate machines (typically three).

Why Is a Block in HDFS So Large?

HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks.
By making a block large enough, the time to transfer the data from the disk can be made to be
significantly larger than the time to seek to the start of the block. Thus the time to transfer a
large file made of multiple blocks operates at the disk transfer rate.

A quick calculation shows that if the seek time is around 10 ms, and the transfer rate is 100
MB/s, then to make the seek time 1% of the transfer time, we need to make the block size
around 100 MB. The default is actually 64 MB, although many HDFS installations use 128 MB
blocks. This figure will continue to be revised upward as transfer speeds grow with new
generations of disk drives.

Namenodes and Datanodes:

An HDFS cluster has two types of node operating in a master-worker pattern: a namenode
(the master) and a number of datanodes (workers). The namenode manages the filesystem
namespace. It maintains the filesystem tree and the metadata for all the files and directories in
the tree. This information is stored persistently on the local disk in the form of two files: the
namespace image and the edit log. The namenode also knows the datanodes on which all the
blocks for a given file are located.
Apache Hadoop is designed to have Master Slave architecture:
Master: Namenode, JobTracker
Slave: {DataNode, TaskTraker}, ….. {DataNode, TaskTraker}

HDFS is one primary components of Hadoop cluster and HDFS is designed to have Master-
slave architecture.
Master: NameNode
Slave: {Datanode}…..{Datanode}
- The Master (NameNode) manages the file system namespace operations like opening,
closing, and renaming files and directories and determines the mapping of blocks to
DataNodes along with regulating access to files by clients
- Slaves (DataNodes) are responsible for serving read and write requests from the file
system’s clients along with perform block creation, deletion, and replication upon
instruction from the Master (NameNode).

Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they
are told to (by clients or the namenode), and they report back to the namenode periodically
with lists of blocks that they are storing.

NameNode failure: if the machine running the namenode failed, all the files on the filesystem
would be lost since there would be no way of knowing how to reconstruct the files from the
blocks on the datanodes.

What precautions HDFS is taking to recover file system in case of namenode failure:
The first way is to back up the files that make up the persistent state of the filesystem metadata.
Hadoop can be configured so that the namenode writes its persistent state to multiple
filesystems. These writes are synchronous and atomic. The usual configurationchoice is
to write to local disk as well as a remote NFS mount.
Second way:
It is also possible to run a secondary namenode, which despite its name does not act as a
namenode. Its main role is to periodically merge the namespace image with the edit log to
prevent the edit log from becoming too large. But this can shaped to act as primary namenode.

HDFS Federation :
The namenode keeps a reference to every file and block in the filesystem in memory, which
means that on very large clusters with many files, memory becomes the limiting factor for
scaling .
HDFS Federation, introduced in the 0.23 release series, allows a cluster to scale by adding
namenodes, each of which manages a portion of the filesystem namespace. For example, one
namenode might manage all the files rooted under /user, say, and a second namenode might
handle files under /share.
Each Namenode Namespace volumes are independent of each other, which means namenodes
do not communicate with one another, and furthermore the failure of one namenode does not
affect the availability of the namespaces managed by other namenodes. Block pool storage is
not partitioned, however, so datanodes register with each namenode in the cluster and store
blocks from multiple block pools.

HDFS High-Availability:
The namenode is still a single point of failure (SPOF), since if it did fail, all clients—
including MapReduce jobs—would be unable to read, write, or list files, because the namenode
is the sole repository of the metadata and the file-to-block mapping. In such an event the whole
Hadoop system would effectively be out of service until a new namenode could be brought
online.
To recover from a failed namenode in this situation, an administrator starts a new primary
namenode with one of the filesystem metadata replicas, and configures datanodes and
clients to use this new namenode.
The new namenode is not able to serve requests until it has i) loaded its namespace image
into memory, ii) replayed its edit log, and iii) received enough block reports from thedatanodes
to leave safe mode. On large clusters with many files and blocks, the time it takes for a
namenode to start from cold can be 30 minutes or more.

The 0.23 release series of Hadoop remedies this situation by adding support for HDFS high-
availability (HA). In this implementation there is a pair of namenodes in an activestandby
configuration. In the event of the failure of the active namenode, the standby takes over its
duties to continue servicing client requests without a significant interruption.
A few architectural changes are needed to allow this to happen:
• The namenodes must use highly-available shared storage to share the edit log.
• Datanodes must send block reports to both namenodes since the block mappings are
stored in a namenode’s memory, and not on disk.
• Clients must be configured to handle namenode failover, which uses a mechanism
that is transparent to users.
Failover and fencing:
The transition from the active namenode to the standby is managed by a new entity in the
system called the failover controller. Failover controllers are pluggable, but the first
implementation uses ZooKeeper to ensure that only one namenode is active.

Failover may also be initiated manually by an adminstrator, in the case of routine


maintenance, for example. This is known as a graceful failover, since the failover controller
arranges an orderly transition for both namenodes to switch roles.
In the case of an ungraceful failover, The HA implementation goes to great lengths to ensure
that the previously active namenode is prevented from doing any damage and causing
corruption—a method known as fencing.

Command Line Interface

• The HDFS can be manipulated through a Java API or through a command-line interface.
• The File System (FS) shell includes various shell-like commands that directly interact
with the Hadoop Distributed File System (HDFS) as well as other file systems that
Hadoop supports.
• Below are the commands supported :
• appendToFile: Append the content of the text file in the HDFS.
• cat: Copies source paths to stdout.
• checksum: Returns the checksum information of a file.
• chgrp : Change group association of files. The user must be the owner of files, or else a
super-user.
• chmod : Change the permissions of files. The user must be the owner of the file, or else
a super-user.
• chown: Change the owner of files. The user must be a super-user.
• copyFromLocal: This command copies all the files inside the test folder in the edge
node to the test folder in the HDFS.
• copyToLocal : This command copies all the files inside the test folder in the HDFS
to the test folder in the edge node.
• count: Count the number of directories, files and bytes under the paths that match the
specified file pattern.
• cp: Copy files from source to destination. This command allows multiple sources as
well in which case the destination must be a directory.
• createSnapshot: HDFS Snapshots are read-only point-in-time copies of the file system.
Snapshots can be taken on a subtree of the file system or the entire file system. Some
common use cases of snapshots are data backup, protection against user errors and
disaster recovery.
• deleteSnapshot: Delete a snapshot from a snapshot table directory. This operation
requires the owner privilege of the snapshottable directory.
• df: Displays free space
• du: Displays sizes of files and directories contained in the given directory or the
length of a file in case its just a file.
• expunge: Empty the Trash.
• find: Finds all files that match the specified expression and applies selected actions
to them. If no path is specified then defaults to the current working directory. If no
expression is specified then defaults to -print.
• get Copy files to the local file system.
• getfacl: Displays the Access Control Lists (ACLs) of files and directories. If a
directory has a default ACL, then getfacl also displays the default ACL.
• getfattr: Displays the extended attribute names and values for a file or directory.
• getmerge : Takes a source directory and a destination file as input and concatenates files
in src into the destination local file.
• help: Return usage output.
• ls: list files
• lsr: Recursive version of ls.
• mkdir: Takes path URI’s as argument and creates directories.
• moveFromLocal: Similar to put command, except that the source localsrc is deleted
after it’s copied.
• moveToLocal: Displays a “Not implemented yet” message.
• mv: Moves files from source to destination. This command allows multiple sources
as well in which case the destination needs to be a directory.
• put : Copy single src, or multiple srcs from local file system to the destination file
system. Also reads input from stdin and writes to destination file system.
• renameSnapshot : Rename a snapshot. This operation requires the owner privilege
of the snapshottable directory.
• rm : Delete files specified as args.
• rmdir : Delete a directory.
• rmr : Recursive version of delete.
• setfacl : Sets Access Control Lists (ACLs) of files and directories.
• setfattr : Sets an extended attribute name and value for a file or directory.
• setrep: Changes the replication factor of a file. If the path is a directory then the
command recursively changes the replication factor of all files under the directory
tree rooted at the path.
• stat : Print statistics about the file/directory at <path> in the specified format.
• tail: Displays the last kilobyte of the file to stdout.
• test : Hadoop fs -test -[defsz] URI.
• text: Takes a source file and outputs the file in text format. The allowed formats are
zip and TextRecordInputStream.
• touchz: Create a file of zero length.
• truncate: Truncate all files that match the specified file pattern to the specified length.
• usage: Return the help for an individual command.

HDFS Interfaces
Features of HDFS interfaces are :

1. Create new file


2. Upload files/folder
3. Set Permission
4. Copy
5. Move
6. Rename
7. Delete
8. Drag and Drop
9. HDFS File viewer

Data Flow :

• MapReduce is used to compute a huge amount of data.


• To handle the upcoming data in a parallel and distributed form, the data has to flow from

various phases :

• Input Reader :
• The input reader reads the upcoming data and splits it into the data blocks of the
appropriate size (64 MB to 128 MB).
• Once input reads the data, it generates the corresponding key-value pairs.
• The input files reside in HDFS.
Map Function :
The map function process the upcoming key-value pairs and generated the corresponding
output key-value pairs.
The mapped input and output types may be different from each other.

Partition Function :
The partition function assigns the output of each Map function to the appropriate reducer.
The available key and value provide this function. It returns the index of reducers.

Shuffling and Sorting :

The data are shuffled between nodes so that it moves out from the map and get ready to
process for reduce function. The sorting operation is performed on input data for Reduce
function.

Reduce Function :
The Reduce function is assigned to each unique key. These keys are already arranged in
sorted order. The values associated with the keys can iterate the Reduce and generates the
corresponding output.
Output Writer :
Once the data flow from all the above phases, the Output writer executes.
The role of the Output writer is to write the Reduce output to the stable storage.

Data Ingestion

Introduction

Apache Hadoop is synonymous with big data for its cost-effectiveness and its attribute of
scalability for processing petabytes of data. Data analysis using Hadoop is just half the battle
won. Getting data into the Hadoop cluster plays a critical role in any big data deployment.
Data ingestion is important in any big data project because the volume of data is generally in
petabytes or exabytes. Hadoop Sqoop and Hadoop Flume are the two tools in Hadoop which
is used to gather data from different sources and load them into HDFS. Sqoop in Hadoop is
mostly used to extract structured data from databases like Teradata, Oracle, etc., and Flume in
Hadoop is used to sources data which is stored in various sources like and deals mostly with
unstructured data.
Apache Sqoop and Apache Flume are two popular open source tools for Hadoop that help
organizations overcome the challenges encountered in data ingestion.
While working on Hadoop, there is always one question occurs that if both Sqoop and Flume
are used to gather data from different sources and load them into HDFS so why we are using
both of them.
So, in this post, BigData Hadoop: Apache Sqoop vs Apache Flume we will answer this
question. At first, we will understand the brief introduction of both tools. Afterward, we will
learn comparison of Apache Flume vs Sqoop to understand each tool.

What is Apache Sqoop?


Apache Sqoop is a lifesaver in moving data from the data warehouse into the Hadoop
environment. Interestingly it named Sqoop as SQL-to-Hadoop. Basically, for importing data
from RDBMS’s like MySQL, Oracle, etc. into HBase, Hive or HDFS Apache Sqoop is an
effective Hadoop tool.

What is Apache Flume?


Basically, for streaming logs into Hadoop environment, Apache Flume is best service designed.
Also for collecting and aggregating huge amounts of log data, Flume is a distributed and reliable
service.

Sqoop
Apache Sqoop(which is a portmanteau for “sql-to-hadoop”) is an open source tool that allows
users to extract data from a structured data store into Hadoop for further processing. This
processing can be done with MapReduce programs or other higher-level tools such as Hive, Pig
or Spark.
Sqoop can automatically create Hive tables from imported data from a RDBMS (Relational
Database Management System) table.
Sqoop can also be used to send data from Hadoop to a relational database, useful for sending
results processed in Hadoop to an operational transaction processing system.
Sqoop includes tools for the following operations:

• Listing databases and tables on a database system


• Importing a single table from a database system, including specifying which columns to import
and specifying which rows to import using a WHERE clause
• Importing data from one or more tables using a SELECT statement
• Incremental imports from a table on a database system (importing only what has changed since
a known previous state)
• Exporting of data from HDFS to a table on a remote database system
Sqoop Connectors
Sqoop has an extension framework that makes it possible to import data from — and export data
to — any external storage system that has bulk data transfer capabilities.
A Sqoop connector is a modular component that uses this framework to enable Sqoop imports
and exports.
Sqoop ships with connectors for working with a range of popular databases, including MySQL,
PostgreSQL, Oracle, SQL Server, DB2, and Netezza.
As well as the built-in Sqoop connectors, various third-party connectors are available for data
stores, ranging from enterprise data warehouses (such as Teradata) to NoSQL stores (such as
Couchbase).
There is also a generic JDBC (Java Database Connectivity) connector for connecting to any
database that supports Java’s JDBC protocol.
# movielens DB must be created in MySQL using movielens.sql

sqoop import \
--connect jdbc:mysql://172.31.26.67:3306/movielens \
--table genres \
--username ubuntu \
--password ubuntu \
--target-dir /user/ubuntu/sqoop/movielens/genres

sqoop import \
--connect jdbc:mysql://172.31.26.67:3306/movielens \
--table genres
-m 2 \
--username ubuntu \
--password ubuntu \
--target-dir /user/ubuntu/sqoop/movielens/genres2
Sqoop is capable of importing into a few different file formats.
By default, Sqoop will generate comma-delimited text files for our imported data.Delimiters can
be specified explicitly.
Sqoop also supports SequenceFiles, Avro datafiles, and Parquet files.
sqoop import \
--connect jdbc:mysql://172.31.26.67:3306/movielens \
--table genres \
-m 1 \
--username ubuntu \
--password ubuntu \
--target-dir /user/ubuntu/sqoop/movielens/genres3 \
--fields-terminated-by '\t' \
--enclosed-by '"'

sqoop import \
--connect jdbc:mysql://172.31.26.67:3306/movielens \
--table genres \
--columns "id, name" \
--where "id > 5" \
-m 1 \
--username ubuntu \
--password ubuntu \
--target-dir /user/ubuntu/sqoop/movielens/genres4
We can specify many more options when importing a Database using Sqoop, such as:

• --fields-terminated-by
• --lines-terminated-by
• --null-non-string
• --null-string "NA"

Create a new table in MySQL using the following steps:


mysql -u root -p
# Create table in movielens

use movielens;

CREATE TABLE users_replica AS select u.id, u.age, u.gender,


u.occupation_id, o.name as occupation
from users u LEFT JOIN occupations o
ON u.occupation_id = o.id;

select * from users_replica limit 10;


# Alter table
alter table users_replica add primary key (id);
alter table users_replica add column (salary int, generation varchar(100));
update users_replica set salary = 120000 where occupation = 'Lawyer';
update users_replica set salary = 100000 where occupation = 'Engineer';
update users_replica set salary = 80000 where occupation = 'Programmer';
update users_replica set salary = 0 where occupation = 'Student';
update users_replica set generation = 'Millenial' where age<35;
update users_replica set generation = 'Boomer' where age>55;
exit;
Then run the following Sqoop job:
sqoop import \
--connect jdbc:mysql://172.31.26.67:3306/movielens \
--username ubuntu \
--password ubuntu \
--table users_replica \
--target-dir /user/ubuntu/sqoop/movielens/users \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
-m 3 \
--where "id between 1 and 300" \
--null-non-string -1 \
--null-string "NA"

How Sqoop works


Sqoop is an abstraction for MapReduce, meaning it takes a command, such as a request to
import a table from an RDBMS into HDFS, and implements this using a MapReduce processing
routine. Specifically, Sqoop implements a Map-only MapReduce process.
Sqoop performs the following steps to complete an import operation:

1.Connect to the database system using JDBC or a customer connector.

2. Examine the table to be imported.

3.Create a Java class to represent the structure (schema) for the specified table. This
class can then be reused for future import operations.

4.Execute a Map-only MapReduce job with a specified number of tasks (mappers) to


connect to the database system and import data from the specified table in parallel.

When importing data to HDFS, it is important that you ensure access to a consistent snapshot of
the source data.
We need to ensure that any processes that update existing rows of a table are disabled during the
import.
Imported Data and Hive
Using a system such as Hive to handle relational operations can dramatically ease the
development of the analytic pipeline.
Sqoop can generate a Hive table based on a table from an existing relational data source.
## Create database movielens in Hive
#3 steps import into Hive:
# 1. import data to HDFS
sqoop import \
--connect jdbc:mysql://172.31.43.67:3306/movielens \
--table genres -m 1 \
--username ubuntu \
--password ubuntu \
--target-dir /user/ubuntu/sqoop/movielens/genres

# 2. create table in Hive

sqoop create-hive-table \
--connect jdbc:mysql://172.31.26.67:3306/movielens \
--table genres \
--hive-table movielens.genres \
--username ubuntu \
--password ubuntu \
--fields-terminated-by ','
# 3. import data from HDFS to Hive

#Run commands in Hive:


Hive
hive> show databases;
hive> use movielens;
hive> show tables;
hive> select * from genres;
hive> LOAD DATA INPATH "/user/ubuntu/sqoop/movielens/genres" OVERWRITE INTO
TABLE genres;
hive> select * from genres;
hive> exit;
#run commands in Terminal
hadoop fs -ls /user/ubuntu/sqoop/movielens/genres
hadoop fs -ls /user/hive/warehouse/movielens.db/genres
hadoop fs -cat /user/hive/warehouse/movielens.db/genres/part-m-00000
# Direct import into Hive

sqoop import \
--connect jdbc:mysql://172.31.26.67:3306/movielens \
--table genres -m 1 \
--hive-import \
--hive-overwrite \
--hive-table movielens.genres2 \
--username ubuntu \
--password ubuntu \
--fields-terminated-by ','
#run commands in Terminal
hadoop fs -ls /user/hive/warehouse/movielens.db/genres
hadoop fs -cat /user/hive/warehouse/movielens.db/genres/part-m-00000
Sqoop Export
In Sqoop, an import refers to the movement of data from a database system into HDFS. By
contrast, an export uses HDFS as the source of data and a remote database as the destination.
We can, for example, export the results of an analysis to a database for consumption by other
tools.
Before exporting a table from HDFS to a database, we must prepare the database to receive the
data by creating the target table. Although Sqoop can infer which Java types are appropriate to
hold SQL data types, this translation does not work in both directions. You must determine
which types are most appropriate.
When reading the tables directly from files, we need to tell Sqoop which delimiters to use.
Sqoop assumes records are newline-delimited by default, but needs to be told about the field
delimiters.
# Create table in MySQL
mysql -u ubuntu -p

mysql> use movielens;


mysql> create table genres_export (id INT, name VARCHAR(255));
mysql> exit;
# Export data from hive warehouse to mysql

sqoop export \
--connect jdbc:mysql://172.31.26.67:3306/movielens -m 1 \
--table genres_export \
--export-dir /user/hive/warehouse/movielens.db/genres \
--username ubuntu \
--password ubuntu
# Check the exported table in mysql

mysql -u root -p
use movielens;
show tables;
select * from enres_export limit 10;
The Sqoop performs exports is very similar in nature to how Sqoop performs imports. Before
performing the export, Sqoop picks a strategy based on the database connect string.
Sqoop then generates a Java class based on the target table definition. This generated class has
the ability to parse records from text files and insert values of the appropriate types into a table
A MapReduce job is then launched that reads the source datafiles from HDFS, parses the
records using the generated class, and executes the chosen export strategy.
Exercise Sqoop
Install crime-data-la database in MySQL
Copy datasets crime_data_la.csv and crime_data_area_name.csvin /user/ubuntu/crime_la in
HDFS.
From the console, copy crime_data_la.csv and crime_data_area_name.csv to the local
filesystem using:
hadoop fs -copyToLocal /user/ubuntu/crime_la/crime_data_la.csv
hadoop fs -copyToLocal /user/ubuntu/crime_la/crime_data_area_name.csv
ls -a
Create a new database crime_la in MySQL
mysql -u root -p
CREATE DATABASE crime_la;
USE crime_la;
Create table crime_data_la and area_lookup:
CREATE TABLE crime_data_la
(dr_number INT, date_reported VARCHAR(255), date_occured VARCHAR(255),
tm_occured INT, area_id INT, reporting_district INT, crime_code INT, victime_age INT,
victim_sex VARCHAR(255), victim_descent VARCHAR(255), coord_lat FLOAT, coord_long
FLOAT);

alter table crime_data_la add primary key (dr_number);

CREATE TABLE area_lookup (area_id INT, area_name VARCHAR(255));

alter table area_lookup add primary key (area_id);


Load data into crime_la database:
LOAD DATA LOCAL INFILE 'crime_data_la.csv'
INTO TABLE crime_data_la
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(dr_number, date_reported, date_occured,tm_occured,area_id, reporting_district, crime_code,
victime_age,
victim_sex, victim_descent, coord_lat, coord_long);
LOAD DATA LOCAL INFILE 'crime_data_area_name.csv'
INTO TABLE area_lookup
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
(area_id, area_name);
Check the data has been correctly loaded:
SELECT * FROM crime_data_la LIMIT 10;

SELECT * FROM area_lookup;


Grant all privileges on crime_la
GRANT ALL PRIVILEGES
ON crime_la.*
TO 'ubuntu'@'%'
IDENTIFIED BY 'ubuntu';
Flume
Introduction to Flume
Apache Flume is a Hadoop ecosystem project originally developed by Cloudera designed to
capture, transform, and ingest data into HDFS using one or more agents.
Apache Flume is an ideal fit for streams of data that we would like to aggregate, store, and
analyze using Hadoop.
Flume is designed for high-volume ingestion into Hadoop of event-based data.
The initial use case was based upon capturing log files, or web logs, from a source system like a
web server, and routing these files and their messages into HDFS as they are generated.
The usual destination (or sink in Flume parlance) is HDFS. However, Flume is flexible enough
to write to other systems, like HBase or Solr.
Flume Agents
To use Flume, we need to run a Flume agent, which is a long-lived Java process that runs
sources and sinks, connected by channels.
Agents can connect a data source directly to HDFS or to other downstream agents.
Agents can also perform in-flight data operations, including basic transformations, compression,
encryption, batching of events and more.
A Flume installation is made up of a collection of connected agents running in a distributed
topology.
Agents on the edge of the system (co-located on web server machines, for example) collect data
and forward it to agents that are responsible for aggregating and then storing the data in its final
destination.
Agents are configured to run a collection of particular sources and sinks, so using Flume is
mainly a configuration exercise in wiring the pieces together.

A Flume agent source instructs the agent where the data is to be received from.
A Flume agent sink tells the agent where to send data.

• Often the destination is HDFS, which was the original intention for the project.
• However, the destination could be another agent that will do some further in-flight processing,
or another filesystem such as S3.

The Flume agent channel is a queue between the agent’s source and sink.

• Flume implements a transactional architecture for added reliability. This enables rollback and
retry operations if required.

A source in Flume produces events and delivers them to the channel, which stores the events
until they are forwarded to the sink.
You can think of the source-channel-sink combination as a basic Flume building block.

Flume Agent example


# Flume Components
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Source
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -f logfile.log
agent1.sources.source1.channels = channel1

# Sink
agent1.sinks.sink1.type = logger
agent1.sinks.sink1.channel = channel1

# Channel
agent1.channels.channel1.type = memory
## First-tier agent

# Flume Components
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Source
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -F logfile.log
agent1.sources.source1.channels = channel1

# Define a sink that outputs to a source.


agent1.sinks.sink1.channel = channel1
agent1.sinks.sink1.type = avro
agent1.sinks.sink1.hostname = 172.31.43.67
agent1.sinks.sink1.port = 14000

# Channels
agent1.channels.channel1.type = memory
## Second-tier agent

# Flume Components
agent2.sources = source2
agent2.sinks = sink2
agent2.channels = channel2

# Source as a sink
agent2.sources.source2.channels = channel2
agent2.sources.source2.type = avro
agent2.sources.source2.bind = 172.31.43.67
agent2.sources.source2.port = 14000
# Sink
agent2.sinks.sink2.type = hdfs
agent2.sinks.sink2.hdfs.path = flume/agent_tiers
agent2.sinks.sink2.hdfs.fileType = DataStream
agent2.sinks.sink2.hdfs.filePrefix = events
agent2.sinks.sink2.hdfs.fileSuffix = .log
agent2.sinks.sink2.channel = channel2

# Channel
agent2.channels.channel2.type = memory

Difference Between Apache Sqoop vs Flume

Conclusion
As you have already learned above Sqoop and Flume, both are primarily two Data
Ingestion tools used in the Big Data world, now still if you need to ingest textual log data into
Hadoop/HDFS then Flume is the right choice for doing that. If your data is not regularly
generated then Flume will still work but it will be an overkill for that situation.
Similarly, Sqoop is not the best fit for event-driven data handling.
Hadoop I/O: Compression, Serialization, File-Based Data structures.
I/O Compression

• In the Hadoop framework, where large data sets are stored and processed, you will
need storage for large files.
• These files are divided into blocks and those blocks are stored in different nodes across
the cluster so lots of I/O and network data transfer is also involved.
• In order to reduce the storage requirements and to reduce the time spent in-network
transfer, you can have a look at data compression in the Hadoop framework.
• Using data compression in Hadoop you can compress files at various steps, at all of
these steps it will help to reduce storage and quantity of data transferred.
• You can compress the input file itself.
• That will help you reduce storage space in HDFS.
• You can also configure that the output of a MapReduce job is compressed in Hadoop.
• That helps is reducing storage space if you are archiving output or sending it to some
other application for further processing.
I/O Serialization

• Serialization refers to the conversion of structured objects into byte streams for
transmission over the network or permanent storage on a disk.
• Deserialization refers to the conversion of byte streams back to structured objects.
• Serialization is mainly used in two areas of distributed data processing :
• Interprocess communication
• Permanent storage
• We require I/O Serialization because :
• To process records faster (Time-bound).
• When proper data formats need to maintain and transmit over data without schema
support on another end.
• When in the future, data without structure or format needs to process, complex Errors
may occur.
• Serialization offers data validation over transmission.
• To maintain the proper format of data serialization, the system must have the following
four properties -
• Compact - helps in the best use of network bandwidth
• Fast - reduces the performance overhead
• Extensible - can match new requirements
• Inter-operable - not language-specific
File-based Data Structure

You might also like