BIG DATA

BIG DATA
BY,
SHASHANK SHETTY
ASSISTANT PROFESSOR, DEPT OF CSE
NMAM INSTITUTE OF TECHNOLOGY, Nitte

CONTENTS (1/2)
 Big Data Definition
 Areas Of Challenges
 Big data Attributes.
 Big data Source.
 Sample Events generating data
 New tools for generating data
 Big data applications.
 Getting value from big data.
 Big data security
 Comparing Hadoop With RDBMS
 Hadoop

CONTENTS(2/2)
HDFS (Hadoop distributed File sytem)
 MapReduce
HBASE
 YARN
 Hadoop Ecosystem
 SQOOP
 HIVE
 PIG
 HUE
 FLUME
 Conclusion
 Reference

WHAT IS BIG DATA???
BIG DATA
So Large Data That It
Becomes Difficult To
Process It using
Traditional Systems
Is
SOURCE: PLANNING FOR BIG DATA, EDD DUMBILL, PP.1-4

DIFFICULT TO PROCESS BY TRADITIONAL
SYSTEM
200 MB DOCUMENT 150 GB IMAGE
200 TB VIDEO
Unable To
Send
Unable To
View
Unable To
Edit
Depends On
the Capabilities
of the System

ORGANIZATION SPECIFIC
500TB Text, Audio, Video Data
Per Day
BIG
DATA
NOT
A
BIG
DATA
COMPANY 1 COMPANY 2
Depends On
the Capabilities
of the
Organization

AREAS OF CHALLENGE
CAPTURE
 SEARCH
 SHARING
 STORAGE
TRANSFER
ANALYSIS
VISUALIZATION

BIG DATA
BIG DATAATTRIBUTES ???
 Large and Growing files
At High Speed
 In Various Format

V Attributes
VOLUME
VELOCITY
VARIETY
The Data
Comes At
High Speed
The Data
Results Into
Large Files
The Files
Come Under
Various
Formats

Unstructured
Data
90%
Structured
Data
10%
Mostly
Wasted
Used In
Decision
Making
STRUCTURED DATA
Challenging
/Opportunity
To Analyze And Extract
Meaningful Information

BIG DATA SOURCES
USERS
APPLICATION
SYSTEMS
SENSORS
LARGE AND GROWING
FILES
(BIG DATA FILES)
Are
Creating

DATA GENERATION POINT
EXAMPLES
 MOBILE DEVICES
 MICROPHONES
 READERS/SCANNERS
 CAMERAS
 MACHINE SENSORS
 SOCIAL MEDIA
 PROGRAMS/ SOFTWARES
SCIENCE FACILITIES

SAMPLE DATA TYPES
 VIDEOS
 AUDIOS
 IMAGES
 PHOTOS
 LOGS
 CLICK TRAILS
 TEXT MESSAGES
 EMAILS
 DOCUMENTS
 BOOKS
 TRANSACTIONS
 PUBLIC RECORDS

SAMPLE EVENTS GENERATING DATA
1)Air Bus:
 Airbus generates 10TB every 30 minutes
 About 640TB is generated in one flight
2) Smart Meters:
 Smart Meter reads the usage every 15 minutes
 Records 350 Billion Transaction every year.
 In 2009, there were 76 million smart meters.
 By 2014, there will be 200 million smart meters
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4

3) Camera Phones:
 5 million camera phones are there world wide.
 Most of them have location Awareness ( G.P.S)
 22% of them are Smartphone's.
 By the End of 2013 the number of Smartphone's
will exceed the number of PC‟s
4) Internet Users:
 2+ billion people use internet.
 By 2014 CISCO estimates internet traffic 4.8
Zettabytes per year

5) Blogs:
 There are 200 billion blog entries in the world.
6) Emails:
 300 million Emails are sent every day.
7) RFID:
 In 2005, there were around 1.5 million RFID‟s
 In 2012, there are 30 million RFID‟s
WalMart as played
the major role

8) Facebook:
 Facebook generates 25TB of data daily.
9) Twitter:
 Twitter generates 12TB of data daily.
 200 million users generating 230 million tweets daily.
 97,000 tweets are sent every seconds.
10) Trading:
 NYSE produces 1TB per trading day.
11) Experiment:
 CERN atomic facility generates 40TB per second.

SAMPLE EVENTS GENERATING DATA
Big Data:
 In 2009, the total data was estimated to be 1 ZB
 In 2020, it is estimated to be 35ZB

New Tools For Big Data
TRADITIONAL
SYSTEMS
(E.g.,RDBMS)
BIG DATA
TOOLS
(E.g.,
HADOOP)
TIMEE
Not able to
handle Big
Data
Created to
handle Big
Data

Big Data Applications
 Companies gaining edge by collecting,
analyzing and understanding information.
 Government forecasting events and taking
proactive actions.

Getting value from Big Data
Collect Analyze Understand
EXTRACT
HERE

Big Data Security Issues
 Security and privacy issues are magnified by the V
attributes.
 Velocity
 Volume
 Variety
 Traditional Security mechanisms which are tailoured
to securing small scale static data are inadequate.
SOURCE: CLOUD SECURITY ALLIANCES

Top Five Security Challenges
1) Secure Computation in Distributed
Programming framework:
 Distributed programming framework utilizes parallism in
computation and storage to process massive amount of data.
Example: MAPREDUCE Framework:
 Splits input files into multiple chuncks.
 These chunks are read by the mapper and outputs key/value pairs.
 The reducer combines the values belonging to distinct key and outputs the
result.
OPPORTUNITY 1: Two Major prevention measure arises
1) Securing Mapper
2) Securing the data in the presence of untrusted mapper

2) Input Validation/Filtering
 Input Validation:
 What kind of data is untrusted?
 What are the untrusted data sources?
 Data Filtering:
 Filter Rogue or malicious data.
 Challenges/ opportunity
 GB‟s or TB‟s Continuous data
 Signature based data filtering has limitations

3) Secure Data storage
 Data at various nodes, authentication, authorization and
encryption is challenging.
 Autotiering moves cold data into lesser secure medium
o What if the cold data is sensitive?
 Autotier doesnot keep track of where the data is stored.
(new challenge)
 Encryption of real time data can have performance
impact.
 Challenges/opportunity:
 24/7 availability of data
 unauthorized access

4) Privacy concern in data mining.
 Sharing of results involve multiple challenges.
o Invasion of privacy.
o Invasive Marketing.
o Unintentional disclosure of Information.
 Example: Companies and government agencies they
constantly mined and analyzed by the inside analysts
and also potentially outside contractors
 Challenges/Opportunity: Robust and scalable
privacy preserving mining algorithms

5) Cryptography enforced access
control and secure communication
 To ensure end to end secure private data.
 Accessible to only authorized entity.
 Hence Cryptography enforced access control
has to be implemented.
 Challenges/ opportunities: The main
problem to encrypt data especially large data
sets, is all-or-nothing retrieval policy,
disallowing user to easily search or share data.

COMPARING HADOOP
WITH RDBMS
vs
………………
… … … … … … …

Comparing Hadoop with RDBMS
 Until recently many applications utilized Relational
Database systems (RDBMS) for batch processing.
-Oracle, Sybase, MySQL, Microsoft SQL, Server etc.
-Hadoop doesn‟t fully replace relational products; many
architectures would benefit from both hadoop and
Relational product(s).
 Scale-Out vs Scale-up
-RDBMS products scale up
 Expensive to scale for large installation.
 Hits a ceiling when storage reaches 100s of terabytes.
- Hadoop clusters can scale-out to 100s of machines and to
petabytes of storage.

 Structured Relational vs Semi-structured vs unstructured
-RDBMS works well for structured data-tables that conform to
a predefined schema.
-Hadoop works best on semi structured and unstructured data.
 Semi-structured may have schema that is loosely followed.
 Unstructured data has no structure whatsoever and Is usually
blocks of text (or for example images)
 At processing time types for key and values are choosen by
the implementer.
-Certain types of input data will not easily fit into relational
schema such as JSON, XML etc.
Comparing Hadoop with RDBMS (contd..)

 Offline batch vs Online Transactions
- Hadoop was not designed for real time and low latency
queries.
- Products that do provide low latency queries such as Hbase
have limited query functionality.
- Hadoop performs best for offline batch processing on large
amounts of data.
- RDBMS is best for online transactions and low latency
queries.
- Hadoop is designed to stream large files and large amounts of
data.
- RDBMS works best with small records.
Comparing Hadoop with RDBMS (contd..)

 Framework for running applications on large clusters of
commodity hardware
 Scale: petabytes of data on thousands of nodes
 Include
 Storage: HDFS
 Processing: MapReduce
Support the Map/Reduce programming model
 Requirements
 Economy: use cluster of comodity computers
 Easy to use
Users: no need to deal with the complexity of
distributed computing
 Reliable: can handle node failures automatically

What's Hadoop ..Contd??
 Hadoop is a software platform that lets one easily write
and run applications that process vast amounts of data.
 Here's what makes Hadoop especially useful:
 Scalable
 Economical
 Efficient
 Reliable

Hadoop, Why?
 Need to process Multi Petabyte Datasets
 Expensive to build reliability in each application.
 Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
 Need common infrastructure
– Efficient, reliable, Open Source Apache License
 The above goals are same as Condor, but
Workloads are IO bound and not CPU bound

Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!

HDFS
(Hadoop Distributed File
System)

HDFS
 Hadoop implements MapReduce, using the Hadoop
Distributed File System (HDFS) (see figure below.)
 MapReduce divides applications into many small blocks of
work. HDFS creates multiple replicas of data blocks for
reliability, placing them on compute nodes around the
cluster. MapReduce can then process the data where it is
located.
 Hadoop has been demonstrated on clusters with 2000
nodes. The current design target is 10,000 node clusters.

Goals of HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to
where data resides
– Provides very high aggregate bandwidth
• User Space, runs on heterogeneous OS

Hadoop at Facebook
• Production cluster
– 4800 cores, 600 machines, 16GB per machine – April 2009
– 8000 cores, 1000 machines, 32 GB per machine – July
2009
– 4 SATA disks of 1 TB each per machine
– 2 level network hierarchy, 40 machines per rack
– Total cluster size is 2 PB, projected to be 12 PB in Q3
2009
• Test cluster
• 800 cores, 16GB each

Hadoop Architecture
Data
Data data data data data
Results
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Hadoop Cluster
DFS Block 1
DFS Block 1
DFS Block 2
DFS Block 2
DFS Block 2
DFS Block 1
DFS Block 3
DFS Block 3
DFS Block 3
MAP
MAP
MAP
Reduce

Map/Reduce Processes
 Launching Application
– User application code
– Submits a specific kind of Map/Reduce job
 JobTracker
– Handles all jobs
– Makes all scheduling decisions
 TaskTracker
– Manager for all tasks on a given node
 Task
– Runs an individual map or reduce fragment for a
given job
– Forks from the TaskTracker

-cont’d
 Hadoop Map/Reduce – Goals:
• Process large data sets
• Cope with hardware failure
• High throughput

Hadoop Map-Reduce Architecture
 Master-Slave architecture
 Map-Reduce Master “Jobtracker”
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to Tasktrackers
– Monitors task and tasktracker status, re-executes tasks upon
failure
 Map-Reduce Slaves “Tasktrackers”
– Run Map and Reduce tasks upon instruction from the Jobtracker
– Manage storage and transmission of intermediate output

NameNode Metadata
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc

DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block (e.g. CRC)
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to the
NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes

Block Placement
• Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
• Clients read from nearest replica
• Would like to make this policy pluggable

Data Correctness
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas

NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution

Data Pipelining
• Client retrieves a list of DataNodes on which to place
replicas of a block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the next DataNode
in the Pipeline
• When all replicas are written, the Client moves on to write
the next block in file

Data Flow
Web Servers Scribe Servers
Network
Storage
Hadoop ClusterOracle RAC MySQL

Example for MapReduce
 Page 1: the weather is good
 Page 2: today is good
 Page 3: good weather is good.

Reduce Input
 Worker 1:
 (the 1)
 Worker 2:
 (is 1), (is 1), (is 1)
 Worker 3:
 (weather 1), (weather 1)
 Worker 4:
 (today 1)
 Worker 5:
 (good 1), (good 1), (good 1), (good 1)

Reduce Output
 Worker 1:
 (the 1)
 Worker 2:
 (is 3)
 Worker 3:
 (weather 2)
 Worker 4:
 (today 1)
 Worker 5:
 (good 4)

Hadoop Web Interface
• MapReduce Job Tracker Web Interface
The job tracker web UI provides information about general job statistics of
the Hadoop cluster, running/completed/failed jobs and a job history log file.
It also gives access to the local machine's Hadoop log files (the machine on
which the web UI is running on).
By default, it's available at http://localhost:50030/
• Task Tracker Web Interface
The task tracker web UI shows you running and non-running tasks. It also
gives access to the local machine's Hadoop log files.
• HDFS Name Node Web Interface
The name node web UI shows you a cluster summary including information
about total/remaining capacity, live and dead nodes. Additionally, it allows
you to browse the HDFS namespace and view the contents of its files in the
web browser. It also gives access to the local machine's Hadoop log files.

HBASE
HBase is a database: the Hadoop database. It is indexed by row key,
column key, and timestamp.
 HBase stores structured and semistructured data naturally so you can
load it with tweets and parsed log files and a catalog of all your products
right along with their customer reviews.
It can store unstructured data too, as long as it‟s not too large
HBase is designed to run on a cluster of computers instead of a single
computer. The cluster can be built using commodity hardware; HBase
scales horizontally as you add more machines to the cluster.

HBASE (Contd…)
Each node in the cluster provides a bit of storage, a bit of cache,
and a bit of computation as well.
This makes HBase incredibly flexible and forgiving. No node is
unique, so if one of those machines breaks down, you simply
replace it with another.
 This adds up to a powerful, scalable approach to data that,until
now, hasn‟t been commonly available to mere mortals.

HBASE DATA MODEL:
Hbase Data model - These six concepts form the foundation of HBase.
Table:
HBase organizes data into tables. Table names are Strings and composed of characters
that are safe for use in a file system path.
Row :
 Within a table, data is stored according to its row. Rows are identified uniquely by
their row key. Row keys don‟t have a data type and are always treated as a byte[].
Column family:
 Data within a row is grouped by column family. Column families also impact the
physical arrangement of data stored in HBase.
 For this reason, they must be defined up front and aren‟t easily modified. Every row
in a table has the same column families, although a row need not store data in all its
families. Column family names are Strings and composed of characters that are safe for
use in a file system path.

 Column qualifier:
 Data within a column family is addressed via its column qualifier,or column.
Column qualifiers need not be specified in advance. Column qualifiers need not be
consistent between rows.
 Like rowkeys, column qualifiers don‟t have a data type and are always treated as a
byte[].
 Cell:
A combination of rowkey, column family, and column qualifier uniquely identifies
a cell. The data stored in a cell is referred to as that cell‟s value. Values also don‟t
have a data type and are always treated as a byte[].
 Version:
 Values within a cell are versioned. Versions are identified by their timestamp,a
long. When a version isn‟t specified, the current timestamp is used as the basis for the
operation. The number of cell value versions retained by HBase is configured via the
column family. The default number of cell versions is three.

HBase Tables and Regions
Table is made up of any number of regions.
Region is specified by its startKey and endKey.
 Empty table: (Table, NULL, NULL)
 Two-region table: (Table, NULL, “com.ABC.www”) and
(Table, “com.ABC.www”, NULL)
Each region may live on a different node and is made up of
several HDFS files and blocks, each of which is replicated by
Hadoop

Why Next Generation MR
 Reliability
 Availability
 Scalability - Clusters of 10,000 machines and 200,000
cores, and beyond.
 Backward (and Forward) Compatibility
 Ensure customers’ MapReduce applications run
unchanged in the next version of the framework.
 Evolution – Ability for customers to control upgrades to
the Hadoop software stack.
 Predictable Latency – A major customer concern.
 Cluster utilization

Why Next Generation MR
 Secondary Requirements
–Support for alternate programming
paradigms to MapReduce.
–Support for short-lived services

ReArchitecure
• Need
– Separate the tasks of Job Tracker
• Resource management
• Job Scheduling / Management

So, What did we
come up with
• Resource Manager
• Node Manager
• Application
Master
• Container

Resource Manager (RM)
Manages the global
assignment of compute
resources to applications.

• A pure Scheduler
• No monitoring, tracking
status of application
• No guarantee on restarting
failed tasks.

• Each client/application may
request multiple resources
– Memory
– Network
– Cpu
– Disk ..
• This is a significant change
from static Mapper /
Reducer model

Application Master
• A per – application
ApplicationMaster (AM) that
a ages the appli atio ’s life
cycle (scheduling and
coordination).
• An application is either a single
job in the classic MapReduce
jobs or a DAG of such jobs.

Application Master
A per – application
ApplicationMaster (AM) that
a ages the appli atio ’s life
cycle.

Application Master
• Application Master has the
responsibility of
– negotiating appropriate resource
containers from the Scheduler
– launching tasks
– tracking their status
– monitoring for progress
– handling task-failures.

Node Manager
• The NodeManager is the per-machine
framework agent
– responsible for launching the
applications‟ containers,
monitoring their resource usage
(cpu, memory, disk, network) and
reporting the same to the
Scheduler.

Gain with New Architecture
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
• Support for programming paradigms other than MapReduce

• RM and Job manager segregated
• The Hadoop MapReduce JobTracker
spends a very significant portion of
time and effort managing the life
cycle of applications
• Scalability
• Availability

• ResourceManage
– Uses ZooKeeper for fail-over.
– When primary fails, secondary can
quickly start using the state stored
in ZK
• Application Master
– MapReduce NextGen supports
application specific checkpoint
capabilities for the
ApplicationMaster.
– MapReduce ApplicationMaster can
recover from failures by restoring
itself from state saved in HDFS.
• Scalability
• Availability

• MapReduce NextGen uses wire-
compatible protocols to allow
different versions of servers and
clients to communicate with
each other.
• Rolling upgrades for the cluster
in future.
• Scalability
• Availability

• New framework is generic.
– Can came up with non MR parallel
computing techniques
– Different versions of MR running in
parallel 
– End users can upgrade to MR versions
on their own schedule
• Scalability
• Availability

Gain with New
Architecture
• MRv2 uses a general concept of a
resource for scheduling and allocating to
individual applications.
• Container , can be a mapper or a reducer
or … ?
• Stubborn notion of Mapper,Reducer
abolished
• Better cluster utilization
• Scalability
• Availability

 When Hadoop 1.0.0 was released by Apache in 2011, comprising
mainly HDFS and MapReduce, it soon became clear that Hadoop
was not simply another application or service, but a platform
around which an entire ecosystem of capabilities could be built.
Since then, dozens of self-standing software projects have sprung
into being around Hadoop, each addressing a variety of problem
spaces and meeting different needs.
 Many of these projects were begun by the same people or
companies who were the major developers and early users of
Hadoop; others were initiated by commercial Hadoop distributors.
The majority of these projects now share a home with Hadoop at
the Apache Software Foundation, which supports open-source
software development and encourages the development of the
communities surrounding these projects.

SQOOP
 Data Import/ Export.
 SQOOP is a tool designed to help
users of large data import existing
relational databases into their hadoop
clusters.
 Automatic data import.
 Easy import data from many
databases to Hadoop.
 Generates code for use in Mapreduce
applications.
Source: Big Data Analytics with Hadoop

 Sqoop is a tool designed to transfer data between Hadoop
and relational databases.
 You can use Sqoop to import data from a relational database
management system (RDBMS) such as MySQL or Oracle into
the Hadoop Distributed File System (HDFS), transform the
data in Hadoop MapReduce, and then export the data back
into an RDBMS.
What is Sqoop?

SQOOP
9
1 ©2011 Cloudera, Inc. All Rights Reserved.
91
RDBMS
Sqoop
HDFS

HIVE
 Hive is a data warehouse infrastructure built on top of
Hadoop for providing data summarization, query, and
analysis.
– ETL.
– Structure.
– Access to different storage.
– Query execution via MapReduce.
 While initially developed by Facebook, Apache Hive is now
used and developed by other companies such as Netflix.
 Key Building Principles:
– SQL is a familiar language
– Extensibility – Types, Functions, Formats, Scripts
– Performance

Databases.
Tables.
Partitions.
Buckets (or Clusters).
Data Units

Hive, Why?
• Need a Multi Petabyte Warehouse
• Files are insufficient data abstractions
– Need tables, schemas, partitions, indices
• SQL is highly popular
• Need for an open data format
– RDBMS have a closed data format
– flexible schema
• Hive is a Hadoop subproject!

Hadoop & Hive History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Becomes Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• Sept 2008 – Hive becomes a Hadoop subproject

•Hive structures data into well-understood database concepts
such as: tables, rows, cols, partitions
•It supports primitive types: integers, floats, doubles, and
strings
•Hive also supports:
–associative arrays: map<key-type, value-type>
–Lists: list<element type>
–Structs: struct<file name: file type…>
•SerDe: serialize and deserialized API is used to move data
in and out of tables
Data model

Query Language (HiveQL)
• Subset of SQL
• Meta-data queries
• Limited equality and join predicates
• No inserts on existing tables (to preserve
worm property)
– Can overwrite an entire table

Hive - DDL
 Create table
hive> CREATE TABLE customer (age INT, address STRING);
 Partitions
hive> CREATE TABLE customer (age INT, address STRING)
PARTITIONED BY ( sdate STRING) ;
 Show table
hive> SHOW TABLES ;
 Describe table
hive> DESCRIBE customer;

Hive - DDL
 Alter table
hive> ALTER TABLE customer ADD COLUMNS ( age INT) ;
 Drop table
hive> DROP TABLE customer;

HiveQL Examples
HiveQL, an SQL like language
hive> SELECT a.age FROM customer a WHERE a.sdate ='2008-08-
15';
selects all data from table for a partition but doesnt store it
hive> INSERT OVERWRITE DIRECTORY '/data/hdfs_file'
SELECT a.* FROM customer a WHERE a.sdate='2008-08-15';
writes all of customer table to an hdfs directory

Wordcount in Hive
FROM (
MAP doctext USING 'python wc_mapper.py' AS (word, cnt)
FROM docs
CLUSTER BY word
) a
REDUCE word, cnt USING 'pythonwc_reduce.py';

Hive Usage in Facebook
• Hive and Hadoop are extensively used in Facbook for
different kinds of operations.
• 700 TB = 2.1Petabyte after replication!
• Think of other application model that can leverage
Hadoop MR.

Hive – Related Projects
 Apache Flume – move large data sets to Hadoop
 Apache Sqoop – cmd line, move rdbms data to Hadoop
 Apache Hbase – Non relational database
 Apache Pig – analyse large data sets
 Apache Oozie – work flow scheduler
 Apache Mahout – machine learning and data mining
 Apache Hue – Hadoop user interface
 Apache Zoo Keeper – configuration / build

Introduction
• What is Pig?
– An open-source high-level dataflow system
– Provides a simple language for queries and data
manipulation, Pig Latin, that is compiled into map-reduce
jobs that are run on Hadoop
– Pig Latin combines the high-level data manipulation
constructs of SQL with the procedural programming of
map-reduce
• Why is it important?
– Companies and organizations like Yahoo, Google and
Microsoft are collecting enormous data sets in the form of
click streams, search logs, and web crawls
– Some form of ad-hoc processing and analysis of all of this
information is required

Existing Solutions
• Parallel database products (ex: Teradata)
– Expensive at web scale
– Data analysis programmers find the declarative SQL
queries to be unnatural and restrictive
• Raw map-reduce
– Complex n-stage dataflows are not supported; joins
and related tasks require workarounds or custom
implementations
– Resulting code is difficult to reuse and maintain; shifts
focus and attention away from data analysis

Language Features
• Several options for user-interaction
– Interactive mode (console)
– Batch mode (prepared script files containing Pig Latin commands)
– Embedded mode (execute Pig Latin commands within a Java program)
• Built primarily for scan-centric workloads and read-only data
analysis
– Easily operates on both structured and schema-less, unstructured data
– Transactional consistency and index-based lookups not required
– Data curation and schema management can be overkill
• Flexible, fully nested data model
• Extensive UDF support
– Currently must be written in Java
– Can be written for filtering, grouping, per-tuple processing, loading and
storing

Pig Latin vs. SQL
• Pig Latin is procedural (dataflow programming model)
– Step-by-step query style is much cleaner and easier to write and
follow than trying to wrap everything into a single block of
SQL
Source: http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html

Pig Latin vs. SQL (continued)
• Lazy evaluation (data not processed prior to STORE command)
• Data can be stored at any point during the pipeline
• An execution plan can be explicitly defined
– No need to rely on the system to choose the desired plan via optimizer hints
• Pipeline splits are supported
– SQL requires the join to be run twice or materialized as an intermediate result
Source: http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html

Data Model
• Supports four basic types
– Atom: a simple atomic value (int, long, double, string)
• ex: „Peter‟
– Tuple: a sequence of fields that can be any of the data types
• ex: („Peter‟, 14)
– Bag: a collection of tuples of potentially varying structures,
can contain duplicates
• ex: {(„Peter‟), („Bob‟, (14, 21))}
– Map: an associative array, the key must be a chararray but
the value can be any type

Data Model (continued)
• By default Pig treats undeclared fields as bytearrays
(collection of uninterpreted bytes)
• Ca i fer a field’s type ased o :
– Use of operators that expect a certain type of field
– UDFs with a known or explicitly set return type
– Schema information provided by a LOAD function
or explicitly declared using an AS clause
• Type conversion is lazy

Pig problem
• Fragment-replicate; skewed; merge join
• User has to know when to use which join
• Because… Pig is
domestic animal,
does whatever
you tell it to do.
- Alan Gates
Images from http://wiki.apache.org/pig/PigTalksPapers

Hue – What is it ?
 Hue = Hadoop User Experience
 Hue is an open-source Web interface that supports Apache
Hadoop and its ecosystem, licensed under the Apache v2 license.
 Its main goal is to have the users "just use" Hadoop without
worrying about the underlying complexity or using a command
line
 An open source Hadoop GUI
 Developed by Cloudera
 Web based
 Many functions

Hue – Why ???
 It is widely used
 It ships with Hadoop
 It integrates with Hadoop tools i.e.
 Hive
 Oozie
 HDFS
 It has an API for app creation

Hue Features
 HDFS file browser
 Job browser / designer
 Hive / Pig query editor
 Oozie app for work flows
 Has Hadoop API
 Access to shell
 User Admin
 App for Solr searches

What is Apache Flume?
● It is a distributed data collection service that gets
flows of data (like logs) from their source and
aggregates them to where they have to be processed.
● Goals: reliability, scalability, extensibility,
manageability.
Exactly what I needed!

The Flume Model: Flows and
Nodes
● A flow corresponds to a type of data source (server
logs, machine monitoring metrics...).
● Flows are comprised of nodes chained together.

The Flume Model: Flows and Nodes
● In a Node, data come in through a source...
...are optionally processed by one or more decorators... ...and
then are transmitted out via a sink.
Examples: Console, Exec, Syslog, IRC,
Twitter, other nodes...
Examples: Console, local files, HDFS,
S3,
other nodes...
Examples: wire batching, compression,
sampling, projection, extraction...

● Agent:
receives data from an
application.
● Processor (optional):
intermediate processing.
● Collector:
write data to permanent
storage.
The Flume Model: Agent, Processor and
Collector Nodes

The Flume Model: Data and Control
Path (1/2)
Nodes are in the data path.

The Flume Model: Data and Control
Path (2/2)
Masters are in the control path.
● Centralized point of configuration. Multiple: ZK.
● Specify sources, sinks and control data flows.

Flume Goals: Reliability
Tunable Failure Recovery Modes
● Best Effort
● Store on Failure and Retry
● End to End Reliability

Flume Goals: Scalability
Horizontally Scalable Data Path
Load Balancing

Flume Goals: Scalability
Horizontally Scalable Control Path

Flume Goals: Extensibility
 Simple Source and Sink API
 Event streaming and composition of simple
operation
 Plug in Architecture
 Add your own sources, sinks, decorators

Conclusion
Flume is
● Distributed data collection service
● Suitable for enterprise setting
● Large amount of log data to process

Conclusion
 Big data is here to stay, It is impossible to imagine
the next generation without it consuming data,
producing new forms of data and containing data
driven algorithms.
 As compute environment become cheaper,
application environment becomes networked over
cloud. So security, access control, compression and
encryption introduce challenges that have to be
addressed in a systematic manner.

References
[1] Chris Eaton, Dirk Deroos, Tom Deutsh, George Lapis, Paul Zikopoulos, Understanding Big
Data, Analysis for enterprise class hadoop and streaming data, pp.3-49.
[2] Mike Barlow, Real time data analystics, Emerging Architecture, February 2013, First edition,
pp.1-21.
[3] Sachidanand Singh, Nirmala Singh, Big Data Analytics,2012 International Conference on
Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai,
India
[4] Big Data Introduction, www.youtube.com/watch?v=e6kovHZ6FVc
[5] Hadoop Video, www.youtube.com/watch?v=OoEpfbbyga8
[6] Cloud Security Alliance, Big Data Security and privacy issues, November 2012.
[7] http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
[8] http://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.html
[9] http://www.youtube.com/watch?v=5Eib_H_zCEY&feature=related
[10] http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=related
[11] http://labs.google.com/papers/gfs-sosp2003.pdf
[12] http://hadoop.apache.org/core/docs/current/hdfs_design.html
[13] http://hadoop.apache.org/core/docs/current/api/
[14] http://hadoop.apache.org/hive/
[15]http://www.cloudera.com/resource/chicago_data_summit_flume_an_introduction_jonathan_h
sieh_hadoop_log_processing
[16] http://www.slideshare.net/cloudera/inside-flume
[17] http://www.slideshare.net/cloudera/flume-intro100715
[18] http://www.slideshare.net/cloudera/flume-austin-hug-21711

BIG DATA

More Related Content

What's hot

What's hot (20)

Similar to BIG DATA

Similar to BIG DATA (20)

Recently uploaded

Recently uploaded (20)

BIG DATA