Introduction to Big Data

Introduction
to
Big Data
By: Haluan Mohammad Irsad

Definition
Big data can defined by 3Vs (Three Vs):
– Volume, starts as low as 1 terabyte and it has no upper limit.
– Velocity, data volume per unit time, should be at least 30KB/sec.
– Variety, add unstructured & semi-structured to structured data.

Volume
Big data is composed of huge numbers of very small transactions that come in variety
formats.
Data produce true value only after they’re aggregated and analyzed.

Velocity
Required latency less than 100ms, measured from the time data is created to time has
the responds.
Throughput requirement can easily be as high as 1,000 messages per second.

Variety
Composing of a combination of datasets with differing underlying structures (structured,
semi-structured, or structured).
Heterogeneous format: graphics, JSON, XML, CSV, and log files.

Idetifying by the
Sources
Data nowadays generated by:
• Humans
• Machine
• Sensors
Typical sources:
• Social media
• Financial transactions
• Health records
• Click streams
• Log files
• Internet of Things

Problem
■ Managing volume of data, caused by overloading volume
■ Maintaining system performance, caused by low velocity of data access
■ Avoiding disjunction of data, caused by variety of data structure and formats.
How to accomplish those?

Managing Volume
Scalable database, use NoSQL DBMS (MongoDB, Cassandra DB, Titan DB)

Maintaining Performance
■ For Batch Processing, use Hadoop MapReduce
■ For Stream Processing, use Apache Spark, Apache Storm, Apache Drill

Avoiding Disjunction
■ Use Flat Storage Architecture / Data Lake, to hold huge volumes of multi-structured
data.
■ Use Hadoop Distributed File System (HDFS), to deploy to the machine.

What is Hadoop?
A framework that allows for the distributed processing of large data sets across clusters
of computers using simple programming models.

Why Hadoop?
■ Most proven framework in industry nowadays
■ Open Source
■ Rich features & functionalities
■ Rich support plugins in the ecosystem
=> (https://hadoopecosystemtable.github.io/)

When to use Hadoop?
■ For processing large volumes of data
■ For parallel data processing
■ For storing a diverse set of data

When not to use Hadoop?
■ For a relational database system
■ For a general network network file system
■ For non-parallel data processing

Core Functions
■ Data Storage
■ Data Processing
■ Resource Management

Data Storage
Hadoop use HDFS (Hadoop Distributed File System) to store the data.
HDFS is a distributed file system designed to fault tolerant and deployed on low-cost
hardware

HDFS’ Goals
■ Hardware Failure, detection of faults and quick automatic recovery.
■ Streaming Data Access, emphasis on high throughput of data access.
■ Large Data Sets, provide high aggregate data bandwidth and scale to hundreds of
nodes in a single cluster and support tens of millions of files in a single instance.
■ Simple Coherency Model, HDFS applications need a write-once-read-many access
model for files.
■ Portability, designed to be easily move from one platform to another.

Data Processing
Data processing has two ways, batch and real-time.
■ Batch processing, execution of a series of jobs.
– Use Hadoop MapReduce
■ Real-time processing, execution of instantaneously jobs.
– Use Apache Spark

Hadoop MapReduce
• MapReduce is a parallel distributed processing
that can be used to process large of data in
batch to transform it into manageable-size
data.
• This work is done in two steps:
1. Map the Data, this stage is to delegate
the data into key-value pairs & divided
into fragments, then assigned to map
tasks.
2. Reduce the Data, this stage is the
combination of the Shuffle stage and
the Reduce stage, the goals of this stage
is to process the data result of map
tasks, then produce a new set of output
which will stored in the HDFS.

Apache Spark
Is a compute engine for Hadoop data, provides
expressive programming model (SparkQL), stream
processing, machine learning (MLib), and graph
computation (GraphX).

Resource Management
Manage all resources in the Hadoop cluster, to monitor if there are any faults, job
scheduling, and do quick automatic recovery.
Hadoop use YARN.

Hadoop YARN
• The ResourceManager is the ultimate authority
that arbitrates resources among all the
applications in the system (cluster).
• The NodeManager is the per-machine framework
agent who is responsible for containers,
monitoring their resource usage (cpu, memory,
disk, network) and reporting the same to the
ResourceManager

YARN (cot’d)
■ The Scheduler is responsible for allocating resources to the various running
applications subject to familiar constraints of capacities, queues, etc.
– Performs no monitoring or tracking of status for the application
– No guarantees about restarting failed tasks either due to application failure or
hardware failures
– Performs its scheduling function based on the resource requirements of the
applications

Analyzing Data
Process of inspecting, cleansing, transforming, and modeling data with the goal of
discovering useful information, suggesting conclusions, and supporting decision-
making.
The goal of analyzing data is to leverage your business to grow more higher.
Hadoop support this activity with the help from Apache Mahout.

Apache Mahout
Library that help creates a machine learning
applications.
The main functions is to help solve:
1. Classification, assigning a set of data to known
category.
2. Clustering, grouping a set of objects based on
the similarity.
3. Recommendations, give list of
recommendation based on statistic analyzing.
Mahout provides the algorithm to solve all problem
above and allow to customized them on demand.

Visualization
Hadoop by default doesn’t support to visualize the data.
To visualize the data, use Apache Zeppelin (http://zeppelin.apache.org/).

Apache Zeppelin
Apache Zeppelin runs on top of Apache Spark, but
provide pluggable interpreter APIs to support other
data processing system.

Benefits
Hadoop give some benefits:
■ Ease of scaling
Hadoop is designed as a distributed system
■ Performance
Hadoop is designed to works in distributed & parallel processing
■ Availability & Reliability
Hadoop platform is providing data protection and automatic failover
configuration

Conclusion
■ Big data is not a barrier, but only a data that need to be managed properly.
■ Used a proper tools to managed them.
■ Prepare the strategy to processing the data (batch or stream).
■ Managed & maintain the system carefully.
■ Use plugins that needed by functional requirements.
■ Grow your business with Data-Driven Approach

Introduction to Big Data

More Related Content

Introduction to Big Data

Editor's Notes