Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Introduction	
to
Big	Data
By:	Haluan	Mohammad	Irsad
Definition
Big data can defined by 3Vs (Three Vs):
– Volume, starts as low as 1 terabyte and it has no upper limit.
– Velocity, data volume per unit time, should be at least 30KB/sec.
– Variety, add unstructured & semi-structured to structured data.
Volume
Big data is composed of huge numbers of very small transactions that come in variety
formats.
Data produce true value only after they’re aggregated and analyzed.
Velocity
Required latency less than 100ms, measured from the time data is created to time has
the responds.
Throughput requirement can easily be as high as 1,000 messages per second.
Variety
Composing of a combination of datasets with differing underlying structures (structured,
semi-structured, or structured).
Heterogeneous format: graphics, JSON, XML, CSV, and log files.
Idetifying by	the	
Sources
Data		nowadays	generated	by:
• Humans
• Machine
• Sensors
Typical	sources:
• Social	media
• Financial	transactions
• Health	records
• Click	streams
• Log	files
• Internet	of	Things
Problem
■ Managing volume of data, caused by overloading volume
■ Maintaining system performance, caused by low velocity of data access
■ Avoiding disjunction of data, caused by variety of data structure and formats.
How to accomplish those?
Managing Volume
Scalable database, use NoSQL DBMS (MongoDB, Cassandra DB, Titan DB)
Maintaining Performance
■ For Batch Processing, use Hadoop MapReduce
■ For Stream Processing, use Apache Spark, Apache Storm, Apache Drill
Avoiding Disjunction
■ Use Flat Storage Architecture / Data Lake, to hold huge volumes of multi-structured
data.
■ Use Hadoop Distributed File System (HDFS), to deploy to the machine.
Hadoop
Digging	deeper	into
What is Hadoop?
A framework that allows for the distributed processing of large data sets across clusters
of computers using simple programming models.
Why Hadoop?
■ Most proven framework in industry nowadays
■ Open Source
■ Rich features & functionalities
■ Rich support plugins in the ecosystem
=> (https://hadoopecosystemtable.github.io/)
When to use Hadoop?
■ For processing large volumes of data
■ For parallel data processing
■ For storing a diverse set of data
When not to use Hadoop?
■ For a relational database system
■ For a general network network file system
■ For non-parallel data processing
Core Functions
■ Data Storage
■ Data Processing
■ Resource Management
Data Storage
Hadoop use HDFS (Hadoop Distributed File System) to store the data.
HDFS is a distributed file system designed to fault tolerant and deployed on low-cost
hardware
HDFS’ Goals
■ Hardware Failure, detection of faults and quick automatic recovery.
■ Streaming Data Access, emphasis on high throughput of data access.
■ Large Data Sets, provide high aggregate data bandwidth and scale to hundreds of
nodes in a single cluster and support tens of millions of files in a single instance.
■ Simple Coherency Model, HDFS applications need a write-once-read-many access
model for files.
■ Portability, designed to be easily move from one platform to another.
Data Processing
Data processing has two ways, batch and real-time.
■ Batch processing, execution of a series of jobs.
– Use Hadoop MapReduce
■ Real-time processing, execution of instantaneously jobs.
– Use Apache Spark
Hadoop	MapReduce
• MapReduce	is	a	parallel	distributed	processing	
that	can	be	used	to	process	large	of	data	in	
batch	to	transform	it	into	manageable-size	
data.
• This	work	is	done	in	two	steps:
1. Map	the	Data,	this	stage	is	to	delegate	
the	data	into	key-value	pairs	&	divided	
into	fragments,	then	assigned	to	map	
tasks.
2. Reduce	the	Data,	this	stage	is	the	
combination	of	the Shuffle stage	and	
the Reduce stage,	the	goals	of	this	stage	
is	to	process	the	data	result	of	map	
tasks,	then	produce	a	new	set	of	output	
which	will	stored	in	the	HDFS.
Apache	Spark
Is	a	compute	engine	for	Hadoop	data,	provides	
expressive	programming	model	(SparkQL),	stream	
processing,	machine	learning	(MLib),	and	graph	
computation	(GraphX).
Resource Management
Manage all resources in the Hadoop cluster, to monitor if there are any faults, job
scheduling, and do quick automatic recovery.
Hadoop use YARN.
Hadoop	YARN
• The	ResourceManager is	the	ultimate	authority	
that	arbitrates	resources	among	all	the	
applications	in	the	system	(cluster).
• The	NodeManager is	the	per-machine	framework	
agent	who	is	responsible	for	containers,	
monitoring	their	resource	usage	(cpu,	memory,	
disk,	network)	and	reporting	the	same	to	the	
ResourceManager
YARN (cot’d)
■ The Scheduler is responsible for allocating resources to the various running
applications subject to familiar constraints of capacities, queues, etc.
– Performs no monitoring or tracking of status for the application
– No guarantees about restarting failed tasks either due to application failure or
hardware failures
– Performs its scheduling function based on the resource requirements of the
applications
Analyzing Data
Process of inspecting, cleansing, transforming, and modeling data with the goal of
discovering useful information, suggesting conclusions, and supporting decision-
making.
The goal of analyzing data is to leverage your business to grow more higher.
Hadoop support this activity with the help from Apache Mahout.
Apache	Mahout
Library	that	help	creates	a	machine	learning	
applications.
The	main	functions	is	to	help	solve:
1. Classification,	assigning	a	set	of	data	to	known	
category.
2. Clustering,	grouping	a	set	of	objects	based	on	
the	similarity.
3. Recommendations,	give	list	of	
recommendation	based	on	statistic	analyzing.
Mahout	provides	the	algorithm	to	solve	all	problem	
above	and	allow	to	customized	them	on	demand.
Visualization
Hadoop by default doesn’t support to visualize the data.
To visualize the data, use Apache Zeppelin (http://zeppelin.apache.org/).
Apache	Zeppelin
Apache	Zeppelin	runs	on	top	of	Apache	Spark,	but	
provide	pluggable	interpreter	APIs	to	support	other	
data	processing	system.
Benefits
Hadoop give some benefits:
■ Ease of scaling
Hadoop is designed as a distributed system
■ Performance
Hadoop is designed to works in distributed & parallel processing
■ Availability & Reliability
Hadoop platform is providing data protection and automatic failover
configuration
Conclusion
■ Big data is not a barrier, but only a data that need to be managed properly.
■ Used a proper tools to managed them.
■ Prepare the strategy to processing the data (batch or stream).
■ Managed & maintain the system carefully.
■ Use plugins that needed by functional requirements.
■ Grow your business with Data-Driven Approach
FIN

More Related Content

Introduction to Big Data

Editor's Notes

  1. RDBMS could scale on read operation, but in write operation you need to drop ACID requirements which is violated RDBMS core rules.
  2. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  3. Batch: “where data is collected and then processed as one unit with processing completion times on the order of hours or days”