Big data introduction, Hadoop in details

Introduction to Big data
When
Where
Tuesday 15-11-2016
06:00 PM -08:00 PM
Badir Program for
Technology Incubators
#DataRiyadh DataGeeks DataGeeksarabia
A deep introduction about big data topic along with real advices of how to start a career in this hot topic.
Be ready to digest a concentrated big data tablet that will put you on the right way.
presented by
Mahmoud Yassin

Agenda:
Data nowadays:
-Data types
-Fun facts about data nowadays.
-From where we generate data.
-Lake of data effect on business decisions.
-Future of data size.
Unlocking Big data solutions:
-Hadoop.
-Hadoop ecosystem Zoo
-Big data landscape
-Top Big data companies
-How to start a career in Big data
-Questions
Big Data:
-What’s big data?
-How big is the big data?
-The famous Vs about big data.
-Challenges of dealing with such data amount.
-Why to consider a career in big data?

Data unit measures:
2#DataRiyadh

Data Types:
information with a degree of
organization that is readily
searchable and quickly
consolidate into facts.
Examples: RDMBS, spreadsheet
information with a lack of
structure that is time and energy
consuming to search and find and
consolidate into facts
Exemples: email, documents, images,
reports
Semi Structured data : XML data
#DataRiyadh

Challenges for Unstructured data:
How do you store
Billions of Files?
How long does it take to
migrate 100’s of TB’s or
data every 3-5 years
Data has no
structure
Resources LimitationData Redundancy Data Backup
#DataRiyadh

Sources of data generation:
Social Media Sensors Cell Phones GPS Purchase
WWW E-mails Media streaming Healthcare IOT
#DataRiyadh

Facts about data:
2#DataRiyadh

Facts about data:
70% of data is created by Individuals – but enterprises are
responsible for storing and managing 80% of it.
52% of travelers use social media to plan for their vacations.
35% of purchases on Amazon are though recommendations
75% of what people watch on Netflix are recommendations.
#DataRiyadh

Lake of data and business decisions :
2#DataRiyadh

Future of data generation:
#DataRiyadh

Cost:
Even if RDBMS is used to handle and store “big data,” it will turn out to be very expensive.
Velocity:
Also, “big data” is generated at a very high velocity. RDBMS lacks in high velocity because it’s
designed for steady data retention rather than rapid growth.
Can traditional DBMS solve this ?
Data types:
Second, the majority of the data comes in a semi-structured or unstructured format from social
media, audio, video, texts, and emails. However, the second problem related to unstructured data is
outside the purview of RDBMS because relational databases just can’t categorize unstructured data.
They’re designed and structured to accommodate structured data such as weblog sensor and
financial data.
Size:
First, the data size has increased tremendously to the range of petabytes—one petabyte = 1,024
terabytes. RDBMS finds it challenging to handle such huge data volumes. To address this, RDBMS
added more central processing units (or CPUs) or more memory to the database management
system to scale up vertically.
#DataRiyadh

What is Big data:
Big data is a term that describes the large volume of data – both structured and
unstructured – that generates on a day-to-day basis. But it’s not the amount of data
that’s important. It’s what organizations do with the data that matters. Big data can
be analyzed for insights that lead to better decisions and strategic business moves.
Big data is high-volume, high-velocity and/or high-variety information assets that
demand cost-effective, innovative forms of information processing that enable
enhanced insight, decision making, and process automation
Big data is a term for data sets that are so large or complex that traditional data
processing applications are inadequate to deal with them. Challenges include
analysis, capture, data curation, search, sharing, storage, transfer, visualization,
querying, updating and information privacy.
#DataRiyadh

The Vs:
Velocity Volume Variety
Veracity Variability Visualization Value
#DataRiyadh

Big data in action:
UPS stores a large amount of data – much of which comes
from sensors in its vehicles - GPS
ORION (On-Road Integration Optimization and Navigation)
Data Analytics
and Data Science
Data Analytics
and Data Science
the world's
largest
operations
research project
the world's
largest
operations
research project
savings of more
than 8.4 million
85 million miles
off of daily routes
Saved
$30 million/Day
#DataRiyadh

Big data in action:
Walmart collects 2.5 petabytes of information from 1 million customers.
from 6000 store
Big data System (Kosmix)
Pricing
strategies
Pricing
strategies
Advertising
campaigns
Advertising
campaigns
30% on their
Online sales
Revenue got
increased by 40%
“We want to know what every product in
the world is. We want to know who every
person in the world is. And we want to
have the ability to connect them together
in a transaction.”
-Neil Ashe, CEO of Global E-commerce at
Walmart
#DataRiyadh

Big data in action:
Based on data analysis based on Big data platform:
- What users made purchases in the past.
- Which items do they have in their shopping cart
- Which items did customers rate and like
- What influence did the rating have on other customers to make purchase
Personalization of the online store
based on your previous searches

Big data in action:
#DataRiyadh
Zynga collects over 25 Terabytes / Day from FarmVille to drive higher in-game purchases.

Big data in quotes:
“Without big data analytics, companies are blind and deaf, wandering out onto the web like
deer on a freeway.” – Geoffrey Moore management consultant and author
“Data is the new science. Big Data holds the answers.” – Pat Gelsinger Chief Executive Officer of VMware
“With too little data, you won’t be able to make any conclusions that you trust. With loads of
data you will find relationships that aren’t real… Big data isn’t about bits, it’s about talent.” –
Douglas Merrill CEO and founder of ZestFinance.com
“The world is one big data problem.” – Andrew McAfee MIT

Big data market forecast:
The “big data” market is expected to cross $50 billion by 2017.
#DataRiyadh

Big data jobs trend:
The median advertised salary for professionals with big data expertise
is $124,000 a year.
IBM , Cisco and Oracle together advertised 26,488 open positions that
required big data expertise in the last twelve months.
124000 usd to sar =
465012
/12
= 38751 SAR/Month
#DataRiyadh

How to solve big data
Hadoop: is a big data analysis engine
#DataRiyadh

What is Hadoop
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models.
Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for any
kind of data, enormous processing power and the ability to handle virtually limitless
concurrent tasks or jobs.
#DataRiyadh

Hadoop history
Nutch is a well matured, production ready Web crawler. that enables fine
grained configuration, relying on Apache Hadoop™ data structures, which
are great for batch processing.
#DataRiyadh

Why Hadoop is important ?
Ability to store and process huge amounts of
any kind of data, quickly.
With data volumes and varieties constantly
increasing, especially from social media and the
Internet of Things (IoT), that's a key
consideration.
Computing power. Hadoop's distributed computing model processes big data fast. The
more computing nodes you use, the more processing power you have.
Fault tolerance. Data and application processing are protected against hardware failure.
If a node goes down, jobs are automatically redirected to other nodes to make sure the
distributed computing does not fail. Multiple copies of all data are stored automatically.

Why Hadoop is important ?
Flexibility. Unlike traditional relational databases, you
don’t have to preprocess data before storing it. You
can store as much data as you want and decide how
to use it later. That includes unstructured data like
text, images and videos.
Low cost. The open-source framework is free and uses commodity hardware
to store large quantities of data.
Scalability. You can easily grow your system to handle more data simply by
adding nodes. Little administration is required.
Scalability
Horizontal scaling means that you scale by adding more
machines into your pool of resources
Vertical scaling means that you scale by adding more
power (CPU, RAM) to an existing machine #DataRiyadh

How is Hadoop being used?
Going beyond its original goal of searching millions (or billions) of web pages and returning
relevant results, many organizations are looking to Hadoop as their next big data platform.
Popular uses today include:
#DataRiyadh

How is Hadoop being used?
#DataRiyadh

Hadoop family:
Data Storage:
Data Ingestion:
Data Processing:
Cluster Management:
#DataRiyadh

Hadoop ecosystem
Cluster monitoring, provisioning and management
#DataRiyadh

Hadoop | Data Ingestion
Apache Sqoop is a tool designed for efficiently transferring bulk data between
Apache Hadoop and structured data stores such as relational databases.
#DataRiyadh

Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data.
It is robust and fault tolerant with tunable reliability mechanisms and many
failover and recovery mechanisms.
#DataRiyadh

Storm is real-time computation system. Storm makes it easy to reliably
process unbounded streams of data, doing for real-time processing.
A Storm topology consumes streams of data and processes those streams in
arbitrarily complex ways, repartitioning the streams between each stage of
the computation however needed.
#DataRiyadh

An easy to use, powerful, and reliable system to process and distribute data.
Apache NiFi supports powerful and scalable directed graphs of data routing,
transformation, and system mediation logic in a Web-based user interface
#DataRiyadh

Kafka™ is used for building real-time data pipelines and streaming apps. It is
horizontally scalable, fault-tolerant, wicked fast, and runs in production in
thousands of companies.
#DataRiyadh

Large scale log aggregator, and analytics.
Fluentd is an open source data collector for unified logging
layer.
Fluentd allows you to unify data collection and consumption
for a better use and understanding of data.
Apache Samza is a distributed stream processing framework.
It uses Apache Kafka for messaging, and Apache Hadoop
YARN to provide fault tolerance, processor isolation, security,
and resource management.
#DataRiyadh

Hadoop | Data Storage Layer
Hadoop Distributed File System (HDFS) offers a way to store large files across
multiple machines. Hadoop and HDFS was derived from Google File System
(GFS) paper.
#DataRiyadh

#DataRiyadh

A distributed, column-oriented database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using MapReduce and
point queries (random reads)
Doesn’t support SQL like RDBMS
#DataRiyadh

A metadata and table management system for Hadoop. It shares the
metadata with other tools like map reduce, Pig and Hive.
It provides one constant data model for all Hadoop tools along with a shared
schema.
#DataRiyadh

Hadoop | Data Processing Layer
MapReduce is the heart of Hadoop. It is this programming paradigm that
allows for massive scalability across hundreds or thousands of servers in a
Hadoop cluster with a parallel, distributed algorithm.
#DataRiyadh

A scripting SQL based language and execution environment for creating complex
MapReduce transformations. Functions are written in Pig Latin (the language)
and translated into executable MapReduce jobs. Pig also allows the user to
create extended functions (UDFs) using Java.
#DataRiyadh

In memory data analytics cluster computing framework originally developed in the
AMPLab at UC Berkeley.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster
on disk.
#DataRiyadh

Hadoop | Data Querying Layer
A distributed data warehouse built on top of HDFS to manage and organize
large amounts of data. Hive provides a query language based on SQL semantic
(HiveQL) which is translated by the runtime engine to MapReduce jobs for
querying the data.
#DataRiyadh

Hadoop | Data Querying Layer
open source massively parallel processing (MPP) SQL query engine for data
stored in a computer cluster running Apache Hadoop.
#DataRiyadh

Hadoop | Management Layer
intuitive, easy-to-use Hadoop management web UI. Apache Ambari was
donated by Hortonworks team. It's a powerful and nice interface for Hadoop
and other typical applications from the Hadoop ecosystem.

Big data introduction, Hadoop in details

Big data existing solutions:
Data Sources
YARN: A framework for job scheduling and cluster resource management.
A platform for manipulating data
stored in HDFS via a high-level
language called Pig Latin. It does
data extractions, transformations
and loading, and basic analysis in
patch mode
A data warehousing and SQL-like
query language that presents data
in the form of tables. Hive
programming is similar to database
programming.
Ambari: A web interface for managing, configuring and testing Hadoop services and components.
An open-source
cluster computing
framework with in-
memory analytics.
HDFS :A platform for manipulating data
stored in HDFS via a high-level
language called Pig Latin. It does
data extractions, transformations
and loading, and basic analysis in
patch mode
HBase:A distributed, column-oriented
database. HBase uses HDFS for its
underlying storage, and supports
both batch-style computations
using MapReduce and point
queries.
HCatalog
A table and storage management layer
for Hadoop that enables Hadoop
applications (Pig, MapReduce, and
Hive) to read and write data to a
tabular form as opposed to files.
A distributed data processing
model and execution
environment that runs on large
clusters of commodity machines.
MapReduce
A Scalable machine learning and data mining library
A high-performance
coordination service for
distributed applications.
is a Java Web application
used to schedule Apache
Hadoop jobs
data collection system for monitoring large
distributed systems.
A web-based tool for
provisioning, managing, and
monitoring Apache Hadoop
clusters. Ambari also provides a
dashboard for viewing cluster
health such as heatmaps and
ability to view MapReduce, Pig
and Hive applications visually
along with features to diagnose
their performance.
is a data serialization system

Other apache projects:
Apache Flink
is an open source
platform for
distributed stream
and batch data
processing.
Apache Falcon
Feed management
and data processing
platform
Apache Ranger
Ranger is a framework
to enable, monitor and
manage comprehensive
data security across the
Hadoop platform.
Apache Tez
to develop a generic
application which
can be used to
process complex
data-processing task
Apache Tika
toolkit detects and
extracts metadata
and text from over
a thousand
different file types
Apache Parquet
columnar storage
format available to
any project in the
Hadoop ecosystem
Apache Zeppelin
A web-based notebook
that enables interactive
data analytics.
Apache Drill
Schema-free SQL
Query Engine for
Hadoop, NoSQL
and Cloud Storage
#DataRiyadh

Top Leading Big data companies
The Apache Software Foundation (ASF) is an American non-
profit corporation to support Apache projects
#DataRiyadh

How to start
1. Identify business use cases tied to business
outcomes, metrics and your big data roadmap
2. Identify big data champions from both the
business and IT sides of your organization
3. Select infrastructure, tools and architecture
for your big data POC/implementation
4. Staff the project with the right big data skills
or a strategic big data implementation partner
5. Run your project/POC in sprints or short
projects with tangible and measurable
outcomes
6. Try to scale your success POC up to test
your
Logic implementation against the big dataset.
#DataRiyadh

What can I do now ?
#DataRiyadh

Certification Path | Administration
Cloudera HortonWorks
#DataRiyadh

Certification Path | Development
#DataRiyadh

Certification Path | Data Science
#DataRiyadh

Questions
http://www.mahmoud-yassin.com/big-data

Big data introduction, Hadoop in details

More Related Content

Big data introduction, Hadoop in details

Editor's Notes