L8 Big Data Management en
L8 Big Data Management en
L8 Big Data Management en
Introduction to
Data Science
(IT4142E)
Contents
q Lecture 1: Overview of Data Science
q Lecture 2: Data crawling and preprocessing
q Lecture 3: Data cleaning and integration
q Lecture 4: Exploratory data analysis
q Lecture 5: Data visualization
q Lecture 6: Multivariate data visualization
q Lecture 7: Machine learning
q Lecture 8: Big data analysis
q Lecture 9: Capstone Project guidance
q Lecture 10+11: Text, image, graph analysis
q Lecture 12: Evaluation of analysis results
3
Big data 5'V
4
Big data technology stack
5
Scalable data management
• Scalability
• Able to manage incresingly big volume of data
• Accessibility
• Able to maintain efficiciency in reading and writing data (I/O)
into data storage systems
• Transparency
• In distributed environment, users should be able to access
data over the network as easily as if the data were stored
locally.
• Users should not have to know the physical location of data to
access it.
• Availability
• Fault tolerance
• The number of users, system failures, or other consequences
of distribution shouldnʼt compromise the availability.
6
Data I/O landscape
0.1 Gb/s
1 Gb/s or125 MB/s Nodesin
another
Network rack
CPUs:
10GB/s
7
Scalable data ingestion and processing
• Data ingestion
• Data from different complementing information systems is to be
combined to gain a more comprehensive basis to satisfy the
need
• How to ingest data efficiently from various, distributed
heterogeneous sources?
• Different data formats
• Different data models and schemas
• Security and privacy
• Data processing
• How to process massive volume of data in a timely fashion?
• How to process massive stream of data in a real-time fashion?
• Traditional parallel, distributed processing (OpenMP, MPI)
• Big learning curve
• Scalability is limited
• Fault tolerence is hard to achive
• Expensive, high performance computing infrastructure
8
Scalable analytic algorithms
• Challenges
• Big volume
• Big dimensionality
• Realtime processing
• Scaling-up Machine Learning algorithms
• Adapting the algorithm to handle Big Data in a single machine.
• Eg. Sub-sampling
• Eg. Principal component analysis
• Eg. feature extraction and feature selection
• Scaling-up algorithms by parallelism
• Eg. k-nn classification based on MapReduce
• Eg. scaling-up support vector machines (SVM) by a divide and-
conquer approach
• Novel realtime processing architecture
• Eg. Mini-batch in Spark streaming
• Eg. Complex event processing in Apache Flink
9
Eg. Curse of dimensionality
• The required number of samples (to achieve the same
accuracy) grows exponentionally with the number of
variables!
• In practice: number of training examples is fixed!
=> the classifier’s performance usually will degrade for a large
number of features!
10
Utilization and interpretability of big data
11
Privacy and security
12
Big data job trends
13
Talent shortage in big data
14
Big data skill set
15
How to land big data related jobs
• Learn to code
• Coursera
• Udacity
• Freecodecamp
• Codecademy
• Math, Stats and machine learning
• Kaggle
• Hadoop, NoSQL, Spark
• Visualization and Reporting
• Tableau
• Pentahoo
• Meetup & Share
• Find a mentor
• Internships, projects
16
Data science method
1. Formulate a question
4. Product
2. Gather data
3. Analyze data
80% 4/2010
70% 10/2009
5/2009
60%
12/2008
Precision
50% 8/2008
5/2008
40% 12/2007
30%
20%
Baseline
10%
0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Answered
18
Cleaning big data: most time-consuming, least
enjoyable data science task
source: https://www.forbes.com/ 19
Cleaning big data: most time-consuming, least
enjoyable data science task
20
References
[1] Tiwari, Shashank. Professional NoSQL. John Wiley & Sons, 2011.
[2] Lam, Chuck. Hadoop in action. Manning Publications Co., 2010.
[3] Miner, Donald, and Adam Shook. MapReduce design patterns: building effective algorithms and analytics for Hadoop and other
systems. " O'Reilly Media, Inc.", 2012.
[4] Karau, Holden. Fast Data Processing with Spark. Packt Publishing Ltd, 2013.
[5] Penchikala, Srini. Big data processing with apache spark. Lulu. com, 2018.
[6] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.
[7] Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics." International Journal of
Information Management 35.2 (2015): 137-144.
[8] Cattell, Rick. "Scalable SQL and NoSQL data stores." Acm Sigmod Record 39.4 (2011): 12-27.
[9] Gessert, Felix, et al. "NoSQL database systems: a survey and decision guidance." Computer Science-Research and Development 32.3-4
(2017): 353-365.
[10] George, Lars. HBase: the definitive guide: random access to your planet-size data. " O'Reilly Media, Inc.", 2011.
[11] Sivasubramanian, Swaminathan. "Amazon dynamoDB: a seamlessly scalable non-relational database service." Proceedings of the
2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012.
[12] Chan, L. "Presto: Interacting with petabytes of data at Facebook." (2013).
[13] Garg, Nishant. Apache Kafka. Packt Publishing Ltd, 2013.
[14] Karau, Holden, et al. Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.", 2015.
[15] Iqbal, Muhammad Hussain, and Tariq Rahim Soomro. "Big data analysis: Apache storm perspective." International journal of
computer trends and technology 19.1 (2015): 9-14.
[16] Toshniwal, Ankit, et al. "Storm@ twitter." Proceedings of the 2014 ACM SIGMOD international conference on Management of data.
ACM, 2014.
[17] Lin, Jimmy. "The lambda and the kappa." IEEE Internet Computing 21.5 (2017): 60-66.
21
Online courses
• https://www.coursera.org/learn/nosql-database-
systems
• https://who.rocq.inria.fr/Vassilis.Christophides/Big/index
.htm
• https://www.coursera.org/learn/big-data-
introduction?specialization=big-data
• https://www.coursera.org/learn/big-data-integration-
processing?specialization=big-data
• https://www.coursera.org/learn/big-data-
management?specialization=big-data
• https://www.coursera.org/learn/hadoop
• https://www.coursera.org/learn/scala-spark-big-data
22
Hadoop ecosystem
23
We need a system that scales
• Traditional tools are overwhelmed
• Slow disks, unreliable machines, parallelism is not easy
• 3 challenges
• Reliable storage
• Powerful data processing
• Efficient visualization
24
What is Apache Hadoop?
• Scalable and economical data storage and
processing
• The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models. It is designed to scale out from single servers to
thousands of machines, each offering local computation and
storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-
available service on top of a cluster of computers, each of
which may be prone to failures (commodity hardware).
• Heavily inspired by Google data architecture
25
Hadoop main components
• Storage: Hadoop distributed file system
(HDFS)
• Processing: MapReduce framework
• System utilities:
• Hadoop Common: The common utilities that
support the other Hadoop modules.
• Hadoop YARN: A framework for job scheduling and
cluster resource management.
26
Scalability
• Distributed by design
• Hadoop can run on cluster
• Individual servers within a cluster are called
nodes
• each node may both store and process data
• Scale out by adding more nodes to increase
scalability
• Up to several thousand nodes
27
Fault tolerance
• Cluster of commodity servers
• Hardware failure is the norm rather than the exception
• Built with redundancy
• File loaded into HDFS are replicated across nodes in
the cluster
• If a node failed, its data is re-replicated using one of the
copies
• Data processing jobs are broken into individual tasks
• Each task takes a small amount of data as input
• Parallel tasks execution
• Failed tasks also get rescheduled elsewhere
• Routine failures are handled automatically without any
loss of data
28
Hadoop distributed file system
• Provides inexpensive and reliable storage for massive
amounts of data
• Optimized for big files (100 MB to several TBs file
sizes)
• Hierarchical UNIX style file system
• (e.g., /hust/soict/hello.txt)
• UNIX style file ownership and permissions
• There are also some major deviations from UNIX
• Append only
• Write once read many times
29
HDFS Architecture
• Master/slave architecture
• HDFS master: namenode
• Manage namespace and
metadata
• Monitor datanode
• HDFS slave: datanode
• Handle read/write the actual
data
30
HDFS main design principles
• I/O pattern
• Append only à reduce synchronization
• Data distribution
• File is splitted in big chunks (64 MB)
à reduce metadata size
à reduce network communication
• Data replication
• Each chunk is usually replicated in 3 different nodes
• Fault tolerance
• Data node: re-replication
• Name node
• Secondary namenode
• Enqury data nodes instead of complex checkpointing scheme
31
Data processing: MapReduce
• MapReduce framework is the Hadoop default data
processing engine
• MapReduce is a programming model for data
processing
• it is not a language, a style of processing data created by
Google
• The beauty of MapReduce
• Simplicity
• Flexibility
• Scalability
32
a MR job = {Isolated Tasks}n
• MapReduce divides the workload into multiple
independent tasks and schedule them across cluster
nodes
• A work performed by each task is done in isolation
from one another for scalability reasons
• The communication overhead required to keep the data on the
nodes synchronized at all times would prevent the model from
performing reliably and efficiently at large scale
33
Data Distribution
• In a MapReduce cluster, data is usually managed by a
distributed file systems (e.g., HDFS)
• Move code to data and not data to code
34
Keys and Values
• The programmer in MapReduce has to specify two
functions, the map function and the reduce function
that implement the Mapper and the Reducer in a
MapReduce program
• In MapReduce data elements are always structured
as
key-value (i.e., (K, V)) pairs
• The map and reduce functions receive and emit (K, V)
pairs
Input Splits Intermediate Outputs Final Outputs
35
Partitions
§ A different subset of intermediate key space is
assigned to each Reducer
§ These subsets are known as partitions
36
MapReduce example
• Input: text file containing order ID, employee name,
and sale amount
• Output: sum of all sales per employee
37
Map phase
• Hadoop splits job into many individual map tasks
• Number of map tasks is determined by the amount of input data
• Each map task receives a portion of the overall job input to process
• Mappers process one input record at a time
• For each input record, they emit zero or more records as output
• In this case, the map task simply parses the input record
• And then emits the name and price fields for each as output
Map phase
38
• Hadoop automatically sorts and merges output from all
map tasks
• This intermediate process is known as the shuffle and sort
• The result is supplied to reduce tasks
39
Reduce phase
• Reducer input comes from the shuffle and sort process
• As with map, the reduce function receives one record at a time
• A given reducer receives all records for a given key
• For each input record, reduce can emit zero or more output records
• Our reduce function simply sums total per person
• And emits employee name (key) and total (value) as output
Reduce phase
40
Data flow for the entire MapReduce job
41
Word Count Dataflow
42
MapReduce - Dataflow
43
Map reduce life cycle
44
Hadoop ecosystem
• Many related tools integrate with Hadoop
• Data analysis
• Database integration
• Workflow management
• These are not considered ‘core Hadoop’
• Rather, they are part of the ‘Hadoop ecosystem’
• Many are also open source Apache projects
45
Apache Pig
• Apache Pig builds on Hadoop to offer high level data processing
• Pig is especially good at joining and transforming data
• The Pig interpreter runs on the client machine
• Turns PigLatin scripts into MapReduce jobs
• Submits those jobs to the cluster
46
Apache Hive
• Another abstraction on top of MapReduce
• Reduce development time
• HiveQL: SQL-like language
• The Hive interpreter runs on the client machine
• Turns HiveQL scripts into MapReduce jobs
• Submits those jobs to the cluster
47
Apache Hbase
• HBase is a distributed column-oriented data store built on top of
HDFS
• Is considered as the Hadoop database
• Data is logically organized into tables, rows and columns
• terabytes, and even petabytes of data in a table
• Tables can have many thousands of columns
• Scales to provide very high write throughput
• Hundreds of thousands of inserts per second
• Fairly primitive when compared to RDBMS
• NoSQL : There is no high/level query language
• Use API to scan / get / put values based on keys
48
Apache sqoop
• Sqoop is a tool designed for efficiently
transferring bulk data between Apache
Hadoop and structured datastores such
as relational databases.
• It can import all tables, a single table, or
a portion of a table into HDFS
• Via a Map/only MapReduce job
• Result is a directory in HDFS containing
comma/delimited text files
• Sqoop can also export data from HDFS
back to the database
49
Apache Kafka
Kafka decouple data streams
Producers don’t know about
Producer Producer consumers
Flexible message consumption
Producers Kafka broker delegates log
partition offset (location) to
Consumers (clients)
Cluster
Zookeeper
Consumers Consumer
Offsets
Consumer
51
Apache Zookeeper
• Apache ZooKeeper is a highly reliable
distributed coordination service
• Group membership
• Leader election
• Dynamic Configuration
• Status monitoring
• All of these kinds of services are used in some
form or another by distributed applications
52
PAXOS algorithm
https://www.youtube.com/watch?v=d7nAGI_NZPk
53
YARN – Yet Another Resource Negotiator
• Nodes have "resources" – memory and CPU cores –
which are allocated to application when requested
• Moving beyond Map Reduce
• MR and non-MR running on the same cluster
• Most jobtracker functions moved to application masters
HADOOP 1.0 HADOOP 2.0
MapReduce Others
(data processing) (data processing)
MapReduce
(cluster resource management YARN
& data processing) (cluster resource management)
HDFS
(redundant, reliable HDFS
storage) (redundant, reliable storage)
54
YARN execution
55
Big data platform: Hadoop ecosystem
56
Big data management
57
Thank you
for your
attention!!!
58