Big data is characterized by 3Vs - volume, velocity, and variety. Hadoop is a framework for distributed processing of large datasets across clusters of computers. It provides HDFS for storage, MapReduce for batch processing, and YARN for resource management. Additional tools like Spark, Mahout, and Zeppelin can be used for real-time processing, machine learning, and data visualization respectively on Hadoop. Benefits of Hadoop include ease of scaling to large data, high performance via parallel processing, reliability through data protection and failover.
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses limitations in traditional RDBMS for big data by allowing scaling to large clusters of commodity servers, high fault tolerance, and distributed processing. The core components of Hadoop are HDFS for distributed storage and MapReduce for distributed processing. Hadoop has an ecosystem of additional tools like Pig, Hive, HBase and more. Major companies use Hadoop to process and gain insights from massive amounts of structured and unstructured data.
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
This document provides an introduction to big data, including definitions of big data and why it is important. It discusses characteristics of big data like volume, velocity, variety and veracity. It provides examples of big data applications in various industries like GE, Boeing, social media, finance, CERN, journalism, politics and more. It also introduces NoSQL and the CAP theorem, and concludes that big data is changing business and technology by enabling new insights from data to reduce costs and optimize operations.
The document provides an overview of big data analytics using Hadoop. It discusses how Hadoop allows for distributed processing of large datasets across computer clusters. The key components of Hadoop discussed are HDFS for storage, and MapReduce for parallel processing. HDFS provides a distributed, fault-tolerant file system where data is replicated across multiple nodes. MapReduce allows users to write parallel jobs that process large amounts of data in parallel on a Hadoop cluster. Examples of how companies use Hadoop for applications like customer analytics and log file analysis are also provided.
This document provides an agenda for a presentation on big data and big data analytics using R. The presentation introduces the presenter and has sections on defining big data, discussing tools for storing and analyzing big data in R like HDFS and MongoDB, and presenting case studies analyzing social network and customer data using R and Hadoop. The presentation also covers challenges of big data analytics, existing case studies using tools like SAP Hana and Revolution Analytics, and concerns around privacy with large-scale data analysis.
The document summarizes the key components of the big data stack, from the presentation layer where users interact, through various processing and storage layers, down to the physical infrastructure of data centers. It provides examples like Facebook's petabyte-scale data warehouse and Google's globally distributed database Spanner. The stack aims to enable the processing and analysis of massive datasets across clusters of servers and data centers.
This document provides an overview of big data and Hadoop. It defines big data as large volumes of structured, semi-structured and unstructured data that is growing exponentially and is too large for traditional databases to handle. It discusses the 4 V's of big data - volume, velocity, variety and veracity. The document then describes Hadoop as an open-source framework for distributed storage and processing of big data across clusters of commodity hardware. It outlines the key components of Hadoop including HDFS, MapReduce, YARN and related modules. The document also discusses challenges of big data, use cases for Hadoop and provides a demo of configuring an HDInsight Hadoop cluster on Azure.
The document provides an overview of Hadoop and the Hadoop ecosystem. It discusses the history of Hadoop, how big data is defined in terms of volume, velocity, variety and veracity. It then explains what Hadoop is, the core components of HDFS and MapReduce, how Hadoop is used for distributed processing of large datasets, and how Hadoop compares to traditional RDBMS. The document also outlines other tools in the Hadoop ecosystem like Pig, Hive, HBase and gives a brief demo.
The document discusses big data, including what it is, sources of big data like social media and stock exchange data, and the three Vs of big data - volume, velocity, and variety. It then discusses Hadoop, the open-source framework for distributed storage and processing of large datasets across clusters of computers. Key components of Hadoop include HDFS for distributed storage, MapReduce for distributed computation, and YARN which manages computing resources. The document also provides overviews of Pig and Jaql, programming languages used for analyzing data in Hadoop.
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool.
This document introduces big data concepts and Microsoft's solutions for big data. It defines big data as large, complex datasets that are difficult to process using traditional systems. It also describes the 3Vs of big data: volume, velocity, and variety. The document then outlines Microsoft's offerings for big data including HDInsight, .NET SDK for Hadoop, ODBC driver for Hive, and integrations with Excel, SharePoint, and SQL Server. It provides overviews of Hadoop, HDFS, MapReduce, and the Hadoop ecosystem.
This document provides an overview of big data concepts including what big data is, how it is used, and common tools involved. It defines big data as a cluster of technologies like Hadoop, HDFS, and HCatalog used for fetching, processing, and visualizing large datasets. MapReduce and Hadoop clusters are described as common processing techniques. Example use cases mentioned include business intelligence. Resources for getting started with tools like Hortonworks, CloudEra, and examples of MapReduce jobs are also provided.
Introductory Big Data presentation given during one of our Sizing Servers Lab user group meetings. The presentation is targeted towards an audience of about 20 SME employees. It also contains a short description of the work packages for our BIg Data project proposal that was submitted in March.
This document is a presentation on big data and Hadoop. It introduces big data, how it is growing exponentially, and the challenges of storing and analyzing unstructured data. It discusses how Sears moved to Hadoop to gain insights from all of its customer data. The presentation explains why Hadoop is in high demand, as it allows distributed processing of large datasets across commodity hardware. It provides an overview of the Hadoop ecosystem including HDFS, MapReduce, Hive, HBase and more. Finally, it discusses job opportunities and salaries in big data which are high and growing significantly.
This document provides an introduction to big data and Hadoop. It discusses what big data is, characteristics of big data like volume, velocity and variety. It then introduces Hadoop as a framework for storing and analyzing big data, describing its main components like HDFS and MapReduce. The document outlines a typical big data workflow and gives examples of big data use cases. It also provides an overview of setting up Hadoop on a single node, including installing Java, configuring SSH, downloading and extracting Hadoop files, editing configuration files, formatting the namenode, starting Hadoop daemons and testing the installation.
This document discusses large-scale data processing using Apache Hadoop at SARA and BiG Grid. It provides an introduction to Hadoop and MapReduce, noting that data is easier to collect, store, and analyze in large quantities. Examples are given of projects using Hadoop at SARA, including analyzing Wikipedia data and structural health monitoring. The talk outlines the Hadoop ecosystem and timeline of its adoption at SARA. It discusses how scientists are using Hadoop for tasks like information retrieval, machine learning, and bioinformatics.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
The document provides an overview of Hadoop and its core components. It discusses:
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers.
- The two core components of Hadoop are HDFS for distributed storage, and MapReduce for distributed processing. HDFS stores data reliably across machines, while MapReduce processes large amounts of data in parallel.
- Hadoop can operate in three modes - standalone, pseudo-distributed and fully distributed. The document focuses on setting up Hadoop in standalone mode for development and testing purposes on a single machine.
this presentation describes the company from where I did my summer training and what is bigdata why we use big data, big data challenges, the issue in big data, the solution of big data issues, hadoop, docker , Ansible etc.
Workshop
December 9, 2015
LBS College of Engineering
www.sarithdivakar.info | www.csegyan.org
http://sarithdivakar.info/2015/12/09/wordcount-program-in-python-using-apache-spark-for-data-stored-in-hadoop-hdfs/
Most common technology which is used to store meta data and large databases.we can find numerous applications in the real world.It is the very useful for creating new database oriented apps
Big data refers to large volumes of diverse data that traditional data processing systems are unable to handle. Hadoop is an open-source software framework for distributed storage and processing of big data across clusters of commodity hardware. It allows for the reliable, scalable, and distributed processing of large data sets across clusters of commodity servers. Hadoop features include scalable and reliable data storage with HDFS and distributed processing of large data sets with MapReduce. Popular companies that use Hadoop include Google, Facebook, and Amazon for its abilities to process massive amounts of data in a cost-effective manner.
The document provides an introduction to big data and Hadoop. It describes the concepts of big data, including the four V's of big data: volume, variety, velocity and veracity. It then explains Hadoop and how it addresses big data challenges through its core components. Finally, it describes the various components that make up the Hadoop ecosystem, such as HDFS, HBase, Sqoop, Flume, Spark, MapReduce, Pig and Hive. The key takeaways are that the reader will now be able to describe big data concepts, explain how Hadoop addresses big data challenges, and describe the components of the Hadoop ecosystem.
This document provides an overview of Hadoop and Big Data. It begins with introducing key concepts like structured, semi-structured, and unstructured data. It then discusses the growth of data and need for Big Data solutions. The core components of Hadoop like HDFS and MapReduce are explained at a high level. The document also covers Hadoop architecture, installation, and developing a basic MapReduce program.
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal
This document provides a systematic review of research articles on big data analysis. It analyzed 64 articles published between 2014-2018 from IEEE Explorer and Google Scholar databases. Key findings include: the number of published articles has increased each year, reflecting the growing importance of big data; experimental and case study articles accounted for 25 of the analyzed papers; 19 articles were ultimately selected for review, with 11 from Google Scholar and 8 from IEEE Explorer. The review aims to provide an overview of current research progress on big data analysis techniques.
This document discusses security issues with Hadoop and available solutions. It identifies vulnerabilities in Hadoop including lack of authentication, unsecured data in transit, and unencrypted data at rest. It describes current solutions like Kerberos for authentication, SASL for encrypting data in motion, and encryption zones for encrypting data at rest. However, it notes limitations of encryption zones for processing encrypted data efficiently with MapReduce. It proposes a novel method for large scale encryption that can securely process encrypted data in Hadoop.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable and distributed processing of large datasets. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. HDFS stores data reliably across machines in a Hadoop cluster and MapReduce processes data in parallel by breaking the job into smaller fragments of work executed across cluster nodes.
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
Big Data has gained much interest from the academia and the IT industry. In the digital and computing
world, information is generated and collected at a rate that quickly exceeds the boundary range. As
information is transferred and shared at light speed on optic fiber and wireless networks, the volume of
data and the speed of market growth increase. Conversely, the fast growth rate of such large data
generates copious challenges, such as the rapid growth of data, transfer speed, diverse data, and security.
Even so, Big Data is still in its early stage, and the domain has not been reviewed in general. Hence, this
study expansively surveys and classifies an assortment of attributes of Big Data, including its nature,
definitions, rapid growth rate, volume, management, analysis, and security. This study also proposes a
data life cycle that uses the technologies and terminologies of Big Data. Map/Reduce is a programming
model for efficient distributed computing. It works well with semi-structured and unstructured data. A
simple model but good for a lot of applications like Log processing and Web index building.
This document provides an introduction to big data and Hadoop. It discusses how the volume of data being generated is growing rapidly and exceeding the capabilities of traditional databases. Hadoop is presented as a solution for distributed storage and processing of large datasets across clusters of commodity hardware. Key aspects of Hadoop covered include MapReduce for parallel processing, the Hadoop Distributed File System (HDFS) for reliable storage, and how data is replicated across nodes for fault tolerance.
This document provides an overview of big data and Apache Hadoop. It defines big data as large and complex datasets that are difficult to process using traditional database management tools. It discusses the sources and growth of big data, as well as the challenges of capturing, storing, searching, sharing, transferring, analyzing and visualizing big data. It describes the characteristics and categories of structured, unstructured and semi-structured big data. The document also provides examples of big data sources and uses Hadoop as a solution to the challenges of distributed systems. It gives a high-level overview of Hadoop's core components and characteristics that make it suitable for scalable, reliable and flexible distributed processing of big data.
1. The document discusses the evolution of computing from mainframes to smaller commodity servers and PCs. It then introduces cloud computing as an emerging technology that is changing the technology landscape, with examples like Google File System and Amazon S3.
2. It discusses the need for large data processing due to increasing amounts of data from sources like the stock exchange, Facebook, genealogy sites, and scientific experiments.
3. Hadoop is introduced as a framework for distributed computing and reliable shared storage and analysis of large datasets using its Hadoop Distributed File System (HDFS) for storage and MapReduce for analysis.
This document provides an overview of Big Data and Hadoop. It defines Big Data as large volumes of structured, semi-structured, and unstructured data that is too large to process using traditional databases and software. It provides examples of the large amounts of data generated daily by organizations. Hadoop is presented as a framework for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop including HDFS for distributed storage and fault tolerance, and MapReduce for distributed processing, are described at a high level. Common use cases for Hadoop by large companies are also mentioned.
Literature Reivew of Student Center DesignPriyankaKarn3
It was back in 2020, during the COVID-19 lockdown Period when we were introduced to an Online learning system and had to carry out our Design studio work. The students of the Institute of Engineering, Purwanchal Campus, Dharan did the literature study and research. The team was of Prakash Roka Magar, Priyanka Karn (me), Riwaz Upreti, Sandip Seth, and Ujjwal Dev from the Department of Architecture. It was just a scratch draft made out of the initial phase of study just after the topic was introduced. It was one of the best teams I had worked with, shared lots of memories, and learned a lot.
Response & Safe AI at Summer School of AI at IIITHIIIT Hyderabad
Talk covering Guardrails , Jailbreak, What is an alignment problem? RLHF, EU AI Act, Machine & Graph unlearning, Bias, Inconsistency, Probing, Interpretability, Bias
Unblocking The Main Thread - Solving ANRs and Frozen FramesSinan KOZAK
In the realm of Android development, the main thread is our stage, but too often, it becomes a battleground where performance issues arise, leading to ANRS, frozen frames, and sluggish Uls. As we strive for excellence in user experience, understanding and optimizing the main thread becomes essential to prevent these common perforrmance bottlenecks. We have strategies and best practices for keeping the main thread uncluttered. We'll examine the root causes of performance issues and techniques for monitoring and improving main thread health as wel as app performance. In this talk, participants will walk away with practical knowledge on enhancing app performance by mastering the main thread. We'll share proven approaches to eliminate real-life ANRS and frozen frames to build apps that deliver butter smooth experience.
this slide shows husien hanafy portfolio 6-2024hessenhanafy1
Highly Motivated architectural engineer with 6 years of experience in interior, exterior, and landscape design, I'm self-motivated person and a competitive professional who is driven by goals with complete dedication and enthusiasm
20CDE09- INFORMATION DESIGN
UNIT I INCEPTION OF INFORMATION DESIGN
Introduction and Definition
History of Information Design
Need of Information Design
Types of Information Design
Identifying audience
Defining the audience and their needs
Inclusivity and Visual impairment
Case study.
2. CONTENTS (1/2)
Big Data Definition
Areas Of Challenges
Big data Attributes.
Big data Source.
Sample Events generating data
New tools for generating data
Big data applications.
Getting value from big data.
Big data security
Comparing Hadoop With RDBMS
Hadoop
4. WHAT IS BIG DATA???
BIG DATA
So Large Data That It
Becomes Difficult To
Process It using
Traditional Systems
Is
SOURCE: PLANNING FOR BIG DATA, EDD DUMBILL, PP.1-4
6. DIFFICULT TO PROCESS BY TRADITIONAL
SYSTEM
200 MB DOCUMENT 150 GB IMAGE
200 TB VIDEO
Unable To
Send
Unable To
View
Unable To
Edit
Depends On
the Capabilities
of the System
7. ORGANIZATION SPECIFIC
500TB Text, Audio, Video Data
Per Day
BIG
DATA
NOT
A
BIG
DATA
COMPANY 1 COMPANY 2
Depends On
the Capabilities
of the
Organization
13. DATA GENERATION POINT
EXAMPLES
MOBILE DEVICES
MICROPHONES
READERS/SCANNERS
CAMERAS
MACHINE SENSORS
SOCIAL MEDIA
PROGRAMS/ SOFTWARES
SCIENCE FACILITIES
14. SAMPLE DATA TYPES
VIDEOS
AUDIOS
IMAGES
PHOTOS
LOGS
CLICK TRAILS
TEXT MESSAGES
EMAILS
DOCUMENTS
BOOKS
TRANSACTIONS
PUBLIC RECORDS
15. SAMPLE EVENTS GENERATING DATA
1)Air Bus:
Airbus generates 10TB every 30 minutes
About 640TB is generated in one flight
2) Smart Meters:
Smart Meter reads the usage every 15 minutes
Records 350 Billion Transaction every year.
In 2009, there were 76 million smart meters.
By 2014, there will be 200 million smart meters
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
16. 3) Camera Phones:
5 million camera phones are there world wide.
Most of them have location Awareness ( G.P.S)
22% of them are Smartphone's.
By the End of 2013 the number of Smartphone's
will exceed the number of PC‟s
4) Internet Users:
2+ billion people use internet.
By 2014 CISCO estimates internet traffic 4.8
Zettabytes per year
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
17. 5) Blogs:
There are 200 billion blog entries in the world.
6) Emails:
300 million Emails are sent every day.
7) RFID:
In 2005, there were around 1.5 million RFID‟s
In 2012, there are 30 million RFID‟s
WalMart as played
the major role
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
18. 8) Facebook:
Facebook generates 25TB of data daily.
9) Twitter:
Twitter generates 12TB of data daily.
200 million users generating 230 million tweets daily.
97,000 tweets are sent every seconds.
10) Trading:
NYSE produces 1TB per trading day.
11) Experiment:
CERN atomic facility generates 40TB per second.
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
19. SAMPLE EVENTS GENERATING DATA
Big Data:
In 2009, the total data was estimated to be 1 ZB
In 2020, it is estimated to be 35ZB
SOURCE: HADOOP THE DEFINETIVE GUIDE, 3rd EDITION, PP.1-4
20. New Tools For Big Data
TRADITIONAL
SYSTEMS
(E.g.,RDBMS)
BIG DATA
TOOLS
(E.g.,
HADOOP)
TIMEE
Not able to
handle Big
Data
Created to
handle Big
Data
21. Big Data Applications
Companies gaining edge by collecting,
analyzing and understanding information.
Government forecasting events and taking
proactive actions.
23. Big Data Security Issues
Security and privacy issues are magnified by the V
attributes.
Velocity
Volume
Variety
Traditional Security mechanisms which are tailoured
to securing small scale static data are inadequate.
SOURCE: CLOUD SECURITY ALLIANCES
24. Top Five Security Challenges
1) Secure Computation in Distributed
Programming framework:
Distributed programming framework utilizes parallism in
computation and storage to process massive amount of data.
Example: MAPREDUCE Framework:
Splits input files into multiple chuncks.
These chunks are read by the mapper and outputs key/value pairs.
The reducer combines the values belonging to distinct key and outputs the
result.
OPPORTUNITY 1: Two Major prevention measure arises
1) Securing Mapper
2) Securing the data in the presence of untrusted mapper
SOURCE: CLOUD SECURITY ALLIANCES
25. 2) Input Validation/Filtering
Input Validation:
What kind of data is untrusted?
What are the untrusted data sources?
Data Filtering:
Filter Rogue or malicious data.
Challenges/ opportunity
GB‟s or TB‟s Continuous data
Signature based data filtering has limitations
SOURCE: CLOUD SECURITY ALLIANCES
26. 3) Secure Data storage
Data at various nodes, authentication, authorization and
encryption is challenging.
Autotiering moves cold data into lesser secure medium
o What if the cold data is sensitive?
Autotier doesnot keep track of where the data is stored.
(new challenge)
Encryption of real time data can have performance
impact.
Challenges/opportunity:
24/7 availability of data
unauthorized access
SOURCE: CLOUD SECURITY ALLIANCES
27. 4) Privacy concern in data mining.
Sharing of results involve multiple challenges.
o Invasion of privacy.
o Invasive Marketing.
o Unintentional disclosure of Information.
Example: Companies and government agencies they
constantly mined and analyzed by the inside analysts
and also potentially outside contractors
Challenges/Opportunity: Robust and scalable
privacy preserving mining algorithms
SOURCE: CLOUD SECURITY ALLIANCES
28. 5) Cryptography enforced access
control and secure communication
To ensure end to end secure private data.
Accessible to only authorized entity.
Hence Cryptography enforced access control
has to be implemented.
Challenges/ opportunities: The main
problem to encrypt data especially large data
sets, is all-or-nothing retrieval policy,
disallowing user to easily search or share data.
SOURCE: CLOUD SECURITY ALLIANCES
30. Comparing Hadoop with RDBMS
Until recently many applications utilized Relational
Database systems (RDBMS) for batch processing.
-Oracle, Sybase, MySQL, Microsoft SQL, Server etc.
-Hadoop doesn‟t fully replace relational products; many
architectures would benefit from both hadoop and
Relational product(s).
Scale-Out vs Scale-up
-RDBMS products scale up
Expensive to scale for large installation.
Hits a ceiling when storage reaches 100s of terabytes.
- Hadoop clusters can scale-out to 100s of machines and to
petabytes of storage.
31. Structured Relational vs Semi-structured vs unstructured
-RDBMS works well for structured data-tables that conform to
a predefined schema.
-Hadoop works best on semi structured and unstructured data.
Semi-structured may have schema that is loosely followed.
Unstructured data has no structure whatsoever and Is usually
blocks of text (or for example images)
At processing time types for key and values are choosen by
the implementer.
-Certain types of input data will not easily fit into relational
schema such as JSON, XML etc.
Comparing Hadoop with RDBMS (contd..)
32. Offline batch vs Online Transactions
- Hadoop was not designed for real time and low latency
queries.
- Products that do provide low latency queries such as Hbase
have limited query functionality.
- Hadoop performs best for offline batch processing on large
amounts of data.
- RDBMS is best for online transactions and low latency
queries.
- Hadoop is designed to stream large files and large amounts of
data.
- RDBMS works best with small records.
Comparing Hadoop with RDBMS (contd..)
34. Framework for running applications on large clusters of
commodity hardware
Scale: petabytes of data on thousands of nodes
Include
Storage: HDFS
Processing: MapReduce
Support the Map/Reduce programming model
Requirements
Economy: use cluster of comodity computers
Easy to use
Users: no need to deal with the complexity of
distributed computing
Reliable: can handle node failures automatically
35. What's Hadoop ..Contd??
Hadoop is a software platform that lets one easily write
and run applications that process vast amounts of data.
Here's what makes Hadoop especially useful:
Scalable
Economical
Efficient
Reliable
36. Hadoop, Why?
Need to process Multi Petabyte Datasets
Expensive to build reliability in each application.
Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
Need common infrastructure
– Efficient, reliable, Open Source Apache License
The above goals are same as Condor, but
Workloads are IO bound and not CPU bound
37. Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
39. HDFS
Hadoop implements MapReduce, using the Hadoop
Distributed File System (HDFS) (see figure below.)
MapReduce divides applications into many small blocks of
work. HDFS creates multiple replicas of data blocks for
reliability, placing them on compute nodes around the
cluster. MapReduce can then process the data where it is
located.
Hadoop has been demonstrated on clusters with 2000
nodes. The current design target is 10,000 node clusters.
40. Goals of HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to
where data resides
– Provides very high aggregate bandwidth
• User Space, runs on heterogeneous OS
41. Hadoop at Facebook
• Production cluster
– 4800 cores, 600 machines, 16GB per machine – April 2009
– 8000 cores, 1000 machines, 32 GB per machine – July
2009
– 4 SATA disks of 1 TB each per machine
– 2 level network hierarchy, 40 machines per rack
– Total cluster size is 2 PB, projected to be 12 PB in Q3
2009
• Test cluster
• 800 cores, 16GB each
42. Hadoop Architecture
Data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Results
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Hadoop Cluster
DFS Block 1
DFS Block 1
DFS Block 2
DFS Block 2
DFS Block 2
DFS Block 1
DFS Block 3
DFS Block 3
DFS Block 3
MAP
MAP
MAP
Reduce
44. Map/Reduce Processes
Launching Application
– User application code
– Submits a specific kind of Map/Reduce job
JobTracker
– Handles all jobs
– Makes all scheduling decisions
TaskTracker
– Manager for all tasks on a given node
Task
– Runs an individual map or reduce fragment for a
given job
– Forks from the TaskTracker
45. -cont’d
Hadoop Map/Reduce – Goals:
• Process large data sets
• Cope with hardware failure
• High throughput
46. Hadoop Map-Reduce Architecture
Master-Slave architecture
Map-Reduce Master “Jobtracker”
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to Tasktrackers
– Monitors task and tasktracker status, re-executes tasks upon
failure
Map-Reduce Slaves “Tasktrackers”
– Run Map and Reduce tasks upon instruction from the Jobtracker
– Manage storage and transmission of intermediate output
48. NameNode Metadata
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc
49. DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block (e.g. CRC)
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to the
NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
50. Block Placement
• Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
• Clients read from nearest replica
• Would like to make this policy pluggable
51. Data Correctness
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas
52. NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
53. Data Pipelining
• Client retrieves a list of DataNodes on which to place
replicas of a block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the next DataNode
in the Pipeline
• When all replicas are written, the Client moves on to write
the next block in file
59. Hadoop Web Interface
• MapReduce Job Tracker Web Interface
The job tracker web UI provides information about general job statistics of
the Hadoop cluster, running/completed/failed jobs and a job history log file.
It also gives access to the local machine's Hadoop log files (the machine on
which the web UI is running on).
By default, it's available at http://localhost:50030/
• Task Tracker Web Interface
The task tracker web UI shows you running and non-running tasks. It also
gives access to the local machine's Hadoop log files.
By default, it's available at http://localhost:50060/
• HDFS Name Node Web Interface
The name node web UI shows you a cluster summary including information
about total/remaining capacity, live and dead nodes. Additionally, it allows
you to browse the HDFS namespace and view the contents of its files in the
web browser. It also gives access to the local machine's Hadoop log files.
By default, it's available at http://localhost:50070/
61. HBASE
HBase is a database: the Hadoop database. It is indexed by row key,
column key, and timestamp.
HBase stores structured and semistructured data naturally so you can
load it with tweets and parsed log files and a catalog of all your products
right along with their customer reviews.
It can store unstructured data too, as long as it‟s not too large
HBase is designed to run on a cluster of computers instead of a single
computer. The cluster can be built using commodity hardware; HBase
scales horizontally as you add more machines to the cluster.
62. HBASE (Contd…)
Each node in the cluster provides a bit of storage, a bit of cache,
and a bit of computation as well.
This makes HBase incredibly flexible and forgiving. No node is
unique, so if one of those machines breaks down, you simply
replace it with another.
This adds up to a powerful, scalable approach to data that,until
now, hasn‟t been commonly available to mere mortals.
63. HBASE DATA MODEL:
Hbase Data model - These six concepts form the foundation of HBase.
Table:
HBase organizes data into tables. Table names are Strings and composed of characters
that are safe for use in a file system path.
Row :
Within a table, data is stored according to its row. Rows are identified uniquely by
their row key. Row keys don‟t have a data type and are always treated as a byte[].
Column family:
Data within a row is grouped by column family. Column families also impact the
physical arrangement of data stored in HBase.
For this reason, they must be defined up front and aren‟t easily modified. Every row
in a table has the same column families, although a row need not store data in all its
families. Column family names are Strings and composed of characters that are safe for
use in a file system path.
64. Column qualifier:
Data within a column family is addressed via its column qualifier,or column.
Column qualifiers need not be specified in advance. Column qualifiers need not be
consistent between rows.
Like rowkeys, column qualifiers don‟t have a data type and are always treated as a
byte[].
Cell:
A combination of rowkey, column family, and column qualifier uniquely identifies
a cell. The data stored in a cell is referred to as that cell‟s value. Values also don‟t
have a data type and are always treated as a byte[].
Version:
Values within a cell are versioned. Versions are identified by their timestamp,a
long. When a version isn‟t specified, the current timestamp is used as the basis for the
operation. The number of cell value versions retained by HBase is configured via the
column family. The default number of cell versions is three.
66. HBase Tables and Regions
Table is made up of any number of regions.
Region is specified by its startKey and endKey.
Empty table: (Table, NULL, NULL)
Two-region table: (Table, NULL, “com.ABC.www”) and
(Table, “com.ABC.www”, NULL)
Each region may live on a different node and is made up of
several HDFS files and blocks, each of which is replicated by
Hadoop
68. Why Next Generation MR
Reliability
Availability
Scalability - Clusters of 10,000 machines and 200,000
cores, and beyond.
Backward (and Forward) Compatibility
Ensure customers’ MapReduce applications run
unchanged in the next version of the framework.
Evolution – Ability for customers to control upgrades to
the Hadoop software stack.
Predictable Latency – A major customer concern.
Cluster utilization
69. Why Next Generation MR
Secondary Requirements
–Support for alternate programming
paradigms to MapReduce.
–Support for short-lived services
73. Resource Manager (RM)
• A pure Scheduler
• No monitoring, tracking
status of application
• No guarantee on restarting
failed tasks.
74. Resource Manager (RM)
• Each client/application may
request multiple resources
– Memory
– Network
– Cpu
– Disk ..
• This is a significant change
from static Mapper /
Reducer model
75. Application Master
• A per – application
ApplicationMaster (AM) that
a ages the appli atio ’s life
cycle (scheduling and
coordination).
• An application is either a single
job in the classic MapReduce
jobs or a DAG of such jobs.
76. Application Master
A per – application
ApplicationMaster (AM) that
a ages the appli atio ’s life
cycle.
77. Application Master
• Application Master has the
responsibility of
– negotiating appropriate resource
containers from the Scheduler
– launching tasks
– tracking their status
– monitoring for progress
– handling task-failures.
78. Node Manager
• The NodeManager is the per-machine
framework agent
– responsible for launching the
applications‟ containers,
monitoring their resource usage
(cpu, memory, disk, network) and
reporting the same to the
Scheduler.
79. Gain with New Architecture
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
• Support for programming paradigms other than MapReduce
80. Gain with New Architecture
• RM and Job manager segregated
• The Hadoop MapReduce JobTracker
spends a very significant portion of
time and effort managing the life
cycle of applications
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
81. Gain with New Architecture
• ResourceManage
– Uses ZooKeeper for fail-over.
– When primary fails, secondary can
quickly start using the state stored
in ZK
• Application Master
– MapReduce NextGen supports
application specific checkpoint
capabilities for the
ApplicationMaster.
– MapReduce ApplicationMaster can
recover from failures by restoring
itself from state saved in HDFS.
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
82. Gain with New Architecture
• MapReduce NextGen uses wire-
compatible protocols to allow
different versions of servers and
clients to communicate with
each other.
• Rolling upgrades for the cluster
in future.
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
83. Gain with New Architecture
• New framework is generic.
– Can came up with non MR parallel
computing techniques
– Different versions of MR running in
parallel
– End users can upgrade to MR versions
on their own schedule
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
84. Gain with New
Architecture
• MRv2 uses a general concept of a
resource for scheduling and allocating to
individual applications.
• Container , can be a mapper or a reducer
or … ?
• Stubborn notion of Mapper,Reducer
abolished
• Better cluster utilization
• Scalability
• Availability
• Wire-compatibility
• Innovation & Agility
• Cluster Utilization
87. When Hadoop 1.0.0 was released by Apache in 2011, comprising
mainly HDFS and MapReduce, it soon became clear that Hadoop
was not simply another application or service, but a platform
around which an entire ecosystem of capabilities could be built.
Since then, dozens of self-standing software projects have sprung
into being around Hadoop, each addressing a variety of problem
spaces and meeting different needs.
Many of these projects were begun by the same people or
companies who were the major developers and early users of
Hadoop; others were initiated by commercial Hadoop distributors.
The majority of these projects now share a home with Hadoop at
the Apache Software Foundation, which supports open-source
software development and encourages the development of the
communities surrounding these projects.
89. SQOOP
Data Import/ Export.
SQOOP is a tool designed to help
users of large data import existing
relational databases into their hadoop
clusters.
Automatic data import.
Easy import data from many
databases to Hadoop.
Generates code for use in Mapreduce
applications.
Source: Big Data Analytics with Hadoop
90. Sqoop is a tool designed to transfer data between Hadoop
and relational databases.
You can use Sqoop to import data from a relational database
management system (RDBMS) such as MySQL or Oracle into
the Hadoop Distributed File System (HDFS), transform the
data in Hadoop MapReduce, and then export the data back
into an RDBMS.
What is Sqoop?
93. HIVE
Hive is a data warehouse infrastructure built on top of
Hadoop for providing data summarization, query, and
analysis.
– ETL.
– Structure.
– Access to different storage.
– Query execution via MapReduce.
While initially developed by Facebook, Apache Hive is now
used and developed by other companies such as Netflix.
Key Building Principles:
– SQL is a familiar language
– Extensibility – Types, Functions, Formats, Scripts
– Performance
95. Hive, Why?
• Need a Multi Petabyte Warehouse
• Files are insufficient data abstractions
– Need tables, schemas, partitions, indices
• SQL is highly popular
• Need for an open data format
– RDBMS have a closed data format
– flexible schema
• Hive is a Hadoop subproject!
96. Hadoop & Hive History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Becomes Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• Sept 2008 – Hive becomes a Hadoop subproject
98. •Hive structures data into well-understood database concepts
such as: tables, rows, cols, partitions
•It supports primitive types: integers, floats, doubles, and
strings
•Hive also supports:
–associative arrays: map<key-type, value-type>
–Lists: list<element type>
–Structs: struct<file name: file type…>
•SerDe: serialize and deserialized API is used to move data
in and out of tables
Data model
99. Query Language (HiveQL)
• Subset of SQL
• Meta-data queries
• Limited equality and join predicates
• No inserts on existing tables (to preserve
worm property)
– Can overwrite an entire table
101. Hive - DDL
Alter table
hive> ALTER TABLE customer ADD COLUMNS ( age INT) ;
Drop table
hive> DROP TABLE customer;
102. HiveQL Examples
HiveQL, an SQL like language
hive> SELECT a.age FROM customer a WHERE a.sdate ='2008-08-
15';
selects all data from table for a partition but doesnt store it
hive> INSERT OVERWRITE DIRECTORY '/data/hdfs_file'
SELECT a.* FROM customer a WHERE a.sdate='2008-08-15';
writes all of customer table to an hdfs directory
103. Wordcount in Hive
FROM (
MAP doctext USING 'python wc_mapper.py' AS (word, cnt)
FROM docs
CLUSTER BY word
) a
REDUCE word, cnt USING 'pythonwc_reduce.py';
104. Hive Usage in Facebook
• Hive and Hadoop are extensively used in Facbook for
different kinds of operations.
• 700 TB = 2.1Petabyte after replication!
• Think of other application model that can leverage
Hadoop MR.
105. Hive – Related Projects
Apache Flume – move large data sets to Hadoop
Apache Sqoop – cmd line, move rdbms data to Hadoop
Apache Hbase – Non relational database
Apache Pig – analyse large data sets
Apache Oozie – work flow scheduler
Apache Mahout – machine learning and data mining
Apache Hue – Hadoop user interface
Apache Zoo Keeper – configuration / build
107. Introduction
• What is Pig?
– An open-source high-level dataflow system
– Provides a simple language for queries and data
manipulation, Pig Latin, that is compiled into map-reduce
jobs that are run on Hadoop
– Pig Latin combines the high-level data manipulation
constructs of SQL with the procedural programming of
map-reduce
• Why is it important?
– Companies and organizations like Yahoo, Google and
Microsoft are collecting enormous data sets in the form of
click streams, search logs, and web crawls
– Some form of ad-hoc processing and analysis of all of this
information is required
108. Existing Solutions
• Parallel database products (ex: Teradata)
– Expensive at web scale
– Data analysis programmers find the declarative SQL
queries to be unnatural and restrictive
• Raw map-reduce
– Complex n-stage dataflows are not supported; joins
and related tasks require workarounds or custom
implementations
– Resulting code is difficult to reuse and maintain; shifts
focus and attention away from data analysis
109. Language Features
• Several options for user-interaction
– Interactive mode (console)
– Batch mode (prepared script files containing Pig Latin commands)
– Embedded mode (execute Pig Latin commands within a Java program)
• Built primarily for scan-centric workloads and read-only data
analysis
– Easily operates on both structured and schema-less, unstructured data
– Transactional consistency and index-based lookups not required
– Data curation and schema management can be overkill
• Flexible, fully nested data model
• Extensive UDF support
– Currently must be written in Java
– Can be written for filtering, grouping, per-tuple processing, loading and
storing
110. Pig Latin vs. SQL
• Pig Latin is procedural (dataflow programming model)
– Step-by-step query style is much cleaner and easier to write and
follow than trying to wrap everything into a single block of
SQL
Source: http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html
111. Pig Latin vs. SQL (continued)
• Lazy evaluation (data not processed prior to STORE command)
• Data can be stored at any point during the pipeline
• An execution plan can be explicitly defined
– No need to rely on the system to choose the desired plan via optimizer hints
• Pipeline splits are supported
– SQL requires the join to be run twice or materialized as an intermediate result
Source: http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html
112. Data Model
• Supports four basic types
– Atom: a simple atomic value (int, long, double, string)
• ex: „Peter‟
– Tuple: a sequence of fields that can be any of the data types
• ex: („Peter‟, 14)
– Bag: a collection of tuples of potentially varying structures,
can contain duplicates
• ex: {(„Peter‟), („Bob‟, (14, 21))}
– Map: an associative array, the key must be a chararray but
the value can be any type
113. Data Model (continued)
• By default Pig treats undeclared fields as bytearrays
(collection of uninterpreted bytes)
• Ca i fer a field’s type ased o :
– Use of operators that expect a certain type of field
– UDFs with a known or explicitly set return type
– Schema information provided by a LOAD function
or explicitly declared using an AS clause
• Type conversion is lazy
114. Pig problem
• Fragment-replicate; skewed; merge join
• User has to know when to use which join
• Because… Pig is
domestic animal,
does whatever
you tell it to do.
- Alan Gates
Images from http://wiki.apache.org/pig/PigTalksPapers
116. Hue – What is it ?
Hue = Hadoop User Experience
Hue is an open-source Web interface that supports Apache
Hadoop and its ecosystem, licensed under the Apache v2 license.
Its main goal is to have the users "just use" Hadoop without
worrying about the underlying complexity or using a command
line
An open source Hadoop GUI
Developed by Cloudera
Web based
Many functions
117. Hue – Why ???
It is widely used
It ships with Hadoop
It integrates with Hadoop tools i.e.
Hive
Oozie
HDFS
It has an API for app creation
118. Hue Features
HDFS file browser
Job browser / designer
Hive / Pig query editor
Oozie app for work flows
Has Hadoop API
Access to shell
User Admin
App for Solr searches
125. What is Apache Flume?
● It is a distributed data collection service that gets
flows of data (like logs) from their source and
aggregates them to where they have to be processed.
● Goals: reliability, scalability, extensibility,
manageability.
Exactly what I needed!
126. The Flume Model: Flows and
Nodes
● A flow corresponds to a type of data source (server
logs, machine monitoring metrics...).
● Flows are comprised of nodes chained together.
127. The Flume Model: Flows and Nodes
● In a Node, data come in through a source...
...are optionally processed by one or more decorators... ...and
then are transmitted out via a sink.
Examples: Console, Exec, Syslog, IRC,
Twitter, other nodes...
Examples: Console, local files, HDFS,
S3,
other nodes...
Examples: wire batching, compression,
sampling, projection, extraction...
128. ● Agent:
receives data from an
application.
● Processor (optional):
intermediate processing.
● Collector:
write data to permanent
storage.
The Flume Model: Agent, Processor and
Collector Nodes
129. The Flume Model: Data and Control
Path (1/2)
Nodes are in the data path.
130. The Flume Model: Data and Control
Path (2/2)
Masters are in the control path.
● Centralized point of configuration. Multiple: ZK.
● Specify sources, sinks and control data flows.
134. Flume Goals: Extensibility
Simple Source and Sink API
Event streaming and composition of simple
operation
Plug in Architecture
Add your own sources, sinks, decorators
136. Conclusion
Big data is here to stay, It is impossible to imagine
the next generation without it consuming data,
producing new forms of data and containing data
driven algorithms.
As compute environment become cheaper,
application environment becomes networked over
cloud. So security, access control, compression and
encryption introduce challenges that have to be
addressed in a systematic manner.
137. References
[1] Chris Eaton, Dirk Deroos, Tom Deutsh, George Lapis, Paul Zikopoulos, Understanding Big
Data, Analysis for enterprise class hadoop and streaming data, pp.3-49.
[2] Mike Barlow, Real time data analystics, Emerging Architecture, February 2013, First edition,
pp.1-21.
[3] Sachidanand Singh, Nirmala Singh, Big Data Analytics,2012 International Conference on
Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai,
India
[4] Big Data Introduction, www.youtube.com/watch?v=e6kovHZ6FVc
[5] Hadoop Video, www.youtube.com/watch?v=OoEpfbbyga8
[6] Cloud Security Alliance, Big Data Security and privacy issues, November 2012.
[7] http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
[8] http://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.html
[9] http://www.youtube.com/watch?v=5Eib_H_zCEY&feature=related
[10] http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=related
[11] http://labs.google.com/papers/gfs-sosp2003.pdf
[12] http://hadoop.apache.org/core/docs/current/hdfs_design.html
[13] http://hadoop.apache.org/core/docs/current/api/
[14] http://hadoop.apache.org/hive/
[15]http://www.cloudera.com/resource/chicago_data_summit_flume_an_introduction_jonathan_h
sieh_hadoop_log_processing
[16] http://www.slideshare.net/cloudera/inside-flume
[17] http://www.slideshare.net/cloudera/flume-intro100715
[18] http://www.slideshare.net/cloudera/flume-austin-hug-21711