Big Data Technology Stack

View the Big Data Technology Stack in a nutshell. This Big Data Technology Stack deck covers the different layers of the Big Data world and summarizes the major technologies in vogue today.

Uploaded by

Khalid Imran

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

564 views

Big Data Technology Stack

View the Big Data Technology Stack in a nutshell. This Big Data Technology Stack deck covers the different layers of the Big Data world and summarizes the major technologies in vogue today.

Uploaded by

Khalid Imran

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Big Data :

Technology Stack
By: Khalid Imran

Agenda
Big Data Stack : In a Nutshell
Data Layer
Data Processing Layer
Data Ingestion Layer

Data Presentation Layer

Operations & Scheduling Layer
Security & Governance

Big Data Technology Stack : In a nutshell

Data Layer
Hadoop Distributed File System (HDFS)
HDFS is a scalable, fault-tolerant Java based distributed file system that is used for storing
large volumes of data in inexpensive commodity hardware.

Amazon Simple Storage Service (S3)

S3 is a cloud based scalable, distributed file system offering from Amazon. It can be
utilized as the data layer in big data applications, coupled with other required
components.

IBM General Parallel File System (GPFS) / Spectrum Scale

GPFS is a high-performance clustered file system developed by IBM.

Data Processing Layer

Hadoop MapReduce
Hadoop Map/Reduce is a software framework for distributed processing of large data sets on
compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The
framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks. A
MapReduce job usually splits the input data-set into independent chunks which are processed by
the map tasks in a completely parallel manner. The framework sorts the outputs of the maps,
which are then input to the reduce tasks. Typically both the input and the output of the job are
stored in a file-system.
Apache Pig
Pig is a high-level platform for creating MapReduce programs used with Hadoop. Apache Pig
allows Apache Hadoop users to write complex MapReduce transformations using a simple
scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can
be executed on the data. Pig Latin can be extended using UDF (User Defined Functions) which the
user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Data Processing Layer

Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. Hive is used to explore, structure and analyze data, then turn
it into actionable business insight. Apache Hive supports analysis of large datasets stored in
Hadoop's HDFS and compatible file systems such as Amazon S3 file system. It provides an SQL-like
language called HiveQL with schema on read and transparently converts queries to map/reduce,
Apache Tez and Spark jobs.

Apache HBase
HBase is an open-source NoSQL database that provides real-time read/write access to large
datasets with extremely low latency as well as fault tolerance. HBase runs on top of HDFS. HBase
provides a strong consistency model, and range-based partitioning. Reads, including range-based
reads, tend to scale much better on HBase, whereas writes do not scale as well as they do on
Cassandra..

Data Processing Layer

Apache Cassandra
Cassandra is another open-source distributed NoSQL database. It is highly scalable, fault tolerant
and can be used to manage huge volumes of data. Cassandra's consistency model is based on
Amazon's Dynamo: it provides eventual consistency. This is very appealing for some applications
where you want to guarantee the availability of writes. Similarly, Cassandra tends to provide very
good write scaling.
Apache Storm
Storm is a distributed real-time computation system for processing large volumes of high-velocity
data. Storm makes it easy to reliably process unbounded streams of data, doing for real-time
processing what Hadoop did for batch processing.

Apache Solr
Apache Solr is the open source platform for searches of data stored in HDFS in Hadoop. Solr
powers the search and navigation features of many of the worlds largest Internet sites, enabling
powerful full-text search and near real-time indexing. Apache Solr can be used for rapidly finding
tabular, text, geo-location or sensor data that is stored in Hadoop.

Data Processing Layer

Apache Spark
Apache Spark is an open source cluster computing framework for large-scale data processing.
Studies have shown that Spark can run up to 100x faster than Hadoop MapReduce in memory, or
10x faster on disk for program execution. It provides in-memory computations for increased speed
and data processing over MapReduce. It runs on top of existing Hadoop cluster and can access
Hadoop data store (HDFS), as well as also process structured data from Hive and streaming data
from HDFS, Flume, Kafka, Twitter and other sources.

Apache Mahout
Apache Mahout is a library of scalable machine-learning algorithms that can be implemented on
top of Apache Hadoop and it utilizes the MapReduce paradigm. Machine learning is a discipline of
artificial intelligence focused on enabling machines to learn without being explicitly programmed,
and it is commonly used to improve future performance based on previous outcomes. Mahout
provides the tools and algorithms to automatically find meaningful patterns in those big data sets
stored in the HDFS.

Data Ingestion Layer

Apache Flume
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data (e.g. application logs, sensor and
machine data, geo-location data and social media) into the HDFS. It has a simple and flexible
architecture based on streaming data flows; and is robust and fault tolerant with comes with
configurable reliability mechanisms for failover and recovery.
Apache Kafka
Kafka is a high throughput distributed messaging system. Kafka maintains feeds of messages
in categories called topics. Producers are processes that publish messages to a Kafka topic.
Consumers are processes that subscribe to topics and process the feed of published
messages. Kafka is run as a cluster comprising of one or more servers each of which is called
a broker.

Apache Sqoop
Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases or
mainframes. Sqoop can be used to import data from a RDBMS or a mainframe into HDFS,
transform the data using Hadoop MapReduce, and then export the data back into an RDBMS.

Data Presentation Layer

Kibana
Kibana is an analytics and visualization plugin that works with ElasticSearch. It provides real-time
summary and charting of streaming data. The visualization capabilities it provides allow users to
different charts, plots and maps of large volumes of data.

Operations & Scheduling Layer

Ambari
Ambari is an open framework that helps in provisioning, managing and monitoring of Apache
Hadoop clusters. It simplifies the deployment and maintenance of hosts. Ambari also includes an
intuitive web interface that allows one to easily provision, configure and test all the Hadoop
services and core components. It also comes with the powerful Ambari Blueprints API that can be
utilized for automating cluster installations without any user intervention.
Apache Oozie
Apache Oozie provides operational service capabilities for a Hadoop cluster, specifically around
job scheduling within the cluster. Oozie is a Java based web application that is primarily used to
schedule Apache Hadoop jobs. Oozie can combine multiple jobs sequentially into one logical unit
of work. It can be integrated with the Hadoop stack, and supports Hadoop jobs for various Apache
tools such as MapReduce, Apache Pig, Apache Hive, and Apache Sqoop.
Apache ZooKeeper

Apache ZooKeeper provides operational services for a Hadoop cluster. It provides a distributed
configuration service, a synchronization service and a naming registry for distributed systems that
can use Zookeeper to store and mediate updates to important configuration information.

About Me Khalid Imran

A tester by passion, Ive spent the past 16+ years testing disparate systems, learning new domains, developing
innovative solutions, designing test strategies, challenging conventional methods, proving new techniques and
embracing emerging tools & technologies. The breadth and depth of my experience cuts across functional testing,
non-functional testing, manual, automation, test and project execution methodologies, licensed and open stack
tools, platforms, devices, programming languages, custom-built test harnesses and utilities, delivery management,
client engagement, on-site, off-shore team dynamics and more.
I am currently heading the 1400+ QA strong testing practice at Cybage as a QA Evangelist. I manage the Testing
Centre of Excellence (TCoE), lead a team of architects and specialists and assist in deliveries across the organization,
pre-sales and business development, solutioning and consultancy, training and process improvement group. I hold
multiple certifications namely: CSQA, CSM and CPISI.
I welcome any questions or feedback you may have on this presentation.

DT-EDU-DeN60EDU0101. Virtual DataPort Architecture
No ratings yet
DT-EDU-DeN60EDU0101. Virtual DataPort Architecture
23 pages
CSS NC 2 Session Plan Coc1
100% (3)
CSS NC 2 Session Plan Coc1
11 pages
Ruta de Entrenamiento Base Cloudera Revisada
100% (1)
Ruta de Entrenamiento Base Cloudera Revisada
6 pages
01 - IBM Data Lake Solutions & Technologies - Le Nhan Tam
No ratings yet
01 - IBM Data Lake Solutions & Technologies - Le Nhan Tam
32 pages
PLC by CoDeSys
80% (5)
PLC by CoDeSys
460 pages
TM 9-2330-213-14&P Trailers PDF
No ratings yet
TM 9-2330-213-14&P Trailers PDF
448 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Apache Oozie Essentials
From Everand
Apache Oozie Essentials
Singh Jagat Jasjit
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Learning HBase
From Everand
Learning HBase
Shashwat Shriparv
No ratings yet
Cloudera Administration Handbook
From Everand
Cloudera Administration Handbook
Rohit Menon
No ratings yet
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop ECO System
No ratings yet
Hadoop ECO System
1 page
Access Control Snowflake
No ratings yet
Access Control Snowflake
6 pages
Drill Slides
No ratings yet
Drill Slides
14 pages
Choosing Technologies For A Big Data Solution in The Cloud: James Serra
No ratings yet
Choosing Technologies For A Big Data Solution in The Cloud: James Serra
58 pages
AWS Services Overview
No ratings yet
AWS Services Overview
28 pages
Informatica
No ratings yet
Informatica
7 pages
GettingStarted - With Data Quality Guide
No ratings yet
GettingStarted - With Data Quality Guide
48 pages
Cloudera Kudu
100% (1)
Cloudera Kudu
102 pages
Data Lakes For Maximum Flexibility
No ratings yet
Data Lakes For Maximum Flexibility
29 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
Kudu
No ratings yet
Kudu
9 pages
Fundamentals of Cloud Computing
100% (1)
Fundamentals of Cloud Computing
14 pages
Talend ESB Container AG 50b en
No ratings yet
Talend ESB Container AG 50b en
63 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Apache HIVE
No ratings yet
Apache HIVE
9 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
ETL 2.0 Data Integration Comes of Age
No ratings yet
ETL 2.0 Data Integration Comes of Age
13 pages
Govindarajan Data Vault PDF
100% (1)
Govindarajan Data Vault PDF
29 pages
PAM For Informatica Platform v10 5 4
No ratings yet
PAM For Informatica Platform v10 5 4
237 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Scope, and The Inter-Relationships Among These Entities
No ratings yet
Scope, and The Inter-Relationships Among These Entities
12 pages
Metadata Management On A Hadoop Eco-System: Whitepaper by
No ratings yet
Metadata Management On A Hadoop Eco-System: Whitepaper by
12 pages
Apache Cassandra
No ratings yet
Apache Cassandra
7 pages
Big Data Landscape 2017
No ratings yet
Big Data Landscape 2017
1 page
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
Talend Real Time Scenario
No ratings yet
Talend Real Time Scenario
25 pages
Battle of The Giants - Comparing Kimball and Inmon
No ratings yet
Battle of The Giants - Comparing Kimball and Inmon
15 pages
Cloud Data Warehouse
No ratings yet
Cloud Data Warehouse
7 pages
Cloudera Introduction PDF
No ratings yet
Cloudera Introduction PDF
97 pages
DWDM Lecturenotes PDF
No ratings yet
DWDM Lecturenotes PDF
133 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
Impala and BigQuery
No ratings yet
Impala and BigQuery
47 pages
Informatica February Release
No ratings yet
Informatica February Release
15 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Data Mining N Business Intelligence
No ratings yet
Data Mining N Business Intelligence
63 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
1 - Creating A Data Transformation Pipeline With Cloud Dataprep
0% (1)
1 - Creating A Data Transformation Pipeline With Cloud Dataprep
39 pages
SS1123 - D2T - Apache Cassandra Overview PDF
100% (1)
SS1123 - D2T - Apache Cassandra Overview PDF
45 pages
Informatica MDM Course Contents
No ratings yet
Informatica MDM Course Contents
7 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
Designing Data Integration The ETL Pattern Approac
No ratings yet
Designing Data Integration The ETL Pattern Approac
9 pages
Documenting ETL Rules in CA ERwin
No ratings yet
Documenting ETL Rules in CA ERwin
25 pages
Resume Format 1
No ratings yet
Resume Format 1
5 pages
Edureka Interview Questions - HDFS
No ratings yet
Edureka Interview Questions - HDFS
4 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
Data Integration Using GoldenGate
No ratings yet
Data Integration Using GoldenGate
18 pages
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
From Everand
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
Manoj Kumar
No ratings yet
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
From Everand
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
Pierre-yves Bonnefoy
No ratings yet
G Schock G-2210 Manual
No ratings yet
G Schock G-2210 Manual
2 pages
Reconfigurable Computing Course Handout
No ratings yet
Reconfigurable Computing Course Handout
3 pages
WTW4800XQ0
No ratings yet
WTW4800XQ0
8 pages
ឆ១ឆ្នាំ១ core english 2024 ict en with answer
No ratings yet
ឆ១ឆ្នាំ១ core english 2024 ict en with answer
3 pages
4xge SFP Es-16 Access
No ratings yet
4xge SFP Es-16 Access
3 pages
Input Profiles
No ratings yet
Input Profiles
11 pages
Getting Started in C and Assembly Code With The TMS320LF240x DSP
No ratings yet
Getting Started in C and Assembly Code With The TMS320LF240x DSP
42 pages
1783-ETAP Module Components 1783-ETAP1F, 1783-ETAP2F Module Components
No ratings yet
1783-ETAP Module Components 1783-ETAP1F, 1783-ETAP2F Module Components
3 pages
Epson FX-890 Dot Matrix Printer
No ratings yet
Epson FX-890 Dot Matrix Printer
55 pages
WK 1
No ratings yet
WK 1
12 pages
Ahmed CV UPDATED
No ratings yet
Ahmed CV UPDATED
2 pages
Installation Material
No ratings yet
Installation Material
82 pages
IMXRT1060 Datasheet
No ratings yet
IMXRT1060 Datasheet
119 pages
Sony - Optiarc BD 5750H
No ratings yet
Sony - Optiarc BD 5750H
5 pages
Specification 33KV GIS ZX0.2
No ratings yet
Specification 33KV GIS ZX0.2
16 pages
Fireye NEX-1502 Comfire Software
No ratings yet
Fireye NEX-1502 Comfire Software
46 pages
Satellite A105 S4384
No ratings yet
Satellite A105 S4384
11 pages
Air Tag
No ratings yet
Air Tag
11 pages
LLVM Tutorial
100% (1)
LLVM Tutorial
59 pages
Popular Science Australia
No ratings yet
Popular Science Australia
84 pages
Evinrude Tiller To Remote
No ratings yet
Evinrude Tiller To Remote
15 pages
(Hd-Link) Iw03gf - Rev 3.1 Eng (Full Ver. 2015-01-27)
No ratings yet
(Hd-Link) Iw03gf - Rev 3.1 Eng (Full Ver. 2015-01-27)
32 pages
Lm3241 6-Mhz, 750-Ma Miniature, Adjustable, Step-Down DC-DC Converter For RF Power Amplifiers
No ratings yet
Lm3241 6-Mhz, 750-Ma Miniature, Adjustable, Step-Down DC-DC Converter For RF Power Amplifiers
28 pages
Microcontroller Viva Questions and Answers
100% (6)
Microcontroller Viva Questions and Answers
5 pages
Electronic Circuits
100% (6)
Electronic Circuits
91 pages
Spaghetti Diagram Updated
No ratings yet
Spaghetti Diagram Updated
7 pages
Manual Torre MULTIQUIP, Partes
No ratings yet
Manual Torre MULTIQUIP, Partes
58 pages