Storage Emulated 0 Download Modern-Data-Architecture-Apache-Hadoop
Storage Emulated 0 Download Modern-Data-Architecture-Apache-Hadoop
Storage Emulated 0 Download Modern-Data-Architecture-Apache-Hadoop
2014 Hortonworks
www.hortonworks.com
Executive Summary
Apache Hadoop didnt disrupt the datacenter, the data did.
Shortly after Corporate IT functions within enterprises adopted large scale systems to
manage data then the Enterprise Data Warehouse (EDW) emerged as the logical home
of all enterprise data. Today, every enterprise has a Data Warehouse that serves to model
and capture the essence of the business from their enterprise systems.
The explosion of new types of data in recent years from inputs such as the web and
connected devices, or just sheer volumes of records has put tremendous pressure on
the EDW.
In response to this disruption, an increasing number of organizations have turned to
Apache Hadoop to help manage the enormous increase in data whilst maintaining
coherence of the Data Warehouse.
This paper discusses Apache Hadoop, its capabilities as a data platform and how the
core of Hadoop and its surrounding ecosystem solution vendors provides the enterprise
requirements to integrate alongside the Data Warehouse and other enterprise data
systems as part of a modern data architecture, and as a step on the journey toward
delivering an enterprise Data Lake.
An enterprise data lake provides the following core benefits to an enterprise:
New efficiencies for data architecture through a significantly lower cost of storage,
and through optimization of data processing workloads such as data transformation
and integration.
2
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
the data systems within the enterprise. These new types of data stem from systems of engagement
Hortonworks.com
The data from these sources has a number of features that make it a challenge for a data warehouse:
Clickstream
Social Media
Exponential Growth. An estimated 2.8ZB of data in 2012 is expected to grow to 40ZB by 2020.
Server Logs
85% of this data growth is expected to come from new types; with machine-generated data being
Geolocation
Varied Nature. The incoming data can have little or no structure, or structure that changes too
frequently for reliable schema creation at time of ingest.
Value at High Volumes. The incoming data can have little or no value as individual, or small groups
of records. But high volumes and longer historical perspectives can be inspected for patterns and
used for advanced analytic applications.
What is Hadoop?
within these new types of data is being proven by many enterprises across many industries from Retail
The technology that has emerged as the way to tackle the challenge and realize the value in big data is
Apache Hadoop, whose momentum was described as unstoppable by Forrester Research in the
Forrester Wave: Big Data Hadoop Solutions, Q1 2014.
The maturation of Apache Hadoop in recent years has broadened its capabilities from simple data
processing of large data sets to a fully-fledged data platform with the necessary services for the
digital data.
3
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
APPLICATIONS
Stascal
Analysis
BI
/
Reporng,
Ad
Hoc
Analysis
Interacve
Web
&
Mobile
Applicaons
Enterprise
Applicaons
SOURCES
MPP
Repositories
OLTP,
ERP,
CRM
Systems
Documents
&
Emails
Web
Logs,
Click
Streams
Social
Networks
Data Access
Data Management
Machine
Generated
Sensor
Data
Operaons
OPERATIONS
TOOLS
Security
RDBMS
ROOMS
Governance
&
Integraon
DATA SYSTEMS
Provision,
Manage
&
Monitor
Geo-locaon
Data
Fig. 1
A Modern Data Architecture with Apache Hadoop integrated with existing data systems
Hortonworks is dedicated to enabling Hadoop as a key component of the data center, and having
partnered deeply with some of the largest data warehouse vendors we have observed several key
opportunities and efficiencies Hadoop brings to the enterprise.
4
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
SQL
Hadoop
Collect
Ask
structured
quesons
data
from
list
Batch
Interacve
Real-me
In-memory
Fig. 2
Multi-use, Multi-workload Data Processing. By supporting multiple access methods (batch, realtime, streaming, in-memory, etc.) to a common data set, Hadoop enables analysts to transform
and view data in multiple ways (across various schemas) to obtain closed-loop analytics by bringing
time-to-insight closer to real time than ever before.
For example, a manufacturing plant may choose to react to incoming sensor data with real-time
data processing, enable data analysts to review logs during the day with interactive processing,
and run a series of batch processes overnight. Hadoop enables this scenario to happen on a single
cluster of shared resources and single versions of the data.
5
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
HADOOP
NAS
Engineered
System
MPP
SAN
$0
$20,000
$40,000
$60,000
$80,000
$180,000
Fig. 3
Source: Juergen Urbanski, Board Member Big Data & Analytics, BITKOM
6
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
Data Warehouse Workload Optimization. The scope of tasks being executed by the EDW has grown
considerably across ETL, Analytics and Operations. The ETL function is a relatively low-value computing
workload that can be performed on in a much lower cost manner. Many users off-load this function to
Hadoop, wherein data is extracted, transformed and then the results are loaded into the data warehouse.
The result: critical CPU cycles and storage space can be freed up from the data warehouse, enabling
it to perform the truly high value functions - Analytics and Operations - that best leverage its
advanced capabilities.
ANALYTICS
20%
OPERATIONS
50%
OPERATIONS
50%
ANALYTICS
50%
ETL PROCESS
30%
Fig. 4
7
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
GOVERNANCE
&
INTEGRATION
DATA ACCESS
SECURITY
OPERATIONS
Provide
layered
approach
to
security
through
Authencaon,
Authorizaon,
Accounng,
and
Data
Protecon
Deploy
and
eecvely
manage
the
plaorm
DATA
MANAGEMENT
Deployment
Choice
Provide
deployment
choice
across
physical,
virtual,
cloud
Fig. 5
These Enterprise Hadoop capabilities are aligned to the following functional areas that are a foundational
requirement for any platform technology:
Data Management. Store and process vast quantities of data in a scale out storage layer.
Data Access. Access and interact with your data in a wide variety of ways spanning batch,
interactive, streaming, and real-time use cases.
Data Governance & Integration. Quickly and easily load data, and manage according to policy.
Security. Address requirements of Authentication, Authorization, Accounting and Data Protection.
Operations. Provision, manage, monitor and operate Hadoop clusters at scale.
8
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
The Apache projects that perform this set of functions are detailed in the following diagram. This set of
projects and technologies represent the core of Enterprise Hadoop. Key technology powerhouses such
as Microsoft, SAP, Teradata, Yahoo!, Facebook, Twitter, LinkedIn and many others are continually
contributing to enhance the capabilities of the open source platform, each bringing their unique capabilities and use cases. As a result, the innovation of Enterprise Hadoop has continued to outpace all
proprietary efforts.
GOVERNANCE
&
INTEGRATION
Data
Workow,
Lifecycle
&
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
SECURITY
OPERATIONS
Authencaon
Authorizaon
Accounng
Data
Protecon
Provision,
Manage
&
Monitor
DATA
ACCESS
Batch
Script
Map
Reduce
Pig
SQL
NoSQL
Hive/Tez
HBase
HCatalog
Accumulo
Stream
Search
Others
Storm
Solr
In-Memory
Analycs
ISV
Engines
HDFS
(Hadoop
Distributed
File
System)
Storage:
HDFS
Resources:
YARN
Access:
Hive,
Pipeline:
Falcon
Cluster:
Knox
Ambari
Zookeeper
Scheduling
Oozie
DATA
MANAGEMENT
Linux
Windows
Deployment Choice
On-Premise
Cloud
Fig. 6
Data Management: Hadoop Distributed File System (HDFS) is the core technology for the efficient
scale out storage layer, and is designed to run across low-cost commodity hardware. Apache
Hadoop YARN is the pre-requisite for Enterprise Hadoop as it provides the resource management
and pluggable architecture for enabling a wide variety of data access methods to operate on data
stored in Hadoop with predictable performance and service levels.
Data Access: Apache Hive is the most widely adopted data access technology, though there are
many specialized engines. For instance, Apache Pig provides scripting capabilities, Apache Storm
offers real-time processing, Apache HBase offers columnar NoSQL storage and Apache Accumulo
offers cell-level access control. All of these engines can work across one set of data and resources
thanks to YARN. YARN also provides flexibility for new and emerging data access methods, for
instance Search and programming frameworks such as Cascading.
9
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
Data Governance & Integration: Apache Falcon provides policy-based workflows for governance,
while Apache Flume and Sqoop enable easy data ingestion, as do the NFS and WebHDFS
interfaces to HDFS.
Security: Security is provided at every layer of the Hadoop stack from HDFS and YARN to Hive
and the other Data Access components on up through the entire perimeter of the cluster via
Apache Knox.
Operations: Apache Ambari offers the necessary interface and APIs to provision, manage and
monitor Hadoop clusters and integrate with other management console software.
A Thriving Ecosystem
Beyond these core components, and as a result of innovation such as YARN, Apache Hadoop has a
thriving ecosystem of vendors providing additional capabilities and/or integration points. These partners
contribute to and augment Hadoop with given functionality, and this combination of core and ecosystem
provides compelling solutions for enterprises whatever their use case. Examples of partner integrations
include:
HP
Business Intelligence and Analytics: All of the major BI vendors offer Hadoop integration, and specialized analytics vendors offer niche solutions for specific data types and use cases.
Data Management and Tools: There are many partners offering vertical and horizontal data management solutions along side Hadoop, and there are numerous tool sets from SDKs to full IDE
experiences for developing Hadoop solutions.
Microsoft
Rackspace
Red Hat
SAP
Teradata
Infrastructure: While Hadoop is designed for commodity hardware, it can also run as an appliance,
and be easily integrated into other storage, data and management solutions both on-premise and in
the cloud.
Systems Integrators: Naturally, as a component of an enterprise data architecture, then SIs of all
sizes are building skills to assist with integration and solution development.
As many of these vendors are already prevalent within an enterprise, providing similar capabilities for an
EDW, risk of implementation is mitigated as teams are able to leverage existing tools and skills from
EDW workloads.
There is also a thriving ecosystem of new vendors that is emerging on top of the enterprise Hadoop
platform. These new companies are taking advantage of open APIs and new platform capabilities to
create an entirely new generation of applications. The applications theyre building leverage both existing
and new types of data and are performing new types of processing and analysis that werent technologically or financially feasible before the emergence of Hadoop. The result is that these new businesses are
harnessing the massive growth in data creating opportunities for improved insight into customers, better
medical research and healthcare delivery, more efficient energy exploration and production, predictive
policing and much more.
10
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
Financial Services
Telecom
Retail
Manufacturing
Healthcare
Pharmaceucals
Unstructured
Structured
Clickstream
Geographic
Text
Social
Enterprise Hadoop
DATA
TYPE
Server
Logs
USE
CASE
Sensor
INDUSTRY
Telecommunications
Retail
Financial Services
Healthcare
Manufacturing
use cases
Advertising
Government
Fig. 7
11
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
Data Management
Operaons
EDW
Data Access
Security
MPP
Governance
&
Integraon
RDBMS
Scope
Fig. 8
12
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
APPLICATIONS
strategy for both efficiency in a modern data architecture, and opportunity in an enterprise data lake.
Stascal
Analysis
BI
/
Reporng,
Ad
Hoc
Analysis
Interacve
Web
&
Mobile
Applicaons
Enterprise
Applicaons
SOURCES
MPP
Repositories
OLTP,
ERP,
CRM
Systems
Documents
&
Emails
Web
Logs,
Click
Streams
Social
Networks
Data Access
Data Management
Machine
Generated
Sensor
Data
Operaons
OPERATIONS
TOOLS
Security
RDBMS
ROOMS
Governance
&
Integraon
DATA SYSTEMS
Provision,
Manage
&
Monitor
Geo-locaon
Data
Fig. 9
A Modern Data Architecture with Apache Hadoop integrated with existing data systems
13
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
Case Study 1:
particular service provider for different types of products, and who are
view of the households it served across all the different data channels
and in doing so, expect that the service provider be aware of whats
the rapid growth in the volume and type of customer data it was
receiving proved too challenging, and as a result, it lacked a unified
Infrastructure
investment
Bandwidth
allocaon
Product
development
Governance
&
Integraon
EDW
MPP
TRADITIONAL SOURCES
Batch
Script
SQL
CRM
ERP
BILLING DATA
SUBSCRIBER DATA
PRODUCT CATALOG
NETWORK DATA
HDFS
(Hadoop
D
istributed
F
ile
System)
Operaons
DATA
REPOSITORIES
Security
ANALYSIS
ONLINE CHAT
SENSOR DATA
SOCIAL MEDIA
SERVER LOGS
CALL
DETAIL
RECORDS
MERCHANT
LISTINGS
DMP
Fig. 10
14
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
Case Study 2:
What this large retailer needed was a golden record that unified
customer data across all time periods and across all channels,
delivering key insights that the retailers marketing team then used
technically difficult.
team is still discovering unexpected and unique uses for its 360
view of customer buying behavior.
DATA
REPOSITORIES
Governance
&
Integraon
EDW
ERP
CRM
Batch
Script
SQL
HDFS
(Hadoop
D
istributed
F
ile
System)
Operaons
Recommendaon
engine
Brand
health
Price
sensivity
Top-down
clustering
Security
ANALYSIS
RDBMS
TRADITIONAL
SOURCES
CRM
WEB TRANSACTIONS
ERP
POS TRANSACTIONS
PRODUCT CATALOG
STAFFING
INVENTORY
STORES
WIFI LOGS
SOCIAL MEDIA
SENSOR RFID
LOCATIONS DATA
Fig. 11
15
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
DATA
ACCESS
Batch
Script
Map
Reduce
Pig
SQL
NoSQL
Hive/Tez
HBase
HCatalog
Accumulo
Stream
Search
Others
Storm
Solr
In-Memory
Analycs
ISV
Engines
HDFS
(Hadoop
Distributed
File
System)
SECURITY
OPERATIONS
Authencaon
Authorizaon
Accounng
Data
Protecon
Provision,
Manage
&
Monitor
Storage:
HDFS
Resources:
YARN
Access:
Hive,
Pipeline:
Falcon
Cluster:
Knox
Ambari
Zookeeper
Scheduling
Oozie
DATA MANAGEMENT
Fig. 12
16
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
Completely
Open
HDP incorporates the most current community innovation and is tested on the most mature
Components of
Enterprise Hadoop
Read more about the individual
components of Enterprise
Hadoop.
Data Management
HDP is developed and supported by engineers with the deepest and broadest knowledge of
hdfs
Apache Hadoop.
Fundamentally
Versatile
yarn
Data Access
mapreduce
pig
HDP supports all big data scenarios: from batch, to interactive, to real-time and streaming.
hive
HDP offers a versatile data access layer through YARN at the core of Enterprise Hadoop that allows
hbase
new processing engines to be incorporated as they become ready for enterprise consumption.
HDP provides the comprehensive enterprise capabilities of security, governance and operations
for enterprise implementation of Hadoop.
Wholly
Integrated
HDP is designed to run in any data center and integrates with any
existing system.
HDP can be deployed in any scenario: from Linux to Windows, from On-Premise to the Cloud.
HDP is deeply integrated with key technology vendor platforms: Red Hat, Microsoft, SAP, Teradata
and more.
tez
accumulo
storm
hcatalog
Data Governance & Integration
falcon
flume
Security
knox
security
Operations
ambari
On-Premise. HDP is the only Hadoop platform that works across Linux and Windows.
In-Cloud. HDP can be run as part of IaaS, and also powers Rackspaces Big Data Cloud, and
Microsofts HDInsight Service, CSC and many others.
Appliance. HDP runs on commodity hardware by default, and can also be purchased as an
appliance from Teradata.
17
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com
Open
Leadership
Ecosystem
Endorsement
Enterprise
Rigor
Foundation process.
deployments
platform.
For an independent analysis of Hortonworks Data Platform, you can download the Forrester Wave:
Big Data Hadoop Solutions, Q1 2014 from Forrester Research.
About Hortonworks
Hortonworks develops, distributes and supports the only 100% open source Apache Hadoop data
platform. Our team comprises the largest contingent of builders and architects within the Hadoop
ecosystem who represent and lead the broader enterprise requirements within these communities. The
Hortonworks Data Platform provides an open platform that deeply integrates with existing IT investments
and upon which enterprises can build and deploy Hadoop-based applications. Hortonworks has deep
relationships with the key strategic data center partners that enable our customers to unlock the broadest
opportunities from Hadoop. For more information, visit www.hortonworks.com.
18
A Modern Data Architecture with Apache Hadoop
The Journey to a Data Lake
2014 Hortonworks
www.hortonworks.com