Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
November 2011
How Apache Hadoop is Revolutionizing
Business Intelligence and Data Analytics
Dr. Amr Awadallah | Founder, CTO, VP of Engineering
aaa@cloudera.com, twitter: @awadallah
Business Intelligence Before Adopting Apache Hadoop


    BI Reports + Interactive Apps                       Can’t Explore Original
                                                        High Fidelity Raw Data
      RDBMS (processed data)

         ETL Compute Grid

                    Moving Data To
                    Compute Doesn’t Scale

            Storage Only Grid (original raw data)
                                                                             Archiving =
              Mostly Append
                                                                             Premature
                         Collection                                          Data Death

                      Instrumentation


2
                            ©2011 Cloudera, Inc. All Rights Reserved.
Business Intelligence After Adopting Apache Hadoop

                                                                  Data Exploration &
    BI Reports + Interactive Apps                                 Advanced Analytics

              RDBMS




     ETL and Aggregations                               Complex Data Processing
                  Hadoop: Storage + Compute Grid

                                                    Keep Data Alive For Ever

                               Collection

                          Instrumentation


3
                            ©2011 Cloudera, Inc. All Rights Reserved.
So What is Apache                                                      Hadoop ?
• A scalable fault-tolerant distributed system for data storage
  and processing (open source under the Apache license).

• Core Hadoop has two main components:
     – Hadoop Distributed File System: self-healing high-bandwidth
       clustered storage.
     – MapReduce: fault-tolerant distributed processing.

• Key business values:
     –   Flexible – Store any data, Run any analysis (Mine First, Govern Later).
     –   Scalable – Start at 1TB/3-nodes then grow to petabytes/1000s of nodes.
     –   Affordable – Cost per TB at a fraction of traditional options.
     –   Open Source – No Lock-In, Rich Ecosystem, Large developer community.
     –   Broadly adopted – A large and active ecosystem, Proven to run at scale.



 4
                                ©2011 Cloudera, Inc. All Rights Reserved.
The Main Benefit: Agility/Flexibility
Schema-on-Write (RDBMS):                                   Schema-on-Read (Hadoop):
•   Schema must be created before                         •   Data is simply copied to the file
    data is loaded                                            store, no transformation is needed
•   Explicit load operation has to                        •   A SerDe (Serializer/Deserlizer) is
    take place which transforms                               applied during read time to extract
    data to DB internal structure                             the required columns
•   New columns must be added                             •   New data can start flowing anytime
    explicitly before data for such                           and will appear retroactively once
    columns can be loaded into the                            the SerDe is updated to parse it
    database

•   Read is Fast                                          •   Load is Fast
                                     Benefits
•   Standards/Governance                                  •   Flexibility/Agility


    5
                                ©2011 Cloudera, Inc. All Rights Reserved.
What is Complex Data Processing?
1. Java MapReduce: Most flexibility and performance, but tedious
   development cycle (the “assembly language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
   any programming language of your choice, but slightly lower
   performance and less flexibility than native Java MapReduce.
3. Crunch: A library for multi-stage MapReduce pipelines in Java.
4. Pig Latin: A high-level language out of Yahoo, suitable for batch
   data flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
   store mapping files to their schemas and associated SerDes.
6. Oozie: A PDL XML workflow engine that enables creating a
   workflow of jobs composed of any of the above.


6
                          ©2011 Cloudera, Inc. All Rights Reserved.
What This Means For You: Agility

Up Front Design                                               Just in Time




7
                  ©2011 Cloudera, Inc. All Rights Reserved.
What This Means For You: Innovation

    Data Committee                                          Data Scientist




8
                ©2011 Cloudera, Inc. All Rights Reserved.
What This Means For You: Consolidation

      Silos                                               Sharing




 9
              ©2011 Cloudera, Inc. All Rights Reserved.
What This Means For You: Extract Value from Latent Data


     Archive to Tape                                      Keep Data Alive




10
                     ©2011 Cloudera, Inc. All Rights Reserved.
What This Means For You: Ability to Grow Fluidly




11
                  ©2011 Cloudera, Inc. All Rights Reserved.
What This Means For You: Data Beats Algorithm


     Smarter Algos                                               More Data




12
                     ©2011 Cloudera, Inc. All Rights Reserved.
Where Does Hadoop Fit in the Enterprise Data Stack?


                                     Data Scientists        Analysts              Business Users
                                                                                      Enterprise
                                          IDEs            BI, Analytics
                                                                                      Reporting

                                    Development Tools                  Business Intelligence Tools
                     System
                    Operators
                     Cloudera
                    Mgmt Suite                                                                Enterprise
        ETL Tools




                                                                                                Data
                                                                                              Warehouse

  Data
Architects                                                                                                 Customers
                                                                                             Low-Latency     Web
                                                                                               Serving     Application
                                                                      Relational               Systems
                Logs             Files       Web Data
                                                                      Databases


   13
                                                 ©2011 Cloudera, Inc. All Rights Reserved.
Use The Right Tool For The Right Job
    Relational Databases:                             Hadoop:




Use when:                                               Use when:
•   Interactive OLAP Analytics (<1sec)                  •   Structured or Not (Flexibility)
•   Multistep ACID Transactions                         •   Scalability of Storage/Compute
•   100% SQL Compliance                                 •   Complex Data Processing


14
                               ©2011 Cloudera, Inc. All Rights Reserved.
Two Core Use Cases Common Across Many Industries


Use Case                    Application                     Industry                               Application             Use Case
                                                                Web
   ADVANCED ANALYTICS




                        Social Network Analysis                                               Clickstream Sessionization




                                                                                                                              DATA PROCESSING
                         Content Optimization                 Media                           Clickstream Sessionization

                          Network Analytics                    Telco                                  Mediation

                         Loyalty & Promotions                  Retail                               Data Factory

                            Fraud Analysis                 Financial                             Trade Reconciliation

                            Entity Analysis                  Federal                                   SIGINT

                         Sequencing Analysis        Bioinformatics                                Genome Mapping

                           Product Quality           Manufacturing                              Mfg Process Tracking



  15
                                                  ©2011 Cloudera, Inc. All Rights Reserved.
CDH: Cloudera’s Distribution Including Apache Hadoop

The #1 commercial and non-commercial Apache Hadoop distribution.
                File System Mount        UI Framework/SDK                             Data Mining
                            FUSE-DFS                                  HUE               APACHE MAHOUT


                     Workflow                  Scheduling                              Metadata
                        APACHE OOZIE                  APACHE OOZIE                         APACHE HIVE


                                       Languages / Compilers
                                                APACHE PIG, APACHE HIVE                Fast Read/Write
               Data Integration
                                                                                           Access
                APACHE FLUME,
                                                                                       APACHE HBASE
                APACHE SQOOP


                                             Coordination                          APACHE ZOOKEEPER

•     Open Source – 100% Apache licensed, 100% Open Source, 100% Free, No Forks.
•     Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA.
•     Proven at Scale – Deployed at hundreds of enterprises across many industries.
•     Integrated – All required component versions & dependencies are managed for you.
•     Industry Standard – Existing RDBMS, ETL and BI systems work best with it.
•     Many Form Factors – Public Cloud, Private Cloud, RHEL, Ubuntu, 32/64bit, etc.


 16
                                       ©2011 Cloudera, Inc. All Rights Reserved.
CDH Integrates with Existing IT Infrastructure

      BI/Analytics   ETL           Databases                           Cloud/OS   Hardware




       Cloudera’s Distribution including Apache Hadoop



 17
                           ©2011 Cloudera, Inc. All Rights Reserved.
What is Cloudera Enterprise?
Cloudera Enterprise makes open                             CLOUDERA ENTERPRISE COMPONENTS
source Apache Hadoop enterprise-easy

 Simplify and Accelerate Hadoop Deployment                     Cloudera                     Production-
                                                               Management                   Level Support
 Reduce Adoption Costs and Risks
                                                                  Suite
 Lower the Cost of Administration
                                                               Comprehensive                Our Team of Experts
 Increase the Transparency & Control of Hadoop                                             On-Call to Help You
                                                              Toolset for Hadoop
 Leverage the Experience of Our Experts                        Administration               Meet Your SLAs




       3 of the top 5 telecommunications, mobile services, defense &
intelligence, banking, media and retail organizations depend on Cloudera

           EFFECTIVENESS                                                           EFFICIENCY
           Ensuring Repeatable Value from                                          Enabling Apache Hadoop to be
           Apache Hadoop Deployments                                               Affordably Run in Production



18
                                     ©2011 Cloudera, Inc. All Rights Reserved.
SCM Express: Simplifies Installation and Configuration

   Service & Configuration Manager
   (SCM) Express takes the complexity out
   of deploying and configuring CDH.

    Provision a complete Hadoop stack in minutes
    Centrally manage system services through a user-
     friendly interface
    Manages services for up to 50 nodes
    FREE to download


KEY FEATURES
 Automated, wizard-     Central, real-time      Ability to configure the                     Incorporates           Automates the
based installation of    dashboard for             cluster while it’s                      comprehensive         expansion of services
the complete Hadoop       configuration                   running                        validation and error   to new nodes when they
       stack              management                                                           checking               come online


        1                     2                               3                                 4                       5
   19
                                             ©2011 Cloudera, Inc. All Rights Reserved.
What I Would Like You To Remember:
• The Key Benefits of the Apache Hadoop Data Platform:
   – Agility/Flexibility (Enables Exploration/Innovation).
   – Complex Data Processing (Any Language, Any Problem).
   – Scalability of Storage/Compute (Freedom to Grow).
   – Economical Active Archive (Keep All Your Data Alive).

• Cloudera Enterprise enables:
   – Lower the Cost of Management and Administration.
   – Simplify and Accelerate Hadoop Deployment.
   – Increase the Transparency & Control of Hadoop.
   – Firm SLAs on Issue Resolution.

20
                      ©2011 Cloudera, Inc. All Rights Reserved.
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanced Data Analytics at Yahoo - Amr Awadallah, Cloudera

More Related Content

Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanced Data Analytics at Yahoo - Amr Awadallah, Cloudera

  • 1. November 2011 How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics Dr. Amr Awadallah | Founder, CTO, VP of Engineering aaa@cloudera.com, twitter: @awadallah
  • 2. Business Intelligence Before Adopting Apache Hadoop BI Reports + Interactive Apps Can’t Explore Original High Fidelity Raw Data RDBMS (processed data) ETL Compute Grid Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) Archiving = Mostly Append Premature Collection Data Death Instrumentation 2 ©2011 Cloudera, Inc. All Rights Reserved.
  • 3. Business Intelligence After Adopting Apache Hadoop Data Exploration & BI Reports + Interactive Apps Advanced Analytics RDBMS ETL and Aggregations Complex Data Processing Hadoop: Storage + Compute Grid Keep Data Alive For Ever Collection Instrumentation 3 ©2011 Cloudera, Inc. All Rights Reserved.
  • 4. So What is Apache Hadoop ? • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license). • Core Hadoop has two main components: – Hadoop Distributed File System: self-healing high-bandwidth clustered storage. – MapReduce: fault-tolerant distributed processing. • Key business values: – Flexible – Store any data, Run any analysis (Mine First, Govern Later). – Scalable – Start at 1TB/3-nodes then grow to petabytes/1000s of nodes. – Affordable – Cost per TB at a fraction of traditional options. – Open Source – No Lock-In, Rich Ecosystem, Large developer community. – Broadly adopted – A large and active ecosystem, Proven to run at scale. 4 ©2011 Cloudera, Inc. All Rights Reserved.
  • 5. The Main Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Schema must be created before • Data is simply copied to the file data is loaded store, no transformation is needed • Explicit load operation has to • A SerDe (Serializer/Deserlizer) is take place which transforms applied during read time to extract data to DB internal structure the required columns • New columns must be added • New data can start flowing anytime explicitly before data for such and will appear retroactively once columns can be loaded into the the SerDe is updated to parse it database • Read is Fast • Load is Fast Benefits • Standards/Governance • Flexibility/Agility 5 ©2011 Cloudera, Inc. All Rights Reserved.
  • 6. What is Complex Data Processing? 1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the “assembly language” of Hadoop). 2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch: A library for multi-stage MapReduce pipelines in Java. 4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above. 6 ©2011 Cloudera, Inc. All Rights Reserved.
  • 7. What This Means For You: Agility Up Front Design Just in Time 7 ©2011 Cloudera, Inc. All Rights Reserved.
  • 8. What This Means For You: Innovation Data Committee Data Scientist 8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9. What This Means For You: Consolidation Silos Sharing 9 ©2011 Cloudera, Inc. All Rights Reserved.
  • 10. What This Means For You: Extract Value from Latent Data Archive to Tape Keep Data Alive 10 ©2011 Cloudera, Inc. All Rights Reserved.
  • 11. What This Means For You: Ability to Grow Fluidly 11 ©2011 Cloudera, Inc. All Rights Reserved.
  • 12. What This Means For You: Data Beats Algorithm Smarter Algos More Data 12 ©2011 Cloudera, Inc. All Rights Reserved.
  • 13. Where Does Hadoop Fit in the Enterprise Data Stack? Data Scientists Analysts Business Users Enterprise IDEs BI, Analytics Reporting Development Tools Business Intelligence Tools System Operators Cloudera Mgmt Suite Enterprise ETL Tools Data Warehouse Data Architects Customers Low-Latency Web Serving Application Relational Systems Logs Files Web Data Databases 13 ©2011 Cloudera, Inc. All Rights Reserved.
  • 14. Use The Right Tool For The Right Job Relational Databases: Hadoop: Use when: Use when: • Interactive OLAP Analytics (<1sec) • Structured or Not (Flexibility) • Multistep ACID Transactions • Scalability of Storage/Compute • 100% SQL Compliance • Complex Data Processing 14 ©2011 Cloudera, Inc. All Rights Reserved.
  • 15. Two Core Use Cases Common Across Many Industries Use Case Application Industry Application Use Case Web ADVANCED ANALYTICS Social Network Analysis Clickstream Sessionization DATA PROCESSING Content Optimization Media Clickstream Sessionization Network Analytics Telco Mediation Loyalty & Promotions Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping Product Quality Manufacturing Mfg Process Tracking 15 ©2011 Cloudera, Inc. All Rights Reserved.
  • 16. CDH: Cloudera’s Distribution Including Apache Hadoop The #1 commercial and non-commercial Apache Hadoop distribution. File System Mount UI Framework/SDK Data Mining FUSE-DFS HUE APACHE MAHOUT Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Languages / Compilers APACHE PIG, APACHE HIVE Fast Read/Write Data Integration Access APACHE FLUME, APACHE HBASE APACHE SQOOP Coordination APACHE ZOOKEEPER • Open Source – 100% Apache licensed, 100% Open Source, 100% Free, No Forks. • Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA. • Proven at Scale – Deployed at hundreds of enterprises across many industries. • Integrated – All required component versions & dependencies are managed for you. • Industry Standard – Existing RDBMS, ETL and BI systems work best with it. • Many Form Factors – Public Cloud, Private Cloud, RHEL, Ubuntu, 32/64bit, etc. 16 ©2011 Cloudera, Inc. All Rights Reserved.
  • 17. CDH Integrates with Existing IT Infrastructure BI/Analytics ETL Databases Cloud/OS Hardware Cloudera’s Distribution including Apache Hadoop 17 ©2011 Cloudera, Inc. All Rights Reserved.
  • 18. What is Cloudera Enterprise? Cloudera Enterprise makes open CLOUDERA ENTERPRISE COMPONENTS source Apache Hadoop enterprise-easy  Simplify and Accelerate Hadoop Deployment Cloudera Production- Management Level Support  Reduce Adoption Costs and Risks Suite  Lower the Cost of Administration Comprehensive Our Team of Experts  Increase the Transparency & Control of Hadoop On-Call to Help You Toolset for Hadoop  Leverage the Experience of Our Experts Administration Meet Your SLAs 3 of the top 5 telecommunications, mobile services, defense & intelligence, banking, media and retail organizations depend on Cloudera EFFECTIVENESS EFFICIENCY Ensuring Repeatable Value from Enabling Apache Hadoop to be Apache Hadoop Deployments Affordably Run in Production 18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19. SCM Express: Simplifies Installation and Configuration Service & Configuration Manager (SCM) Express takes the complexity out of deploying and configuring CDH.  Provision a complete Hadoop stack in minutes  Centrally manage system services through a user- friendly interface  Manages services for up to 50 nodes  FREE to download KEY FEATURES Automated, wizard- Central, real-time Ability to configure the Incorporates Automates the based installation of dashboard for cluster while it’s comprehensive expansion of services the complete Hadoop configuration running validation and error to new nodes when they stack management checking come online 1 2 3 4 5 19 ©2011 Cloudera, Inc. All Rights Reserved.
  • 20. What I Would Like You To Remember: • The Key Benefits of the Apache Hadoop Data Platform: – Agility/Flexibility (Enables Exploration/Innovation). – Complex Data Processing (Any Language, Any Problem). – Scalability of Storage/Compute (Freedom to Grow). – Economical Active Archive (Keep All Your Data Alive). • Cloudera Enterprise enables: – Lower the Cost of Management and Administration. – Simplify and Accelerate Hadoop Deployment. – Increase the Transparency & Control of Hadoop. – Firm SLAs on Issue Resolution. 20 ©2011 Cloudera, Inc. All Rights Reserved.