Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Greenplum Analytics
                                            Workbench


                                               APURVA DESAI




© Copyright 2012 EMC Corporation. All rights reserved.            1
Overview




© Copyright 2012 EMC Corporation. All rights reserved.              2
What is Hadoop?
 What is Hadoop?
        –    Distributed computing paradigm
        –    File system – HDFS
        –    Processing framework –Map Reduce
        –    Languages – PIG, HIVE
        –    Key Value Store – Hbase
 Why is it important?
        – BIG Data is everywhere
        – BIG Data is mostly unstructured
        – Need affordable, scalable no-sql processing


© Copyright 2012 EMC Corporation. All rights reserved.   3
Analytics Workbench - Motivation
 Open source
        – Hadoop industry is nascent
        – BIG Data development needs scale


 Greenplum
        – Innovation & Experimentation platform
        – Contribute to the community
        – GPDB & GPHD - Mixed mode environment




© Copyright 2012 EMC Corporation. All rights reserved.   4
Greenplum Vision




© Copyright 2012 EMC Corporation. All rights reserved.   5
Buildout Pre-requisites
 Hardware systems integration


 Hadoop experience


 Program Management


 Partner ecosystem

          Greenplum has Inhouse Expertise

© Copyright 2012 EMC Corporation. All rights reserved.   6
Team Introduction
                                                          System Integration
                                                           – Greg, Eric, Don, Dave,
                                                             Patrick



                                                          Program Management
                                                           – Mike, Joe



                                                          Hadoop
                                                           – Apurva, Judes, Clinton,
                                                             Chandra, Ashwin




© Copyright 2012 EMC Corporation. All rights reserved.                                 7
Partners
                                                          Intel
                                                            – 2000 Westmere CPUs

                                                          Mellanox
                                                            – 1,000+ NICs
                                                            – 72 IB switches

                                                          Micron
                                                            – 6,000 8GB DRAM

                                                          Seagate
                                                            – 12,000 2TB Drives

                                                          Supermicro
                                                            – 1000 Chasis/MB


© Copyright 2012 EMC Corporation. All rights reserved.                             8
Partners
                                                          Switch
                                                           – Hosting Facilities


                                                          VMware
                                                           – Operational Support
                                                           – Rubicon




© Copyright 2012 EMC Corporation. All rights reserved.                             9
Peek @ the Cluster




© Copyright 2012 EMC Corporation. All rights reserved.   10
Cluster Statistics
 Largest cluster for Apache Hadoop validation!

 # Of Physical Hosts : > 1,000 (> 10,000 with VMs)
 # Of Racks : 54 (50 just for the DataNodes)
 # Of Processors : > 24,000
 Amount Of RAM : > 48TB
 Amount of Disk Capacity : > 24PB
        – “Equivalent to nearly half of the entire written works of
          mankind from the beginning of recorded history”



© Copyright 2012 EMC Corporation. All rights reserved.                11
Namenode




© Copyright 2012 EMC Corporation. All rights reserved.   12
Job Tracker




© Copyright 2012 EMC Corporation. All rights reserved.   13
CPU




© Copyright 2012 EMC Corporation. All rights reserved.   14
Use Cases




© Copyright 2012 EMC Corporation. All rights reserved.          15
Hadoop Review




© Copyright 2012 EMC Corporation. All rights reserved.   16
Hadoop Shuffle




© Copyright 2012 EMC Corporation. All rights reserved.   17
Initial Use Cases
 Apache Hadoop Validation
 Mellanox UDA
 Terasort Benchmark




© Copyright 2012 EMC Corporation. All rights reserved.   18
Apache Hadoop Validation
 Purpose
        – Run Apache Hadoop Validation at Scale
        – Validate cluster configuration


 Various Configurations Validated
        – Standard Out Of The Box Configs
        – Configs Modified For IO Intensive Processing




© Copyright 2012 EMC Corporation. All rights reserved.   19
Apache Hadoop Preliminary Results
                                       Apache Hadoop-1.0.0 validation
                          1.2


                           1


                          0.8
   Execution Time (Min)




                          0.6


                          0.4                                           1000 Nodes


                          0.2


                           0




© Copyright 2012 EMC Corporation. All rights reserved.                               20
Apache Hadoop Findings
 Apache BigTop for integration tests
 Functional validation passed as expected


 Next Steps
        – Identify integration cases
        – Contribute back to BigTop
        – Stabilize Hadoop 0.23




© Copyright 2012 EMC Corporation. All rights reserved.   21
Mellanox UDA - Overview
                                                          RDMA in Hadoop Shuffle stage
                                                          Register Map & Reduce task buffer
                                                          Hadoop JT for Task completion
                                                          cp sorted maptask o/p  reduce i/p
                                                          Perform in-memory merge @reduce
                                                          Avoid disk spills for large inputs
                                                          Reduce CPU load for sort & merge
                                                          GP + Mellanox collaboration
                                                            – Open Sourcing UDA




© Copyright 2012 EMC Corporation. All rights reserved.                                          22
Mellanox UDA Preliminary Results
 Preliminary UDA results provided by Mellanox
 Show improvement with UDA vs Vanilla Hadoop.
 Better CPU utilization
 Reduced execution time


 Next Steps
        – Run on Analytics Workbench schedule for June 2012
        – Configuration on the workbench to turn it on/off




© Copyright 2012 EMC Corporation. All rights reserved.        23
TeraSort Benchmark
 Industry standard benchmark
 Good validation of configuration
 3 Steps
        – Teragen – Generate 1TB of data
        – Terasort – Sort generated data
        – Teravalidate – Validate the sort
 Measure time for each step




© Copyright 2012 EMC Corporation. All rights reserved.   24
TeraSort Benchmark Preliminary Results
                              Apache Hadoop-1.0.0 validation - TeraSort
                          9

                          8

                          7
   Exection Time in Sec




                          6

                          5

                                                                                                TeraGen
                          4
                                                                                                TeraSort
                          3

                          2

                          1

                          0
                                       1 TB                                             10 TB
                                                         # of TB Generated and Sorted




© Copyright 2012 EMC Corporation. All rights reserved.                                                     25
TeraSort Benchmark Findings
 Minimal tuning of configuration
 Results are within expected range.
 Next Steps
        – Tune the cluster for optimal performance
        – Use the benchmark for every new release




© Copyright 2012 EMC Corporation. All rights reserved.   26
Lessons Learnt




© Copyright 2012 EMC Corporation. All rights reserved.   27
Buildout Progress
                             1200
                                                                                         racked   ready
                             1000
           Number of nodes




                             800


                             600


                             400


                             200


                                0
                               Dec '11   Jan '12         Feb '12   Mar '12   April '12
                                                          Month




© Copyright 2012 EMC Corporation. All rights reserved.                                                    28
―Real‖ Hadoop Cluster




© Copyright 2012 EMC Corporation. All rights reserved.   29
Categories
 Racking & Stacking                                      Hadoop Deployment


 Networking                                              Post deployment


 Non Hadoop Hosts                                        Process


 Base OS Setup




© Copyright 2012 EMC Corporation. All rights reserved.                         30
In Closing




© Copyright 2012 EMC Corporation. All rights reserved.           31
Upcoming work
 Workbench Tasks
        –    Load various data sets
        –    Load GPDB, Hive, Hbase, Zookeeper, etc.
        –    Load Chorus, Command center, UAP stack
        –    VM provisioning
        –    Various audits
 On-boarding candidates
        –    HD Education
        –    Apache Hadoop Build & Validate
        –    Mellanox UDA
        –    Intel HiBench
        –    Big data benchmarking
        –    Hi resolution image processing, etc. etc.



© Copyright 2012 EMC Corporation. All rights reserved.   32
A day in the life @ Switch




© Copyright 2012 EMC Corporation. All rights reserved.   33
Q&A




© Copyright 2012 EMC Corporation. All rights reserved.         34
Other Relevant Greenplum Sessions
Session                                                  Presenter          Times
Unified Analytics Platform Introduction                  Brian Wilson       Tues 10:00-11:00   Thurs 1:00-2:00
Greenplum Database Overview                              Michael Crutcher   Mon 8:30-9:30      Wed 10:00-11:00
Greenplum Hadoop Overview                                Susheel Kaushik    Mon 10:00-11:00    Wed 4:15-5:15
Greenplum DCA Overview                                   Hanxi Chen         Mon 4:00-5:00      Thurs 10:00-11:00
Greenplum Analytics Workbench                            Apurva Desai       Wed 8:30-9:30      Thurs 10:00-11:00
Analytics on Hadoop                                      Don Miner          Tues 11:30-12:30   Thurs 8:30-9:30
Optimizing Greenplum Database on VMware                  Kevin O’Leary      Mon 4:00-5:00      Tues 4:15-5:15
Virtualized Infrastructure
Big Data Driven Businesses in Action:                    Mike Maxey         Wed 4:15-5:15      Thurs 11:30-12:30
Creating Real Business Value Using
Greenplum UAP (Panel w/4 Customers)
Analytics for Business Value: Collaboration              Josh Klahr         Mon 10:00-11:00    Wed 2:45-3:45
Disruptive Data Science — How Data                       Annika Jimenez     Tues 4:15-5:15     Thurs 11:30-12:30
Science and Big Data are Transforming                    David Dietrich
Business, IT and People




© Copyright 2012 EMC Corporation. All rights reserved.                                                             35
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

More Related Content

Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

  • 1. Greenplum Analytics Workbench APURVA DESAI © Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. Overview © Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. What is Hadoop?  What is Hadoop? – Distributed computing paradigm – File system – HDFS – Processing framework –Map Reduce – Languages – PIG, HIVE – Key Value Store – Hbase  Why is it important? – BIG Data is everywhere – BIG Data is mostly unstructured – Need affordable, scalable no-sql processing © Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. Analytics Workbench - Motivation  Open source – Hadoop industry is nascent – BIG Data development needs scale  Greenplum – Innovation & Experimentation platform – Contribute to the community – GPDB & GPHD - Mixed mode environment © Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. Greenplum Vision © Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. Buildout Pre-requisites  Hardware systems integration  Hadoop experience  Program Management  Partner ecosystem Greenplum has Inhouse Expertise © Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. Team Introduction  System Integration – Greg, Eric, Don, Dave, Patrick  Program Management – Mike, Joe  Hadoop – Apurva, Judes, Clinton, Chandra, Ashwin © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. Partners  Intel – 2000 Westmere CPUs  Mellanox – 1,000+ NICs – 72 IB switches  Micron – 6,000 8GB DRAM  Seagate – 12,000 2TB Drives  Supermicro – 1000 Chasis/MB © Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. Partners  Switch – Hosting Facilities  VMware – Operational Support – Rubicon © Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. Peek @ the Cluster © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. Cluster Statistics Largest cluster for Apache Hadoop validation!  # Of Physical Hosts : > 1,000 (> 10,000 with VMs)  # Of Racks : 54 (50 just for the DataNodes)  # Of Processors : > 24,000  Amount Of RAM : > 48TB  Amount of Disk Capacity : > 24PB – “Equivalent to nearly half of the entire written works of mankind from the beginning of recorded history” © Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. Namenode © Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. Job Tracker © Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. CPU © Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. Use Cases © Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. Hadoop Review © Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17. Hadoop Shuffle © Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18. Initial Use Cases  Apache Hadoop Validation  Mellanox UDA  Terasort Benchmark © Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19. Apache Hadoop Validation  Purpose – Run Apache Hadoop Validation at Scale – Validate cluster configuration  Various Configurations Validated – Standard Out Of The Box Configs – Configs Modified For IO Intensive Processing © Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20. Apache Hadoop Preliminary Results Apache Hadoop-1.0.0 validation 1.2 1 0.8 Execution Time (Min) 0.6 0.4 1000 Nodes 0.2 0 © Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21. Apache Hadoop Findings  Apache BigTop for integration tests  Functional validation passed as expected  Next Steps – Identify integration cases – Contribute back to BigTop – Stabilize Hadoop 0.23 © Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22. Mellanox UDA - Overview  RDMA in Hadoop Shuffle stage  Register Map & Reduce task buffer  Hadoop JT for Task completion  cp sorted maptask o/p  reduce i/p  Perform in-memory merge @reduce  Avoid disk spills for large inputs  Reduce CPU load for sort & merge  GP + Mellanox collaboration – Open Sourcing UDA © Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23. Mellanox UDA Preliminary Results  Preliminary UDA results provided by Mellanox  Show improvement with UDA vs Vanilla Hadoop.  Better CPU utilization  Reduced execution time  Next Steps – Run on Analytics Workbench schedule for June 2012 – Configuration on the workbench to turn it on/off © Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24. TeraSort Benchmark  Industry standard benchmark  Good validation of configuration  3 Steps – Teragen – Generate 1TB of data – Terasort – Sort generated data – Teravalidate – Validate the sort  Measure time for each step © Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25. TeraSort Benchmark Preliminary Results Apache Hadoop-1.0.0 validation - TeraSort 9 8 7 Exection Time in Sec 6 5 TeraGen 4 TeraSort 3 2 1 0 1 TB 10 TB # of TB Generated and Sorted © Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26. TeraSort Benchmark Findings  Minimal tuning of configuration  Results are within expected range.  Next Steps – Tune the cluster for optimal performance – Use the benchmark for every new release © Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27. Lessons Learnt © Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28. Buildout Progress 1200 racked ready 1000 Number of nodes 800 600 400 200 0 Dec '11 Jan '12 Feb '12 Mar '12 April '12 Month © Copyright 2012 EMC Corporation. All rights reserved. 28
  • 29. ―Real‖ Hadoop Cluster © Copyright 2012 EMC Corporation. All rights reserved. 29
  • 30. Categories  Racking & Stacking  Hadoop Deployment  Networking  Post deployment  Non Hadoop Hosts  Process  Base OS Setup © Copyright 2012 EMC Corporation. All rights reserved. 30
  • 31. In Closing © Copyright 2012 EMC Corporation. All rights reserved. 31
  • 32. Upcoming work  Workbench Tasks – Load various data sets – Load GPDB, Hive, Hbase, Zookeeper, etc. – Load Chorus, Command center, UAP stack – VM provisioning – Various audits  On-boarding candidates – HD Education – Apache Hadoop Build & Validate – Mellanox UDA – Intel HiBench – Big data benchmarking – Hi resolution image processing, etc. etc. © Copyright 2012 EMC Corporation. All rights reserved. 32
  • 33. A day in the life @ Switch © Copyright 2012 EMC Corporation. All rights reserved. 33
  • 34. Q&A © Copyright 2012 EMC Corporation. All rights reserved. 34
  • 35. Other Relevant Greenplum Sessions Session Presenter Times Unified Analytics Platform Introduction Brian Wilson Tues 10:00-11:00 Thurs 1:00-2:00 Greenplum Database Overview Michael Crutcher Mon 8:30-9:30 Wed 10:00-11:00 Greenplum Hadoop Overview Susheel Kaushik Mon 10:00-11:00 Wed 4:15-5:15 Greenplum DCA Overview Hanxi Chen Mon 4:00-5:00 Thurs 10:00-11:00 Greenplum Analytics Workbench Apurva Desai Wed 8:30-9:30 Thurs 10:00-11:00 Analytics on Hadoop Don Miner Tues 11:30-12:30 Thurs 8:30-9:30 Optimizing Greenplum Database on VMware Kevin O’Leary Mon 4:00-5:00 Tues 4:15-5:15 Virtualized Infrastructure Big Data Driven Businesses in Action: Mike Maxey Wed 4:15-5:15 Thurs 11:30-12:30 Creating Real Business Value Using Greenplum UAP (Panel w/4 Customers) Analytics for Business Value: Collaboration Josh Klahr Mon 10:00-11:00 Wed 2:45-3:45 Disruptive Data Science — How Data Annika Jimenez Tues 4:15-5:15 Thurs 11:30-12:30 Science and Big Data are Transforming David Dietrich Business, IT and People © Copyright 2012 EMC Corporation. All rights reserved. 35