Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
BIG DATA Defined:
  Data Stack 3.0
   Anand Deshpande
   Persistent Systems
   December 2012
8 December 2012         1
Congratulations to the Pune Chapter




                  Best Chapter Award at CSI 2012 Kolkata
8 December 2012                                            2
COMAD 2012
              14-16 December
                   Pune


                           Coming to
                                 India
                           Delhi 2016
8 December 2012                          3
The Data Revolution is Happening
Now
The growing need for large-volume, multi-
structured “Big Data” analytics,
as well as … “Fast Data”, have positioned the
industry at the cusp of the most radical
revolution in database architectures in 20
years.

We believe that the economics of data will
increasingly drive competitive advantage.

 Source: Credit Suisse Research, Sept 2011
8 December 2012                                 4
What Data Can Do For You

Organizational leaders want analytics
to exploit their growing data and
computational power to get smart,
and get innovative, in ways they never
could before.

Source - MIT Sloan Management Review- The New Intelligent Enterprise Big Data, Analytics
and the Path From Insights to Value By Steve LaValle, Eric Lesser,
Rebecca Shockley, Michael S. Hopkins and Nina Kruschwitz
December 21, 2010



8 December 2012                                                                            5
Determining Shopping Patterns
        British Grocer, Tesco Uses Big Data
        by Applying Weather Results to Predict
        Demand and Increase Sales




Britain often conjures images of unpredictable weather, with downpours sometimes followed
by sunshine within the same hour — several times a day.
Such randomness has prompted Tesco, the country’s largest grocery chain, to create…its own
software that calculates how shopping patterns change ―for every degree of temperature and
every hour of sunshine.‖
Source: New York Times, September 2, 2009. Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigier
http://www.nytimes.com/2009/09/02/business/global/02weather.html
 8 December 2012                                                                                                     6
Tracking Customers in Social Media
      Glaxo Smith Kline Uses Big Data
      to Efficiently Target Customers




GlaxoSmithKline is aiming to build direct relationships with 1 million consumers in a year using
social media as a base for research and multichannel marketing. Targeted offers and
promotions will drive people to particular brand websites where external data is integrated
with information already held by the marketing teams.

Source: Big data: Embracing the elephant in the room By Steve Hemsley
http://www.marketingweek.co.uk/big-data-embracing-the-elephant-in-the-room/3030939.article
 8 December 2012                                                                                   7
What does India Think?
      Persistent enabled Aamir Khan Productions and Star Plus use
      Big Data to know how people react to some of the most
      excruciating social issues.
      http://www.satyamevjayate.in/




Satyamev Jayate - Aamir Khan’s pioneering, interactive socio-cultural TV show - has caught the
interest of the entire nation. It has already generated ~7.5M responses in 4 weeks over
SMS, Facebook, Twitter, Phone Calls and Discussion Forums by its viewers across the world
over. This data is being analyzed and delivered in real-time to allow the producers to
understand the pulse of the viewers, to gauge the appreciation for the show and most
importantly to spread the message. Harnessing the truth from all this data is a key component
of the show’s success.
 8 December 2012                                                                                 8
8 December 2012   9
WE ALREADY HAVE DATABASES.
   WHY DO WE NEED TO DO ANYTHING
   DIFFERENT?
8 December 2012                    10
Relational Database Systems for
Operational Store
 ● Transaction processing capabilities
   ideally suited for transaction-oriented
   operational stores.
 ● Data types – numbers, text, etc.
 ● SQL as the Query language
 ● De-facto standard as the operational
   store for ERP and mission critical
   systems.
 ● Interface through application programs
   and query tools
8 December 2012                              11
Data Stack 1.0:
Online Transactions Processing (OLTP)
                          ● High throughput for
                            transactions (writes).
                          ● Focus on reliability – ACID
                            Properties.
                          ● Highly normalized
                            Schema.
                          ● Interface through
                            application programs and
                            query tools
         Data Stack 1.0
8 December 2012                                           12
Data Stack 2.0: Enterprise Data
Warehouse for Decision Support
 ● Operational data stores store on-line
   transactions
       – Many writes, some reads.
 ● Large fact table, multiple dimension
   tables
 ● Schema has a specific pattern – star
   schema
 ● Joins are also very standard and create
   cubes
 ● Queries focus on aggregates.
 ● Users access data through tools such as
   Cognos, Business Objects, Hyperion etc.
8 December 2012                              13
Data Stack 2.0: Enterprise Data
Warehouse
                                   Data         Reports &
                                 Warehouse     Ad hoc Anal


                                                Alerts &
                                               Dashboards

                      ETL
                                               What-if Anal.
                                                  EPM


                                                Predictive     User
                                                Analytics
                                  OLAP
                  Data Staging
                                                  Data
                                 Data Store
8 December 2012
                                              Visualization           14
Standard Enterprise Data
Architecture
                    Presentation Layer   Relational
                                         Databases
                                                                   Optimized Loader
                                                      Extraction
                                            ERP       Cleansing
                    Application Logic     Systems       (ETL)
                                                                   Data Warehouse
                                                                        Engine           Analyze
                                         Purchased                                        Query
                                            Data

                  Relational Databases    Legacy
                                           Data                    Metadata Repository




Data Stack 1.0:                          Data Stack 2.0:
Operational Data Systems                 Enterprise Data Warehouse
                                         Systems
8 December 2012                                                                                    15
Who are the players
                 Oracle          Microsoft                                         Open Source                     Pure Play


                                   SQL Server      IBM Infosphere     Business                   Enterprise Data    Informatica
                   Oracle Data
ETL                                Integration       DataStage      Objects Data       Kettle      integration      Powercenter
                   Integrator
                                  Service (SSIS)          I          Integrator                       server


                                  Parallel Data                                                                       Teradata,
                     Oracle                        Netezza (Pure                     Postgres/
DWH                              Warehouse(PD                        Sybase iQ                     <BLANK>           Greenplum
                   11g/Exadata                         Data)                          MySQL
                                       W)                                                                              (EMC),

                                  SQL Server
                    Hyperion/                         Cognos
OLAP                                Analysis                         SAP Hana        Mondrian     OLAP Viewer
                     Essbase                         Powerplay
                                 Services(SSAS)

                                                                     Business                      Enterprise
                   Oracle BI –     SQL Server                                          BIRT                        MicroStrategy
                                                                    Objects , BO                  Guide, Web
Reporting          OBIEE) &        Reporting         Cognos BI                       Pentaho,                        Qliktech,
                                                                    Dashboard                    Report Studio
                    Exalytics    Services (SSRS)                                      Jasper                         Tableau
                                                                      Builder                         or;

                                  SQL Server
Predictive         Oracle Data                                                                   SAS Enterprise
                                  Data Mining          SPSS         SAP Hana + R      R/Weka
Analytics         Mining (ODM)                                                                       Miner
                                    (SSDM)

      8 December 2012                                                                                                          16
Despite the two data stacks ..
     One in two
     business
     executives
     believe that they
     do not have
     sufficient
     information
     across their
     organization to
     do their job
Source: December 2012 Business Value
     8 IBM Institute for               17
Data has Variety: it doesn’t fit


                        Less than 40% of the
                        Enterprise Data
                        makes its way to
                        Data Stack 1.0 or
                        Data Stack 2.0.

8 December 2012                                18
Beyond the Operational Systems, data
    required for decision making is scattered
    within and beyond the enterprise
                                                               Weather forecasts
                                      Expense                                             Twitter
 Email Systems                        Management                                          Feeds
                  Collaboration                   Vendor                Demographic
                                      System
                  /Wiki Sites                     Collaboration         Data
   Organizational                                 Systems                             Maps
                      Employee Surveys
   Workflow
                Document Repositories              Supply Chain                  Economic Data
ERP Systems                                        Systems
                           Customer Call                                                    Social
      CRM Systems                                             Location and
                           Center Records                                                   Networking
                                                              Presence Data
  Enterprise                               Sensor                                           Data
  Data Warehouse      Project artifacts    Data           CRM Systems
      Structured              Unstructured                Cloud                      Public
      Data Sources            Data Sources                Data Sources               Data Sources
    8 December 2012                                                                                19
Data Volumes are Growing
        5 Exabytes of information was
           created between the dawn of
     civilization through 2003, but that
     much information is now created
          every 2 days, and the pace is
                             increasing

                                               Eric Schmidt
(1 exabyte = 1018 bytes )      at the Techonomy Conference,
                                             August 4, 2010
8 December 2012                                               20
The Continued Explosion of Data in the
 Enterprise and Beyond
80% of new information growth is
unstructured content –
90% of that is currently unmanaged
                                                                                2020
                                                                            35 zettabytes




                                       44x as much
                                              Data and Content
       2009                                   Over Coming Decade
 800,000 petabytes      1990 Source:2000 Digital Universe Decade – Are You Ready?, May 2010
                                    IDC, The
                                                  2010             2020
 8 December 2012                                                                         21
What comes first -- Structure or data?


                     Schema/
      Data
                     Structure




                  Structure First is Constraining
8 December 2012                                     22
Time to create a new
        data stack for
        unstructured data.

        Data Stack 3.0.
8 December 2012                23
Time-out!

                  Internet
                  companies have
                  already addressed
                  the same
                  problems.
8 December 2012                       24
Internet Companies have to deal with
 large volumes of unstructured real-time
 data.
 ● Twitter has 140 million active users and more than 400
   million tweets per day.
 ● Facebook has over 900 million active users and an average
   of 3.2 billion Likes and Comments are generated by
   Facebook users per day.
 ● 3.1 billion email accounts in 2011, expected to rise to over 4
   billion by 2015.
 ● There were 2.3 billion internet users (2,279,709,629)
   worldwide in the first quarter of 2012, according to Internet
   World Stats data updated 31st March 2012.
8 December 2012                                                     25
Their data loads and pricing
requirements do not fit traditional
relational systems
 ● Hosted service
 ● Large cluster (1000s of nodes) of low-cost
   commodity servers.
 ● Very large amounts of data -- Indexing
   billions of documents, video, images etc..
 ● Batch updates.
 ● Fault tolerance.
 ● Hundreds of Million users,
 ● Billions of queries every day.
8 December 2012                                 26
They built their own systems
● It is the platform that distinguishes them from everyone else.
● They required:
      –   high reliability across data centers
      –   scalability to thousands of network nodes
      –   huge read/write bandwidth requirements
      –   support for large blocks of data which are gigabytes in size.
      –   efficient distribution of operations across nodes to reduce bottlenecks

Relational databases were not suitable and would have been
cost prohibitive.
8 December 2012                                                                27
Internet Companies have open-sourced
the source code they created for their
own use.

Companies have
created business
models to support
and enhance this
software.
8 December 2012                          28
What did the Internet Companies build?
And how did they get there?


      They started with a clean slate!
What features from the relational
database can be compromised?

Do we need ..
● transaction support?
                           Must have
● rigid schemas?
                           ● Scale
● joins?
                           ● Ability to handle unstructured
● SQL?                       data
● on-line, live updates?   ● Ability to process large
                             volumes of data without
                             having to start with structure
                             first.
                           ● leverage distributed
                             computing
Rethinking ACID properties
  Atomicity             For the internet workload, with
  Consistency           distributed computing, ACID
  Isolation
  Durability
                        properties are too strong.


 Rather than requiring consistency        Basic
 after every transaction, it is enough    Availability
 for the database to eventually be in     Soft-state
 a consistent state -- BASE.              Eventual consistency
Brewer’s CAP Theorem for Distributed
Systems

● Consistent – Reads always pick
  up the latest write.               Consistency        Availability
                                                   CA
● Available – can always read and
  write.
● Partition tolerant – The system           CP          AP
  can be split across multiple
  machines and datacenters                    Partition
                                             Tolerance
Can do at most two of these three.
Essential Building
Blocks for Internet
Data Systems

          Map Reduce Jobs (Developers)
Tracker




           Hadoop Map-Reduce Layer
  Job




          Hadoop Distributed File System
                    (HDFS)

           C   L   U   S   T   E   R
“For the last several years, every company involved
in building large web-scale systems has faced
some of the same fundamental challenges. While
nearly everyone agrees that the "divide-and-
conquer using lots of cheap hardware" approach to
breaking down large problems is the only way to
scale, doing so is not easy”
-   Jeremy Zawodny @Yahoo !
Challenges with Distributed
Computing
● Cheap nodes fail, especially if you have many
      Mean time between failures for 1 node = 3 years
      Mean time between failures for 1000 nodes = 1 day
   – Solution: Build fault-tolerance into system

● Commodity network = low bandwidth
   – Solution: Push computation to the data

● Programming distributed systems is hard
   – Solution: Data-parallel programming model: users write “map” &
     “reduce” functions, system distributes work and handles faults
The Hadoop Ecosystem
 ● HDFS – distributed, fault tolerant file system
 ● MapReduce – framework for writing/executing distributed, fault tolerant
   algorithms
 ● Hive & Pig – SQL-like declarative languages
 ● Sqoop – package for moving data between HDFS and relational DB systems
 ● + Others…
                                    BI     ETL
                                                   RDBMS
                                 Reporting Tools

                                 Hive & Pig




                                                           (Serialization)
                     Zookeeper

                                    Map/           Sqoop




                                                                Avro
                                  Reduce
                                 HBase
                                           HDFS
                                                                             36
Reliable Storage is Essential
   ● Google GFS; Hadoop HDFS; Kosmix KFS
     large distributed log structured file system that stores all types
     of data.
   ● Provides global file namespace
   ● Typical usage pattern
        – Huge files (100s of GB to TB)
        – Data is rarely updated in place
        – Reads and appends are common
   ● A new application coming on line can use an existing GFS
     cluster or they can make your own.
   ● File system can be tuned to fit individual application needs.
http://highscalability.com/google-architecture
Distributed File System
● Chunk Servers
   –   File is split into contiguous chunks
   –   Typically each chunk is 16-64MB
   –   Each chunk replicated (usually 2x or 3x)
   –   Try to keep replicas in different racks
● Master node
   – a.k.a. Name Nodes in HDFS
   – Stores metadata
   – Might be replicated
Now that you have storage,
how would you manipulate this data?


● Why use MapReduce?
  – Nice way to partition tasks across lots of machines.
  – Handle machine failure
  – Works across different application types, like search and ads.
  – You can pre-compute useful data, find word counts, sort TBs of
    data, etc.
  – Computation can automatically move closer to the IO source.
Hadoop is the Apache
implementation of MapReduce
● The Apache™ Hadoop™ project develops open-source software
  for reliable, scalable, distributed computing.

● The Apache Hadoop software library is a framework that allows:
   – distributed processing of large data sets across clusters of computers
     using a simple programming model.
   – It is designed to scale up from single servers to thousands of
     machines, each offering local computation and storage.
   – Rather than rely on hardware to deliver high-availability, the library
     itself is designed to detect and handle failures at the application
     layer, so delivering a highly-available service on top of a cluster of
     computers, each of which may be prone to failures.
Hadoop MapReduce Flow
Word Count – Distributed Solution
Input         Map            Shuffle & Sort          Reduce     Output

                        the, 1
 the quick                                                      brown, 2
              Map      brown, 1

brown fox               fox, 1
                                                       Reduce    fox, 2
                    quick, 1
                    the, 1                    brown, [1,1]       how, 1
                    fox, 1                      fox, [1,1]
the fox ate                                                      now, 1
              Map   the, 1
                          ate, 1
                                                how, [1]
                                                now, [1]
the mouse                                     the, [1,1,1,1]
                                                                 the, 4
                         mouse, 1
                     how, 1
                     now, 1
                                                                 ate, 1
                    brown, 1                           Reduce    cow, 1
how now
              Map    the, 1
                                                 ate, [1]
                        cow, 1                                  mouse, 1
    the                                          cow, [1]
                                                mouse, [1]      quick, 1
brown cow                                       quick, [1]
Word Count in Map-Reduce
            public void map(Object key, Text value, …. ) {
                StringTokenizer itr = new StringTokenizer(value.toString());
                while (itr.hasMoreTokens()) {
  map


                     word.set(itr.nextToken());
                    context.write(word, one);
                }


            public void reduce(Text key, Iterable<IntWritable> values, ……… ) {
                  int sum = 0;
   reduce




                  for (IntWritable val : values) {sum += val.get();}
                  result.set(sum);
                  context.write(key, result);
                }
Pig and Hive
● Pig and Hive provide a            Pig is a data flow
  wrapper to make it easier to
  write MapReduce jobs.             scripting language
● The raw data is stored in
  Hadoop's HDFS.
● These scripting languages
  provide
    – Ease of programming.
    – Optimization opportunities.
                                       Hive is SQL-like
    – Extensibility.
                                       language
●
Other Hadoop-related projects at Apache
include:
●   Avro™: A data serialization system.          ●   Pig™: A high-level data-flow language
●   Cassandra™: A scalable multi-master              and execution framework for parallel
    database with no single points of failure.       computation.
●   Chukwa™: A data collection system for        ●   ZooKeeper™: A high-performance
    managing large distributed systems.              coordination service for distributed
●   HBase™: A scalable, distributed database         applications.
    that supports structured data storage for
    large tables.
●   Hive™: A data warehouse infrastructure
    that provides data summarization and ad
    hoc querying.
●   Mahout™: A Scalable machine learning
    and data mining library.

                                                                            http://hadoop.apache.org/
Powered by Hadoop
                             http://wiki.apache.org/hadoop/PoweredBy
● Facebook                   (more than 100+ Companies are listed)
   – 1100-machine cluster with 8800 cores
   – store copies of internal log and dimension data sources and use it as
     a source for reporting/analytics and machine learning
● Yahoo
   – Biggest cluster: 4000 nodes
   – Search Marketing, People you may know, Search Assist, and many
     more…
● Ebay
   – 532 nodes cluster (8 * 532 cores, 5.3PB).
   – Using it for Search optimization and Research
Hadoop is not a relational database
● Hadoop is best suited for batch processing of large
  volumes of unstructured data.
   –   Lack of schemas
   –   Lack of indexes
   –   Lack of updates – pretty much absent!
   –   Not designed for joins.
   –   Support for Integrity Constraints
   –   Limited support for data analysis tools
                      But what are your data analysis needs?
Hadoop is not a Relational Database:
If these are important, stick to RDBMS


                                    Data
   OLTP        Data Integrity                          SQL
                                Independence



          Ad-hoc          Complex       Maturity and
          Queries       Relationships    Stability
Do you need SQL and full relational systems?
If not, consider NoSQL databases for your
needs




  Key-value   Tabular   Document   Graph
                                      http://nosql-database.org/
The Key-Value In-Memory DBs
● In memory DBs are simpler and faster than their on-disk counterparts.
● Key value stores offer a simple interface with no schema. Really a
  giant, distributed hash table.
● Often used as caches for on-disk DB systems.
● Advantages:
    – Relatively simple
    – Practically no server to server talk.
    – Linear scalability
● Disadvantages:
    – Doesn’t understand data – no server side operations. The key and value are always
      strings.
    – It’s really meant to only be a cache – no more, no less.
    – No recovery, limited elasticity.
Voldemort is a distributed key-value
       storage system
       ●    Data is automatically                              ●   Good single node performance: you can
              –   replicated over multiple servers.                expect 10-20k operations per second
              –   partitioned so each server contains only a        –   depending on the machines, the network,
                  subset of the total data                              the disk system, and the data replication
       ●    Data items are versioned                                    factor

       ●    Server failure is handled transparently            ●   Voldemort is not a relational database,
                                                                    –   it does not attempt to satisfy arbitrary
       ●    Each node is independent of other nodes                     relations while satisfying ACID properties.
            with no central point of failure or                     –   Nor is it an object database that attempts
            coordination                                                to transparently map object reference
       ●    Support for pluggable data placement                        graphs.
            strategies to support things like                       –   Nor does it introduce a new abstraction
            distribution across data centers that are                   such as document-orientation.
            geographically far apart.                          ●   It is basically just a big, distributed,
                                                                   persistent, fault-tolerant hash table.


http://project-voldemort.com/
Tabular stores
● The original: Google’s BigTable
    – Proprietary, not open source.
● The open source elephant alternative – Hadoop with HBase.
● A top level Apache Project.
● Large number of users.
● Contains a distributed file system, MapReduce, a database
  server (Hbase), and more.
● Rack aware.
What is Google’s Big Table
● BigTable is a large scale, fault tolerant, self managing system
  that includes terabytes of memory and petabytes of storage. It
  can handle millions of reads/writes per second.
● BigTable is a distributed hash mechanism built on top of GFS. It
  is not a relational database. It doesn't support joins or SQL type
  queries.
● It provides lookup mechanism to access structured data by key.
  GFS stores opaque data and many applications needs has data
  with structure.
● Commercial databases simply don't scale to this level and they
  don't work across 1000s machines.
Document Stores
● As the name implies, these databases store documents.
● Usually schema-free. The same database can store
  multiple documents.
● Allow indexing based on document content.
● Prominent examples: CouchDB, MongoDB.
Why MongoDB?
● Document-oriented                     ● Rich query language
    – Documents (objects) map nicely to ● Easy scalability
      programming language data types
                                            – Automatic sharding (auto-
    – Embedded documents and arrays           partitioning of data across servers)
      reduce need for joins
                                            – Eventually-consistent reads can be
    – Dynamically-typed (schemaless)          distributed over replicated servers
      for easy schema evolution
                                        ●   High performance
    – No joins and no multi-document
      transactions for high performance     – No joins and embedding makes
      and easy scalability                    reads and writes fast
● High availability                         – Indexes including indexing of keys
                                              from embedded documents and
    – Replicated servers with automatic       arrays
      master failover
                                            – Optional streaming writes (no
                                              acknowledgements )
Mapping Systems to the CAP
 Theorem            Availability
           Each client can always read
                                       and write
                                          A
     RDBMS                                                 Dynamo, Cassandra
     (MySQL, Postgres etc.),                               Voldermot, SimpleDB
     AsterData, Greenplum                                  Tokyo Cabinet, CouchDB
     Vertica,                   CA                 AP      KAI, Riak




                                                                  Partition Tolerance
Consistency:
All clients have the same   C              CP              P      The system works well
                                                                  despite physical network
view of the data            BigTable, MongoDB, BerkeleyDB
                            Hypertable, Terrastore, MemcachedDB   partitions
                            Hbase, Scalaris, Redis
NoSQL Use cases: Important to align
data model to the requirements
                                                 Flexible Schema
              Massive Write     Fast Key Value
   Bigness                                         and Flexible
              Performance           Access
                                                    Data Types


  Schema         Write         No single point      Generally
  Migration    Availability      of failure         available



                           Ease of
                        programming
Mapping new Internet Data Management
   Technologies to the Enterprise
Enterprise data strategy is getting
inclusive
Open Source Rules !




                   Hadoop
                Infrastructure
What about support !
The Path to Data Stack 3.0:
  Must support Variety, Volume and
  Velocity

Data Stack 1.0                     Data Stack 2.0                    Data Stack 3.0
Relational Database Systems        Enterprise Data Warehouse         Dynamic Data Platform


Recording Business Events          Support for Decision Making       Uncovering Key Insights

Highly Normalized Data             Un-normalized Dimensional Model   Schema less Approach

GBs of Data                        TBs of Data                       PBs of Data

End User Access through Ent Apps   End User Access Through Reports   End User Direct Access

Structured                         Structured                        Structured + Semi Structured

  8 December 2012                                                                                   62
Can Data Stack 3.0 Address Real
Problems?




  Large Data       Diverse Data      Queries that      Answer Queries
Volume at Low         beyond        Are Difficult to    that No One
     Price        Structured Data      Answer             Dare Ask



8 December 2012                                                         63
How does one go about the Big Data
Expedition?
PERSISTENT SYSTEMS AND BIG
   DATA

8 December 2012                 65
Persistent Systems has an
 experienced team of Big Data Experts that
 has created the technology building blocks
 to help you implement a Big Data Solution
that offers a direct path to unlock the value
                 in your data.
Big Data Expertise at Persistent
● 10+ projects executed with Leading ISVs and Enterprise Customers
● Dedicated group to MapReduce, Hadoop and Big Data Ecosystem
  (formed 3 years ago)
● Engaged with the Big Data Ecosystem, including leading ISVs and
  experts
           • Preferred Big Data Services Partner of IBM and Microsoft




8 December 2012
Big Data Leadership and
Contributions
● Code Contributions to Big Data Open Source Projects, including:
      – Hadoop, Hive, and SciDB
●   Dedicated Hadoop cluster in Persistent
●   Created PeBAL – Persistent Big Data Analytics Library
●   Created Visual Programming Environment for Hadoop
●   Created Data Connectors for Moving Data
●   Pre-built Solutions to Accelerate Big Data Projects




8 December 2012                                                     68
Persistent’s Big Data Offerings
         1. Setting up and Maintaining Big Data Platform
         2. Data Analytics on Big Data Platform
         3. Building Applications on Big Data

                                            Technology Assets                                    People Assets
                     Persistent Pre-built      Persistent Pre-built       Persistent Pre-built      Big Data Custom
                      Industry Solution:        Industry Solution:         Industry Solution:           Services
                            Retail                   Banking                     Telco
                                                                                                      Extension of
                                   Persistent Pre-built Horizontal Solutions                           Your Team
Visual Programming




                                          (Email, Text, IT Analytics, … )                         Discovery Workshop
                                                                                                 Training for Your Team
                                     Persistent Platform Enhancement IP
        Tools




                                  (PeBAL Analytics Library, Data Connectors)
                                                                                                 Methodology
                                  Foundational Infrastructure and Platform                       Team Formation Process
                         (Built Upon Selected 3rd Party Big Data Platforms and Technologies;       Cluster Sizing/Config
                                           Cluster of Commodity Hardware)
    8 December 2012                                                                                                        69
Persistent Next Generation Data
  Architecture
   Reports
                           BI Tools
       & Alerts

 Email
  Email
                    Connector Framework Media




                                                                                                                                            Connector Framework
 Server
 Server                                                    Admin App
Web Proxy
Web Proxy
                                                                                                      Solutions
IBM Tivoli                                            Workflow Integration              Persistent Analytics Library (PEBAL)
                                                                                                                                                                    NoSQL

                                                                             Graph Fn     Set Fn      …. ….. …..        Text Analytics Fn
  BBCA
                                                                                                      Text Analytics/
                                                                                                                                  Hive
                                     Social




                                                                                 PIG/Jqal
                                      Connector




                                                                                                      GATE/SystemT
 Twitter,                                                                                                                                                           RDBMS
Facebook                                                                                           MapReduce and HDFS
                                                                                                    Cluster Monitoring
                                                                                                                                                                  Data
       DW                                                                                                                                                         Warehouse
Commercial/ Open
                                                  Persistent IP                     External Data source
 Source Product
  8 December 2012                                                                                                                                                       70
Persistent Big Data Analytics Library
                   WHY PEBAL
                    • Lots of common problems – not all of them are solved in Map Reduce
                    • PigLatin, Hive, JAQL are languages and not libraries – something is
                      needed to run on top that is not tied to SQL like interaces

                   FEATURES
                    • Organized as JAQL functions, PeBAL implements several graph, set, text
                      extraction, indexing and correlation algorithms.
                    • PeBAL functions are schema agnostic.
                    • All PeBAL functions are tried and tested against well defined use cases.

                   BENEFITS OF A READY MADE SOLUTION
                    • Proven – well written and tested
                    • Reuse across multiple applications
                    • Quicker implementation of map reduce applications
8 December 2012
                    • High performance                                                     71
Web
                                               Analytics
                    Text            Inverted
                  Analytics           Lists

                              Set

                   Graph                                   Statistics




8 December 2012                                                         72
Visual Programming Environment
                  ADOPTION BARRIERS
                   • Steep Learning Curve
                   • Difficult to Code
                   • Ad-hoc reporting can’t always be done by writing programs
                   • Limited tooling available

                  VISUAL PROGRAMMING ENVIRONMENT
                   • Use Standard ETL tool as the UI environment for generating PIG scripts

                  BENEFITS
                   • ETL Tools are widely used in Enterprises
                   • Can leverage large pool of skilled people who are experts in ETL and BI
                     tools
                   • UI helps in iterative and rapid data analysis
                   • More people will start using it
8 December 2012                                                                               73
Visual Programming Environment for
     Hadoop
 Data
Sources                          ETL Tool

                                 Data Flow UI

                Metadata
                                 PIG Convertor

                                         PIG code


                           PIG          UDF Library
                                                             HDFS/ Hive
                 Data                                 Data
    HDFS
                                  HDFS
                            Big Data Platform
Persistent IP

     8 December 2012                                                      74
Persistent Connector Framework
    20+           WHY CONNECTOR FRAMEWORK
   Years           • Pluggable Architecture

                  OUT OF THE BOX
                   • Database, Data Warehouse
                   • Microsoft Exchange
                   • Web proxy
                   • IBM Tivoli
                   • BBCA
                   • Generic Push connector for *any* content

                  FEATURES
                   • Bi-directional connector (as applicable)
                   • Supports Push/Pull mechanism
                   • Stores data on HDFS in an optimized format
8 December 2012    • Supports masking of data                     75
Persistent Data Connectors




8 December 2012              76
Persistent’s Breadth of Big Data Capabilities
                                                                                  Tooling
                     Horizontal and Vertical Pre-built Solutions                  •   RDBMS/DWH to import/export data
                                                                                  •   Text Analytics libraries
                                                                                  •   Data Visualization using Web2.0 and reporting tools -
                        Big Data Platform (PeBAL) analytics                           Cognos, Microstrategy
                             libraries and Connectors                             •   Ecosystem tools like - Nutch, Katta, Lucene

                                                                    •   Job configuration, management and monitoring with BIgInsight’s job
                                  IT Management                         scheduler (MetaTracker)
                                                                    •   Job failure and recovery management
                                Big Data Application
                                   Programming               •   Deep JAQL expertise - JAQL Programming, Extending JAQL using
                                                                 UDFs, Integration of third party tools/libraries, Performance tuning, ETL using
                                                                 JAQL
•   HDFS                            Distributed
                                                             •   Expertise in MR programming - PIG, Hive, Java MR
•   IBM GPFS                        File Systems
                                                             •   Deep expertise in analytics - Text Analytics - IBM’s text extraction solution
                                                                 (AQL + SystemT)
•   Platform Setup on multi-           Cluster
    node                                Layer                •   Statistical Analytics - R, SPSS, BigInsights Integration with R
    clusters, monitoring, VM
    based setup                                    Persistent IP for Big Data Solutions
•      8 December 2012
    Product Deployment
                                                   Big Data Platform Components                                                               77
Persistent Roadmap to Big Data
  Improve Knowledge Base              1. Learn             Discover and
and Shared Big Data Platform                              Define Use Cases


                         5. Manage               2. Initiate


 Measure Effectiveness                                         Validate with
  and Business Value                                              a POC
                               4. Measure     3. Scale

                               Upgrade to Production
                                   if Successful
 8 December 2012                                                               78
Customer Analytics

Identifying your most
influential customers ?                                                                      Target these
                                                                                             customers for
                                                                        Identify             promotions.
                                                                        influential
                                                 Overlay sales          customers
                                                 data on the            using network
                         Build a social          graph                  analysis Few thousand
                         graph of all              > 1billion transactions       Influential customers
                         customers                 over twenty years
                          70 million customers



                  Targeting influential customers is best way to
8 December 2012
                  improve campaign ROI!                                                                      79
Overview of Email Analytics
● Key Business Needs
      –   Ensure compliance with respect to a variety of business and IT communications and
          information sharing guidelines.
      –   Provide an ongoing analysis of customer sentiment through email communications.

● Use Cases
      –   Quickly identify if there has been an information breach or if the information is being shared in
          ways that is not in compliance with organizational guidelines.
      –   Identify if a particular customer is not being appropriately managed.

● Benefits
      –   Ability to proactively manage email analytics and communications across the organization in a
          cost-effective way.
      –   Reduce the response time to manage a breach and proactively address issues that emerge
          through ongoing analysis of email.


8 December 2012                                                                                          80
Using Email to Analyze Customer
   Sentiment




Sense the mood of your customers
through their emails

Carry out detailed analysis on customer
team interactions and response times


   8 December 2012                        81
Analyzing Prescription Data

 1.5 million patients are
 harmed by medication
 errors every year




                  Identifying erroneous prescriptions can save lives!
8 December 2012   Source: Center for Medication Safety & Clinical Improvement   82
Overview of IT Analytics
●   Key Business Needs
      –   Troubleshooting issues in the world of advanced and cloud based systems is highly complex, requiring
          analysis of data from various systems.
      –   Information may be in different formats, locations, granularity, data stores.
      –   System outages have a negative impact on short-term revenue, as well as long-term credibility and
          reliability.
      –   The ability to quickly identify if a particular system is unstable and take corrective action is imperative.

●   Use Cases
      –   Identify security threats and isolate the corresponding external factors quickly.
      –   Identify if an email server is unstable, determine the priority and take preventative action before a
          complete failure occurs.

●   Benefits
      –   Reduced maintenance cost
      –   Higher reliablity and SLA compliance


8 December 2012                                                                                                     83
Consumer Insight from Social Media




Find out what the customers are
talking about your organization or
product in the social media


   8 December 2012                      84
Insights for Satyamev Jayate – Variety of
sources                         Web/TV Viewer




                                                                                                   Response to Pledge
                                                                                                   multiple choice




                                                                               Web, Social Media
2. Unstructured Analysis                           1. Structured Analysis
                                                                                                   questions




                                                                               (unstructured)
Responses to following questions                   Responses to




                                                                               Social Media
                                                                               (Structured)
•   Share your story                               Pledge, multiple                                Web, emails, IVR/Calls
•   Ask a question to Aamir                        choice questions                                Individual blogs




                                                                               SMS
•   Send a message of hope




                                                                               IVR
•   Share your solution                                                                            Social widgets
                                                                                                   Videos
Content Filtering Rating Tagging
System (CFRTS)
                                                                                                   …
L0, L1, L2 phased analytics 3. Impact Analysis
                               Crawling general internet for measuring the
                               before & after scenario on a particular topic
Rigorous Weekly
Operation Cycle
producing instant
analytics
Killer combo of Human+Software to
analyze the data efficiently                                      Topic opens on
                                                                     Sunday
                                          Episode Tags are
                                             refined and                               Live Analytics
                                          messages are re-                             report is sent
                                            ingested for                              during the show
                                            another pass




                                    Featured content                                          Data capture
                                    is delivered thrice                                    from SMS, phone
                                     a day all through                                         calls, social
                                       out the week.                                        media, website,



                                                         JSONs are
                                                      created for the        System runs L0
                                                       external and          Analysis, L1, L2
                                                          internal          Analysts continue
                                                        dashboards
8 December 2012   87
Thank you

                  Anand Deshpande (anand@persistent.co.in)
                  http://in.linkedin.com/in/ananddeshpande
                           Persistent Systems Limited
                              www.persistentsys.com



8 December 2012                                              88
Enterprise Value is Shifting to Data
                                                                                   Data


                                                                     Apps


                                                        ERP


                                      Database


                   Operating
                    Systems


Mainframe


            1975
8 December 2012                1985              1995         2006          2013          89

More Related Content

From the Big Data keynote at InCSIghts 2012

  • 1. BIG DATA Defined: Data Stack 3.0 Anand Deshpande Persistent Systems December 2012 8 December 2012 1
  • 2. Congratulations to the Pune Chapter Best Chapter Award at CSI 2012 Kolkata 8 December 2012 2
  • 3. COMAD 2012 14-16 December Pune Coming to India Delhi 2016 8 December 2012 3
  • 4. The Data Revolution is Happening Now The growing need for large-volume, multi- structured “Big Data” analytics, as well as … “Fast Data”, have positioned the industry at the cusp of the most radical revolution in database architectures in 20 years. We believe that the economics of data will increasingly drive competitive advantage. Source: Credit Suisse Research, Sept 2011 8 December 2012 4
  • 5. What Data Can Do For You Organizational leaders want analytics to exploit their growing data and computational power to get smart, and get innovative, in ways they never could before. Source - MIT Sloan Management Review- The New Intelligent Enterprise Big Data, Analytics and the Path From Insights to Value By Steve LaValle, Eric Lesser, Rebecca Shockley, Michael S. Hopkins and Nina Kruschwitz December 21, 2010 8 December 2012 5
  • 6. Determining Shopping Patterns British Grocer, Tesco Uses Big Data by Applying Weather Results to Predict Demand and Increase Sales Britain often conjures images of unpredictable weather, with downpours sometimes followed by sunshine within the same hour — several times a day. Such randomness has prompted Tesco, the country’s largest grocery chain, to create…its own software that calculates how shopping patterns change ―for every degree of temperature and every hour of sunshine.‖ Source: New York Times, September 2, 2009. Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigier http://www.nytimes.com/2009/09/02/business/global/02weather.html 8 December 2012 6
  • 7. Tracking Customers in Social Media Glaxo Smith Kline Uses Big Data to Efficiently Target Customers GlaxoSmithKline is aiming to build direct relationships with 1 million consumers in a year using social media as a base for research and multichannel marketing. Targeted offers and promotions will drive people to particular brand websites where external data is integrated with information already held by the marketing teams. Source: Big data: Embracing the elephant in the room By Steve Hemsley http://www.marketingweek.co.uk/big-data-embracing-the-elephant-in-the-room/3030939.article 8 December 2012 7
  • 8. What does India Think? Persistent enabled Aamir Khan Productions and Star Plus use Big Data to know how people react to some of the most excruciating social issues. http://www.satyamevjayate.in/ Satyamev Jayate - Aamir Khan’s pioneering, interactive socio-cultural TV show - has caught the interest of the entire nation. It has already generated ~7.5M responses in 4 weeks over SMS, Facebook, Twitter, Phone Calls and Discussion Forums by its viewers across the world over. This data is being analyzed and delivered in real-time to allow the producers to understand the pulse of the viewers, to gauge the appreciation for the show and most importantly to spread the message. Harnessing the truth from all this data is a key component of the show’s success. 8 December 2012 8
  • 10. WE ALREADY HAVE DATABASES. WHY DO WE NEED TO DO ANYTHING DIFFERENT? 8 December 2012 10
  • 11. Relational Database Systems for Operational Store ● Transaction processing capabilities ideally suited for transaction-oriented operational stores. ● Data types – numbers, text, etc. ● SQL as the Query language ● De-facto standard as the operational store for ERP and mission critical systems. ● Interface through application programs and query tools 8 December 2012 11
  • 12. Data Stack 1.0: Online Transactions Processing (OLTP) ● High throughput for transactions (writes). ● Focus on reliability – ACID Properties. ● Highly normalized Schema. ● Interface through application programs and query tools Data Stack 1.0 8 December 2012 12
  • 13. Data Stack 2.0: Enterprise Data Warehouse for Decision Support ● Operational data stores store on-line transactions – Many writes, some reads. ● Large fact table, multiple dimension tables ● Schema has a specific pattern – star schema ● Joins are also very standard and create cubes ● Queries focus on aggregates. ● Users access data through tools such as Cognos, Business Objects, Hyperion etc. 8 December 2012 13
  • 14. Data Stack 2.0: Enterprise Data Warehouse Data Reports & Warehouse Ad hoc Anal Alerts & Dashboards ETL What-if Anal. EPM Predictive User Analytics OLAP Data Staging Data Data Store 8 December 2012 Visualization 14
  • 15. Standard Enterprise Data Architecture Presentation Layer Relational Databases Optimized Loader Extraction ERP Cleansing Application Logic Systems (ETL) Data Warehouse Engine Analyze Purchased Query Data Relational Databases Legacy Data Metadata Repository Data Stack 1.0: Data Stack 2.0: Operational Data Systems Enterprise Data Warehouse Systems 8 December 2012 15
  • 16. Who are the players Oracle Microsoft Open Source Pure Play SQL Server IBM Infosphere Business Enterprise Data Informatica Oracle Data ETL Integration DataStage Objects Data Kettle integration Powercenter Integrator Service (SSIS) I Integrator server Parallel Data Teradata, Oracle Netezza (Pure Postgres/ DWH Warehouse(PD Sybase iQ <BLANK> Greenplum 11g/Exadata Data) MySQL W) (EMC), SQL Server Hyperion/ Cognos OLAP Analysis SAP Hana Mondrian OLAP Viewer Essbase Powerplay Services(SSAS) Business Enterprise Oracle BI – SQL Server BIRT MicroStrategy Objects , BO Guide, Web Reporting OBIEE) & Reporting Cognos BI Pentaho, Qliktech, Dashboard Report Studio Exalytics Services (SSRS) Jasper Tableau Builder or; SQL Server Predictive Oracle Data SAS Enterprise Data Mining SPSS SAP Hana + R R/Weka Analytics Mining (ODM) Miner (SSDM) 8 December 2012 16
  • 17. Despite the two data stacks .. One in two business executives believe that they do not have sufficient information across their organization to do their job Source: December 2012 Business Value 8 IBM Institute for 17
  • 18. Data has Variety: it doesn’t fit Less than 40% of the Enterprise Data makes its way to Data Stack 1.0 or Data Stack 2.0. 8 December 2012 18
  • 19. Beyond the Operational Systems, data required for decision making is scattered within and beyond the enterprise Weather forecasts Expense Twitter Email Systems Management Feeds Collaboration Vendor Demographic System /Wiki Sites Collaboration Data Organizational Systems Maps Employee Surveys Workflow Document Repositories Supply Chain Economic Data ERP Systems Systems Customer Call Social CRM Systems Location and Center Records Networking Presence Data Enterprise Sensor Data Data Warehouse Project artifacts Data CRM Systems Structured Unstructured Cloud Public Data Sources Data Sources Data Sources Data Sources 8 December 2012 19
  • 20. Data Volumes are Growing 5 Exabytes of information was created between the dawn of civilization through 2003, but that much information is now created every 2 days, and the pace is increasing Eric Schmidt (1 exabyte = 1018 bytes ) at the Techonomy Conference, August 4, 2010 8 December 2012 20
  • 21. The Continued Explosion of Data in the Enterprise and Beyond 80% of new information growth is unstructured content – 90% of that is currently unmanaged 2020 35 zettabytes 44x as much Data and Content 2009 Over Coming Decade 800,000 petabytes 1990 Source:2000 Digital Universe Decade – Are You Ready?, May 2010 IDC, The 2010 2020 8 December 2012 21
  • 22. What comes first -- Structure or data? Schema/ Data Structure Structure First is Constraining 8 December 2012 22
  • 23. Time to create a new data stack for unstructured data. Data Stack 3.0. 8 December 2012 23
  • 24. Time-out! Internet companies have already addressed the same problems. 8 December 2012 24
  • 25. Internet Companies have to deal with large volumes of unstructured real-time data. ● Twitter has 140 million active users and more than 400 million tweets per day. ● Facebook has over 900 million active users and an average of 3.2 billion Likes and Comments are generated by Facebook users per day. ● 3.1 billion email accounts in 2011, expected to rise to over 4 billion by 2015. ● There were 2.3 billion internet users (2,279,709,629) worldwide in the first quarter of 2012, according to Internet World Stats data updated 31st March 2012. 8 December 2012 25
  • 26. Their data loads and pricing requirements do not fit traditional relational systems ● Hosted service ● Large cluster (1000s of nodes) of low-cost commodity servers. ● Very large amounts of data -- Indexing billions of documents, video, images etc.. ● Batch updates. ● Fault tolerance. ● Hundreds of Million users, ● Billions of queries every day. 8 December 2012 26
  • 27. They built their own systems ● It is the platform that distinguishes them from everyone else. ● They required: – high reliability across data centers – scalability to thousands of network nodes – huge read/write bandwidth requirements – support for large blocks of data which are gigabytes in size. – efficient distribution of operations across nodes to reduce bottlenecks Relational databases were not suitable and would have been cost prohibitive. 8 December 2012 27
  • 28. Internet Companies have open-sourced the source code they created for their own use. Companies have created business models to support and enhance this software. 8 December 2012 28
  • 29. What did the Internet Companies build? And how did they get there? They started with a clean slate!
  • 30. What features from the relational database can be compromised? Do we need .. ● transaction support? Must have ● rigid schemas? ● Scale ● joins? ● Ability to handle unstructured ● SQL? data ● on-line, live updates? ● Ability to process large volumes of data without having to start with structure first. ● leverage distributed computing
  • 31. Rethinking ACID properties Atomicity For the internet workload, with Consistency distributed computing, ACID Isolation Durability properties are too strong. Rather than requiring consistency Basic after every transaction, it is enough Availability for the database to eventually be in Soft-state a consistent state -- BASE. Eventual consistency
  • 32. Brewer’s CAP Theorem for Distributed Systems ● Consistent – Reads always pick up the latest write. Consistency Availability CA ● Available – can always read and write. ● Partition tolerant – The system CP AP can be split across multiple machines and datacenters Partition Tolerance Can do at most two of these three.
  • 33. Essential Building Blocks for Internet Data Systems Map Reduce Jobs (Developers) Tracker Hadoop Map-Reduce Layer Job Hadoop Distributed File System (HDFS) C L U S T E R
  • 34. “For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and- conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy” - Jeremy Zawodny @Yahoo !
  • 35. Challenges with Distributed Computing ● Cheap nodes fail, especially if you have many Mean time between failures for 1 node = 3 years Mean time between failures for 1000 nodes = 1 day – Solution: Build fault-tolerance into system ● Commodity network = low bandwidth – Solution: Push computation to the data ● Programming distributed systems is hard – Solution: Data-parallel programming model: users write “map” & “reduce” functions, system distributes work and handles faults
  • 36. The Hadoop Ecosystem ● HDFS – distributed, fault tolerant file system ● MapReduce – framework for writing/executing distributed, fault tolerant algorithms ● Hive & Pig – SQL-like declarative languages ● Sqoop – package for moving data between HDFS and relational DB systems ● + Others… BI ETL RDBMS Reporting Tools Hive & Pig (Serialization) Zookeeper Map/ Sqoop Avro Reduce HBase HDFS 36
  • 37. Reliable Storage is Essential ● Google GFS; Hadoop HDFS; Kosmix KFS large distributed log structured file system that stores all types of data. ● Provides global file namespace ● Typical usage pattern – Huge files (100s of GB to TB) – Data is rarely updated in place – Reads and appends are common ● A new application coming on line can use an existing GFS cluster or they can make your own. ● File system can be tuned to fit individual application needs. http://highscalability.com/google-architecture
  • 38. Distributed File System ● Chunk Servers – File is split into contiguous chunks – Typically each chunk is 16-64MB – Each chunk replicated (usually 2x or 3x) – Try to keep replicas in different racks ● Master node – a.k.a. Name Nodes in HDFS – Stores metadata – Might be replicated
  • 39. Now that you have storage, how would you manipulate this data? ● Why use MapReduce? – Nice way to partition tasks across lots of machines. – Handle machine failure – Works across different application types, like search and ads. – You can pre-compute useful data, find word counts, sort TBs of data, etc. – Computation can automatically move closer to the IO source.
  • 40. Hadoop is the Apache implementation of MapReduce ● The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing. ● The Apache Hadoop software library is a framework that allows: – distributed processing of large data sets across clusters of computers using a simple programming model. – It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. – Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 42. Word Count – Distributed Solution Input Map Shuffle & Sort Reduce Output the, 1 the quick brown, 2 Map brown, 1 brown fox fox, 1 Reduce fox, 2 quick, 1 the, 1 brown, [1,1] how, 1 fox, 1 fox, [1,1] the fox ate now, 1 Map the, 1 ate, 1 how, [1] now, [1] the mouse the, [1,1,1,1] the, 4 mouse, 1 how, 1 now, 1 ate, 1 brown, 1 Reduce cow, 1 how now Map the, 1 ate, [1] cow, 1 mouse, 1 the cow, [1] mouse, [1] quick, 1 brown cow quick, [1]
  • 43. Word Count in Map-Reduce public void map(Object key, Text value, …. ) { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { map word.set(itr.nextToken()); context.write(word, one); } public void reduce(Text key, Iterable<IntWritable> values, ……… ) { int sum = 0; reduce for (IntWritable val : values) {sum += val.get();} result.set(sum); context.write(key, result); }
  • 44. Pig and Hive ● Pig and Hive provide a Pig is a data flow wrapper to make it easier to write MapReduce jobs. scripting language ● The raw data is stored in Hadoop's HDFS. ● These scripting languages provide – Ease of programming. – Optimization opportunities. Hive is SQL-like – Extensibility. language ●
  • 45. Other Hadoop-related projects at Apache include: ● Avro™: A data serialization system. ● Pig™: A high-level data-flow language ● Cassandra™: A scalable multi-master and execution framework for parallel database with no single points of failure. computation. ● Chukwa™: A data collection system for ● ZooKeeper™: A high-performance managing large distributed systems. coordination service for distributed ● HBase™: A scalable, distributed database applications. that supports structured data storage for large tables. ● Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. ● Mahout™: A Scalable machine learning and data mining library. http://hadoop.apache.org/
  • 46. Powered by Hadoop http://wiki.apache.org/hadoop/PoweredBy ● Facebook (more than 100+ Companies are listed) – 1100-machine cluster with 8800 cores – store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning ● Yahoo – Biggest cluster: 4000 nodes – Search Marketing, People you may know, Search Assist, and many more… ● Ebay – 532 nodes cluster (8 * 532 cores, 5.3PB). – Using it for Search optimization and Research
  • 47. Hadoop is not a relational database ● Hadoop is best suited for batch processing of large volumes of unstructured data. – Lack of schemas – Lack of indexes – Lack of updates – pretty much absent! – Not designed for joins. – Support for Integrity Constraints – Limited support for data analysis tools But what are your data analysis needs?
  • 48. Hadoop is not a Relational Database: If these are important, stick to RDBMS Data OLTP Data Integrity SQL Independence Ad-hoc Complex Maturity and Queries Relationships Stability
  • 49. Do you need SQL and full relational systems? If not, consider NoSQL databases for your needs Key-value Tabular Document Graph http://nosql-database.org/
  • 50. The Key-Value In-Memory DBs ● In memory DBs are simpler and faster than their on-disk counterparts. ● Key value stores offer a simple interface with no schema. Really a giant, distributed hash table. ● Often used as caches for on-disk DB systems. ● Advantages: – Relatively simple – Practically no server to server talk. – Linear scalability ● Disadvantages: – Doesn’t understand data – no server side operations. The key and value are always strings. – It’s really meant to only be a cache – no more, no less. – No recovery, limited elasticity.
  • 51. Voldemort is a distributed key-value storage system ● Data is automatically ● Good single node performance: you can – replicated over multiple servers. expect 10-20k operations per second – partitioned so each server contains only a – depending on the machines, the network, subset of the total data the disk system, and the data replication ● Data items are versioned factor ● Server failure is handled transparently ● Voldemort is not a relational database, – it does not attempt to satisfy arbitrary ● Each node is independent of other nodes relations while satisfying ACID properties. with no central point of failure or – Nor is it an object database that attempts coordination to transparently map object reference ● Support for pluggable data placement graphs. strategies to support things like – Nor does it introduce a new abstraction distribution across data centers that are such as document-orientation. geographically far apart. ● It is basically just a big, distributed, persistent, fault-tolerant hash table. http://project-voldemort.com/
  • 52. Tabular stores ● The original: Google’s BigTable – Proprietary, not open source. ● The open source elephant alternative – Hadoop with HBase. ● A top level Apache Project. ● Large number of users. ● Contains a distributed file system, MapReduce, a database server (Hbase), and more. ● Rack aware.
  • 53. What is Google’s Big Table ● BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second. ● BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries. ● It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure. ● Commercial databases simply don't scale to this level and they don't work across 1000s machines.
  • 54. Document Stores ● As the name implies, these databases store documents. ● Usually schema-free. The same database can store multiple documents. ● Allow indexing based on document content. ● Prominent examples: CouchDB, MongoDB.
  • 55. Why MongoDB? ● Document-oriented ● Rich query language – Documents (objects) map nicely to ● Easy scalability programming language data types – Automatic sharding (auto- – Embedded documents and arrays partitioning of data across servers) reduce need for joins – Eventually-consistent reads can be – Dynamically-typed (schemaless) distributed over replicated servers for easy schema evolution ● High performance – No joins and no multi-document transactions for high performance – No joins and embedding makes and easy scalability reads and writes fast ● High availability – Indexes including indexing of keys from embedded documents and – Replicated servers with automatic arrays master failover – Optional streaming writes (no acknowledgements )
  • 56. Mapping Systems to the CAP Theorem Availability Each client can always read and write A RDBMS Dynamo, Cassandra (MySQL, Postgres etc.), Voldermot, SimpleDB AsterData, Greenplum Tokyo Cabinet, CouchDB Vertica, CA AP KAI, Riak Partition Tolerance Consistency: All clients have the same C CP P The system works well despite physical network view of the data BigTable, MongoDB, BerkeleyDB Hypertable, Terrastore, MemcachedDB partitions Hbase, Scalaris, Redis
  • 57. NoSQL Use cases: Important to align data model to the requirements Flexible Schema Massive Write Fast Key Value Bigness and Flexible Performance Access Data Types Schema Write No single point Generally Migration Availability of failure available Ease of programming
  • 58. Mapping new Internet Data Management Technologies to the Enterprise
  • 59. Enterprise data strategy is getting inclusive
  • 60. Open Source Rules ! Hadoop Infrastructure
  • 62. The Path to Data Stack 3.0: Must support Variety, Volume and Velocity Data Stack 1.0 Data Stack 2.0 Data Stack 3.0 Relational Database Systems Enterprise Data Warehouse Dynamic Data Platform Recording Business Events Support for Decision Making Uncovering Key Insights Highly Normalized Data Un-normalized Dimensional Model Schema less Approach GBs of Data TBs of Data PBs of Data End User Access through Ent Apps End User Access Through Reports End User Direct Access Structured Structured Structured + Semi Structured 8 December 2012 62
  • 63. Can Data Stack 3.0 Address Real Problems? Large Data Diverse Data Queries that Answer Queries Volume at Low beyond Are Difficult to that No One Price Structured Data Answer Dare Ask 8 December 2012 63
  • 64. How does one go about the Big Data Expedition?
  • 65. PERSISTENT SYSTEMS AND BIG DATA 8 December 2012 65
  • 66. Persistent Systems has an experienced team of Big Data Experts that has created the technology building blocks to help you implement a Big Data Solution that offers a direct path to unlock the value in your data.
  • 67. Big Data Expertise at Persistent ● 10+ projects executed with Leading ISVs and Enterprise Customers ● Dedicated group to MapReduce, Hadoop and Big Data Ecosystem (formed 3 years ago) ● Engaged with the Big Data Ecosystem, including leading ISVs and experts • Preferred Big Data Services Partner of IBM and Microsoft 8 December 2012
  • 68. Big Data Leadership and Contributions ● Code Contributions to Big Data Open Source Projects, including: – Hadoop, Hive, and SciDB ● Dedicated Hadoop cluster in Persistent ● Created PeBAL – Persistent Big Data Analytics Library ● Created Visual Programming Environment for Hadoop ● Created Data Connectors for Moving Data ● Pre-built Solutions to Accelerate Big Data Projects 8 December 2012 68
  • 69. Persistent’s Big Data Offerings 1. Setting up and Maintaining Big Data Platform 2. Data Analytics on Big Data Platform 3. Building Applications on Big Data Technology Assets People Assets Persistent Pre-built Persistent Pre-built Persistent Pre-built Big Data Custom Industry Solution: Industry Solution: Industry Solution: Services Retail Banking Telco Extension of Persistent Pre-built Horizontal Solutions Your Team Visual Programming (Email, Text, IT Analytics, … ) Discovery Workshop Training for Your Team Persistent Platform Enhancement IP Tools (PeBAL Analytics Library, Data Connectors) Methodology Foundational Infrastructure and Platform Team Formation Process (Built Upon Selected 3rd Party Big Data Platforms and Technologies; Cluster Sizing/Config Cluster of Commodity Hardware) 8 December 2012 69
  • 70. Persistent Next Generation Data Architecture Reports BI Tools & Alerts Email Email Connector Framework Media Connector Framework Server Server Admin App Web Proxy Web Proxy Solutions IBM Tivoli Workflow Integration Persistent Analytics Library (PEBAL) NoSQL Graph Fn Set Fn …. ….. ….. Text Analytics Fn BBCA Text Analytics/ Hive Social PIG/Jqal Connector GATE/SystemT Twitter, RDBMS Facebook MapReduce and HDFS Cluster Monitoring Data DW Warehouse Commercial/ Open Persistent IP External Data source Source Product 8 December 2012 70
  • 71. Persistent Big Data Analytics Library WHY PEBAL • Lots of common problems – not all of them are solved in Map Reduce • PigLatin, Hive, JAQL are languages and not libraries – something is needed to run on top that is not tied to SQL like interaces FEATURES • Organized as JAQL functions, PeBAL implements several graph, set, text extraction, indexing and correlation algorithms. • PeBAL functions are schema agnostic. • All PeBAL functions are tried and tested against well defined use cases. BENEFITS OF A READY MADE SOLUTION • Proven – well written and tested • Reuse across multiple applications • Quicker implementation of map reduce applications 8 December 2012 • High performance 71
  • 72. Web Analytics Text Inverted Analytics Lists Set Graph Statistics 8 December 2012 72
  • 73. Visual Programming Environment ADOPTION BARRIERS • Steep Learning Curve • Difficult to Code • Ad-hoc reporting can’t always be done by writing programs • Limited tooling available VISUAL PROGRAMMING ENVIRONMENT • Use Standard ETL tool as the UI environment for generating PIG scripts BENEFITS • ETL Tools are widely used in Enterprises • Can leverage large pool of skilled people who are experts in ETL and BI tools • UI helps in iterative and rapid data analysis • More people will start using it 8 December 2012 73
  • 74. Visual Programming Environment for Hadoop Data Sources ETL Tool Data Flow UI Metadata PIG Convertor PIG code PIG UDF Library HDFS/ Hive Data Data HDFS HDFS Big Data Platform Persistent IP 8 December 2012 74
  • 75. Persistent Connector Framework 20+ WHY CONNECTOR FRAMEWORK Years • Pluggable Architecture OUT OF THE BOX • Database, Data Warehouse • Microsoft Exchange • Web proxy • IBM Tivoli • BBCA • Generic Push connector for *any* content FEATURES • Bi-directional connector (as applicable) • Supports Push/Pull mechanism • Stores data on HDFS in an optimized format 8 December 2012 • Supports masking of data 75
  • 76. Persistent Data Connectors 8 December 2012 76
  • 77. Persistent’s Breadth of Big Data Capabilities Tooling Horizontal and Vertical Pre-built Solutions • RDBMS/DWH to import/export data • Text Analytics libraries • Data Visualization using Web2.0 and reporting tools - Big Data Platform (PeBAL) analytics Cognos, Microstrategy libraries and Connectors • Ecosystem tools like - Nutch, Katta, Lucene • Job configuration, management and monitoring with BIgInsight’s job IT Management scheduler (MetaTracker) • Job failure and recovery management Big Data Application Programming • Deep JAQL expertise - JAQL Programming, Extending JAQL using UDFs, Integration of third party tools/libraries, Performance tuning, ETL using JAQL • HDFS Distributed • Expertise in MR programming - PIG, Hive, Java MR • IBM GPFS File Systems • Deep expertise in analytics - Text Analytics - IBM’s text extraction solution (AQL + SystemT) • Platform Setup on multi- Cluster node Layer • Statistical Analytics - R, SPSS, BigInsights Integration with R clusters, monitoring, VM based setup Persistent IP for Big Data Solutions • 8 December 2012 Product Deployment Big Data Platform Components 77
  • 78. Persistent Roadmap to Big Data Improve Knowledge Base 1. Learn Discover and and Shared Big Data Platform Define Use Cases 5. Manage 2. Initiate Measure Effectiveness Validate with and Business Value a POC 4. Measure 3. Scale Upgrade to Production if Successful 8 December 2012 78
  • 79. Customer Analytics Identifying your most influential customers ? Target these customers for Identify promotions. influential Overlay sales customers data on the using network Build a social graph analysis Few thousand graph of all > 1billion transactions Influential customers customers over twenty years 70 million customers Targeting influential customers is best way to 8 December 2012 improve campaign ROI! 79
  • 80. Overview of Email Analytics ● Key Business Needs – Ensure compliance with respect to a variety of business and IT communications and information sharing guidelines. – Provide an ongoing analysis of customer sentiment through email communications. ● Use Cases – Quickly identify if there has been an information breach or if the information is being shared in ways that is not in compliance with organizational guidelines. – Identify if a particular customer is not being appropriately managed. ● Benefits – Ability to proactively manage email analytics and communications across the organization in a cost-effective way. – Reduce the response time to manage a breach and proactively address issues that emerge through ongoing analysis of email. 8 December 2012 80
  • 81. Using Email to Analyze Customer Sentiment Sense the mood of your customers through their emails Carry out detailed analysis on customer team interactions and response times 8 December 2012 81
  • 82. Analyzing Prescription Data 1.5 million patients are harmed by medication errors every year Identifying erroneous prescriptions can save lives! 8 December 2012 Source: Center for Medication Safety & Clinical Improvement 82
  • 83. Overview of IT Analytics ● Key Business Needs – Troubleshooting issues in the world of advanced and cloud based systems is highly complex, requiring analysis of data from various systems. – Information may be in different formats, locations, granularity, data stores. – System outages have a negative impact on short-term revenue, as well as long-term credibility and reliability. – The ability to quickly identify if a particular system is unstable and take corrective action is imperative. ● Use Cases – Identify security threats and isolate the corresponding external factors quickly. – Identify if an email server is unstable, determine the priority and take preventative action before a complete failure occurs. ● Benefits – Reduced maintenance cost – Higher reliablity and SLA compliance 8 December 2012 83
  • 84. Consumer Insight from Social Media Find out what the customers are talking about your organization or product in the social media 8 December 2012 84
  • 85. Insights for Satyamev Jayate – Variety of sources Web/TV Viewer Response to Pledge multiple choice Web, Social Media 2. Unstructured Analysis 1. Structured Analysis questions (unstructured) Responses to following questions Responses to Social Media (Structured) • Share your story Pledge, multiple Web, emails, IVR/Calls • Ask a question to Aamir choice questions Individual blogs SMS • Send a message of hope IVR • Share your solution Social widgets Videos Content Filtering Rating Tagging System (CFRTS) … L0, L1, L2 phased analytics 3. Impact Analysis Crawling general internet for measuring the before & after scenario on a particular topic
  • 86. Rigorous Weekly Operation Cycle producing instant analytics Killer combo of Human+Software to analyze the data efficiently Topic opens on Sunday Episode Tags are refined and Live Analytics messages are re- report is sent ingested for during the show another pass Featured content Data capture is delivered thrice from SMS, phone a day all through calls, social out the week. media, website, JSONs are created for the System runs L0 external and Analysis, L1, L2 internal Analysts continue dashboards
  • 88. Thank you Anand Deshpande (anand@persistent.co.in) http://in.linkedin.com/in/ananddeshpande Persistent Systems Limited www.persistentsys.com 8 December 2012 88
  • 89. Enterprise Value is Shifting to Data Data Apps ERP Database Operating Systems Mainframe 1975 8 December 2012 1985 1995 2006 2013 89

Editor's Notes

  1. LONDON — Britain often conjures images of unpredictable weather, with downpours sometimes followed by sunshine within the same hour — several times a day. Such randomness has prompted Tesco, the country’s largest grocery chain, to create its own weather team in hopes of better forecasting temperatures and how consumer demand changes with them. After three years of research, the six-person team has created its own software that calculates how shopping patterns change “for every degree of temperature and every hour of sunshine,” Tesco said last month. “Rapidly changing weather can be a real challenge,” Jonathan Church, a Tesco spokesman, said in a statement. “The system successfully predicted temperature drops during July that led to a major increase in demand for soup, winter vegetables and cold-weather puddings.”
  2. Big data: Embracing the elephant in the room(Case study – GlaxoSmithKline)By Steve Hemsley GlaxoSmithKline’s (GSK) CRM consultant James Parker believes the benefits of big data must be sold into every department within an organisation, rather than being seen purely as a marketing function.GSK, which owns brands from Sensodyne to Lucozade, is already tracking consumers online and repurposing the data to benefit particular brands. “Consumers are providing signals of what they are doing, seeing, thinking and sharing at a point in time and this is valuable to help drive relationships,” he says.The company is working with St Ives Group’s data management division Occam, which says GSK is aiming to build direct relationships with 1 million consumers in a year using social media as a base for research and multichannel marketing. Targeted offers and promotions will drive people to particular brand websites where external data is integrated with information already held by the marketing teams.“We identify people who are mentioning a particular brand of ours but also track everything else they are talking about in the public domain to build a customer profile,” says Parker.
  3. Poor Use CasesOLTP. Outside VoltDB, complex multi-object transactions are generally not supported. Programmers are supposed to denormalize, use documents, or use other complex strategies like compensating transactions.Data integrity. Most of the NoSQL systems rely on applications to enforce data integrity where SQL uses a declarative approach. Relational databases are still the winner for data integrity.Data independence. Data outlasts applications. In NoSQL applications drive everything about the data. One argument for the relational model is as a repository of facts that can last for the entire lifetime of the enterprise, far past the expected life-time of any individual application.SQL. If you require SQL then very few NoSQL system will provide a SQL interface, but more systems are starting to provide SQLish interfaces.Ad-hoc queries. If you need to answer real-time questions about your data that you can’t predict in advance, relational databases are generally still the winner. Complex relationships. Some NoSQL systems support relationships, but a relational database is still the winner at relating.Maturity and stability. Relational databases still have the edge here. People are familiar with how they work, what they can do, and have confidence in their reliability. There are also more programmers and toolsets available for relational databases. So when in doubt, this is the road that will be traveled.
  4. Bigness. Ability to support large volumes of data at a reasonable price on commodity hardware. Massive write performance. This is probably the canonical usage based on Google&apos;s influence. High volume. Facebook needs to store 135 billion messages a month. Twitter, for example, has the problem of storing 7 TB/data per day with the At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster.Fast key-value access. Hashing on a key in main memory is the fastest way to access a value.Flexible schema and flexible datatypes. NoSQL products support a whole range of new data types column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program and programmer friendly compatible datatypes likes JSON. Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic, because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.Write availability. Do your writes need to succeed no mater what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn&apos;t have to be written to scale a system that was never intended to be used that way.No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it&apos;s not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer led coup. The response to a database problem can&apos;t always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn&apos;t be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won&apos;t you go with the cheaper product, that you control, that&apos;s easier to use, and that&apos;s easier to scale?Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn&apos;t work. Isn&apos;t it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying find the best fit between a problem and solution.
  5. Microsoft: We have built the SQLServer to Hadoop connector that has been production, we are also doing testing of MongoDB on Azure.Salesforce Radian6: Done some salesforce integration work for them, I think.EMC: We have done some work on Greenplum but that was before buyout – we are trying to start an engagement with them. They are a partner though on content side (documentum)Hbase: Several projects have used Hbase – Email analytics, American AirlinesMongo DB: Perf Testing on Azure, Data store for social media analytics platform -  Content Savvy,  multi-media data store for QyukiPIG: Use it heavily on Yahoo, SciDB: Doing QA and code contributions, we will be using it on Vo as wellAsterData: Done some BI work for them.
  6. Persistent has intellectual assets from the very core of the stack up through the solution specific layersBig Data ConsultingRoadmap for deploying Big Data in EnterpriseWorkshop for use case discovery.Sample Proof of Concepts.Hadoop ProgrammingDeveloping custom applications to fit in MR paradigm exposed by HadoopPorting existing application to utilize MR on HadoopTuning existing MR jobs to improve performanceBig Data Application DevelopmentKnowledge of tools based on the Hadoop platform - Nutch, Hive, Pig, JAQLExperienced in building applications using these toolsWriting connectors for loading data on HDFS
  7. PeBAL – Peal is Persistent’sbig dataanalytics library. It helps quicker implementation of map reduce applications. Organized as JAQL functions, PeBAL implements several graph, set, text extraction, indexing and correlation algorithms. PeBAL functions are schema agnostic. All PeBAL functions are tried and tested against well defined use cases. Lot of common problemsMany of them solved – but not in Map ReduceDon’t reinvent the wheel. Use readymade solution which isWell writtenWell testedWell used in multiple applicationsWell (good) performanceWould PigLatin, Hive, JAQL solve this for you?No All of them are programming languages and not librariesThe language features are very tied to SQL like interfaceWe need something on top of them
  8. A large US airlines, typically wanted to identify influencers for targeting marketing programs as well user segmentation
  9. [Too Tedious to Define and Answer]Could be extended to ticketing systems, call center recordings, live communitiesAreas that this could helpCustomer serviceTrainingProduct Management
  10. [Data Available but not Warehoused]75 thousand people die in the US because of correctable flaws in Health care systemData is in unstructured format, volume is huge, analysis has to be done quickly – solution built using IBM BigInsights
  11. [Don’t believe it’s possible]Intended Users: Marketing Team, Product DevelopmentRequires: Persistent’sDatasift to BigInsights Connector, Persistent BigData Analytics Library (Pebal), Persistent Text Analytics Framework, IBM BigInsightsCompatible with: Twitter, Facebook,Data Volumes: In TerabytesAlso do work on Cognos Consumer Insight:A social media monitoring and analytics platform that allows several use cases like – brand reputation management, campaign tracking, use intent tracking to be implemented on data available on social media as well as Internet sources like blogs, news, message forrumsetc.TechnologyUsed: Hadoop (IBM Big Insights), JQAL, Text Analytics using System TDomain: Digital Marketing, Social Media Business Need for using Big Data:  Cost effective for large datasets, Ability to handle unstructured/semi-structured data.