The document discusses the emergence of big data and new data architectures needed to handle large, diverse datasets. It notes that internet companies built their own data systems like Hadoop to process massive amounts of unstructured data across thousands of servers in a fault-tolerant, scalable way. These systems use a map-reduce programming model and distributed file systems like HDFS to store and process data in a parallel, distributed manner.
1 of 89
More Related Content
From the Big Data keynote at InCSIghts 2012
1. BIG DATA Defined:
Data Stack 3.0
Anand Deshpande
Persistent Systems
December 2012
8 December 2012 1
2. Congratulations to the Pune Chapter
Best Chapter Award at CSI 2012 Kolkata
8 December 2012 2
3. COMAD 2012
14-16 December
Pune
Coming to
India
Delhi 2016
8 December 2012 3
4. The Data Revolution is Happening
Now
The growing need for large-volume, multi-
structured “Big Data” analytics,
as well as … “Fast Data”, have positioned the
industry at the cusp of the most radical
revolution in database architectures in 20
years.
We believe that the economics of data will
increasingly drive competitive advantage.
Source: Credit Suisse Research, Sept 2011
8 December 2012 4
5. What Data Can Do For You
Organizational leaders want analytics
to exploit their growing data and
computational power to get smart,
and get innovative, in ways they never
could before.
Source - MIT Sloan Management Review- The New Intelligent Enterprise Big Data, Analytics
and the Path From Insights to Value By Steve LaValle, Eric Lesser,
Rebecca Shockley, Michael S. Hopkins and Nina Kruschwitz
December 21, 2010
8 December 2012 5
6. Determining Shopping Patterns
British Grocer, Tesco Uses Big Data
by Applying Weather Results to Predict
Demand and Increase Sales
Britain often conjures images of unpredictable weather, with downpours sometimes followed
by sunshine within the same hour — several times a day.
Such randomness has prompted Tesco, the country’s largest grocery chain, to create…its own
software that calculates how shopping patterns change ―for every degree of temperature and
every hour of sunshine.‖
Source: New York Times, September 2, 2009. Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigier
http://www.nytimes.com/2009/09/02/business/global/02weather.html
8 December 2012 6
7. Tracking Customers in Social Media
Glaxo Smith Kline Uses Big Data
to Efficiently Target Customers
GlaxoSmithKline is aiming to build direct relationships with 1 million consumers in a year using
social media as a base for research and multichannel marketing. Targeted offers and
promotions will drive people to particular brand websites where external data is integrated
with information already held by the marketing teams.
Source: Big data: Embracing the elephant in the room By Steve Hemsley
http://www.marketingweek.co.uk/big-data-embracing-the-elephant-in-the-room/3030939.article
8 December 2012 7
8. What does India Think?
Persistent enabled Aamir Khan Productions and Star Plus use
Big Data to know how people react to some of the most
excruciating social issues.
http://www.satyamevjayate.in/
Satyamev Jayate - Aamir Khan’s pioneering, interactive socio-cultural TV show - has caught the
interest of the entire nation. It has already generated ~7.5M responses in 4 weeks over
SMS, Facebook, Twitter, Phone Calls and Discussion Forums by its viewers across the world
over. This data is being analyzed and delivered in real-time to allow the producers to
understand the pulse of the viewers, to gauge the appreciation for the show and most
importantly to spread the message. Harnessing the truth from all this data is a key component
of the show’s success.
8 December 2012 8
10. WE ALREADY HAVE DATABASES.
WHY DO WE NEED TO DO ANYTHING
DIFFERENT?
8 December 2012 10
11. Relational Database Systems for
Operational Store
● Transaction processing capabilities
ideally suited for transaction-oriented
operational stores.
● Data types – numbers, text, etc.
● SQL as the Query language
● De-facto standard as the operational
store for ERP and mission critical
systems.
● Interface through application programs
and query tools
8 December 2012 11
12. Data Stack 1.0:
Online Transactions Processing (OLTP)
● High throughput for
transactions (writes).
● Focus on reliability – ACID
Properties.
● Highly normalized
Schema.
● Interface through
application programs and
query tools
Data Stack 1.0
8 December 2012 12
13. Data Stack 2.0: Enterprise Data
Warehouse for Decision Support
● Operational data stores store on-line
transactions
– Many writes, some reads.
● Large fact table, multiple dimension
tables
● Schema has a specific pattern – star
schema
● Joins are also very standard and create
cubes
● Queries focus on aggregates.
● Users access data through tools such as
Cognos, Business Objects, Hyperion etc.
8 December 2012 13
14. Data Stack 2.0: Enterprise Data
Warehouse
Data Reports &
Warehouse Ad hoc Anal
Alerts &
Dashboards
ETL
What-if Anal.
EPM
Predictive User
Analytics
OLAP
Data Staging
Data
Data Store
8 December 2012
Visualization 14
15. Standard Enterprise Data
Architecture
Presentation Layer Relational
Databases
Optimized Loader
Extraction
ERP Cleansing
Application Logic Systems (ETL)
Data Warehouse
Engine Analyze
Purchased Query
Data
Relational Databases Legacy
Data Metadata Repository
Data Stack 1.0: Data Stack 2.0:
Operational Data Systems Enterprise Data Warehouse
Systems
8 December 2012 15
16. Who are the players
Oracle Microsoft Open Source Pure Play
SQL Server IBM Infosphere Business Enterprise Data Informatica
Oracle Data
ETL Integration DataStage Objects Data Kettle integration Powercenter
Integrator
Service (SSIS) I Integrator server
Parallel Data Teradata,
Oracle Netezza (Pure Postgres/
DWH Warehouse(PD Sybase iQ <BLANK> Greenplum
11g/Exadata Data) MySQL
W) (EMC),
SQL Server
Hyperion/ Cognos
OLAP Analysis SAP Hana Mondrian OLAP Viewer
Essbase Powerplay
Services(SSAS)
Business Enterprise
Oracle BI – SQL Server BIRT MicroStrategy
Objects , BO Guide, Web
Reporting OBIEE) & Reporting Cognos BI Pentaho, Qliktech,
Dashboard Report Studio
Exalytics Services (SSRS) Jasper Tableau
Builder or;
SQL Server
Predictive Oracle Data SAS Enterprise
Data Mining SPSS SAP Hana + R R/Weka
Analytics Mining (ODM) Miner
(SSDM)
8 December 2012 16
17. Despite the two data stacks ..
One in two
business
executives
believe that they
do not have
sufficient
information
across their
organization to
do their job
Source: December 2012 Business Value
8 IBM Institute for 17
18. Data has Variety: it doesn’t fit
Less than 40% of the
Enterprise Data
makes its way to
Data Stack 1.0 or
Data Stack 2.0.
8 December 2012 18
19. Beyond the Operational Systems, data
required for decision making is scattered
within and beyond the enterprise
Weather forecasts
Expense Twitter
Email Systems Management Feeds
Collaboration Vendor Demographic
System
/Wiki Sites Collaboration Data
Organizational Systems Maps
Employee Surveys
Workflow
Document Repositories Supply Chain Economic Data
ERP Systems Systems
Customer Call Social
CRM Systems Location and
Center Records Networking
Presence Data
Enterprise Sensor Data
Data Warehouse Project artifacts Data CRM Systems
Structured Unstructured Cloud Public
Data Sources Data Sources Data Sources Data Sources
8 December 2012 19
20. Data Volumes are Growing
5 Exabytes of information was
created between the dawn of
civilization through 2003, but that
much information is now created
every 2 days, and the pace is
increasing
Eric Schmidt
(1 exabyte = 1018 bytes ) at the Techonomy Conference,
August 4, 2010
8 December 2012 20
21. The Continued Explosion of Data in the
Enterprise and Beyond
80% of new information growth is
unstructured content –
90% of that is currently unmanaged
2020
35 zettabytes
44x as much
Data and Content
2009 Over Coming Decade
800,000 petabytes 1990 Source:2000 Digital Universe Decade – Are You Ready?, May 2010
IDC, The
2010 2020
8 December 2012 21
22. What comes first -- Structure or data?
Schema/
Data
Structure
Structure First is Constraining
8 December 2012 22
23. Time to create a new
data stack for
unstructured data.
Data Stack 3.0.
8 December 2012 23
24. Time-out!
Internet
companies have
already addressed
the same
problems.
8 December 2012 24
25. Internet Companies have to deal with
large volumes of unstructured real-time
data.
● Twitter has 140 million active users and more than 400
million tweets per day.
● Facebook has over 900 million active users and an average
of 3.2 billion Likes and Comments are generated by
Facebook users per day.
● 3.1 billion email accounts in 2011, expected to rise to over 4
billion by 2015.
● There were 2.3 billion internet users (2,279,709,629)
worldwide in the first quarter of 2012, according to Internet
World Stats data updated 31st March 2012.
8 December 2012 25
26. Their data loads and pricing
requirements do not fit traditional
relational systems
● Hosted service
● Large cluster (1000s of nodes) of low-cost
commodity servers.
● Very large amounts of data -- Indexing
billions of documents, video, images etc..
● Batch updates.
● Fault tolerance.
● Hundreds of Million users,
● Billions of queries every day.
8 December 2012 26
27. They built their own systems
● It is the platform that distinguishes them from everyone else.
● They required:
– high reliability across data centers
– scalability to thousands of network nodes
– huge read/write bandwidth requirements
– support for large blocks of data which are gigabytes in size.
– efficient distribution of operations across nodes to reduce bottlenecks
Relational databases were not suitable and would have been
cost prohibitive.
8 December 2012 27
28. Internet Companies have open-sourced
the source code they created for their
own use.
Companies have
created business
models to support
and enhance this
software.
8 December 2012 28
29. What did the Internet Companies build?
And how did they get there?
They started with a clean slate!
30. What features from the relational
database can be compromised?
Do we need ..
● transaction support?
Must have
● rigid schemas?
● Scale
● joins?
● Ability to handle unstructured
● SQL? data
● on-line, live updates? ● Ability to process large
volumes of data without
having to start with structure
first.
● leverage distributed
computing
31. Rethinking ACID properties
Atomicity For the internet workload, with
Consistency distributed computing, ACID
Isolation
Durability
properties are too strong.
Rather than requiring consistency Basic
after every transaction, it is enough Availability
for the database to eventually be in Soft-state
a consistent state -- BASE. Eventual consistency
32. Brewer’s CAP Theorem for Distributed
Systems
● Consistent – Reads always pick
up the latest write. Consistency Availability
CA
● Available – can always read and
write.
● Partition tolerant – The system CP AP
can be split across multiple
machines and datacenters Partition
Tolerance
Can do at most two of these three.
33. Essential Building
Blocks for Internet
Data Systems
Map Reduce Jobs (Developers)
Tracker
Hadoop Map-Reduce Layer
Job
Hadoop Distributed File System
(HDFS)
C L U S T E R
34. “For the last several years, every company involved
in building large web-scale systems has faced
some of the same fundamental challenges. While
nearly everyone agrees that the "divide-and-
conquer using lots of cheap hardware" approach to
breaking down large problems is the only way to
scale, doing so is not easy”
- Jeremy Zawodny @Yahoo !
35. Challenges with Distributed
Computing
● Cheap nodes fail, especially if you have many
Mean time between failures for 1 node = 3 years
Mean time between failures for 1000 nodes = 1 day
– Solution: Build fault-tolerance into system
● Commodity network = low bandwidth
– Solution: Push computation to the data
● Programming distributed systems is hard
– Solution: Data-parallel programming model: users write “map” &
“reduce” functions, system distributes work and handles faults
36. The Hadoop Ecosystem
● HDFS – distributed, fault tolerant file system
● MapReduce – framework for writing/executing distributed, fault tolerant
algorithms
● Hive & Pig – SQL-like declarative languages
● Sqoop – package for moving data between HDFS and relational DB systems
● + Others…
BI ETL
RDBMS
Reporting Tools
Hive & Pig
(Serialization)
Zookeeper
Map/ Sqoop
Avro
Reduce
HBase
HDFS
36
37. Reliable Storage is Essential
● Google GFS; Hadoop HDFS; Kosmix KFS
large distributed log structured file system that stores all types
of data.
● Provides global file namespace
● Typical usage pattern
– Huge files (100s of GB to TB)
– Data is rarely updated in place
– Reads and appends are common
● A new application coming on line can use an existing GFS
cluster or they can make your own.
● File system can be tuned to fit individual application needs.
http://highscalability.com/google-architecture
38. Distributed File System
● Chunk Servers
– File is split into contiguous chunks
– Typically each chunk is 16-64MB
– Each chunk replicated (usually 2x or 3x)
– Try to keep replicas in different racks
● Master node
– a.k.a. Name Nodes in HDFS
– Stores metadata
– Might be replicated
39. Now that you have storage,
how would you manipulate this data?
● Why use MapReduce?
– Nice way to partition tasks across lots of machines.
– Handle machine failure
– Works across different application types, like search and ads.
– You can pre-compute useful data, find word counts, sort TBs of
data, etc.
– Computation can automatically move closer to the IO source.
40. Hadoop is the Apache
implementation of MapReduce
● The Apache™ Hadoop™ project develops open-source software
for reliable, scalable, distributed computing.
● The Apache Hadoop software library is a framework that allows:
– distributed processing of large data sets across clusters of computers
using a simple programming model.
– It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage.
– Rather than rely on hardware to deliver high-availability, the library
itself is designed to detect and handle failures at the application
layer, so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures.
42. Word Count – Distributed Solution
Input Map Shuffle & Sort Reduce Output
the, 1
the quick brown, 2
Map brown, 1
brown fox fox, 1
Reduce fox, 2
quick, 1
the, 1 brown, [1,1] how, 1
fox, 1 fox, [1,1]
the fox ate now, 1
Map the, 1
ate, 1
how, [1]
now, [1]
the mouse the, [1,1,1,1]
the, 4
mouse, 1
how, 1
now, 1
ate, 1
brown, 1 Reduce cow, 1
how now
Map the, 1
ate, [1]
cow, 1 mouse, 1
the cow, [1]
mouse, [1] quick, 1
brown cow quick, [1]
43. Word Count in Map-Reduce
public void map(Object key, Text value, …. ) {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
map
word.set(itr.nextToken());
context.write(word, one);
}
public void reduce(Text key, Iterable<IntWritable> values, ……… ) {
int sum = 0;
reduce
for (IntWritable val : values) {sum += val.get();}
result.set(sum);
context.write(key, result);
}
44. Pig and Hive
● Pig and Hive provide a Pig is a data flow
wrapper to make it easier to
write MapReduce jobs. scripting language
● The raw data is stored in
Hadoop's HDFS.
● These scripting languages
provide
– Ease of programming.
– Optimization opportunities.
Hive is SQL-like
– Extensibility.
language
●
45. Other Hadoop-related projects at Apache
include:
● Avro™: A data serialization system. ● Pig™: A high-level data-flow language
● Cassandra™: A scalable multi-master and execution framework for parallel
database with no single points of failure. computation.
● Chukwa™: A data collection system for ● ZooKeeper™: A high-performance
managing large distributed systems. coordination service for distributed
● HBase™: A scalable, distributed database applications.
that supports structured data storage for
large tables.
● Hive™: A data warehouse infrastructure
that provides data summarization and ad
hoc querying.
● Mahout™: A Scalable machine learning
and data mining library.
http://hadoop.apache.org/
46. Powered by Hadoop
http://wiki.apache.org/hadoop/PoweredBy
● Facebook (more than 100+ Companies are listed)
– 1100-machine cluster with 8800 cores
– store copies of internal log and dimension data sources and use it as
a source for reporting/analytics and machine learning
● Yahoo
– Biggest cluster: 4000 nodes
– Search Marketing, People you may know, Search Assist, and many
more…
● Ebay
– 532 nodes cluster (8 * 532 cores, 5.3PB).
– Using it for Search optimization and Research
47. Hadoop is not a relational database
● Hadoop is best suited for batch processing of large
volumes of unstructured data.
– Lack of schemas
– Lack of indexes
– Lack of updates – pretty much absent!
– Not designed for joins.
– Support for Integrity Constraints
– Limited support for data analysis tools
But what are your data analysis needs?
48. Hadoop is not a Relational Database:
If these are important, stick to RDBMS
Data
OLTP Data Integrity SQL
Independence
Ad-hoc Complex Maturity and
Queries Relationships Stability
49. Do you need SQL and full relational systems?
If not, consider NoSQL databases for your
needs
Key-value Tabular Document Graph
http://nosql-database.org/
50. The Key-Value In-Memory DBs
● In memory DBs are simpler and faster than their on-disk counterparts.
● Key value stores offer a simple interface with no schema. Really a
giant, distributed hash table.
● Often used as caches for on-disk DB systems.
● Advantages:
– Relatively simple
– Practically no server to server talk.
– Linear scalability
● Disadvantages:
– Doesn’t understand data – no server side operations. The key and value are always
strings.
– It’s really meant to only be a cache – no more, no less.
– No recovery, limited elasticity.
51. Voldemort is a distributed key-value
storage system
● Data is automatically ● Good single node performance: you can
– replicated over multiple servers. expect 10-20k operations per second
– partitioned so each server contains only a – depending on the machines, the network,
subset of the total data the disk system, and the data replication
● Data items are versioned factor
● Server failure is handled transparently ● Voldemort is not a relational database,
– it does not attempt to satisfy arbitrary
● Each node is independent of other nodes relations while satisfying ACID properties.
with no central point of failure or – Nor is it an object database that attempts
coordination to transparently map object reference
● Support for pluggable data placement graphs.
strategies to support things like – Nor does it introduce a new abstraction
distribution across data centers that are such as document-orientation.
geographically far apart. ● It is basically just a big, distributed,
persistent, fault-tolerant hash table.
http://project-voldemort.com/
52. Tabular stores
● The original: Google’s BigTable
– Proprietary, not open source.
● The open source elephant alternative – Hadoop with HBase.
● A top level Apache Project.
● Large number of users.
● Contains a distributed file system, MapReduce, a database
server (Hbase), and more.
● Rack aware.
53. What is Google’s Big Table
● BigTable is a large scale, fault tolerant, self managing system
that includes terabytes of memory and petabytes of storage. It
can handle millions of reads/writes per second.
● BigTable is a distributed hash mechanism built on top of GFS. It
is not a relational database. It doesn't support joins or SQL type
queries.
● It provides lookup mechanism to access structured data by key.
GFS stores opaque data and many applications needs has data
with structure.
● Commercial databases simply don't scale to this level and they
don't work across 1000s machines.
54. Document Stores
● As the name implies, these databases store documents.
● Usually schema-free. The same database can store
multiple documents.
● Allow indexing based on document content.
● Prominent examples: CouchDB, MongoDB.
55. Why MongoDB?
● Document-oriented ● Rich query language
– Documents (objects) map nicely to ● Easy scalability
programming language data types
– Automatic sharding (auto-
– Embedded documents and arrays partitioning of data across servers)
reduce need for joins
– Eventually-consistent reads can be
– Dynamically-typed (schemaless) distributed over replicated servers
for easy schema evolution
● High performance
– No joins and no multi-document
transactions for high performance – No joins and embedding makes
and easy scalability reads and writes fast
● High availability – Indexes including indexing of keys
from embedded documents and
– Replicated servers with automatic arrays
master failover
– Optional streaming writes (no
acknowledgements )
56. Mapping Systems to the CAP
Theorem Availability
Each client can always read
and write
A
RDBMS Dynamo, Cassandra
(MySQL, Postgres etc.), Voldermot, SimpleDB
AsterData, Greenplum Tokyo Cabinet, CouchDB
Vertica, CA AP KAI, Riak
Partition Tolerance
Consistency:
All clients have the same C CP P The system works well
despite physical network
view of the data BigTable, MongoDB, BerkeleyDB
Hypertable, Terrastore, MemcachedDB partitions
Hbase, Scalaris, Redis
57. NoSQL Use cases: Important to align
data model to the requirements
Flexible Schema
Massive Write Fast Key Value
Bigness and Flexible
Performance Access
Data Types
Schema Write No single point Generally
Migration Availability of failure available
Ease of
programming
62. The Path to Data Stack 3.0:
Must support Variety, Volume and
Velocity
Data Stack 1.0 Data Stack 2.0 Data Stack 3.0
Relational Database Systems Enterprise Data Warehouse Dynamic Data Platform
Recording Business Events Support for Decision Making Uncovering Key Insights
Highly Normalized Data Un-normalized Dimensional Model Schema less Approach
GBs of Data TBs of Data PBs of Data
End User Access through Ent Apps End User Access Through Reports End User Direct Access
Structured Structured Structured + Semi Structured
8 December 2012 62
63. Can Data Stack 3.0 Address Real
Problems?
Large Data Diverse Data Queries that Answer Queries
Volume at Low beyond Are Difficult to that No One
Price Structured Data Answer Dare Ask
8 December 2012 63
66. Persistent Systems has an
experienced team of Big Data Experts that
has created the technology building blocks
to help you implement a Big Data Solution
that offers a direct path to unlock the value
in your data.
67. Big Data Expertise at Persistent
● 10+ projects executed with Leading ISVs and Enterprise Customers
● Dedicated group to MapReduce, Hadoop and Big Data Ecosystem
(formed 3 years ago)
● Engaged with the Big Data Ecosystem, including leading ISVs and
experts
• Preferred Big Data Services Partner of IBM and Microsoft
8 December 2012
68. Big Data Leadership and
Contributions
● Code Contributions to Big Data Open Source Projects, including:
– Hadoop, Hive, and SciDB
● Dedicated Hadoop cluster in Persistent
● Created PeBAL – Persistent Big Data Analytics Library
● Created Visual Programming Environment for Hadoop
● Created Data Connectors for Moving Data
● Pre-built Solutions to Accelerate Big Data Projects
8 December 2012 68
69. Persistent’s Big Data Offerings
1. Setting up and Maintaining Big Data Platform
2. Data Analytics on Big Data Platform
3. Building Applications on Big Data
Technology Assets People Assets
Persistent Pre-built Persistent Pre-built Persistent Pre-built Big Data Custom
Industry Solution: Industry Solution: Industry Solution: Services
Retail Banking Telco
Extension of
Persistent Pre-built Horizontal Solutions Your Team
Visual Programming
(Email, Text, IT Analytics, … ) Discovery Workshop
Training for Your Team
Persistent Platform Enhancement IP
Tools
(PeBAL Analytics Library, Data Connectors)
Methodology
Foundational Infrastructure and Platform Team Formation Process
(Built Upon Selected 3rd Party Big Data Platforms and Technologies; Cluster Sizing/Config
Cluster of Commodity Hardware)
8 December 2012 69
70. Persistent Next Generation Data
Architecture
Reports
BI Tools
& Alerts
Email
Email
Connector Framework Media
Connector Framework
Server
Server Admin App
Web Proxy
Web Proxy
Solutions
IBM Tivoli Workflow Integration Persistent Analytics Library (PEBAL)
NoSQL
Graph Fn Set Fn …. ….. ….. Text Analytics Fn
BBCA
Text Analytics/
Hive
Social
PIG/Jqal
Connector
GATE/SystemT
Twitter, RDBMS
Facebook MapReduce and HDFS
Cluster Monitoring
Data
DW Warehouse
Commercial/ Open
Persistent IP External Data source
Source Product
8 December 2012 70
71. Persistent Big Data Analytics Library
WHY PEBAL
• Lots of common problems – not all of them are solved in Map Reduce
• PigLatin, Hive, JAQL are languages and not libraries – something is
needed to run on top that is not tied to SQL like interaces
FEATURES
• Organized as JAQL functions, PeBAL implements several graph, set, text
extraction, indexing and correlation algorithms.
• PeBAL functions are schema agnostic.
• All PeBAL functions are tried and tested against well defined use cases.
BENEFITS OF A READY MADE SOLUTION
• Proven – well written and tested
• Reuse across multiple applications
• Quicker implementation of map reduce applications
8 December 2012
• High performance 71
72. Web
Analytics
Text Inverted
Analytics Lists
Set
Graph Statistics
8 December 2012 72
73. Visual Programming Environment
ADOPTION BARRIERS
• Steep Learning Curve
• Difficult to Code
• Ad-hoc reporting can’t always be done by writing programs
• Limited tooling available
VISUAL PROGRAMMING ENVIRONMENT
• Use Standard ETL tool as the UI environment for generating PIG scripts
BENEFITS
• ETL Tools are widely used in Enterprises
• Can leverage large pool of skilled people who are experts in ETL and BI
tools
• UI helps in iterative and rapid data analysis
• More people will start using it
8 December 2012 73
74. Visual Programming Environment for
Hadoop
Data
Sources ETL Tool
Data Flow UI
Metadata
PIG Convertor
PIG code
PIG UDF Library
HDFS/ Hive
Data Data
HDFS
HDFS
Big Data Platform
Persistent IP
8 December 2012 74
75. Persistent Connector Framework
20+ WHY CONNECTOR FRAMEWORK
Years • Pluggable Architecture
OUT OF THE BOX
• Database, Data Warehouse
• Microsoft Exchange
• Web proxy
• IBM Tivoli
• BBCA
• Generic Push connector for *any* content
FEATURES
• Bi-directional connector (as applicable)
• Supports Push/Pull mechanism
• Stores data on HDFS in an optimized format
8 December 2012 • Supports masking of data 75
77. Persistent’s Breadth of Big Data Capabilities
Tooling
Horizontal and Vertical Pre-built Solutions • RDBMS/DWH to import/export data
• Text Analytics libraries
• Data Visualization using Web2.0 and reporting tools -
Big Data Platform (PeBAL) analytics Cognos, Microstrategy
libraries and Connectors • Ecosystem tools like - Nutch, Katta, Lucene
• Job configuration, management and monitoring with BIgInsight’s job
IT Management scheduler (MetaTracker)
• Job failure and recovery management
Big Data Application
Programming • Deep JAQL expertise - JAQL Programming, Extending JAQL using
UDFs, Integration of third party tools/libraries, Performance tuning, ETL using
JAQL
• HDFS Distributed
• Expertise in MR programming - PIG, Hive, Java MR
• IBM GPFS File Systems
• Deep expertise in analytics - Text Analytics - IBM’s text extraction solution
(AQL + SystemT)
• Platform Setup on multi- Cluster
node Layer • Statistical Analytics - R, SPSS, BigInsights Integration with R
clusters, monitoring, VM
based setup Persistent IP for Big Data Solutions
• 8 December 2012
Product Deployment
Big Data Platform Components 77
78. Persistent Roadmap to Big Data
Improve Knowledge Base 1. Learn Discover and
and Shared Big Data Platform Define Use Cases
5. Manage 2. Initiate
Measure Effectiveness Validate with
and Business Value a POC
4. Measure 3. Scale
Upgrade to Production
if Successful
8 December 2012 78
79. Customer Analytics
Identifying your most
influential customers ? Target these
customers for
Identify promotions.
influential
Overlay sales customers
data on the using network
Build a social graph analysis Few thousand
graph of all > 1billion transactions Influential customers
customers over twenty years
70 million customers
Targeting influential customers is best way to
8 December 2012
improve campaign ROI! 79
80. Overview of Email Analytics
● Key Business Needs
– Ensure compliance with respect to a variety of business and IT communications and
information sharing guidelines.
– Provide an ongoing analysis of customer sentiment through email communications.
● Use Cases
– Quickly identify if there has been an information breach or if the information is being shared in
ways that is not in compliance with organizational guidelines.
– Identify if a particular customer is not being appropriately managed.
● Benefits
– Ability to proactively manage email analytics and communications across the organization in a
cost-effective way.
– Reduce the response time to manage a breach and proactively address issues that emerge
through ongoing analysis of email.
8 December 2012 80
81. Using Email to Analyze Customer
Sentiment
Sense the mood of your customers
through their emails
Carry out detailed analysis on customer
team interactions and response times
8 December 2012 81
82. Analyzing Prescription Data
1.5 million patients are
harmed by medication
errors every year
Identifying erroneous prescriptions can save lives!
8 December 2012 Source: Center for Medication Safety & Clinical Improvement 82
83. Overview of IT Analytics
● Key Business Needs
– Troubleshooting issues in the world of advanced and cloud based systems is highly complex, requiring
analysis of data from various systems.
– Information may be in different formats, locations, granularity, data stores.
– System outages have a negative impact on short-term revenue, as well as long-term credibility and
reliability.
– The ability to quickly identify if a particular system is unstable and take corrective action is imperative.
● Use Cases
– Identify security threats and isolate the corresponding external factors quickly.
– Identify if an email server is unstable, determine the priority and take preventative action before a
complete failure occurs.
● Benefits
– Reduced maintenance cost
– Higher reliablity and SLA compliance
8 December 2012 83
84. Consumer Insight from Social Media
Find out what the customers are
talking about your organization or
product in the social media
8 December 2012 84
85. Insights for Satyamev Jayate – Variety of
sources Web/TV Viewer
Response to Pledge
multiple choice
Web, Social Media
2. Unstructured Analysis 1. Structured Analysis
questions
(unstructured)
Responses to following questions Responses to
Social Media
(Structured)
• Share your story Pledge, multiple Web, emails, IVR/Calls
• Ask a question to Aamir choice questions Individual blogs
SMS
• Send a message of hope
IVR
• Share your solution Social widgets
Videos
Content Filtering Rating Tagging
System (CFRTS)
…
L0, L1, L2 phased analytics 3. Impact Analysis
Crawling general internet for measuring the
before & after scenario on a particular topic
86. Rigorous Weekly
Operation Cycle
producing instant
analytics
Killer combo of Human+Software to
analyze the data efficiently Topic opens on
Sunday
Episode Tags are
refined and Live Analytics
messages are re- report is sent
ingested for during the show
another pass
Featured content Data capture
is delivered thrice from SMS, phone
a day all through calls, social
out the week. media, website,
JSONs are
created for the System runs L0
external and Analysis, L1, L2
internal Analysts continue
dashboards
88. Thank you
Anand Deshpande (anand@persistent.co.in)
http://in.linkedin.com/in/ananddeshpande
Persistent Systems Limited
www.persistentsys.com
8 December 2012 88
89. Enterprise Value is Shifting to Data
Data
Apps
ERP
Database
Operating
Systems
Mainframe
1975
8 December 2012 1985 1995 2006 2013 89
Editor's Notes
LONDON — Britain often conjures images of unpredictable weather, with downpours sometimes followed by sunshine within the same hour — several times a day. Such randomness has prompted Tesco, the country’s largest grocery chain, to create its own weather team in hopes of better forecasting temperatures and how consumer demand changes with them. After three years of research, the six-person team has created its own software that calculates how shopping patterns change “for every degree of temperature and every hour of sunshine,” Tesco said last month. “Rapidly changing weather can be a real challenge,” Jonathan Church, a Tesco spokesman, said in a statement. “The system successfully predicted temperature drops during July that led to a major increase in demand for soup, winter vegetables and cold-weather puddings.”
Big data: Embracing the elephant in the room(Case study – GlaxoSmithKline)By Steve Hemsley GlaxoSmithKline’s (GSK) CRM consultant James Parker believes the benefits of big data must be sold into every department within an organisation, rather than being seen purely as a marketing function.GSK, which owns brands from Sensodyne to Lucozade, is already tracking consumers online and repurposing the data to benefit particular brands. “Consumers are providing signals of what they are doing, seeing, thinking and sharing at a point in time and this is valuable to help drive relationships,” he says.The company is working with St Ives Group’s data management division Occam, which says GSK is aiming to build direct relationships with 1 million consumers in a year using social media as a base for research and multichannel marketing. Targeted offers and promotions will drive people to particular brand websites where external data is integrated with information already held by the marketing teams.“We identify people who are mentioning a particular brand of ours but also track everything else they are talking about in the public domain to build a customer profile,” says Parker.
Poor Use CasesOLTP. Outside VoltDB, complex multi-object transactions are generally not supported. Programmers are supposed to denormalize, use documents, or use other complex strategies like compensating transactions.Data integrity. Most of the NoSQL systems rely on applications to enforce data integrity where SQL uses a declarative approach. Relational databases are still the winner for data integrity.Data independence. Data outlasts applications. In NoSQL applications drive everything about the data. One argument for the relational model is as a repository of facts that can last for the entire lifetime of the enterprise, far past the expected life-time of any individual application.SQL. If you require SQL then very few NoSQL system will provide a SQL interface, but more systems are starting to provide SQLish interfaces.Ad-hoc queries. If you need to answer real-time questions about your data that you can’t predict in advance, relational databases are generally still the winner. Complex relationships. Some NoSQL systems support relationships, but a relational database is still the winner at relating.Maturity and stability. Relational databases still have the edge here. People are familiar with how they work, what they can do, and have confidence in their reliability. There are also more programmers and toolsets available for relational databases. So when in doubt, this is the road that will be traveled.
Bigness. Ability to support large volumes of data at a reasonable price on commodity hardware. Massive write performance. This is probably the canonical usage based on Google's influence. High volume. Facebook needs to store 135 billion messages a month. Twitter, for example, has the problem of storing 7 TB/data per day with the At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster.Fast key-value access. Hashing on a key in main memory is the fastest way to access a value.Flexible schema and flexible datatypes. NoSQL products support a whole range of new data types column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program and programmer friendly compatible datatypes likes JSON. Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic, because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.Write availability. Do your writes need to succeed no mater what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn't have to be written to scale a system that was never intended to be used that way.No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer led coup. The response to a database problem can't always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn't be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper product, that you control, that's easier to use, and that's easier to scale?Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying find the best fit between a problem and solution.
Microsoft: We have built the SQLServer to Hadoop connector that has been production, we are also doing testing of MongoDB on Azure.Salesforce Radian6: Done some salesforce integration work for them, I think.EMC: We have done some work on Greenplum but that was before buyout – we are trying to start an engagement with them. They are a partner though on content side (documentum)Hbase: Several projects have used Hbase – Email analytics, American AirlinesMongo DB: Perf Testing on Azure, Data store for social media analytics platform - Content Savvy, multi-media data store for QyukiPIG: Use it heavily on Yahoo, SciDB: Doing QA and code contributions, we will be using it on Vo as wellAsterData: Done some BI work for them.
Persistent has intellectual assets from the very core of the stack up through the solution specific layersBig Data ConsultingRoadmap for deploying Big Data in EnterpriseWorkshop for use case discovery.Sample Proof of Concepts.Hadoop ProgrammingDeveloping custom applications to fit in MR paradigm exposed by HadoopPorting existing application to utilize MR on HadoopTuning existing MR jobs to improve performanceBig Data Application DevelopmentKnowledge of tools based on the Hadoop platform - Nutch, Hive, Pig, JAQLExperienced in building applications using these toolsWriting connectors for loading data on HDFS
PeBAL – Peal is Persistent’sbig dataanalytics library. It helps quicker implementation of map reduce applications. Organized as JAQL functions, PeBAL implements several graph, set, text extraction, indexing and correlation algorithms. PeBAL functions are schema agnostic. All PeBAL functions are tried and tested against well defined use cases. Lot of common problemsMany of them solved – but not in Map ReduceDon’t reinvent the wheel. Use readymade solution which isWell writtenWell testedWell used in multiple applicationsWell (good) performanceWould PigLatin, Hive, JAQL solve this for you?No All of them are programming languages and not librariesThe language features are very tied to SQL like interfaceWe need something on top of them
A large US airlines, typically wanted to identify influencers for targeting marketing programs as well user segmentation
[Too Tedious to Define and Answer]Could be extended to ticketing systems, call center recordings, live communitiesAreas that this could helpCustomer serviceTrainingProduct Management
[Data Available but not Warehoused]75 thousand people die in the US because of correctable flaws in Health care systemData is in unstructured format, volume is huge, analysis has to be done quickly – solution built using IBM BigInsights
[Don’t believe it’s possible]Intended Users: Marketing Team, Product DevelopmentRequires: Persistent’sDatasift to BigInsights Connector, Persistent BigData Analytics Library (Pebal), Persistent Text Analytics Framework, IBM BigInsightsCompatible with: Twitter, Facebook,Data Volumes: In TerabytesAlso do work on Cognos Consumer Insight:A social media monitoring and analytics platform that allows several use cases like – brand reputation management, campaign tracking, use intent tracking to be implemented on data available on social media as well as Internet sources like blogs, news, message forrumsetc.TechnologyUsed: Hadoop (IBM Big Insights), JQAL, Text Analytics using System TDomain: Digital Marketing, Social Media Business Need for using Big Data: Cost effective for large datasets, Ability to handle unstructured/semi-structured data.