Practical Hadoop by Example: For Relational Database Professioanals
Practical Hadoop by Example: For Relational Database Professioanals
Practical Hadoop by Example: For Relational Database Professioanals
Alex Gorbachev
12-Mar-2013
New York, NY
Alex Gorbachev
Chief Technology Officer at Pythian
Blogger
OakTable Network member
Oracle ACE Director
Founder of BattleAgainstAnyGuess.com
Founder of Sydney Oracle Meetup
IOUG Director of Communities
EVP, Ottawa Oracle User Group
2 2012
2012 Pythian
Pythian
2
Why Companies Trust Pythian
Recognized Leader:
Global
industry-leader in remote database administration services and consulting
for Oracle, Oracle Applications, MySQL and SQL Server
Work with over 150 multinational companies such as Forbes.com, Fox
Interactive media, and MDS Inc. to help manage their complex IT deployments
Expertise:
One of the worlds largest concentrations of dedicated, full-time DBA expertise.
Global Reach & Scalability:
24/7/365global remote support for DBA and consulting, systems administration,
special projects or emergency response
3 2012
2012 Pythian
Pythian
3
Agenda
What is Big Data?
What is Hadoop?
Hadoop use cases
Moving data in and out of
Hadoop
Avoiding major pitfalls
2012 Pythian
What is Big Data?
Doesnt Matter.
2012 Pythian
What Does Matter?
2012 Pythian
Given enough skill and money
Oracle can do anything.
Lets talk about efficient solutions.
2012 Pythian
When RDBMS Makes no Sense?
Storing images and video
Processing images and video
Storing and processing other large files
PDFs, Excel files
Processing large blocks of natural language text
Blog posts, job ads, product descriptions
Processing semi-structured data
CSV, JSON, XML, log files
Sensor data
2012 Pythian
When RDBMS Makes no Sense?
Ad-hoc, exploratory analytics
Integrating data from external sources
Data cleanup tasks
Very advanced analytics (machine learning)
2012 Pythian
New Data Sources
Blog posts
Social media
Images
Videos
Logs from web applications
Sensors
2012 Pythian
Big Problems with Big Data
It is:
Unstructured
Unprocessed
Un-aggregated
Un-filtered
Repetitive
Low quality
And generally messy.
2012 Pythian
Technical Challenges
Storage capacity
Storage throughput Scalable storage
Pipeline throughput
Processing power
Parallel processing
Massive Parallel Processing
System Integration
Data Analysis Ready to use tools
2012 Pythian
Big Data Solutions
2012 Pythian
What is Hadoop?
Hadoop Principles
Bring Code to Data Share Nothing
2012 Pythian
Hadoop in a Nutshell
2012 Pythian
HDFS architecture
simplified view
Files are split in large blocks
Each block is replicated on write
Files can be only created and
deleted by one client
Uploading new data? => new file
Append supported in recent versions
Update data? => recreate file
No concurrent writes to a file
Clients transfer blocks directly to
& from data nodes
Data nodes use cheap local disks
Local reads are efficient
2012 Pythian
HDFS design principles
2012 Pythian
Map Reduce example histogram calculation
2012 Pythian
Map Reduce pros & cons
Advantages Pitfalls
Very simple Low efficiency
Flexible Lots of intermediate data
Highly scalable Lots of network traffic on shuffle
Good fit for HDFS mappers Complex manipulation
read locally requires pipeline of multiple
jobs
Fault tolerant
No high-level language
Only mappers leverage local
reads on HDFS
2012 Pythian
Main components of Hadoop ecosystem
Hive HiveQL is SQL like query language
Generates MapReduce jobs
Pig data sets manipulation language (like create your own
query execution plan)
Generates MapReduce jobs
Zookeeper distributed cluster manager
Oozie workflow scheduler services
Sqoop transfer data between Hadoop and relational
2012 Pythian
Non-MR processing on Hadoop
HBase columnar-oriented key-value store (NoSQL)
SQL without Map Reduce
Impala (Cloudera)
Drill (MapR)
Phoenix (Salesforce.com)
Hadapt (commercial)
Shark Spark in-memory analytics on Hadoop
2012 Pythian
Hadoop Benefits
Reliable solution based on unreliable hardware
Designed for large files
Load data first, structure later
Designed to maximize throughput of large scans
Designed to leverage parallelism
Designed to scale
Flexible development platform
Solution Ecosystem
2012 Pythian
Hadoop Limitations
Hadoop is scalable but not fast
Some assembly required
Batteries not included
Instrumentation not included either
DIY mindset (remember MySQL?)
2012 Pythian
How much does it cost?
$300K DIY on SuperMicro
100 data nodes
2 name nodes
3 racks
800 Sandy Bridge CPU cores
6.4 TB RAM
600 x 2TB disks
1.2 PB of raw disk capacity
400 TB usable (triple mirror)
Open-source s/w
2012 Pythian
Hadoop Use Cases
Use Cases for Big Data
Top-line contributions
Analyze customer behavior
Optimize ad placements
Customized promotions and etc
Recommendation systems
Netflix, Pandora, Amazon
Improve connection with your customers
Know your customers patterns and responses
Bottom-line contributors
Cheap archives storage
ETL layer transformation engine, data cleansing
2012 Pythian
Typical Initial Use-Cases for Hadoop
in modern Enterprise IT
Transformation engine (part of ETL)
Scales easily
Inexpensive processing capacity
Any data source and destination
Data Landfill
Stop throwing away any data
Dont know how to use data today? Maybe tomorrow you will
Hadoop is very inexpensive but very reliable
2012 Pythian
Advanced: Data Science Platform
Data warehouse is good when questions are known, data
domain and structure is defined
Hadoop is great for seeking new meaning of data, new types of
insights
Unique information parsing and interpretation
Huge variety of data sources and domains
2012 Pythian
Pythian Internal Hadoop Use
OCR of screen video capture from Pythian privileged access
surveillance system
Input raw frames from video capture
Map-Reduce job runs OCR on frames and produces text
Map-Reduce job identifies text changes from frame to frame and produces
text stream with timestamp when it was on the screen
Other Map-Reduce jobs mine text (and keystrokes) for insights
Credit Cart patterns
Sensitive commands (like DROP TABLE)
Root access
Unusual activity patterns
Merge with monitoring and documentation systems
2012 Pythian
Hadoop in the Data Warehouse
Use Cases and Customer Stories
ETL for Unstructured Data
DWH
BI,
batch reports
2012 Pythian
ETL for Structured Data
Sqoop,
OLTP Perl Hadoop
Transformation
Oracle,
aggregation
MySQL,
Longterm storage
Informix
DWH
BI,
batch reports
2012 Pythian
Bring the World into Your Datacenter
2012 Pythian
Rare Historical Report
2012 Pythian
Find Needle in Haystack
2012 Pythian
Hadoop for Oracle DBAs?
alert.log repository
listener.log repository
Statspack/AWR/ASH repository
trace repository
DB Audit repository
Web logs repository
SAR repository
SQL and execution plans repository
Database jobs execution logs
2012 Pythian
Connecting the (big) Dots
Sqoop
Queries
2012 Pythian
Sqoop is Flexible Import
Select
<columns> from <table> where <condition>
Or <write your own query>
Split column
Parallel
Incremental
File formats
2012 Pythian
Sqoop Import Examples
Sqoop
import
--connect
jdbc:oracle:thin:@//
dbserver:1521/masterdb
--username
hr
--table
emp
--where
start_date
>
01-01-2012
Sqoop
import
jdbc:oracle:thin:@//dbserver:1521/
masterdb
--username
myuser
--table
shops
--split-by
shop_id
--num-mappers
16
Must be indexed or
partitioned to avoid
16 full table scans
2012 Pythian
Less Flexible Export
100row batch inserts
Commit every 100 batches
Parallel export
Merge vs. Insert
Example:
sqoop
export
--connect
jdbc:mysql://db.example.com/foo
--table
bar
--export-dir
/results/bar_data
2012 Pythian
FUSE-DFS
Mount HDFS on Oracle server:
sudo yum install hadoop-0.20-fuse
hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port>
<mount_point>
Use external tables to load data into Oracle
File Formats may vary
All ETL best practices apply
2012 Pythian
Oracle Loader for Hadoop
Load data from Hadoop into Oracle
Map-Reduce job inside Hadoop
Converts data types, partitions and sorts
Direct path loads
Reduces CPU utilization on database
NEW:
Support for Avro
Support for compression codecs
2012 Pythian
Oracle Direct Connector to HDFS
Create external tables of files in HDFS
PREPROCESSOR
HDFS_BIN_PATH:hdfs_stream
All the features of External Tables
Tested (by Oracle) as 5 times faster (GB/s) than FUSE-DFS
2012 Pythian
Oracle SQL Connector for HDFS
Map-Reduce Java program
Creates an external table
Can use Hive Metastore for schema
Optimized for parallel queries
Supports Avro and compression
2012 Pythian
How not to Fail
Data That Belong in RDBMS
2012 Pythian
Prepare for Migration
2012 Pythian
Use Hadoop Efficiently
Understand your bottlenecks:
CPU, storage or network?
Reduce use of temporary data:
All data is over the network
Written to disk in triplicate.
Eliminate unbalanced
workloads
Offload work to RDBMS
Fine-tune optimization with
Map-Reduce
2012 Pythian
Your Data
is NOT
as BIG
as you think
2012 Pythian
Getting started
Pick a business problem Pick an operational problem
Acquire data Data store
Get the tools: Hadoop, R, ETL
Hive, Pig, Tableau
Get the tools: Hadoop,
Get platform: can start cheap Sqoop, Hive, Pig, Oracle
Connectors
Analyze data
Get platform: Ops suitable
Need Data Analysts a.k.a. Data
Scientists Operational team
2012 Pythian
Continue Your Education
www.collaborate13.ioug.org
2012 Pythian
Thank you & Q&A
To contact us
sales@pythian.com
1-866-PYTHIAN
To follow us
http://www.pythian.com/news/
http://www.facebook.com/pages/The-Pythian-Group/
http://twitter.com/pythian
http://www.linkedin.com/company/pythian
2012 Pythian