Getting to 1.5M Ads/sec: How DataXu manages Big Data

Getting to 1.5M Ads per Second
How DataXu Manages Big Data
AWS, DataXu, Qubole
March 30th, 2015

Today’s speakers
Yekesa Kosuru
VP of Engineering,
DataXu
Ashish Dubey
Solutions Architect,
Qubole
Scott Ward
Solutions Architect,
AWS

Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers

Housekeeping
• The recording link will be distributed to all registrants via email after
the webinar next week
• Please submit your questions and comments using the Chat with
Presenters box located at the bottom left corner of your screen

Technologies and techniques for working
productively with data, at any scale.
Big Data

Creating Value from Data Assets
Recommendations,
Collective Intelligence
Machine Learning
Visualization
Dashboards
Business Intelligence
Measuring Functionality
and Services
Ad Hoc Queries
A/B Testing
Hypothesis Testing &
Predictions
Statistical
Analysis
Learning from Social
Media Conversations
Sentiment Analysis
SOCIAL
BIG DATA
Machine Learning Dashboards
Business Intelligence
Ad Hoc Queries
A/B Testing
Statistical
Analysis
Sentiment Analysis

Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Big Data Lifecycle

Big Data AWS Cloud
Potentially Massive Data Sets Massive, virtually unlimited capacity
Iterative, experimental style of data manipulation
and analysis
Iterative, experimental style of infrastructure
deployment/usage
Frequently not a steady-state workload;
peaks and valleys
Efficient with highly variable workloads
Time to results is key
Parallel compute clusters from single data source
Hard to configure/manage
Managed services for data storage and analysis
Big Data + AWS

AWS Data Services
Data
Velocity
Variety
Volume
Structured, Unstructured, Text, Binary
Gigabytes, Terabytes, Petabytes
Millisecond, Second, Minute, Hour, Day
EC2EBS
Instance Storage
RedshiftRDS
SQL Stores
EMR
Hadoop
DynamoDB
NoSQL
Kinesis
Stream
Storage Services
S3 Cloud
Front
Glacier
Elasticache
Caching
Data
Pipeline
Orchestrate

Amazon Elastic Map Reduce
Hosted Hadoop Framework
• Easy to use and fully managed
• Secure
• Resizable clusters to support processing needs
• Support for EC2 spot instances
• Use many query tools to support analysis of
your data
– Hive, Pig, Hbase, Spark, BI Tools, etc
• EMR-FS for an S3 backed data store.
• Direct integration with other AWS data stores
– S3, Redshift, DynamoDB

Master instance group
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Amazon
Redshift
Amazon
DynamoDB
Amazon EMR Architecture

EMR Security
• Security groups for master and
slave instances
• Instances launch in your VPC
• Encrypt data in S3
• Control who can access S3 data
• API requests required signed key
Master instance group
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Amazon
Redshift
Amazon
DynamoDB

Amazon Redshift
Petabyte Scale Data Warehouse
• Fully managed data warehouse solution
• Able to achieve petabyte scale at $1000
per TB per year
• Integrates with existing data warehouse
tools
• Scales through columnar storage and
parallel query execution
• Data load directly from S3
• Integration with Amazon EMR

Amazon Redshift Architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3
– Parallel load from Amazon DynamoDB,
Amazon EMR, Amazon S3, HDFS/SSH
• Two hardware platforms
– Optimized for data processing
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC

• SSL to secure data in transit
• Encryption to secure data at rest
– AES-256; hardware accelerated
– All blocks on disks and in
Amazon S3 encrypted
– HSM/CloudHSM
• No direct access to compute
nodes
• Amazon VPC support
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
Security
Group
JDBC/ODBC
Amazon Redshift Security

2014 Usage Statistics for Qubole on AWS:
• Total QCUH processed in 2014 = 40.6 million
• Total nodes managed in 2014 = 2.5 million
• Total PB processed in 2014 = 519
Operations
Analyst
Marketing Ops
Analyst
Data
Architects
Business
Users
Product
Support
Customer
Support
Developer
Sales
Ops
Product
Managers
Developer
Tools
Service
Management
Data Workbench
Cloud Data Platform
BI & DW
Systems
• SDK
• API
• Analysis
• Security
• Job Scheduler
• Data Governance
• Analytics templates
• Monitoring
• Support
• Collaboration
• Workflow &
Map/Reduce
• Auto Scaling
• Cloud Optimization
• Data Connectors• YARN • Presto & Hive• Spark & Pig
Hadoop Ecosystem (Apache Open Source)

Qubole Cluster Set up with AWS Credentials

Agenda Slide
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers

| 26
DataXu Introduction
Disruptive on-demand software platform relied upon by the world’s
leading brands
A petabyte scale marketing cloud that enables Fortune 500 brands to
manage data, insight and action to maximize Marketing ROI
The industry’s #1 rated programmatic marketing technology
spun out of MIT by the founders
One of the fastest growing companies in the Inc. 500

| 27
DataXu Quick Statistics
Big data + Real time decisions
Big Data
Processing
13 petabytes
of data
20 terabytes/day
consumer data intake
Real-Time
Decisioning
42 billion
decisions per second
1,500,000
Inbound Queries Per Second
Dozens of
algorithms across mobile,
social, native, display,
video and TV
Predictive
Modeling
Executing 10,000+
investments simultaneously
10M variables
considered per investment
decision using next gen
machine learning
Enterprise-
Cloud
Infrastructure
14
data centers
35,000+
CPU cores
Patent portfolio for real-time decision systems
Exclusive license from MIT to Algebra Of Systems IPR

| 28
Programmatic buying exploits real time signals to
drive greater ROI.
Analyze the attributes
available at bidding time
Assess the value of each
impression to determine a bid
price and the creative to serve
Learn from served
impressions to adjust future
bidding and creative delivery
OptimizeAppraiseAnalyze
Context Geo O.S.
Time Demo Etc.

| 29
• On-premise and Cloud
• Why Cloud/AWS
– Automation, API driven
– All Data in One Place
– Improved Testability
– Deep Security
– Breadth and Depth of Services
– Costs, Pay As You Go
– Auto Scaling (Scalability, Elasticity)
– Disaster Recovery and Business Continuity
DataXu in the Cloud
AWS

| 30
DataXu Data Flows in AWS
Producers Continuous
Processing
Storage
Analytics
CDN
Real Time
Bidding
Retargeting
Platform
Qubole
Kinesis S3 Redshift
Machine
LearningStreaming
Data Collection
Analysts
Data Scientists
Engineers

| 31
Why Qubole
Managed Service
• Auto Scaling
• Spot Pricing
• No Opex
• Redundant Clusters
• Data Security
Single Unified Interface
• Rich Unified Experience
• Data Discovery tool
• Query Templates
• Administration and Monitoring
Performance Optimizations
• Overall better performance than other
Hadoop clusters in the cloud
Automation
• Workflow, Scheduler
• SDK
Support
• 24 X 7 deep expertise support

| 32
Unified Experience
Operations
Analyst
Marketing
Ops
Analyst
Data
Architect
Busines
s
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Easy of use
for anyone

| 33
• Use VPC, pick AZ’s appropriately to match reservations
• Use hybrid spot pricing strategy
• Use tags for better reporting
• Seek Qubole help for cluster tuning
Qubole Cluster Best Practices

| 34
Data Security & Privacy
• AWS offers comprehensive data security
• Security & Privacy
– VPC
– IAM Policies, Users, Roles
– S3 Buckets, Bucket Policies & HTTPS
– Security Groups, Whitelist IP CIDR
– Key Management Service & CloudHSM
– Server Side and Client Side Encryption

| 35
Right tool for right workload
Large scale ETL
Interactive
Discovery
Queries
Machine
Learning/Real time
queries
High Performance
DW
Queries/Reporting
backend
Use Case / Technology

Questions?
DataXu
Yekesa Kosuru
ykosuru@dataxu.com
www.dataxu.com
Qubole
Ashish Dubey
adubey@qubole.com
www.qubole.com
AWS
Scott Ward
scotward@amazon.com
aws.amazon.com

Getting to 1.5M Ads/sec: How DataXu manages Big Data

Related slideshows

More Related Content

Getting to 1.5M Ads/sec: How DataXu manages Big Data

Editor's Notes