Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Getting to 1.5M Ads per Second
How DataXu Manages Big Data
AWS, DataXu, Qubole
March 30th, 2015
Today’s speakers
Yekesa Kosuru
VP of Engineering,
DataXu
Ashish Dubey
Solutions Architect,
Qubole
Scott Ward
Solutions Architect,
AWS
Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
Housekeeping
• The recording link will be distributed to all registrants via email after
the webinar next week
• Please submit your questions and comments using the Chat with
Presenters box located at the bottom left corner of your screen
Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Technologies and techniques for working
productively with data, at any scale.
Big Data
Creating Value from Data Assets
Recommendations,
Collective Intelligence
Machine Learning
Visualization
Dashboards
Business Intelligence
Measuring Functionality
and Services
Ad Hoc Queries
A/B Testing
Hypothesis Testing &
Predictions
Statistical
Analysis
Learning from Social
Media Conversations
Sentiment Analysis
SOCIAL
BIG DATA
Machine Learning Dashboards
Business Intelligence
Ad Hoc Queries
A/B Testing
Statistical
Analysis
Sentiment Analysis
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Big Data Lifecycle
Big Data AWS Cloud
Potentially Massive Data Sets Massive, virtually unlimited capacity
Iterative, experimental style of data manipulation
and analysis
Iterative, experimental style of infrastructure
deployment/usage
Frequently not a steady-state workload;
peaks and valleys
Efficient with highly variable workloads
Time to results is key
Parallel compute clusters from single data source
Hard to configure/manage
Managed services for data storage and analysis
Big Data + AWS
AWS Data Services
Data
Velocity
Variety
Volume
Structured, Unstructured, Text, Binary
Gigabytes, Terabytes, Petabytes
Millisecond, Second, Minute, Hour, Day
EC2EBS
Instance Storage
RedshiftRDS
SQL Stores
EMR
Hadoop
DynamoDB
NoSQL
Kinesis
Stream
Storage Services
S3 Cloud
Front
Glacier
Elasticache
Caching
Data
Pipeline
Orchestrate
Amazon Elastic Map Reduce
Hosted Hadoop Framework
• Easy to use and fully managed
• Secure
• Resizable clusters to support processing needs
• Support for EC2 spot instances
• Use many query tools to support analysis of
your data
– Hive, Pig, Hbase, Spark, BI Tools, etc
• EMR-FS for an S3 backed data store.
• Direct integration with other AWS data stores
– S3, Redshift, DynamoDB
Master instance group
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Amazon
Redshift
Amazon
DynamoDB
Amazon EMR Architecture
EMR Security
• Security groups for master and
slave instances
• Instances launch in your VPC
• Encrypt data in S3
• Control who can access S3 data
• API requests required signed key
Master instance group
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Amazon
Redshift
Amazon
DynamoDB
Amazon Redshift
Petabyte Scale Data Warehouse
• Fully managed data warehouse solution
• Able to achieve petabyte scale at $1000
per TB per year
• Integrates with existing data warehouse
tools
• Scales through columnar storage and
parallel query execution
• Data load directly from S3
• Integration with Amazon EMR
Amazon Redshift Architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3
– Parallel load from Amazon DynamoDB,
Amazon EMR, Amazon S3, HDFS/SSH
• Two hardware platforms
– Optimized for data processing
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
• SSL to secure data in transit
• Encryption to secure data at rest
– AES-256; hardware accelerated
– All blocks on disks and in
Amazon S3 encrypted
– HSM/CloudHSM
• No direct access to compute
nodes
• Amazon VPC support
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
Security
Group
JDBC/ODBC
Amazon Redshift Security
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
2014 Usage Statistics for Qubole on AWS:
• Total QCUH processed in 2014 = 40.6 million
• Total nodes managed in 2014 = 2.5 million
• Total PB processed in 2014 = 519
Operations
Analyst
Marketing Ops
Analyst
Data
Architects
Business
Users
Product
Support
Customer
Support
Developer
Sales
Ops
Product
Managers
Developer
Tools
Service
Management
Data Workbench
Cloud Data Platform
BI & DW
Systems
• SDK
• API
• Analysis
• Security
• Job Scheduler
• Data Governance
• Analytics templates
• Monitoring
• Support
• Collaboration
• Workflow &
Map/Reduce
• Auto Scaling
• Cloud Optimization
• Data Connectors• YARN • Presto & Hive• Spark & Pig
Hadoop Ecosystem (Apache Open Source)
Qubole Cluster Settings
Qubole Cluster Set up with AWS Credentials
Qubole Query Types
Qubole Dashboard
Agenda Slide
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
| 26
DataXu Introduction
Disruptive on-demand software platform relied upon by the world’s
leading brands
A petabyte scale marketing cloud that enables Fortune 500 brands to
manage data, insight and action to maximize Marketing ROI
The industry’s #1 rated programmatic marketing technology
spun out of MIT by the founders
One of the fastest growing companies in the Inc. 500
| 27
DataXu Quick Statistics
Big data + Real time decisions
Big Data
Processing
13 petabytes
of data
20 terabytes/day
consumer data intake
Real-Time
Decisioning
42 billion
decisions per second
1,500,000
Inbound Queries Per Second
Dozens of
algorithms across mobile,
social, native, display,
video and TV
Predictive
Modeling
Executing 10,000+
investments simultaneously
10M variables
considered per investment
decision using next gen
machine learning
Enterprise-
Cloud
Infrastructure
14
data centers
35,000+
CPU cores
Patent portfolio for real-time decision systems
Exclusive license from MIT to Algebra Of Systems IPR
| 28
Programmatic buying exploits real time signals to
drive greater ROI.
Analyze the attributes
available at bidding time
Assess the value of each
impression to determine a bid
price and the creative to serve
Learn from served
impressions to adjust future
bidding and creative delivery
OptimizeAppraiseAnalyze
Context Geo O.S.
Time Demo Etc.
| 29
• On-premise and Cloud
• Why Cloud/AWS
– Automation, API driven
– All Data in One Place
– Improved Testability
– Deep Security
– Breadth and Depth of Services
– Costs, Pay As You Go
– Auto Scaling (Scalability, Elasticity)
– Disaster Recovery and Business Continuity
DataXu in the Cloud
AWS
| 30
DataXu Data Flows in AWS
Producers Continuous
Processing
Storage
Analytics
CDN
Real Time
Bidding
Retargeting
Platform
Qubole
Kinesis S3 Redshift
Machine
LearningStreaming
Data Collection
Analysts
Data Scientists
Engineers
| 31
Why Qubole
Managed Service
• Auto Scaling
• Spot Pricing
• No Opex
• Redundant Clusters
• Data Security
Single Unified Interface
• Rich Unified Experience
• Data Discovery tool
• Query Templates
• Administration and Monitoring
Performance Optimizations
• Overall better performance than other
Hadoop clusters in the cloud
Automation
• Workflow, Scheduler
• SDK
Support
• 24 X 7 deep expertise support
| 32
Unified Experience
Operations
Analyst
Marketing
Ops
Analyst
Data
Architect
Busines
s
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Easy of use
for anyone
| 33
• Use VPC, pick AZ’s appropriately to match reservations
• Use hybrid spot pricing strategy
• Use tags for better reporting
• Seek Qubole help for cluster tuning
Qubole Cluster Best Practices
| 34
Data Security & Privacy
• AWS offers comprehensive data security
• Security & Privacy
– VPC
– IAM Policies, Users, Roles
– S3 Buckets, Bucket Policies & HTTPS
– Security Groups, Whitelist IP CIDR
– Key Management Service & CloudHSM
– Server Side and Client Side Encryption
| 35
Right tool for right workload
Large scale ETL
Interactive
Discovery
Queries
Machine
Learning/Real time
queries
High Performance
DW
Queries/Reporting
backend
Use Case / Technology
Questions?
DataXu
Yekesa Kosuru
ykosuru@dataxu.com
www.dataxu.com
Qubole
Ashish Dubey
adubey@qubole.com
www.qubole.com
AWS
Scott Ward
scotward@amazon.com
aws.amazon.com

More Related Content

Getting to 1.5M Ads/sec: How DataXu manages Big Data

  • 1. Getting to 1.5M Ads per Second How DataXu Manages Big Data AWS, DataXu, Qubole March 30th, 2015
  • 2. Today’s speakers Yekesa Kosuru VP of Engineering, DataXu Ashish Dubey Solutions Architect, Qubole Scott Ward Solutions Architect, AWS
  • 3. Agenda • AWS: Big Data, Technologies & Techniques for working productively with Data at any scale • Qubole: Big Data Delivered as a Service • DataXu: Leveraging Big Data to Understand & Engage Customers
  • 4. Housekeeping • The recording link will be distributed to all registrants via email after the webinar next week • Please submit your questions and comments using the Chat with Presenters box located at the bottom left corner of your screen
  • 5. Agenda • AWS: Big Data, Technologies & Techniques for working productively with Data at any scale • Qubole: Big Data Delivered as a Service • DataXu: Leveraging Big Data to Understand & Engage Customers
  • 7. Technologies and techniques for working productively with data, at any scale. Big Data
  • 8. Creating Value from Data Assets Recommendations, Collective Intelligence Machine Learning Visualization Dashboards Business Intelligence Measuring Functionality and Services Ad Hoc Queries A/B Testing Hypothesis Testing & Predictions Statistical Analysis Learning from Social Media Conversations Sentiment Analysis SOCIAL BIG DATA Machine Learning Dashboards Business Intelligence Ad Hoc Queries A/B Testing Statistical Analysis Sentiment Analysis
  • 9. Generation Collection & storage Analytics & computation Collaboration & sharing Big Data Lifecycle
  • 10. Big Data AWS Cloud Potentially Massive Data Sets Massive, virtually unlimited capacity Iterative, experimental style of data manipulation and analysis Iterative, experimental style of infrastructure deployment/usage Frequently not a steady-state workload; peaks and valleys Efficient with highly variable workloads Time to results is key Parallel compute clusters from single data source Hard to configure/manage Managed services for data storage and analysis Big Data + AWS
  • 11. AWS Data Services Data Velocity Variety Volume Structured, Unstructured, Text, Binary Gigabytes, Terabytes, Petabytes Millisecond, Second, Minute, Hour, Day EC2EBS Instance Storage RedshiftRDS SQL Stores EMR Hadoop DynamoDB NoSQL Kinesis Stream Storage Services S3 Cloud Front Glacier Elasticache Caching Data Pipeline Orchestrate
  • 12. Amazon Elastic Map Reduce Hosted Hadoop Framework • Easy to use and fully managed • Secure • Resizable clusters to support processing needs • Support for EC2 spot instances • Use many query tools to support analysis of your data – Hive, Pig, Hbase, Spark, BI Tools, etc • EMR-FS for an S3 backed data store. • Direct integration with other AWS data stores – S3, Redshift, DynamoDB
  • 13. Master instance group Task instance groupCore instance group HDFS HDFS Amazon S3 Amazon Redshift Amazon DynamoDB Amazon EMR Architecture
  • 14. EMR Security • Security groups for master and slave instances • Instances launch in your VPC • Encrypt data in S3 • Control who can access S3 data • API requests required signed key Master instance group Task instance groupCore instance group HDFS HDFS Amazon S3 Amazon Redshift Amazon DynamoDB
  • 15. Amazon Redshift Petabyte Scale Data Warehouse • Fully managed data warehouse solution • Able to achieve petabyte scale at $1000 per TB per year • Integrates with existing data warehouse tools • Scales through columnar storage and parallel query execution • Data load directly from S3 • Integration with Amazon EMR
  • 16. Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata – Coordinates query execution • Compute Nodes – Local, columnar storage – Execute queries in parallel – Load, backup, restore via Amazon S3 – Parallel load from Amazon DynamoDB, Amazon EMR, Amazon S3, HDFS/SSH • Two hardware platforms – Optimized for data processing – DW1: HDD; scale from 2TB to 1.6PB – DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  • 17. • SSL to secure data in transit • Encryption to secure data at rest – AES-256; hardware accelerated – All blocks on disks and in Amazon S3 encrypted – HSM/CloudHSM • No direct access to compute nodes • Amazon VPC support 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal Security Group JDBC/ODBC Amazon Redshift Security
  • 19. Agenda • AWS: Big Data, Technologies & Techniques for working productively with Data at any scale • Qubole: Big Data Delivered as a Service • DataXu: Leveraging Big Data to Understand & Engage Customers
  • 20. 2014 Usage Statistics for Qubole on AWS: • Total QCUH processed in 2014 = 40.6 million • Total nodes managed in 2014 = 2.5 million • Total PB processed in 2014 = 519 Operations Analyst Marketing Ops Analyst Data Architects Business Users Product Support Customer Support Developer Sales Ops Product Managers Developer Tools Service Management Data Workbench Cloud Data Platform BI & DW Systems • SDK • API • Analysis • Security • Job Scheduler • Data Governance • Analytics templates • Monitoring • Support • Collaboration • Workflow & Map/Reduce • Auto Scaling • Cloud Optimization • Data Connectors• YARN • Presto & Hive• Spark & Pig Hadoop Ecosystem (Apache Open Source)
  • 22. Qubole Cluster Set up with AWS Credentials
  • 25. Agenda Slide • AWS: Big Data, Technologies & Techniques for working productively with Data at any scale • Qubole: Big Data Delivered as a Service • DataXu: Leveraging Big Data to Understand & Engage Customers
  • 26. | 26 DataXu Introduction Disruptive on-demand software platform relied upon by the world’s leading brands A petabyte scale marketing cloud that enables Fortune 500 brands to manage data, insight and action to maximize Marketing ROI The industry’s #1 rated programmatic marketing technology spun out of MIT by the founders One of the fastest growing companies in the Inc. 500
  • 27. | 27 DataXu Quick Statistics Big data + Real time decisions Big Data Processing 13 petabytes of data 20 terabytes/day consumer data intake Real-Time Decisioning 42 billion decisions per second 1,500,000 Inbound Queries Per Second Dozens of algorithms across mobile, social, native, display, video and TV Predictive Modeling Executing 10,000+ investments simultaneously 10M variables considered per investment decision using next gen machine learning Enterprise- Cloud Infrastructure 14 data centers 35,000+ CPU cores Patent portfolio for real-time decision systems Exclusive license from MIT to Algebra Of Systems IPR
  • 28. | 28 Programmatic buying exploits real time signals to drive greater ROI. Analyze the attributes available at bidding time Assess the value of each impression to determine a bid price and the creative to serve Learn from served impressions to adjust future bidding and creative delivery OptimizeAppraiseAnalyze Context Geo O.S. Time Demo Etc.
  • 29. | 29 • On-premise and Cloud • Why Cloud/AWS – Automation, API driven – All Data in One Place – Improved Testability – Deep Security – Breadth and Depth of Services – Costs, Pay As You Go – Auto Scaling (Scalability, Elasticity) – Disaster Recovery and Business Continuity DataXu in the Cloud AWS
  • 30. | 30 DataXu Data Flows in AWS Producers Continuous Processing Storage Analytics CDN Real Time Bidding Retargeting Platform Qubole Kinesis S3 Redshift Machine LearningStreaming Data Collection Analysts Data Scientists Engineers
  • 31. | 31 Why Qubole Managed Service • Auto Scaling • Spot Pricing • No Opex • Redundant Clusters • Data Security Single Unified Interface • Rich Unified Experience • Data Discovery tool • Query Templates • Administration and Monitoring Performance Optimizations • Overall better performance than other Hadoop clusters in the cloud Automation • Workflow, Scheduler • SDK Support • 24 X 7 deep expertise support
  • 33. | 33 • Use VPC, pick AZ’s appropriately to match reservations • Use hybrid spot pricing strategy • Use tags for better reporting • Seek Qubole help for cluster tuning Qubole Cluster Best Practices
  • 34. | 34 Data Security & Privacy • AWS offers comprehensive data security • Security & Privacy – VPC – IAM Policies, Users, Roles – S3 Buckets, Bucket Policies & HTTPS – Security Groups, Whitelist IP CIDR – Key Management Service & CloudHSM – Server Side and Client Side Encryption
  • 35. | 35 Right tool for right workload Large scale ETL Interactive Discovery Queries Machine Learning/Real time queries High Performance DW Queries/Reporting backend Use Case / Technology

Editor's Notes

  1. 1.5 million ad requests per sec Billions of impressions per month, Petabytes of data ~10ms round trip average response time, 100ms max Serving in 50+ countries around the world Over 20 TB data collected per day Integrated with over 30 exchanges around the world
  2. No HDFS, there is no reliable way to auto scaling
  3. Pretty innvoative, using spot Qubole has put thoughts into cost effective Spot pricning anf auto scaling Talk about auto scaling – cost optimization HDFS – does not make sense in Qubole, don’t rely on,