AWS June Webinar Series - Getting Started: Amazon Redshift

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pavan Pothukuchi
June 17, 2015
Amazon Redshift
Getting Started

Petabyte scale
Massively parallel
Relational data warehouse
Fully managed; zero admin
Amazon
Redshift
a lot faster
a lot cheaper
a whole lot simpler

Redshift
EMR
EC2
Analyze
Data Pipeline
Glacier
DynamoDB
Store
Direct Connect
Collect
Kinesis
S3

Selected Amazon Redshift Customers

Amazon Redshift Architecture
Leader Node
• SQL endpoint, JDBC/ODBC
• Stores metadata
• Coordinates query execution
Compute Nodes
• Local, columnar storage
• Execute queries in parallel
• Load, backup, restore via Amazon S3
• Load from Amazon DynamoDB or SSH
Two hardware platforms
• Optimized for data processing
• DS2: HDD; scale from 2TB to 2PB
• DC1: SSD; scale from 160GB to 326TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC

Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
ID Age State Amoun
t
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375

Column storage
Data compression
Zone maps
• Track of the minimum and
maximum value for each block
• Skip over blocks that don’t contain
the data needed for a given query
• Minimize unnecessary I/O

Column storage
Data compression
Zone maps
• Use direct-attached storage to
maximize throughput
• Hardware optimized for high
performance data processing
• Large block sizes to make the
most of each read
• Amazon Redshift manages
durability for you

Amazon Redshift Node Types
• Optimized for I/O intensive workloads
• High disk density
• On demand at $0.85/hour
• As low as $1,000/TB/Year
• Scale from 2TB to 2PB
DS2.XL: 31 GB RAM, 2 Cores
2 TB compressed storage, 0.5 GB/sec scan
DS2.8XL: 244 GB RAM, 16 Cores
16 TB compressed, 4 GB/sec scan
• High performance at smaller storage size
• High compute and memory density
• On demand at $0.25/hour
• As low as $5,500/TB/Year
• Scale from 160GB to 326TB
DC1.L: 16 GB RAM, 2 Cores
160 GB compressed SSD storage
DC1.8XL: 256 GB RAM, 32 Cores
2.56 TB of compressed SSD storage

Priced to let you analyze all your data
Price is nodes times hourly
cost
No charge for leader node
3x data compression on avg
Price includes 3 copies of
data
DS2 (HDD)
Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)
Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.250 $ 13,690

Built-in Security
• Load encrypted from S3
• SSL to secure data in transit; ECDHE perfect
forward security
• Encryption to secure data at rest
• All blocks on disks & in Amazon S3 encrypted
• Block key, Cluster key, Master key (AES-256)
• On-premises HSM & CloudHSM support
• Audit logging & AWS CloudTrail integration
• Amazon VPC support
• SOC 1/2/3, PCI-DSS Level 1, FedRAMP
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC

Durability and Availability – Managed
Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at
all times
Backups to Amazon S3 are continuous, automatic, and incremental
• Designed for eleven nines of durability
Continuous monitoring and automated recovery from failures of drives and nodes
Able to restore snapshots to any Availability Zone within a region
Easily enable backups to a second region for disaster recovery

Common Customer Use Cases
Reduce costs by extending DW
rather than adding HW
Migrate completely from existing
DW systems
Respond faster to business
Improve performance by an
order of magnitude
Make more data available for
analysis
Access business data via
standard reporting tools
Add analytic functionality to
applications
Scale DW capacity as demand
grows
Reduce HW & SW costs by an
order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies

• 10s of million ads/day
• Stores 18 months of data
• Analyzes ad opportunities,
clicks and experiments
• 250M mobile events/day
• Stores 3 wk. granular and 4
yr. of aggregate data
• Analyzes new feature usage
and A/B testing

Select Security Settings and Provision

Resize
• Resize while remaining online
• Provision a new cluster in the
background
• Copy data in parallel from node to
node
• Only charged for source cluster

AWS CloudCorporate Data center
Amazon S3
Amazon
Redshift
Flat files
Data loading options

AWS CloudCorporate Data center
ETL
Source DBs
Amazon
Redshift
Amazon
Redshift

AWS Cloud
Amazon
Redshift
Amazon
Kinesis

Use the COPY command
Each slice can load one file at a
time
A single input file means only one
slice is ingesting data
Instead of 100MB/s, you’re only
getting 6.25MB/s
Use multiple input files to maximize
throughput

Use the COPY command
You need at least as many input
files as you have slices
With 16 input files, all slices are
working so you maximize
throughput
Get 100MB/s per node; scale
linearly as you add nodes
Use multiple input files to maximize
throughput

Load lineorder table from single file
copy lineorder from 's3://awssampledb/load/lo/lineorder-single.tbl'
credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=>key>'
gzip
compupdate off
region 'us-east-1';

Load lineorder table from multiple files
copy lineorder from 's3://awssampledb/load/lo/lineorder-multi.tbl'
credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=>key>'
gzip
compupdate off
region 'us-east-1';

JDBC/ODBC
Amazon Redshift
Amazon Redshift works with your existing
analysis tools

ODBC/JDBC
BI Server Redshift
Clients

Resources
Pavan Pothukuchi | pavanpo@amazon.com |
Detail Pages
• http://aws.amazon.com/redshift
• https://aws.amazon.com/marketplace/redshift/
Best Practices
• http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html
Deep Dive Webinar Series in July
• Migration and Loading Data
• Optimizing Performance
• Reporting and Advanced Analytics

AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new
customers about the AWS platform, best practices and new cloud services.
Details
• July 1, 2015
• Chicago, Illinois
• @ McCormick Place
Featuring
• New product launches
• 36+ sessions, labs, and bootcamps
• Executive and partner networking
Registration is now open
• Come and see what AWS and the cloud can do for you.
• Click here to register: http://amzn.to/1RooPPL

Pavan Pothukuchi – pavanpo@amazon.com

Load part table using key prefix
copy part from 's3://pp-redshift-webinar-demo/load/part-csv.tbl'
credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<key>;'
csv
null as '000';

Load supplier table using gzip
copy supplier from 's3://awssampledb/ssbgz/supplier.tbl'
credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<key>'
delimiter '|'
gzip
region 'us-east-1';

Load customer table using a manifest file
copy customer from 's3://pp-redshift-webinar-demo/load/customer-fw-manifest'
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12,
c_phone:15,c_mktsegment:10'
maxerror 10
acceptinvchars as '^'
manifest;

Load dwdate using auto
copy dwdate from 's3://pp-redshift-webinar-demo/load/dwdate-tab.tbl'
delimiter 't'
dateformat 'auto';

AWS June Webinar Series - Getting Started: Amazon Redshift

More Related Content

AWS June Webinar Series - Getting Started: Amazon Redshift