AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Uses & Best Practices
for Amazon Redshift
Ian Meyers, AWS (meyersi@)
Adrian Wiles, The Financial Times (adrian.wiles@)
April 30, 2014

Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Amazon Redshift

Redshift
EMR
EC2
Analyze
Glacier
S3
DynamoDB
Store
Direct Connect
Collect
Kinesis

Petabyte scale
Massively parallel
Rela2onal data warehouse
Fully managed; zero admin
Amazon
Redshift
a lot faster
a lot cheaper
a whole lot simpler

Common Customer Use Cases
•  Reduce costs by
extending DW rather than
adding HW
•  Migrate completely from
existing DW systems
•  Respond faster to
business
•  Improve performance by
an order of magnitude
•  Make more data
available for analysis
•  Access business data via
standard reporting tools
•  Add analytic functionality
to applications
•  Scale DW capacity as
demand grows
•  Reduce HW & SW costs
by an order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies

AWS Marketplace
•  Find software to use with
Amazon Redshift
•  One-click deployments
•  Flexible pricing options
http://aws.amazon.com/marketplace/redshift

Data Loading Options
•  Parallel upload to Amazon S3
•  AWS Direct Connect
•  AWS Import/Export
•  Amazon Kinesis
•  SSH / Elastic MapReduce
•  Systems integrators
Data Integration Systems Integrators

Amazon Redshift Architecture
•  Leader Node
–  SQL endpoint
–  Stores metadata
–  Coordinates query execution
•  Compute Nodes
–  Local, columnar storage
–  Execute queries in parallel
–  Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB, EMR & SSH
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3 / DynamoDB / SSH / EMR
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node

Amazon Redshift Node Types
•  Optimized for I/O intensive workloads
•  High disk density
•  On demand at $0.85/hour
•  As low as $1,000/TB/Year
•  Scale from 2TB to 1.6PB
DW1.XL: 16 GB RAM, 2 Cores
3 Spindles, 2 TB compressed storage
DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles
16 TB compressed, 2 GB/sec scan rate
•  High performance at smaller storage size
•  High compute and memory density
•  On demand at $0.25/hour
•  As low as $5,500/TB/Year
•  Scale from 160GB to 256TB
DW2.L *New*: 16 GB RAM, 2 Cores,
160 GB compressed SSD storage
DW2.8XL *New*: 256 GB RAM, 32 Cores,
2.56 TB of compressed SSD storage

Amazon Redshift dramatically reduces I/O
•  Column storage
•  Data compression
•  Zone maps
•  Direct-attached storage •  With row storage you do
unnecessary I/O
•  To get total amount, you have
to read everything
ID

Age

State

Amount

123

20

CA

500

345

25

WA

250

678

40

FL

125

957

37

WA

375

•  With column storage, you
only read the data you need
to evaluate the query
•  Zone maps
•  Direct-attached storage
ID

Age

State

Amount

123

20

CA

500

345

25

WA

250

678

40

FL

125

957

37

WA

375

•  Zone maps
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
•  Track the minimum and
maximum value for each block
•  Skip over blocks that don’t
contain relevant data

•  Zone maps
•  Use local storage for
performance
•  Maximize scan rates
•  Automatic replication and
continuous backup
•  HDD & SSD platforms

Amazon Redshift parallelizes and distributes everything
•  Query
•  Load
•  Backup/Restore
•  Resize

•  Load in parallel from Amazon S3,
Elastic MapReduce, Amazon
DynamoDB or any SSH connection
•  Data automatically distributed and
sorted according to DDL
•  Scales linearly with number of nodes
Amazon S3 / DynamoDB / EMR / SSH
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
•  Query
•  Load
• 
•  Resize

•  Backups to Amazon S3 are automatic,
continuous and incremental
•  Configurable system snapshot retention period.
Take user snapshots on-demand
•  Cross region backups for disaster recovery
•  Streaming restores enable you to resume
querying faster
Amazon S3
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
•  Query
•  Load
•  Resize

•  Resize while remaining online
•  Provision a new cluster in the
background
•  Copy data in parallel from node to node
•  Only charged for source cluster
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Leader
Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Leader
Node
•  Query
•  Load
•  Resize

128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Leader
Node
•  Automatic SQL endpoint
switchover via DNS
•  Decommission the source cluster
•  Simple operation via Console or
API
•  Query
•  Load
•  Resize

Amazon Redshift is priced to let you analyze all your data
•  Number of nodes x cost
per hour
•  No charge for leader
node
•  No upfront costs
•  Pay as you go
DW1 (HDD)
Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB
On-Demand $ 0.850 $ 3,723
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DW2 (SSD)
Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB
On-Demand $ 0.250 $ 13,688

Amazon Redshift is easy to use
•  Provision in minutes
•  Powerful Web Console
•  Monitor query performance
•  Point and click resize
•  Built in security
•  Automatic backups

Amazon Redshift has security built-in
•  SSL to secure data in transit
•  Encryption to secure data at rest
–  AES-256; hardware accelerated
–  All blocks on disks and in Amazon S3
encrypted
–  HSM Support
•  No direct access to compute nodes
•  Audit logging & AWS CloudTrail
integration
•  Amazon VPC support
10 GigE
(HPC)
Ingestion
Backup
Restore
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3 / Amazon DynamoDB
Customer VPC
Internal
VPC
JDBC/ODBC
Leader
Node
Compute
Node
Compute
Node
Compute
Node

Amazon Redshift continuously backs up your
data and recovers from failures
•  Replication within the cluster and backup to Amazon S3 to maintain multiple copies of
data at all times
•  Backups to Amazon S3 are continuous, automatic, and incremental
–  Designed for eleven nines of durability
•  Continuous monitoring and automated recovery from failures of drives and nodes
•  Able to restore snapshots to any Availability Zone within a region
•  Easily enable backups to a second region for disaster recovery

Amazon Redshift Feature Delivery
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
Unload Encrypted Files
DUB (4/25)
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
4 byte UTF-8 (7/18)
Statement Timeout (7/22)
SHA1 Builtin (7/15)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress
(8/9)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy,
Distributed Tables, Audit Logging/
CloudTrail, Concurrency, Resize Perf.,
Approximate Count Distinct, SNS Alerts
(11/13)
SOC1/2/3 (5/8)
Sharing snapshots (7/18)
Resource Level IAM (8/9)
PCI (8/22)
Distributed Tables, Single Node Cursor
Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables and
diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH, Fetch
size support for single node clusters, new
system tables with commit stats,
row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster
Version (3/21)
Regex_Substr, COPY from JSON (3/25)

50+ new features since launch
•  Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney
•  Certifications – PCI, SOC 1/2/3
•  Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials, HSM
•  Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross-region
snapshots
•  Query – Regex, Cursors, MD5, SHA1, Time Zone, workload queue timeout, HLL
•  Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character
substitution, CSV, auto datetime format detection, epoch, Ingest from SSH

COPY from JSON
{
"jsonpaths":
[
"$['id']",
"$['name']",
"$['location'][0]",
"$['location'][1]",
"$['seats']"
]
}
COPY venue FROM 's3://mybucket/venue.json’ credentials
'aws_access_key_id=ACCESS-KEY-ID; aws_secret_access_key=SECRET-ACCESS-KEY'
JSON AS 's3://mybucket/venue_jsonpaths.json';

REGEX_SUBSTR()
select email, regexp_substr(email,'@[^.]*')
from users limit 5;
email | regexp_substr
--------------------------------------------+----------------
Suspendisse.tristique@nonnisiAenean.edu | @nonnisiAenean
sed@lacusUtnec.ca | @lacusUtnec
elementum@semperpretiumneque.ca | @semperpretiumneque
Integer.mollis.Integer@tristiquealiquet.org | @tristiquealiquet
Donec.fringilla@sodalesat.org | @sodalesat

Resize Progress
•  Progress indicator in
console
•  New API call

Adrian Wiles
Enterprise Data Architect
Financial Times

AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)

Flexibility
• Storage growth
• Ability to try new things
0
2
4
6
8
10
12
14
2011 2012 2013 2014

Why Redshift?
•  Faster
•  No up-front commitment
–  Fail fast and cheap
–  Avoid procurement process
•  Scale down as well as up
•  AWS ecosystem

Source Data
Archive
Staging
DQ
Standardise
ETL
Data
Warehouse
Reporting
Campaigns
Cubes

Source Data
Archive
Staging
DQ
Standardise
ETL
Reporting
Campaigns
Cubes
Data Warehouse

Source Data
Archive
DQ
Standardise
ETL
Reporting
Campaigns
Cubes
Data Warehouse
Staging

Source Data
DQ
Standardise
ETL
Reporting
Campaigns
Cubes
Data Warehouse
Staging
Archive

Data Warehouse
Staging
Archive
sftp
ETL
Cubes
Reporting
Campaigns

Performance
“My query must be wrong, that was too
fast”
3.7 Billion Rows
Before: 7:03m
Redshift: 43.82s

Learning
•  Plan your network in advance
•  Different way of thinking
–  Distributed data and column stores
–  No stored procedures
–  Small queries
•  Work with AWS
–  AWS invested in our success

Future
•  Consolidate existing marts
•  More sources near real-time
•  Service Oriented Architecture

AWS Partner Trail
Win a Kindle Fire
•  10 in total
•  Get a code from our
sponsors

Please rate
this session
using the AWS
Summits App
and help us build
better events

AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)

More Related Content

AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)