Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Uses & Best Practices
for Amazon Redshift
Ian Meyers, AWS (meyersi@)
Adrian Wiles, The Financial Times (adrian.wiles@)
April 30, 2014
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Amazon Redshift
Redshift
EMR
EC2
Analyze
Glacier
S3
DynamoDB
Store
Direct Connect
Collect
Kinesis
Petabyte  scale
Massively  parallel  
Rela2onal  data  warehouse  
Fully  managed;  zero  admin
Amazon
Redshift
a lot faster
a lot cheaper
a whole lot simpler
Common Customer Use Cases
•  Reduce costs by
extending DW rather than
adding HW
•  Migrate completely from
existing DW systems
•  Respond faster to
business
•  Improve performance by
an order of magnitude
•  Make more data
available for analysis
•  Access business data via
standard reporting tools
•  Add analytic functionality
to applications
•  Scale DW capacity as
demand grows
•  Reduce HW & SW costs
by an order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies
Amazon Redshift Customers
Growing Ecosystem
AWS Marketplace
•  Find software to use with
Amazon Redshift
•  One-click deployments
•  Flexible pricing options
http://aws.amazon.com/marketplace/redshift
Data Loading Options
•  Parallel upload to Amazon S3
•  AWS Direct Connect
•  AWS Import/Export
•  Amazon Kinesis
•  SSH / Elastic MapReduce
•  Systems integrators
Data Integration Systems Integrators
Amazon Redshift Architecture
•  Leader Node
–  SQL endpoint
–  Stores metadata
–  Coordinates query execution
•  Compute Nodes
–  Local, columnar storage
–  Execute queries in parallel
–  Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB, EMR & SSH
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3 / DynamoDB / SSH / EMR
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node
Amazon Redshift Node Types
•  Optimized for I/O intensive workloads
•  High disk density
•  On demand at $0.85/hour
•  As low as $1,000/TB/Year
•  Scale from 2TB to 1.6PB
DW1.XL: 16 GB RAM, 2 Cores
3 Spindles, 2 TB compressed storage
DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles
16 TB compressed, 2 GB/sec scan rate
•  High performance at smaller storage size
•  High compute and memory density
•  On demand at $0.25/hour
•  As low as $5,500/TB/Year
•  Scale from 160GB to 256TB
DW2.L *New*: 16 GB RAM, 2 Cores,
160 GB compressed SSD storage
DW2.8XL *New*: 256 GB RAM, 32 Cores,
2.56 TB of compressed SSD storage
Amazon Redshift dramatically reduces I/O
•  Column storage
•  Data compression
•  Zone maps
•  Direct-attached storage •  With row storage you do
unnecessary I/O
•  To get total amount, you have
to read everything
ID	

 Age	

 State	

 Amount	

123	

 20	

 CA	

 500	

345	

 25	

 WA	

 250	

678	

 40	

 FL	

 125	

957	

 37	

 WA	

 375
•  With column storage, you
only read the data you need
to evaluate the query
Amazon Redshift dramatically reduces I/O
•  Column storage
•  Data compression
•  Zone maps
•  Direct-attached storage
ID	

 Age	

 State	

 Amount	

123	

 20	

 CA	

 500	

345	

 25	

 WA	

 250	

678	

 40	

 FL	

 125	

957	

 37	

 WA	

 375
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
Amazon Redshift dramatically reduces I/O
•  Column storage
•  Data compression
•  Zone maps
•  Direct-attached storage •  COPY compresses automatically
•  You can analyze and override
•  More performance, less cost
Amazon Redshift dramatically reduces I/O
•  Column storage
•  Data compression
•  Zone maps
•  Direct-attached storage
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
•  Track the minimum and
maximum value for each block
•  Skip over blocks that don’t
contain relevant data
Amazon Redshift dramatically reduces I/O
•  Column storage
•  Data compression
•  Zone maps
•  Direct-attached storage
•  Use local storage for
performance
•  Maximize scan rates
•  Automatic replication and
continuous backup
•  HDD & SSD platforms
Amazon Redshift parallelizes and distributes everything
•  Query
•  Load
•  Backup/Restore
•  Resize
•  Load in parallel from Amazon S3,
Elastic MapReduce, Amazon
DynamoDB or any SSH connection
•  Data automatically distributed and
sorted according to DDL
•  Scales linearly with number of nodes
Amazon S3 / DynamoDB / EMR / SSH
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Amazon Redshift parallelizes and distributes everything
•  Query
•  Load
• 
•  Backup/Restore
•  Resize
•  Backups to Amazon S3 are automatic,
continuous and incremental
•  Configurable system snapshot retention period.
Take user snapshots on-demand
•  Cross region backups for disaster recovery
•  Streaming restores enable you to resume
querying faster
Amazon S3
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Amazon Redshift parallelizes and distributes everything
•  Query
•  Load
•  Backup/Restore
•  Resize
•  Resize while remaining online
•  Provision a new cluster in the
background
•  Copy data in parallel from node to node
•  Only charged for source cluster
SQL Clients/BI Tools
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Leader
Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Leader
Node
Amazon Redshift parallelizes and distributes everything
•  Query
•  Load
•  Backup/Restore
•  Resize
SQL Clients/BI Tools
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Comput
e Node
128GB RAM
48TB disk
16 cores
Leader
Node
•  Automatic SQL endpoint
switchover via DNS
•  Decommission the source cluster
•  Simple operation via Console or
API
Amazon Redshift parallelizes and distributes everything
•  Query
•  Load
•  Backup/Restore
•  Resize
Amazon Redshift is priced to let you analyze all your data
•  Number of nodes x cost
per hour
•  No charge for leader
node
•  No upfront costs
•  Pay as you go
DW1 (HDD)
Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB
On-Demand $ 0.850 $ 3,723
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DW2 (SSD)
Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB
On-Demand $ 0.250 $ 13,688
1 Year Reservation $ 0.161 $ 8,794
3 Year Reservation $ 0.100 $ 5,498
Amazon Redshift is easy to use
•  Provision in minutes
•  Powerful Web Console
•  Monitor query performance
•  Point and click resize
•  Built in security
•  Automatic backups
Amazon Redshift has security built-in
•  SSL to secure data in transit
•  Encryption to secure data at rest
–  AES-256; hardware accelerated
–  All blocks on disks and in Amazon S3
encrypted
–  HSM Support
•  No direct access to compute nodes
•  Audit logging & AWS CloudTrail
integration
•  Amazon VPC support
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3 / Amazon DynamoDB
Customer VPC
Internal
VPC
JDBC/ODBC
Leader
Node
Compute
Node
Compute
Node
Compute
Node
Amazon Redshift continuously backs up your
data and recovers from failures
•  Replication within the cluster and backup to Amazon S3 to maintain multiple copies of
data at all times
•  Backups to Amazon S3 are continuous, automatic, and incremental
–  Designed for eleven nines of durability
•  Continuous monitoring and automated recovery from failures of drives and nodes
•  Able to restore snapshots to any Availability Zone within a region
•  Easily enable backups to a second region for disaster recovery
Amazon Redshift Feature Delivery
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
Unload Encrypted Files
DUB (4/25)
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
4 byte UTF-8 (7/18)
Statement Timeout (7/22)
SHA1 Builtin (7/15)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress
(8/9)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy,
Distributed Tables, Audit Logging/
CloudTrail, Concurrency, Resize Perf.,
Approximate Count Distinct, SNS Alerts
(11/13)
SOC1/2/3 (5/8)
Sharing snapshots (7/18)
Resource Level IAM (8/9)
PCI (8/22)
Distributed Tables, Single Node Cursor
Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables and
diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH, Fetch
size support for single node clusters, new
system tables with commit stats,
row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster
Version (3/21)
Regex_Substr, COPY from JSON (3/25)
50+ new features since launch
•  Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney
•  Certifications – PCI, SOC 1/2/3
•  Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials, HSM
•  Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross-region
snapshots
•  Query – Regex, Cursors, MD5, SHA1, Time Zone, workload queue timeout, HLL
•  Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character
substitution, CSV, auto datetime format detection, epoch, Ingest from SSH
COPY from JSON
{
"jsonpaths":
[
"$['id']",
"$['name']",
"$['location'][0]",
"$['location'][1]",
"$['seats']"
]
}
COPY venue FROM 's3://mybucket/venue.json’ credentials
'aws_access_key_id=ACCESS-KEY-ID; aws_secret_access_key=SECRET-ACCESS-KEY'
JSON AS 's3://mybucket/venue_jsonpaths.json';
REGEX_SUBSTR()
select email, regexp_substr(email,'@[^.]*')
from users limit 5;
email | regexp_substr
--------------------------------------------+----------------
Suspendisse.tristique@nonnisiAenean.edu | @nonnisiAenean
sed@lacusUtnec.ca | @lacusUtnec
elementum@semperpretiumneque.ca | @semperpretiumneque
Integer.mollis.Integer@tristiquealiquet.org | @tristiquealiquet
Donec.fringilla@sodalesat.org | @sodalesat
Resize Progress
•  Progress indicator in
console
•  New API call
Adrian Wiles
Enterprise Data Architect
Financial Times
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
Better value for money
Flexibility
• Storage growth
• Ability to try new things
0
2
4
6
8
10
12
14
2011 2012 2013 2014
Democratise Data
Why Redshift?
•  Faster
•  No up-front commitment
–  Fail fast and cheap
–  Avoid procurement process
•  Scale down as well as up
•  AWS ecosystem
Source Data
Archive
Staging
DQ
Standardise
ETL
Data
Warehouse
Reporting
Campaigns
Cubes
Source Data
Archive
Staging
DQ
Standardise
ETL
Reporting
Campaigns
Cubes
Data Warehouse
Source Data
Archive
DQ
Standardise
ETL
Reporting
Campaigns
Cubes
Data Warehouse
Staging
Source Data
DQ
Standardise
ETL
Reporting
Campaigns
Cubes
Data Warehouse
Staging
Archive
Data Warehouse
Staging
Archive
sftp
ETL
 Cubes
Reporting
Campaigns
Data Warehouse
Staging
 Archive
sftp
ETL
 Cubes
Reporting
Campaigns
Performance
“My query must be wrong, that was too
fast”
3.7 Billion Rows
Before: 7:03m
Redshift: 43.82s
Near real-time
Dashboard
“Yes we can”
Learning
•  Plan your network in advance
•  Different way of thinking
–  Distributed data and column stores
–  No stored procedures
–  Small queries
•  Work with AWS
–  AWS invested in our success
Future
•  Consolidate existing marts
•  More sources near real-time
•  Service Oriented Architecture
AWS Partner Trail
Win a Kindle Fire
•  10 in total
•  Get a code from our
sponsors
Please rate
this session
using the AWS
Summits App
and help us build
better events
#AWSSummit
@AWScloud @AWS_UKI

More Related Content

AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)

  • 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Uses & Best Practices for Amazon Redshift Ian Meyers, AWS (meyersi@) Adrian Wiles, The Financial Times (adrian.wiles@) April 30, 2014
  • 2. Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year Amazon Redshift
  • 4. Petabyte  scale Massively  parallel   Rela2onal  data  warehouse   Fully  managed;  zero  admin Amazon Redshift a lot faster a lot cheaper a whole lot simpler
  • 5. Common Customer Use Cases •  Reduce costs by extending DW rather than adding HW •  Migrate completely from existing DW systems •  Respond faster to business •  Improve performance by an order of magnitude •  Make more data available for analysis •  Access business data via standard reporting tools •  Add analytic functionality to applications •  Scale DW capacity as demand grows •  Reduce HW & SW costs by an order of magnitude Traditional Enterprise DW Companies with Big Data SaaS Companies
  • 8. AWS Marketplace •  Find software to use with Amazon Redshift •  One-click deployments •  Flexible pricing options http://aws.amazon.com/marketplace/redshift
  • 9. Data Loading Options •  Parallel upload to Amazon S3 •  AWS Direct Connect •  AWS Import/Export •  Amazon Kinesis •  SSH / Elastic MapReduce •  Systems integrators Data Integration Systems Integrators
  • 10. Amazon Redshift Architecture •  Leader Node –  SQL endpoint –  Stores metadata –  Coordinates query execution •  Compute Nodes –  Local, columnar storage –  Execute queries in parallel –  Load, backup, restore via Amazon S3; load from Amazon DynamoDB, EMR & SSH 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores Amazon S3 / DynamoDB / SSH / EMR JDBC/ODBC 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node
  • 11. Amazon Redshift Node Types •  Optimized for I/O intensive workloads •  High disk density •  On demand at $0.85/hour •  As low as $1,000/TB/Year •  Scale from 2TB to 1.6PB DW1.XL: 16 GB RAM, 2 Cores 3 Spindles, 2 TB compressed storage DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles 16 TB compressed, 2 GB/sec scan rate •  High performance at smaller storage size •  High compute and memory density •  On demand at $0.25/hour •  As low as $5,500/TB/Year •  Scale from 160GB to 256TB DW2.L *New*: 16 GB RAM, 2 Cores, 160 GB compressed SSD storage DW2.8XL *New*: 256 GB RAM, 32 Cores, 2.56 TB of compressed SSD storage
  • 12. Amazon Redshift dramatically reduces I/O •  Column storage •  Data compression •  Zone maps •  Direct-attached storage •  With row storage you do unnecessary I/O •  To get total amount, you have to read everything ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  • 13. •  With column storage, you only read the data you need to evaluate the query Amazon Redshift dramatically reduces I/O •  Column storage •  Data compression •  Zone maps •  Direct-attached storage ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  • 14. analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw Amazon Redshift dramatically reduces I/O •  Column storage •  Data compression •  Zone maps •  Direct-attached storage •  COPY compresses automatically •  You can analyze and override •  More performance, less cost
  • 15. Amazon Redshift dramatically reduces I/O •  Column storage •  Data compression •  Zone maps •  Direct-attached storage 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959 •  Track the minimum and maximum value for each block •  Skip over blocks that don’t contain relevant data
  • 16. Amazon Redshift dramatically reduces I/O •  Column storage •  Data compression •  Zone maps •  Direct-attached storage •  Use local storage for performance •  Maximize scan rates •  Automatic replication and continuous backup •  HDD & SSD platforms
  • 17. Amazon Redshift parallelizes and distributes everything •  Query •  Load •  Backup/Restore •  Resize
  • 18. •  Load in parallel from Amazon S3, Elastic MapReduce, Amazon DynamoDB or any SSH connection •  Data automatically distributed and sorted according to DDL •  Scales linearly with number of nodes Amazon S3 / DynamoDB / EMR / SSH 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Amazon Redshift parallelizes and distributes everything •  Query •  Load •  •  Backup/Restore •  Resize
  • 19. •  Backups to Amazon S3 are automatic, continuous and incremental •  Configurable system snapshot retention period. Take user snapshots on-demand •  Cross region backups for disaster recovery •  Streaming restores enable you to resume querying faster Amazon S3 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Amazon Redshift parallelizes and distributes everything •  Query •  Load •  Backup/Restore •  Resize
  • 20. •  Resize while remaining online •  Provision a new cluster in the background •  Copy data in parallel from node to node •  Only charged for source cluster SQL Clients/BI Tools 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Leader Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Leader Node Amazon Redshift parallelizes and distributes everything •  Query •  Load •  Backup/Restore •  Resize
  • 21. SQL Clients/BI Tools 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Leader Node •  Automatic SQL endpoint switchover via DNS •  Decommission the source cluster •  Simple operation via Console or API Amazon Redshift parallelizes and distributes everything •  Query •  Load •  Backup/Restore •  Resize
  • 22. Amazon Redshift is priced to let you analyze all your data •  Number of nodes x cost per hour •  No charge for leader node •  No upfront costs •  Pay as you go DW1 (HDD) Price Per Hour for DW1.XL Single Node Effective Annual Price per TB On-Demand $ 0.850 $ 3,723 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DW2 (SSD) Price Per Hour for DW2.L Single Node Effective Annual Price per TB On-Demand $ 0.250 $ 13,688 1 Year Reservation $ 0.161 $ 8,794 3 Year Reservation $ 0.100 $ 5,498
  • 23. Amazon Redshift is easy to use •  Provision in minutes •  Powerful Web Console •  Monitor query performance •  Point and click resize •  Built in security •  Automatic backups
  • 24. Amazon Redshift has security built-in •  SSL to secure data in transit •  Encryption to secure data at rest –  AES-256; hardware accelerated –  All blocks on disks and in Amazon S3 encrypted –  HSM Support •  No direct access to compute nodes •  Audit logging & AWS CloudTrail integration •  Amazon VPC support 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores Amazon S3 / Amazon DynamoDB Customer VPC Internal VPC JDBC/ODBC Leader Node Compute Node Compute Node Compute Node
  • 25. Amazon Redshift continuously backs up your data and recovers from failures •  Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times •  Backups to Amazon S3 are continuous, automatic, and incremental –  Designed for eleven nines of durability •  Continuous monitoring and automated recovery from failures of drives and nodes •  Able to restore snapshots to any Availability Zone within a region •  Easily enable backups to a second region for disaster recovery
  • 26. Amazon Redshift Feature Delivery Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) Unload Encrypted Files DUB (4/25) NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) 4 byte UTF-8 (7/18) Statement Timeout (7/22) SHA1 Builtin (7/15) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/ CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts (11/13) SOC1/2/3 (5/8) Sharing snapshots (7/18) Resource Level IAM (8/9) PCI (8/22) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25)
  • 27. 50+ new features since launch •  Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney •  Certifications – PCI, SOC 1/2/3 •  Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials, HSM •  Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross-region snapshots •  Query – Regex, Cursors, MD5, SHA1, Time Zone, workload queue timeout, HLL •  Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character substitution, CSV, auto datetime format detection, epoch, Ingest from SSH
  • 28. COPY from JSON { "jsonpaths": [ "$['id']", "$['name']", "$['location'][0]", "$['location'][1]", "$['seats']" ] } COPY venue FROM 's3://mybucket/venue.json’ credentials 'aws_access_key_id=ACCESS-KEY-ID; aws_secret_access_key=SECRET-ACCESS-KEY' JSON AS 's3://mybucket/venue_jsonpaths.json';
  • 29. REGEX_SUBSTR() select email, regexp_substr(email,'@[^.]*') from users limit 5; email | regexp_substr --------------------------------------------+---------------- Suspendisse.tristique@nonnisiAenean.edu | @nonnisiAenean sed@lacusUtnec.ca | @lacusUtnec elementum@semperpretiumneque.ca | @semperpretiumneque Integer.mollis.Integer@tristiquealiquet.org | @tristiquealiquet Donec.fringilla@sodalesat.org | @sodalesat
  • 30. Resize Progress •  Progress indicator in console •  New API call
  • 31. Adrian Wiles Enterprise Data Architect Financial Times
  • 35. Flexibility • Storage growth • Ability to try new things 0 2 4 6 8 10 12 14 2011 2012 2013 2014
  • 37. Why Redshift? •  Faster •  No up-front commitment –  Fail fast and cheap –  Avoid procurement process •  Scale down as well as up •  AWS ecosystem
  • 43. Data Warehouse Staging Archive sftp ETL Cubes Reporting Campaigns
  • 44. Performance “My query must be wrong, that was too fast” 3.7 Billion Rows Before: 7:03m Redshift: 43.82s
  • 47. Learning •  Plan your network in advance •  Different way of thinking –  Distributed data and column stores –  No stored procedures –  Small queries •  Work with AWS –  AWS invested in our success
  • 48. Future •  Consolidate existing marts •  More sources near real-time •  Service Oriented Architecture
  • 49. AWS Partner Trail Win a Kindle Fire •  10 in total •  Get a code from our sponsors
  • 50. Please rate this session using the AWS Summits App and help us build better events