Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Greg Khairallah, Business Development Manager, AWS
July 28, 2016
Getting Started with
Amazon Redshift

2

Agenda
 Introduction
 Benefits
 Use cases
 Q&A

3

AnalyzeStore
Amazon
Glacier
Amazon S3
Amazon
DynamoDB
Amazon RDS,
Amazon Aurora
AWS big data portfolio
AWS Data Pipeline
Amazon
CloudSearch
Amazon EMR Amazon EC2
Amazon
Redshift
Amazon Machine
Learning
Amazon
Elasticsearch Service
AWS Database
Migration Service
Amazon
QuickSight
Amazon Kinesis
Firehose
AWS Import/Export
Snowball
AWS Direct Connect
Collect
Amazon Kinesis Streams

4

Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
 Relational data warehouse
 Massively parallel; petabyte scale
 Fully managed
 HDD and SSD platforms
 $1,000/TB/year; starts at $0.25/hour

5

The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI
tools, Hadoop, machine
learning, streaming
Analysis inline with
process flows
Pay as you go, grow
as you need
Managed availability
and disaster recovery
Enterprise Big data SaaS

6

The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical
representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any
vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
Forrester Wave™ enterprise data warehouse Q4 ’15

7

Selected Amazon Redshift customers

8

Amazon Redshift architecture
Leader node
 Simple SQL endpoint
 Stores metadata
 Optimizes query plan
 Coordinates query execution
Compute nodes
 Local columnar storage
 Parallel/distributed execution of all queries,
loads, backups, restores, resizes
Start at just $0.25/hour, grow to
2 PB (compressed)
 DC1: SSD; scale from 160 GB to 326 TB
 DS2: HDD; scale from 2 TB to 2 PB
Ingestion/backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)

9

Benefit #1: Amazon Redshift is fast
Dramatically less I/O
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
 Column storage
 Data compression
 Zone maps
 Direct-attached storage
 Large data block sizes

10

Benefit #1: Amazon Redshift is fast
Parallel and distributed
 Query
 Load
 Export
 Backup
 Restore
 Resize

11

Benefit #1: Amazon Redshift is fast
 Hardware optimized for I/O intensive workloads,
4 GB/sec/node
 Enhanced networking, over 1 million packets/sec/node
 Choice of storage type, instance size
 Regular cadence of autopatched improvements

12

Benefit #1: Amazon Redshift is fast
 New dense storage (HDD) instance type (Jun 2015)
 Improved memory 2x, compute 2x, disk throughput 1.5x
 Cost: Same as our prior generation!
 Performance improvement: 50%
 Enhanced I/O and commit improvements (Jan 2016)
 Performance improvement: 35%
 Memory allocation improvements (May 2016)
 Performance improvement: 60%

13

Benefit #2: Amazon Redshift is inexpensive
Ds2 (HDD)
Price per hour for
DW1.XL single node
Effective annual
price per TB compressed
On demand $ 0.850 $ 3,725
1-year reservation $ 0.500 $ 2,190
3-year reservation $ 0.228 $ 999
Dc1 (SSD)
Price per hour for
DW2.L single node
Effective annual
price per TB compressed
On demand $ 0.250 $ 13,690
1-year reservation $ 0.161 $ 8,795
3-year reservation $ 0.100 $ 5,500
Pricing is simple
 Number of nodes x price/hour
 No charge for leader node
 No upfront costs
 Pay as you go

14

Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to Amazon S3
Continuous and incremental backups
across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2

15

Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/region level disasters

16

Benefit #4: Security is built in
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
 Load encrypted from Amazon S3
 SSL to secure data in transit
 ECDHE perfect forward security
 Amazon VPC for network isolation
 Encryption to secure data at rest
 All blocks on disks and in Amazon S3 encrypted
 Block key, cluster key, master key (AES-256)
 On-premises HSM and AWS CloudHSM support
 Audit logging and AWS CloudTrail integration
 SOC 1/2/3, PCI-DSS, FedRAMP, BAA

17

Benefit #5: We innovate quickly
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
DUB (4/25)
SOC1/2/3 (5/8)
Unload Encrypted Files
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
SHA1 Builtin (7/15)
4 byte UTF-8 (7/18)
Sharing snapshots (7/18)
Statement Timeout (7/22)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress
(8/9)
Resource Level IAM (8/9)
PCI (8/22)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy,
Distributed Tables, Audit
Logging/CloudTrail, Concurrency,
Resize Perf., Approximate Count
Distinct, SNS Alerts, Cross Region
Backup (11/13)
Distributed Tables, Single Node Cursor
Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables
and diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH,
Fetch size support for single node
clusters, new system tables with
commit stats, row_number(), strotol()
and query termination (2/13)
Resize progress indicator & Cluster
Version (3/21)
Regex_Substr, COPY from JSON
(3/25)
50 slots, COPY from EMR, ECDHE
ciphers (4/22)
3 new regex features, Unload to single
file, FedRAMP(5/6)
Rename Cluster (6/2)
Copy from multiple regions,
percentile_cont, percentile_disc (6/30)
Free Trial (7/1)
pg_last_unload_count (9/15)
AES-128 S3 encryption (9/29)
UTF-16 support (9/29)
Well over 125 new features added since launch
Release every two weeks
Automatic patching

18

Benefit #6: Amazon Redshift is powerful
 Approximate functions
 User-defined functions
 Machine learning
 Data science

19

Benefit #7: Amazon Redshift has a large ecosystem
Data integration Systems integratorsBusiness intelligence

20

Benefit #8: Service-oriented architecture
Amazon EMR
Amazon
DynamoDB
Amazon
EC2/SSH
Amazon
S3
Amazon RDS/
Amazon Aurora
Amazon
Redshift
Amazon
Kinesis
Amazon ML
AWS Data
Pipeline
Amazon
CloudSearch
Amazon
Mobile
Analytics

21

Performance
Ease of use
Security
Analytics and
functionality
SOA
Recent launches Dynamic WLM parameters
Queue hopping for timed-out queries
Merge rows from staging to prod. table
2x improvement in query throughput
10x latency improvement for UNION ALL queries
Bzip2 format for ingestion
Table level restore
10x improvement in vacuum perf.
Default access privileges
Tag-based AWS IAM access
IAM roles for COPY/UNLOAD
SAS connector enhancements,
Implicit conversion of SAS
queries to Amazon Redshift
DMS support from OLTP sources
Enhanced data ingestion from
Kinesis Firehose
Improved data schema conversion
to Amazon ML

22

Use cases

23

 68 million customers
 Tens of TBs per day of data
across a mobile network
 6 PB of total data (uncompressed)
 Data science for marketing
operations, logistics, and so on
 Greenplum on premises
 Scaling challenges
 Performance issues
 Need same level of security
 Need for a hybrid environment
NTT Docomo: Japan’s largest mobile
service provider

24

NTT Docomo: Japan’s largest mobile
service provider
Data
Source
ET
AWS
Direct
Connect
Client
Forwarder
LoaderState
management
Sandbo
x
Amazon Redshift
S3
 125 node DS2.8XL cluster
 4,500 vCPUs, 30 TB RAM
 2 PB compressed
 10x faster analytic queries
 50% reduction in time for new BI
application deployment
 Significantly less operations overhead

25

Nasdaq: powering 100 marketplaces
in 50 countries
 Orders, quotes, trade executions,
market “tick” data from 7 exchanges
 7 billion rows/day
 Analyze market share, client activity,
surveillance, billing, and so on
 Microsoft SQL Server on premises
 Expensive legacy DW ($1.16 M/yr.)
 Limited capacity (1 yr. of data online)
 Needed lower TCO
 Must satisfy multiple security and
regulatory requirements
 Similar performance

26

Nasdaq: powering 100 marketplaces
in 50 countries
 23 node DS2.8XL cluster
 828 vCPUs, 5 TB RAM
 368 TB compressed
 2.7 T rows, 900 B derived
 8 tables with 100 B rows
 7 man-month migration
 ¼ the cost, 2x storage, room to grow
 Faster performance, very secure

27

Resources
Greg Khairallah | gregkh@amazon.com
Detail pages
 http://aws.amazon.com/redshift
 https://aws.amazon.com/marketplace/redshift/
Best practices
 http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
 http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html
 http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html

28

Thank you!

More Related Content

Getting Started with Amazon Redshift - AWS July 2016 Webinar Series

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Greg Khairallah, Business Development Manager, AWS July 28, 2016 Getting Started with Amazon Redshift
  • 3. AnalyzeStore Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora AWS big data portfolio AWS Data Pipeline Amazon CloudSearch Amazon EMR Amazon EC2 Amazon Redshift Amazon Machine Learning Amazon Elasticsearch Service AWS Database Migration Service Amazon QuickSight Amazon Kinesis Firehose AWS Import/Export Snowball AWS Direct Connect Collect Amazon Kinesis Streams
  • 4. Amazon Redshift a lot faster a lot simpler a lot cheaper  Relational data warehouse  Massively parallel; petabyte scale  Fully managed  HDD and SSD platforms  $1,000/TB/year; starts at $0.25/hour
  • 5. The Amazon Redshift view of data warehousing 10x cheaper Easy to provision Higher DBA productivity 10x faster No programming Easily leverage BI tools, Hadoop, machine learning, streaming Analysis inline with process flows Pay as you go, grow as you need Managed availability and disaster recovery Enterprise Big data SaaS
  • 6. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change. Forrester Wave™ enterprise data warehouse Q4 ’15
  • 8. Amazon Redshift architecture Leader node  Simple SQL endpoint  Stores metadata  Optimizes query plan  Coordinates query execution Compute nodes  Local columnar storage  Parallel/distributed execution of all queries, loads, backups, restores, resizes Start at just $0.25/hour, grow to 2 PB (compressed)  DC1: SSD; scale from 160 GB to 326 TB  DS2: HDD; scale from 2 TB to 2 PB Ingestion/backup Backup Restore JDBC/ODBC 10 GigE (HPC)
  • 9. Benefit #1: Amazon Redshift is fast Dramatically less I/O analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959  Column storage  Data compression  Zone maps  Direct-attached storage  Large data block sizes
  • 10. Benefit #1: Amazon Redshift is fast Parallel and distributed  Query  Load  Export  Backup  Restore  Resize
  • 11. Benefit #1: Amazon Redshift is fast  Hardware optimized for I/O intensive workloads, 4 GB/sec/node  Enhanced networking, over 1 million packets/sec/node  Choice of storage type, instance size  Regular cadence of autopatched improvements
  • 12. Benefit #1: Amazon Redshift is fast  New dense storage (HDD) instance type (Jun 2015)  Improved memory 2x, compute 2x, disk throughput 1.5x  Cost: Same as our prior generation!  Performance improvement: 50%  Enhanced I/O and commit improvements (Jan 2016)  Performance improvement: 35%  Memory allocation improvements (May 2016)  Performance improvement: 60%
  • 13. Benefit #2: Amazon Redshift is inexpensive Ds2 (HDD) Price per hour for DW1.XL single node Effective annual price per TB compressed On demand $ 0.850 $ 3,725 1-year reservation $ 0.500 $ 2,190 3-year reservation $ 0.228 $ 999 Dc1 (SSD) Price per hour for DW2.L single node Effective annual price per TB compressed On demand $ 0.250 $ 13,690 1-year reservation $ 0.161 $ 8,795 3-year reservation $ 0.100 $ 5,500 Pricing is simple  Number of nodes x price/hour  No charge for leader node  No upfront costs  Pay as you go
  • 14. Benefit #3: Amazon Redshift is fully managed Continuous/incremental backups Multiple copies within cluster Continuous and incremental backups to Amazon S3 Continuous and incremental backups across regions Streaming restore Amazon S3 Amazon S3 Region 1 Region 2
  • 15. Benefit #3: Amazon Redshift is fully managed Amazon S3 Amazon S3 Region 1 Region 2 Fault tolerance Disk failures Node failures Network failures Availability Zone/region level disasters
  • 16. Benefit #4: Security is built in 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC  Load encrypted from Amazon S3  SSL to secure data in transit  ECDHE perfect forward security  Amazon VPC for network isolation  Encryption to secure data at rest  All blocks on disks and in Amazon S3 encrypted  Block key, cluster key, master key (AES-256)  On-premises HSM and AWS CloudHSM support  Audit logging and AWS CloudTrail integration  SOC 1/2/3, PCI-DSS, FedRAMP, BAA
  • 17. Benefit #5: We innovate quickly Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) DUB (4/25) SOC1/2/3 (5/8) Unload Encrypted Files NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) SHA1 Builtin (7/15) 4 byte UTF-8 (7/18) Sharing snapshots (7/18) Statement Timeout (7/22) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) Resource Level IAM (8/9) PCI (8/22) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts, Cross Region Backup (11/13) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25) 50 slots, COPY from EMR, ECDHE ciphers (4/22) 3 new regex features, Unload to single file, FedRAMP(5/6) Rename Cluster (6/2) Copy from multiple regions, percentile_cont, percentile_disc (6/30) Free Trial (7/1) pg_last_unload_count (9/15) AES-128 S3 encryption (9/29) UTF-16 support (9/29) Well over 125 new features added since launch Release every two weeks Automatic patching
  • 18. Benefit #6: Amazon Redshift is powerful  Approximate functions  User-defined functions  Machine learning  Data science
  • 19. Benefit #7: Amazon Redshift has a large ecosystem Data integration Systems integratorsBusiness intelligence
  • 20. Benefit #8: Service-oriented architecture Amazon EMR Amazon DynamoDB Amazon EC2/SSH Amazon S3 Amazon RDS/ Amazon Aurora Amazon Redshift Amazon Kinesis Amazon ML AWS Data Pipeline Amazon CloudSearch Amazon Mobile Analytics
  • 21. Performance Ease of use Security Analytics and functionality SOA Recent launches Dynamic WLM parameters Queue hopping for timed-out queries Merge rows from staging to prod. table 2x improvement in query throughput 10x latency improvement for UNION ALL queries Bzip2 format for ingestion Table level restore 10x improvement in vacuum perf. Default access privileges Tag-based AWS IAM access IAM roles for COPY/UNLOAD SAS connector enhancements, Implicit conversion of SAS queries to Amazon Redshift DMS support from OLTP sources Enhanced data ingestion from Kinesis Firehose Improved data schema conversion to Amazon ML
  • 23.  68 million customers  Tens of TBs per day of data across a mobile network  6 PB of total data (uncompressed)  Data science for marketing operations, logistics, and so on  Greenplum on premises  Scaling challenges  Performance issues  Need same level of security  Need for a hybrid environment NTT Docomo: Japan’s largest mobile service provider
  • 24. NTT Docomo: Japan’s largest mobile service provider Data Source ET AWS Direct Connect Client Forwarder LoaderState management Sandbo x Amazon Redshift S3  125 node DS2.8XL cluster  4,500 vCPUs, 30 TB RAM  2 PB compressed  10x faster analytic queries  50% reduction in time for new BI application deployment  Significantly less operations overhead
  • 25. Nasdaq: powering 100 marketplaces in 50 countries  Orders, quotes, trade executions, market “tick” data from 7 exchanges  7 billion rows/day  Analyze market share, client activity, surveillance, billing, and so on  Microsoft SQL Server on premises  Expensive legacy DW ($1.16 M/yr.)  Limited capacity (1 yr. of data online)  Needed lower TCO  Must satisfy multiple security and regulatory requirements  Similar performance
  • 26. Nasdaq: powering 100 marketplaces in 50 countries  23 node DS2.8XL cluster  828 vCPUs, 5 TB RAM  368 TB compressed  2.7 T rows, 900 B derived  8 tables with 100 B rows  7 man-month migration  ¼ the cost, 2x storage, room to grow  Faster performance, very secure
  • 27. Resources Greg Khairallah | gregkh@amazon.com Detail pages  http://aws.amazon.com/redshift  https://aws.amazon.com/marketplace/redshift/ Best practices  http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html  http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html  http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html