Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. This session explains how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use work load management, tune your queries, and use Amazon Redshift's interleaved sorting features. Finally, learn how to use these best practices to give their entire organization access to analytic insights at scale.
Presented by: Alex Sinner, Solutions Architecture PMO, Amazon Web Services
Customer Guest: Luuk Linssen, Product Manager, Bannerconnect
2. Fast, fully managed, petabyte-scale data warehousing for less than $1,000/TB/year
Amazon Redshift
3. Amazon Redshift delivers performance
“[Amazon] Redshift is twenty times faster than Hive.” (5x–20x reduction in query times) link
“Queries that used to take hours came back in seconds. Our analysts are orders of magnitude
more productive.” (20x–40x reduction in query times) link
“…[Amazon Redshift] performance has blown away everyone here (we generally see 50–
100x speedup over Hive).” link
“Team played with [Amazon] Redshift today and concluded it is ****** awesome. Un-indexed
complex queries returning in < 10s.”
“Did I mention it's ridiculously fast? We'll be using it immediately to provide our analysts an
alternative to Hadoop.”
“We saw… 2x improvement in query times.”
Channel “We regularly process multibillion row datasets and we do that in a matter of hours.” link
7. Summary of Best Practices
Table Design
ü Choose the best Sort Key
ü Choose the best Distribution Key
ü Compression Encodings – use automatic compression
Loading Data
ü Use COPY command (not INSERT)
ü Load multiple, compressed files in a single COPY
command, in Sort Key order
8. Use multiple input files to maximize throughput
COPY command
Each slice loads one file at a time
A single input file means
only one slice is ingesting data
Instead of full bandwidth, you
only get 1/16th
9. Use multiple input files to maximize throughput
COPY command
Use at least as many input files
as you have slices
With 16 input files, all slices are
working so you maximize
throughput from S3
Scale linearly as you add nodes
10. Primary keys and manifest files
Amazon Redshift doesn’t enforce primary key constraints
• If you load data multiple times, Amazon Redshift won’t complain
• If you declare primary keys, the optimizer will expect the data to
be unique
Use manifest files to control exactly what is loaded and
how to respond if input files are missing
• Define a JSON manifest on Amazon S3
• Ensures that the cluster loads exactly what you want
11. Data hygiene
Analyze tables regularly
• Every single load for popular columns
• Weekly for all columns
• Look SVV_TABLE_INFO(stats_off) for stale stats
• Look stl_alert_event_log for missing stats
Vacuum tables regularly
• Weekly is a good target
• Number of unsorted blocks as trigger
• Look SVV_TABLE_INFO(unsorted, empty)
• Deep copy might be faster for high percent unsorted
(20% unsorted usually is faster to deep copy)
12. Automatic compression is a good thing (mostly)
Better performance, lower costs
Samples data automatically when COPY into an empty table
• Samples up to 100,000 rows and picks optimal encoding
Regular ETL process using temp or staging tables:
Turn off automatic compression
• Use analyze compression to determine the right encodings
• Bake those encodings into your DML
13. Be careful when compressing your sort keys
Zone maps store min/max per block
After we know which block(s) contain the range,
we know which row offsets to scan
Highly compressed sort keys means many rows
per block
You’ll scan more data blocks than you need
If your sort keys compress significantly more
than your data columns, you might want to skip
compression of sortkey column(s)
Check SVV_TABLE_INFO(skew_sortkey1)
COL1 COL2
14. Keep your columns as narrow as possible
• Buffers allocated based on declared
column width
• Wider than needed columns mean
memory is wasted
• Fewer rows fit into memory; increased
likelihood of queries spilling to disk
• Check
SVV_TABLE_INFO(max_varchar)
16. IAM Role support for COPY and UNLOAD
Use IAM roles to securely give access permission to COPY
or UNLOAD data from/to S3 / DynamoDB
You can associate up to 10 IAM roles with a cluster
Use IAM role ARN identifier in COPY/UNLOAD command
Restrict roles to specific Redshift users
17. New SQL functions
We add SQL functions regularly to expand Amazon Redshift’s query capabilities
Added 25+ window and aggregate functions since launch, including:
• LISTAGG
• [APPROXIMATE] COUNT
• DROP IF EXISTS, CREATE IF NOT EXISTS
• REGEXP_SUBSTR, _COUNT, _INSTR, _REPLACE
• PERCENTILE_CONT, _DISC, MEDIAN
• PERCENT_RANK, RATIO_TO_REPORT
We’ll continue iterating but also want to enable you to write your own
18. Scalar user-defined functions (UDFs)
You can write UDFs using Python 2.7
• Syntax is largely identical to PostgreSQL UDF syntax
• System and network calls within UDFs are prohibited
Comes with Pandas, NumPy, and SciPy pre-installed
• You’ll also be able import your own libraries for even more
flexibility
19. Scalar UDF example
CREATE FUNCTION f_hostname (VARCHAR url)
RETURNS varchar
IMMUTABLE AS $$
import urlparse
return urlparse.urlparse(url).hostname
$$ LANGUAGE plpythonu;
20. Scalar UDF examples from partners
http://www.looker.com/blog/amazon-redshift-user-
defined-functions
https://www.periscope.io/blog/redshift-user-defined-
functions-python.html
21. 1-click deployment to launch, on
multiple regions around the world
Pay-as-you-go pricing with no long
term contracts required
Advanced Analytics Business IntelligenceData Integration
23. Compound sort keys
Records in Amazon
Redshift are stored in
blocks
For this illustration, let’s
assume that four records
fill a block
Records with a given
cust_id are all in one
block
However, records with a
given prod_id are spread
across four blocks
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
cust_id prod_id other columns blocks
Select sum(amt)
From big_tab
Where cust_id = (1234);
Select sum(amt)
From big_tab
Where prod_id = (5678);
24. 1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
Interleaved sort keys
Records with a given
cust_id are spread
across two blocks
Records with a given
prod_id are also spread
across two blocks
Data is sorted in equal
measures for both keys
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks
25. Usage
New keyword INTERLEAVED when defining sort keys
• Existing syntax will still work and behavior is unchanged
• You can choose up to 8 columns to include and can query with any
or all of them
No change needed to queries
[[ COMPOUND | INTERLEAVED ] SORTKEY ( column_name [, ...] ) ]
28. Typical ETL/ELT on legacy data warehouse
One file per table, maybe a few if too big
Many updates (“massage” the data)
Every job clears the data, then load
Count on primary key to block double loads
High concurrency of load jobs
29. Two questions to ask
Why do you do what you do?
• Many times, users don’t even know
What is the customer need?
• Many times, needs do not match current practice
• You might benefit from adding other AWS services
30. Open-source tools
https://github.com/awslabs/amazon-redshift-utils
Admin scripts
• Collection of utilities for running diagnostics on your cluster.
Admin views
• Collection of utilities for managing your cluster, generating schema DDL, and so on
Column encoding utility
• Gives you the ability to apply optimal column encoding to an established schema with data already loaded
Analyze and vacuum utility
• Gives you the ability to automate VACUUM and ANALYZE operations
Unload and copy utility
• Helps you to migrate data between Amazon Redshift clusters or databases
35. challenges
industry becomes more advanced
processing more data than ever!
from 700GB to 1.9TB on a daily basis
need of near real time data processing
not want to wait hours for new insights
38. outcome
increase in speed
10 GB report went from minutes to seconds
increase in detail
all reports are now created on a hourly instead of daily basis
completely new audience insights
45GB per report per account
40. lessons
learned
use the AWS best practices!
ü load compressed files instead of
uncompressed files
ü make sure load data is being split into
multiples files, where the number of files
is a multiple of the number of slices in the
cluster
ü make sure to choose the best distribution
and sort key for your tables