AWS (Amazon Redshift) presentation

AWS: Redshift overview
PRESENTATION PREPARED BY VOLODYMYR ROVETSKIY

Agenda
What is AWS Redshift
Amazon Redshift Pricing
AWS Redshift Architecture
•Data Warehouse System Architecture
•Internal Architecture and System Operation
Query Planning and Designing Tables
•Query Planning And Execution Workflow
•Columnar Storage
•Zone Maps
•Compression
•Referential Integrity
Data locate in Redshift
•The Sortkey
•The Distribution Key
Workload Management (WLM)
Loading Data
•What is Amazon S3
•Data Loading from Amazon S3
•COPY from Amazon S3
Redshift table maintenance operations
•ANALYZE
•VACUUM
Amazon Redshift Snapshots
Amazon Redshift Security
Monitoring Cluster Performance
Useful resources
Conclusion

What is Amazon Redshift
Cluster architecture
Columnar storage
Zone maps
Compression
Read optimized
No referential integrity by design
Redshift is the Amazon Cloud Data Warehousing server; it can interact with Amazon EC2 and S3
components but is managed separately using the Redshift tab of the AWS console. As a cloud based
system it is rented by the hour from Amazon, and broadly the more storage you hire the more you
pay.
Amazon Redshift features

Amazon Redshift Pricing
Clients pay an hourly rate based on the type
and number of nodes in your cluster.
There is discount up to 75% over On-
Demand rates by committing to use Amazon
Redshift for a 1 or 3 year term.
Prices include two additional copies of your
data, one on the cluster nodes and one in
Amazon S3.
Amazon Redshift take care of backup,
durability, availability, security, monitoring,
and maintenance.
Price is depend on chosen Region.
Dense Storage (DS) nodes allow you to
create large data warehouses using hard disk
drives (HDDs) for a low price point.
Dense Compute (DC) nodes allow you to
create high performance data warehouses
using fast CPUs, large amounts of RAM and
solid-state disks (SSDs).

Data Warehouse System Architecture
Leader node
• Stores metadata
• Manages communications with client programs and compute nodes
• Manages distributing data to the slices on compute nodes
• Develops and distributes execution plans for compute nodes
Compute nodes
• Execute the query segments in parallel and send results back to the leader node for final
aggregation
• Each compute node has its own dedicated CPU, memory, and attached disk storage
• User data is stored on the compute nodes
Node slices
• Each slice is allocated a portion of the node's memory and disk space
• The slices work in parallel to complete the operation.
• The number of slices per node is determined by the node size of the cluster.
• The rows of table are distributed to the node slices according to the distribution key
Client applications
• Amazon Redshift is based on industry-standard PostgreSQL

The following diagram shows a high level
view of internal components and
functionality of the Amazon Redshift data
warehouse.
Internal Architecture
and System Operation

Query Planning And Execution Workflow
The query planning and execution workflow
follows these steps:
• 1. The leader node receives the query and parses the SQL.
• 2. The parser produces an initial query tree that is a logical
representation of the original query. Amazon Redshift then
inputs this query tree into the query optimizer.
• 3. The optimizer evaluates and if necessary rewrites the query
to maximize its efficiency.
• 4. The optimizer generates a query plan for the execution with
the best performance.
• 5. The execution engine translates the query plan
into compiled C++ code
• 6. The compute node execute the compiled code segments in
parallel

Columnar Storage
Pic.1 shows how records from database tables are typically stored into disk blocks by row.
Pic.2 shows how with columnar storage, the values for each column are stored sequentially into disk blocks.
Columnar storage is optimizing analytic query performance because:
 reduces the overall disk I/O requirements
 reduces the amount of data you need to load from disk
 each block holds the same type of data
 block data can use a compression scheme selected specifically for the column data type

Zone Maps
The zone map is held separately from the block, like
an index
The zone map holds only two data points per block,
the highest and lowest values in the block.
Redshift uses the zone map when executing queries,
and excludes the blocks that the zone map indicates
won’t be returned by the WHERE clause filter.
The zone maps will filter data blocks efficiently if
columns are used as the sortkey

Compression
Benefits of Compression
•Reduces the size of data when it is stored or read from storage
•Conserves storage space
•Reduces the amount of disk I/O
•Improves query performance
Redshift recommendations and advices:
•Use COPY command to apply automatic compression.(COMPUPDATE
ON)
•Produce a report with the suggested column encoding schemes for
the tables analyzed. (ANALYZE COMPRESSION)
•Compression type cannot be changed for a column after the table is
created
•Highly compressed sort keys means many rows per block You’ll scan
more data blocks than you need

Referential integrity. Redshift unsupported features:
Table partitioning
Tablespaces
Constraints:
◦ Unique
◦ Foreign key
◦ Primary key
◦ Check constraints
◦ Exclusion constraints
Indexes
Important:
Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon
Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if
your ETL process or some other process in your application enforces their integrity.
Collations
Stored procedures
Triggers
Table functions
Sequences
Full text search
Exotic data types (arrays, JSON, Geospatial types,
etc.)

The Sort key
• Each table can have a single Sort Key – a compound key, comprised of 1 to 400
columns from the table
• Redshift is storing data on disk in Sort Key order
• Sort keys should be selected based on how the table is used:
• Columns that are used to join to other tables should be included in the
sort key;
• Date type columns that are used in filtering operations should be included;
• Redshift stores metadata about each data block, including the min and max of
each column value – using sortkey, Redshift can skip entire blocks when
answering a query;

Sort keys and Zone Maps
CREATE SOME_TABLE ( SALESID INTEGER NOT NULL,
DATE DATETIME NOT NULL )
SELECT COUNT(*) FROM SOME_TABLE
WHERE DATE = ’09-JUNE-2013’
CREATE SOME_TABLE ( SALESID INTEGER NOT NULL,
DATE DATETIME NOT NULL )
SORTKEY (DATE)
SELECT COUNT(*) FROM SOME_TABLE
WHERE DATE = ’09-JUNE-2013’

The Sort keys – Single Column
Table is sorted by 1 column [ SORTKEY ( date ) ].
Best for:
• Queries that use 1st column (i.e. date) as primary filter
• Can speed up joins and group by
• Quickest to VACUUM
Example:
create table sales(
date datetime not null,
region datetime not null,
country varchar not null)
distkey(date)
sortkey(date);

The Sort keys – Compound
Table is sorted by 1st column , then 2nd column etc. [ SORTKEY COMPOUND ( date, region, country) ].
Best for:
• Queries that use 1st column as primary filter, then other columns
• Can speed up joins and group by
• Slower to VACUUM
Example:
create table sales(
distkey(date)
sortkey compound (date, region, country);

The Sort keys – Interleaved
Equal weight is given to each column. [ SORTKEY INTERLEAVED ( date, region, country) ]
Best for:
• Queries that use different columns in filter
• Queries get faster the more columns used in the filter
• The Slowest to VACUUM
Example:
create table sales(
distkey(date)
sortkey interleaved(date, region, country);

The Distribution Key
•Redshift will distribute and replicate data between compute
nodes;
•By default, data will be spread evenly across all compute nodes
slices (EVEN distribution)
•The even distribution of data across the nodes is vital to ensuring
consistent query performance
•If data is denormalised and does not participate in joins, then an
EVEN distribution won’t be problematic
•Alternatively a Distribution key can be provided (KEY distribution)
•The Distribution key helps distribute data across a node’s slices
•The Distribution key is defined on a per-table basis
•The Distribution Key is comprised of only a single column

Distribution styles by example
• Large Fact tables
• Large dimension tables
• Medium dimension tables
(1K – 2M)
• Tables with no joins or group by
• Small dimension tables (<1000)
Data Distribution

Workload Management (WLM)
WLM allows you to:
• Manage and adjust query concurrency
• Increase query concurrency up to 15 in a queue
• Define user groups and query groups
• Segregate short and long running queries
• Help improve performance of individual queries
Be aware:
• Query workload is distributed to every compute node
• Increasing concurrency may not always help due to resource
contention (CPU, Memory, I/O)
• Total throughput may increase by letting one query complete first
and allowing other queries to wait
WLM Options by default:
• 1 queue with a concurrency of 5
• Define up to 8 queues with a total concurrency of 15
• Redshift has a super user queue internally

Short Description of Amazon Simple
Storage Service (S3)
Cloud Storage for web application
Origin store for content distribution
Staging area and persistent store for Big Data
analytics
Backup and archive target databases
To use Amazon S3, you need an AWS account
Before you can store data in Amazon S3, you must
create a bucket.
Add an object to the created bucket (a text file, a
photo, a video and so forth)
When objects are added to the bucket you can view
and manage them

Data Loading from Amazon S3
Best Practice and recommendations:
• S3 bucket and your cluster must be created in the
same region
• Split your data on S3 into multiple files
• Use a COPY Command to load data
• Load your data in sort key order to avoid needing to
vacuum
• Organize your data as a sequence of time-series
tables
• Run the VACUUM command whenever you add,
delete, or modify a large number of rows
• Run the ANALYZE command whenever you’ve made
a non-trivial number of changes to update table
statistics

COPY from Amazon S3
Syntax Parameters
FROM - the path to the Amazon S3 objects that contain the data
MANIFEST - The manifest is a text file in JSON format that lists the URL of each file
that is to be loaded from Amazon S3. The URL includes the bucket name and full
object path for the file. The files that are specified in the manifest can be in different
buckets, but all the buckets must be in the same region as the Amazon Redshift
cluster.
ENCRYPTED - Specifies that the input files on Amazon S3 are encrypted using client-
side encryption.
REGION [AS] 'aws-region‘ - Specifies the AWS region where the source data is
located.
Examples

ANALYZE: The command used to capture statistical information about a table for use by the query planner.
•Run before running queries.
•Run against the database after regular load or update cycle.
•Run against any new tables that you create.
•Consider running ANALYZE operations on different schedules for different types of tables and columns, depending on their use in
queries and their propensity to change.
•Do not need to analyze all columns in all tables regularly or on the same schedule. Analyze the columns that are frequently used in
the following:
•Sorting and grouping operations
•Joins
•Query predicates
This command can analyze the whole table or specified columns:
ANALYZE <TABLE NAME>;
ANALYZE <TABLE NAME> (<COLUMN1>,<COLUMN2>);

VACUUM: A process to physically reorganize tables after load
activity.
•Can be run in 4 modes:
•VACUUM FULL - reclaims space and re-sorts;
•VACUUM DELETE ONLY - reclaims space but does not re-sort
•VACUUM SORT ONLY - re-sorts but does not reclaim space
•VACUUM REINDEX - used for INTERLEAVED sort keys, re-analyzes
sort keys and then runs FULL VACUUM
VACUUM is an I/O intensive operation and can take time to run. To
minimize the impact of VACUUM:
•Run VACUUM on a regular schedule during time periods when
you expect minimal activity on the cluster
•Use TRUNCATE instead of DELETE where possible
•TRUNCATE or DROP test tables
•Perform a Deep Copy instead of VACUUM
•Load Data in sort order and remove need for VACUUM
TO threshold PERCENT - the threshold above which VACUUM
skips the sort phase and the target threshold for reclaiming
space in the delete phase. If you include the
TO threshold PERCENT parameter, you must also specify a table
name. This parameter can't be used with REINDEX.
For example, if you specify 75 for threshold, VACUUM skips the
sort phase if 75 percent or more of the table's rows are already
in sort order. For the delete phase, VACUUMS sets a target of
reclaiming disk space such that at least 75 percent of the table's
rows are not marked for deletion following the vacuum.
The threshold value must be an integer between 0 and 100. The
default is 95.

Amazon Redshift Snapshots
Automated Snapshots
•enabled by default when cluster is created
•periodically takes from the cluster(every eight hours or every 5 GB of data changes)
•deleted at the end of a retention period(1 day by default)
•Can be disabled (set retention period to 0)
Manual Snapshots
•Can be taken whenever you want
•will never automatically delete
•manual snapshots accrue storage charges
Excluding Tables From Snapshot
•To create a no-backup table, include the BACKUP NO parameter when you create the table
Copying Snapshots to Another Region
•Copying snapshots across regions incurs data transfer charges
Restoring a Table from a Snapshot (feature was added 2016 10th of March)
•You can restore a table only to the current, active running cluster and from a snapshot that was taken of that cluster.
•You can restore only one table at a time.
•You cannot restore a table from a cluster snapshot that was taken prior to a cluster being resized.

Amazon Redshift Security
Cluster Security: Controlling access to Redshift cluster for management
• Cluster runs within a Virtual Private Cloud (VPC) managed by the Amazon Redshift
service
Connection security: Controlling clients that can connect to Redshift cluster
•. Users can only connect to the cluster using an ODBC or JDBC connections. You may
optionally only permit connections to the Amazon Redshift cluster from a VPC you
control.
Database object security: Controlling which users have access to which database objects
• At the database security level Amazon Redshift uses the Postgres security model, with
user name / password authentication. Database user accounts are configured
separately from Redshift’s management security using SQL commands.
Data Security: encryption of data at rest (load data, table data, and backup data)
• You can encrypt data that is loaded into Amazon Redshift, encrypt the data stored in
the Amazon Redshift tables, and encrypt the backups.

Monitoring Cluster Performance
Amazon CloudWatch metrics help you monitor physical
aspects of your cluster, such as CPU utilization, latency,
and throughput.
Performance data helps you monitor database activity
and performance. This data is aggregated in the
Amazon Redshift console to help you easily correlate
what you see in Amazon CloudWatch metrics
Query/Load Performance DataAmazon CloudWatch Metrics

Useful resources to learn more about
Redshift
Redshift Documentation
• https://aws.amazon.com/redshift
• http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html
Open Source Scripts and Tools
• https://github.com/awslabs/amazon-redshift-utils
• http://www.aginity.com/redshift

Conclusion
Amazon Redshift’s features
•Optimized for Data Warehousing- It uses columnar storage, data compression, and
zone maps to reduce the amount of IO needed to perform queries. Redshift has a massively
parallel processing (MPP) architecture, parallelizing and distributing SQL operations to take
advantage of all available resources.
•Scalable- With a few clicks of the AWS Management Console or a simple API call, you can
easily scale the number of nodes in your data warehouse up or down as your performance
or capacity needs change.
•No Up-Front Costs- You pay only for the resources you provision. You can choose On-
Demand pricing with no up-front costs or long-term commitments, or obtain significantly
discounted rates with Reserved Instance pricing.
•Fault Tolerant- Amazon Redshift has multiple features that enhance the reliability of
your data warehouse cluster. All data written to a node in your cluster is automatically
replicated to other nodes within the cluster and all data is continuously backed up to
Amazon S3.
•SQL - Amazon Redshift is a SQL data warehouse and uses industry standard ODBC and
JDBC connections and Postgres drivers.
•Isolation - Amazon Redshift enables you to configure firewall rules to control network
access to your data warehouse cluster.
•Encryption – With just a couple of parameter settings, you can set up Amazon Redshift to
use SSL to secure data in transit and hardware-accelerated AES-256 encryption for data at
rest.
Redshift
Optimized
for Data
Warehous
ing
Scalable
No Up-
Front
Costs
Fault
Tolerant
Secure
SQL
Standards

Jeff Bezos reacted to my payment :-))

AWS (Amazon Redshift) presentation

Related slideshows

More Related Content

AWS (Amazon Redshift) presentation

Editor's Notes