Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
AWS: Redshift overview
PRESENTATION PREPARED BY VOLODYMYR ROVETSKIY
Agenda
What is AWS Redshift
Amazon Redshift Pricing
AWS Redshift Architecture
•Data Warehouse System Architecture
•Internal Architecture and System Operation
Query Planning and Designing Tables
•Query Planning And Execution Workflow
•Columnar Storage
•Zone Maps
•Compression
•Referential Integrity
Data locate in Redshift
•The Sortkey
•The Distribution Key
Workload Management (WLM)
Loading Data
•What is Amazon S3
•Data Loading from Amazon S3
•COPY from Amazon S3
Redshift table maintenance operations
•ANALYZE
•VACUUM
Amazon Redshift Snapshots
Amazon Redshift Security
Monitoring Cluster Performance
Useful resources
Conclusion
What is Amazon Redshift
Cluster architecture
Columnar storage
Zone maps
Compression
Read optimized
No referential integrity by design
Redshift is the Amazon Cloud Data Warehousing server; it can interact with Amazon EC2 and S3
components but is managed separately using the Redshift tab of the AWS console. As a cloud based
system it is rented by the hour from Amazon, and broadly the more storage you hire the more you
pay.
Amazon Redshift features
Amazon Redshift Pricing
Clients pay an hourly rate based on the type
and number of nodes in your cluster.
There is discount up to 75% over On-
Demand rates by committing to use Amazon
Redshift for a 1 or 3 year term.
Prices include two additional copies of your
data, one on the cluster nodes and one in
Amazon S3.
Amazon Redshift take care of backup,
durability, availability, security, monitoring,
and maintenance.
Price is depend on chosen Region.
Dense Storage (DS) nodes allow you to
create large data warehouses using hard disk
drives (HDDs) for a low price point.
Dense Compute (DC) nodes allow you to
create high performance data warehouses
using fast CPUs, large amounts of RAM and
solid-state disks (SSDs).
Data Warehouse System Architecture
Leader node
• Stores metadata
• Manages communications with client programs and compute nodes
• Manages distributing data to the slices on compute nodes
• Develops and distributes execution plans for compute nodes
Compute nodes
• Execute the query segments in parallel and send results back to the leader node for final
aggregation
• Each compute node has its own dedicated CPU, memory, and attached disk storage
• User data is stored on the compute nodes
Node slices
• Each slice is allocated a portion of the node's memory and disk space
• The slices work in parallel to complete the operation.
• The number of slices per node is determined by the node size of the cluster.
• The rows of table are distributed to the node slices according to the distribution key
Client applications
• Amazon Redshift is based on industry-standard PostgreSQL
The following diagram shows a high level
view of internal components and
functionality of the Amazon Redshift data
warehouse.
Internal Architecture
and System Operation
Query Planning And Execution Workflow
The query planning and execution workflow
follows these steps:
• 1. The leader node receives the query and parses the SQL.
• 2. The parser produces an initial query tree that is a logical
representation of the original query. Amazon Redshift then
inputs this query tree into the query optimizer.
• 3. The optimizer evaluates and if necessary rewrites the query
to maximize its efficiency.
• 4. The optimizer generates a query plan for the execution with
the best performance.
• 5. The execution engine translates the query plan
into compiled C++ code
• 6. The compute node execute the compiled code segments in
parallel
Columnar Storage
Pic.1 shows how records from database tables are typically stored into disk blocks by row.
Pic.2 shows how with columnar storage, the values for each column are stored sequentially into disk blocks.
Columnar storage is optimizing analytic query performance because:
 reduces the overall disk I/O requirements
 reduces the amount of data you need to load from disk
 each block holds the same type of data
 block data can use a compression scheme selected specifically for the column data type
Zone Maps
The zone map is held separately from the block, like
an index
The zone map holds only two data points per block,
the highest and lowest values in the block.
Redshift uses the zone map when executing queries,
and excludes the blocks that the zone map indicates
won’t be returned by the WHERE clause filter.
The zone maps will filter data blocks efficiently if
columns are used as the sortkey
Compression
Benefits of Compression
•Reduces the size of data when it is stored or read from storage
•Conserves storage space
•Reduces the amount of disk I/O
•Improves query performance
Redshift recommendations and advices:
•Use COPY command to apply automatic compression.(COMPUPDATE
ON)
•Produce a report with the suggested column encoding schemes for
the tables analyzed. (ANALYZE COMPRESSION)
•Compression type cannot be changed for a column after the table is
created
•Highly compressed sort keys means many rows per block You’ll scan
more data blocks than you need
Referential integrity. Redshift unsupported features:
Table partitioning
Tablespaces
Constraints:
◦ Unique
◦ Foreign key
◦ Primary key
◦ Check constraints
◦ Exclusion constraints
Indexes
Important:
Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon
Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if
your ETL process or some other process in your application enforces their integrity.
Collations
Stored procedures
Triggers
Table functions
Sequences
Full text search
Exotic data types (arrays, JSON, Geospatial types,
etc.)
Data locate in Redshift
The Sort key
• Each table can have a single Sort Key – a compound key, comprised of 1 to 400
columns from the table
• Redshift is storing data on disk in Sort Key order
• Sort keys should be selected based on how the table is used:
• Columns that are used to join to other tables should be included in the
sort key;
• Date type columns that are used in filtering operations should be included;
• Redshift stores metadata about each data block, including the min and max of
each column value – using sortkey, Redshift can skip entire blocks when
answering a query;
Sort keys and Zone Maps
CREATE SOME_TABLE ( SALESID INTEGER NOT NULL,
DATE DATETIME NOT NULL )
SELECT COUNT(*) FROM SOME_TABLE
WHERE DATE = ’09-JUNE-2013’
CREATE SOME_TABLE ( SALESID INTEGER NOT NULL,
DATE DATETIME NOT NULL )
SORTKEY (DATE)
SELECT COUNT(*) FROM SOME_TABLE
WHERE DATE = ’09-JUNE-2013’
The Sort keys – Single Column
Table is sorted by 1 column [ SORTKEY ( date ) ].
Best for:
• Queries that use 1st column (i.e. date) as primary filter
• Can speed up joins and group by
• Quickest to VACUUM
Example:
create table sales(
date datetime not null,
region datetime not null,
country varchar not null)
distkey(date)
sortkey(date);
The Sort keys – Compound
Table is sorted by 1st column , then 2nd column etc. [ SORTKEY COMPOUND ( date, region, country) ].
Best for:
• Queries that use 1st column as primary filter, then other columns
• Can speed up joins and group by
• Slower to VACUUM
Example:
create table sales(
date datetime not null,
region datetime not null,
country varchar not null)
distkey(date)
sortkey compound (date, region, country);
The Sort keys – Interleaved
Equal weight is given to each column. [ SORTKEY INTERLEAVED ( date, region, country) ]
Best for:
• Queries that use different columns in filter
• Queries get faster the more columns used in the filter
• The Slowest to VACUUM
Example:
create table sales(
date datetime not null,
region datetime not null,
country varchar not null)
distkey(date)
sortkey interleaved(date, region, country);
Data locate in Redshift
The Distribution Key
•Redshift will distribute and replicate data between compute
nodes;
•By default, data will be spread evenly across all compute nodes
slices (EVEN distribution)
•The even distribution of data across the nodes is vital to ensuring
consistent query performance
•If data is denormalised and does not participate in joins, then an
EVEN distribution won’t be problematic
•Alternatively a Distribution key can be provided (KEY distribution)
•The Distribution key helps distribute data across a node’s slices
•The Distribution key is defined on a per-table basis
•The Distribution Key is comprised of only a single column
Distribution styles by example
• Large Fact tables
• Large dimension tables
• Medium dimension tables
(1K – 2M)
• Tables with no joins or group by
• Small dimension tables (<1000)
Data Distribution
Workload Management (WLM)
WLM allows you to:
• Manage and adjust query concurrency
• Increase query concurrency up to 15 in a queue
• Define user groups and query groups
• Segregate short and long running queries
• Help improve performance of individual queries
Be aware:
• Query workload is distributed to every compute node
• Increasing concurrency may not always help due to resource
contention (CPU, Memory, I/O)
• Total throughput may increase by letting one query complete first
and allowing other queries to wait
WLM Options by default:
• 1 queue with a concurrency of 5
• Define up to 8 queues with a total concurrency of 15
• Redshift has a super user queue internally
Short Description of Amazon Simple
Storage Service (S3)
Cloud Storage for web application
Origin store for content distribution
Staging area and persistent store for Big Data
analytics
Backup and archive target databases
To use Amazon S3, you need an AWS account
Before you can store data in Amazon S3, you must
create a bucket.
Add an object to the created bucket (a text file, a
photo, a video and so forth)
When objects are added to the bucket you can view
and manage them
Data Loading from Amazon S3
Best Practice and recommendations:
• S3 bucket and your cluster must be created in the
same region
• Split your data on S3 into multiple files
• Use a COPY Command to load data
• Load your data in sort key order to avoid needing to
vacuum
• Organize your data as a sequence of time-series
tables
• Run the VACUUM command whenever you add,
delete, or modify a large number of rows
• Run the ANALYZE command whenever you’ve made
a non-trivial number of changes to update table
statistics
COPY from Amazon S3
Syntax Parameters
FROM - the path to the Amazon S3 objects that contain the data
MANIFEST - The manifest is a text file in JSON format that lists the URL of each file
that is to be loaded from Amazon S3. The URL includes the bucket name and full
object path for the file. The files that are specified in the manifest can be in different
buckets, but all the buckets must be in the same region as the Amazon Redshift
cluster.
ENCRYPTED - Specifies that the input files on Amazon S3 are encrypted using client-
side encryption.
REGION [AS] 'aws-region‘ - Specifies the AWS region where the source data is
located.
Examples
Redshift table maintenance operations
ANALYZE: The command used to capture statistical information about a table for use by the query planner.
•Run before running queries.
•Run against the database after regular load or update cycle.
•Run against any new tables that you create.
•Consider running ANALYZE operations on different schedules for different types of tables and columns, depending on their use in
queries and their propensity to change.
•Do not need to analyze all columns in all tables regularly or on the same schedule. Analyze the columns that are frequently used in
the following:
•Sorting and grouping operations
•Joins
•Query predicates
This command can analyze the whole table or specified columns:
ANALYZE <TABLE NAME>;
ANALYZE <TABLE NAME> (<COLUMN1>,<COLUMN2>);
Redshift table maintenance operations
VACUUM: A process to physically reorganize tables after load
activity.
•Can be run in 4 modes:
•VACUUM FULL - reclaims space and re-sorts;
•VACUUM DELETE ONLY - reclaims space but does not re-sort
•VACUUM SORT ONLY - re-sorts but does not reclaim space
•VACUUM REINDEX - used for INTERLEAVED sort keys, re-analyzes
sort keys and then runs FULL VACUUM
VACUUM is an I/O intensive operation and can take time to run. To
minimize the impact of VACUUM:
•Run VACUUM on a regular schedule during time periods when
you expect minimal activity on the cluster
•Use TRUNCATE instead of DELETE where possible
•TRUNCATE or DROP test tables
•Perform a Deep Copy instead of VACUUM
•Load Data in sort order and remove need for VACUUM
TO threshold PERCENT - the threshold above which VACUUM
skips the sort phase and the target threshold for reclaiming
space in the delete phase. If you include the
TO threshold PERCENT parameter, you must also specify a table
name. This parameter can't be used with REINDEX.
For example, if you specify 75 for threshold, VACUUM skips the
sort phase if 75 percent or more of the table's rows are already
in sort order. For the delete phase, VACUUMS sets a target of
reclaiming disk space such that at least 75 percent of the table's
rows are not marked for deletion following the vacuum.
The threshold value must be an integer between 0 and 100. The
default is 95.
Amazon Redshift Snapshots
Automated Snapshots
•enabled by default when cluster is created
•periodically takes from the cluster(every eight hours or every 5 GB of data changes)
•deleted at the end of a retention period(1 day by default)
•Can be disabled (set retention period to 0)
Manual Snapshots
•Can be taken whenever you want
•will never automatically delete
•manual snapshots accrue storage charges
Excluding Tables From Snapshot
•To create a no-backup table, include the BACKUP NO parameter when you create the table
Copying Snapshots to Another Region
•Copying snapshots across regions incurs data transfer charges
Restoring a Table from a Snapshot (feature was added 2016 10th of March)
•You can restore a table only to the current, active running cluster and from a snapshot that was taken of that cluster.
•You can restore only one table at a time.
•You cannot restore a table from a cluster snapshot that was taken prior to a cluster being resized.
Amazon Redshift Security
Cluster Security: Controlling access to Redshift cluster for management
• Cluster runs within a Virtual Private Cloud (VPC) managed by the Amazon Redshift
service
Connection security: Controlling clients that can connect to Redshift cluster
•. Users can only connect to the cluster using an ODBC or JDBC connections. You may
optionally only permit connections to the Amazon Redshift cluster from a VPC you
control.
Database object security: Controlling which users have access to which database objects
• At the database security level Amazon Redshift uses the Postgres security model, with
user name / password authentication. Database user accounts are configured
separately from Redshift’s management security using SQL commands.
Data Security: encryption of data at rest (load data, table data, and backup data)
• You can encrypt data that is loaded into Amazon Redshift, encrypt the data stored in
the Amazon Redshift tables, and encrypt the backups.
Monitoring Cluster Performance
Amazon CloudWatch metrics help you monitor physical
aspects of your cluster, such as CPU utilization, latency,
and throughput.
Performance data helps you monitor database activity
and performance. This data is aggregated in the
Amazon Redshift console to help you easily correlate
what you see in Amazon CloudWatch metrics
Query/Load Performance DataAmazon CloudWatch Metrics
Useful resources to learn more about
Redshift
Redshift Documentation
• https://aws.amazon.com/redshift
• http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html
Open Source Scripts and Tools
• https://github.com/awslabs/amazon-redshift-utils
• http://www.aginity.com/redshift
Conclusion
Amazon Redshift’s features
•Optimized for Data Warehousing- It uses columnar storage, data compression, and
zone maps to reduce the amount of IO needed to perform queries. Redshift has a massively
parallel processing (MPP) architecture, parallelizing and distributing SQL operations to take
advantage of all available resources.
•Scalable- With a few clicks of the AWS Management Console or a simple API call, you can
easily scale the number of nodes in your data warehouse up or down as your performance
or capacity needs change.
•No Up-Front Costs- You pay only for the resources you provision. You can choose On-
Demand pricing with no up-front costs or long-term commitments, or obtain significantly
discounted rates with Reserved Instance pricing.
•Fault Tolerant- Amazon Redshift has multiple features that enhance the reliability of
your data warehouse cluster. All data written to a node in your cluster is automatically
replicated to other nodes within the cluster and all data is continuously backed up to
Amazon S3.
•SQL - Amazon Redshift is a SQL data warehouse and uses industry standard ODBC and
JDBC connections and Postgres drivers.
•Isolation - Amazon Redshift enables you to configure firewall rules to control network
access to your data warehouse cluster.
•Encryption – With just a couple of parameter settings, you can set up Amazon Redshift to
use SSL to secure data in transit and hardware-accelerated AES-256 encryption for data at
rest.
Redshift
Optimized
for Data
Warehous
ing
Scalable
No Up-
Front
Costs
Fault
Tolerant
Secure
SQL
Standards
Jeff Bezos reacted to my payment :-))

More Related Content

AWS (Amazon Redshift) presentation

  • 1. AWS: Redshift overview PRESENTATION PREPARED BY VOLODYMYR ROVETSKIY
  • 2. Agenda What is AWS Redshift Amazon Redshift Pricing AWS Redshift Architecture •Data Warehouse System Architecture •Internal Architecture and System Operation Query Planning and Designing Tables •Query Planning And Execution Workflow •Columnar Storage •Zone Maps •Compression •Referential Integrity Data locate in Redshift •The Sortkey •The Distribution Key Workload Management (WLM) Loading Data •What is Amazon S3 •Data Loading from Amazon S3 •COPY from Amazon S3 Redshift table maintenance operations •ANALYZE •VACUUM Amazon Redshift Snapshots Amazon Redshift Security Monitoring Cluster Performance Useful resources Conclusion
  • 3. What is Amazon Redshift Cluster architecture Columnar storage Zone maps Compression Read optimized No referential integrity by design Redshift is the Amazon Cloud Data Warehousing server; it can interact with Amazon EC2 and S3 components but is managed separately using the Redshift tab of the AWS console. As a cloud based system it is rented by the hour from Amazon, and broadly the more storage you hire the more you pay. Amazon Redshift features
  • 4. Amazon Redshift Pricing Clients pay an hourly rate based on the type and number of nodes in your cluster. There is discount up to 75% over On- Demand rates by committing to use Amazon Redshift for a 1 or 3 year term. Prices include two additional copies of your data, one on the cluster nodes and one in Amazon S3. Amazon Redshift take care of backup, durability, availability, security, monitoring, and maintenance. Price is depend on chosen Region. Dense Storage (DS) nodes allow you to create large data warehouses using hard disk drives (HDDs) for a low price point. Dense Compute (DC) nodes allow you to create high performance data warehouses using fast CPUs, large amounts of RAM and solid-state disks (SSDs).
  • 5. Data Warehouse System Architecture Leader node • Stores metadata • Manages communications with client programs and compute nodes • Manages distributing data to the slices on compute nodes • Develops and distributes execution plans for compute nodes Compute nodes • Execute the query segments in parallel and send results back to the leader node for final aggregation • Each compute node has its own dedicated CPU, memory, and attached disk storage • User data is stored on the compute nodes Node slices • Each slice is allocated a portion of the node's memory and disk space • The slices work in parallel to complete the operation. • The number of slices per node is determined by the node size of the cluster. • The rows of table are distributed to the node slices according to the distribution key Client applications • Amazon Redshift is based on industry-standard PostgreSQL
  • 6. The following diagram shows a high level view of internal components and functionality of the Amazon Redshift data warehouse. Internal Architecture and System Operation
  • 7. Query Planning And Execution Workflow The query planning and execution workflow follows these steps: • 1. The leader node receives the query and parses the SQL. • 2. The parser produces an initial query tree that is a logical representation of the original query. Amazon Redshift then inputs this query tree into the query optimizer. • 3. The optimizer evaluates and if necessary rewrites the query to maximize its efficiency. • 4. The optimizer generates a query plan for the execution with the best performance. • 5. The execution engine translates the query plan into compiled C++ code • 6. The compute node execute the compiled code segments in parallel
  • 8. Columnar Storage Pic.1 shows how records from database tables are typically stored into disk blocks by row. Pic.2 shows how with columnar storage, the values for each column are stored sequentially into disk blocks. Columnar storage is optimizing analytic query performance because:  reduces the overall disk I/O requirements  reduces the amount of data you need to load from disk  each block holds the same type of data  block data can use a compression scheme selected specifically for the column data type
  • 9. Zone Maps The zone map is held separately from the block, like an index The zone map holds only two data points per block, the highest and lowest values in the block. Redshift uses the zone map when executing queries, and excludes the blocks that the zone map indicates won’t be returned by the WHERE clause filter. The zone maps will filter data blocks efficiently if columns are used as the sortkey
  • 10. Compression Benefits of Compression •Reduces the size of data when it is stored or read from storage •Conserves storage space •Reduces the amount of disk I/O •Improves query performance Redshift recommendations and advices: •Use COPY command to apply automatic compression.(COMPUPDATE ON) •Produce a report with the suggested column encoding schemes for the tables analyzed. (ANALYZE COMPRESSION) •Compression type cannot be changed for a column after the table is created •Highly compressed sort keys means many rows per block You’ll scan more data blocks than you need
  • 11. Referential integrity. Redshift unsupported features: Table partitioning Tablespaces Constraints: ◦ Unique ◦ Foreign key ◦ Primary key ◦ Check constraints ◦ Exclusion constraints Indexes Important: Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity. Collations Stored procedures Triggers Table functions Sequences Full text search Exotic data types (arrays, JSON, Geospatial types, etc.)
  • 12. Data locate in Redshift The Sort key • Each table can have a single Sort Key – a compound key, comprised of 1 to 400 columns from the table • Redshift is storing data on disk in Sort Key order • Sort keys should be selected based on how the table is used: • Columns that are used to join to other tables should be included in the sort key; • Date type columns that are used in filtering operations should be included; • Redshift stores metadata about each data block, including the min and max of each column value – using sortkey, Redshift can skip entire blocks when answering a query;
  • 13. Sort keys and Zone Maps CREATE SOME_TABLE ( SALESID INTEGER NOT NULL, DATE DATETIME NOT NULL ) SELECT COUNT(*) FROM SOME_TABLE WHERE DATE = ’09-JUNE-2013’ CREATE SOME_TABLE ( SALESID INTEGER NOT NULL, DATE DATETIME NOT NULL ) SORTKEY (DATE) SELECT COUNT(*) FROM SOME_TABLE WHERE DATE = ’09-JUNE-2013’
  • 14. The Sort keys – Single Column Table is sorted by 1 column [ SORTKEY ( date ) ]. Best for: • Queries that use 1st column (i.e. date) as primary filter • Can speed up joins and group by • Quickest to VACUUM Example: create table sales( date datetime not null, region datetime not null, country varchar not null) distkey(date) sortkey(date);
  • 15. The Sort keys – Compound Table is sorted by 1st column , then 2nd column etc. [ SORTKEY COMPOUND ( date, region, country) ]. Best for: • Queries that use 1st column as primary filter, then other columns • Can speed up joins and group by • Slower to VACUUM Example: create table sales( date datetime not null, region datetime not null, country varchar not null) distkey(date) sortkey compound (date, region, country);
  • 16. The Sort keys – Interleaved Equal weight is given to each column. [ SORTKEY INTERLEAVED ( date, region, country) ] Best for: • Queries that use different columns in filter • Queries get faster the more columns used in the filter • The Slowest to VACUUM Example: create table sales( date datetime not null, region datetime not null, country varchar not null) distkey(date) sortkey interleaved(date, region, country);
  • 17. Data locate in Redshift The Distribution Key •Redshift will distribute and replicate data between compute nodes; •By default, data will be spread evenly across all compute nodes slices (EVEN distribution) •The even distribution of data across the nodes is vital to ensuring consistent query performance •If data is denormalised and does not participate in joins, then an EVEN distribution won’t be problematic •Alternatively a Distribution key can be provided (KEY distribution) •The Distribution key helps distribute data across a node’s slices •The Distribution key is defined on a per-table basis •The Distribution Key is comprised of only a single column
  • 18. Distribution styles by example • Large Fact tables • Large dimension tables • Medium dimension tables (1K – 2M) • Tables with no joins or group by • Small dimension tables (<1000) Data Distribution
  • 19. Workload Management (WLM) WLM allows you to: • Manage and adjust query concurrency • Increase query concurrency up to 15 in a queue • Define user groups and query groups • Segregate short and long running queries • Help improve performance of individual queries Be aware: • Query workload is distributed to every compute node • Increasing concurrency may not always help due to resource contention (CPU, Memory, I/O) • Total throughput may increase by letting one query complete first and allowing other queries to wait WLM Options by default: • 1 queue with a concurrency of 5 • Define up to 8 queues with a total concurrency of 15 • Redshift has a super user queue internally
  • 20. Short Description of Amazon Simple Storage Service (S3) Cloud Storage for web application Origin store for content distribution Staging area and persistent store for Big Data analytics Backup and archive target databases To use Amazon S3, you need an AWS account Before you can store data in Amazon S3, you must create a bucket. Add an object to the created bucket (a text file, a photo, a video and so forth) When objects are added to the bucket you can view and manage them
  • 21. Data Loading from Amazon S3 Best Practice and recommendations: • S3 bucket and your cluster must be created in the same region • Split your data on S3 into multiple files • Use a COPY Command to load data • Load your data in sort key order to avoid needing to vacuum • Organize your data as a sequence of time-series tables • Run the VACUUM command whenever you add, delete, or modify a large number of rows • Run the ANALYZE command whenever you’ve made a non-trivial number of changes to update table statistics
  • 22. COPY from Amazon S3 Syntax Parameters FROM - the path to the Amazon S3 objects that contain the data MANIFEST - The manifest is a text file in JSON format that lists the URL of each file that is to be loaded from Amazon S3. The URL includes the bucket name and full object path for the file. The files that are specified in the manifest can be in different buckets, but all the buckets must be in the same region as the Amazon Redshift cluster. ENCRYPTED - Specifies that the input files on Amazon S3 are encrypted using client- side encryption. REGION [AS] 'aws-region‘ - Specifies the AWS region where the source data is located. Examples
  • 23. Redshift table maintenance operations ANALYZE: The command used to capture statistical information about a table for use by the query planner. •Run before running queries. •Run against the database after regular load or update cycle. •Run against any new tables that you create. •Consider running ANALYZE operations on different schedules for different types of tables and columns, depending on their use in queries and their propensity to change. •Do not need to analyze all columns in all tables regularly or on the same schedule. Analyze the columns that are frequently used in the following: •Sorting and grouping operations •Joins •Query predicates This command can analyze the whole table or specified columns: ANALYZE <TABLE NAME>; ANALYZE <TABLE NAME> (<COLUMN1>,<COLUMN2>);
  • 24. Redshift table maintenance operations VACUUM: A process to physically reorganize tables after load activity. •Can be run in 4 modes: •VACUUM FULL - reclaims space and re-sorts; •VACUUM DELETE ONLY - reclaims space but does not re-sort •VACUUM SORT ONLY - re-sorts but does not reclaim space •VACUUM REINDEX - used for INTERLEAVED sort keys, re-analyzes sort keys and then runs FULL VACUUM VACUUM is an I/O intensive operation and can take time to run. To minimize the impact of VACUUM: •Run VACUUM on a regular schedule during time periods when you expect minimal activity on the cluster •Use TRUNCATE instead of DELETE where possible •TRUNCATE or DROP test tables •Perform a Deep Copy instead of VACUUM •Load Data in sort order and remove need for VACUUM TO threshold PERCENT - the threshold above which VACUUM skips the sort phase and the target threshold for reclaiming space in the delete phase. If you include the TO threshold PERCENT parameter, you must also specify a table name. This parameter can't be used with REINDEX. For example, if you specify 75 for threshold, VACUUM skips the sort phase if 75 percent or more of the table's rows are already in sort order. For the delete phase, VACUUMS sets a target of reclaiming disk space such that at least 75 percent of the table's rows are not marked for deletion following the vacuum. The threshold value must be an integer between 0 and 100. The default is 95.
  • 25. Amazon Redshift Snapshots Automated Snapshots •enabled by default when cluster is created •periodically takes from the cluster(every eight hours or every 5 GB of data changes) •deleted at the end of a retention period(1 day by default) •Can be disabled (set retention period to 0) Manual Snapshots •Can be taken whenever you want •will never automatically delete •manual snapshots accrue storage charges Excluding Tables From Snapshot •To create a no-backup table, include the BACKUP NO parameter when you create the table Copying Snapshots to Another Region •Copying snapshots across regions incurs data transfer charges Restoring a Table from a Snapshot (feature was added 2016 10th of March) •You can restore a table only to the current, active running cluster and from a snapshot that was taken of that cluster. •You can restore only one table at a time. •You cannot restore a table from a cluster snapshot that was taken prior to a cluster being resized.
  • 26. Amazon Redshift Security Cluster Security: Controlling access to Redshift cluster for management • Cluster runs within a Virtual Private Cloud (VPC) managed by the Amazon Redshift service Connection security: Controlling clients that can connect to Redshift cluster •. Users can only connect to the cluster using an ODBC or JDBC connections. You may optionally only permit connections to the Amazon Redshift cluster from a VPC you control. Database object security: Controlling which users have access to which database objects • At the database security level Amazon Redshift uses the Postgres security model, with user name / password authentication. Database user accounts are configured separately from Redshift’s management security using SQL commands. Data Security: encryption of data at rest (load data, table data, and backup data) • You can encrypt data that is loaded into Amazon Redshift, encrypt the data stored in the Amazon Redshift tables, and encrypt the backups.
  • 27. Monitoring Cluster Performance Amazon CloudWatch metrics help you monitor physical aspects of your cluster, such as CPU utilization, latency, and throughput. Performance data helps you monitor database activity and performance. This data is aggregated in the Amazon Redshift console to help you easily correlate what you see in Amazon CloudWatch metrics Query/Load Performance DataAmazon CloudWatch Metrics
  • 28. Useful resources to learn more about Redshift Redshift Documentation • https://aws.amazon.com/redshift • http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html Open Source Scripts and Tools • https://github.com/awslabs/amazon-redshift-utils • http://www.aginity.com/redshift
  • 29. Conclusion Amazon Redshift’s features •Optimized for Data Warehousing- It uses columnar storage, data compression, and zone maps to reduce the amount of IO needed to perform queries. Redshift has a massively parallel processing (MPP) architecture, parallelizing and distributing SQL operations to take advantage of all available resources. •Scalable- With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse up or down as your performance or capacity needs change. •No Up-Front Costs- You pay only for the resources you provision. You can choose On- Demand pricing with no up-front costs or long-term commitments, or obtain significantly discounted rates with Reserved Instance pricing. •Fault Tolerant- Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3. •SQL - Amazon Redshift is a SQL data warehouse and uses industry standard ODBC and JDBC connections and Postgres drivers. •Isolation - Amazon Redshift enables you to configure firewall rules to control network access to your data warehouse cluster. •Encryption – With just a couple of parameter settings, you can set up Amazon Redshift to use SSL to secure data in transit and hardware-accelerated AES-256 encryption for data at rest. Redshift Optimized for Data Warehous ing Scalable No Up- Front Costs Fault Tolerant Secure SQL Standards
  • 30. Jeff Bezos reacted to my payment :-))

Editor's Notes

  1. Amazon Redshift data warehouse is a cloud based massively parallel processing (MPP), columnar database that consists of multiple computers (nodes). It delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. Redshift uses standard PostgreSQL ODBC drivers, allowing the usage of wide range of familiar SQL clients. Most common tasks associated with provisioning, configuring and monitoring a data warehouse are automated within Redshift for the ease of administration.
  2. Dense Storage Node Types Node Size vCPU ECU RAM (GiB) Slices Per Node Storage Per Node Node Range Total Capacity ds1.xlarge 2 4.4 15 2 2 TB HDD 1–32 64 TB ds1.8xlarge 16 35 120 16 16 TB HDD 2–128 2 PB ds2.xlarge 4 13 31 2 2 TB HDD 1–32 64 TB ds2.8xlarge 36 119 244 16 16 TB HDD 2–128 2 PB Dense Compute Node Types Node Size vCPU ECU RAM (GiB) Slices Per Node Storage Per Node Node Range Total Capacity dc1.large 2 7 15 2 160 GB SSD 1–32 5.12 TB dc1.8xlarge 32 104 244 32 2.56 TB SSD 2–128 326 TB Cluster architecture Clients connect via existing protocols to the Leader Node. Leader node develops a query plan and may generate and compile C++ code to be executed by the compute nodes Leader node will distribute work across compute nodes using Distribution Keys (more later) Compute nodes receive work from leader node and may transmit data amongst themselves to answer the query Leader aggregates the results and returns to client Leader can distribute bulk data loads across compute nodes http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
  3. For example, suppose a table contains 100 columns. A query that uses five columns will only need to read about five percent of the data contained in the table. This savings is repeated for possibly billions or even trillions of records for large databases. In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well. Typical database block sizes range from 2 KB to 32 KB. Amazon Redshift uses a block size of 1 MB, which is more efficient and further reduces the number of I/O requests needed to perform any database loading or other operations that are part of query execution. Недолік Пысля апдейта ынсерта або деліта потрібно ранити вакууь і аналайз Тяжке для операцій які модифікують дані
  4. Compression is a column-level operation that reduces the size of data when it is stored. Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance. By default, Amazon Redshift stores data in its raw, uncompressed format. When you create tables in an Amazon Redshift database, you can define a compression type, or encoding, for the columns.  For us, the most expensive part of Redshift is the data storage. Thus, the more we can compress the data to save space, the better. We can typically compress our data about 3x in Redshift. This means that Redshift’s cost of $1000/TB/year (for 3 year reserved nodes) is actually more like $350 per uncompressed TB per year. Redshift allows you to specify a different compression type for each column, which is great. Instead of using a generic strategy to compress all of your data, you should use the compression type that best suits the data in the column. In addition to compression, if you know you’ll be using Redshift on a continuous, long-term basis, another way to save costs is to use reserved instances. By using reserved instances, we save about 60% of the cost compared to on-demand Redshift. The syntax is as follows: CREATE TABLE table_name (column_name data_type ENCODE encoding-type)[, ...] create table product( product_id int, product_name char(20) encode bytedict); You cannot change the compression encoding for a column after the table is created.
  5. Додати формати даних які підтримуються у редшіфті
  6. Use SORTKEY SORTKEY essentially defines how the data will be sorted in the storage. This feature is useful to limit the amount of data that has to be scanned. For example, if I have a large table full of news paper articles over a century and want to find article published between 1980 - 1985 that mention "Tiger", it's useful to have articles sorted by published_date on the storage, because that way I can limit the scanning on blocks that contain these dates. They are also useful for joining if the key is also the DISTKEY because the query planner can skip a lot of work. You *can* specify multiple SORTKEYs. When you specify SORTKEY(a, b), the data is effectively sorted as if with "ORDER BY (a, b). If cardinality of a is high enough, filtering by a is very effective, but having a second SORTKEY will make small sense, and vice versa. Therefore the utility of setting multiple SORTKEY is more difficult to judge. Start with a single SORTKEY and see how it goes. After data loads or inserts, the ANALYZE command should be run: ANALYZE updates the table metadata that is used by the query planner – very important for column-based data storage and ongoing query performance
  7. Дані не сортовані по регіону і кантрі
  8. Дані не сортовані по регіону і кантрі Sort Keys – Comparing Styles Increased load and vacuum times More effective with large tables (> 100M+ rows) Use Compound Sort Key when appending data in order Sort Keys – Interleaved Considerations
  9. Додати опис по скев Skew Distribution Styles When you create a table, you designate one of three distribution styles: KEY, ALL, or EVEN. KEY distribution The rows are distributed according to the values in one column. The leader node will attempt to place matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together. ALL distribution A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place only a portion of a table's rows on each node, ALL distribution ensures that every row is collocated for every join that the table participates in. EVEN distribution The rows are distributed across the slices in a round-robin fashion, regardless of the values in any particular column. EVEN distribution is appropriate when a table does not participate in joins or when there is not a clear choice between KEY distribution and ALL distribution. EVEN distribution is the default distribution style. Consider choosing DISTKEY What is "DISTKEY" anyways? DISTKEY essentially decides which row goes to which node. For example, if you declare "user_id" as DISTKEY, RedShift will donode_id = hash(user_id) % num_nodes to choose the node to store that row. Well, it's not THAT simple, but you get the idea. Why does it matter? DISTKEY primarily matters when you do a join. Let's say a SQL statement SELECT * FROM User INNER JOIN Post ON (User.UserId = Post.UserId) WHERE Post.Type = 1 is issued. If User and Post both used UserId as DISTKEY, a RedShift node can just take the allocated shard, join them, filter them and send the (much smaller) contribution over the wire to be combined. However, if User was distributed by UserId and Post was distributed byArticleId, Posts that belong to Users on a node will be on other nodes. Therefore the nodes have to ship the entire shard over the network to perform the join, which is expensive. What should I do? If a table is large and you anticipate a join with another large table, then consider choosing the key that will be used for the join to be the DISTKEY. In other words, unless this is the case don't declare a DISTKEY (RedShift will distribute the rows evenly) What is "data skew"? Data skew is when data concentrates on small number of nodes due to a badly chosen DISTKEY. Imagine you have a huge user base which are predominantly located in US. If you use "country_code" as DISTKEY, most of the data will end up on one node because most users will have the same country code "US". This means that this one node will do most of the work while other nodes will remain idle, which is inefficient. Therefore, it's important to choose a DISTKEY that will result in an even(-ish) distribution among the nodes.
  10. http://docs.aws.amazon.com/redshift/latest/dg/r_SET.html – set label for query group http://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html#wlm-wildcards Workload Management (WLM) is necessary to optimize access to database resources for concurrently executing queries. The goals of a functional workload management are to Optimally leverage available (hardware) resources for performance and throughput - Prioritize access for high priority jobs Assure resource availability by avoiding system lock-up by any small set of jobs Effective workload management starts with comprehensive monitoring, allowing identifying bottleneck conditions, and then leveraging the available platform tools to implement a workload management strategy. Amazon Redshift’s workload management (WLM) helps you allocate resources to certain user groups or query groups. By adjusting your WLM queue configurations, you can drastically improve performance and query speed. For our Redshift clusters, we use WLM to set what percentage of memory goes to a customer’s queries, versus loading data and other maintenance tasks. If we give a lot of memory to our customers and don’t leave much for loading new data, loading will never finish; if we do the opposite, customer queries will never finish. There has to be a good balance when allocating resources to our customers relative to internal functions. WLM also allows us to define how many queries can be run in parallel at the same time for a given group. If the maximum number of queries are already running in parallel, any new queries will enter a queue to wait for the first one to finish. Of course, the optimal configuration for this will depend on your specific use case, but it’s a tool that anyone using Amazon Redshift should definitely take advantage of. https://www.flydata.com/blog/optimal-wlm-settings-redshift/ Use WLM to counter resource hogging When queries are issued concurrently, resource hogging can become a problem. For example, if somebody issues 10 queries that take 1 hour each, another guy with a 5 min query can wait for a long time before he can get his query done. To prevent this kind of problem, consider using WLM. Work Load Management (WLM) - It enables to create multiple query queues according to different user groups or query groups. This will help in managing workloads. - Redshift allocates an equal, fixed share of server memory to each queue, and, by default, an equal, fixed share of a queue's memory to each query slot in the queue regardless of the number of queries that are actually running concurrently.. - WLM assigns a query to a queue based on the user's group or query group specified while running the SQL. Maxmimum concurrency level per queue is 50. Total concurrency of all the queues is 50. For each queue, you can specify • Concurrency level • User groups • Query groups • Wildcards • WLM memory percent to use  - To specify the amount of available memory that is allocated to a query  - If you specify a WLM Memory Percent to Use value for one or more queues but not all of the queues, then the unallocated memory is divided evenly among the remaining queues. • WLM timeout  - To limit the amount of time that queries in a given WLM queue are permitted to use. Specified in milliseconds.  - statement timeout configuration parameter applies to the entire cluster, WLM timeout is specific to a single queue in the WLM configuration If you have multiple queries that each access data on a single slice, set up a separate WLM queue to execute those queries concurrently. Redshift will assign concurrent queries to separate slices. By increasing concurrency, you increase the contention for system resources and limit the overall throughput. If a specific query needs more memory than is allocated to a single query slot, you can increase the available memory by increasing the wlm_query_slot_count (p. 697) parameter Default queue The last queue defined in the WLM configuration is the default queue.You can set the concurrency level and the timeout for the default queue, but it cannot include user groups or query groups. The default queue counts against the limit of eight query queues and the limit of 50 concurrent queries. Superuser queue To run a query in the Superuser queue, a user must be logged in as a superuser and must run the query within the predefined 'superuser' query group. The WLM configuration is an editable parameter (wlm_json_configuration) in a parameter group, which can be associated with one or more clusters. You must reboot the cluster after changing the WLM configuration.
  11. To load data from files located in one or more S3 buckets, use the FROM clause to indicate how COPY will locate the files in Amazon S3. You can provide the object path to the data files as part of the FROM clause, or you can provide the location of a manifest file that contains a list of Amazon S3 object paths.  The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files and by setting distribution keys on your tables.  Note We strongly recommend that you divide your data into multiple files to take advantage of parallel processing.
  12. The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files on Amazon S3, from a DynamoDB table, or from text output from one or more remote hosts. Loads data into a table from data files or from an Amazon DynamoDB table. The files can be located in an Amazon Simple Storage Service (Amazon S3) bucket, an Amazon Elastic MapReduce (Amazon EMR) cluster, or a remote host that is accessed using a Secure Shell (SSH) connection. The maximum size of a single input row from any source is 4 MB. To use the COPY command, you must have INSERT privilege for the Amazon Redshift table. We recommend using role-based access control because it is provides more secure, fine-grained control of access to AWS resources and sensitive user data, in addition to safeguarding your AWS credentials. To use role-based access control, you must first create an IAM role using the Amazon Redshift service role type, and then attach the role to your cluster.  When you create an IAM role, IAM returns an Amazon Resource Name (ARN) for the role. To execute a COPY command using an IAM role, provide the role ARN in the CREDENTIALS parameter string.  The following COPY command example uses the role MyRedshiftRole for authentication. copy customer from 's3://mybucket/mydata' credentials 'aws_iam_role=arn:aws:iam::12345678901:role/MyRedshiftRole';
  13. Step 6: Vacuum and Analyze the Database Whenever you add, delete, or modify a significant number of rows, you should run a VACUUM command and then an ANALYZE command. A vacuum recovers the space from deleted rows and restores the sort order. The ANALYZE command updates the statistics metadata, which enables the query optimizer to generate more accurate query plans. For more information, see Vacuuming Tables. If you load the data in sort key order, a vacuum is fast. In this tutorial, you added a significant number of rows, but you added them to empty tables. That being the case, there is no need to resort, and you didn't delete any rows. COPY automatically updates statistics after loading an empty table, so your statistics should be up-to-date. However, as a matter of good housekeeping, you will complete this tutorial by vacuuming and analyzing your database.
  14. To clean up tables after a bulk delete, a load, or a series of incremental updates, you need to run the VACUUMcommand, either against the entire database or against individual tables. Sort Stage and Merge Stage Amazon Redshift performs a vacuum operation in two stages: first, it sorts the rows in the unsorted region, then, if necessary, it merges the newly sorted rows at the end of the table with the existing rows. When vacuuming a large table, the vacuum operation proceeds in a series of steps consisting of incremental sorts followed by merges. If the operation fails or if Amazon Redshift goes off line during the vacuum, the partially vacuumed table or database will be in a consistent state, but you will need to manually restart the vacuum operation. Incremental sorts are lost, but merged rows that were committed before the failure do not need to be vacuumed again. If the unsorted region is large, the lost time might be significant. For more information about the sort and merge stages, see Managing the Volume of Merged Rows. Users can access tables while they are being vacuumed. You can perform queries and write operations while a table is being vacuumed, but when DML and a vacuum run concurrently, both might take longer. If you execute UPDATE and DELETE statements during a vacuum, system performance might be reduced. Incremental merges temporarily block concurrent UPDATE and DELETE operations, and UPDATE and DELETE operations in turn temporarily block incremental merge steps on the affected tables. DDL operations, such as ALTER TABLE, are blocked until the vacuum operation finishes with the table. Vacuum Threshold By default, VACUUM skips the sort phase for any table where more than 95 percent of the table's rows are already sorted. Skipping the sort phase can significantly improve VACUUM performance. In addition, in the delete phase VACUUM reclaims space such that at least 95 percent of the remaining rows are not marked for deletion. Because VACUUM can often skip rewriting many blocks that contain only a few rows marked for deletion, it usually needs much less time for the delete phase compared to reclaiming 100 percent of deleted rows. To change the default sort threshold for a single table, include the table name and the TO threshold PERCENT parameter when you run the VACUUM command. Vacuum Types You can run a full vacuum, a delete only vacuum, a sort only vacuum, or a reindex with full vacuum. VACUUM FULL We recommend a full vacuum for most applications where reclaiming space and resorting rows are equally important. It's more efficient to run a full vacuum than to run back-to-back DELETE ONLY and SORT ONLY vacuum operations. VACUUM FULL is the same as VACUUM. Full vacuum is the default vacuum operation. VACUUM DELETE ONLY A DELETE ONLY vacuum is the same as a full vacuum except that it skips the sort. A DELETE ONLY vacuum saves time when reclaiming disk space is important but resorting new rows is not. For example, you might perform a DELETE ONLY vacuum operation if you don't need to resort rows to optimize query performance. VACUUM SORT ONLY A SORT ONLY vacuum saves some time by not reclaiming disk space, but in most cases there is little benefit compared to a full vacuum. VACUUM REINDEX Use VACUUM REINDEX for tables that use interleaved sort keys. REINDEX reanalyzes the distribution of the values in the table's sort key columns, then performs a full VACUUM operation. VACUUM REINDEX takes significantly longer than VACUUM FULL because it needs to take an extra analysis pass over the data, and because merging in new interleaved data can involve touching all the data blocks. If a VACUUM REINDEX operation terminates before it completes, the next VACUUM resumes the reindex operation before performing the vacuum. Examples Reclaim space and database and resort rows in alls tables based on the default 95 percent vacuum threshold. vacuum; Reclaim space and resort rows in the SALES table based on the default 95 percent threshold. vacuum sales; Always reclaim space and resort rows in the SALES table. vacuum sales to 100 percent; Resort rows in the SALES table only if fewer than 75 percent of rows are already sorted. vacuum sort only sales to 75 percent; Reclaim space in the SALES table such that at least 75 percent of the remaining rows are not marked for deletion following the vacuum. vacuum delete only sales to 75 percent; Reindex and then vacuum the LISTING table. vacuum reindex listing; The following command returns an error. vacuum reindex listing to 75 percent;
  15. Snapshots are point-in-time backups of a cluster. There are two types of snapshots:automated and manual. Amazon Redshift stores these snapshots internally in Amazon S3 by using an encrypted Secure Sockets Layer (SSL) connection. If you need to restore from a snapshot, Amazon Redshift creates a new cluster and imports data from the snapshot that you specify.
  16. Amazon Virtual Private Cloud (Amazon VPC) enables you to launch Amazon Web Services (AWS) resources into a virtual network that you've defined. This virtual network closely resembles a traditional network that you'd operate in your own data center, with the benefits of using the scalable infrastructure of AWS. When you provision an Amazon Redshift cluster, it is locked down by default so nobody has access to it. To grant other users inbound access to an Amazon Redshift cluster, you associate the cluster with a security group.  Amazon Redshift has multiple levels of security for node, connection, and data security. The following diagram illustrates a number of the capabilities available for security using Redshift. All nodes of a Redshift cluster are contained in an Internal VPC. It is optionally possible to restrict access to the Redshift cluster to MicroStrategy by creating a Customer VPC. A Security Group further restricts access to Redshift
  17. Amazon CloudWatch monitors your Amazon Web Services (AWS) resources and the applications you run on AWS in real-time. You can use CloudWatch to collect and track metrics, which are the variables you want to measure for your resources and applications. CloudWatch alarms send notifications or automatically make changes to the resources you are monitoring based on rules that you define.