I'm trying to correlate the query plan with the query report in my Amazon Redshift cluster. How can I do that?
Short description
Redshift 3.5.2 Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. DTK Software is company from Latvia. Since 2004 DTK Software specializes in document image processing and bar-code recognition technologies. In 2007 we have started our development work in computer vision field, specifically the OCR (Optical Character Recognition) and ANPR (Automatic Number Plate Recognition) technologies.
To determine the usage required to run a query in Amazon Redshift, use the EXPLAIN command. The EXPLAIN command displays the execution plan for a query statement without actually running the query. The execution plan outlines the query planning and execution steps involved.
Then, use the SVL_QUERY_REPORT system view to view query information at a cluster slice level. You can use the slice-level information for detecting uneven data distribution across the cluster, which impacts query performance.
Note: In the SVL_QUERY_REPORT, the rows column indicates the number of rows that are getting processed per cluster slice. The rows_pre_filter column indicates the total number of rows emitted before filtering the rows marked for deletion.
Amazon Redshift processes the query plan and translates the plan into steps, segments, and streams. For more information, see Query planning and execution workflow.
Resolution
Creating a table and fetching the explain plan and SVL query report for the query
1. Create two tables with different sort keys and distribution keys.
2. Run the following query where join was not performed on a distribution key:
This query distributes the inner table to all compute nodes.
3. Retrieve the query plan:
4. Run the following query:
Mapping the query plan with the query report
1. Run the following query to obtain the svl_query_report:
Here's an example output:
This output indicates that when the segment value is 0, Amazon Redshift performs a sequential scan operation to scan the event table.
2. Run the following query to obtain the query report of segment 1:
Here's an example output:
The query continues to run until the segment value is 1 and a hash table operation is performed on the inner table in the join.
3. Run the following query to get the SVL_QUERY_REPORT for a query with a segment value of 2:
4. Run the following query:
Here's an example output:
In this example output, the query is run when the segment value is 2, and performs a sequential scan operation to scan the sales table. In the same segment, an aggregate operation is performed to aggregate results and a hash join operation is performed to join tables. The join columns for one of the tables is not a distribution key or a sort key. As a result, the inner table is distributed to all the compute nodes as DS_BCAST_INNER, whichcan be seen in the EXPLAIN plan.
5. Run the following query to get the SVL_QUERY_REPORT for a query with a segment value of 3:
Here's an example output:
The query continues to run until the segment value is 3 and a hash aggregate operation and sort operation are performed. A hash aggregate operation is performed on unsorted grouped aggregate functions. The sort operation is performed to evaluate the ORDER BY clause.
6. Run the following query to get the SVL_QUERY_REPORT for a query with a segment value of 4 and 5:
After all the segments are used, the query runs a network operation on segments 4 and 5 to send intermediate results to the leader node. The results are sent to the leader node for additional processing.
After the query is run, use the following query to check the execution time of the query in milliseconds:
Optimizing your query
To optimize your query while analyzing the query plan, perform the following steps:
1. Identify the steps with the highest cost.
2. Check if there are any high-cost sort operations. Note that performance of a query depends on the data distribution method along with the data being scanned by the query. Be sure to select the proper distribution style for a table to minimize the impact of the redistribution step. Additionally, use a sort key for suitable columns to improve query speed and reduce the number of blocks that need to be scanned. For more information on how to choose distribution and sort keys, see Amazon Redshift Engineering’s advanced table design playbook: distribution styles and distribution keys.
The following examples use the STL_ALERT_EVENT_LOG table to identify and correct potential query performance issues:
In this example output, the query for the ANALYZE command can be used to improve query performance because the statistics for the query are outdated.
You can also use the EXPLAIN plan to see if there are any alerts that are being populated for the query:
3. Check the join types.
Note: A nested loop is the least optimal join because it is mainly used for cross-joins and some inequality joins.
Redshift Premium 1.0.2 Pc
The following example shows a cross-join between two tables. A nested loop join is being used and the first cost value is 0.00. This cost value is the relative cost for returning the first row of the cross-join operation. The second value (3901467082.32) provides the relative cost of completing the cross-join operation. Note the cost difference between the first and last row. The nested loops negatively impact your cluster’s performance by overloading the queue with long-running queries:
Note: Amazon Redshift selects a join operator based on the distribution style of the table and location of the data required.
To optimize the query performance, the sort key and distribution key have been changed to 'eventid' for both tables. In the following example, the merge join is being used instead of a hash join:
4. Identify any broadcast operators with high-cost operations.
Note: For small tables, broadcast operators aren't always considered non-optimal because the redistribution of small tables does not impact query performance as much relatively.
5. Run the following query to check the execution time of the query.
A difference in execution time for both queries confirms that the query plan has correctly correlated to the query report.
Related information
Using the SVL_QUERY_REPORT view
Amazon EBS allows you to create storage volumes and attach them to Amazon EC2 instances. Once attached, you can create a file system on top of these volumes, run a database, or use them in any other way you would use block storage. Amazon EBS volumes are placed in a specific Availability Zone where they are automatically replicated to protect you from the failure of a single component. All EBS volume types offer durable snapshot capabilities and are designed for 99.999% availability.
Amazon EBS provides a range of options that allow you to optimize storage performance and cost for your workload. These options are divided into two major categories: SSD-backed storage for transactional workloads, such as databases and boot volumes (performance depends primarily on IOPS), and HDD-backed storage for throughput intensive workloads, such as MapReduce and log processing (performance depends primarily on MB/s).
Redshift Premium 1.0.2 Server
SSD-backed volumes include the highest performance Provisioned IOPS SSD (io2 and io1) for latency-sensitive transactional workloads and General Purpose SSD (gp3 and gp2) that balance price and performance for a wide variety of transactional data. HDD-backed volumes include Throughput Optimized HDD (st1) for frequently accessed, throughput intensive workloads and the lowest cost Cold HDD (sc1) for less frequently accessed data.
Redshift Premium 1.0.2 Pack
Elastic Volumes is a feature of Amazon EBS that allows you to dynamically increase capacity, tune performance, and change the type of live volumes with no downtime or performance impact. This allows you to easily right-size your deployment and adapt to performance changes.