Parallel Data Warehouse (PDW) Performance Tip: CTAS vs INSERT SELECT

When loading tables in Microsoft’s Parallel Data Warehouse (PDW) via T-SQL, one will soon encounter the need to load data into a new table. Whether the developer needs to perform an initial data load of table or even go through the PDW-specific steps to perform a ‘trunc and load’ of a table, there are a two table load methods to consider.

PDW T-SQL Load Methods:

  • INSERT SELECT
  • CREATE TABLE AS SELECT(CTAS)

To make a direct comparison between the two methods, I have provided the actual PDW query and execution plans generated by each method using identical SELECTs

Load Method: INSERT SELECT  

Usage
Appends one or more new rows to a table. Add one row by listing the row values or add multiple rows by inserting the results of a SELECT statement.

Example
Insert Into TABLEA SELECT only, use, this, method, to, append FROM TABLEB

Generated Query Plan

Image

Generated Execution Plan

Load Method: Create Table as Select (CTAS)

Usage
Creates a new table in SQL Server PDW that is populated with the results from a SELECT statement.

Example 
CREATE TABLE myTable (
yearId int NOT NULL,
some varchar(50),
stuff varchar(50))
WITH
( CLUSTERED INDEX (yearId) );

Generated Query Plan

Generated Execution Plan

*When comparing to INSERT SELECT, notice the absence of the Index Scan and Parallelism steps above.  CTAS results in an Index Seek for this particular SELECT and also does away with the Parallellism steps.

The Verdict – Use CREATE TABLE AS SELECT (CTAS) when loading empty tables!

Performance Results
CTAS loaded the test table more that twice as fast as INSERT SELECT.

  • CTAS runtime: 1min  44sec
  • INSERT SELECT runtime: 4min  2sec 

Simple Explanation
On the PDW, INSERT SELECT is a fully logged transaction and occurs serially per each distribution. CTAS, however, is minimally logged and happens parellelly across all nodes.

NOTE: When performing CTAS or re-creating tables in any way, you will need to create statistics on those tables upon loading.

Ideally, you should create stats on all the join columns, group by, order by and restriction. SQL Server PDW does not automatically create and update statistics on the Control node for every occasion when SQL Server creates or updates statistics on the Compute nodes:

– This will create stat for all columns on all objects

select ‘create statistics ‘+b.name+‘ on dbo.’+a.name+‘ (‘+b.name+‘)’
from sys.tablesa,sys.columnsb
where a.object_id=b.object_id and notexists(
selectnullfromsys.stats_columnswhere object_idin(selectobject_idfromsys.stats_columnsgroupbyobject_idhavingcount(*)>=1)
and object_id=b.object_idandcolumn_id=b.column_id)
order bya.name,b.column_id;

About these ads

About Sal De Loera
I have been involved in Information Technology, specifically Data Warehousing and BI, for over 8 years. Over these years I have assumed the following roles: Software Engineer Intern, Programmer/Analyst, Sr. Data Warehouse Engineer, Data Architect, Director of Information Technology, and as most recently as a Data Warehouse / BI consultant. Additionally, my IT experience spans the following industries: Insurance, Restaurant, Mass Media, Education, and Local Government. I have created this blog primarily as a means to not only share information and case studies regarding Business Intelligence, Data Warehousing, and Data Modeling, but also receive input from the community on some of the topics that I'l cover. Looking forward to your feedback and to the exciting opportunities that come with new people and technology!

2 Responses to Parallel Data Warehouse (PDW) Performance Tip: CTAS vs INSERT SELECT

  1. preetimishra1984 says:

    I am using CTAS to create a table with some aggregated data from a view (which is a union all of 4-5 tables). the view has huge data (of the order of billion records). something like below:
    CREATE TABLE [dbo].[AGGFortinetDaily20121231IPAGG]
    WITH ( DISTRIBUTION = HASH (DestinationIPPart3), CLUSTERED INDEX (DestinationIPPart3) )
    AS SELECT SourceIPPart1,SourceIPPart2,SourceIPPart3 ,DestinationIPPart1,DestinationIPPart2,DestinationIPPart3
    ,SUM(CONVERT(BIGINT,[sent])) AS SENT
    ,SUM(CONVERT(BIGINT,[rcvd])) AS RCVD
    ,COUNT_BIG(*) AS EVTCOUNT
    ,MIN([dst_port]) AS MINPORT
    ,MAX([dst_port]) AS MAXPORT
    ,Min(logUTCtime) MinLogTime
    ,Max(logUTCtime) MaxLogTime
    FROM [dbo].[TempAggView] GROUP BY SourceIPPart1,SourceIPPart2,SourceIPPart3,DestinationIPPart1,DestinationIPPart2,DestinationIPPart3

    This is running forever. i force-stopped the query after 30 mins when it was still execution.
    Please suggest what i can do to make it faster.
    Thanks.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 60 other followers

%d bloggers like this: