Data Warehousing Tip: Using T-SQL vs SSIS for Slowly Changing Dimension (SCD) Types 1 and 2

Here’s an example of using T-SQL to process a Slowly Changing Type 1&2 Dimension . Note:  SSIS has an SCD transform ,however, it does not process large dimensional datasets very quickly.
…eh, I’m putting it too nicely, the SSIS SCD transform is painfully slow!  This T-SQL performs 20X quicker than the SSIS SCD transform.
USE [SomeDB]
GO
/****** Object: StoredProcedure [dbo].[sp_UpsertDimGuest] Script Date: 9/11/2013 6:09:04 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[sp_UpsertDimGuest]
AS
BEGIN
–DROP Constraints
–SCD 2 Historical Attribute Tracking via Upsert
INSERT INTO [dbo].[DimGuest]
( [GuestContactSrgKey],
[SourceSystem]
,[GuestIDAlternateKey]
,[GuestAcct]
,[GuestFirstName]
,[GuestLastName]
,[ETLLoadID]
,[EffStartDt]
,[EffEndDt]
,[IsCurrent]
)
SELECT
[GuestContactSrgKey]
,[SourceSystem]
,[GuestIDAlternateKey]
,[GuestAcct]
,[GuestFirstName]
,[GuestLastName]
,[ETLLoadID]
,GETDATE()
,’1/1/3000′
,’true’
FROM (
MERGE GTDW.dbo.DimGuest AS [Target]
USING GTDW_Staging.dbo.DimGuestStagingFinal AS [Source]
ON Target.GuestIDAlternateKey = Source.GuestIDAlternateKey
AND Target.IsCurrent = 1
WHEN MATCHED AND
(Target.[GuestFirstName] <> Source.[GuestFirstName]
OR Target.[GuestLastName] <> Source.[GuestLastName]
)
THEN UPDATE SET
IsCurrent = 0
,EffEndDt = GETDATE()
WHEN NOT MATCHED BY TARGET
THEN INSERT (
[GuestContactSrgKey]
,[SourceSystem]
,[GuestIDAlternateKey]
,[GuestAcct]
,[GuestFirstName]
,[GuestLastName]
,[ETLLoadID]
,[EffStartDt]
,[EffEndDt]
,[IsCurrent]
)
VALUES (
Source.[GuestContactSrgKey]
,Source.[SourceSystem]
,Source.[GuestIDAlternateKey]
,Source.[GuestAcct]
,Source.[GuestFirstName]
,Source.[GuestLastName]
,Source.[ETLLoadID]
,GETDATE()
,’1/1/3000′
,’true’
)
WHEN NOT MATCHED BY SOURCE AND Target.IsCurrent = 1
THEN UPDATE SET
IsCurrent = 0
,EffEndDt = GETDATE()
OUTPUT $action AS Action
,[Source].*
) AS MergeOutput
WHERE MergeOutput.Action = ‘UPDATE’
AND GuestIDAlternateKey IS NOT NULL
;
–Changing Attributes – history is not recorded
UPDATE DimG
SET
[GuestContactSrgKey] = DSF.[GuestContactSrgKey]
,[GuestAcct] = DSF.[GuestAcct]
,[ETLLoadID] = DSF.[ETLLoadID]
FROM DimGuest DimG
INNER JOIN [GTDW_Staging].[dbo].[DimGuestStagingFinal] DSF
ON DimG.GuestIDAlternateKey = DSF.GuestIDAlternateKey
AND DimG.IsCurrent = 1 –Optional
AND (
DimG.[GuestAcct] <> DSF.[GuestAcct]
)
END
Advertisements

Data Warehousing Tip: T-SQL Script to Create DEFAULT Constraints by Data Type


There is an ongoing discussion/debate regarding the handling of NULLS in a data warehouse, however, the following article by Michelle Poolet summarizes it beautifully – “consider the data warehouse business requirements” (see http://sqlmag.com/database-administration/null-data-warehouse).


Beyond consideration, there are some data types that should get the DEFAULT treatment (bit, char, nchar, nvarchar, and varchar) and the following script can assist in creating these across every applicable column in your database.


SELECT Object_schema_name(T.[object_id], Db_id()) AS [Schema], 
       T.[name]  AS [table_name], 
       AC.[name] AS [column_name], 
       TY.[name] AS system_data_type, 
       AC.[max_length], 
       AC.[precision], 
       AC.[scale], 
       AC.[is_nullable], 
       AC.[is_ansi_padded], 
       'ALTER TABLE dbo.' + T.[name] 
       + ' ADD CONSTRAINT ' + AC.[name] + ' DEFAULT(' 
       + CASE TY.[name] 
            --WHEN 'int' THEN '0' 
          --WHEN 'bigint' THEN '0' 
          WHEN 'varchar' THEN '' 
          WHEN 'bit' THEN '' 
          WHEN 'char' THEN '' 
          WHEN 'smalldatetime' THEN '' 
          WHEN 'datetime' THEN ''
          --WHEN 'tinyint' THEN '0'
          --WHEN 'smallint' THEN '0' 
          WHEN 'nvarchar' THEN '' 
          --WHEN 'money' THEN '0.00'
          ELSE '' 
       END +') FOR '+ac.name AS sqlstring 
   FROM   sys.[tables] AS T 
       INNER JOIN sys.[all_columns] AC 
               ON T.[object_id] = AC.[object_id] 
       INNER JOIN sys.[types] TY 
               ON AC.[system_type_id] = TY.[system_type_id] 
                  AND AC.[user_type_id] = TY.[user_type_id] 
WHERE  T.[is_ms_shipped] = 0 
       AND T.[name] LIKE ('DIM%') AND AC.[name] NOT LIKE ( '%SrgKey%' )  AND 
TY.[name] in ('varchar', 'bit' , 'char', 'smalldatetime', 'datetime' , 'nvarchar')

I’ve taken the approach of defaulting values for all columns (less those affected by the WHERE clause) in my STAGING Database.  The thought is that the non-NULL values inserted into my staging tables will persist to the TARGET database and I won’t experience one of a variety of issues (OLAP, T-SQL, or otherwise) that a development and/or user team will encounter due to having NULL values in the data warehouse.

 

Parallel Data Warehouse (PDW) Performance Tip: CTAS vs INSERT SELECT

When loading tables in Microsoft’s Parallel Data Warehouse (PDW) via T-SQL, one will soon encounter the need to load data into a new table. Whether the developer needs to perform an initial data load of table or even go through the PDW-specific steps to perform a ‘trunc and load’ of a table, there are a two table load methods to consider.

PDW T-SQL Load Methods:

  • INSERT SELECT
  • CREATE TABLE AS SELECT(CTAS)

To make a direct comparison between the two methods, I have provided the actual PDW query and execution plans generated by each method using identical SELECTs

Load Method: INSERT SELECT  

Usage
Appends one or more new rows to a table. Add one row by listing the row values or add multiple rows by inserting the results of a SELECT statement.

Example
Insert Into TABLEA SELECT only, use, this, method, to, append FROM TABLEB

Generated Query Plan

Image

Generated Execution Plan

Load Method: Create Table as Select (CTAS)

Usage
Creates a new table in SQL Server PDW that is populated with the results from a SELECT statement.

Example 
CREATE TABLE myTable (
yearId int NOT NULL,
some varchar(50),
stuff varchar(50))
WITH
( CLUSTERED INDEX (yearId) );

Generated Query Plan

Generated Execution Plan

*When comparing to INSERT SELECT, notice the absence of the Index Scan and Parallelism steps above.  CTAS results in an Index Seek for this particular SELECT and also does away with the Parallellism steps.

The Verdict – Use CREATE TABLE AS SELECT (CTAS) when loading empty tables!

Performance Results
CTAS loaded the test table more that twice as fast as INSERT SELECT.

  • CTAS runtime: 1min  44sec
  • INSERT SELECT runtime: 4min  2sec 

Simple Explanation
On the PDW, INSERT SELECT is a fully logged transaction and occurs serially per each distribution. CTAS, however, is minimally logged and happens parellelly across all nodes.

NOTE: When performing CTAS or re-creating tables in any way, you will need to create statistics on those tables upon loading.

Ideally, you should create stats on all the join columns, group by, order by and restriction. SQL Server PDW does not automatically create and update statistics on the Control node for every occasion when SQL Server creates or updates statistics on the Compute nodes:

— This will create stat for all columns on all objects

select ‘create statistics ‘+b.name+‘ on dbo.’+a.name+‘ (‘+b.name+‘)’
from sys.tablesa,sys.columnsb
where a.object_id=b.object_id and notexists(
selectnullfromsys.stats_columnswhere object_idin(selectobject_idfromsys.stats_columnsgroupbyobject_idhavingcount(*)>=1)
and object_id=b.object_idandcolumn_id=b.column_id)
order bya.name,b.column_id;

T-SQL Tip: Highlight Code Between Any Two Parenthesis

Where has this been all my working life? If you frequently find your self scouring complex (and sometimes needlessly complex) T-SQL code, simply place the cursor on the inside or outside of a parenthesis, and perform a CTRL+SHIFT+]. Sql Server Management Studio will then highlight the code for that particular SELECT block. There is some complex T-SQL out there in the wild (future consultants beware!), this will save you loads of scrolling and searching time and help you wrangle in pesky blocks of extended code.

Crunch that code!
– Sal

Parallel Data Warehouse (PDW) Basics: CREATE TABLE and CTAS Syntax

After forgetting the syntax for temp tables a couple of times, I decided to write a brief overview of the table creation syntax for Microsoft’s Parallel Data Warehouse (PDW) architecture.

One thing worth noting up-front, this post does not include considerations that should be made in determinining whether a table should be distributed vs replicated.

TEMPORARY TABLE BASICS

-Temp tables will reside in the tempdb (no surprise there), however, the ## global prefix as we see it on SQL Server SMP is not available.
-Temp tables can be either distributed or replicated
-Temp tables cannot have partitions, views, or indexes
-Table permissions cannot be changed
-Visible in only the current session
– CONTROL, INSERT, SELECT and UPDATE permissions are granted to the temp table creator.

Example: Distributed temporary table

CREATE TABLE testDB.dbo.#TableA (
   here int NOT NULL,
   we varchar(50),
   go varchar(50))
WITH
   ( LOCATION = USER_DB,
    DISTRIBUTION = HASH (here) );

NON-TEMP TABLE BASICS
-Indexes cannot be added (or dropped) after a table has been created (must use ‘CREATE TABLE AS SELECT’ aka ‘CTAS’ syntax if the developer wants to preserve the underlying data).
-All tables are created with page compression, there is no configurable option for this.
-When creating a distributed table, keep in mind that partitions will exist in the 8 distributions per compute node. So if you want the math on that, an db may have 10 year partitions on a server – in PDW, this translates to 10(partitions) * 8 (distributions) = 80 partitions on each compute node.  If you have a Parallel Data Warehouse (PDW) ‘Full-Rack’, there will be a total of 800 partitions (80 partitions per node * 10 nodes) …  interesting nonetheless!
-The default collation for the PDW appliance is Latin1_General_100_CI_AS_KS_WS – this is configurable, however, use caution if you plan on modifying a column collation as it can lead to collation errors when comparing strings.

Example: Replicated Table with Index

CREATE TABLE myTable (
   yearId int NOT NULL,
   some varchar(50),
   stuff varchar(50))
WITH 
    ( CLUSTERED INDEX (yearId) );

Example: Distributed Table with Multiple Partitions

CREATE TABLE TableA(
   yearId int NOT NULL,
   some varchar(50),
   stuff varchar(50)
WITH     ( DISTRIBUTION = HASH (yearId),     
PARTITION ( YearPurchased     
RANGE RIGHT FOR VALUES       
( 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012 )));

Example:  Adding a new distribution column and clustered index to an existing table using ‘Create Table as Select’ aka ‘CTAS’

CREATE TABLE TableTemp WITH ( 
    DISTRIBUTION = HASH (newDistributionColumn),
    CLUSTERED INDEX (id ASC) )
AS SELECT * FROM TableOriginal;
DROP TABLE TableOriginal
RENAME OBJECT [TableTemp] TO [TableOriginal]

Below is the base syntax for creating tables as provided by Microsoft:

  CREATE TABLE [ database_name . [ dbo ] . | dbo. ] table_name
(
{ column_name
[ COLLATE Windows_collation_name ]
[ NULL | NOT NULL ] }
[ ,...n ]
)[ WITH ( <table_option> [ ,...n ] ) ]

Create a new temporary table.

CREATE TABLE [ database_name . [ dbo ] . | dbo. ] #table_name
(
{ column_name
[ COLLATE Windows_collation_name ]
[ NULL | NOT NULL ] }
[ ,...n ]
)
WITH ( LOCATION = USER_DB [, <table_option>
[ ,...n ] ] ) [;]
::=
datetimeoffset [ ( n ) ]
| datetime2 [ ( n ) ]
| datetime
| smalldatetime
| date
| time [ ( n ) ]
| float [ ( n ) ]
| real [ ( n ) ]
| decimal [ ( precision [ , scale ] ) ]
| money
| smallmoney
| bigint
| int
| smallint
| tinyint
| bit
| nvarchar [ ( n ) ]
| nchar [ ( n ) ]
| varchar [ ( n ) ]
| char [ ( n ) ]
| varbinary [ ( n ) ]
| binary [ ( n ) ]
::={[ CLUSTERED INDEX ( { index_column_name [ ASC | DESC ] } [ ,...n ] ) ] | [ DISTRIBUTION = { HASH( distribution_column_name ) | REPLICATE } ] | [ PARTITION( partition_column_name RANGE [ LEFT|RIGHT ] FOR VALUES ( [ boundary_value [,...n] ] ) ) ] }

NOTE: When performing CTAS or re-creating tables in any way, you will need to create statistics on those tables upon loading.

Ideally, you should create stats on all the join columns, group by, order by and restriction. SQL Server PDW does not automatically create and update statistics on the Control node for every occasion when SQL Server creates or updates statistics on the Compute nodes:

— This will create stat for all columns on all objects

select ‘create statistics ‘ + b.name + ‘ on dbo.’ + a.name + ‘ (‘ + b.name + ‘)’ from sys.tables a, sys.columns b where a.object_id = b.object_id and not exists ( select null from sys.stats_columns where object_id in (select object_id from sys.stats_columns group by object_id having count(*)>=1) and object_id = b.object_id and column_id = b.column_id) order by a.name, b.column_id;

T-SQL Tip: Sniffing out special characters in a table using LIKE

Here’s a handy snippet of code if you need to find special characters in a table. I had to implement this for a client after discovering that a third party export process was not correctly handling special characters (i.e. the per table export would flip question marks to upside down question marks…weird huh!). We needed to find a way to identify those tables that contained special characters and develop custom exports for these tables only. To address the aforementioned issue, the developer can use the SQL Server system tables to retrieve meta-data of their tables and auto-generate code similar to the snippet below for all applicable tables and columns (i.e. nchar, char, varchar, etc).

CREATE TABLE #SpecialCharTable

(
colA VARCHAR(50),
colB VARCHAR(50),
colC VARCHAR(50)
)

INSERT INTO #SpecialCharTable
VALUES      (‘A’,  ‘B’, ‘C’)

INSERT INTO #SpecialCharTable
VALUES      (‘A’, ‘B’, Char(63))

INSERT INTO #SpecialCharTable
VALUES      (Char(1), Char(2), Char(3))

SELECT colA,
colB,
colC
FROM   #SpecialCharTable
WHERE  ( colA LIKE ‘%[^0-9a-zA-Z ]%’ )
OR ( colB LIKE ‘%[^0-9a-zA-Z ]%’ )
OR ( colC LIKE ‘%[^0-9a-zA-Z ]%’ )

SELECT TOP 1 *
FROM   #SpecialCharTable
WHERE  ( colA LIKE ‘%[^0-9a-zA-Z ]%’ )
OR ( colB LIKE ‘%[^0-9a-zA-Z ]%’ )
OR ( colC LIKE ‘%[^0-9a-zA-Z ]%’ )

SELECT *
FROM   #SpecialCharTable
DROP TABLE #A_SpecialCharTable

Note: PATINDEX is a handly function to find the location of special characters in a column.
http://msdn.microsoft.com/en-us/library/ms188395.aspx

%d bloggers like this: