Data Warehousing Tip: Using T-SQL vs SSIS for Slowly Changing Dimension (SCD) Types 1 and 2
September 11, 2013 Leave a comment
Useful information and case studies covering Data Warehousing, Data Modeling, and Business Intelligence
September 11, 2013 Leave a comment
September 11, 2013 Leave a comment
There is an ongoing discussion/debate regarding the handling of NULLS in a data warehouse, however, the following article by Michelle Poolet summarizes it beautifully – “consider the data warehouse business requirements” (see http://sqlmag.com/database-administration/null-data-warehouse).
Beyond consideration, there are some data types that should get the DEFAULT treatment (bit, char, nchar, nvarchar, and varchar) and the following script can assist in creating these across every applicable column in your database.
SELECT Object_schema_name(T.[object_id], Db_id()) AS [Schema], T.[name] AS [table_name], AC.[name] AS [column_name], TY.[name] AS system_data_type, AC.[max_length], AC.[precision], AC.[scale], AC.[is_nullable], AC.[is_ansi_padded], 'ALTER TABLE dbo.' + T.[name] + ' ADD CONSTRAINT ' + AC.[name] + ' DEFAULT(' + CASE TY.[name] --WHEN 'int' THEN '0' --WHEN 'bigint' THEN '0' WHEN 'varchar' THEN '' WHEN 'bit' THEN '' WHEN 'char' THEN '' WHEN 'smalldatetime' THEN '' WHEN 'datetime' THEN '' --WHEN 'tinyint' THEN '0' --WHEN 'smallint' THEN '0' WHEN 'nvarchar' THEN '' --WHEN 'money' THEN '0.00' ELSE '' END +') FOR '+ac.name AS sqlstring FROM sys.[tables] AS T INNER JOIN sys.[all_columns] AC ON T.[object_id] = AC.[object_id] INNER JOIN sys.[types] TY ON AC.[system_type_id] = TY.[system_type_id] AND AC.[user_type_id] = TY.[user_type_id] WHERE T.[is_ms_shipped] = 0 AND T.[name] LIKE ('DIM%') AND AC.[name] NOT LIKE ( '%SrgKey%' ) AND TY.[name] in ('varchar', 'bit' , 'char', 'smalldatetime', 'datetime' , 'nvarchar')
I’ve taken the approach of defaulting values for all columns (less those affected by the WHERE clause) in my STAGING Database. The thought is that the non-NULL values inserted into my staging tables will persist to the TARGET database and I won’t experience one of a variety of issues (OLAP, T-SQL, or otherwise) that a development and/or user team will encounter due to having NULL values in the data warehouse.
April 12, 2013 Leave a comment
If IDENTITY insert is disabled or simply not available in a particular database or a database architecture, you’ll want to do a table based surrogate key assignment or use INSERT w ROW_NUMBER() with the OVER clause. I prefer using ROW_NUMBER().
USE [SomeDB]
CREATE TABLE [dbo].[testTgtTable](
[srgKey] [int] NULL,
[colA] [nchar](10) NULL,
[colB] [nchar](10) NULL
)
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[testSrcTable](
[colA] [nchar](10) NULL,
[colB] [nchar](10) NULL
)
GO
INSERT INTO testSrcTable
SELECT ‘hear’, ‘no evil’ UNION ALL
SELECT ‘see’, ‘no evil’ UNION ALL
SELECT ‘speak’, ‘no evil’
DECLARE @someIntVar int
SET @someIntVar = (SELECT ISNULL(MAX(SrgKey),0) SK
FROM dbo.testTgtTable)
INSERT INTO testTgtTable (SrgKey, colA, colB)
SELECT
ROW_NUMBER() OVER(ORDER BY [colA]) + @someIntVar SK
, [colA]
, [colB]
FROM
testSrcTable
February 7, 2013 3 Comments
Rather than making individual time calculations for each measure in an SSAS cube (i.e. YTD, Prior YTD, YTD Variances, etc), one can use a more dynamic approach to time calcs applicable to all measures and save a great amount of dev time and overhead in the process.
1. Create Named Query in DSV:
2. Create the dimension and add it to the Dimension Usage tab of the cube (no relations to the fact will exist)
3. Add the MDX script calculations:
The result!
February 7, 2013 2 Comments
SSRS Tip: Simplest Method to Pass MDX Multi-Select Parameters to MDX Datasets
I recently ran across a poorly covered topic in regards SSRS development using MDX parameters and datasets. After some trail and error and visiting a number of online resources, I found the approach below to be the most straight forward.
To pass a multivalued MDX parameter to an MDX based dataset, perform the following routine:
1. Create the parameter dataset (Right click DataSet >> Add Dataset…) – Note: the MEMBER_CAPTION and UNIQUENAME enables the cleanest approach to assigning values to the parameter (Step 2). Also, we will be filtering out unwanted parameter values.
Step 1.1 – click ‘use dataset embedded in report’
Step 1.2 – select datasource
Step 1.3 – click ‘Query Designer’
Step 1.4 – Click ‘Design Mode’ icon
Step 1.5 – Enter query
WITH MEMBER [Measures].[ParameterCaption] AS [Fixed Asset Book].[Book ID].CURRENTMEMBER.MEMBER_CAPTION MEMBER [Measures].[ParameterValue] AS [Fixed Asset Book].[Book ID].CURRENTMEMBER.UNIQUENAME
SET [Book ID] AS [Fixed Asset Book].[Book Id].ALLMEMBERS
SELECT {[Measures].[ParameterCaption], [Measures].[ParameterValue]} ON COLUMNS , FILTER([Book ID], [ParameterCaption] <> “All” AND [ParameterCaption] <> “” AND [ParameterCaption] <> “N/A” AND [ParameterCaption] <> “Unknown”) ON ROWS FROM [FixedAssets]
Notice the full unique names in ‘ParameterValue’ Column…the use of these values will enable a StrToSet function later in our primary MDX dataset.
2. Create the parameter (on the left pane, right click ‘Parameters’, ‘Add Parameter…’)
2.1 – From the General tab, add ‘allow multiple selection’ if needed
2.2 – Assign ParameterCaption as the label (will be displayed to the user on the report)and ParameterValue as the value (will be interpreted by SSRS when filtering report data)
3. Create the dataset (Right click DataSet >> Add Dataset…)
Step 3.1 – click ‘use dataset embedded in report’
Step 3.2 – select datasource
Step 3.3 – click ‘Query Designer’
Step 3.4 – Click ‘Design Mode’ icon
Step 3.5 – Enter query to include the WHERE (StrToSet(@BookID, CONSTRAINED)) clause – this will enable the MDX to interpret the @BookID multiselect object (from step 1) as dynamic memberset.
Step 3.6 – Click the ‘Query Parameters’ icon
Step 3.7 – Create the BookID parameter assignment
4. You’re all set!
November 16, 2012 Leave a comment
HP Announcement: PDW V2 Appliance Features
New features for the HP’s Enterprise Data Warehouse (EDW) V2 appliance for SQL Server PDW have been announced!
Highlights:
October 23, 2012 Leave a comment
As with any other project our objective is to drive PDW POC success! To clear and cultivate the playing field for successful POC engagements, we must, at a minimum, consider the following guidelines.
Guideline 1: Identify Existing Roadblocks and De-scope
Major POC challenges can manifest themselves in many different forms i.e. inability to seed data, dealing with sub-par data warehouse architectures, wrangling poorly concieved T-SQL/ETL architectures, etc. If these challenges exist in the current production environment and are the primary driver behind the PDW POC initiative, approach with caution as these core issues can take more time to address when compared to minor issues such as lack of an SSIS framework, data delivery issues, or inability to generate very large data volumes. If show-stoppers exist, try either de-scoping the issues/components from the PDW altogether or try to find a sustainable work-around.
Guideline 2: Gather Accurate Business and Technical Requirements
This should be fairly standard as the client will have a core dataset and/or processes that they would like see run on the PDW appliance coupled with an ‘ideal’ runtime. Keep the deliverables manageable and implement the SMART methodology (Specific, Measureable, Achievable, Realistic, Timescale). Be sure to gather base requirements for documentation, data volumes, environmental requirements, etc. For example, the client may want to stage both a dev and test environment, they may only want a schemas worth of tables/data brought over from an existing SMP database (Note: PDW does not support schemas), or the client may want to bring over 2 years of history rather than 10 years. Keep historical data requirements to a minimum, as it may be difficult to find extraction windows from existing source systems. Similarly, data loads into the PDW via dwloader command or SSIS destination (slower than dwloader) can take a significant amount of time and may run over the course of X number of days depending on source system availability and data volumes. Maybe most importantly, assess the current environment, data volumes, and processing windows to manage realistic expectations! Although there have been instances where specific loading routines can result in 100X improvement in load times, this is not always the case and this factor should not be used as a baseline.
Guideline 3: Streamline Testing Requirements
When addressing the testing phase, do not overload the POC activities with extraneous testing/recon requirements and keep the scope of testing and data validation to a minimum. Simplify the testing criteria to critical dataset load times vs. data volumes. The success criteria should be clear i.e. Company GECKO has defined success as loading 2Tb of prior day policy activity data for 7 consecutive days in under 30 minutes each day, or Company STERNXM defines success as loading 20 regional subscriber datasets via flatfile concurrently in under 2 hours once a week. Also, the client may be interested in testing ad-hoc query and data loads while the appliance is actively processing large data load requests. At a minimum, the testing process will address the following :
Guideline 4: Resource Planning and Communication
While keeping project activities within scope, ensure that the team has at least the minimum number of dedicated resources to the project. In-house members of a POC delivery team usually include at minimum: DBA, Business Analyst, and a Data Developer engineer. Keep in mind, most streamlined POC engagements wrap up within a 2-3 week timeframe and if the delivery requirements are modified (heaven forbid) , to say, include more complex PDW processing routines or include additional table loads, always ensure that you remain communicative and provide updates/recommendations to resource budgets and overall POC timelines. Additionally, provide daily email updates and try to incorporate simple graphical representations of project progress (i.e. RAG Traffic lights). While never under-estimating, pad for the unexpected and ensure that time is added to the plan to account for contingencies.
Guideline 5: Partnership with Microsoft PDW COE
In a recent PDW implementation, I was involved with a few PDW-specific challenges applicable to query optimization, data loading, and appliance software upgrades. As a result of both the clients and the consulting company’s partnership with the Microsoft COE, the downtime associated with these challenges were minimized. Partnership led to smoother and more efficient PDW POC delivery as the COE was ready and willing to recommend and demonstrate solutions that aligned to MS COE best practices . Ultimately, the COE recommendations and actions directly benefited system usability and sustainability and ensured a successful POC delivery.
I would like to build a collection of guidelines within this post to benefit future PDW POC implementations! If you have an idea or want to expand on one of these guidelines, feel free to share your personal experiences and/or workarounds to common POC challenges.
I want to thank the following contributors of the PDW community for providing insight that led to the formulation of the POC guidelines above:
Matt Goswell
James Herring
Ralph Kemperdick
Derek Comingore
Matthew Winter
October 18, 2012 5 Comments
I received the following error today when attempting to use Visio’s ‘Database Model Diagram’ template under category ‘Software and Database’.
The program VISIO.EXE version 14.0.6122.5000 stopped interacting with Windows and was closed. To see if more information about the problem is available, check the problem history in the Action Center control panel. Process ID: 3fa8 Start Time: 01cdad7f36f7bba9 Termination Time: 0 Application Path: D:\Program Files\Microsoft Office\Office14\VISIO.EXE Report Id: 9c9eef0a-1972-11e2-9a12-005056995c65
Sure enough, per the seemingly improbable suggestion made in the post below, Visio did not hang after changing the default printer to ‘Microsoft XPS Document Writer’. Don’t make the i mistake I did and change the default printer to ‘Microsoft XPS Document Writer (redirected 7)’.
As it turns out, Remote Desktop (RDP) assigns remotely-mounted printers which can cause issues with Visio’s templates.
Below is similar to what you should see once you’ve assigned the correct default printer.
October 15, 2012 2 Comments
When loading tables in Microsoft’s Parallel Data Warehouse (PDW) via T-SQL, one will soon encounter the need to load data into a new table. Whether the developer needs to perform an initial data load of table or even go through the PDW-specific steps to perform a ‘trunc and load’ of a table, there are a two table load methods to consider.
To make a direct comparison between the two methods, I have provided the actual PDW query and execution plans generated by each method using identical SELECTs
Usage
Appends one or more new rows to a table. Add one row by listing the row values or add multiple rows by inserting the results of a SELECT statement.
Example
Insert Into TABLEA SELECT only, use, this, method, to, append FROM TABLEB
Generated Query Plan
Generated Execution Plan
Usage
Creates a new table in SQL Server PDW that is populated with the results from a SELECT statement.
Example
CREATE TABLE myTable (
yearId int NOT NULL,
some varchar(50),
stuff varchar(50))
WITH
( CLUSTERED INDEX (yearId) );
Generated Query Plan
Generated Execution Plan
*When comparing to INSERT SELECT, notice the absence of the Index Scan and Parallelism steps above. CTAS results in an Index Seek for this particular SELECT and also does away with the Parallellism steps.
Performance Results
CTAS loaded the test table more that twice as fast as INSERT SELECT.
Simple Explanation
On the PDW, INSERT SELECT is a fully logged transaction and occurs serially per each distribution. CTAS, however, is minimally logged and happens parellelly across all nodes.
NOTE: When performing CTAS or re-creating tables in any way, you will need to create statistics on those tables upon loading.
Ideally, you should create stats on all the join columns, group by, order by and restriction. SQL Server PDW does not automatically create and update statistics on the Control node for every occasion when SQL Server creates or updates statistics on the Compute nodes:
— This will create stat for all columns on all objects
select ‘create statistics ‘+b.name+‘ on dbo.’+a.name+‘ (‘+b.name+‘)’
from sys.tablesa,sys.columnsb
where a.object_id=b.object_id and notexists(
selectnullfromsys.stats_columnswhere object_idin(selectobject_idfromsys.stats_columnsgroupbyobject_idhavingcount(*)>=1)
and object_id=b.object_idandcolumn_id=b.column_id)
order bya.name,b.column_id;
August 27, 2012 5 Comments
There are a couple process hungry operations that can be avoided when developing or migrating T-SQL queries into Microsoft’s Parallel Data Warehouse (PDW) environment. With proper planning and attention to detail, one can side-step two operations more commonly known as the Data Movement Service (DMS) ShuffleMove and PartitionMove operations. To shed a bit of light on what DMS is, its basically a service agent that moves data between the appliance compute nodes within the PDW ‘shared-nothing’ architecture.
In an attempt to set a hook of interest in on ya, I’ll share a recent project breakthrough associated with the bypassing of these two DMS operations. Recently my PDW implementation team experienced a 600% performance boost in query runtimes after reformulating a few steps in a query that previously utilized the PartitionMove and ShuffleMove DMS operations….I would assume I have your attention now!
So lets get to it… the idea of a large table or the concept of a subquery should be a familiar to data developers, DBAs, and the like. When I refer to table distributions this may sound foreign, however, I am referring to the method in which the developer has decided to distribute the data across the PDW nodes via a CREATE TABLE statement. There are only two methods of distribution, and one must be chosen when creating a table:
Replicated Tables – Replicated tables are an ideal method for storing dimensional tables, lookup tables, or other small reference tables that are commonly used in PDW joins. The developer should consider creating tables via the replicated method when they would like to have a local copy of a table on each node. Since every node has a local copy, joins to the replicated table should be quick and efficient as long as the DMS doesn’t have to arrange for datasets to be distributed on other nodes (see incompatible joins toward the bottom of this post) and as long as the replicated table is relatively small, say less than 5mil records (this is just an estimate, there is a science behind choosing distribution modes and I’ll dive further into this on a future post).
Sample:
CREATE TABLE [testpdwdb].[dbo].[replicatedTable]
(
[AcctCorp] INT NULL,
[CutoverDate] DATE NULL
)
WITH (DISTRIBUTION = REPLICATE);
Distributed Tables – Distributed tables are best suited for fact tables, very large dimension tables, or larger tables that contain a column that has many distinct values used in table join predicates (columns that meet this criteria are considered candidate distribution columns). Finding the best distribution column can be a time consuming process as you really have to understand the queries that run/may run against the table. After an appropriate column has been chosen, PDW’s hash distrubution algorithim takes care of evenly distributing the data across the appliance nodes.
Note: Even data distribution across the appliance compute nodes prevents queries from encountering performance draining ‘data skew’ scenarios. Data skew occurs when the developer makes a poor choice in a distribution column which can result in the hash algorithm ‘clumping’ together data that have same distributed column values. Ultimately, this results in some nodes working overtime while others nodes are vacationing…it’s tough to call this a parallel data warehouse if all nodes are not working at the same time under similar workloads!
Sample:
CREATE TABLE [testpdwdb].[dbo].[distrubutedTable]
(
[AcctCorp] INT NULL,
[CutoverDate] DATE NULL
)
WITH (CLUSTEREDINDEX([Cutoff_Date]),
DISTRIBUTION = HASH([AcctCorp]));
With that mini-lesson out of the way, let me formally introduce the ShuffleMove and the PartitionMove DMS operations as these performance-draining operations kick in when tables are not distributed correctly.
The ShuffleMove:
Lets start with the ShuffleMove. A ShuffleMove occurs when the PDW must move data between nodes to fulfill a query request. These occur when a query joins between two or more tables that are not distributed on the same column (also knows as an incompatible join). To create a compatible join, PDW must create a temp table on every node for the incompatible table, redistribute the data from the incompatible table on a compatible column across the nodes, join the temp table back to the other table, and stream results to the client.
As you can see in the following diagram, there is a lot of data movement going on in a shuffle move to perform a simple query task. Records in Fact_dealer_sales must be ‘shuffled’ to align with the make_id hash of table fact_manufacturer_sales.
..
.
The Partition Move:
A Partition move is the most expensive DMS operation and involves moving large amounts of data to the Control Node and across all of the appliance distributions on each node (8 per node). Partition moves are typically the result of a GROUP BY statement that does not include the distribution column. When a query attempts to a GROUP BY on a non-distribution column, the aggregation takes place locally on each nodes distribution (10 nodes * 8 per node = 80 distributions), then the results are sent up to the Control Node for final aggregation.
Note: the column fact_dreamcar_sales.make would not reside in the fact table as a text field (more likely to be make_id) and was included just for kicks. BTW Lamborghini makes one heck of a supercar, however, it would’ve made the ‘make’ column quite wide. Additionally, Zonda is a ‘model’ and not a ‘make’ although the Zonda did ‘make’ a name for its manufacturer – Pagini (just occured to me that I’ve been watching too much of BBC’s Top Gear lately).
Here are some steps that be taken to avoid PDW ShuffleMove and PartitionMove operations:
If interested, below I included the T-SQL skeleton and the DMS Query Plan for the query that was running for 4+ hours on the PDW. Note the T-SQL was pulled directly off of the SMP (where runs in 40 min) and was modified slightly to run on the PDW.
As you look at the T-SQL and the DMS Query Plan below, hopefully a couple of burning questions come to mind after reading this post – maybe something along the lines of “Are all of those joins to subquries aligned?”, or “Is the distribution key included in those group by’s ? ” or maybe even “WTH is this?”… we asked ourselves these questions, made some code changes (CTAS’d the subqueries into temp tables and aligned distribution columns before joining), and managed to get it down to a 30min runtime on the PDW…not bad considering all of the indexes the SMP required to get it to run in 40 min!
Declare @EndDate as date
Set @EndDate = '7/4/2012'
Declare @StartDate as date
Set @StartDate = ‘7/4/2012’
Delete From testpdwdb.dbo.someFinancialMasterTable where fiscal_date = @EndDate
insert into testpdwdb.dbo.someFinancialMasterTable
SELECT …
FROM
(
SELECT …
FROM
(
SELECT …
FROM
(
SELECT …
FROM
(
SELECT …
FROM testpdwdb.dbo.someOrderSnapshotTable
WHERE someDatecol = DATEADD(D, -1, @StartDate)
)c
JOIN
(
SELECT …
FROM testpdwdb.dbo.someProductRevenueTable
WHERE Fiscal_date = DATEADD(D, -1, @StartDate)
)r
ON c.sys_oci = r.sys
AND c.prin_oci = r.prin
AND c.sub_acct_no_oci = r.sub_acct_no
AND R.some_other_date_col = C.some_date_col
JOIN
(
SELECT …
FROM testpdwdb.dbo.someMarketTable
)M
ON R.sys = M.sys
and r.prin = m.prin
LEFT JOIN
(
SELECT …
FROM testpdwdb.dbo.someHierarchyTable
)h
ON r.gl_acct_id = h.gl_acct_id
) m
GROUP BY …
) a
FULL OUTER JOIN
(
SELECT …
FROM
(
SELECT …
FROM
(
SELECT …
FROM testpdwdb.dbo.someItemTable a
full outer JOIN
(
SELECT …
FROM testpdwdb.dbo.someAccountItemTable b
left join testpdwdb.dbo.someActivityTable c
on b.sys = c.sys
and b.prin = c.prin
and b.SUB_ACCT_NO = c.sub_acct_no
and b.date = c.date
WHERE b.date BETWEEN @StartDate AND @EndDate
and c.date BETWEEN @StartDate AND @EndDate
) b
ON a.sys = b.sys
AND a.prin = b.prin
AND a.acct = b.sub_acct_no
AND UPPER(a.product) = UPPER(b.product)
AND a.date = b.date
GROUP BY …
) z
JOIN
(
SELECT …
FROM
testpdwdb.dbo.someMarketTable
)M
ON z.sys = M.sys
and z.prin = m.prin
GROUP BY …
) X
FULL OUTER JOIN
(
SELECT …
FROM
(
SELECT …
FROMtestpdwdb.dbo.someFinancialModelTable
WHERE DATE BETWEEN @StartDate AND @EndDate AND gl_acct_id IS NOT NULL
)pm
JOIN
(
SELECT …
FROM testpdwdb.dbo.someMarketTable
)M
ON pm.sys = M.sys
and pm.prin = m.prin
GROUP BY …
) PM
ON X.some_entity_col = PM.some_entity_col
AND UPPER(X.some_customer_col) = UPPER(PM.some_customer_col)
AND UPPER(X.some_product_col) = UPPER(PM.some_product_col)
AND X.gl_acct_id = PM.gl_acct_id
) b
ON a.some_entity_col = b.some_entity_col
AND UPPER(a.some_customer_col) = UPPER(b.some_customer_col)
AND UPPER(a.some_product_col) = UPPER(b.some_product_col)
AND a.gl_acct_id = b.gl_acct_id
GROUP BY …
) x
GROUP BY
…
…and the Parallel Data Warehouse DMS plan only a mother could love: