T-SQL Tip: Highlight Code Between Any Two Parenthesis

Where has this been all my working life? If you frequently find your self scouring complex (and sometimes needlessly complex) T-SQL code, simply place the cursor on the inside or outside of a parenthesis, and perform a CTRL+SHIFT+]. Sql Server Management Studio will then highlight the code for that particular SELECT block. There is some complex T-SQL out there in the wild (future consultants beware!), this will save you loads of scrolling and searching time and help you wrangle in pesky blocks of extended code.

Crunch that code!
– Sal

Advertisements

Parallel Data Warehouse (PDW) Basics: CREATE TABLE and CTAS Syntax

After forgetting the syntax for temp tables a couple of times, I decided to write a brief overview of the table creation syntax for Microsoft’s Parallel Data Warehouse (PDW) architecture.

One thing worth noting up-front, this post does not include considerations that should be made in determinining whether a table should be distributed vs replicated.

TEMPORARY TABLE BASICS

-Temp tables will reside in the tempdb (no surprise there), however, the ## global prefix as we see it on SQL Server SMP is not available.
-Temp tables can be either distributed or replicated
-Temp tables cannot have partitions, views, or indexes
-Table permissions cannot be changed
-Visible in only the current session
– CONTROL, INSERT, SELECT and UPDATE permissions are granted to the temp table creator.

Example: Distributed temporary table

CREATE TABLE testDB.dbo.#TableA (
   here int NOT NULL,
   we varchar(50),
   go varchar(50))
WITH
   ( LOCATION = USER_DB,
    DISTRIBUTION = HASH (here) );

NON-TEMP TABLE BASICS
-Indexes cannot be added (or dropped) after a table has been created (must use ‘CREATE TABLE AS SELECT’ aka ‘CTAS’ syntax if the developer wants to preserve the underlying data).
-All tables are created with page compression, there is no configurable option for this.
-When creating a distributed table, keep in mind that partitions will exist in the 8 distributions per compute node. So if you want the math on that, an db may have 10 year partitions on a server – in PDW, this translates to 10(partitions) * 8 (distributions) = 80 partitions on each compute node.  If you have a Parallel Data Warehouse (PDW) ‘Full-Rack’, there will be a total of 800 partitions (80 partitions per node * 10 nodes) …  interesting nonetheless!
-The default collation for the PDW appliance is Latin1_General_100_CI_AS_KS_WS – this is configurable, however, use caution if you plan on modifying a column collation as it can lead to collation errors when comparing strings.

Example: Replicated Table with Index

CREATE TABLE myTable (
   yearId int NOT NULL,
   some varchar(50),
   stuff varchar(50))
WITH 
    ( CLUSTERED INDEX (yearId) );

Example: Distributed Table with Multiple Partitions

CREATE TABLE TableA(
   yearId int NOT NULL,
   some varchar(50),
   stuff varchar(50)
WITH     ( DISTRIBUTION = HASH (yearId),     
PARTITION ( YearPurchased     
RANGE RIGHT FOR VALUES       
( 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012 )));

Example:  Adding a new distribution column and clustered index to an existing table using ‘Create Table as Select’ aka ‘CTAS’

CREATE TABLE TableTemp WITH ( 
    DISTRIBUTION = HASH (newDistributionColumn),
    CLUSTERED INDEX (id ASC) )
AS SELECT * FROM TableOriginal;
DROP TABLE TableOriginal
RENAME OBJECT [TableTemp] TO [TableOriginal]

Below is the base syntax for creating tables as provided by Microsoft:

  CREATE TABLE [ database_name . [ dbo ] . | dbo. ] table_name
(
{ column_name
[ COLLATE Windows_collation_name ]
[ NULL | NOT NULL ] }
[ ,...n ]
)[ WITH ( <table_option> [ ,...n ] ) ]

Create a new temporary table.

CREATE TABLE [ database_name . [ dbo ] . | dbo. ] #table_name
(
{ column_name
[ COLLATE Windows_collation_name ]
[ NULL | NOT NULL ] }
[ ,...n ]
)
WITH ( LOCATION = USER_DB [, <table_option>
[ ,...n ] ] ) [;]
::=
datetimeoffset [ ( n ) ]
| datetime2 [ ( n ) ]
| datetime
| smalldatetime
| date
| time [ ( n ) ]
| float [ ( n ) ]
| real [ ( n ) ]
| decimal [ ( precision [ , scale ] ) ]
| money
| smallmoney
| bigint
| int
| smallint
| tinyint
| bit
| nvarchar [ ( n ) ]
| nchar [ ( n ) ]
| varchar [ ( n ) ]
| char [ ( n ) ]
| varbinary [ ( n ) ]
| binary [ ( n ) ]
::={[ CLUSTERED INDEX ( { index_column_name [ ASC | DESC ] } [ ,...n ] ) ] | [ DISTRIBUTION = { HASH( distribution_column_name ) | REPLICATE } ] | [ PARTITION( partition_column_name RANGE [ LEFT|RIGHT ] FOR VALUES ( [ boundary_value [,...n] ] ) ) ] }

NOTE: When performing CTAS or re-creating tables in any way, you will need to create statistics on those tables upon loading.

Ideally, you should create stats on all the join columns, group by, order by and restriction. SQL Server PDW does not automatically create and update statistics on the Control node for every occasion when SQL Server creates or updates statistics on the Compute nodes:

— This will create stat for all columns on all objects

select ‘create statistics ‘ + b.name + ‘ on dbo.’ + a.name + ‘ (‘ + b.name + ‘)’ from sys.tables a, sys.columns b where a.object_id = b.object_id and not exists ( select null from sys.stats_columns where object_id in (select object_id from sys.stats_columns group by object_id having count(*)>=1) and object_id = b.object_id and column_id = b.column_id) order by a.name, b.column_id;

Dynamic Multi-Threaded SSIS using C#

Here is a snippet of code that will assist in the dynamic ‘spin-up’ of duplicate SSIS packages.  To put it simply, I used this approach in a C# service that monitored a nightly FTP file drop folder and inserted records into a queue table (one record per file) for SSIS  processing.  The code below essentially looks at queue table on a timer interval, gathers the count of files ready to be processed, compares that count to the number of available dtexec threads (equals the number of SSIS packages currently running minus the configurable number of SSIS threads allowed) and fires off dtexec calls as needed. Per Microsoft’s recommendation, the maximum number of concurrently running executables is equal to the total number of processors on the computer executing the package, plus two. In my case I configured this routine to have 18 max available processing threads as the dtexec server contained 16 cores. Happy coding!

Parallel Data Warehouse (PDW) Lesson 1: – Basic Architecture Overview

I’ll begin my coverage of Microsoft’s Parallel Data Warehouse (PDW) architecture / appliance with a brief overview of the Massively Parallel Processing (MPP) architecture and the specific hardware components of the PDW appliance.

For starters, think of PDW as a single, pre-configured server and network environment, comprised of all the hardware and software neccessary to perform a very specific task – process and store massive amounts of data and retrieve it as quickly and reliably as possible. Overall, this goal is achieved by limiting the user workloads with the PDW to reading vs. writing, parallel load balancing, and multi-level system redundancies.

The Parallel Data Warehouse appliance architecture is based on  MPP technology which has been in existence for a number of years going back the early 80’s. Additionally, MPP was not widely adopted at that time due to high hardware costs. The primary difference between traditional SQL Server symmetric multiprocessing (SMP) systems that we are all currently working with and the re-emerging MPP architecture lies in the method of processing, storage, and hardware configuration.

Scale-Up vs. Scale-Out
With SMP systems, as the size and demands on the data grow, many server administrators choose to “throw more hardware” at emerging performance issues or increasing storage requirements. Whether it is adding more CPUs, RAM, or discs to the storage area network (SAN) to an existing SMP implementaion, this approach is referred to as ‘scaling- up’. There are many issues with scaling-up and I will briefly summarize the heavy hitters below:
– Performance Bottlenecks: Increasing the number of discs managed by a single storage controller negatively impacts performance. In an SMP environment, the single system bus cannot efficiently manage multiple user requests against an excessive number of discs.
– System Availability: Maintenance tasks and server outages typically impacts all users.
– Hardware Limitations: Eventually the system architecture/hardware will not support any more CPUs or memory.
– System Compatibility: Typically the SMP is a multivendor solution which leads to technical challenges and incompatibility.
– Total Cost of Ownership: ongoing maintenance and costs of changing server and network compacity lead to significant ownership costs over time.

The other school of thought is to ‘scale-out’, which offers a modular approach to scaling the server capabilities.  When scaling-out, storage units or ‘racks’ are added to a MPP appliance as needed. Scaling-out not only addresses the scaling up issues from above but also offers parallel processing of user requests via MPP specific distribution and partitioning methods (This will be covered in Lesson 3: Distribution and Replication Tables).

Performance Bottlenecks RESOLVED:
Parallel Data Warehouse addresses performance bottlenecks via parallel processing, dual infiniband networks, multi-core CPUs, ample amounts of RAM, and multiple system buses.

System Availability RESOLVED:
Parallel Data Warehouse includes multiple levels of redundancy including hot-swap drives, “warm” failover servers, mirrored drives, dual cooling fans, dual networks, dual power supplies, failover clusters

Hardware Limitations RESOLVED
Parallel Data Warehouse is expandable up to four full data racks

System Compatibility RESOLVED
Parallel Data Warehouse is a single vendor turn-key appliance

Total Cost of Ownership RESOLVED
Parallel Data Warehouse is known for ease of maintenance and effortless scalability

Microsoft has paired up with hardware vendors, namely HP and Dell, to provide a few configuration options that fit the specific storage needs for any mid to large size customer (more on that later)

The base configuration for a “full rack” is comprised of two physical racks – the control rack and the data rack as shown below.

Image

Control Rack
The control rack essentially hosts the management hardware/software and acts as the liason between user requests and the control nodes.  As seen in the diagram above, the management node and the control node have backup servers to provide redundancy.  I will provide detail for each of the servers found in the Control Rack.

Management Node – Think of this as the nerve center of the PDW. The management node contains node images for reimaging activities, deploys software updates to all appliance nodes, monitors system health, serves as the domain controller while handling non-login authentication, and performs general hardware and software management functions.

Control Node – Serves as  the central point of control for the appliance, point of client connection and user query processing, system and database metadata storage, and supports a centralized hardware monitoring scheme.

Landing Zone – Primarily serves as the point of entry for incoming data streams as data is cleansed and stored here.  Authorized NT users are able to access the LZ as it is visible to the corporate network. Third party tools such as the Nexus Query editor by Coffing Data Warehousing are also installed here.

Backup Node – Acts as a central backup and restore location for data. As with any backup location, files can be copied from here and stored in a separate backup archive.

Data Rack
On the other hand, the data rack handles the parallel execution of user queries, and PDW can support up to 4 data racks. By definition, 4 data racks is the currently published scale out limit of PDW architecture with each rack containing 10 compute nodes (effectively a 40 compute node limit).

Compute Node – Think of a compute node as a single standalone SQL Server instance with dedicated storage and memory. Current base specs for each compute node include 96 Gb of memory, hex-core CPUs, and local tempdb workspace.  Also, each node has a dedicated disc array which is managed via a SAN component and interconnected to other disc arrays via a dual Fibre Channel data bus.  The data bus supports high-speed disk I/O plays an integral role in disk redundancy. Additionally, the primary data is stored as RAID1 on the SANs.

So to make a more direct connection to the DBA and/or developer that may be reading this, let’s walk through a specific scenario where a user wants to execute a query against the PDW.  The interaction with these system entities is as follows: The user connects to the control node via a query editing tool. The control node gets the request from the tool and creates a distributable query plan to execute on the compute nodes.  The compute nodes work on their subsets of data and return the result sets to the control node.  The control node compiles the result sets into one collection and outputs the results to the query editor/user.  Pretty simple stuff.

Dual Infiniband Network

The compute nodes are connected together using dual InfiniBand network which enables high-speed data sharing and computations between nodes called data shuffling. Compute nodes are also connect to the control and admin nodes to support high-speed backups/restores, table copies, data loads, and query execution.

Dual Fibre Channel

Essentially serving as the data bus, the dual fibre channel ensures super-fast I/O processing and failover redundancy.

General PDW Base Features:

Strong relational engine
Parallel Database Copy
Microsoft product stack integration
Physical separation of server workloads
Separation of IO patterns
Parallel loading – 1.5TB per hour on 1 rack
High speed scanning – 20 to 35GBps per rack

The “Half-Rack”

I think it is important to note here that the current PDW offerings best serve customers that have between 10 Tb and 500 Tb projected storage requirement over the next few years.

To support those customers that do not have hundreds of terabytes worth of data but still want to take advantage of the features of PDW, HP and Microsoft agreed on an additional topology called the Half-Data Rack.  The Half-Data Rack provides around 40% of the performance of a full rack and is around 65% of the price.

Below is a breakdown of the current offerings:

Configuration Servers Processors Cores Space (Tb)
HP PDW  Full Rack 17 22 132 125
HP PDW  Full Rack with 4 Data Racks 47 82 492 500
HP PDW  Half Rack 11 8 48 15-60 (Optional disc sizes   available)

That wraps up this initial discussion on the basics of PDW hardware architecture. I hope you have found this post useful and have a better understanding of the hardware components that make up the PDW architecture.

%d bloggers like this: