Parallel Data Warehouse (PDW) Tip: Proof of Concept (POC) Guidelines
October 23, 2012 Leave a comment
As with any other project our objective is to drive PDW POC success! To clear and cultivate the playing field for successful POC engagements, we must, at a minimum, consider the following guidelines.
Guideline 1: Identify Existing Roadblocks and De-scope
Major POC challenges can manifest themselves in many different forms i.e. inability to seed data, dealing with sub-par data warehouse architectures, wrangling poorly concieved T-SQL/ETL architectures, etc. If these challenges exist in the current production environment and are the primary driver behind the PDW POC initiative, approach with caution as these core issues can take more time to address when compared to minor issues such as lack of an SSIS framework, data delivery issues, or inability to generate very large data volumes. If show-stoppers exist, try either de-scoping the issues/components from the PDW altogether or try to find a sustainable work-around.
Guideline 2: Gather Accurate Business and Technical Requirements
This should be fairly standard as the client will have a core dataset and/or processes that they would like see run on the PDW appliance coupled with an ‘ideal’ runtime. Keep the deliverables manageable and implement the SMART methodology (Specific, Measureable, Achievable, Realistic, Timescale). Be sure to gather base requirements for documentation, data volumes, environmental requirements, etc. For example, the client may want to stage both a dev and test environment, they may only want a schemas worth of tables/data brought over from an existing SMP database (Note: PDW does not support schemas), or the client may want to bring over 2 years of history rather than 10 years. Keep historical data requirements to a minimum, as it may be difficult to find extraction windows from existing source systems. Similarly, data loads into the PDW via dwloader command or SSIS destination (slower than dwloader) can take a significant amount of time and may run over the course of X number of days depending on source system availability and data volumes. Maybe most importantly, assess the current environment, data volumes, and processing windows to manage realistic expectations! Although there have been instances where specific loading routines can result in 100X improvement in load times, this is not always the case and this factor should not be used as a baseline.
Guideline 3: Streamline Testing Requirements
When addressing the testing phase, do not overload the POC activities with extraneous testing/recon requirements and keep the scope of testing and data validation to a minimum. Simplify the testing criteria to critical dataset load times vs. data volumes. The success criteria should be clear i.e. Company GECKO has defined success as loading 2Tb of prior day policy activity data for 7 consecutive days in under 30 minutes each day, or Company STERNXM defines success as loading 20 regional subscriber datasets via flatfile concurrently in under 2 hours once a week. Also, the client may be interested in testing ad-hoc query and data loads while the appliance is actively processing large data load requests. At a minimum, the testing process will address the following :
- Data Load
- Data Expansion
- Query Performance
- Query Concurrency
- Mixed Workload Concurrency
Guideline 4: Resource Planning and Communication
While keeping project activities within scope, ensure that the team has at least the minimum number of dedicated resources to the project. In-house members of a POC delivery team usually include at minimum: DBA, Business Analyst, and a Data Developer engineer. Keep in mind, most streamlined POC engagements wrap up within a 2-3 week timeframe and if the delivery requirements are modified (heaven forbid) , to say, include more complex PDW processing routines or include additional table loads, always ensure that you remain communicative and provide updates/recommendations to resource budgets and overall POC timelines. Additionally, provide daily email updates and try to incorporate simple graphical representations of project progress (i.e. RAG Traffic lights). While never under-estimating, pad for the unexpected and ensure that time is added to the plan to account for contingencies.
Guideline 5: Partnership with Microsoft PDW COE
In a recent PDW implementation, I was involved with a few PDW-specific challenges applicable to query optimization, data loading, and appliance software upgrades. As a result of both the clients and the consulting company’s partnership with the Microsoft COE, the downtime associated with these challenges were minimized. Partnership led to smoother and more efficient PDW POC delivery as the COE was ready and willing to recommend and demonstrate solutions that aligned to MS COE best practices . Ultimately, the COE recommendations and actions directly benefited system usability and sustainability and ensured a successful POC delivery.
I would like to build a collection of guidelines within this post to benefit future PDW POC implementations! If you have an idea or want to expand on one of these guidelines, feel free to share your personal experiences and/or workarounds to common POC challenges.
I want to thank the following contributors of the PDW community for providing insight that led to the formulation of the POC guidelines above: