Around 80% of the work to create a data warehouse/BI solution is spent on the ETL phase. Although building an ETL solution can be a challenge, you can break down the project into at least two separate processes for easier management. One process is strictly related to business modeling, and therefore cannot be replicated. But the other is made up of purely technical processes that are always the same, regardless of the business environment we operate in, and thus can be highly automated.
In this session, we will look at well-known patterns to solving common problems and how they can be automated with the help of specific tools and techniques that use metadata to reduce development time and bugs. Using these engineering techniques, you will be able to adopt an Agile approach to your BI solution.
3. Davide Mauri
20 Years of experience on the SQL Server Platform
– Specialized in Data Solution Architecture, Database Design,
Performance Tuning, Business Intelligence, Data Warehouse, Big Data
& Analytics
Microsoft SQL Server MVP
President of UGISS (Italian SQL Server UG)
Mentor @ SolidQ
– Regular Speaker @ SQL Server events
– Projects, Consulting, Mentoring & Training
Find me here:
– Blog: http://sqlblog.com/blogs/davide_mauri/default.aspx
– Twitter:@mauridb
4. Building a DWH in 2013
Is still a (almost) manual process
A *lot* of repetitive low-value work
No (or very few) standard tools available
5. How it should be
Semi-automatic process
– “develop by intent”
Define the mapping logic
CREATE DIMENSION Customer
FROM SourceCustomerTable
MAP USING CustomerMetadata
ALTER DIMENSION Customers
ADD ATTRIBUTE LoyaltyLevel
from a TYPE 1
semantic perspective
AS
– Source to Dimensions / Measures
• (Metadata anyone?)
CREATE FACT Orders
FROM SourceOrdersTable
MAP USING OrdersMetadata
Design the model and let the tool build it for you
ALTER FACT Orders
ADD DIMENSION Customer
8. Invest on Automation?
Faster development
– Reduce Costs
– Embrace Changes
Less bugs
Increase solution quality and make it consistent
throughout the whole product
9. Automation Pre-Requisites
Split the process to have two separate type of
processes
– What can be automated
– What can NOT be automated
Create and impose a set of rules that defines
– How to solve common technical problems
– How to implement such identified solutions
10. No Monkey Work!
Let the people think
and let the machines
do the «monkey» work.
13. Design Pattern
Specific SQL Server Patterns
– Change Data Capture
– Change Tracking
– Partition Load
– SSIS Parallelism
14. Engineering the DWH
“Software Engineering
allows and require the
formalization of
software building and
maintenance process.”
15. Sample Rules
• Always put «last_update» column
• Always log Inserted/Updated/Deleted rows to
log.load_info table
• Use MD5 – binary(16) for checksums
• Use views to expose data
– Dimension & Fact views MUST use the same column
names for lookup columns
16. Engineering the DWH
There are two intrinsc
processes hidden in the
development of a BI
solution that must be
allowed (or forced) to
emerge.
20. ETL Phases
«E» and «L» must be
– Simple, Easy and Straightforward
– Completely Automated
– Completely Reusable
«E» and «L» have ZERO value in a BI Solution
– Should be done in the most economic way
26. Source Differential Load
E
• SQL Server 2012 that can help with
incremental/differential load
– Change Data Capture
• Natively supported in SSIS 2012
• http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sqlserver-2012-2/
– Change Tracking
• Underused feature in BI…not so rich as CDC but MUCH more
simpler and easier
27. L
SCD 1 & SCD 2
Start
Lookup Dimension Id
and MD5 Checksum
From Business Key
Insert new members
into DWH
Calculate MD5
Checksum of NonSCD-Key Colums
Yes
Dimension Id is
Null?
No
Checksum are
different?
Yes
End
Merge data from
temp table to DWH
Store into temp
table
28. SCD 2 Special Note
L
• Merge => UPDATE Interval + INSERT New Row
31. Parallel Load
• Logically split the work in several steps
– E.g: Load/Process one customer at time
• Create a «queue» table the stores information for each step
– Step 1 -> Load Customer «A»
– Step 2 -> Load Customer «B»
• Create a Package that
1. Pick the first not already picked up
2. Do work
3. Back to step 3
• Call the Package «n» times simultaneously
EL
32. Other SSIS Specific Patterns
• Range Lookup
– Not natively supported
– Matt Masson has the answer in his blog
• http://blogs.msdn.com/b/mattm/archive/2008/11/25/l
ookup-pattern-range-lookups.aspx
34. Metadata
Provide context information
– Which columns are used to build/feed a
Dimension?
– Which columns are Business Keys?
– Which table is the Fact Table?
– How Fact and Dimension are connected?
• Which columns are used?
35. How to manage Metadata?
• Naming Convention
• Extended Properties
• Specific, Ad Hoc Database or Tables
• Other (XML, File, ecc.)
36. Naming Convention
• The easiest and cheapest
–
–
–
–
No additional (hidden) costs
No need to be maintained
Never out-of-sync
No documentation need
• Actually, it IS PART of the documentation
– Imposes a Standard
• Very limited in terms of flexibility and usage
37. Extended Properties
Support most of metadata needs
No additional software needed
Very verbose usage
– Development of a wrapper to make usage simpler is
feasible and encouraged
38. Metadata Objects
Dedicated Ad-Hoc Database and Tables
As Flexible as you need
Maintenance Overhead to keep metadata in-sync with
data
– Development of automatic check procedure is needed
– DMV can help a lot here
39. External Metadata Objects
Really expensive to keep them in-sync
– A tool is needed, otherwise too much manual
work
Does not give any specific benefits with respect
to Ad-Hoc Database/Tables
42. Automation Scenarios
• Run-Time: «Auto-Configuring» Packages
– Really hard to customize packages
– SSIS limitations must be managed
• Eg: Data Flow cannot be changed at runtime
• On-the fly creation of package may be needed
• Design-Time: Package Generators / Package Templates
– Easy to customize created packages
46. Useful Resources
• «STOCK» Tasks:
– http://msdn.microsoft.com/enus/library/ms135956.aspx
• How to set Task properties at runtime:
– http://technet.microsoft.com/enus/library/microsoft.sqlserver.dts.runtime.executables
.add.aspx
47. BIML – BI Markup Language
• Developed by Varigence
– http://www.varigence.com
– http://bimlscript.com/
– MIST: BIML Full-Featured IDE
• Free via BIDS Helper
– Support “limited” to SSIS package generation
– http://bidshelper.codeplex.com
48. THANK YOU!
• For attending this session and
PASS SQLRally Nordic 2013, Stockholm