SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Automating Data Warehouse
Patterns Through Metadata
Davide Mauri
dmauri@solidq.com
Davide Mauri
20 Years of experience on the SQL Server Platform
– Specialized in Data Solution Architecture, Database Design,
Performance Tuning, Business Intelligence, Data Warehouse, Big Data
& Analytics

Microsoft SQL Server MVP
President of UGISS (Italian SQL Server UG)
Mentor @ SolidQ
– Regular Speaker @ SQL Server events
– Projects, Consulting, Mentoring & Training

Find me here:
– Blog: http://sqlblog.com/blogs/davide_mauri/default.aspx
– Twitter:@mauridb
Building a DWH in 2013
Is still a (almost) manual process
A *lot* of repetitive low-value work
No (or very few) standard tools available
How it should be
Semi-automatic process
– “develop by intent”

Define the mapping logic

CREATE DIMENSION Customer
FROM SourceCustomerTable
MAP USING CustomerMetadata
ALTER DIMENSION Customers
ADD ATTRIBUTE LoyaltyLevel
from a TYPE 1
semantic perspective
AS

– Source to Dimensions / Measures
• (Metadata anyone?)

CREATE FACT Orders
FROM SourceOrdersTable
MAP USING OrdersMetadata

Design the model and let the tool build it for you

ALTER FACT Orders
ADD DIMENSION Customer
The perfect BI process & architecture

Iterative!
Is automation possible?

DWH PROCESSES
Invest on Automation?
Faster development
– Reduce Costs
– Embrace Changes

Less bugs
Increase solution quality and make it consistent
throughout the whole product
Automation Pre-Requisites
Split the process to have two separate type of
processes
– What can be automated
– What can NOT be automated

Create and impose a set of rules that defines
– How to solve common technical problems
– How to implement such identified solutions
No Monkey Work!
Let the people think
and let the machines
do the «monkey» work.
Design Pattern
“A general reusable
solution to a commonly
occurring problem
within a given context”
Design Pattern
Generic ETL Pattern
– Partition Load
– Incremental/Differential Load

Generic BI Design Pattern
– Slowly Changing Dimension
• SCD1, SCD2, ecc.

– Fact Table
• Transactional, Snapshot, Temporal Snapshot
Design Pattern
Specific SQL Server Patterns
– Change Data Capture
– Change Tracking
– Partition Load
– SSIS Parallelism
Engineering the DWH
“Software Engineering
allows and require the
formalization of
software building and
maintenance process.”
Sample Rules
• Always put «last_update» column
• Always log Inserted/Updated/Deleted rows to
log.load_info table
• Use MD5 – binary(16) for checksums
• Use views to expose data
– Dimension & Fact views MUST use the same column
names for lookup columns
Engineering the DWH
There are two intrinsc
processes hidden in the
development of a BI
solution that must be
allowed (or forced) to
emerge.
Business Process
Data manipulation,
transformation, enrichment
& cleansing logic

Specific for every customer.
Almost not automatable
Technical Process
Application of data
extraction and loading
techniques
Recurring (pattern) in
any solution

Highly Automatable
Hi-Level Vision
Technical Process

Technical Process

ETL
OLTP

L

ET
STG

E

TL

Business Process

DWH
ETL Phases
«E» and «L» must be
– Simple, Easy and Straightforward
– Completely Automated
– Completely Reusable

«E» and «L» have ZERO value in a BI Solution
– Should be done in the most economic way
Well known solution to common problems

PATTERN
Source Full Load

E
Source Incremental Load
In this scenario,
“ID” is a IDENTITY/SEQUENCE.
Probably a PK.

E
Source Differential Load/1

In this scenario the source table
doesn’t offer any specific way to
Understand what’s changed

E
Source Differential Load/2

In this scenario the source table
has a TimeStamp-Like column

E
Source Differential Load

E

• SQL Server 2012 that can help with
incremental/differential load
– Change Data Capture
• Natively supported in SSIS 2012
• http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sqlserver-2012-2/

– Change Tracking
• Underused feature in BI…not so rich as CDC but MUCH more
simpler and easier
L

SCD 1 & SCD 2
Start

Lookup Dimension Id
and MD5 Checksum
From Business Key

Insert new members
into DWH

Calculate MD5
Checksum of NonSCD-Key Colums

Yes

Dimension Id is
Null?

No

Checksum are
different?

Yes

End

Merge data from
temp table to DWH

Store into temp
table
SCD 2 Special Note

L

• Merge => UPDATE Interval + INSERT New Row
FACT TABLE LOAD

L
Partition Load

EL
Parallel Load
• Logically split the work in several steps
– E.g: Load/Process one customer at time

• Create a «queue» table the stores information for each step
– Step 1 -> Load Customer «A»
– Step 2 -> Load Customer «B»

• Create a Package that
1. Pick the first not already picked up
2. Do work
3. Back to step 3

• Call the Package «n» times simultaneously

EL
Other SSIS Specific Patterns
• Range Lookup
– Not natively supported
– Matt Masson has the answer in his blog 
• http://blogs.msdn.com/b/mattm/archive/2008/11/25/l
ookup-pattern-range-lookups.aspx
A key ingredient in automation

METADATA
Metadata
Provide context information
– Which columns are used to build/feed a
Dimension?
– Which columns are Business Keys?
– Which table is the Fact Table?
– How Fact and Dimension are connected?
• Which columns are used?
How to manage Metadata?
• Naming Convention

• Extended Properties
• Specific, Ad Hoc Database or Tables
• Other (XML, File, ecc.)
Naming Convention
• The easiest and cheapest
–
–
–
–

No additional (hidden) costs
No need to be maintained
Never out-of-sync
No documentation need
• Actually, it IS PART of the documentation

– Imposes a Standard

• Very limited in terms of flexibility and usage
Extended Properties
Support most of metadata needs
No additional software needed

Very verbose usage
– Development of a wrapper to make usage simpler is
feasible and encouraged
Metadata Objects
Dedicated Ad-Hoc Database and Tables

As Flexible as you need
Maintenance Overhead to keep metadata in-sync with
data
– Development of automatic check procedure is needed
– DMV can help a lot here
External Metadata Objects
Really expensive to keep them in-sync
– A tool is needed, otherwise too much manual
work

Does not give any specific benefits with respect
to Ad-Hoc Database/Tables
DEMO
Let’s make it possible!

AUTOMATION
Automation Scenarios
• Run-Time: «Auto-Configuring» Packages
– Really hard to customize packages
– SSIS limitations must be managed
• Eg: Data Flow cannot be changed at runtime
• On-the fly creation of package may be needed

• Design-Time: Package Generators / Package Templates
– Easy to customize created packages
Automation Solutions
• Specific Tool/frameworks
– BIML / MIST

• SQL Server Platform
– SQL, PowerShell, .NET
– SMO, AMO
Package Generators
Required Assemblies
Microsoft.SqlServer.ManagedDTS
Microsoft.SqlServer.DTSRuntimeWrap
Microsoft.SqlServer.DTSPipelineWrap

Path:
C:Program Files (x86)Microsoft SQL
Server110SDKAssemblies
DEMO
Useful Resources
• «STOCK» Tasks:
– http://msdn.microsoft.com/enus/library/ms135956.aspx

• How to set Task properties at runtime:
– http://technet.microsoft.com/enus/library/microsoft.sqlserver.dts.runtime.executables
.add.aspx
BIML – BI Markup Language
• Developed by Varigence
– http://www.varigence.com
– http://bimlscript.com/
– MIST: BIML Full-Featured IDE

• Free via BIDS Helper
– Support “limited” to SSIS package generation
– http://bidshelper.codeplex.com
THANK YOU!
• For attending this session and
PASS SQLRally Nordic 2013, Stockholm

Weitere ähnliche Inhalte

Andere mochten auch

Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Health Catalyst
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataCloudera, Inc.
 
Big Data = Bigger Metadata
Big Data = Bigger MetadataBig Data = Bigger Metadata
Big Data = Bigger MetadataIan White
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsDATAVERSITY
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Master Data Management methodology
Master Data Management methodologyMaster Data Management methodology
Master Data Management methodologyDatabase Architechs
 

Andere mochten auch (7)

Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data
 
Big Data = Bigger Metadata
Big Data = Bigger MetadataBig Data = Bigger Metadata
Big Data = Bigger Metadata
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture Patterns
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Master Data Management methodology
Master Data Management methodologyMaster Data Management methodology
Master Data Management methodology
 

Mehr von Davide Mauri

Azure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstartAzure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstartDavide Mauri
 
Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data WarehousingDavide Mauri
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDavide Mauri
 
When indexes are not enough
When indexes are not enoughWhen indexes are not enough
When indexes are not enoughDavide Mauri
 
Building a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with AzureBuilding a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with AzureDavide Mauri
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveDavide Mauri
 
Azure SQL & SQL Server 2016 JSON
Azure SQL & SQL Server 2016 JSONAzure SQL & SQL Server 2016 JSON
Azure SQL & SQL Server 2016 JSONDavide Mauri
 
SQL Server & SQL Azure Temporal Tables - V2
SQL Server & SQL Azure Temporal Tables - V2SQL Server & SQL Azure Temporal Tables - V2
SQL Server & SQL Azure Temporal Tables - V2Davide Mauri
 
SQL Server 2016 Temporal Tables
SQL Server 2016 Temporal TablesSQL Server 2016 Temporal Tables
SQL Server 2016 Temporal TablesDavide Mauri
 
SQL Server 2016 What's New For Developers
SQL Server 2016  What's New For DevelopersSQL Server 2016  What's New For Developers
SQL Server 2016 What's New For DevelopersDavide Mauri
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream AnalyticsDavide Mauri
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine LearningDavide Mauri
 
Dashboarding with Microsoft: Datazen & Power BI
Dashboarding with Microsoft: Datazen & Power BIDashboarding with Microsoft: Datazen & Power BI
Dashboarding with Microsoft: Datazen & Power BIDavide Mauri
 
Azure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applicationsAzure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applicationsDavide Mauri
 
Event Hub & Azure Stream Analytics
Event Hub & Azure Stream AnalyticsEvent Hub & Azure Stream Analytics
Event Hub & Azure Stream AnalyticsDavide Mauri
 
SQL Server 2016 JSON
SQL Server 2016 JSONSQL Server 2016 JSON
SQL Server 2016 JSONDavide Mauri
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveDavide Mauri
 
Real Time Power BI
Real Time Power BIReal Time Power BI
Real Time Power BIDavide Mauri
 
AzureML - Creating and Using Machine Learning Solutions (Italian)
AzureML - Creating and Using Machine Learning Solutions (Italian)AzureML - Creating and Using Machine Learning Solutions (Italian)
AzureML - Creating and Using Machine Learning Solutions (Italian)Davide Mauri
 
Datarace: IoT e Big Data (Italian)
Datarace: IoT e Big Data (Italian)Datarace: IoT e Big Data (Italian)
Datarace: IoT e Big Data (Italian)Davide Mauri
 

Mehr von Davide Mauri (20)

Azure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstartAzure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstart
 
Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data Warehousing
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
 
When indexes are not enough
When indexes are not enoughWhen indexes are not enough
When indexes are not enough
 
Building a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with AzureBuilding a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with Azure
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep Dive
 
Azure SQL & SQL Server 2016 JSON
Azure SQL & SQL Server 2016 JSONAzure SQL & SQL Server 2016 JSON
Azure SQL & SQL Server 2016 JSON
 
SQL Server & SQL Azure Temporal Tables - V2
SQL Server & SQL Azure Temporal Tables - V2SQL Server & SQL Azure Temporal Tables - V2
SQL Server & SQL Azure Temporal Tables - V2
 
SQL Server 2016 Temporal Tables
SQL Server 2016 Temporal TablesSQL Server 2016 Temporal Tables
SQL Server 2016 Temporal Tables
 
SQL Server 2016 What's New For Developers
SQL Server 2016  What's New For DevelopersSQL Server 2016  What's New For Developers
SQL Server 2016 What's New For Developers
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Dashboarding with Microsoft: Datazen & Power BI
Dashboarding with Microsoft: Datazen & Power BIDashboarding with Microsoft: Datazen & Power BI
Dashboarding with Microsoft: Datazen & Power BI
 
Azure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applicationsAzure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applications
 
Event Hub & Azure Stream Analytics
Event Hub & Azure Stream AnalyticsEvent Hub & Azure Stream Analytics
Event Hub & Azure Stream Analytics
 
SQL Server 2016 JSON
SQL Server 2016 JSONSQL Server 2016 JSON
SQL Server 2016 JSON
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep Dive
 
Real Time Power BI
Real Time Power BIReal Time Power BI
Real Time Power BI
 
AzureML - Creating and Using Machine Learning Solutions (Italian)
AzureML - Creating and Using Machine Learning Solutions (Italian)AzureML - Creating and Using Machine Learning Solutions (Italian)
AzureML - Creating and Using Machine Learning Solutions (Italian)
 
Datarace: IoT e Big Data (Italian)
Datarace: IoT e Big Data (Italian)Datarace: IoT e Big Data (Italian)
Datarace: IoT e Big Data (Italian)
 

Kürzlich hochgeladen

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Kürzlich hochgeladen (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Automating DWH Patterns Through Metadata

  • 1.
  • 2. Automating Data Warehouse Patterns Through Metadata Davide Mauri dmauri@solidq.com
  • 3. Davide Mauri 20 Years of experience on the SQL Server Platform – Specialized in Data Solution Architecture, Database Design, Performance Tuning, Business Intelligence, Data Warehouse, Big Data & Analytics Microsoft SQL Server MVP President of UGISS (Italian SQL Server UG) Mentor @ SolidQ – Regular Speaker @ SQL Server events – Projects, Consulting, Mentoring & Training Find me here: – Blog: http://sqlblog.com/blogs/davide_mauri/default.aspx – Twitter:@mauridb
  • 4. Building a DWH in 2013 Is still a (almost) manual process A *lot* of repetitive low-value work No (or very few) standard tools available
  • 5. How it should be Semi-automatic process – “develop by intent” Define the mapping logic CREATE DIMENSION Customer FROM SourceCustomerTable MAP USING CustomerMetadata ALTER DIMENSION Customers ADD ATTRIBUTE LoyaltyLevel from a TYPE 1 semantic perspective AS – Source to Dimensions / Measures • (Metadata anyone?) CREATE FACT Orders FROM SourceOrdersTable MAP USING OrdersMetadata Design the model and let the tool build it for you ALTER FACT Orders ADD DIMENSION Customer
  • 6. The perfect BI process & architecture Iterative!
  • 8. Invest on Automation? Faster development – Reduce Costs – Embrace Changes Less bugs Increase solution quality and make it consistent throughout the whole product
  • 9. Automation Pre-Requisites Split the process to have two separate type of processes – What can be automated – What can NOT be automated Create and impose a set of rules that defines – How to solve common technical problems – How to implement such identified solutions
  • 10. No Monkey Work! Let the people think and let the machines do the «monkey» work.
  • 11. Design Pattern “A general reusable solution to a commonly occurring problem within a given context”
  • 12. Design Pattern Generic ETL Pattern – Partition Load – Incremental/Differential Load Generic BI Design Pattern – Slowly Changing Dimension • SCD1, SCD2, ecc. – Fact Table • Transactional, Snapshot, Temporal Snapshot
  • 13. Design Pattern Specific SQL Server Patterns – Change Data Capture – Change Tracking – Partition Load – SSIS Parallelism
  • 14. Engineering the DWH “Software Engineering allows and require the formalization of software building and maintenance process.”
  • 15. Sample Rules • Always put «last_update» column • Always log Inserted/Updated/Deleted rows to log.load_info table • Use MD5 – binary(16) for checksums • Use views to expose data – Dimension & Fact views MUST use the same column names for lookup columns
  • 16. Engineering the DWH There are two intrinsc processes hidden in the development of a BI solution that must be allowed (or forced) to emerge.
  • 17. Business Process Data manipulation, transformation, enrichment & cleansing logic Specific for every customer. Almost not automatable
  • 18. Technical Process Application of data extraction and loading techniques Recurring (pattern) in any solution Highly Automatable
  • 19. Hi-Level Vision Technical Process Technical Process ETL OLTP L ET STG E TL Business Process DWH
  • 20. ETL Phases «E» and «L» must be – Simple, Easy and Straightforward – Completely Automated – Completely Reusable «E» and «L» have ZERO value in a BI Solution – Should be done in the most economic way
  • 21. Well known solution to common problems PATTERN
  • 23. Source Incremental Load In this scenario, “ID” is a IDENTITY/SEQUENCE. Probably a PK. E
  • 24. Source Differential Load/1 In this scenario the source table doesn’t offer any specific way to Understand what’s changed E
  • 25. Source Differential Load/2 In this scenario the source table has a TimeStamp-Like column E
  • 26. Source Differential Load E • SQL Server 2012 that can help with incremental/differential load – Change Data Capture • Natively supported in SSIS 2012 • http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sqlserver-2012-2/ – Change Tracking • Underused feature in BI…not so rich as CDC but MUCH more simpler and easier
  • 27. L SCD 1 & SCD 2 Start Lookup Dimension Id and MD5 Checksum From Business Key Insert new members into DWH Calculate MD5 Checksum of NonSCD-Key Colums Yes Dimension Id is Null? No Checksum are different? Yes End Merge data from temp table to DWH Store into temp table
  • 28. SCD 2 Special Note L • Merge => UPDATE Interval + INSERT New Row
  • 31. Parallel Load • Logically split the work in several steps – E.g: Load/Process one customer at time • Create a «queue» table the stores information for each step – Step 1 -> Load Customer «A» – Step 2 -> Load Customer «B» • Create a Package that 1. Pick the first not already picked up 2. Do work 3. Back to step 3 • Call the Package «n» times simultaneously EL
  • 32. Other SSIS Specific Patterns • Range Lookup – Not natively supported – Matt Masson has the answer in his blog  • http://blogs.msdn.com/b/mattm/archive/2008/11/25/l ookup-pattern-range-lookups.aspx
  • 33. A key ingredient in automation METADATA
  • 34. Metadata Provide context information – Which columns are used to build/feed a Dimension? – Which columns are Business Keys? – Which table is the Fact Table? – How Fact and Dimension are connected? • Which columns are used?
  • 35. How to manage Metadata? • Naming Convention • Extended Properties • Specific, Ad Hoc Database or Tables • Other (XML, File, ecc.)
  • 36. Naming Convention • The easiest and cheapest – – – – No additional (hidden) costs No need to be maintained Never out-of-sync No documentation need • Actually, it IS PART of the documentation – Imposes a Standard • Very limited in terms of flexibility and usage
  • 37. Extended Properties Support most of metadata needs No additional software needed Very verbose usage – Development of a wrapper to make usage simpler is feasible and encouraged
  • 38. Metadata Objects Dedicated Ad-Hoc Database and Tables As Flexible as you need Maintenance Overhead to keep metadata in-sync with data – Development of automatic check procedure is needed – DMV can help a lot here
  • 39. External Metadata Objects Really expensive to keep them in-sync – A tool is needed, otherwise too much manual work Does not give any specific benefits with respect to Ad-Hoc Database/Tables
  • 40. DEMO
  • 41. Let’s make it possible! AUTOMATION
  • 42. Automation Scenarios • Run-Time: «Auto-Configuring» Packages – Really hard to customize packages – SSIS limitations must be managed • Eg: Data Flow cannot be changed at runtime • On-the fly creation of package may be needed • Design-Time: Package Generators / Package Templates – Easy to customize created packages
  • 43. Automation Solutions • Specific Tool/frameworks – BIML / MIST • SQL Server Platform – SQL, PowerShell, .NET – SMO, AMO
  • 45. DEMO
  • 46. Useful Resources • «STOCK» Tasks: – http://msdn.microsoft.com/enus/library/ms135956.aspx • How to set Task properties at runtime: – http://technet.microsoft.com/enus/library/microsoft.sqlserver.dts.runtime.executables .add.aspx
  • 47. BIML – BI Markup Language • Developed by Varigence – http://www.varigence.com – http://bimlscript.com/ – MIST: BIML Full-Featured IDE • Free via BIDS Helper – Support “limited” to SSIS package generation – http://bidshelper.codeplex.com
  • 48. THANK YOU! • For attending this session and PASS SQLRally Nordic 2013, Stockholm

Hinweis der Redaktion

  1. http://chartporn.org/2012/05/10/repetitive-tasks/
  2. http://en.wikipedia.org/wiki/Software_design_pattern
  3. http://en.wikipedia.org/wiki/Software_design_pattern
  4. http://en.wikipedia.org/wiki/Software_design_pattern
  5. Matt Masson Blog: http://blogs.msdn.com/b/mattm/archive/2008/11/25/lookup-pattern-range-lookups.aspx