SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Datawarehouse modeling – the basics
SQL&BI UG Meeting Oct 2019
What Is a Data Warehouse?
A centralized store of business data for reporting and
analysis that typically:
• Contains large volumes of historical data
• Is optimized for querying, as opposed to inserting or updating
data
• Is incrementally loaded with new business data at regular
intervals
• Provides the basis for enterprise BI solutions
Data Warehouse Architectures
Central Data Warehouse
Departmental Data Marts
Hub-and-Spoke
Components of a Data Warehousing Solution
Data Sources
ETL and
Data Cleansing
Data Warehouse
Data Models
Reporting and
Analysis
MDM
The dimensional model
5
The Dimensional Model
Snowflake
schema
DimensionAttributes
DimensionAttributes
Star schema
Dimension
Attributes FactMeasures
DimensionAttributes
DimensionAttributes
DimensionAttributes
Star Schema
Star schema in PowerBI Models
8
The Data Warehouse Design Process
1. Determine analytical and reporting requirements
2. Identify the business processes that generate the required data
3. Examine the source data for those business processes
4. Conform dimensions across business processes
5. Prioritize processes and create a dimensional model for each
6. Document and refine the models to determine the database
logical schema
7. Design the physical data structures for the database
Dimensional Modeling
Business Processes
Time
Product
Customer
Salesperson
FactoryLine
Shipper
Account
Department
Warehouse
Manufacturing x x x
Order Processing x x x x
Order Fulfillment x x x
Financial Accounting x x x
Inventory Management x x x
• Grain: 1 row per order item
• Dimensions: Time (order date and ship date), Product, Customer, Salesperson
• Facts: Item Quantity, Unit Cost, Total Cost, Unit Price, Sales Amount, Shipping Cost
Documenting Dimensional Models
Country
State or
Province
City
Age
Marital Status
Gender
Sales
Order
Item Quantity
Unit Cost
Total Cost
Unit Price
Sales Amount
Shipping Cost
Time
(Order Date and
Ship Date)
CustomerProduct
Calendar Year
Month
Date
Fiscal Year
Fiscal Quarter
Month
Date
Region
Country
Territory
Manager
Surname
Forename
Category
Subcategory
Product Name
Color
Size
Salesperson
Lesson 2: Designing Dimension Tables
 Considerations for Dimension Keys
 Dimension Attributes and Hierarchies
 Unknown and None
 Designing Slowly Changing Dimensions
 Time Dimension Tables
 Self-Referencing Dimension Tables
 Junk Dimensions
Considerations for Dimension Keys
ProductKey ProductAltKey ProductName Color Size
1 MB1-B-32 MB1 Mountain Bike Blue 32
2 MB1-R-32 MB1 Mountain Bike Red 32
CustomerKey CustomerAltKey Name
1 1002 Amy Alberts
2 1005 Neil Black
Surrogate Key Business (Alternate) Key
Dimension Attributes and Hierarchies
CustKey CustAltKey Name Country State City Phone Gender
1 1002 Amy Alberts Canada BC Vancouver 555 123 F
2 1005 Neil Black USA CA Irvine 555 321 M
3 1006 Ye Xu USA NY New York 555 222 M
Hierarchy
SlicerDrill-through detail
Unknown and None
• Identify the semantic meaning of NULL
• Unknown or None?
• Do not assume NULL equality
• Use ISNULL( )
OrderNo Discount DiscountType
1000 1.20 Bulk Discount
1001 0.00 N/A
1002 2.00
1003 0.50 Promotion
1004 2.50 Other
1005 0.00 N/A
1006 1.50
Source
Dimension Table
DiscKey DiscAltKey DiscountType
-1 Unknown Unknown
0 N/A None
1 Bulk Discount Bulk Discount
2 Promotion Promotion
3 Other Other
Designing Slowly Changing Dimensions
CustKey CustAltKey Name Phone
1 1002 Amy Alberts 555 123
CustKey CustAltKey Name City Current Start End
1 1002 Amy Alberts Vancouver Yes 1/1/2000
CustKey CustAltKey Name Phone
1 1002 Amy Alberts 555 222
Type 1
CustKey CustAltKey Name City Current Start End
1 1002 Amy Alberts Vancouver No 1/1/2000 1/1/2012
4 1002 Amy Alberts Toronto Yes 1/1/2012
Type 2
CustKey CustAltKey Name Cars
1 1002 Amy Alberts 0
CustKey CustAltKey Name Prior Cars Current Cars
1 1002 Amy Alberts 0 1
Type 3
Time Dimension Tables
• Surrogate key
• Granularity
• Range
• Attributes and hierarchies
• Multiple calendars
• Unknown values
DateKey DateAltKey MonthDay WeekDay Day MonthNo Month Year
00000000 01-01-1753 NULL NULL NULL NULL NULL NULL
20130101 01-01-2016 1 3 Tue 01 Jan 2016
20130102 01-02-2016 2 4 Wed 01 Jan 2016
20130103 01-03-2016 3 5 Thu 01 Jan 2016
20130104 01-04-2016 4 6 Fri 01 Jan 2016
Self-Referencing Dimension Tables
EmployeeKey EmployeeAltKey EmployeeName ManagerKey
1 1000 Kim Abercrombie NULL
2 1001 Kamil Amireh 1
3 1002 Cesar Garcia 1
4 1003 Jeff Hay 2
Kim Abercrombie 1st Level Manager
Kamil Amireh 2nd Level Manager
Jeff Hay 3rd Level (Employee)
Cesar Garcia 2nd Level Manager
Junk Dimensions
• Combine low-cardinality attributes that don’t
belong in existing dimensions into a junk
dimension
• Avoids creating many small dimension tables
JunkKey OutOfStockFlag FreeShippingFlag CreditOrDebit
1 1 1 Credit
2 1 1 Debit
3 1 0 Credit
4 1 0 Debit
5 0 1 Credit
6 0 1 Debit
7 0 0 Credit
8 0 0 Debit
Fact tables
Fact Table Columns
Types of Measure
Types of Fact Table
Fact Table Keys
• Dimension Keys
• Measures
• Degenerate Dimensions
OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount
20160101 25 120 1000 1 350.99
20160101 99 120 1000 2 6.98
20160101 25 178 1001 2 701.98
OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount
20160101 25 120 1000 1 350.99
20160101 99 120 1000 2 6.98
20160101 25 178 1001 2 701.98
OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount
20160101 25 120 1000 1 350.99
20160101 99 120 1000 2 6.98
20160101 25 178 1001 2 701.98
Types of Measure
• Additive
• Semi-Additive
• Non-Additive
OrderDateKey ProductKey CustomerKey SalesAmount
20160101 25 120 350.99
20160101 99 120 6.98
20160102 25 178 701.98
DateKey ProductKey StockCount
20160101 25 23
20160101 99 118
20160102 25 22
OrderDateKey ProductKey CustomerKey ProfitMargin
20160101 25 120 25
20160101 99 120 22
20160102 25 178 27
Types of Fact Table
• Transaction Fact Table
• Periodic Snapshot Fact Table
• Accumulating Snapshot Fact Table
OrderDateKey ProductKey CustomerKey OrderNo Qty Cost SalesAmount
20160101 25 120 1000 1 125.00 350.99
20160101 99 120 1000 2 2.50 6.98
20160101 25 178 1001 2 250.00 701.98
DateKey ProductKey OpeningStock UnitsIn UnitsOut ClosingStock
20160101 25 25 1 3 23
20160101 99 120 0 2 118
OrderNo OrderDateKey ShipDateKey DeliveryDateKey
1000 20160101 20160102 20160105
1001 20160101 20160102 00000000
1002 20160102 00000000 00000000
Physical implementation specifics
24
25
Considerations for physical implementation
 Data files and filegroups
 Staging tables
 tempdb
 Transaction logs
 Backup files
 Partitioning
 Indexing
Considerations for Indexes
• Dimension table indexes
• Clustered index on surrogate key column
• Nonclustered index on business key and SCD columns
• Nonclustered indexes on frequently searched columns
• Fact table indexes
• Clustered index on most commonly searched date key
• Nonclustered indexes on other dimension keys
Or
• Columnstore index on all columns
• Composite index key comprises up to 16 columns
Incremental load flows
27
Options for Extracting Modified Data
• Extract all records
• Store a primary key and checksum
• Use a datetime column as a “high water mark”
• Use Change Data Capture
• Use Change Tracking
Common ETL Data Flow Architectures
• Single-stage ETL:
 Data is transferred directly from
source to data warehouse
 Transformations and validations
occur in-flight or on extraction
• Two-stage ETL:
 Data is staged for a coordinated
load
 Transformations and validations
occur in-flight, or on staged data
• Three-stage ETL:
 Data is extracted quickly to a
landing zone, and then staged
prior to loading
 Transformations and validation can
occur throughout the data flow
Source DW
Source DWStaging
Source
DWStaging
Landing Zone

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

BG SQL&BI User group meeting - DWH modeling - the basics

  • 1. Datawarehouse modeling – the basics SQL&BI UG Meeting Oct 2019
  • 2. What Is a Data Warehouse? A centralized store of business data for reporting and analysis that typically: • Contains large volumes of historical data • Is optimized for querying, as opposed to inserting or updating data • Is incrementally loaded with new business data at regular intervals • Provides the basis for enterprise BI solutions
  • 3. Data Warehouse Architectures Central Data Warehouse Departmental Data Marts Hub-and-Spoke
  • 4. Components of a Data Warehousing Solution Data Sources ETL and Data Cleansing Data Warehouse Data Models Reporting and Analysis MDM
  • 6. The Dimensional Model Snowflake schema DimensionAttributes DimensionAttributes Star schema Dimension Attributes FactMeasures DimensionAttributes DimensionAttributes DimensionAttributes
  • 8. Star schema in PowerBI Models 8
  • 9. The Data Warehouse Design Process 1. Determine analytical and reporting requirements 2. Identify the business processes that generate the required data 3. Examine the source data for those business processes 4. Conform dimensions across business processes 5. Prioritize processes and create a dimensional model for each 6. Document and refine the models to determine the database logical schema 7. Design the physical data structures for the database
  • 10. Dimensional Modeling Business Processes Time Product Customer Salesperson FactoryLine Shipper Account Department Warehouse Manufacturing x x x Order Processing x x x x Order Fulfillment x x x Financial Accounting x x x Inventory Management x x x • Grain: 1 row per order item • Dimensions: Time (order date and ship date), Product, Customer, Salesperson • Facts: Item Quantity, Unit Cost, Total Cost, Unit Price, Sales Amount, Shipping Cost
  • 11. Documenting Dimensional Models Country State or Province City Age Marital Status Gender Sales Order Item Quantity Unit Cost Total Cost Unit Price Sales Amount Shipping Cost Time (Order Date and Ship Date) CustomerProduct Calendar Year Month Date Fiscal Year Fiscal Quarter Month Date Region Country Territory Manager Surname Forename Category Subcategory Product Name Color Size Salesperson
  • 12. Lesson 2: Designing Dimension Tables  Considerations for Dimension Keys  Dimension Attributes and Hierarchies  Unknown and None  Designing Slowly Changing Dimensions  Time Dimension Tables  Self-Referencing Dimension Tables  Junk Dimensions
  • 13. Considerations for Dimension Keys ProductKey ProductAltKey ProductName Color Size 1 MB1-B-32 MB1 Mountain Bike Blue 32 2 MB1-R-32 MB1 Mountain Bike Red 32 CustomerKey CustomerAltKey Name 1 1002 Amy Alberts 2 1005 Neil Black Surrogate Key Business (Alternate) Key
  • 14. Dimension Attributes and Hierarchies CustKey CustAltKey Name Country State City Phone Gender 1 1002 Amy Alberts Canada BC Vancouver 555 123 F 2 1005 Neil Black USA CA Irvine 555 321 M 3 1006 Ye Xu USA NY New York 555 222 M Hierarchy SlicerDrill-through detail
  • 15. Unknown and None • Identify the semantic meaning of NULL • Unknown or None? • Do not assume NULL equality • Use ISNULL( ) OrderNo Discount DiscountType 1000 1.20 Bulk Discount 1001 0.00 N/A 1002 2.00 1003 0.50 Promotion 1004 2.50 Other 1005 0.00 N/A 1006 1.50 Source Dimension Table DiscKey DiscAltKey DiscountType -1 Unknown Unknown 0 N/A None 1 Bulk Discount Bulk Discount 2 Promotion Promotion 3 Other Other
  • 16. Designing Slowly Changing Dimensions CustKey CustAltKey Name Phone 1 1002 Amy Alberts 555 123 CustKey CustAltKey Name City Current Start End 1 1002 Amy Alberts Vancouver Yes 1/1/2000 CustKey CustAltKey Name Phone 1 1002 Amy Alberts 555 222 Type 1 CustKey CustAltKey Name City Current Start End 1 1002 Amy Alberts Vancouver No 1/1/2000 1/1/2012 4 1002 Amy Alberts Toronto Yes 1/1/2012 Type 2 CustKey CustAltKey Name Cars 1 1002 Amy Alberts 0 CustKey CustAltKey Name Prior Cars Current Cars 1 1002 Amy Alberts 0 1 Type 3
  • 17. Time Dimension Tables • Surrogate key • Granularity • Range • Attributes and hierarchies • Multiple calendars • Unknown values DateKey DateAltKey MonthDay WeekDay Day MonthNo Month Year 00000000 01-01-1753 NULL NULL NULL NULL NULL NULL 20130101 01-01-2016 1 3 Tue 01 Jan 2016 20130102 01-02-2016 2 4 Wed 01 Jan 2016 20130103 01-03-2016 3 5 Thu 01 Jan 2016 20130104 01-04-2016 4 6 Fri 01 Jan 2016
  • 18. Self-Referencing Dimension Tables EmployeeKey EmployeeAltKey EmployeeName ManagerKey 1 1000 Kim Abercrombie NULL 2 1001 Kamil Amireh 1 3 1002 Cesar Garcia 1 4 1003 Jeff Hay 2 Kim Abercrombie 1st Level Manager Kamil Amireh 2nd Level Manager Jeff Hay 3rd Level (Employee) Cesar Garcia 2nd Level Manager
  • 19. Junk Dimensions • Combine low-cardinality attributes that don’t belong in existing dimensions into a junk dimension • Avoids creating many small dimension tables JunkKey OutOfStockFlag FreeShippingFlag CreditOrDebit 1 1 1 Credit 2 1 1 Debit 3 1 0 Credit 4 1 0 Debit 5 0 1 Credit 6 0 1 Debit 7 0 0 Credit 8 0 0 Debit
  • 20. Fact tables Fact Table Columns Types of Measure Types of Fact Table
  • 21. Fact Table Keys • Dimension Keys • Measures • Degenerate Dimensions OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20160101 25 120 1000 1 350.99 20160101 99 120 1000 2 6.98 20160101 25 178 1001 2 701.98 OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20160101 25 120 1000 1 350.99 20160101 99 120 1000 2 6.98 20160101 25 178 1001 2 701.98 OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount 20160101 25 120 1000 1 350.99 20160101 99 120 1000 2 6.98 20160101 25 178 1001 2 701.98
  • 22. Types of Measure • Additive • Semi-Additive • Non-Additive OrderDateKey ProductKey CustomerKey SalesAmount 20160101 25 120 350.99 20160101 99 120 6.98 20160102 25 178 701.98 DateKey ProductKey StockCount 20160101 25 23 20160101 99 118 20160102 25 22 OrderDateKey ProductKey CustomerKey ProfitMargin 20160101 25 120 25 20160101 99 120 22 20160102 25 178 27
  • 23. Types of Fact Table • Transaction Fact Table • Periodic Snapshot Fact Table • Accumulating Snapshot Fact Table OrderDateKey ProductKey CustomerKey OrderNo Qty Cost SalesAmount 20160101 25 120 1000 1 125.00 350.99 20160101 99 120 1000 2 2.50 6.98 20160101 25 178 1001 2 250.00 701.98 DateKey ProductKey OpeningStock UnitsIn UnitsOut ClosingStock 20160101 25 25 1 3 23 20160101 99 120 0 2 118 OrderNo OrderDateKey ShipDateKey DeliveryDateKey 1000 20160101 20160102 20160105 1001 20160101 20160102 00000000 1002 20160102 00000000 00000000
  • 25. 25 Considerations for physical implementation  Data files and filegroups  Staging tables  tempdb  Transaction logs  Backup files  Partitioning  Indexing
  • 26. Considerations for Indexes • Dimension table indexes • Clustered index on surrogate key column • Nonclustered index on business key and SCD columns • Nonclustered indexes on frequently searched columns • Fact table indexes • Clustered index on most commonly searched date key • Nonclustered indexes on other dimension keys Or • Columnstore index on all columns • Composite index key comprises up to 16 columns
  • 28. Options for Extracting Modified Data • Extract all records • Store a primary key and checksum • Use a datetime column as a “high water mark” • Use Change Data Capture • Use Change Tracking
  • 29. Common ETL Data Flow Architectures • Single-stage ETL:  Data is transferred directly from source to data warehouse  Transformations and validations occur in-flight or on extraction • Two-stage ETL:  Data is staged for a coordinated load  Transformations and validations occur in-flight, or on staged data • Three-stage ETL:  Data is extracted quickly to a landing zone, and then staged prior to loading  Transformations and validation can occur throughout the data flow Source DW Source DWStaging Source DWStaging Landing Zone

Hinweis der Redaktion

  1. Explain that this course uses a fairly generic definition for a data warehouse. There are many very specific definitions in use throughout the data warehousing industry; students might be aware of some of the more common schools of thought on database design, including those of Bill Inmon and Ralph Kimball. This course does not advocate one approach over another, although the lab solutions and data warehouse schema design discussed here are more in line with a Kimball-based approach than any other.
  2. Encourage students to consider the benefits of each architecture discussed in this topic: A centralized data warehouse provides a single source for all business analysis, reporting, and decision-making. However, designing and building such a comprehensive solution presents some significant challenges in devising a single schema that meets the needs of every business unit, and can result in huge volumes of data being transferred into the data warehouse from various source systems. A departmental data warehouse is more focused on the needs of a specific set of users, but might not hold a complete business-wide view of all key data. You might end up with multiple small data marts, each dealing with a discrete part of the business. These will improve reporting, analysis, and decision-making within individual departments but will not necessarily make it easier to get an overall view of the business. In some ways, a hub-and-spoke architecture offers the best aspects of the centralized data warehouse and departmental data mart approaches. However, it can be difficult to design and build—and synchronizing data between the hub and various spokes can present its own challenges. For a more detailed discussion about these various architectures, see Hub-And-Spoke: Building an EDW with SQL Server and Strategies of Implementation at: Building an EDW with SQL Server http://go.microsoft.com/fwlink/?LinkID=248853
  3. Note that not all data warehousing solutions include every component shown on the slide. However, each component has an important part to play in the implementation of a data warehousing solution, and will be considered later in this course.
  4. If students want to better understand how the star join query optimizations in the SQL Server query optimizer work, they should review the articles referenced in their notes. However, emphasize that the optimizations are automatic and that students do not need to do anything, other than use a star schema for the data warehouse tables, to benefit from them.
  5. Note the reference to “The Microsoft Data Warehouse Toolkit” in the student notes. The principles described in this book underpin most of the generally accepted best practices for designing data warehouses with SQL Server. Therefore, this book is highly recommended reading for any instructor teaching this course.
  6. Question What is the difference between a snowflake schema and a star schema? Answer A star schema has all the dimension attributes linked directly to the fact measures. A snowflake schema has some of the dimension attributes linked to each other, in addition to others linked directly to the fact measures.
  7. Use this topic to ensure that all students understand why the business key from the source system is not used as a unique key in dimension tables.
  8. Emphasize that the categorization of attributes in this topic is simply used to help identify reasons why a data value would be included as a dimension attribute column. You do not need to apply any specific configuration to define an attribute as a slicer or a member of a hierarchy. Point out that the levels of the hierarchy are all stored within a single dimension table, resulting in duplication. This is preferable to normalizing the data to create a table for each hierarchy in a snowflake schema. OLTP database developers might find this preference for duplication over normalization unintuitive. However, you should remind them that dimension data is generally denormalized from multiple tables before being loaded, and does not experience the same level of transactional updates as would occur in an OLTP database. Therefore, the performance benefits of storing the data in a single table generally outweigh the reduced duplication benefits of normalizing the data.
  9. In the example on the slide, the dimension data is not normalized in the source system, so there is no business key. Instead, the value itself is used as the alternate key for rows where the value is not “Unknown” or “None”. In this example, ask students to consider a scenario where a foreign key relationship in the source data references a lookup table for the DiscountType column. The foreign key table includes a numeric primary key column with the values, 0, 1, 2, and 3 for the discount types “N/A,” “Bulk Discount,” “Promotion,” and “Other,” respectively. How would this affect the design of the dimension table? One answer is that the primary key values in the source system would serve as the alternate key values in the dimension table; an unused value (such as -1) would be used for the “Unknown” row. The source foreign key and dimension alternate key could then be compared using a similar ISNULL expression, as shown in the student notes (except that the value -1 would be used instead of “Unknown”). As an additional consideration, what if the lookup table in the source system did not include the 0 value for “N/A,” and both “None” and “Unknown” are indicated by a NULL in the source table? If users want to differentiate between “None” and “Unknown” in reports or analyses, you could implement some logic in the ETL load process to match NULL values to the dimension row for “Unknown” when the Discount value is non-zero, and to “N/A” when the discount is 0.00. However, unless this level of differentiation has business value, it would be easier to include only a single row for “None or Unknown” in the dimension table.
  10. Explain that Type 2 slowly changing dimensions can be implemented by using a current flag, and start time stamp—or both. Point out the following considerations: If only a current flag is used, there is no way to match a new fact row to a dimension row, based on the time that the fact was recorded. The ETL process must assume that the current version of the dimension entity should be used. If only a start date is used, identifying the current row requires that you find the row with the most recent start date, typically by using the MAX function. Using an end date column makes it easier to find the right version of a dimension entity for fact, based on the point in time when the fact event occurred. Without the end date column, the appropriate dimension table row can only be determined by finding the most recent start date occurring before the date of the fact event.
  11. Discuss the issues in the bulleted list in the student content. In some cases, you might choose to include a column for the parent alternate key in addition to the parent key, because this can be useful in some load techniques. Some techniques for loading self-referencing dimension tables are discussed in Module 6 of this course: Creating an ETL Solution.
  12. As an alternative to a junk dimension, fact-specific attributes can be used to create degenerate dimensions in the fact table. This approach is discussed in the next lesson. Categorize Activity Place each item into the appropriate category which is a dimension type. Indicate your answer by writing the category number to the right of each item. Slicer (1)Gender Time Dimension (1)Fiscal Year (2)Month Junk Dimension (1)Invoice Number (2)Out of Stock Indicator
  13. Point out that degenerate dimension columns provide the same capability as a junk dimension table. In a scenario where only one fact table requires the additional miscellaneous attributes for analysis and reporting, it is generally more efficient to include them as degenerate dimension columns. Conversely, if the additional attributes are relevant for multiple fact tables, a junk dimension is probably a better choice. Discuss the note about fact table primary keys in the student manual. Students with a strong background in relational database design might feel uncomfortable about not defining a primary key for every table. If, however, there is no need to uniquely identify individual fact rows, and the ETL process can be relied on to eliminate accidental duplicate entries, defining a primary key adds unnecessary overhead to the table definition and generates an index, which can negatively affect the performance of data loads. Similarly, note that declaring foreign key constraints on dimension-key columns in a fact table is not necessary to enforce referential integrity in most data warehouses, and can negatively impact load performance. You can declare them, and then drop and recreate them during each load, but this creates its own overhead and adds little value if the ETL process is correctly implemented. The query optimizer can use foreign key constraints to identify the fact table in a star join query but, in their absence, selects the largest table, which is usually correct.
  14. Use the examples on the slide and in the student notes to show how summing semi-additive and non-additive measures produces meaningless results.
  15. Discuss the importance of including a row for “Unknown” or “None” in the time dimension table when using accumulating snapshot fact tables. Point out that accumulating snapshot fact tables must be updated after the initial load. This requirement can affect the physical design of the table, especially if partitions or column store indexes are used. These considerations are discussed in the next lesson and in Module 6: Creating an ETL Solution. Question What kind of measure is a stock count? ( )Option 1: Additive ( )Option 2: Semi-additive ( )Option 3: Non-additive Answer (√) Option 2: Semi-additive
  16. Point out that query performance in a data warehouse is degraded by fragmentation, and having many indexes on tables in a data warehouse can lead to significant fragmentation after each ETL data load. Additionally, the loads take longer because index keys must be stored on the correct page. To reduce the fragmentation caused by a data load, you can use a fill factor to leave space for inserts. Periodically, however, you will still need to either reorganize or rebuild indexes, which will affect performance and might require some indexes to be taken offline. Alternatively, you can drop all indexes before each load and recreate them afterwards, an action that improves load performance, but incurs the overhead of recreating the indexes and, depending on the volume of data loaded, may be very time-consuming.
  17. Point out the note in the student workbook, and discuss the considerations for propagating deletions in source databases to the data warehouse. Emphasize that, in most scenarios, data warehouses are used to store historical data, so it is common for deletions to not be propagated. In some cases, logical deletions are performed in the data warehouse by setting a “deleted” flag column.