2. What Is a Data Warehouse?
A centralized store of business data for reporting and
analysis that typically:
• Contains large volumes of historical data
• Is optimized for querying, as opposed to inserting or updating
data
• Is incrementally loaded with new business data at regular
intervals
• Provides the basis for enterprise BI solutions
9. The Data Warehouse Design Process
1. Determine analytical and reporting requirements
2. Identify the business processes that generate the required data
3. Examine the source data for those business processes
4. Conform dimensions across business processes
5. Prioritize processes and create a dimensional model for each
6. Document and refine the models to determine the database
logical schema
7. Design the physical data structures for the database
11. Documenting Dimensional Models
Country
State or
Province
City
Age
Marital Status
Gender
Sales
Order
Item Quantity
Unit Cost
Total Cost
Unit Price
Sales Amount
Shipping Cost
Time
(Order Date and
Ship Date)
CustomerProduct
Calendar Year
Month
Date
Fiscal Year
Fiscal Quarter
Month
Date
Region
Country
Territory
Manager
Surname
Forename
Category
Subcategory
Product Name
Color
Size
Salesperson
12. Lesson 2: Designing Dimension Tables
Considerations for Dimension Keys
Dimension Attributes and Hierarchies
Unknown and None
Designing Slowly Changing Dimensions
Time Dimension Tables
Self-Referencing Dimension Tables
Junk Dimensions
13. Considerations for Dimension Keys
ProductKey ProductAltKey ProductName Color Size
1 MB1-B-32 MB1 Mountain Bike Blue 32
2 MB1-R-32 MB1 Mountain Bike Red 32
CustomerKey CustomerAltKey Name
1 1002 Amy Alberts
2 1005 Neil Black
Surrogate Key Business (Alternate) Key
14. Dimension Attributes and Hierarchies
CustKey CustAltKey Name Country State City Phone Gender
1 1002 Amy Alberts Canada BC Vancouver 555 123 F
2 1005 Neil Black USA CA Irvine 555 321 M
3 1006 Ye Xu USA NY New York 555 222 M
Hierarchy
SlicerDrill-through detail
15. Unknown and None
• Identify the semantic meaning of NULL
• Unknown or None?
• Do not assume NULL equality
• Use ISNULL( )
OrderNo Discount DiscountType
1000 1.20 Bulk Discount
1001 0.00 N/A
1002 2.00
1003 0.50 Promotion
1004 2.50 Other
1005 0.00 N/A
1006 1.50
Source
Dimension Table
DiscKey DiscAltKey DiscountType
-1 Unknown Unknown
0 N/A None
1 Bulk Discount Bulk Discount
2 Promotion Promotion
3 Other Other
16. Designing Slowly Changing Dimensions
CustKey CustAltKey Name Phone
1 1002 Amy Alberts 555 123
CustKey CustAltKey Name City Current Start End
1 1002 Amy Alberts Vancouver Yes 1/1/2000
CustKey CustAltKey Name Phone
1 1002 Amy Alberts 555 222
Type 1
CustKey CustAltKey Name City Current Start End
1 1002 Amy Alberts Vancouver No 1/1/2000 1/1/2012
4 1002 Amy Alberts Toronto Yes 1/1/2012
Type 2
CustKey CustAltKey Name Cars
1 1002 Amy Alberts 0
CustKey CustAltKey Name Prior Cars Current Cars
1 1002 Amy Alberts 0 1
Type 3
17. Time Dimension Tables
• Surrogate key
• Granularity
• Range
• Attributes and hierarchies
• Multiple calendars
• Unknown values
DateKey DateAltKey MonthDay WeekDay Day MonthNo Month Year
00000000 01-01-1753 NULL NULL NULL NULL NULL NULL
20130101 01-01-2016 1 3 Tue 01 Jan 2016
20130102 01-02-2016 2 4 Wed 01 Jan 2016
20130103 01-03-2016 3 5 Thu 01 Jan 2016
20130104 01-04-2016 4 6 Fri 01 Jan 2016
18. Self-Referencing Dimension Tables
EmployeeKey EmployeeAltKey EmployeeName ManagerKey
1 1000 Kim Abercrombie NULL
2 1001 Kamil Amireh 1
3 1002 Cesar Garcia 1
4 1003 Jeff Hay 2
Kim Abercrombie 1st Level Manager
Kamil Amireh 2nd Level Manager
Jeff Hay 3rd Level (Employee)
Cesar Garcia 2nd Level Manager
19. Junk Dimensions
• Combine low-cardinality attributes that don’t
belong in existing dimensions into a junk
dimension
• Avoids creating many small dimension tables
JunkKey OutOfStockFlag FreeShippingFlag CreditOrDebit
1 1 1 Credit
2 1 1 Debit
3 1 0 Credit
4 1 0 Debit
5 0 1 Credit
6 0 1 Debit
7 0 0 Credit
8 0 0 Debit
25. 25
Considerations for physical implementation
Data files and filegroups
Staging tables
tempdb
Transaction logs
Backup files
Partitioning
Indexing
26. Considerations for Indexes
• Dimension table indexes
• Clustered index on surrogate key column
• Nonclustered index on business key and SCD columns
• Nonclustered indexes on frequently searched columns
• Fact table indexes
• Clustered index on most commonly searched date key
• Nonclustered indexes on other dimension keys
Or
• Columnstore index on all columns
• Composite index key comprises up to 16 columns
28. Options for Extracting Modified Data
• Extract all records
• Store a primary key and checksum
• Use a datetime column as a “high water mark”
• Use Change Data Capture
• Use Change Tracking
29. Common ETL Data Flow Architectures
• Single-stage ETL:
Data is transferred directly from
source to data warehouse
Transformations and validations
occur in-flight or on extraction
• Two-stage ETL:
Data is staged for a coordinated
load
Transformations and validations
occur in-flight, or on staged data
• Three-stage ETL:
Data is extracted quickly to a
landing zone, and then staged
prior to loading
Transformations and validation can
occur throughout the data flow
Source DW
Source DWStaging
Source
DWStaging
Landing Zone
Hinweis der Redaktion
Explain that this course uses a fairly generic definition for a data warehouse. There are many very specific definitions in use throughout the data warehousing industry; students might be aware of some of the more common schools of thought on database design, including those of Bill Inmon and Ralph Kimball. This course does not advocate one approach over another, although the lab solutions and data warehouse schema design discussed here are more in line with a Kimball-based approach than any other.
Encourage students to consider the benefits of each architecture discussed in this topic:
A centralized data warehouse provides a single source for all business analysis, reporting, and decision-making. However, designing and building such a comprehensive solution presents some significant challenges in devising a single schema that meets the needs of every business unit, and can result in huge volumes of data being transferred into the data warehouse from various source systems.
A departmental data warehouse is more focused on the needs of a specific set of users, but might not hold a complete business-wide view of all key data. You might end up with multiple small data marts, each dealing with a discrete part of the business. These will improve reporting, analysis, and decision-making within individual departments but will not necessarily make it easier to get an overall view of the business.
In some ways, a hub-and-spoke architecture offers the best aspects of the centralized data warehouse and departmental data mart approaches. However, it can be difficult to design and build—and synchronizing data between the hub and various spokes can present its own challenges.
For a more detailed discussion about these various architectures, see Hub-And-Spoke: Building an EDW with SQL Server and Strategies of Implementation at:
Building an EDW with SQL Server
http://go.microsoft.com/fwlink/?LinkID=248853
Note that not all data warehousing solutions include every component shown on the slide. However, each component has an important part to play in the implementation of a data warehousing solution, and will be considered later in this course.
If students want to better understand how the star join query optimizations in the SQL Server query optimizer work, they should review the articles referenced in their notes. However, emphasize that the optimizations are automatic and that students do not need to do anything, other than use a star schema for the data warehouse tables, to benefit from them.
Note the reference to “The Microsoft Data Warehouse Toolkit” in the student notes. The principles described in this book underpin most of the generally accepted best practices for designing data warehouses with SQL Server. Therefore, this book is highly recommended reading for any instructor teaching this course.
Question
What is the difference between a snowflake schema and a star schema?
Answer
A star schema has all the dimension attributes linked directly to the fact measures. A snowflake schema has some of the dimension attributes linked to each other, in addition to others linked directly to the fact measures.
Use this topic to ensure that all students understand why the business key from the source system is not used as a unique key in dimension tables.
Emphasize that the categorization of attributes in this topic is simply used to help identify reasons why a data value would be included as a dimension attribute column. You do not need to apply any specific configuration to define an attribute as a slicer or a member of a hierarchy.
Point out that the levels of the hierarchy are all stored within a single dimension table, resulting in duplication. This is preferable to normalizing the data to create a table for each hierarchy in a snowflake schema. OLTP database developers might find this preference for duplication over normalization unintuitive. However, you should remind them that dimension data is generally denormalized from multiple tables before being loaded, and does not experience the same level of transactional updates as would occur in an OLTP database. Therefore, the performance benefits of storing the data in a single table generally outweigh the reduced duplication benefits of normalizing the data.
In the example on the slide, the dimension data is not normalized in the source system, so there is no business key. Instead, the value itself is used as the alternate key for rows where the value is not “Unknown” or “None”.
In this example, ask students to consider a scenario where a foreign key relationship in the source data references a lookup table for the DiscountType column. The foreign key table includes a numeric primary key column with the values, 0, 1, 2, and 3 for the discount types “N/A,” “Bulk Discount,” “Promotion,” and “Other,” respectively. How would this affect the design of the dimension table?
One answer is that the primary key values in the source system would serve as the alternate key values in the dimension table; an unused value (such as -1) would be used for the “Unknown” row. The source foreign key and dimension alternate key could then be compared using a similar ISNULL expression, as shown in the student notes (except that the value -1 would be used instead of “Unknown”).
As an additional consideration, what if the lookup table in the source system did not include the 0 value for “N/A,” and both “None” and “Unknown” are indicated by a NULL in the source table? If users want to differentiate between “None” and “Unknown” in reports or analyses, you could implement some logic in the ETL load process to match NULL values to the dimension row for “Unknown” when the Discount value is non-zero, and to “N/A” when the discount is 0.00. However, unless this level of differentiation has business value, it would be easier to include only a single row for “None or Unknown” in the dimension table.
Explain that Type 2 slowly changing dimensions can be implemented by using a current flag, and start time stamp—or both. Point out the following considerations:
If only a current flag is used, there is no way to match a new fact row to a dimension row, based on the time that the fact was recorded. The ETL process must assume that the current version of the dimension entity should be used.
If only a start date is used, identifying the current row requires that you find the row with the most recent start date, typically by using the MAX function.
Using an end date column makes it easier to find the right version of a dimension entity for fact, based on the point in time when the fact event occurred. Without the end date column, the appropriate dimension table row can only be determined by finding the most recent start date occurring before the date of the fact event.
Discuss the issues in the bulleted list in the student content. In some cases, you might choose to include a column for the parent alternate key in addition to the parent key, because this can be useful in some load techniques.
Some techniques for loading self-referencing dimension tables are discussed in Module 6 of this course: Creating an ETL Solution.
As an alternative to a junk dimension, fact-specific attributes can be used to create degenerate dimensions in the fact table. This approach is discussed in the next lesson.
Categorize Activity
Place each item into the appropriate category which is a dimension type. Indicate your answer by writing the category number to the right of each item.
Slicer
(1)Gender
Time Dimension
(1)Fiscal Year
(2)Month
Junk Dimension
(1)Invoice Number
(2)Out of Stock Indicator
Point out that degenerate dimension columns provide the same capability as a junk dimension table. In a scenario where only one fact table requires the additional miscellaneous attributes for analysis and reporting, it is generally more efficient to include them as degenerate dimension columns. Conversely, if the additional attributes are relevant for multiple fact tables, a junk dimension is probably a better choice.
Discuss the note about fact table primary keys in the student manual. Students with a strong background in relational database design might feel uncomfortable about not defining a primary key for every table. If, however, there is no need to uniquely identify individual fact rows, and the ETL process can be relied on to eliminate accidental duplicate entries, defining a primary key adds unnecessary overhead to the table definition and generates an index, which can negatively affect the performance of data loads.
Similarly, note that declaring foreign key constraints on dimension-key columns in a fact table is not necessary to enforce referential integrity in most data warehouses, and can negatively impact load performance. You can declare them, and then drop and recreate them during each load, but this creates its own overhead and adds little value if the ETL process is correctly implemented. The query optimizer can use foreign key constraints to identify the fact table in a star join query but, in their absence, selects the largest table, which is usually correct.
Use the examples on the slide and in the student notes to show how summing semi-additive and non-additive measures produces meaningless results.
Discuss the importance of including a row for “Unknown” or “None” in the time dimension table when using accumulating snapshot fact tables.
Point out that accumulating snapshot fact tables must be updated after the initial load. This requirement can affect the physical design of the table, especially if partitions or column store indexes are used. These considerations are discussed in the next lesson and in Module 6: Creating an ETL Solution.
Question
What kind of measure is a stock count?
( )Option 1: Additive
( )Option 2: Semi-additive
( )Option 3: Non-additive
Answer
(√) Option 2: Semi-additive
Point out that query performance in a data warehouse is degraded by fragmentation, and having many indexes on tables in a data warehouse can lead to significant fragmentation after each ETL data load. Additionally, the loads take longer because index keys must be stored on the correct page. To reduce the fragmentation caused by a data load, you can use a fill factor to leave space for inserts. Periodically, however, you will still need to either reorganize or rebuild indexes, which will affect performance and might require some indexes to be taken offline. Alternatively, you can drop all indexes before each load and recreate them afterwards, an action that improves load performance, but incurs the overhead of recreating the indexes and, depending on the volume of data loaded, may be very time-consuming.
Point out the note in the student workbook, and discuss the considerations for propagating deletions in source databases to the data warehouse. Emphasize that, in most scenarios, data warehouses are used to store historical data, so it is common for deletions to not be propagated. In some cases, logical deletions are performed in the data warehouse by setting a “deleted” flag column.