This is a group assignment by my students on Chapter 2 Retail Sales of the book The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling
By Ralph Kimball, Margy Ross
1. Retail Sales
Kimball & Ross, Chapter 2
Name
Student ID
Course
Jessica Raquel Zaqueu
14165
BIS
Chayanit Nadam
14191
BIS
Wong Aun Chyi
14214
BIS
2. Overview
•
•
•
•
•
•
•
•
•
•
Four-step dimensional design process
Transaction-level fact tables
Additive and non-additive facts
Sample dimension table attributes
Causal dimensions
Degenerate dimensions
Extending an existing dimension model
Snow flaking dimension attributes
Avoiding the “too many dimensions” trap
Surrogate keys
3. Four-Step Dimensional Design Process
1. Select the business process to model.
▫
▫
not business department or function
E.g., purchasing, ordering, shipping, invoicing, inventorying
2. Declare the grain of the business process.
▫
▫
Specifies individual fact table row
E.g., individual line item on sales ticket, daily snapshot of
the inventory levels for a product
4. Four-Step Dimensional Design Process
3. Choose the dimensions that apply for each fact table row.
▫
▫
Q: How do business people describe the data that results from the
business process?
E.g., date, product, store, customer, transaction type
▫
▫
▫
Q: What are we measuring?
Typical facts are numeric additive figures
E.g., quantity ordered, dollar cost amount
4. Identify the numeric (measured) facts that will populate each
fact table row.
•
In making decisions regarding the 4 steps, consider both the
user requirements as well as the realities of the source data
5. Retail Case Study
• Large grocery chain: 100 grocery stores over 5 regions
• Each store:
▫ Departments: grocery, frozen foods, dairy, meat, produce, bakery,
floral, health/beauty aids, etc.
▫ 60,000 products (SKUs = stock keeping units) on shelves
▫ 55,000 SKUs with UPCs
▫ 5,000 SKUs without UPCs but with assigned SKU numbers
• Data is collected:
▫ from cash registers into a point-of-sale (POS) system
▫ at back door where vendors make deliveries
6. Retail Case Study – Cont’d
• Management concerns
▫
▫
▫
▫
▫
Logistics of ordering, stocking, and selling products
Maximizing profit
Product pricing
Lowering cost of acquisition and overhead
Use of promotions to increase sales
temporary price reductions
newspaper ads
grocery store displays
coupons
7. Step 1. Select the Business Process
• Decide what business process to model, by combining an
understanding of the business requirements with an
understanding of data realities.
• The first dimensional model built should be the one
▫ with the most impact,
▫ that answers the most pressing business questions,
▫ is readily accessible for data extraction.
• In retail case study: POS retail sales
• Business Question: What products are selling in which stores on
what days and under what promotional conditions?
8. Step 2. Declare the Grain
• What level of data detail should be made available in
the dimensional model?
• Choose the most atomic information captured by the
business process.
▫ Atomic data
Most detailed, cannot be subdivided
Facilitates ad hoc, unexpected usage and ability to drill down to
details
• Case study grain: individual line item on a POS
transaction
9. Step 3. Choose the Dimensions
• A careful grain statement determines the primary
dimensions.
• It is then usually possible to add additional
dimensions.
• If an additional desired dimension violates the grain
by causing additional fact rows to be generated, then
the grain statement must be revised to accommodate
this dimension.
• Case study dimensions: date, product, store,
promotion
11. Step 4. Identify the Facts
• Picking the business measurements for the fact table: true to the
grain.
• Case study - Facts collected by POS system:
▫ Sales quantity, sales price/unit, sales $ amount, standard cost $
amount
▫ Gross Profit = cost – sales
Recommendation: Include in fact table even though it can be
calculated. Eliminates the possibility of user error.
• For non-additive measurements such as percentages and ratios
(e.g., gross margin) store the numerator (gross profit) and
denominator ($ revenue) in the fact table. The ratio can be
calculated in a data access tool for any slice of the fact table.
Caution: Calculate the ratio of the sums, not the sum of the
ratios
13. Key Input to the four-step dimensional
design process
14. Data Dimension
• In every data mart
• Use verbose -> self-explanatory values rather
than codes values
Ex. Holiday indicator by using holiday and
nonholiday instead of using Y and N
• Data key should be integer rather than date type
• If transaction time is of interest -> Time
dimensional table
16. Data Dimension
• Why explicit date dimension table is needed?
• Answer: because relational database cannot
handle an efficient join to the date dimension
table -> deep trouble
• Answer: because most database do not index
SQL date calculation
17. Product Dimension
•
•
•
•
•
Describe every SKU in the store
From operational product master file
Hold the descriptive attribute of each SKU
Hierarchies = groups of attributes
Merchandise hierarchy -> each is a many to one
relationship
• It will be redundancy -> no need to normalized > space saving is minimal
21. Promotion Dimension
• Describe the promotion conditions under which
product was sold
• Causal Dimension -> factors thought to cause a
change in product sales
• 4 causal mechanisms -> Price reductions
-> Ads
-> Displays
-> Coupons
22. Promotion Dimension
4 casual mechanism
Keep all dimensions together
- correlated -> so not much difference in
space requirement
- browed efficiently to see hot the various promotions are used
together
Separating the 4 causal mechanism into distinct dimension table
- more understandable to the business community
- more straightforward than administering a combined dimension
• No Promotion in Effect -> line item not being promoted
-> avoid null
promotion key in the fact table
23. Promotion Dimension
• Q: Which products were under promotion but did not
sell?
Cannot answer! -> POS sales fact table has only
products that were sold
• Factless Fact Table = has no measurement metrics ->
determine what product where on promotion but did
not sell
• 2 step processes to answer Q
- Query the promotion coverage table
- Determine what products sold from the POS sales fact
table
So -> the answer is the set difference between these 2
lists of products.
25. Degenerate Dimension
• Degenerate dimensions often play an integral role in the
fact table’s primary key.
• Degenerate dimensions are very common when the grain
of a fact table represents a single transaction or
transaction line item because the degenerate dimension
represents the unique identifier of the parent.
• Operational control numbers such as order numbers,
invoice numbers, and bill-of- lading numbers usually
give rise to empty dimensions and are represented as
degenerate dimensions (that is, dimension keys without
corresponding dimension tables) in fact tables where the
grain of the table is the document itself or a line item in
the document.
26. Retail Schema Extensibility
• Original schema extends gracefully because POS
transaction data was modeled at its most
granular level.
• Premature aggregation limits ability to extend if
new dimensions do not apply to higher grain
27. Surrogate Keys
• We strongly encourage the use of surrogate keys
in dimensional models rather than relying on
operational production codes.
• surrogate keys are integers that are assigned
sequentially as needed to populate a dimension.
• The surrogate keys merely serve to join the
dimension tables to the fact table.
28. Surrogate Keys
• One of the primary benefits of surrogate keys is
that they buffer the data warehouse environment
from operational changes.
• Surrogate keys allow the warehouse team to
maintain control of the environment rather than
being whipsawed by operational rules for
generating, updating, deleting, recycling, and
reusing production codes.
29. • Surrogate keys provide the warehouse with a mechanism to
differentiate these two separate instances of the same operational
account number.
• Surrogate keys allow the data warehouse team to integrate data
from multiple operational source systems, even if they lack
consistent source keys.
• The surrogate key is as small an integer as possible while ensuring
that it will accommodate the future cardinality or maximum
number of rows in the dimension comfortably.
• The smaller surrogate key translates into smaller fact tables, smaller
fact table indices, and more fact table rows per block input-output
operation.
• Finally, surrogate keys are needed to support one of the primary
techniques for handling changes to dimension table attributes. This
is actually one of the most important reasons to use surrogate keys.
30. Market Basket Analysis
• This notion of analyzing the combination of
products that sell together is known by data
miners as affinity grouping but more popularly
is called market basket analysis.
• Market basket analysis gives the retailer insights
about how to merchandise various combinations
of items. If frozen pasta dinners sell well with
cola products, then these two products perhaps
should be located near each other or marketed
with complementary pricing.
31. Market Basket Analysis (cont’d)
• Data mining tools and some OLAP products can
assist with market basket analysis
• The key to realistic market basket analysis is to
remember that the primary goal is to understand
the meaningful combinations of products sold
together.
32. Conclusion
• In this chapter we got our first exposure to designing a
dimensional model.
• Regardless of industry, we strongly encourage the four-step
process for tackling dimensional model designs. Remember
that it is especially important that we clearly state the grain
associated with our dimensional schema. Loading the fact
table with atomic data provides the greatest flexibility because
we can summarize that data “every which way.”
• As soon as the fact table is restricted to more aggregated
information, we’ll run into walls when the summarization
assumptions prove to be invalid.
• Remember that it is vitally important to populate our
dimension tables with verbose, robust descriptive attributes.