3. 3
Agenda
Data Warehouse architecture &
building blocks
ER modeling review
Need for Dimensional Modeling
Dimensional modeling & its inside
Comparison of ER with dimensional
5. 5
Components
Major components
Source data component
Data staging component
Information delivery component
Metadata component
Management and control component
6. 6
1. Source Data Components
Source data can be grouped into 4 components
Production data
Comes from operational systems of enterprise
Some segments are selected from it
Narrow scope, e.g. order details
Internal data
Private datasheet, documents, customer profiles etc.
E.g. Customer profiles for specific offering
Special strategies to transform ‘it’ to DW (text document)
Archived data
Old data is archived
DW have snapshots of historical data
External data
Executives depend upon external sources
E.g. market data of competitors, car rental require new manufacturing.
Define conversion
8. 8
2. Data Staging Components
After data is extracted, data is to be prepared
Data extracted from sources needs to be changed,
converted and made ready in suitable format
Three major functions to make data ready
Extract
Transform
Load
Staging area provides a place and area with a set
of functions to
Clean
Change
Combine
Convert
10. 10
3. Data Storage Components
Separate repository
Data structured for efficient processing
Redundancy is increased
Updated after specific periods
Only read-only
16. 16
Background (ER Modeling)
For ER modeling, entities are collected from the
environment
Each entity act as a table
Success reasons
Normalized after ER, since it removes redundancy (to
handle update/delete anomalies)
But number of tables is increased
Is useful for fast access of small amount of data
17. ER Drawbacks for DW / Need of Dimensional
Modeling
ER Hard to remember, due to increased number of tables
Complex for queries with multiple tables (table joins)
Conventional RDBMS optimized for small number of tables
whereas large number of tables might be required in DW
Ideally no calculated attributes
The DW does not require to update data like in OLTP system
so there is no need of normalization
OLAP is not the only purpose of DW, we need a model that
facilitate integration of data, data mining, historically
consolidated data.
Efficient indexing scheme to avoid screening of all data
De-Normalization (in DW)
Add primary key
Direct relationships
Re-introduce redundancy
17
18. 18
Dimensional Modeling
Dimensional Modeling focuses subject-
orientation, critical factors of business
Critical factors are stored in facts
Redundancy is no problem, achieve efficiency
Logical design technique for high performance
Is the modeling technique for storage
19. Dimensional Modeling (cont.)
Two important concepts
Fact
Numeric measurements, represent business activity/event
Are pre-computed, redundant
Example: Profit, quantity sold
Dimension
Qualifying characteristics, perspective to a fact
Example: date (Date, month, quarter, year)
19
20. 20
Dimensional Modeling (cont.)
Facts are stored in fact table
Dimensions are represented by dimension tables
Dimensions are degrees in which facts can be judged
Each fact is surrounded by dimension tables
Looks like a star so called Star Schema
22. 22
Inside Dimensional Modeling
Inside Dimension table
Key attribute of dimension table, for identification
Large no of columns, wide table
Non-calculated attributes, textual attributes
Attributes are not directly related
Un-normalized in Star schema
Ability to drill-down and drill-up are two ways of
exploiting dimensions
Can have multiple hierarchies
Relatively small number of records
23. 23
Inside Dimensional Modeling
Have two types of attributes
Key attributes, for connections
Facts
Inside fact table
Concatenated key
Grain or level of data identified
Large number of records
Limited attributes
Sparse data set
Degenerate dimensions (order number Average products per
order)
Fact-less fact table
24. 24
Star Schema Keys
Primary keys
Identifying attribute in dimension table
Relationship attributes combine together to form P.K
Surrogate keys
Replacement of primary key
System generated
Foreign keys
Collection of primary keys of dimension tables
Primary key to fact table
System generated
Collection of P.Ks
25. 25
Advantage of Star Schema
Ease for users to understand
Optimized for navigation (less joins fast)
Most suitable for query processing
Karen Corral, et al. (2006) The impact of alternative
diagrams on the accuracy of recall: A comparison of
star-schema diagrams and entity-relationship diagrams,
Decision Support Systems, 42(1), 450-468.
27. DATA WAREHOUSES AND DATA MARTS
Bill Inmon stated, “The single most important issue facing the IT manager
this year is whether to build the data warehouse first or the data mart first.”
This statement is true even today. Let us examine this statement and take a
stand
Before deciding to build a data warehouse for your organization, you need to
ask the
Following basic and fundamental questions and address the relevant issues:
Top-down or bottom-up approach?
Enterprise-wide or departmental?
Which first—data warehouse or data mart?
Build pilot or go with a full-fledged implementation?
Dependent or independent data marts?
31. A Practical Approach
In order to formulate an approach for your organization, you need to examine
what exactly
Your organization wants. Is your organization looking for long-term results or
fast data
Marts for only a few subjects for now? Does your organization want quick,
proof-of-concept,
Throw-away implementations? Or, do you want to look into some other practical
approach?
32. Although both the top-down and the bottom-up approaches each have their own
advantages and drawbacks, a compromise approach accommodating both views
appears to be practical.
The chief proponent of this practical approach is Ralph Kimball, an eminent
author and data warehouse expert. The steps in this practical approach are as
follows:
1. Plan and define requirements at the overall corporate level
2. Create a surrounding architecture for a complete warehouse
3. Conform and standardize the data content
4. Implement the data warehouse as a series of supermarts, one at a time
33. METADATA IN THE DATA
WAREHOUSE
Types of Metadata
Metadata in a data warehouse fall into three major categories:
Operational Metadata
Extraction and Transformation Metadata
End-User Metadata
34. Operational Metadata
As you know, data for the data warehouse comes from several operational
systems of the enterprise. These source systems contain different data structures.
The data elements selected for the data warehouse have various field lengths and
data types.
In selecting data from the source systems for the data warehouse, you split
records, combine parts of records from different source files, and deal with
multiple coding schemes and field lengths.
When you deliver information to the end-users, you must be able to tie that back
to the original source data sets.
Operational metadata contain all of this information about the operational data
sources.
35. Extraction and Transformation
Metadata
Extraction and transformation metadata contain data about the extraction of data
from the source systems, namely, the extraction frequencies, extraction methods,
and business rules for the data extraction.
Also, this category of metadata contains information about all the data
transformations that take place in the data staging area.
End-User Metadata
The end-user metadata is the navigational map of the data warehouse.
It enables the end-users to find information from the data warehouse.
The end-user metadata allows the end-users to use their own business
terminologies.
36. Significance
Why is metadata especially important in a data warehouse?
First, it acts as the glue that connects all parts of the data
warehouse.
Next, it provides information about the contents and structures to
the developers.
Finally, it opens the door to the end-users and makes the contents
recognizable in their own terms.