2. Data Warehousing
•Aims of information technology:
• To help workers in their everyday business activity
and improve their productivity – clerical data
processing tasks
• To help knowledge Employee (executives,
managers, analysts) make faster and better decisions
– decision support systems
•Two types of applications:
• Operational applications
• Analytical applications
3. •In most organizations, data about specific parts of
business is there - lots and lots of data, somewhere, in
some form.
•Data is available but not information -- and not the
right information at the right time.
•There is a need to
• bring together information .
• off-load decision support applications from the on-line
transaction system
Data Warehousing (Contd..)
4. Data Warehouse
•“A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.” --- W. H. Inmon
•Collection of data that is used primarily in organizational
decision making
•A decision support database that is maintained separately
from the organization’s operational database
5. Data Warehouse - Subject
Oriented
•Data that gives information about a particular subject.
•Data for Model& Analysis.
•Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process.
6. Data Warehouse – Integrated
•It Constructed by integrating multiple, heterogeneous
data sources.
•Data cleaning and data integration techniques are
applied.
•When data is moved to the warehouse, it is converted
-
•
7. Data Warehouse - Time Variant
•Data is stable in a data warehouse.
•Its adds historical as well as current data.
•Every key structure in the data warehouse -
Contains an element of time, explicitly or implicitly
•But the key of operational data may or may not
contain “time element”.
8. Data Warehouse - Non-Volatile
• A physically separate store of data transformed from
the operational environment.
• No update & delete on historical data .
•Operational update of data does not occur in the data
warehouse
•Appended
• Initial loading of data and access of data.
9. Data modifications & schema
design
• A data warehouse is updated on a regular
basis by the ETL process (run nightly or weekly)
using bulk data modification techniques.
• Data warehouses often use denormalized or
partially denormalized schemas (such as a star
schema) to optimize query performance.
10. Why Separate Data Warehouse?
•Separate & historical data are needed for decision support.
•Complex decision .
•Missing Data.
•Data consolidation
.
•Data quality.
11. Advantages of Data Warehousing
•High query performance
•Queries not visible outside warehouse
•Local processing at sources unaffected
•Can operate when sources unavailable
•Can query data not stored in a DBMS
•Extra information at warehouse
• Modify, summarize (store aggregates)
• Add historical information
12. Decision Support System
• Information technology to help knowledge employees
(executives, managers, analysts) make faster and
better decisions
• OLAP is an element of decision support system
• Data mining is a powerful, high-performance data
analysis tool for decision support.
13. Three-Tier Decision Support
Systems•Warehouse database server
• Almost always a relational DBMS, rarely flat files
•OLAP servers
• Relational OLAP (ROLAP): extended relational DBMS that
maps operations on multidimensional data to standard
relational operators
• Multidimensional OLAP (MOLAP): special-purpose server
that directly implements multidimensional data and operations
•Clients
• Query and reporting tools
• Analysis tools
• Data mining tools
14. The Complete Decision Support
System
Information Sources Data Warehouse
Server
(Tier 1)
OLAP Servers
(Tier 2)
Clients
(Tier 3)
Operational
DB’s
Semistructured
Sources
extract
transform
load
refresh
etc.
Data Marts
Data
Warehouse
e.g., MOLAP
e.g., ROLAP
serve
OLAP
Query/Reporting
Data Mining
serve
serve
15. Data Sources
•Data sources are often the operational systems,
providing the lowest level of data.
•Data sources are designed for operational use, not for
decision support, and the data reflect this fact.
•Multiple data sources are often from different systems,
run on a wide range of hardware and much of the
software is built in-house or highly customized.
•Multiple data sources introduce a large number of
issues -- semantic conflicts.
16. Creating and Maintaining a
Warehouse
•Data warehouse needs several tools that automate or support tasks
such as:
• Data extraction from different external data sources,
operational databases, files of standard applications
• Data cleaning (finding and resolving inconsistency in the
source data)
• inconsistent field lengths, inconsistent descriptions, inconsistent value
assignments, missing entries and violation of integrity constraints.
• optional fields in data entry are significant sources of inconsistent data.
17. • Integration and transformation of data (between different data
formats, languages, etc.)
• Data loading (loading the data into the data warehouse)
• checking integrity constraints, sorting, summarizing, etc.
• Data replication (replicating source database into the data
warehouse)
• used to incrementally refresh a warehouse when sources change
• Data refreshment
• propagating updates on source data to the data stored in the warehouse
• Periodically or immediately
• Data archiving
Creating and Maintaining a
Warehouse
18. The Data Warehousing Models
•Enterprise Warehouse
• collects all the information about subjects spanning entire
organization
•Data Mart
• a subset of corporate-wide data that is of value to a specific
group of users
• its scope is confined to specific, selected groups, such as
marketing data mart
• Independent Vs. Dependent (directly from warehouse) data
mart
•Virtual warehouse
• a set of views over operational databases
• only some summary views are materialized
19. Physical Structure of Data
Warehouse
•There are three basic architectures for constructing a
data warehouse:
• Centralized
• Distributed
• Federated
• Tiered
•The data warehouse is distributed for: load
balancing, scalability and higher availability
20. The logical data
warehouse is
only virtual
•The central data
warehouse is physical
•There exist local data
marts on different tiers
which store copies or
summarization of the
previous tier.
Physical Structure of Data
Warehouse
(Contd..)
21. Data Processing Models
•There are two basic data processing models:
• OLTP (On-Line Transaction Processing)
• Describes processing at operational sites
• aim is reliable and efficient processing of a large number
of transactions and ensuring data consistency.
• OLAP (On-Line Analytical Processing)
• Describes processing at warehouse
• aim is efficient multidimensional processing of large data
volumes.
22. OLTP vs. OLAP
• OLTP OLAP
•users Clerk, IT professional Knowledge worker
•Function day to day operations decision support
•DB design application-oriented subject-oriented
•data current, up-to-date historical, summarized
• detailed, flat relational multidimensional
• isolated integrated,
consolidated
•usage repetitive ad-hoc
•access read/write, lots of scans
• index/hash on prim. key
•unit of work short, simple transaction complex query
•# records accessed tens millions
•#users thousands hundreds
•DB size 100MB-GB 100GB-TB
•metric transaction throughput query throughput, response
23. OLAP
•Main goal: support ad-hoc but complex querying
performed by business analysts
•Interactive process of creating, managing, analyzing
and reporting on data
•Extends spreadsheet-like analysis to work with huge
amounts of data in a data warehouse
•Data exploration and aggregation in various ways
•Typical applications include accessing the effectiveness
of a marketing campaign, product sales forecasting, spot
trends
24. •Allows a sophisticated user to analyse data using complex,
multi-dimensional views
•Place key performance indicators (measures) into context
(dimensions)
• Measures are pre-aggregated
• Data retrieval is significantly faster
•The proposed cube is made available to business analysts
who can browse the data using a variety of tools, making ad
hoc interatctive and analytical processing
OLAP (Contd..)
25. OLAP Server Architectures
•Relational OLAP (ROLAP):
• Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middleware to support missing pieces
• Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
• Greater scalability
• schema design: Star, Snowflake, Fact Constellation
•Multidimensional OLAP (MOLAP):
• Array based multidimensional storage engine (sparse matrix
techniques)
• Fast indexing to pre-computed summarized data
• Schema design: Cube
•Hybrid OLAP (HOLAP):
• User flexibility - low level: relational, high level:array
26. ROLAP
•Special schema design: snow flake
•Special indexes: bitmap, multi-table join
•Proven technology (relational models, DBMS)
• Tend to outperform specialized MDDB especially
on large data sets
•Products
• IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
27. Measures and Dimensions
•Measures: key performance indicators that you want to
evaluate
• Typically numerical, including volume, sales and cost
• A rule of thumb: if a number makes business sense
when aggregated, then it is a measure
• Examples
• Aggregate daily volume to month, quarter and year
• Aggregating telephone numbers would not make sense-
not measures
• Affects what should be stored in the data warehouse
28. Measures and Dimensions
(Contd..)
•Dimensions: categories of data analysis
• Typical dimensions include product, time,
region
• A rule of thumb: when a report is requested
“by” something, that something is usually a
dimension
• Example
• Sales report: view sales by month, by region
• Two dimensions needed are time and region
31. Star Schema
•Fact table
•Dimension tables
•Measures
•A single fact table and for each dimension one dimension table
•Does not capture hierarchies directly
32. tim
e ite
m
time_ke
y day item_ke
yday_of_the_wee
k
Sales Fact
Table
item_nam
emont
h
bran
dquarte
r
time_ke
y
typ
eyea
r
supplier_typ
e
item_ke
ybranch_ke
y locatio
n
branc
h
location_ke
y location_ke
y
branch_ke
y
units_sol
d
stree
t
branch_nam
e dollars_sol
d
cit
y
branch_typ
e province_or_stree
t
avg_sale
s
countr
y
Measure
s
1
2
Example - Star
Schema
34. tim
e ite
m
time_ke
y day item_ke
y
supplie
r
Sales Fact
Table
day_of_the_wee
k
item_nam
e
supplier_ke
y
mont
h
bran
d
time_ke
y
supplier_typ
e
quarte
r
typ
eyea
r
item_ke
y
supplier_ke
ybranch_ke
ybranc
h
location_ke
ybranch_ke
y
locatio
n
units_sol
d
branch_nam
e
location_ke
y
dollars_sol
d
branch_typ
e
cit
y
stree
tavg_sale
s
city_ke
y
city_ke
y city
Measure
s
province_or_stree
tcountr
y
1
3
•Represent dimensional hierarchy directly by normalizing tables.
•Easy to maintain and saves storage
Example of Snowflake
Schema
35. tim
e Shipping Fact
Table
ite
m
time_ke
y day time_ke
y
item_ke
yday_of_the_wee
k
Sales Fact
Table
item_nam
e
item_ke
y
mont
h
bran
dquarte
r
time_ke
y
shipper_ke
y
typ
eyea
r
supplier_typ
e
item_ke
y
from_locatio
nbranch_ke
y
to_locatio
nlocatio
n
branc
h
location_ke
y
dollars_cos
tlocation_ke
y
branch_ke
y
units_sol
d
units_shippe
d
stree
t
branch_nam
e dollars_sol
d
cit
y
branch_typ
e province_or_stree
t
avg_sale
s
countr
y
shippe
rMeasure
s
shipper_ke
yshipper_nam
e
location_key
shipper_type 1
4
Multiple fact tables that share many dimension tables
Example of Fact
Constellation
36. Aggregates
• Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
8
1
38. • Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM
SALE
GROUP BY date, prodId
drill-down
rollu
p
Aggregates (Contd..)
39. Points to be noticed
about ROLAP
•Defines complex, multi-dimensional data with simple model
•Reduces the number of joins a query has to process
•Allows the data warehouse to evolve with relatively low
maintenance
•Can contain both detailed and summarized data.
•ROLAP is based on familiar, proven, and already selected
technologies.
•BUT!!!
•SQL for multi-dimensional manipulation of calculations.