2. 2
What is Data Mining?
Data mining is the process of identifying
valid, novel, useful and understandable
patterns in large amount of data.
Also known as KDD (Knowledge Discovery in
Databases).
“We’re drowning in information, but starving for
knowledge.” (John Naisbett)
3. 3
What is a Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support methods that is maintained
separately from the organization’s operational
database
Support information processing by providing a solid
platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in support
of management’s decision-making process.”—W. H.
Inmon
Data warehousing:
The process of constructing and using data
warehouses
4. 4
Data Warehouse—Subject-Oriented
Organized around major subjects, such as customer,
product, sales.
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
5. 5
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous
data sources
relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources.
6. 6
Data Warehouse—Time Variant
The time horizon for the data warehouse is significantly
longer than that of operational systems.
Operational database: current value data.
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years…)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
contain “time element”.
7. 7
Data Warehouse—Non-Volatile
A physically separate store of data transformed from the
operational environment.
Operational update of data does not occur in the data
warehouse environment.
Does not require daily transaction processing,
recovery, and concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data.
8. 8
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
9. 9
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
10. 10
Conceptual Modeling of
Data Warehouses
Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a
set of dimension tables
Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to star.
11. When compared star with snowflake model,star
model is the best one,but the snowflake is the
normalized form to reduce redundancies.
-easy to maintain.
-save storage space.
-reduce the effectiveness of browsing.
-More joins will be needed to execute the query.
11
12. 12
Example of Star Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_street
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
Brand_type
supplier_type
item
branch_key
branch_name
branch_type
branch
13. 13
Example of Snowflake Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
province_or_street
country
city
14. 14
A Concept Hierarchy: Dimension (location)
all
Europe North_America
MexicoCanadaSpainGermany
Vancouver
M. WindL. Chan
...
......
... ...
...
all
region
office
country
TorontoFrankfurtcity
15. 15
Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Example: street<city<state<country.
Aggregates the data by ascending the location hierarchy from
the level city to the level country.
Dimension reduction –one or more dimensions are removed
From the given cube.
Drill down (roll down): reverse of roll-up
Navigate from less detailed data to more detailed data, or
introducing new dimensions.
17. Cont..
Slice and dice:
Slice operation performs selection on one dimension of a given
cube.
Example:time=q1.
Dice operation performs selection on two or more operations.
Example:location=q1 or q2, time= t1 or t2.
Pivot (rotate):
Visualization operation that rotates the data axes in view in
order to provide an alternative presentation of data.
17
18. Data warehouse architecture
Steps:
1.top-down view
-select relevant information.
-information matches current and future
business needs.
2.Data source view
-information being collected,stored
managed by operating systems.
18
19. -information may be documented at various
levels of detail and accuracy.
3.Data warehouse view
-fact table and dimension table.
-represents the information that is stored
inside the data warehouse.
4.Business query view
-perspective of data in warehouse from view
point of end user.
19
20. Process of data warehouse design
1.top-down approach
-planning and designing
-technology mature and well known
-business problem solved clearly and
well understood.
2.Bottom-up approach
-experiments and prototypes.
-business modeling and technology
development.
20
21. 21
Data Warehouse Design Process
Choose the grain (atomic level of data) of the business
process
Choose a business process to model, e.g., orders,
invoices, etc.
Choose the dimensions that will apply to each fact table
record
Choose the measure that will populate each fact table
record
22. Three tier architecture
Back end tools-feed data into bottom tier.
These tools perform data
extraction,clean,transform,load,refresh.
Data extracted using application program
interface called gateways,like JDBC,ODBC
and OLEDB.
22
24. 24
Recommended approach
Enterprise warehouse
collects all of the information about subjects spanning
the entire organization
Data Mart
a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be
materialized
25. 25
Data Warehouse Development:
A Recommended Approach
Define a high-level corporate data model
Data
Mart
Data
Mart
Distributed
Data Marts
Multi-Tier Data
Warehouse
Enterprise
Data
Warehouse
Model refinementModel refinement
26. 26
Types of OLAP
Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support query language
extension.
greater scalability
Multidimensional OLAP (MOLAP)
Array-based multidimensional storage engine
Handle sparse and dense datasets.
fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)=ROLAP+MOLAP
User flexibility=scalable + fast computation.
Mysql 2000–example.
27. 27
Metadata Repository
Meta data is the data defining warehouse objects. It has the following
kinds
Description of the structure of the warehouse
schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring information
(warehouse usage statistics, error reports, audit trails)
The algorithms used for summarization(aggregation, reports..).
Business metadata(policy)
28. 28
Data Warehouse Back-End Tools and
Utilities
Data extraction:
get data from multiple, heterogeneous, and external
sources
Data cleaning:
detect errors in the data and rectify them when
possible
Data transformation:
convert data from legacy or host format to warehouse
format
Load:
sort, summarize, consolidate, compute views, check
integrity, and partitions
Refresh
propagate the updates from the data sources to the
warehouse
29. 29
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and reporting
using crosstables, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
30. OLAM ARCHITECTURE
OLAM and OLAP servers both accept on-line
queries.
Via graphical user interface and work with data
cube via cube API.
Metadata data –access of the data cube .
Data cube –constructed by accessing and
integrating multiple database via MDDB.
Filtering a datawarehouse via database API.
30
31. Cont..
OLAM –data mining tasks like
classification,prediction,clustering,concept
description..
Sophisticated than an OLAP server.
High quality of data.
31
32. 32
An OLAM Architecture
Data
Warehouse
Meta Data
MDDB
OLAM
Engine
OLAP
Engine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data
Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result