1. Building Data WareHouse by
Inmon
Chapter 5: The Data Warehouse and Technology
http://it-slideshares.blogspot.com/
2. 5.0 Overview
Requires a simpler set of technological
features than its operational
predecessors:
◦ Online updating: Not need.
◦ Locking, integrity: needs are minimal.
◦ Teleprocessing interface: is required very
basic.
This chapter outlines some of
technological requirements for the data
warehouse.
3. MANAGING LARGE
AMOUNTS OF DATA
1. Manage Volumes
2. Manage multiple
media technology
3. Index and
monitoring data
4. Interface to retrieve
and passing data
4. Managing Multiple Media
Following is a hierarchy of storage of data in terms
of speed of access and cost of storage:
Main memory Very fast Very
expensive
Expanded memory Very fast Expensive
Cache Very fast Expensive
DASD Fast Moderate
Magnetic tape Not fast Not expensive
Near line Not fast* Not
expensive
Optical disk Not slow Not expensive
Fiche Slow Cheap
*Not fast to find first record sought; very fast to find all other records in the block.
5. Indexing and Monitoring Data
Monitoring data warehouse data
determines such factors as the following:
◦ If a reorganization needs to be done
◦ If an index is poorly structured
◦ If too much or not enough data is in overflow
◦ The statistical composition of the access of
the data
◦ Available remaining space
6. Interfaces to Many Technologies
The interface to different technologies requires
several considerations:
Does the data pass from one DBMS to another
easily?
Does it pass from one operating system to another
easily?
Does it change its basic format in passage (EBCDIC,
ASCII, and so forth)?
Can passage into multidimensional processing be
done easily?
Can selected increments of data, such as changed
data capture (CDC) be passed rather than entire
tables?
Is the context of data lost in translation as data is
moved to other environments?
7. PROGRAMMER OR
DESIGNER CONTROL OF
DATA PLACEMENT
Place data at block/page
level
Manage data in parallel
Solid Meta Data control
Rich Language Interface
8. Parallel Storage and Management of
Data
Metadata Management
Data warehouse table structures
Data warehouse table attribution
Data warehouse source data (the system of
record)
Mapping from the system of record to the data
warehouse
Data model specification
Extract logging
Common routines for access of data
Definitions and/or descriptions of data
Relationships of one unit of data to another
9. Language Interface
Typically, the language interface to the data
warehouse should do the following:
◦ Be able to access data a set at a time
◦ Be able to access data a record at a time
◦ Specifically ensure that one or more indexes
will be used in the satisfaction of a query
◦ Have an SQL interface
◦ Be able to insert, delete, or update data
10. EFFICIENT LOADING OF
DATA
Load efficiently
Use indexes efficiently
Store data in compact
way
Support compound
Keys
11. Efficient Index Utilization
Technology can support efficient index access in
several ways:
◦ Using bit maps
◦ Having multileveled indexes
◦ Storing all or parts of an index in main memory
◦ Compacting the index entries when the order of
the data being indexed allows such compaction
◦ Creating selective indexes and range indexes
12. Compaction of Data
Manage large amounts of data.
Programmer gets the most out of a given
I/O when data is stored compactly
13. Compound Keys
The time valiancy of data warehouse
data.
Key-foreign key relationships are quite
common in the atomic data
15. Lock Management
Ensures that two or more people are not
updating the same record at the same
time.
Turn the lock manager off and on is
necessary.
18. Other Technological Features
Some of those features include the
following:
◦ Transaction integrity
◦ High-speed buffering
◦ Row- or page-level locking
◦ Referential integrity
◦ VIEWs of data
◦ Partial block loadin
19. DBMS Types and the Data
Warehouse
Data warehouses manage massive amounts of data
because:
Granular, atomic detail
Historical information
Summary as well as detailed data
Because record level, transaction-based updates are a
regular feature of the general-purpose DBMS, must
offer facilities:
Locking
COMMITs
Checkpoints
Log tape processing
Deadlock
Backout
20. Changing DBMS Technology
Such a change may be in order for several reasons:
DBMS technologies may be available.
The size of the warehouse has grown.
Use of the warehouse has escalated and changed.
The basic DBMS decision must be revisited from
time to time.
Should the decision be made to go to a new DBMS
technology, what are the considerations?
Will the new DBMS technology meet the foreseeable
requirements?
How will the conversion from the older DBMS
technology to the newer DBMS technology be done?
How will the transformation programs be converted?
22. The multidimensional DBMS The data warehouse
1. holds at least an order of 1. holds massive amounts of data
magnitude less data.
2. is geared for very heavy and
unpredictable access and analysis 2. is geared for a limited amount of
of data. flexible access
3. holds a much shorter time 3. contains data with a very lengthy
horizon of data. time horizon (from 5 to 10
years)
4. allows unfettered access.
4. allows analysts to access its data
in a constrained fashion
5. being housed in a
5. enjoy a complementary multidimensional DBMS
relationship.
Multidimensional DBMS and the
Data Warehouse con’t
23. Multidimensional DBMS and the
Data Warehouse con’t
Following is the relational foundation for
multidimensional DBMS data marts:
Strengths:
Can support a lot of data.
Can support dynamic joining of data.
Has proven technology.
Is capable of supporting general-purpose update
processing.
If there is no known pattern of usage of data,
then the relational structure is as good as any
other.
Weaknesses:
Has performance that is less than optimal.
Cannot be purely optimized for access
24. Multidimensional DBMS and the
Data Warehouse con’t
Following is the cube foundation for multidimensional
DBMS data marts:
Strengths:
Performance that is optimal for DSS processing.
Can be optimized for very fast access of data.
If pattern of access of data is known, then the structure of
data can be optimized.
Can easily be sliced and diced.
Can be examined in many ways.
Weaknesses:
Cannot handle nearly as much data as a standard
relational format.
Does not support general-purpose update processing.
May take a long time to load.
If access is desired on a path not supported by the design
of the data, the structure is not flexible.
28. Data Warehousing across Multiple
Storage Media
A large amount of data is spread across
more than one storage medium.
◦ One processing environment is the DASD
environment where online, interactive
processing is done.
◦ The other processing environment is often a
tape or mass store environment
29. The Role of Metadata in the Data
Warehouse Environment
30. The Role of Metadata in the Data
Warehouse Environment
31. The Role of Metadata in the Data
Warehouse Environment
33. Three Types of Contextual
Information
Threelevels of contextual information must be
managed:
Simple contextual information
Complex contextual information
External contextual information
Simple contextual information relates to the basic
structure of data itself, and includes such things
as these:
The structure of data
The encoding of data
The naming conventions used for data
The metrics describing the data, such as:
How much data there is
How fast the data is growing
What sectors of the data are growing
34. Three Types of Contextual
Information con’t
This type of information addresses such aspects
of data as these:
◦ Product definitions
◦ Marketing territories
◦ Pricing
◦ Packaging
◦ Organization structure
◦ Distribution
35. Three Types of Contextual
Information con’t
Some examples of external contextual
information include the following:
Economic forecasts:
Inflation
Financial trends
Taxation
Economic growth
Political information
Competitive information
Technological advancements
Consumer demographic movements
36. Capturing and Managing Contextual
Information
Complex and external contextual types
of information are hard to capture and
quantify because they are so
unstructured.
37. Looking at the Past
Some of these shortcomings are as follows:
The information management attempts
were aimed at the information systems
developer, not the end user.
Attempts at contextual management
were passive.
Attempts at contextual information
management were in many cases
removed from the development effort.
Attempts to manage contextual
38. Refreshing the Data Warehouse
Reading a log tape is no small matter,
however. Many obstacles are in the way,
including the following:
The log tape contains much extraneous
data.
The log tape format is often arcane.
The log tape contains spanned records.
The log tape often contains addresses
instead of data values.
The log tape reflects the idiosyncrasies of
39. Testing
It is very unusual to find a similar test
environment in the world of the data
warehouse, for the following reasons:
Data warehouses are so large that a
corporation has a hard time justifying one
of them, much less two of them.
The nature of the development life cycle
for the data warehouse is iterative.
For the most part, programs are run in a
heuristic manner, not in a repetitive
40. Summary
Some technological features are
required:
Robust language interface
Compound keys
Variable-length data
The abilities to do the following:
Manage large amounts of data Have metadata control of the
Manage data on a diverse media warehouse
Easily index and monitor data Efficiently load the warehouse
Interface with a wide number of Efficiently use indexes
technologies Store data in a compact way
Allow the programmer to place Support compound keys
the data directly on the physical Selectively turn off the lock
device manager
Store and access data in parallel Do index-only processing
Quickly restore from bulk
storage
41. Summary con’t
The data architect must recognize the
differences between a transaction-based
DBMS and a data warehouse-based
DBMS.
42. Summary con’t
MultidimensionalOLAP technology is suited for
data mart processing and not data warehouse
processing.
When the data mart approach is used, many
problems become evident:
The number of extract programs grows large.
Each new multidimensional database must return to
the legacy operational environment for its own data.
There is no basis for reconciliation of differences in
analysis.
A tremendous amount of redundant data among
different multidimensional DBMS environments
exists.
43. Summary con’t
Metadata in the data warehouse
environment plays a very different role
than metadata in the operational legacy
environment.
http://it-slideshares.blogspot.com/