SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Data Warehousing: A Perspective
by
Hemant Kirpekar                                                                                                                              7/15/2011



                              Data Warehousing: A Perspective

                                                      by Hemant Kirpekar

Introduction
    The Need for proper understanding of Data Warehousing......................................................................2
    The Key Issues..........................................................................................................................................3
    The Definition of a Data Warehouse.......................................................................................................3
    The Lifecycle of a Data Warehouse.........................................................................................................4
    The Goals of a Data Warehouse...............................................................................................................5

Why Data Warehousing is different from OLTP................................................6

E/R Modeling Vs Dimension Tables...................................................................8

Two Sample Data Warehouse Designs
    Designing a Product-Oriented Data Warehouse....................................................................................10
    Designing a Customer-Oriented Data Warehouse................................................................................14

Mechanics of the Design
    Interviewing End-Users and DBAs........................................................................................................19
    Assembling the team..............................................................................................................................19
    Choosing Hardware/Software platforms................................................................................................20
    Handling Aggregates..............................................................................................................................20
    Server-Side activities.............................................................................................................................21
    Client-Side activities..............................................................................................................................22

Conclusions........................................................................................................23

A Checklist for an Ideal Data Warehouse........................................................24




                                                                            1
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                7/15/2011

Introduction


The need for proper understanding of Data Warehousing

The following is an extract from "Knowledge Asset Management and Corporate Memory" a White Paper
to be published on the WWW possibly via the Hispacom site in the third week of August 1996......
Data Warehousing may well leverage the rising tide technologies that everyone will want or need,
however the current trend in Data Warehousing marketing leaves a lot to be desired.
In many organizations there still exists an enormous divide that separates Information Technology and a
managers need for Knowledge and Information. It is common currency that there is a whole host of
available tools and techniques for locating, scrubbing, sorting, storing, structuring, documenting,
processing and presenting information. Unfortunately, tools are tangible and business information and
knowledge are not, so they tend to get confused.
So why do we still have this confusion? First consider how certain companies market Data Warehousing.
There are companies that sell database technologies, other companies that sell the platforms (ostensibly
consisting of an MPP or SMP architecture), some sell technical Consultancy services, others meta-data
tools and services, finally there are the business Consultancy services and the systems integrators - each
and everyone with their own particular focus on the critical factors in the success of Data Warehousing
projects.
In the main, most RDBMS vendors seem to see Data Warehouse projects as a challenge to provide
greater performance, greater capacity and greater divergence. With this excuse, most RDBMS products
carry functionality that make them about as truly "open" as a UNIVAC 90/30, i.e. No standards for View
Partitioning, Bit Mapped Indexing, Histograms, Object Partitioning, SQL query decomposition or SQL
evaluation strategies etc. This however is not really the important issue, the real issue is that some
vendors sell Data Warehousing as if it just provided a big dumping ground for massive amounts of data
with which users are allowed to do anything they like, whilst at the same time freeing up Operational
Systems from the need to support end-user informational requirements.
 Some hardware vendors have a similar approach, i.e. a Data Warehouse platform must inherently have a
lot of disks, a lot of memory and a lot of CPUs. However, one of the most successful Data Warehouse
projects have worked on used COMPAQ hardware, which provides an excellent cost/benefit ratio.
 Some Technical Consultancy Services providers tend to dwell on the performance aspects of Data
Warehousing. They see Data Warehousing as a technical challenge, rather than a business opportunity,
but the biggest performance payoffs will be brought about when there is a full understanding of how the
user wishes to use the information.




                                                    2
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                   7/15/2011
The Key Issues
Organizations are swimming in data. However, most will have to create new data with improved quality,
to meet strategic business planning requirements.


So:
      How should IS plan for the mass of end user information demand?
      What vendors and tools will emerge to help IS build and maintain a data warehouse architecture?
      What strategies can users deploy to develop a successful data warehouse architecture ?
      What technology breakthroughs will occur to empower knowledge workers and reduce operational
      data access requirements?
These are some of the key questions outlined by the Gartner Group in their 1995 report on Data
Warehousing.
I will try to answer some of these questions in this report.

The Definition a Data Warehouse
A Data Warehouse is a:
          . subject-oriented
          . integrated
          .time-variant
          . non-volatile
collection of data in support of management decisions.
(W.H. Inmon, in "Building a Data Warehouse, Wiley 1996)
The data warehouse is oriented to the major subject areas of the corporation that have been defined in the
data model. Examples of subject areas are: customer, product, activity, policy, claim, account. The major
subject areas end up being physically implemented as a series of related tables in the data warehouse.
Personal Note: Could these be objects? No one to my knowledge has explored this possibility as yet.
The second salient characteristic of the data warehouse is that it is integrated. This is the most important
aspect of a data warehouse. The different design decisions that the application designers have made over
the years show up in a thousand different ways. Generally, there is no application consistency in
encoding, naming conventions, physical attributes, measurements of attributes, key structure and physical
characteristics of the data. Each application has been most likely been designed independently. As data is
entered into the data warehouse, inconsistencies of the application level are undone.
The third salient characteristic of the data warehouse is that it is time-variant. A 5 to 10 year time horizon
of data is normal for the data warehouse. Data Warehouse data is a sophisticated series of snapshots taken
at one moment in time and the key structure always contains some time element.
The last important characteristic of the data warehouse is that it is nonvolatile. Unlike operational data
warehouse data is loaded en masse and is then accessed. Update of the data does not occur in the data
warehouse environment.




                                                      3
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                              7/15/2011
The lifecycle of the Data Warehouse
Data flows into the data warehouse from the operational environment. Usually a significant amount of
transformation of data occurs at the passage from the operational level to the data warehouse level.
Once the data ages, it passes from current detail to older detail. As the data is summarized, it passes from
current detail to lightly summarized data and then onto summarized data.
At some point in time data is purged from the warehouse. There are several ways in which this can be
made to happen:
. Data is added to a rolling summary file where the detail is lost.
. Data is transferred to a bulk medium from a high-performance medium such as DASD.
. Data is transferred from one level of the architecture to another.
. Data is actually purged from the system at the DBAs request.
The following diagram is from "Building a Data Warehouse" 2nd Ed, by W.H. Inmon, Wiley '96

       highly summarized                                    monthly sales by product line (‘81 - ‘92)




                                                             wkly sales by
                 lightly                                     subproduct line
                 summarized                                  (‘84 - ‘92)
                 (data mart)

   m
   e
                                                                                             operational
   t
                                                                                             transformation
   a
   d
   a
   t
                                                                                        current
   a
                               sales detail (1990 - 1991)                               detail




    old detail                                                                     sales detail (‘84 - ‘89)


       Structure of a Data Warehouse




                                                               4
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                   7/15/2011
The Goals of a Data Warehouse
According to Ralph Kimball (founder of Red Brick Systems - A highly successful Data Warehouse
DBMS startup), the goals of a Data Warehouse are:

1. The data warehouse provides access to corporate or organizational data.
    Access means several things. Managers and analysts must be able to connect to the data warehouse
    from their personal computers and this connection must be immediate, on demand, and with high
    performance. The tiniest queries must run in less than a second. The tools available must be easy to
    use i.e. useful reports can be run with a one button click and can be changed and rerun with two
    button clicks.

2. The data in the warehouse is consistent.
    Consistency means that when two people request sales figures for the Southeast Region for January
    they get the same number. Consistency means that when they ask for the definition of the "sales"
    data element, they get a useful answer that lets them know what they are fetching. Consistency also
    means that if yesterday's data has not been completely loaded, the analyst is warned that the data
    load is not complete and will not be complete till tomorrow.

3. The data in the warehouse can be combined by every possible measure of the
        business (i.e. slice & dice)
    This implies a very different organization from the E/R organization of typically Operational Data.
    This implies row headers and constraints, i.e. dimensions in a dimensional data model.

4. The data warehouse is not just data, but is also a set of tools to query, analyze, and to
        present information.
    The "back room" components, namely the hardware, the relational database software and the data
    itself are only about 60% of what is needed for a successful data warehouse implementation. The
    remaining 40% is the set of front-end tools that query, analyze and present the data. The "show me
    what is important" requirement needs all of these components.

5. The data warehouse is where used data is published.
    Data is not simply accumulated at a central point and let loose. It is assembled from a variety of
    information sources in the organization, cleaned up, quality assured, and then released only if it is fit
    for use. A data quality manager is critical for a data warehouse and play a role similar to that of a
    magazine editor or a book publisher. He/she is responsible for the content and quality of the
    publication and is identified with the deliverable.

6. The quality of the data in the data warehouse is the driver of business reengineering.
    The best data in any company is the record of how much money someone else owes the company.
    Data quality goes downhill from there. The data warehouse cannot fix poor quality data but the
    inability of a data warehouse to be effective with poor quality data is the best driver for business
    reengineering efforts in an organization.




                                                     5
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                     7/15/2011

Why Data Warehousing is different from OLTP
On-line transaction processing is profoundly different from data warehousing. The users are different, the
data content is different, the data structures are different, the hardware is different, the software is
different, the administration is different, the management of the systems is different, and the daily
rhythms are different. The design techniques and design instincts appropriate for transaction processing
are inappropriate and even destructive for information warehousing.

OLTP Transactional Properties
In OLTP a transaction is defined by its ACID properties.
A Transaction is a user-defined sequence of instructions that maintains consistency
across a persistent set of values. It is a sequence of operations that is atomic with
respect to recovery.
To remain valid, a transaction must maintain it’s ACID properties
Atomicity is a condition that states that for a transaction to be valid the effects of all its instructions must
be enforced or none at all.
Consistency is a property of the persistent data is and must be preserved by the execution of a complete
transaction.
Isolation is a property that states that the effect of running transactions concurrently must be that of
serializability. i.e. as if each of the transactions were run in isolation.
Durability is the ability of a transaction to preserve its effects if it has committed, in the presence of
media and system failures.
A serious data warehouse will often process only one transaction per day, but this transaction will contain
thousands or even millions of records. This kind of transaction has a special name in data warehousing. It
is called a production data load.
In a data warehouse, consistency is measured globally. We do not care about an individual transaction,
but we care enormously that the current load of new data is a full and consistent set of data. What we
care about is the consistent state of the system we started with before the production data load, and the
consistent state of the system we ended up with after a successful production data load. The most
practical frequency of this production data load is once per day, usually in the early hours of the morning.
So, instead of a microscopic perspective, we have a quality assurance manager's judgment of data
consistency.
OLTP systems are driven by performance and reliability concerns. Users of a data warehouse almost
never deal with one account at a time, usually requiring hundreds or thousands of records to be searched
and compressed into a small answer set. Users of a data warehouse change the kinds of questions they ask
constantly. Although, the templates of their requests may be similar, the impact of these queries will vary
wildly on the database system. Small single table queries, called browses, need to be instantaneous
whereas large multitable queries, called join queries, are expected to run for seconds or minutes.
Reporting is the primary activity in a data warehouse. Users consume information in human-sized
chunks of one or two pages. Blinking numbers on a page can be clicked on to answer why questions.
Negatives below are blinking numbers.




                                                       6
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                  7/15/2011
Example of a Data Warehouse Report

Product        Region      Sales             Growth in         Sales as     Change in Change in
                           This Month        Sales Vs% of          Sales as     Sales as
                                             Last Month        Category      % of Cat. % of Cat. YTD
Last Mt. Vs
        Last Yr YTD
Framis      Central        110               12%               31%               3%                 7%
Framis     Eastern          179               -<3%>              28%               -<1%>             3%

Framis     Western         55                5%                44%               1%                 5%
Total Framis               344               6%                33%               1%                 5%
Widget     Central         66                2%                18%               2%                 10%
Widget     Eastern         102               4%                12%               5%                 13%
Widget     Western         39%               -<9%>             9%                -<1%>              8%
Total Widget               207               1%                13%               4%                 11%
Grand Total                551               4%                20%               2%                 8%




The twinkling nature of OLTP databases (constant updates of new values), is the first kind of temporal
inconsistency that we avoid in data warehouses.
The second kind of temporal inconsistency in an OLTP database is the lack of explicit support for
correctly representing prior history. Although it is possible to keep history in an OLTP system, it is a
major burden on that system to correctly depict old history. We have a long series of transactions that
incrementally alter history and it is close to impossible to quickly reconstruct the snapshot of a business
at a specified point in time.
We make a data warehouse a specific time series. We move snapshots of the OLTP systems over to the
data warehouse as a series of data layers, like geologic layers. By bringing static snapshots to the
warehouse only on a regular basis, we solve both of the time representation problems we had on the
OLTP system. No updates during the day - so no twinkling. By storing snapshots, we represent prior
points in time correctly. This allows us to ask comparative queries easily. The snapshot is called the
production data extract, and we migrate this extract to the data warehouse system at regular time
intervals. This process gives rise to the two phases of the data warehouse: loading and querying.




                                                      7
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                     7/15/2011

E/R Modeling Vs Dimension Tables
Entity/Relationship modeling seeks to drive all the redundancy out of the data. If there is no redundancy
in the data, then a transaction that changes any data only needs to touch the database in one place. This is
the secret behind the phenomenal improvement in transaction processing speed since the early 80s. E/R
modeling works by dividing the data into many discreet entities, each of which becomes a table in the
OLTP database. A simple E/R diagram looks like the map of a large metropolitan area where the entities
are the cities and the relationships are the connecting freeways. This diagram is very symmetric For
queries that span many records or many tables, E/R diagrams are too complex for users to
understand and too complex for software to navigate.
SO, E/R MODELS CANNOT BE USED AS THE BASIS FOR ENTERPRISE DATA
WAREHOUSES.
In data warehousing, 80% of the queries are single-table browses, and 20% are multitable joins. This
allows for a tremendously simple data structure. This structure is the dimensional model or the star join
schema.
This name is chosen because the E/R diagram looks like a star with one large central table called the fact
table and a set of smaller attendant tables called dimensional tables, displayed in a radial pattern around
the fact table. This structure is very asymmetric. The fact table in the schema is the only one that
participates in multiple joins with the dimension tables. The dimension tables all have a single join to this
central fact table.




                                    Sales Fact                 Product Dimension
   Time Dimension                    time_key                  product_key
                                     product_key               description
                                     store_key                 brand
      time_key                       dollars_sold              category
      day_of_week                    units_sold
      month                          dollars_cost
      quarter
      year
      holiday_flag

                                                               Store Dimension

                                                               store_key
                                                               store_name
   A typical dimensional model                                 address
                                                               floor_plan_type




The above is an example of a star schema for a typical grocery store chain. The Sales Fact table contains
daily item totals of all the products sold. This is called the grain of the fact table. Each record in the fact
table represents the total sales of a specific product in a market on a day. Any other combination
generates a different record in the fact table. The fact table of a typical grocery retailer with 500
stores, each carrying 50,000 products on the shelves and measuring a daily item movement over 2
years could approach 1 Billion rows. However, using a high-performance server and an industrial-
strength dbms we can store and query such a large fact table with good performance.




                                                       8
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                     7/15/2011
The fact table is where the numerical measurements of the business are stored. These measurements are
taken at the intersection of all the dimensions. The best and most useful facts are continuously valued
and additive. If there is no product activity on a given day, in a market, we leave the record out of the
database. Fact tables therefore are always sparse. Fact tables can also contain semiadditive facts which
can be added only on some of the dimensions and nonadditive facts which cannot be added at all. The
only interesting characteristic about nonadditive facts in table with billions of records is to get a count.
The dimension tables are where the textual descriptions of the dimensions of the business are stored.
Here the best attributes are textual, discrete and used as the source of constraints and row headers in
the user's answer set.
Typical attributes for a product would include a short description (10 to 15 characters), a long description
(30 to 60 characters), the brand name, the category name, the packaging type, and the size. Occasionally,
it may be possible to model an attribute either as a fact or as a dimension. In such a case it is the
designer's choice.
A key role for dimension table attributes is to serve as the source of constraints in a query or to
serve as row headers in the user's answer set.
e.g.
         Brand               Dollar Sales               Unit Sales
         Axon                780                        263
         Framis              1044                       509
         Widget              213                        444
         Zapper              95                         39


A standard SQL Query example for data warehousing could be:
select p.brand, sum(f.dollars), sum(f.units)            <=== select list
from salesfact f, product p, time t                     <=== from clauses with aliases f, p, t
where f.timekey = t.timekey                             <=== join constraint
and f.productkey = p.productkey                         <=== join constraint
and t.quarter = '1 Q 1995'                     <=== application constraint
groupby p.brand                                         <=== group by clause
orderby p.brand                                         <=== order by clause
Virtually every query like this one contains row headers and aggregated facts in the select list. The row
headers are not summed, the aggregated facts are.
The from clause list the tables involved in the join.
The join constraints join on the primary key from the dimension table and the foreign key in the fact
table. Referential integrity is extremely important in data warehousing and is enforced by the data base
management system.
This fact table key is a composite key consisting of concatenated foreign keys.
In OLTP applications joins are usually among artificially generated numeric keys that have little
administrative significance elsewhere in the company. In data warehousing one job function maintains
the master product file and overseas the generation of new product keys and another job function makes
sure that every sales record contains valid product keys. These joins are therefore called MIS joins.


                                                        9
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                        7/15/2011
Application constraints apply to individual dimension tables. Browsing the dimension tables, the user
specifies application constraints. It rarely makes sense to apply an application constraint simultaneously
across two dimensions, thereby linking the two dimensions. The dimensions are linked only through the
fact table. It is possible to directly apply an application constraint to a fact in the fact table. This can be
thought of as a filter on the records that would otherwise be retrieved by the rest of the query.
The group by clause summarizes records in the row headers. The order by clause determines the sort
order of the answer set when it is presented to the user.
From a performance viewpoint then, the SQL query should be evaluated as follows:
First, the application constraints are evaluated dimension by dimension. Each dimension thus produces a
set of candidate keys. The candidate keys are then assembled from each dimension into trial composite
keys to be searched for in the fact table. All the "hits" in the fact table are then grouped and summed
according to the specifications in the select list and group by clause.

Attributes Role in Data Warehousing
Attributes are the drivers of the Data Warehouse. The user begins by placing application constraints on
the dimensions through the process of browsing the dimension tables one at a time. The browse queries
are always on single-dimension tables and are usually fast acting and lightweight. Browsing is to allow
the user to assemble the correct constraints on each dimension. The user launches several queries in this
phase. The user also drags row headers from the dimension tables and additive facts from the fact table to
the answer staging area ( the report). The user then launches a multitable join. Finally, the dbms groups
and summarizes millions of low-level records from the fact table into the small answer set and returns the
answer to the user.



Two Sample Data Warehouse Designs
Designing a Product-Oriented Data Warehouse


                                         Sales Fact
                                                                 Product Dimension
       Time Dimension
                                                                     product_key
        time_key                                                     SKU_no
        day_of_week                                                  SKU_desc
        Day_no_in_Month                                              other product attr
        other time dimension attri
                                          time_key
                                          product_key
                                          store_key
    Promotion Dimension                   promotion_key            Store Dimension
                                          dollar_sales
                                          units_sales
          promotion_key                   dollar_cost                  store_key
          promotion_name                  customer_count               store_name
          price_reduction_type                                         store_number
          other promotion attr                                         store_addr
                                                                       other store attr




  The Grocery Store Schema




                                                       10
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                    7/15/2011


Background
The above schema is for a grocery chain with 500 large grocery stores spread over a three-state area.
Each store has a full complement of departments including grocery, frozen foods, dairy, meat, produce,
bakery, floral, hard goods, liquor and drugs. Each store has about 60,000 individual products on its
shelves. The individual products are called Stock Keeping Units or SKUs. About 40,000 of the SKUs
come from outside manufacturers and have bar codes imprinted on the product package. These bar codes
called Universal Product Codes or UPCs are at the same grain as individual SKUs. The remaining
20,000 SKUs come from departments like meat, produce, bakery or floral departments and do not have
nationally recognized UPC codes.
Management is concerned with the logistics of ordering, stocking the shelves and selling the products as
well as maximizing the profit at each store. The most significant management decision has to do with
pricing and promotions. Promotions include temporary price reductions, ads in newspapers, displays in
the grocery store including shelf displays and end aisle displays and coupons.

Identifying the Processes to Model
The first step in the design is to decide what business processes to model, by combining an understanding
of the business with an understanding of what data is available. The second step is to decide on the grain
of the fact table in each business process.
 A data warehouse always demands data expressed at the lowest possible grain of each dimension, not for
the queries to see individual low-level records, but for the queries to be able to cut through the database
in very precise ways. The best grain for the grocery store data warehouse is daily item movement or SKU
by store by promotion by day.

Dimension Table Modeling
A careful grain statement determines the primary dimensionally of the fact table. It is then possible to
add additional dimensions to the basic grain of the fact table, where these additional dimensions naturally
take on only a single value under each combination of the primary dimensions. If it is recognized that an
additional desired dimension violates the grain by causing additional records to be generated, then the
grain statement must be revised to accommodate this additional dimension. The grain of the grocery store
table allows the primary dimensions of time, product and store to fall out immediately.
Most data warehouses need an explicit time dimension table even though the primary time key may be
an SQL date-valued object. The explicit time dimension table is needed to describe fiscal periods,
seasons, holidays, weekends and other calendar calculations that are difficult to get from the SQL date
machinery.
Time is usually the first dimension in the underlying sort order in the database because when it is the first
in the sort order, the successive loading of time intervals of data will load data into virgin territory on the
disk.
The product dimension is one of the two or three primary dimensions in nearly every data warehouse.
This type of dimension has a great many attributes, in general can go above 50 attributes.
The other two dimensions are an artifact of the grocery store example.
A note of caution:




                                                      11
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                 7/15/2011

     Product Dimension
                                     package_size_key        brand_key
                                     package_size                           subcategory_key
      product_key                    brand_key
                                                             brand          subcategory
                                                             subcategory_
      SKU_desc                                               key
                                                                            category_key

      SKU_number
      package_size_key
      package_type
      diet_type
                                                                              category_key
      weight                                                                  category
      weight_unit_of_                                                         department_key
      _measure
      storage_type_key
                                 storage_type_key
      units_per_retail_          storage_type           shelf_life_
                                                        type_key
      case                       shelf_life_type_key
                                                        shelf_life_
      etc..                                             type
                                                                               department_key
                                                                               department




                                A snowflaked product dimension



Browsing is the act of navigating around in a dimension, either to gain an intuitive understanding of how
the various attributes correlate with each other or to build a constraint on the dimension as a whole. If a
large product dimension table is split apart into a snowflake, and robust browsing is attempted among
widely separated attributes, possibly lying along various tree structures, it is inevitable that browsing
performance will be compromised.




                                                            12
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                  7/15/2011
Fact Table Modeling
The sales fact table records only the SKUs actually sold. No record is kept of the SKUs that did not sell.
(Some applications require these records as well. The fact tables are then termed "factless" fact records).
The customer count, because it is additive across three of the dimensions, but not the fourth, is called
semiadditive. Any analysis using the customer count must be restricted to a single product key to be
valid.
The application must group line items together and find those groups where the desired products coexist.
This can be done with the COUNT DISTINCT operator in SQL.
A different solution is to store brand, subcategory, category, department and all merchandise customer
counts in explicitly stored aggregates. This is an important technique in data warehousing that I will not
cover in this report.
Finally, drilling down in a data warehouse is nothing more than adding row headers from the dimension
tables. Drilling up is subtracting row headers. An explicit hierarchy is not needed to support drilling
down.

Database Sizing for the Grocery Chain
The fact table is overwhelmingly large. The dimensional tables are geometrically smaller. So all realistic
estimates of the disk space needed for the warehouse can ignore the dimension tables.
The fact table in a dimensional schema should be highly normalized whereas efforts to normalize any of
the dimensional tables are a waste of time. If we normalize them by extracting repeating data elements
into separate "outrigger" tables, we make browsing and pick list generation difficult or impossible.
Time dimension: 2 years X 365 days = 730 days
Store dimension: 300 stores, reporting sales each day
Product dimension: 30,000 products in each store, of which 3,000 sell each day in a given store
Promotion dimension: a sold item appears in only one promotion condition in a store on a day.
Number of base fact records = 730 X 300 X 3000 X 1 = 657 million records
Number of key fields = 4; Number of fact fields = 4; Total fields = 8
Base fact table size = 657 million X 8 fields X 4 bytes = 21 GB




                                                     13
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                7/15/2011

Two Sample Data Warehouse Designs
Designing a Customer-Oriented Data Warehouse
I will outline an insurance application as an example of a customer-oriented data warehouse.
In this example the insurance company is a $3 billion property and casualty insurer for automobiles,
home fire protection, and personal liability. There are two main production data sources: all transactions
relating to the formulation of policies, and all transactions involved in processing claims. The insurance
company wants to analyze both the written policies and claims. It wants to see which coverages are most
profitable and which are the least. It wants to measure profits over time by covered item type (i.e. kinds
of cars and kinds of houses), state, county, demographic profile, underwriter, sales broker and sales
region, and events. Both revenues and costs need to be identified and tracked. The company wants to
understand what happens during the life of a policy, especially when a claim is processed.




                                                    14
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                           7/15/2011


The following four schemas outline the star schema for the insurance application:


   date_key                                                          insured_party_key
   day_of_week                                                       name
   fiscal_period                                                     address
                                                                     type
                                                                     demographic attributes
                                    transaction_date
                                    effective_date
   employee_key
                                    insured_party_key
   name
                                    employee_key
   employee_type                                                      coverage_key
                                    coverage_key
   department                                                         coverage_desc
                                    covered_item_key
                                    policy_key                        market_segment
                                                                      line_of_business
                                    claimant_key
    covered_item_key                                                  annual_statement_line
                                    claim_key
    covered_item_desc                                                 automobile_attributes ...
                                    third_party_key
    covered_item_type               transaction_key
  automobile_attributes             amount
  ...
                                                                           policy_key
                                                                           risk_grade
   claimant_name
   claimant_key
   claimant_address
                                                                      claim_key
   claimant_type
                                                                      claim_desc
                                                                      claim_type
                                Claims Transaction                    automobile_attributes ...
  third_party_key
  third_party_name              Schema                             transaction_key
  third_party_addr                                                 transaction_description
  thord_party_type                                                 reason




     date_key                                 transaction_date
     day_of week                              effective_date                   insured_party_key
     fiscal_period                            insured_party_key                name
                                              employee_key                     address
                                              coverage_key                     type
                                              covered_item_key                 demographic_attributes...
                                              policy_key
      employee_key                            transaction_key
      name                                    amount
      employee_type                                                             coverage_key
      department                                                                coverage_description
                                                                                market_segment
                                                                                line_of_business
                                                                                annual_statement_line
                                                                                automobile_attributes




    covered_item_key                                                            policy_key
    covered_item_description                                                    risk_grade
    covered_item_type
    automobile_attributes ...



     transaction_key
     transaction_dscription
     reason
                                            Policy Transaction Schema




                                                              15
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                7/15/2011




    date_key                                                      insured_party_key
    fiscal_period                                                 name
                                                                  address
                                                                  type
                                                                  demographic attributes
                                  snapshot_date
                                  effective_date
    agent_key
                                  insured_party_key
    agent_name
                                  agent_key
    agent_location
                                  coverage_key
    agent_type
                                  covered_item_key                  coverage_key
                                  policy_key                        coverage_desc
                                  status_key                        market_segment
                                  written_permission                line_of_business
    covered_item_key              earned_premium
    covered_item_description                                        annual_statement_line
                                  primary_limit                     automobile_attributes ...
    covered_item_type             primary_deductible
    automobile_attributes ...     number_transactions
                                  automobile_facts ...


                                                                  policy_key
     status_key                                                   risk_grade
     status_description




                     Policy Snapshot Schema



     date_key                                                   insured_party_key
     day_of_week                   transaction_date             name
     fiscal_period                 effective_date               address
                                   insured_party_key            type
                                   agent_key                    demographic attributes
                                   employee_key
                                   coverage_key
                                   covered_item_key
                                   policy_key
                                   claim_key
  agent_key                        status_key
  agent_name                       reservet_amount
  agent_type                       paid_this_month             coverage_key
  agent_location                   received_this_month         coverage_desc
                                   number_transactions         market_segment
                                   automobile facts ...        line_of_business
                                                               annual_statement_line
  covered_item_key                                             automobile_attributes ...
  covered_item_desc
  covered_item_type
  automobile_attributes ...
                                                                policy_key
                                                                risk_grade
   claim_key
   claim_desc                   Claims Snapshot
   claim_type
   automobile_attributes ...    Schema                           status_key
                                                                 Status_description




                                                          16
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                               7/15/2011
An appropriate design for a property and casualty insurance data warehouse is a short value chain
consisting of policy creation and claims processing, where these two major processes are represented
both by transaction fact tables and monthly snapshot fact tables.
This data warehouse will need to represent a number of heterogeneous coverage types with appropriate
combinations of core and custom dimension tables and fact tables.
The large insured party and covered item dimensions will need to be decomposed into one or more
minidimensions in order to provide reasonable browsing performance and in order to accurately track
these slowly changing dimensions.

Database Sizing for the Insurance Application

Policy Transaction Fact Table Sizing
Number of policies: 2,000,000
Number of covered item coverages (line items) per policy: 10
Number of policy transactions (not claim transactions) per year per policy: 12
Number of years: 3
Other dimensions: 1 for each policy line item transaction
Number of base fact records: 2,000,000 X 10 X 12 X 3 = 720 million records
Number of key fields: 8; Number of fact fields = 1; Total fields = 9
Base fact table size = 720 million X 9 fields X 4 bytes = 26 GB

Claim Transaction Fact Table Sizing
Number of policies: 2,000,000
Number of covered item coverages (line items) per policy: 10
Yearly percentage of all covered item coverages with a claim: 5%
Number of claim transactions per actual claim: 50
Number of years: 3
Other dimensions: 1 for each policy line item transaction
Number of base fact records: 2,000,000 X 10 X 0.05 X 50 X 3 = 150 million records
Number of key fields: 11; Number of fact fields = 1; Total fields = 12
Base fact table size = 150 million X 12 fields X 4 bytes = 7.2 GB




                                                    17
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                         7/15/2011
Policy Snapshot Fact Table Sizing
Number of policies: 2,000,000
Number of covered item coverages (line items) per policy: 10
Number of years: 3 => 36 months
Other dimensions: 1 for each policy line item transaction
Number of base fact records: 2,000,000 X 10 X 36 = 720 million records
Number of key fields: 8; Number of fact fields = 5; Total fields = 13
Base fact table size = 720 million X 13 fields X 4 bytes = 37 GB
Total custom policy snapshot fact tables assuming an average of 5 custom facts: 52 GB

Claim Snapshot Fact Table Sizing
Number of policies: 2,000,000
Number of covered item coverages (line items) per policy: 10
Yearly percentage of all covered item coverages with a claim: 5%
Average length of time that a claim is open: 12 months
Number of years: 3
Other dimensions: 1 for each policy line item transaction
Number of base fact records: 2,000,000 X 10 X 0.05 X 3 X 12 = 36 million records
Number of key fields: 11; Number of fact fields = 4; Total fields = 15
Base fact table size = 36 million X 15 fields X 4 bytes = 2.2 GB
Total custom policy snapshot fact tables assuming an average of 5 custom facts: 2.9 GB




                                                    18
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                   7/15/2011

Mechanics of the Design

There are nine decision points that need to be resolved for a complete data warehouse design:
1. The processes, and hence the identity of the fact tables
2. The grain of each fact table
3. The dimensions of each fact table
4. The facts, including precalculated facts.
5. The dimension attributes with complete descriptions and proper terminology
6. How to track slowly changing dimensions
7. The aggregations, heterogeneous dimensions, minidimensions, query models and other physical
storage decisions
8. The historical duration of the database
9. The urgency with which the data is extracted and loaded into the data warehouse

Interviewing End-Users and DBAs
Interviewing the end users is the most important first step in designing a data warehouse. The interviews
really accomplish two purposes. First, the interviews give the designers the insight into the needs and
expectations of the user community. The second purpose is to allow the designers to raise the level of
awareness of the forthcoming data warehouse with the end users, and to adjust and correct some of the
users' expectations.
The DBAa are often the primary experts on the legacy systems that may be used as the sources for the
data warehouse. These interviews serve as a reality check on some of the themes that come up in the end
user interviews.

Assembling the team
The entire data warehouse team should be assembled for two to three days to go through the nine
decision points. The attendees should be all the people who have an ongoing responsibility for the data
warehouse, including DBAs, system administrators, extract programmers, application developers, and
support personnel. End users should not attend the design sessions.
In the design sessions, the fact tables are identified and their grains chosen. Next the dimension tables are
identified by name and their grains chosen. E/R diagrams are not used to identify the fact tables or their
grains. They simply familiarize the staff with the complexities of the data.




                                                     19
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                  7/15/2011
Choosing the Hardware/Software platforms
These choices boil down to two primary concerns:
1. Does the proposed system actually work ?
2. Is this a vendor relationship that we want to have for a long time ?
Question the vendor whether:
1. Can the system query, store, load, index, and alter a billion-row fact table with a dozen dimensions ?
2. Can the system rapidly browse a 100,000 row dimension table ?
Benchmark the system to simulate fact and dimension table loading.
Conduct a query test for:
1. Average browse query response time
2. Average browse query delay compared with unloaded system
3. Ratio between longest and shortest browse query time
4. Average join query response time
5. Average join query delay compared with unloaded system
6. Ration between longest and shortest join query time (gives a sense of the stability of the optimizer)
7. Total number of query suites processed per hour

Handling Aggregates
An aggregate is a fact table record representing a summarization of base-level fact table records. An
aggregate fact table record is always associated with one or more aggregate dimension table records. Any
dimension attribute that remains unchanged in the aggregate dimension table can be used more
efficiently in the aggregate schema than in the base-level schema because it is guaranteed to make sense
at the aggregate level.
Several different precomputed aggregates will accelerate summarization queries. The effect on
performance will be huge. There will be a ten to thousand-fold improvement in runtime by having the
right aggregates available.
DBAs should spend time watching what the users are doing and deciding whether to build more
aggregates. The creation of aggregates requires a significant administrative effort. Whereas the
operational production system will provide a framework for administering base-level record keys, the
data warehouse team must create and maintain aggregate keys.
An aggregate navigator is very useful to intercept the end user's SQL query and transform it so as to use
the best available aggregate. It is thus an essential component of the data warehouse because it insulates
and user applications from the changing portfolio of aggregations, and allows the DBA to dynamically
adjust the aggregations without having to roll over the application base.
Finally, aggregations provide a home for planning data. Aggregations built from the base layer upward,
coincide with the planning process in place that creates plans and forecasts at these very same levels.




                                                     20
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                7/15/2011
Server-Side activities
In summary, the "back" room or server functions can be listed as follows.
Build and use the production data extract system.
    Perform daily data quality assurance.
    Monitor and tune the performance of the data warehouse system.
    Perform backup and recovery on the data warehouse.
    Communicate with the user community.
Steps can be outlined in the daily production extract, as follows:
    1. Primary extraction (read the legacy format)
    2. Identify the changed records
    3. Generalize keys for changing dimensions.
    4. Transform extract into load record images.
    5. Migrate from the legacy system to the Data Warehouse system
    6. Sort and build aggregates.
    7. Generalize keys for aggregates.
    8. Perform loading
    9. Process exceptions
    10. Quality assurance
    11. Publish
Additional notes:
Data extract tools are expensive. It does not make sense to buy them until the extract and transformation
requirements are well understood.
Maintenance of comparison copies of production files is a significant application burden that is a unique
responsibility of the data warehouse team.
To control slowly changing dimensions, the data warehouse team must create an administrative process
for issuing new dimension keys each time a trackable change occurs. The two alternatives for
administering keys are: derived keys and sequentially assigned integer keys.
Metadata - Metadata is a loose term for any form of auxiliary data that is maintained by an application.
Metadata is also kept by the aggregate navigator and by front-end query tools. The data warehouse team
should carefully document all forms of metadata. Ideally, front-end tools should provide for tools for
metadata administration.
Most of the extraction steps should be handled on the legacy system. This will allow for the biggest
reduction in data volumes.




                                                     21
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                     7/15/2011
A bulk data loader should allow for:
    The parallelization of the bulk data load across a number of processors in either SMP or MPP
    environments.
    Selectively turning off and then on the master index pre and post bulk loads
    Insert and update modes selectable by the DBA
    Referential integrity handling options
It is a good idea, as mentioned earlier, to think of the load process as one transaction. If the load is
corrupted, a rollback and load in the next load window should be tried.

Client-Side activities
The client functions can be summarized as follows:
    Build reusable application templates
    Design usable graphical user interfaces
    Train users on both the applications and the data
    Keep the network running efficiently
Additional notes:
Ease of use should be a primary criteria for an end user application tool.
The data warehouse should consist of a library of template applications that run immediately on the user's
desktop. These applications should have a limited set of user-selectable alternatives for setting new
constraints and for picking new measures. These template applications are precanned, parameterized
reports.
The query tools should perform comparisons flexibly and immediately. A single row of an answer set
should show comparisons over multiple time periods of differing grains - month, quarter, ytd, etc. And a
comparison over other dimensions - share of a product to a category, and compound comparisons across
two or more dimensions - share change this yr Vs last yr. These comparison alternatives should be
available in the form of a pull down menu. SQL should never be shown.
Presentation should be treated as a separate activity from querying and comparing and tools that allow
answer sets to be transferred easily into multiple presentation environments, should be chosen
A report-writing query tool should communicate the context of the report instantly, including the
identities of the attributes and the facts as well as any constraints placed by the user. If a user wishes to
edit a column, they should be able to do it directly. Requerying after an edit should at the most fetch the
data needed to rebuild the edited column.
All query tools must have an instant STOP command. The tool should not engage the client machine
while waiting on data from the server.




                                                      22
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                               7/15/2011

Conclusions
The data warehousing market is moving quickly as all major DBMS and tool vendors try to satisfy IS
needs. The industry needs to be driven by the users as opposed to by the software/hardware vendors as
has been the case upto now.
Software is the key. Although there have been several advances in hardware, such as parallel processing,
the main impact will still be felt through software.
Here are a few software issues:
Optimization of the execution of star join queries
Indexing of dimension tables for browsing and constraining, especially multi-million-row dimension
tables
Indexing of composite keys of fact tables
Syntax extensions for SQL to handle aggregations and comparisons
Support for low-level data compression
Support for parallel processing
Database Design tools for star schemas
Extract, administration and QA tools for star schemas
End user query tools




                                                     23
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                 7/15/2011

A Checklist for an Ideal Data Warehouse
The following checklist is from Ralph Kimball's - A Data Warehouse Toolkit, Wiley '96

•   Preliminary complete list of affected user groups prior to interviews

•   Preliminary complete list of legacy data sources prior to interviews

•   Data warehouse implementation team identified

    •   Data warehouse manager identified

    •   Interview leader identified

    •   Extract programming manager identified

•   End user groups to be interviewed identified

•   Data warehouse kickoff meeting with all affected end user groups

•   End user interviews

        •    Marketing interviews

        •    Finance interviews

        •    Logistics interviews

        •    Field management interviews

        •    Senior management interviews

        •    Six-inch stack of existing management reports representing all interviewed groups

•   Legacy system DBA interviews

        •    Copy books obtained for candidate legacy systems

        •    Data dictionary explaining meaning of each candidate table and field

        •    High-level description of which tables and fields are populated with quality data

•   Interview findings report distributed

        •    Prioritized information needs as expressed by end user community

        •    Data audit performed showing what data is available to support information needs

•   Datawarehousing design meeting

        •    Major processes identified and fact tables laid out

        •    Grain for each fact table chosen

                 •    Choice of transaction grain Vs time period accumulating snapshot grain

        •    Dimensions for each fact table identified

        •    Facts for each fact table with legacy source fields identified

        •    Dimension attributes with legacy source fields identified



                                                     24
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                               7/15/2011
        •   Core and custom heterogeneous product tables identified

        •   Slowly changing dimension attributes identified

        •   Demographic minidimensions identified

        •   Initial aggregated dimensions identified

        •   Duration of each fact table (need to extract old data upfront) identified

        •   Urgency of each fact table (e.g. need to extract on a daily basis) identified

        •   Implementation staging (first process to be implemented...)

•   Block diagram for production data extract (as each major process is implemented)

        •   System for reading legacy data

        •   System for identifying changing records

        •   System for handling slowly changing dimensions

        •   System for preparing load record images

        •   Migration system (mainframe to DBMS server machine)

        •   System for creating aggregates

        •   System for loading data, handling exceptions, guaranteeing referential integrity

        •   System for data quality assurance check

        •   System for data snapshot backup and recovery

        •   System for publishing, notifying users of daily data status

•   DBMS server hardware

        •   Vendor sales and support team qualified

        •   Vendor reference sites contacted and qualified as to relevance

        •   Vendor on-site test (if no qualified, relevant references available)

                 •   Vendor demonstrates ability to support system startup, backup, debugging

        •   Open systems and parallel scalability goals met

        •   Contractual terms approved

•   DBMS software

•   Vendor sales and support team qualified

        •   Vendor team has implemented a similar data warehouse

        •   Vendor team agrees with dimensional approach

        •   Vendor team demonstrates competence in prototype test

•   Ability to load, index and quality assure data volume demonstrated




                                                    25
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                 7/15/2011
•   Ability to browse large dimension tables demonstrated

•   Ability to query family of fact tables from 20 PCs under load demonstrated

•   Superior performance and optimizer stability demonstrated for star join queries

•   Superior large dimension table browsing demonstrated

•   Extended SQL syntax for special data warehouse functions

•   Ability to immediately and gracefully stop a query from end user PC

•   Extract tools

        •    Specific need for features of extract tool identified from extract system block diagram

        •    Alternative of writing home-grown extract system rejected

        •    Reference sites supplied by vendor qualified for relevance

•   Aggregate navigator

        •    Open system approach of navigator verified (serves all SQL network clients)

        •    Metadata table administration understood and compared with other navigators

        •    User query statistics, aggregate recommendations, link to aggregate creation tool

        •    Subsecond browsing performance with the navigator demonstrated for tiny browses

•   Front end tool for delivering parameterized reports

        •    Saved reports that can be mailed from user to user and run

        •    Saved constraint definitions that can be reused (public and private)

        •    Saved behavioral group definitions that can be reused (public and private)

        •    Dimension table browser with cross attribute subsetting

        •    Existing report can be opened and run with one button click

        •    Multiple answer sets can be automatically assembled in tool with outer join

        •    Direct support for single and multi dimension comparisons

        •    Direct support for multiple comparisons with different aggregations

        •    Direct support for average time period calculations (e.g. average daily balance)

        •    STOP QUERY command

        •    Extensible interface to HELP allowing warehouse data tables to be described to user

        •    Simple drill-down command supporting multiple hierarchies and nonhierarchies

        •    Drill across that allows multiple fact tables to appear in same report

        •    Correctly calculated break rows

        •    Red-Green exception highlighting with interface to drill down




                                                    26
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                                7/15/2011
        •   Ability to use network aggregate navigator with every atomic query issued by tool

        •   Sequential operations on the answer set such as numbering top N, and rolling

        •   Ability to extend query syntax for DBMS special functions

        •   Ability to define very large behavioral groups of customers or products

        •   Ability to graph data or hand off data to third-party graphics package

        •   Ability to pivot data or to hand off data to third-party pivot package

        •   Ability to support OLE hot links with other OLE aware applications

        •   Ability to place answer set in clipboard or TXT file in Lotus or Excel formats

        •   Ability to print horizontal and vertical tiled report

        •   Batch operation

        •   Graphical user interface user development facilities

                •    Ability to build a startup screen for the end user

                •    Ability to define pull down menu items

                •    Ability to define buttons for running reports and invoking the browser

        •   Consultants

                •    Consultant team qualified

                          •   Consultant team has implemented a similar data warehouse

                          •   Consultant team agrees with the dimensional approach

                          •   Consultant team demonstrates competence in prototype test




                                                    27
Data Warehousing: A Perspective
by
Hemant Kirpekar
                                                                                            7/15/2011

Bibliography
1. Buliding a Data Warehouse, Second Edition, by W.H. Inmon, Wiley, 1996
2. The Data Warehouse Toolkit, by Dr. Ralph Kimball, Wiley, 1996
3. Strategic Database Technology: Management for the year 2000, by Alan Simon, Morgan Kaufmann,
1995
4. Applied Decision Support, by Michael W. Davis, Prentice Hall, 1988
5. Data Warehousing: Passing Fancy or Strategic Imperative, white paper by the Gartner Group, 1995
6. Knowledge Asset Management and Corporate Memory, white paper by the Hispacom Group, to be
                                                                               published in Aug
1996


                                            The End




                                                 28

Weitere ähnliche Inhalte

Was ist angesagt?

Basic Introduction of Data Warehousing from Adiva Consulting
Basic Introduction of  Data Warehousing from Adiva ConsultingBasic Introduction of  Data Warehousing from Adiva Consulting
Basic Introduction of Data Warehousing from Adiva Consultingadivasoft
 
Project Presentation on Data WareHouse
Project Presentation on Data WareHouseProject Presentation on Data WareHouse
Project Presentation on Data WareHouseAbhi Bhardwaj
 
Business intelligence and data warehousing
Business intelligence and data warehousingBusiness intelligence and data warehousing
Business intelligence and data warehousingOZ Assignment help
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project reportsonalighai
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousingumesh patil
 
DATA Warehousing & Data Mining
DATA Warehousing & Data MiningDATA Warehousing & Data Mining
DATA Warehousing & Data Miningcpjcollege
 
Gulabs Ppt On Data Warehousing And Mining
Gulabs Ppt On Data Warehousing And MiningGulabs Ppt On Data Warehousing And Mining
Gulabs Ppt On Data Warehousing And Mininggulab sharma
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report Tom Donoghue
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseShanthi Mukkavilli
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingwork
 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Harish Chand
 
Big data, Analytics and 4th Generation Data Warehousing
Big data, Analytics and 4th Generation Data WarehousingBig data, Analytics and 4th Generation Data Warehousing
Big data, Analytics and 4th Generation Data WarehousingMartyn Richard Jones
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousingSatya P. Joshi
 
Dwdm 2(data warehouse)
Dwdm 2(data warehouse)Dwdm 2(data warehouse)
Dwdm 2(data warehouse)Er Bansal
 

Was ist angesagt? (20)

Basic Introduction of Data Warehousing from Adiva Consulting
Basic Introduction of  Data Warehousing from Adiva ConsultingBasic Introduction of  Data Warehousing from Adiva Consulting
Basic Introduction of Data Warehousing from Adiva Consulting
 
Project Presentation on Data WareHouse
Project Presentation on Data WareHouseProject Presentation on Data WareHouse
Project Presentation on Data WareHouse
 
Ppt
PptPpt
Ppt
 
Business intelligence and data warehousing
Business intelligence and data warehousingBusiness intelligence and data warehousing
Business intelligence and data warehousing
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project report
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousing
 
DATA Warehousing & Data Mining
DATA Warehousing & Data MiningDATA Warehousing & Data Mining
DATA Warehousing & Data Mining
 
Gulabs Ppt On Data Warehousing And Mining
Gulabs Ppt On Data Warehousing And MiningGulabs Ppt On Data Warehousing And Mining
Gulabs Ppt On Data Warehousing And Mining
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)
 
Big data, Analytics and 4th Generation Data Warehousing
Big data, Analytics and 4th Generation Data WarehousingBig data, Analytics and 4th Generation Data Warehousing
Big data, Analytics and 4th Generation Data Warehousing
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousing
 
Dwdm 2(data warehouse)
Dwdm 2(data warehouse)Dwdm 2(data warehouse)
Dwdm 2(data warehouse)
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 

Ähnlich wie Data Warehousing Perspective Overview

oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdfssuserf8f9b2
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designSarita Kataria
 
White Paper - How Data Works
White Paper - How Data WorksWhite Paper - How Data Works
White Paper - How Data WorksDavid Walker
 
Data warehouse
Data warehouseData warehouse
Data warehouseRajThakuri
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
TOPIC 9 data warehousing and data mining.pdf
TOPIC 9 data warehousing and data mining.pdfTOPIC 9 data warehousing and data mining.pdf
TOPIC 9 data warehousing and data mining.pdfSCITprojects2022
 
data warehousing and data mining (1).pdf
data warehousing and data mining (1).pdfdata warehousing and data mining (1).pdf
data warehousing and data mining (1).pdfSCITprojects2022
 
Next generation Data Governance
Next generation Data GovernanceNext generation Data Governance
Next generation Data GovernanceVladimiro Borsi
 
In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...
In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...
In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...Hitachi Vantara
 
Data warehouse
Data warehouseData warehouse
Data warehouseMR Z
 
Data warehousing has quickly evolved into a unique and popular busin.pdf
Data warehousing has quickly evolved into a unique and popular busin.pdfData warehousing has quickly evolved into a unique and popular busin.pdf
Data warehousing has quickly evolved into a unique and popular busin.pdfapleather
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011navaidkhan
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 

Ähnlich wie Data Warehousing Perspective Overview (20)

oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdf
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-design
 
Abstract
AbstractAbstract
Abstract
 
DMDW 1st module.pdf
DMDW 1st module.pdfDMDW 1st module.pdf
DMDW 1st module.pdf
 
White Paper - How Data Works
White Paper - How Data WorksWhite Paper - How Data Works
White Paper - How Data Works
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Course Outline Ch 2
Course Outline Ch 2Course Outline Ch 2
Course Outline Ch 2
 
TOPIC 9 data warehousing and data mining.pdf
TOPIC 9 data warehousing and data mining.pdfTOPIC 9 data warehousing and data mining.pdf
TOPIC 9 data warehousing and data mining.pdf
 
data warehousing and data mining (1).pdf
data warehousing and data mining (1).pdfdata warehousing and data mining (1).pdf
data warehousing and data mining (1).pdf
 
Next generation Data Governance
Next generation Data GovernanceNext generation Data Governance
Next generation Data Governance
 
In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...
In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...
In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...
 
Unit 1
Unit 1Unit 1
Unit 1
 
H1802045666
H1802045666H1802045666
H1802045666
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data warehousing has quickly evolved into a unique and popular busin.pdf
Data warehousing has quickly evolved into a unique and popular busin.pdfData warehousing has quickly evolved into a unique and popular busin.pdf
Data warehousing has quickly evolved into a unique and popular busin.pdf
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Data mining notes
Data mining notesData mining notes
Data mining notes
 

Kürzlich hochgeladen

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 

Kürzlich hochgeladen (20)

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 

Data Warehousing Perspective Overview

  • 1. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Data Warehousing: A Perspective by Hemant Kirpekar Introduction The Need for proper understanding of Data Warehousing......................................................................2 The Key Issues..........................................................................................................................................3 The Definition of a Data Warehouse.......................................................................................................3 The Lifecycle of a Data Warehouse.........................................................................................................4 The Goals of a Data Warehouse...............................................................................................................5 Why Data Warehousing is different from OLTP................................................6 E/R Modeling Vs Dimension Tables...................................................................8 Two Sample Data Warehouse Designs Designing a Product-Oriented Data Warehouse....................................................................................10 Designing a Customer-Oriented Data Warehouse................................................................................14 Mechanics of the Design Interviewing End-Users and DBAs........................................................................................................19 Assembling the team..............................................................................................................................19 Choosing Hardware/Software platforms................................................................................................20 Handling Aggregates..............................................................................................................................20 Server-Side activities.............................................................................................................................21 Client-Side activities..............................................................................................................................22 Conclusions........................................................................................................23 A Checklist for an Ideal Data Warehouse........................................................24 1
  • 2. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Introduction The need for proper understanding of Data Warehousing The following is an extract from "Knowledge Asset Management and Corporate Memory" a White Paper to be published on the WWW possibly via the Hispacom site in the third week of August 1996...... Data Warehousing may well leverage the rising tide technologies that everyone will want or need, however the current trend in Data Warehousing marketing leaves a lot to be desired. In many organizations there still exists an enormous divide that separates Information Technology and a managers need for Knowledge and Information. It is common currency that there is a whole host of available tools and techniques for locating, scrubbing, sorting, storing, structuring, documenting, processing and presenting information. Unfortunately, tools are tangible and business information and knowledge are not, so they tend to get confused. So why do we still have this confusion? First consider how certain companies market Data Warehousing. There are companies that sell database technologies, other companies that sell the platforms (ostensibly consisting of an MPP or SMP architecture), some sell technical Consultancy services, others meta-data tools and services, finally there are the business Consultancy services and the systems integrators - each and everyone with their own particular focus on the critical factors in the success of Data Warehousing projects. In the main, most RDBMS vendors seem to see Data Warehouse projects as a challenge to provide greater performance, greater capacity and greater divergence. With this excuse, most RDBMS products carry functionality that make them about as truly "open" as a UNIVAC 90/30, i.e. No standards for View Partitioning, Bit Mapped Indexing, Histograms, Object Partitioning, SQL query decomposition or SQL evaluation strategies etc. This however is not really the important issue, the real issue is that some vendors sell Data Warehousing as if it just provided a big dumping ground for massive amounts of data with which users are allowed to do anything they like, whilst at the same time freeing up Operational Systems from the need to support end-user informational requirements. Some hardware vendors have a similar approach, i.e. a Data Warehouse platform must inherently have a lot of disks, a lot of memory and a lot of CPUs. However, one of the most successful Data Warehouse projects have worked on used COMPAQ hardware, which provides an excellent cost/benefit ratio. Some Technical Consultancy Services providers tend to dwell on the performance aspects of Data Warehousing. They see Data Warehousing as a technical challenge, rather than a business opportunity, but the biggest performance payoffs will be brought about when there is a full understanding of how the user wishes to use the information. 2
  • 3. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 The Key Issues Organizations are swimming in data. However, most will have to create new data with improved quality, to meet strategic business planning requirements. So: How should IS plan for the mass of end user information demand? What vendors and tools will emerge to help IS build and maintain a data warehouse architecture? What strategies can users deploy to develop a successful data warehouse architecture ? What technology breakthroughs will occur to empower knowledge workers and reduce operational data access requirements? These are some of the key questions outlined by the Gartner Group in their 1995 report on Data Warehousing. I will try to answer some of these questions in this report. The Definition a Data Warehouse A Data Warehouse is a: . subject-oriented . integrated .time-variant . non-volatile collection of data in support of management decisions. (W.H. Inmon, in "Building a Data Warehouse, Wiley 1996) The data warehouse is oriented to the major subject areas of the corporation that have been defined in the data model. Examples of subject areas are: customer, product, activity, policy, claim, account. The major subject areas end up being physically implemented as a series of related tables in the data warehouse. Personal Note: Could these be objects? No one to my knowledge has explored this possibility as yet. The second salient characteristic of the data warehouse is that it is integrated. This is the most important aspect of a data warehouse. The different design decisions that the application designers have made over the years show up in a thousand different ways. Generally, there is no application consistency in encoding, naming conventions, physical attributes, measurements of attributes, key structure and physical characteristics of the data. Each application has been most likely been designed independently. As data is entered into the data warehouse, inconsistencies of the application level are undone. The third salient characteristic of the data warehouse is that it is time-variant. A 5 to 10 year time horizon of data is normal for the data warehouse. Data Warehouse data is a sophisticated series of snapshots taken at one moment in time and the key structure always contains some time element. The last important characteristic of the data warehouse is that it is nonvolatile. Unlike operational data warehouse data is loaded en masse and is then accessed. Update of the data does not occur in the data warehouse environment. 3
  • 4. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 The lifecycle of the Data Warehouse Data flows into the data warehouse from the operational environment. Usually a significant amount of transformation of data occurs at the passage from the operational level to the data warehouse level. Once the data ages, it passes from current detail to older detail. As the data is summarized, it passes from current detail to lightly summarized data and then onto summarized data. At some point in time data is purged from the warehouse. There are several ways in which this can be made to happen: . Data is added to a rolling summary file where the detail is lost. . Data is transferred to a bulk medium from a high-performance medium such as DASD. . Data is transferred from one level of the architecture to another. . Data is actually purged from the system at the DBAs request. The following diagram is from "Building a Data Warehouse" 2nd Ed, by W.H. Inmon, Wiley '96 highly summarized monthly sales by product line (‘81 - ‘92) wkly sales by lightly subproduct line summarized (‘84 - ‘92) (data mart) m e operational t transformation a d a t current a sales detail (1990 - 1991) detail old detail sales detail (‘84 - ‘89) Structure of a Data Warehouse 4
  • 5. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 The Goals of a Data Warehouse According to Ralph Kimball (founder of Red Brick Systems - A highly successful Data Warehouse DBMS startup), the goals of a Data Warehouse are: 1. The data warehouse provides access to corporate or organizational data. Access means several things. Managers and analysts must be able to connect to the data warehouse from their personal computers and this connection must be immediate, on demand, and with high performance. The tiniest queries must run in less than a second. The tools available must be easy to use i.e. useful reports can be run with a one button click and can be changed and rerun with two button clicks. 2. The data in the warehouse is consistent. Consistency means that when two people request sales figures for the Southeast Region for January they get the same number. Consistency means that when they ask for the definition of the "sales" data element, they get a useful answer that lets them know what they are fetching. Consistency also means that if yesterday's data has not been completely loaded, the analyst is warned that the data load is not complete and will not be complete till tomorrow. 3. The data in the warehouse can be combined by every possible measure of the business (i.e. slice & dice) This implies a very different organization from the E/R organization of typically Operational Data. This implies row headers and constraints, i.e. dimensions in a dimensional data model. 4. The data warehouse is not just data, but is also a set of tools to query, analyze, and to present information. The "back room" components, namely the hardware, the relational database software and the data itself are only about 60% of what is needed for a successful data warehouse implementation. The remaining 40% is the set of front-end tools that query, analyze and present the data. The "show me what is important" requirement needs all of these components. 5. The data warehouse is where used data is published. Data is not simply accumulated at a central point and let loose. It is assembled from a variety of information sources in the organization, cleaned up, quality assured, and then released only if it is fit for use. A data quality manager is critical for a data warehouse and play a role similar to that of a magazine editor or a book publisher. He/she is responsible for the content and quality of the publication and is identified with the deliverable. 6. The quality of the data in the data warehouse is the driver of business reengineering. The best data in any company is the record of how much money someone else owes the company. Data quality goes downhill from there. The data warehouse cannot fix poor quality data but the inability of a data warehouse to be effective with poor quality data is the best driver for business reengineering efforts in an organization. 5
  • 6. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Why Data Warehousing is different from OLTP On-line transaction processing is profoundly different from data warehousing. The users are different, the data content is different, the data structures are different, the hardware is different, the software is different, the administration is different, the management of the systems is different, and the daily rhythms are different. The design techniques and design instincts appropriate for transaction processing are inappropriate and even destructive for information warehousing. OLTP Transactional Properties In OLTP a transaction is defined by its ACID properties. A Transaction is a user-defined sequence of instructions that maintains consistency across a persistent set of values. It is a sequence of operations that is atomic with respect to recovery. To remain valid, a transaction must maintain it’s ACID properties Atomicity is a condition that states that for a transaction to be valid the effects of all its instructions must be enforced or none at all. Consistency is a property of the persistent data is and must be preserved by the execution of a complete transaction. Isolation is a property that states that the effect of running transactions concurrently must be that of serializability. i.e. as if each of the transactions were run in isolation. Durability is the ability of a transaction to preserve its effects if it has committed, in the presence of media and system failures. A serious data warehouse will often process only one transaction per day, but this transaction will contain thousands or even millions of records. This kind of transaction has a special name in data warehousing. It is called a production data load. In a data warehouse, consistency is measured globally. We do not care about an individual transaction, but we care enormously that the current load of new data is a full and consistent set of data. What we care about is the consistent state of the system we started with before the production data load, and the consistent state of the system we ended up with after a successful production data load. The most practical frequency of this production data load is once per day, usually in the early hours of the morning. So, instead of a microscopic perspective, we have a quality assurance manager's judgment of data consistency. OLTP systems are driven by performance and reliability concerns. Users of a data warehouse almost never deal with one account at a time, usually requiring hundreds or thousands of records to be searched and compressed into a small answer set. Users of a data warehouse change the kinds of questions they ask constantly. Although, the templates of their requests may be similar, the impact of these queries will vary wildly on the database system. Small single table queries, called browses, need to be instantaneous whereas large multitable queries, called join queries, are expected to run for seconds or minutes. Reporting is the primary activity in a data warehouse. Users consume information in human-sized chunks of one or two pages. Blinking numbers on a page can be clicked on to answer why questions. Negatives below are blinking numbers. 6
  • 7. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Example of a Data Warehouse Report Product Region Sales Growth in Sales as Change in Change in This Month Sales Vs% of Sales as Sales as Last Month Category % of Cat. % of Cat. YTD Last Mt. Vs Last Yr YTD Framis Central 110 12% 31% 3% 7% Framis Eastern 179 -<3%> 28% -<1%> 3% Framis Western 55 5% 44% 1% 5% Total Framis 344 6% 33% 1% 5% Widget Central 66 2% 18% 2% 10% Widget Eastern 102 4% 12% 5% 13% Widget Western 39% -<9%> 9% -<1%> 8% Total Widget 207 1% 13% 4% 11% Grand Total 551 4% 20% 2% 8% The twinkling nature of OLTP databases (constant updates of new values), is the first kind of temporal inconsistency that we avoid in data warehouses. The second kind of temporal inconsistency in an OLTP database is the lack of explicit support for correctly representing prior history. Although it is possible to keep history in an OLTP system, it is a major burden on that system to correctly depict old history. We have a long series of transactions that incrementally alter history and it is close to impossible to quickly reconstruct the snapshot of a business at a specified point in time. We make a data warehouse a specific time series. We move snapshots of the OLTP systems over to the data warehouse as a series of data layers, like geologic layers. By bringing static snapshots to the warehouse only on a regular basis, we solve both of the time representation problems we had on the OLTP system. No updates during the day - so no twinkling. By storing snapshots, we represent prior points in time correctly. This allows us to ask comparative queries easily. The snapshot is called the production data extract, and we migrate this extract to the data warehouse system at regular time intervals. This process gives rise to the two phases of the data warehouse: loading and querying. 7
  • 8. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 E/R Modeling Vs Dimension Tables Entity/Relationship modeling seeks to drive all the redundancy out of the data. If there is no redundancy in the data, then a transaction that changes any data only needs to touch the database in one place. This is the secret behind the phenomenal improvement in transaction processing speed since the early 80s. E/R modeling works by dividing the data into many discreet entities, each of which becomes a table in the OLTP database. A simple E/R diagram looks like the map of a large metropolitan area where the entities are the cities and the relationships are the connecting freeways. This diagram is very symmetric For queries that span many records or many tables, E/R diagrams are too complex for users to understand and too complex for software to navigate. SO, E/R MODELS CANNOT BE USED AS THE BASIS FOR ENTERPRISE DATA WAREHOUSES. In data warehousing, 80% of the queries are single-table browses, and 20% are multitable joins. This allows for a tremendously simple data structure. This structure is the dimensional model or the star join schema. This name is chosen because the E/R diagram looks like a star with one large central table called the fact table and a set of smaller attendant tables called dimensional tables, displayed in a radial pattern around the fact table. This structure is very asymmetric. The fact table in the schema is the only one that participates in multiple joins with the dimension tables. The dimension tables all have a single join to this central fact table. Sales Fact Product Dimension Time Dimension time_key product_key product_key description store_key brand time_key dollars_sold category day_of_week units_sold month dollars_cost quarter year holiday_flag Store Dimension store_key store_name A typical dimensional model address floor_plan_type The above is an example of a star schema for a typical grocery store chain. The Sales Fact table contains daily item totals of all the products sold. This is called the grain of the fact table. Each record in the fact table represents the total sales of a specific product in a market on a day. Any other combination generates a different record in the fact table. The fact table of a typical grocery retailer with 500 stores, each carrying 50,000 products on the shelves and measuring a daily item movement over 2 years could approach 1 Billion rows. However, using a high-performance server and an industrial- strength dbms we can store and query such a large fact table with good performance. 8
  • 9. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 The fact table is where the numerical measurements of the business are stored. These measurements are taken at the intersection of all the dimensions. The best and most useful facts are continuously valued and additive. If there is no product activity on a given day, in a market, we leave the record out of the database. Fact tables therefore are always sparse. Fact tables can also contain semiadditive facts which can be added only on some of the dimensions and nonadditive facts which cannot be added at all. The only interesting characteristic about nonadditive facts in table with billions of records is to get a count. The dimension tables are where the textual descriptions of the dimensions of the business are stored. Here the best attributes are textual, discrete and used as the source of constraints and row headers in the user's answer set. Typical attributes for a product would include a short description (10 to 15 characters), a long description (30 to 60 characters), the brand name, the category name, the packaging type, and the size. Occasionally, it may be possible to model an attribute either as a fact or as a dimension. In such a case it is the designer's choice. A key role for dimension table attributes is to serve as the source of constraints in a query or to serve as row headers in the user's answer set. e.g. Brand Dollar Sales Unit Sales Axon 780 263 Framis 1044 509 Widget 213 444 Zapper 95 39 A standard SQL Query example for data warehousing could be: select p.brand, sum(f.dollars), sum(f.units) <=== select list from salesfact f, product p, time t <=== from clauses with aliases f, p, t where f.timekey = t.timekey <=== join constraint and f.productkey = p.productkey <=== join constraint and t.quarter = '1 Q 1995' <=== application constraint groupby p.brand <=== group by clause orderby p.brand <=== order by clause Virtually every query like this one contains row headers and aggregated facts in the select list. The row headers are not summed, the aggregated facts are. The from clause list the tables involved in the join. The join constraints join on the primary key from the dimension table and the foreign key in the fact table. Referential integrity is extremely important in data warehousing and is enforced by the data base management system. This fact table key is a composite key consisting of concatenated foreign keys. In OLTP applications joins are usually among artificially generated numeric keys that have little administrative significance elsewhere in the company. In data warehousing one job function maintains the master product file and overseas the generation of new product keys and another job function makes sure that every sales record contains valid product keys. These joins are therefore called MIS joins. 9
  • 10. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Application constraints apply to individual dimension tables. Browsing the dimension tables, the user specifies application constraints. It rarely makes sense to apply an application constraint simultaneously across two dimensions, thereby linking the two dimensions. The dimensions are linked only through the fact table. It is possible to directly apply an application constraint to a fact in the fact table. This can be thought of as a filter on the records that would otherwise be retrieved by the rest of the query. The group by clause summarizes records in the row headers. The order by clause determines the sort order of the answer set when it is presented to the user. From a performance viewpoint then, the SQL query should be evaluated as follows: First, the application constraints are evaluated dimension by dimension. Each dimension thus produces a set of candidate keys. The candidate keys are then assembled from each dimension into trial composite keys to be searched for in the fact table. All the "hits" in the fact table are then grouped and summed according to the specifications in the select list and group by clause. Attributes Role in Data Warehousing Attributes are the drivers of the Data Warehouse. The user begins by placing application constraints on the dimensions through the process of browsing the dimension tables one at a time. The browse queries are always on single-dimension tables and are usually fast acting and lightweight. Browsing is to allow the user to assemble the correct constraints on each dimension. The user launches several queries in this phase. The user also drags row headers from the dimension tables and additive facts from the fact table to the answer staging area ( the report). The user then launches a multitable join. Finally, the dbms groups and summarizes millions of low-level records from the fact table into the small answer set and returns the answer to the user. Two Sample Data Warehouse Designs Designing a Product-Oriented Data Warehouse Sales Fact Product Dimension Time Dimension product_key time_key SKU_no day_of_week SKU_desc Day_no_in_Month other product attr other time dimension attri time_key product_key store_key Promotion Dimension promotion_key Store Dimension dollar_sales units_sales promotion_key dollar_cost store_key promotion_name customer_count store_name price_reduction_type store_number other promotion attr store_addr other store attr The Grocery Store Schema 10
  • 11. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Background The above schema is for a grocery chain with 500 large grocery stores spread over a three-state area. Each store has a full complement of departments including grocery, frozen foods, dairy, meat, produce, bakery, floral, hard goods, liquor and drugs. Each store has about 60,000 individual products on its shelves. The individual products are called Stock Keeping Units or SKUs. About 40,000 of the SKUs come from outside manufacturers and have bar codes imprinted on the product package. These bar codes called Universal Product Codes or UPCs are at the same grain as individual SKUs. The remaining 20,000 SKUs come from departments like meat, produce, bakery or floral departments and do not have nationally recognized UPC codes. Management is concerned with the logistics of ordering, stocking the shelves and selling the products as well as maximizing the profit at each store. The most significant management decision has to do with pricing and promotions. Promotions include temporary price reductions, ads in newspapers, displays in the grocery store including shelf displays and end aisle displays and coupons. Identifying the Processes to Model The first step in the design is to decide what business processes to model, by combining an understanding of the business with an understanding of what data is available. The second step is to decide on the grain of the fact table in each business process. A data warehouse always demands data expressed at the lowest possible grain of each dimension, not for the queries to see individual low-level records, but for the queries to be able to cut through the database in very precise ways. The best grain for the grocery store data warehouse is daily item movement or SKU by store by promotion by day. Dimension Table Modeling A careful grain statement determines the primary dimensionally of the fact table. It is then possible to add additional dimensions to the basic grain of the fact table, where these additional dimensions naturally take on only a single value under each combination of the primary dimensions. If it is recognized that an additional desired dimension violates the grain by causing additional records to be generated, then the grain statement must be revised to accommodate this additional dimension. The grain of the grocery store table allows the primary dimensions of time, product and store to fall out immediately. Most data warehouses need an explicit time dimension table even though the primary time key may be an SQL date-valued object. The explicit time dimension table is needed to describe fiscal periods, seasons, holidays, weekends and other calendar calculations that are difficult to get from the SQL date machinery. Time is usually the first dimension in the underlying sort order in the database because when it is the first in the sort order, the successive loading of time intervals of data will load data into virgin territory on the disk. The product dimension is one of the two or three primary dimensions in nearly every data warehouse. This type of dimension has a great many attributes, in general can go above 50 attributes. The other two dimensions are an artifact of the grocery store example. A note of caution: 11
  • 12. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Product Dimension package_size_key brand_key package_size subcategory_key product_key brand_key brand subcategory subcategory_ SKU_desc key category_key SKU_number package_size_key package_type diet_type category_key weight category weight_unit_of_ department_key _measure storage_type_key storage_type_key units_per_retail_ storage_type shelf_life_ type_key case shelf_life_type_key shelf_life_ etc.. type department_key department A snowflaked product dimension Browsing is the act of navigating around in a dimension, either to gain an intuitive understanding of how the various attributes correlate with each other or to build a constraint on the dimension as a whole. If a large product dimension table is split apart into a snowflake, and robust browsing is attempted among widely separated attributes, possibly lying along various tree structures, it is inevitable that browsing performance will be compromised. 12
  • 13. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Fact Table Modeling The sales fact table records only the SKUs actually sold. No record is kept of the SKUs that did not sell. (Some applications require these records as well. The fact tables are then termed "factless" fact records). The customer count, because it is additive across three of the dimensions, but not the fourth, is called semiadditive. Any analysis using the customer count must be restricted to a single product key to be valid. The application must group line items together and find those groups where the desired products coexist. This can be done with the COUNT DISTINCT operator in SQL. A different solution is to store brand, subcategory, category, department and all merchandise customer counts in explicitly stored aggregates. This is an important technique in data warehousing that I will not cover in this report. Finally, drilling down in a data warehouse is nothing more than adding row headers from the dimension tables. Drilling up is subtracting row headers. An explicit hierarchy is not needed to support drilling down. Database Sizing for the Grocery Chain The fact table is overwhelmingly large. The dimensional tables are geometrically smaller. So all realistic estimates of the disk space needed for the warehouse can ignore the dimension tables. The fact table in a dimensional schema should be highly normalized whereas efforts to normalize any of the dimensional tables are a waste of time. If we normalize them by extracting repeating data elements into separate "outrigger" tables, we make browsing and pick list generation difficult or impossible. Time dimension: 2 years X 365 days = 730 days Store dimension: 300 stores, reporting sales each day Product dimension: 30,000 products in each store, of which 3,000 sell each day in a given store Promotion dimension: a sold item appears in only one promotion condition in a store on a day. Number of base fact records = 730 X 300 X 3000 X 1 = 657 million records Number of key fields = 4; Number of fact fields = 4; Total fields = 8 Base fact table size = 657 million X 8 fields X 4 bytes = 21 GB 13
  • 14. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Two Sample Data Warehouse Designs Designing a Customer-Oriented Data Warehouse I will outline an insurance application as an example of a customer-oriented data warehouse. In this example the insurance company is a $3 billion property and casualty insurer for automobiles, home fire protection, and personal liability. There are two main production data sources: all transactions relating to the formulation of policies, and all transactions involved in processing claims. The insurance company wants to analyze both the written policies and claims. It wants to see which coverages are most profitable and which are the least. It wants to measure profits over time by covered item type (i.e. kinds of cars and kinds of houses), state, county, demographic profile, underwriter, sales broker and sales region, and events. Both revenues and costs need to be identified and tracked. The company wants to understand what happens during the life of a policy, especially when a claim is processed. 14
  • 15. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 The following four schemas outline the star schema for the insurance application: date_key insured_party_key day_of_week name fiscal_period address type demographic attributes transaction_date effective_date employee_key insured_party_key name employee_key employee_type coverage_key coverage_key department coverage_desc covered_item_key policy_key market_segment line_of_business claimant_key covered_item_key annual_statement_line claim_key covered_item_desc automobile_attributes ... third_party_key covered_item_type transaction_key automobile_attributes amount ... policy_key risk_grade claimant_name claimant_key claimant_address claim_key claimant_type claim_desc claim_type Claims Transaction automobile_attributes ... third_party_key third_party_name Schema transaction_key third_party_addr transaction_description thord_party_type reason date_key transaction_date day_of week effective_date insured_party_key fiscal_period insured_party_key name employee_key address coverage_key type covered_item_key demographic_attributes... policy_key employee_key transaction_key name amount employee_type coverage_key department coverage_description market_segment line_of_business annual_statement_line automobile_attributes covered_item_key policy_key covered_item_description risk_grade covered_item_type automobile_attributes ... transaction_key transaction_dscription reason Policy Transaction Schema 15
  • 16. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 date_key insured_party_key fiscal_period name address type demographic attributes snapshot_date effective_date agent_key insured_party_key agent_name agent_key agent_location coverage_key agent_type covered_item_key coverage_key policy_key coverage_desc status_key market_segment written_permission line_of_business covered_item_key earned_premium covered_item_description annual_statement_line primary_limit automobile_attributes ... covered_item_type primary_deductible automobile_attributes ... number_transactions automobile_facts ... policy_key status_key risk_grade status_description Policy Snapshot Schema date_key insured_party_key day_of_week transaction_date name fiscal_period effective_date address insured_party_key type agent_key demographic attributes employee_key coverage_key covered_item_key policy_key claim_key agent_key status_key agent_name reservet_amount agent_type paid_this_month coverage_key agent_location received_this_month coverage_desc number_transactions market_segment automobile facts ... line_of_business annual_statement_line covered_item_key automobile_attributes ... covered_item_desc covered_item_type automobile_attributes ... policy_key risk_grade claim_key claim_desc Claims Snapshot claim_type automobile_attributes ... Schema status_key Status_description 16
  • 17. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 An appropriate design for a property and casualty insurance data warehouse is a short value chain consisting of policy creation and claims processing, where these two major processes are represented both by transaction fact tables and monthly snapshot fact tables. This data warehouse will need to represent a number of heterogeneous coverage types with appropriate combinations of core and custom dimension tables and fact tables. The large insured party and covered item dimensions will need to be decomposed into one or more minidimensions in order to provide reasonable browsing performance and in order to accurately track these slowly changing dimensions. Database Sizing for the Insurance Application Policy Transaction Fact Table Sizing Number of policies: 2,000,000 Number of covered item coverages (line items) per policy: 10 Number of policy transactions (not claim transactions) per year per policy: 12 Number of years: 3 Other dimensions: 1 for each policy line item transaction Number of base fact records: 2,000,000 X 10 X 12 X 3 = 720 million records Number of key fields: 8; Number of fact fields = 1; Total fields = 9 Base fact table size = 720 million X 9 fields X 4 bytes = 26 GB Claim Transaction Fact Table Sizing Number of policies: 2,000,000 Number of covered item coverages (line items) per policy: 10 Yearly percentage of all covered item coverages with a claim: 5% Number of claim transactions per actual claim: 50 Number of years: 3 Other dimensions: 1 for each policy line item transaction Number of base fact records: 2,000,000 X 10 X 0.05 X 50 X 3 = 150 million records Number of key fields: 11; Number of fact fields = 1; Total fields = 12 Base fact table size = 150 million X 12 fields X 4 bytes = 7.2 GB 17
  • 18. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Policy Snapshot Fact Table Sizing Number of policies: 2,000,000 Number of covered item coverages (line items) per policy: 10 Number of years: 3 => 36 months Other dimensions: 1 for each policy line item transaction Number of base fact records: 2,000,000 X 10 X 36 = 720 million records Number of key fields: 8; Number of fact fields = 5; Total fields = 13 Base fact table size = 720 million X 13 fields X 4 bytes = 37 GB Total custom policy snapshot fact tables assuming an average of 5 custom facts: 52 GB Claim Snapshot Fact Table Sizing Number of policies: 2,000,000 Number of covered item coverages (line items) per policy: 10 Yearly percentage of all covered item coverages with a claim: 5% Average length of time that a claim is open: 12 months Number of years: 3 Other dimensions: 1 for each policy line item transaction Number of base fact records: 2,000,000 X 10 X 0.05 X 3 X 12 = 36 million records Number of key fields: 11; Number of fact fields = 4; Total fields = 15 Base fact table size = 36 million X 15 fields X 4 bytes = 2.2 GB Total custom policy snapshot fact tables assuming an average of 5 custom facts: 2.9 GB 18
  • 19. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Mechanics of the Design There are nine decision points that need to be resolved for a complete data warehouse design: 1. The processes, and hence the identity of the fact tables 2. The grain of each fact table 3. The dimensions of each fact table 4. The facts, including precalculated facts. 5. The dimension attributes with complete descriptions and proper terminology 6. How to track slowly changing dimensions 7. The aggregations, heterogeneous dimensions, minidimensions, query models and other physical storage decisions 8. The historical duration of the database 9. The urgency with which the data is extracted and loaded into the data warehouse Interviewing End-Users and DBAs Interviewing the end users is the most important first step in designing a data warehouse. The interviews really accomplish two purposes. First, the interviews give the designers the insight into the needs and expectations of the user community. The second purpose is to allow the designers to raise the level of awareness of the forthcoming data warehouse with the end users, and to adjust and correct some of the users' expectations. The DBAa are often the primary experts on the legacy systems that may be used as the sources for the data warehouse. These interviews serve as a reality check on some of the themes that come up in the end user interviews. Assembling the team The entire data warehouse team should be assembled for two to three days to go through the nine decision points. The attendees should be all the people who have an ongoing responsibility for the data warehouse, including DBAs, system administrators, extract programmers, application developers, and support personnel. End users should not attend the design sessions. In the design sessions, the fact tables are identified and their grains chosen. Next the dimension tables are identified by name and their grains chosen. E/R diagrams are not used to identify the fact tables or their grains. They simply familiarize the staff with the complexities of the data. 19
  • 20. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Choosing the Hardware/Software platforms These choices boil down to two primary concerns: 1. Does the proposed system actually work ? 2. Is this a vendor relationship that we want to have for a long time ? Question the vendor whether: 1. Can the system query, store, load, index, and alter a billion-row fact table with a dozen dimensions ? 2. Can the system rapidly browse a 100,000 row dimension table ? Benchmark the system to simulate fact and dimension table loading. Conduct a query test for: 1. Average browse query response time 2. Average browse query delay compared with unloaded system 3. Ratio between longest and shortest browse query time 4. Average join query response time 5. Average join query delay compared with unloaded system 6. Ration between longest and shortest join query time (gives a sense of the stability of the optimizer) 7. Total number of query suites processed per hour Handling Aggregates An aggregate is a fact table record representing a summarization of base-level fact table records. An aggregate fact table record is always associated with one or more aggregate dimension table records. Any dimension attribute that remains unchanged in the aggregate dimension table can be used more efficiently in the aggregate schema than in the base-level schema because it is guaranteed to make sense at the aggregate level. Several different precomputed aggregates will accelerate summarization queries. The effect on performance will be huge. There will be a ten to thousand-fold improvement in runtime by having the right aggregates available. DBAs should spend time watching what the users are doing and deciding whether to build more aggregates. The creation of aggregates requires a significant administrative effort. Whereas the operational production system will provide a framework for administering base-level record keys, the data warehouse team must create and maintain aggregate keys. An aggregate navigator is very useful to intercept the end user's SQL query and transform it so as to use the best available aggregate. It is thus an essential component of the data warehouse because it insulates and user applications from the changing portfolio of aggregations, and allows the DBA to dynamically adjust the aggregations without having to roll over the application base. Finally, aggregations provide a home for planning data. Aggregations built from the base layer upward, coincide with the planning process in place that creates plans and forecasts at these very same levels. 20
  • 21. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Server-Side activities In summary, the "back" room or server functions can be listed as follows. Build and use the production data extract system. Perform daily data quality assurance. Monitor and tune the performance of the data warehouse system. Perform backup and recovery on the data warehouse. Communicate with the user community. Steps can be outlined in the daily production extract, as follows: 1. Primary extraction (read the legacy format) 2. Identify the changed records 3. Generalize keys for changing dimensions. 4. Transform extract into load record images. 5. Migrate from the legacy system to the Data Warehouse system 6. Sort and build aggregates. 7. Generalize keys for aggregates. 8. Perform loading 9. Process exceptions 10. Quality assurance 11. Publish Additional notes: Data extract tools are expensive. It does not make sense to buy them until the extract and transformation requirements are well understood. Maintenance of comparison copies of production files is a significant application burden that is a unique responsibility of the data warehouse team. To control slowly changing dimensions, the data warehouse team must create an administrative process for issuing new dimension keys each time a trackable change occurs. The two alternatives for administering keys are: derived keys and sequentially assigned integer keys. Metadata - Metadata is a loose term for any form of auxiliary data that is maintained by an application. Metadata is also kept by the aggregate navigator and by front-end query tools. The data warehouse team should carefully document all forms of metadata. Ideally, front-end tools should provide for tools for metadata administration. Most of the extraction steps should be handled on the legacy system. This will allow for the biggest reduction in data volumes. 21
  • 22. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 A bulk data loader should allow for: The parallelization of the bulk data load across a number of processors in either SMP or MPP environments. Selectively turning off and then on the master index pre and post bulk loads Insert and update modes selectable by the DBA Referential integrity handling options It is a good idea, as mentioned earlier, to think of the load process as one transaction. If the load is corrupted, a rollback and load in the next load window should be tried. Client-Side activities The client functions can be summarized as follows: Build reusable application templates Design usable graphical user interfaces Train users on both the applications and the data Keep the network running efficiently Additional notes: Ease of use should be a primary criteria for an end user application tool. The data warehouse should consist of a library of template applications that run immediately on the user's desktop. These applications should have a limited set of user-selectable alternatives for setting new constraints and for picking new measures. These template applications are precanned, parameterized reports. The query tools should perform comparisons flexibly and immediately. A single row of an answer set should show comparisons over multiple time periods of differing grains - month, quarter, ytd, etc. And a comparison over other dimensions - share of a product to a category, and compound comparisons across two or more dimensions - share change this yr Vs last yr. These comparison alternatives should be available in the form of a pull down menu. SQL should never be shown. Presentation should be treated as a separate activity from querying and comparing and tools that allow answer sets to be transferred easily into multiple presentation environments, should be chosen A report-writing query tool should communicate the context of the report instantly, including the identities of the attributes and the facts as well as any constraints placed by the user. If a user wishes to edit a column, they should be able to do it directly. Requerying after an edit should at the most fetch the data needed to rebuild the edited column. All query tools must have an instant STOP command. The tool should not engage the client machine while waiting on data from the server. 22
  • 23. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Conclusions The data warehousing market is moving quickly as all major DBMS and tool vendors try to satisfy IS needs. The industry needs to be driven by the users as opposed to by the software/hardware vendors as has been the case upto now. Software is the key. Although there have been several advances in hardware, such as parallel processing, the main impact will still be felt through software. Here are a few software issues: Optimization of the execution of star join queries Indexing of dimension tables for browsing and constraining, especially multi-million-row dimension tables Indexing of composite keys of fact tables Syntax extensions for SQL to handle aggregations and comparisons Support for low-level data compression Support for parallel processing Database Design tools for star schemas Extract, administration and QA tools for star schemas End user query tools 23
  • 24. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 A Checklist for an Ideal Data Warehouse The following checklist is from Ralph Kimball's - A Data Warehouse Toolkit, Wiley '96 • Preliminary complete list of affected user groups prior to interviews • Preliminary complete list of legacy data sources prior to interviews • Data warehouse implementation team identified • Data warehouse manager identified • Interview leader identified • Extract programming manager identified • End user groups to be interviewed identified • Data warehouse kickoff meeting with all affected end user groups • End user interviews • Marketing interviews • Finance interviews • Logistics interviews • Field management interviews • Senior management interviews • Six-inch stack of existing management reports representing all interviewed groups • Legacy system DBA interviews • Copy books obtained for candidate legacy systems • Data dictionary explaining meaning of each candidate table and field • High-level description of which tables and fields are populated with quality data • Interview findings report distributed • Prioritized information needs as expressed by end user community • Data audit performed showing what data is available to support information needs • Datawarehousing design meeting • Major processes identified and fact tables laid out • Grain for each fact table chosen • Choice of transaction grain Vs time period accumulating snapshot grain • Dimensions for each fact table identified • Facts for each fact table with legacy source fields identified • Dimension attributes with legacy source fields identified 24
  • 25. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 • Core and custom heterogeneous product tables identified • Slowly changing dimension attributes identified • Demographic minidimensions identified • Initial aggregated dimensions identified • Duration of each fact table (need to extract old data upfront) identified • Urgency of each fact table (e.g. need to extract on a daily basis) identified • Implementation staging (first process to be implemented...) • Block diagram for production data extract (as each major process is implemented) • System for reading legacy data • System for identifying changing records • System for handling slowly changing dimensions • System for preparing load record images • Migration system (mainframe to DBMS server machine) • System for creating aggregates • System for loading data, handling exceptions, guaranteeing referential integrity • System for data quality assurance check • System for data snapshot backup and recovery • System for publishing, notifying users of daily data status • DBMS server hardware • Vendor sales and support team qualified • Vendor reference sites contacted and qualified as to relevance • Vendor on-site test (if no qualified, relevant references available) • Vendor demonstrates ability to support system startup, backup, debugging • Open systems and parallel scalability goals met • Contractual terms approved • DBMS software • Vendor sales and support team qualified • Vendor team has implemented a similar data warehouse • Vendor team agrees with dimensional approach • Vendor team demonstrates competence in prototype test • Ability to load, index and quality assure data volume demonstrated 25
  • 26. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 • Ability to browse large dimension tables demonstrated • Ability to query family of fact tables from 20 PCs under load demonstrated • Superior performance and optimizer stability demonstrated for star join queries • Superior large dimension table browsing demonstrated • Extended SQL syntax for special data warehouse functions • Ability to immediately and gracefully stop a query from end user PC • Extract tools • Specific need for features of extract tool identified from extract system block diagram • Alternative of writing home-grown extract system rejected • Reference sites supplied by vendor qualified for relevance • Aggregate navigator • Open system approach of navigator verified (serves all SQL network clients) • Metadata table administration understood and compared with other navigators • User query statistics, aggregate recommendations, link to aggregate creation tool • Subsecond browsing performance with the navigator demonstrated for tiny browses • Front end tool for delivering parameterized reports • Saved reports that can be mailed from user to user and run • Saved constraint definitions that can be reused (public and private) • Saved behavioral group definitions that can be reused (public and private) • Dimension table browser with cross attribute subsetting • Existing report can be opened and run with one button click • Multiple answer sets can be automatically assembled in tool with outer join • Direct support for single and multi dimension comparisons • Direct support for multiple comparisons with different aggregations • Direct support for average time period calculations (e.g. average daily balance) • STOP QUERY command • Extensible interface to HELP allowing warehouse data tables to be described to user • Simple drill-down command supporting multiple hierarchies and nonhierarchies • Drill across that allows multiple fact tables to appear in same report • Correctly calculated break rows • Red-Green exception highlighting with interface to drill down 26
  • 27. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 • Ability to use network aggregate navigator with every atomic query issued by tool • Sequential operations on the answer set such as numbering top N, and rolling • Ability to extend query syntax for DBMS special functions • Ability to define very large behavioral groups of customers or products • Ability to graph data or hand off data to third-party graphics package • Ability to pivot data or to hand off data to third-party pivot package • Ability to support OLE hot links with other OLE aware applications • Ability to place answer set in clipboard or TXT file in Lotus or Excel formats • Ability to print horizontal and vertical tiled report • Batch operation • Graphical user interface user development facilities • Ability to build a startup screen for the end user • Ability to define pull down menu items • Ability to define buttons for running reports and invoking the browser • Consultants • Consultant team qualified • Consultant team has implemented a similar data warehouse • Consultant team agrees with the dimensional approach • Consultant team demonstrates competence in prototype test 27
  • 28. Data Warehousing: A Perspective by Hemant Kirpekar 7/15/2011 Bibliography 1. Buliding a Data Warehouse, Second Edition, by W.H. Inmon, Wiley, 1996 2. The Data Warehouse Toolkit, by Dr. Ralph Kimball, Wiley, 1996 3. Strategic Database Technology: Management for the year 2000, by Alan Simon, Morgan Kaufmann, 1995 4. Applied Decision Support, by Michael W. Davis, Prentice Hall, 1988 5. Data Warehousing: Passing Fancy or Strategic Imperative, white paper by the Gartner Group, 1995 6. Knowledge Asset Management and Corporate Memory, white paper by the Hispacom Group, to be published in Aug 1996 The End 28