SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Chapter 1:
           Data Warehousing
1.Basic Concepts of data warehousing
2.Data warehouse architectures
3.Some characteristics of data warehouse data
4.The reconciled data layer
5.Data transformation
6.The derived data layer
7. The user interface

                                                1
                   HCMC UT, 2008
Motivation
     “Modern organization is drowning in data but starving
      for information”.
     Operational processing (transaction processing)
      captures, stores and manipulates data to support daily
      operations.
     Information processing is the analysis of data or other
      forms of information to support decision making.

     Data warehouse can consolidate and integrate
      information from many internal and external sources
      and arrange it in a meaningful format for making
      business decisions.
Chapter 1                                       2
Definition
   Data Warehouse:
         Warehouse           (W.H. Immon)
      – A subject-oriented, integrated, time-variant, non-
          updatable collection of data used in support of
          management decision-making processes
      –   Subject-oriented: e.g. customers, patients, students,
          products
      –   Integrated: Consistent naming conventions, formats,
          encoding structures; from multiple data sources
      –   Time-variant: Can study trends and changes
      –   Nonupdatable: Read-only, periodically refreshed
   Data Warehousing:
         Warehousing
      – The process of constructing and using a data
          warehouse

Chapter 1                                           3
Data Warehouse—Subject-
                Oriented
   Organized   around major subjects, such as
    customer, product, sales.
   Focusing on the modeling and analysis of data for
    decision makers, not on daily operations or
    transaction processing.
   Provide a simple and concise view around
    particular subject issues by excluding data that are
    not useful in the decision support process.

Chapter 1                                   4
Data Warehouse - Integrated
   Constructed by integrating multiple,
     heterogeneous data sources
      – relational databases, flat files, on-line transaction
        records
   Data  cleaning and data integration techniques are
     applied.
      – Ensure consistency in naming conventions, encoding
        structures, attribute measures, etc. among different data
        sources
               E.g., Hotel price: currency, tax, breakfast covered, etc.
      – When data is moved to the warehouse, it is converted.


Chapter 1                                                       5
Data Warehouse -Time Variant
   The   time horizon for the data warehouse is
     significantly longer than that of operational
     systems.
      – Operational database: current value data.
      – Data warehouse data: provide information from a
        historical perspective (e.g., past 5-10 years)
   Every   key structure in the data warehouse
      – Contains an element of time, explicitly or implicitly
      – But the key of operational data may or may not contain
        “time element”.
Chapter 1                                           6
Data Warehouse - Non Updatable

  A   physically separate store of data
    transformed from the operational
    environment.
   Operational update of data does not occur in
    the data warehouse environment.
      – Does not require transaction processing,
        recovery, and concurrency control mechanisms.
      – Requires only two operations in data accessing:
               initial loading of data and access of data.
Chapter 1                                               7
Need for Data Warehousing
     Integrated, company-wide view of high-quality
      information (from disparate databases)
     Separation of operational and informational systems and
      data (for improved performance)
        Table 11-1: comparison of operational and informational systems




Chapter 1                                                8
Need to separate operational and
                  information systems
  Three primary factors:
      – A data warehouse centralizes data that are scattered
        throughout disparate operational systems and
        makes them available for DS.
      – A well-designed data warehouse adds value to data
        by improving their quality and consistency.
      – A separate data warehouse eliminates much of the
        contention for resources that results when
        information applications are mixed with
        operational processing.

Chapter 1                                      9
Data Warehouse Architectures
      1.Generic Two-Level Architecture
      2.Independent Data Mart
      3.Dependent Data Mart and Operational
       Data Store
      4.Logical Data Mart and @ctive Warehouse
      5.Three-Layer architecture

All involve some form of extraction, transformation and loading (ETL)
                                                                 ETL


   Chapter 1                                      10
Figure 11-2: Generic two-level architecture




                                 L
                                     One,
                                     company-
                                     wide
                        T            warehouse


                E


 Periodic extraction  data is not completely current in warehouse

Chapter 1                                        11
Figure 11-3: Independent Data Mart
                                         Data marts:
                                         Mini-warehouses, limited in scope




                                     L




                             T
                    E

             Separate ETL for each          Data access complexity
             independent data mart          due to multiple data marts
 Chapter 1                                            12
Independent Data mart
   Independent   data mart: a data mart filled
     with data extracted from the operational
     environment without benefits of a data
     warehouse.




Chapter 1                               13
Figure 11-4:                                          ODS provides option for
Dependent data mart with operational data store       obtaining current data




                                          L




                              T
                     E                               Simpler data access
              Single ETL for
              enterprise data warehouse       Dependent data marts
              (EDW)                           loaded from EDW
  Chapter 1                                               14
Dependent data mart-
            Operational data store
   Dependent    data mart: A data mart filled
    exclusively from the enterprise data
    warehouse and its reconciled data.
   Operational data store (ODS): An
    integrated, subject-oriented, updatable,
    current-valued, enterprisewise, detailed
    database designed to serve operational users
    as they do decision support processing.
Chapter 1                            15
Figure 11-5:                                                  ODS and data warehouse
Logical data mart and @ctive data warehouse                   are one and the same




                                       L




                               T
                      E
                Near real-time ETL for        Data marts are NOT separate databases,
                @active Data Warehouse        but logical views of the data warehouse
                                               Easier to create new data marts
    Chapter 1                                                    16
@ctive data warehouse

   @active    data warehouse: An enterprise data
     warehouse that accepts near-real-time feeds of
     transactional data from the systems of record,
     analyzes warehouse data, and in near-real-time
     relays business rules to the data warehouse and
     systems of record so that immediate actions can be
     taken in repsonse to business events.



Chapter 1                                  17
Table 11-2: Data Warehouse vs. Data Mart




    Source: adapted from Strange (1997).

Chapter 1                                              18
Figure 11-6: Three-layer architecture




Chapter 1                                     19
Three-layer architecture
            Reconciled and derived data
   Reconciled  data: detailed, current data
    intended to be the single, authoritative
    source for all decision support.
   Derived data: Data that have been selected,
    formatted, and aggregated for end-user
    decision support application.
   Metadata: technical and business data that
    describe the properties or characteristics of
    other data.
Chapter 1                              20
Data Characteristics                 Figure 11-7:
                                            Example of
            Status vs. Event Data           DBMS log entry


                                                 Status


                               Event = a database action
                               (create/update/delete) that
                               results from a transaction




                                                 Status

Chapter 1                              21
Data Characteristics
Transient vs. Periodic Data
                            Figure 11-8:
                Transient operational data

           Changes to existing records are
       written over previous records, thus
      destroying the previous data content




Chapter 1                                    22
Data Characteristics
Transient vs. Periodic Data
                           Figure 11-9:
                Periodic warehouse data


               Data are never physically
             altered or deleted once they
            have been added to the store




Chapter 1                                   23
Other data warehouse changes
   New descriptive attributes
   New business activity attributes
   New classes of descriptive attributes
   Descriptive attributes become more refined
   Descriptive data are related to one another
   New source of data




Chapter 1                                 24
Data Reconciliation
 Typical     operational data is:
    – Transient – not historical
    – Not normalized (perhaps due to denormalization for
      performance)
    – Restricted in scope – not comprehensive
    – Sometimes poor quality – inconsistencies and errors
 After     ETL, data should be:
    –   Detailed – not summarized yet
    –   Historical – periodic
    –   Normalized – 3rd normal form or higher
    –   Comprehensive – enterprise-wide perspective
    –   Quality controlled – accurate with full integrity
Chapter 1                                           25
The ETL Process
  Capture
  Scrub or data cleansing
  Transform
  Load and Index

        ETL = Extract, transform, and load

Chapter 1                           26
Figure 11-10: Steps in data reconciliation




                      Capture = extract…obtaining a snapshot
                      of a chosen subset of the source data for
                      loading into the data warehouse


Static extract = capturing a             Incremental extract =
snapshot of the source data at           capturing changes that have
a point in time                          occurred since the last static
                                         extract
 Chapter 1                                              27
Figure 11-10: Steps in data reconciliation (continued)




                         Scrub = cleanse…uses pattern
                         recognition and AI techniques to
                         upgrade data quality

Fixing errors: misspellings,               Also: decoding, reformatting, time
erroneous dates, incorrect field usage,    stamping, conversion, key
mismatched addresses, missing data,        generation, merging, error
duplicate data, inconsistencies            detection/logging, locating missing
                                           data
Chapter 1                                                   28
Figure 11-10: Steps in data reconciliation (continued)




                      Transform = convert data from format
                      of operational system to format of data
                      warehouse

Record-level:                            Field-level:
Selection – data partitioning            single-field – from one field to one field
Joining – data combining                 multi-field – from many fields to one, or
Aggregation – data summarization         one field to many
Chapter 1                                                       29
Figure 11-10: Steps in data reconciliation (continued)




                          Load/Index= place transformed data
                          into the warehouse and create indexes


Refresh mode: bulk rewriting of             Update mode: only changes in
target data at periodic intervals           source data are written to data
                                            warehouse


Chapter 1                                                         30
Data Transformation
   Data  transformation is the component of
    data reconcilation that converts data from
    the format of the source operational systems
    to the format of enterprise data warehouse.
   Data transformation consists of a variety of
    different functions:
      – record-level functions,
      – field-level functions and
      – more complex transformation.

Chapter 1                              31
Record-level functions & Field-level
                  functions
   Record-level   functions
     – Selection: data partitioning
     – Joining: data combining
     – Normalization
     – Aggregation: data summarization
   Field-level functions
     – Single-field transformation: from one field to
       one field
     – Multi-field transformation: from many fields to
       one, or one field to many

Chapter 1                                 32
Figure 11-11: Single-field transformation

                        In general – some transformation function
                        translates data from old form to new form




                         Algorithmic transformation uses a
                         formula or logical expression




                                 Table lookup –
                                 another
                                 approach



Chapter 1                                                33
Figure 11-12: Multifield transformation




                                M:1 –from many source
                                fields to one target field




                                   1:M –from one
                                   source field to
                                   many target fields




Chapter 1                                          34
Derived Data
   Objectives
      –   Ease of use for decision support applications
      –   Fast response to predefined user queries
      –   Customized data for particular target audiences
      –   Ad-hoc query support
      –   Data mining capabilities
          Characteristics
      – Detailed (mostly periodic) data
      – Aggregate (for summary)
      – Distributed (to departmental servers)

            Most common data model = star schema
               (also called “dimensional model”)
Chapter 1                                          35
The Star Schema
   Star schema: is a simple database design in
    which dimensional (describing how data are
    commonly aggregated) are separated from
    fact or event data.
   A star schema consists of two types of
    tables: fact tables and dimension table.


Chapter 1                            36
Figure 11-13: Components of a star schema
                               Fact tables contain
                               factual or quantitative
                               data




                                                         Dimension tables are
    1:N relationship
                                                         denormalized to
    between dimension
                                                         maximize
    tables and fact tables
                                                         performance




                             Dimension tables contain
                             descriptions about the
                             subjects of the business

                 Excellent for ad-hoc queries,
                 but bad for online transaction processing
Chapter 1                                             37
Figure 11-14: Star schema example

                    Fact table provides statistics for sales
                    broken down by product, period and store
                    dimensions




Chapter 1                                              38
Figure 11-15: Star schema with sample data




Chapter 1                                   39
Issues Regarding Star Schema
 Dimension    table keys must be surrogate (non-
  intelligent and non-business related), because:
  – Keys may change over time
  – Length/format consistency
 Granularity     of Fact Table – what level of detail do
  you want?
  –   Transactional grain – finest level
  –   Aggregated grain – more summarized
  –   Finer grains  better market basket analysis capability
  –   Finer grain  more dimension tables, more rows in fact table


Chapter 1                                        40
 Duration        of the database
             Ex: 13 months or 5 quarters
             Some businesses need for a longer durations.


   Size        of the fact table
      – Estimate the number of possible values for each
        dimension associated with the fact table.
      – Multiply the values obtained in the first step
        after making any necessary adjustments.




Chapter 1                                          41
Figure 11-16: Modeling dates




            Fact tables contain time-period data
             Date dimensions are important
Chapter 1                                    42
Variations of the Star Schema
   1. Multiple fact tables
   2. Factless fact tables
   3. Normalizing Dimension Tables
   4. Snowflake schema




Chapter 1                         43
Multiple Fact tables
   More  than one fact table in a given star
    schema.
   Ex: There are 2 fact tables, one at the center
    of each star:
      – Sales – facts about the sale of a product to a customer
        in a store on a date.
      – Receipts - facts about the receipt of a product from a
        vendor to a warehouse on a date.
      – Two separate product dimension tables have been
        created.
      – One date dimension table is used.


Chapter 1                                          44
Chapter 1   45
Factless Fact Tables
   There  are applications in which fact tables
     do not have nonkey data but that do have
     foreign keys for the associated dimensions.

   The     two situations:
      – To track events
      – To inventory the set of possible occurrences (called
        coverage)



Chapter 1                                         46
Factless fact table showing occurrence of an event.
Chapter 1                                           47
Factless fact table showing coverage
Chapter 1                                          48
Normalizing dimension tables
   Dimension   tables may not be normalized. Most
    data warehouse experts find this acceptable.
   In some situations in which it makes sense to
    further normalize dimension tables.
   Multivalued dimensions:
      – Ex: Hospital charge/payment for a patient on a date is
        associated with one or more diagnosis.
      – N:M relationship between the Diagnosis and Finances
        fact table.
      – Solution: create an associative entity (helper table)
        between Diagnosis and Finances.

Chapter 1                                         49
Multivalued dimension
Chapter 1                           50
Snowflake schema
   Snowflake  schema is an expanded version of a star
    schema in which dimension tables are normalized
    into several related tables.
   Advantages
      – Small saving in storage space
      – Normalized structures are easier to update and maintain
   Disadvantages
      – Schema less intuitive
      – Ability to browse through the content difficult
      – Degraded query performance because of additional
        joins.

Chapter 1                                        51
Example of snowflake schema

time
time_key                                       item
day                                          item_key        supplier
day_of_the_week         Sales Fact Table     item_name       supplier_key
month                                        brand           supplier_type
quarter                         time_key     type
year                             item_key    supplier_key

                               branch_key
                                             location
    branch                    location_key
                                             location_key
    branch_key
                                units_sold   street
    branch_name
                                             city_key       city
    branch_type
                              dollars_sold
                                                            city_key
                                avg_sales                   city
                                                            province_or_stre
             Measures                                       country
Chapter 1                                         52
The User Interface

  A   variety of tools are available to query and
     analyze data stored in data warehouses.
      – 1. Querying tools
      – 2. On-line Analytical processing (OLAP,
        MOLAP, ROLAP) tools
      – 3. Data Mining tools
      – 4. Data Visualization tools


Chapter 1                                 53
Role of Metadata (data catalog)

 Identify subjects of the data mart
 Identify dimensions and facts
 Indicate how data is derived from enterprise data
  warehouses, including derivation rules
 Indicate how data is derived from operational data store,
  including derivation rules
 Identify available reports and predefined queries
 Identify data analysis techniques (e.g. drill-down)
 Identify responsible people

Chapter 1                                   54
Querying Tools
   SQL  is not an analytical language
   SQL-99 includes some data warehousing
    extensions
   SQL-99 still is not a full-featured data
    warehouse querying and analysis tool.
   Different DBMS vendors will implement
    some or all of the SQL-99 OLAP extension
    commands and possibly others.

Chapter 1                          55
On-Line Analytical Processing (OLAP)
  OLAP    is the use of a set of graphical tools that
   provides users with multidimensional views of
   their data and allows them to analyze the data
   using simple windowing techniques
  Relational OLAP (ROLAP)
     – OLAP tools that view the database as a traditional
        relational database in either a star schema or other
        normalized or denormalized set of tables.
  Multidimensional        OLAP (MOLAP)
     – OLAP tools that load data into an intermediate
        structure, usually a three or higher dimensional
        array. (Cube structure)


Chapter 1                                           56
From tables to data cubes
  A  data warehouse is based on a multidimensional
    data model which views data in the form of a data
    cube
   A data cube, such as sales, allows data to be
    modeled and viewed in multiple dimensions
      – Dimension tables, such as item (item_name, brand,
        type), or time (day, week, month, quarter, year)
      – Fact table contains measures (such as dollars_sold) and
        keys to each of the related dimension tables



Chapter 1                                        57
MOLAP Operations

   Roll    up (drill-up): summarize data
      – by climbing up hierarchy or by dimension reduction
   Drill   down (roll down): reverse of roll-up
      – from higher level summary to lower level summary or
        detailed data, or introducing new dimensions
   Slice   and dice:
      – project and select




Chapter 1                                       58
Figure 11-22: Slicing a data cube




Chapter 1                                  59
Summary report
      Figure 11-23:
Example of drill-down




                        Drill-down with
                        color added




 Chapter 1                                60
Data Mining
    Data mining is knowledge discovery using a blend of
     statistical, AI, and computer graphics techniques
    Goals:
     – Explain observed events or conditions
     – Confirm hypotheses
     – Explore data for new or unexpected relationships
    Techniques
     –   Case-based reasoning
     –   Rule discovery
     –   Signal processing
     –   Neural nets
     –   Fractals




Chapter 1                                                 61
Data Visualization
 Data  visualization is the representation of data in
   graphical/multimedia formats for human analysis




Chapter 1                                   62
OLAP tool Vendors

   IBM
   Informix
   Cartelon
   NCR
   Oracle (Oracle Warehouse builder, Oracle OLAP)
   Red Brick
   Sybase
   SAS
   Microsoft (SQL Server OLAP)
   Microstrategy Corporation

Chapter 1                              63

Weitere ähnliche Inhalte

Was ist angesagt?

Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
vivekjv
 
introduction to datawarehouse
introduction to datawarehouseintroduction to datawarehouse
introduction to datawarehouse
kiran14360
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecycle
bartlowe
 
Business analysis in data warehousing
Business analysis in data warehousingBusiness analysis in data warehousing
Business analysis in data warehousing
Himanshu
 
Dwdm 2(data warehouse)
Dwdm 2(data warehouse)Dwdm 2(data warehouse)
Dwdm 2(data warehouse)
Er Bansal
 

Was ist angesagt? (20)

Dbm630_Lecture02-03
Dbm630_Lecture02-03Dbm630_Lecture02-03
Dbm630_Lecture02-03
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
introduction to datawarehouse
introduction to datawarehouseintroduction to datawarehouse
introduction to datawarehouse
 
Presentation
PresentationPresentation
Presentation
 
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
 
Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design
 
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration Techniques
 
Data Warehouses & Deployment By Ankita dubey
Data Warehouses & Deployment By Ankita dubeyData Warehouses & Deployment By Ankita dubey
Data Warehouses & Deployment By Ankita dubey
 
Data mining
Data miningData mining
Data mining
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Data modeling star schema
Data modeling star schemaData modeling star schema
Data modeling star schema
 
business analysis-Data warehousing
business analysis-Data warehousingbusiness analysis-Data warehousing
business analysis-Data warehousing
 
SALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCE
SALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCESALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCE
SALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCE
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecycle
 
Real World Business Intelligence and Data Warehousing
Real World Business Intelligence and Data WarehousingReal World Business Intelligence and Data Warehousing
Real World Business Intelligence and Data Warehousing
 
Business analysis in data warehousing
Business analysis in data warehousingBusiness analysis in data warehousing
Business analysis in data warehousing
 
Introduction to data warehousing
Introduction to data warehousingIntroduction to data warehousing
Introduction to data warehousing
 
Data warehouse system and its concepts
Data warehouse system and its conceptsData warehouse system and its concepts
Data warehouse system and its concepts
 
Sqlserver interview questions
Sqlserver interview questionsSqlserver interview questions
Sqlserver interview questions
 
Dwdm 2(data warehouse)
Dwdm 2(data warehouse)Dwdm 2(data warehouse)
Dwdm 2(data warehouse)
 

Ähnlich wie Data warehouse

BI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business businessBI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business business
JawaherAlbaddawi
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
Girish Dhareshwar
 

Ähnlich wie Data warehouse (20)

Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
DW 101
DW 101DW 101
DW 101
 
Implementation of Data Marts in Data ware house
Implementation of Data Marts in Data ware houseImplementation of Data Marts in Data ware house
Implementation of Data Marts in Data ware house
 
What is the future of etl tools like ab initio
What is the future of etl tools like ab initioWhat is the future of etl tools like ab initio
What is the future of etl tools like ab initio
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 
(Lecture 2)Data Warehouse Architecture.pdf
(Lecture 2)Data Warehouse Architecture.pdf(Lecture 2)Data Warehouse Architecture.pdf
(Lecture 2)Data Warehouse Architecture.pdf
 
Dbm630_lecture02-03
Dbm630_lecture02-03Dbm630_lecture02-03
Dbm630_lecture02-03
 
Unit 1
Unit 1Unit 1
Unit 1
 
BI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business businessBI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business business
 
DMDW 1st module.pdf
DMDW 1st module.pdfDMDW 1st module.pdf
DMDW 1st module.pdf
 
Oracle: Fundamental Of DW
Oracle: Fundamental Of DWOracle: Fundamental Of DW
Oracle: Fundamental Of DW
 
Oracle: Fundamental Of Dw
Oracle: Fundamental Of DwOracle: Fundamental Of Dw
Oracle: Fundamental Of Dw
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
11667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect411667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect4
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)
 
TOPIC 9 data warehousing and data mining.pdf
TOPIC 9 data warehousing and data mining.pdfTOPIC 9 data warehousing and data mining.pdf
TOPIC 9 data warehousing and data mining.pdf
 
data warehousing and data mining (1).pdf
data warehousing and data mining (1).pdfdata warehousing and data mining (1).pdf
data warehousing and data mining (1).pdf
 

Data warehouse

  • 1. Chapter 1: Data Warehousing 1.Basic Concepts of data warehousing 2.Data warehouse architectures 3.Some characteristics of data warehouse data 4.The reconciled data layer 5.Data transformation 6.The derived data layer 7. The user interface 1 HCMC UT, 2008
  • 2. Motivation  “Modern organization is drowning in data but starving for information”.  Operational processing (transaction processing) captures, stores and manipulates data to support daily operations.  Information processing is the analysis of data or other forms of information to support decision making.  Data warehouse can consolidate and integrate information from many internal and external sources and arrange it in a meaningful format for making business decisions. Chapter 1 2
  • 3. Definition  Data Warehouse: Warehouse (W.H. Immon) – A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management decision-making processes – Subject-oriented: e.g. customers, patients, students, products – Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources – Time-variant: Can study trends and changes – Nonupdatable: Read-only, periodically refreshed  Data Warehousing: Warehousing – The process of constructing and using a data warehouse Chapter 1 3
  • 4. Data Warehouse—Subject- Oriented  Organized around major subjects, such as customer, product, sales.  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Chapter 1 4
  • 5. Data Warehouse - Integrated  Constructed by integrating multiple, heterogeneous data sources – relational databases, flat files, on-line transaction records  Data cleaning and data integration techniques are applied. – Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources  E.g., Hotel price: currency, tax, breakfast covered, etc. – When data is moved to the warehouse, it is converted. Chapter 1 5
  • 6. Data Warehouse -Time Variant  The time horizon for the data warehouse is significantly longer than that of operational systems. – Operational database: current value data. – Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)  Every key structure in the data warehouse – Contains an element of time, explicitly or implicitly – But the key of operational data may or may not contain “time element”. Chapter 1 6
  • 7. Data Warehouse - Non Updatable A physically separate store of data transformed from the operational environment.  Operational update of data does not occur in the data warehouse environment. – Does not require transaction processing, recovery, and concurrency control mechanisms. – Requires only two operations in data accessing:  initial loading of data and access of data. Chapter 1 7
  • 8. Need for Data Warehousing  Integrated, company-wide view of high-quality information (from disparate databases)  Separation of operational and informational systems and data (for improved performance) Table 11-1: comparison of operational and informational systems Chapter 1 8
  • 9. Need to separate operational and information systems Three primary factors: – A data warehouse centralizes data that are scattered throughout disparate operational systems and makes them available for DS. – A well-designed data warehouse adds value to data by improving their quality and consistency. – A separate data warehouse eliminates much of the contention for resources that results when information applications are mixed with operational processing. Chapter 1 9
  • 10. Data Warehouse Architectures  1.Generic Two-Level Architecture  2.Independent Data Mart  3.Dependent Data Mart and Operational Data Store  4.Logical Data Mart and @ctive Warehouse  5.Three-Layer architecture All involve some form of extraction, transformation and loading (ETL) ETL Chapter 1 10
  • 11. Figure 11-2: Generic two-level architecture L One, company- wide T warehouse E Periodic extraction  data is not completely current in warehouse Chapter 1 11
  • 12. Figure 11-3: Independent Data Mart Data marts: Mini-warehouses, limited in scope L T E Separate ETL for each Data access complexity independent data mart due to multiple data marts Chapter 1 12
  • 13. Independent Data mart  Independent data mart: a data mart filled with data extracted from the operational environment without benefits of a data warehouse. Chapter 1 13
  • 14. Figure 11-4: ODS provides option for Dependent data mart with operational data store obtaining current data L T E Simpler data access Single ETL for enterprise data warehouse Dependent data marts (EDW) loaded from EDW Chapter 1 14
  • 15. Dependent data mart- Operational data store  Dependent data mart: A data mart filled exclusively from the enterprise data warehouse and its reconciled data.  Operational data store (ODS): An integrated, subject-oriented, updatable, current-valued, enterprisewise, detailed database designed to serve operational users as they do decision support processing. Chapter 1 15
  • 16. Figure 11-5: ODS and data warehouse Logical data mart and @ctive data warehouse are one and the same L T E Near real-time ETL for Data marts are NOT separate databases, @active Data Warehouse but logical views of the data warehouse  Easier to create new data marts Chapter 1 16
  • 17. @ctive data warehouse  @active data warehouse: An enterprise data warehouse that accepts near-real-time feeds of transactional data from the systems of record, analyzes warehouse data, and in near-real-time relays business rules to the data warehouse and systems of record so that immediate actions can be taken in repsonse to business events. Chapter 1 17
  • 18. Table 11-2: Data Warehouse vs. Data Mart Source: adapted from Strange (1997). Chapter 1 18
  • 19. Figure 11-6: Three-layer architecture Chapter 1 19
  • 20. Three-layer architecture Reconciled and derived data  Reconciled data: detailed, current data intended to be the single, authoritative source for all decision support.  Derived data: Data that have been selected, formatted, and aggregated for end-user decision support application.  Metadata: technical and business data that describe the properties or characteristics of other data. Chapter 1 20
  • 21. Data Characteristics Figure 11-7: Example of Status vs. Event Data DBMS log entry Status Event = a database action (create/update/delete) that results from a transaction Status Chapter 1 21
  • 22. Data Characteristics Transient vs. Periodic Data Figure 11-8: Transient operational data Changes to existing records are written over previous records, thus destroying the previous data content Chapter 1 22
  • 23. Data Characteristics Transient vs. Periodic Data Figure 11-9: Periodic warehouse data Data are never physically altered or deleted once they have been added to the store Chapter 1 23
  • 24. Other data warehouse changes  New descriptive attributes  New business activity attributes  New classes of descriptive attributes  Descriptive attributes become more refined  Descriptive data are related to one another  New source of data Chapter 1 24
  • 25. Data Reconciliation  Typical operational data is: – Transient – not historical – Not normalized (perhaps due to denormalization for performance) – Restricted in scope – not comprehensive – Sometimes poor quality – inconsistencies and errors  After ETL, data should be: – Detailed – not summarized yet – Historical – periodic – Normalized – 3rd normal form or higher – Comprehensive – enterprise-wide perspective – Quality controlled – accurate with full integrity Chapter 1 25
  • 26. The ETL Process Capture Scrub or data cleansing Transform Load and Index ETL = Extract, transform, and load Chapter 1 26
  • 27. Figure 11-10: Steps in data reconciliation Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Static extract = capturing a Incremental extract = snapshot of the source data at capturing changes that have a point in time occurred since the last static extract Chapter 1 27
  • 28. Figure 11-10: Steps in data reconciliation (continued) Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality Fixing errors: misspellings, Also: decoding, reformatting, time erroneous dates, incorrect field usage, stamping, conversion, key mismatched addresses, missing data, generation, merging, error duplicate data, inconsistencies detection/logging, locating missing data Chapter 1 28
  • 29. Figure 11-10: Steps in data reconciliation (continued) Transform = convert data from format of operational system to format of data warehouse Record-level: Field-level: Selection – data partitioning single-field – from one field to one field Joining – data combining multi-field – from many fields to one, or Aggregation – data summarization one field to many Chapter 1 29
  • 30. Figure 11-10: Steps in data reconciliation (continued) Load/Index= place transformed data into the warehouse and create indexes Refresh mode: bulk rewriting of Update mode: only changes in target data at periodic intervals source data are written to data warehouse Chapter 1 30
  • 31. Data Transformation  Data transformation is the component of data reconcilation that converts data from the format of the source operational systems to the format of enterprise data warehouse.  Data transformation consists of a variety of different functions: – record-level functions, – field-level functions and – more complex transformation. Chapter 1 31
  • 32. Record-level functions & Field-level functions  Record-level functions – Selection: data partitioning – Joining: data combining – Normalization – Aggregation: data summarization  Field-level functions – Single-field transformation: from one field to one field – Multi-field transformation: from many fields to one, or one field to many Chapter 1 32
  • 33. Figure 11-11: Single-field transformation In general – some transformation function translates data from old form to new form Algorithmic transformation uses a formula or logical expression Table lookup – another approach Chapter 1 33
  • 34. Figure 11-12: Multifield transformation M:1 –from many source fields to one target field 1:M –from one source field to many target fields Chapter 1 34
  • 35. Derived Data  Objectives – Ease of use for decision support applications – Fast response to predefined user queries – Customized data for particular target audiences – Ad-hoc query support – Data mining capabilities  Characteristics – Detailed (mostly periodic) data – Aggregate (for summary) – Distributed (to departmental servers) Most common data model = star schema (also called “dimensional model”) Chapter 1 35
  • 36. The Star Schema  Star schema: is a simple database design in which dimensional (describing how data are commonly aggregated) are separated from fact or event data.  A star schema consists of two types of tables: fact tables and dimension table. Chapter 1 36
  • 37. Figure 11-13: Components of a star schema Fact tables contain factual or quantitative data Dimension tables are 1:N relationship denormalized to between dimension maximize tables and fact tables performance Dimension tables contain descriptions about the subjects of the business Excellent for ad-hoc queries, but bad for online transaction processing Chapter 1 37
  • 38. Figure 11-14: Star schema example Fact table provides statistics for sales broken down by product, period and store dimensions Chapter 1 38
  • 39. Figure 11-15: Star schema with sample data Chapter 1 39
  • 40. Issues Regarding Star Schema  Dimension table keys must be surrogate (non- intelligent and non-business related), because: – Keys may change over time – Length/format consistency  Granularity of Fact Table – what level of detail do you want? – Transactional grain – finest level – Aggregated grain – more summarized – Finer grains  better market basket analysis capability – Finer grain  more dimension tables, more rows in fact table Chapter 1 40
  • 41.  Duration of the database  Ex: 13 months or 5 quarters  Some businesses need for a longer durations.  Size of the fact table – Estimate the number of possible values for each dimension associated with the fact table. – Multiply the values obtained in the first step after making any necessary adjustments. Chapter 1 41
  • 42. Figure 11-16: Modeling dates Fact tables contain time-period data  Date dimensions are important Chapter 1 42
  • 43. Variations of the Star Schema  1. Multiple fact tables  2. Factless fact tables  3. Normalizing Dimension Tables  4. Snowflake schema Chapter 1 43
  • 44. Multiple Fact tables  More than one fact table in a given star schema.  Ex: There are 2 fact tables, one at the center of each star: – Sales – facts about the sale of a product to a customer in a store on a date. – Receipts - facts about the receipt of a product from a vendor to a warehouse on a date. – Two separate product dimension tables have been created. – One date dimension table is used. Chapter 1 44
  • 45. Chapter 1 45
  • 46. Factless Fact Tables  There are applications in which fact tables do not have nonkey data but that do have foreign keys for the associated dimensions.  The two situations: – To track events – To inventory the set of possible occurrences (called coverage) Chapter 1 46
  • 47. Factless fact table showing occurrence of an event. Chapter 1 47
  • 48. Factless fact table showing coverage Chapter 1 48
  • 49. Normalizing dimension tables  Dimension tables may not be normalized. Most data warehouse experts find this acceptable.  In some situations in which it makes sense to further normalize dimension tables.  Multivalued dimensions: – Ex: Hospital charge/payment for a patient on a date is associated with one or more diagnosis. – N:M relationship between the Diagnosis and Finances fact table. – Solution: create an associative entity (helper table) between Diagnosis and Finances. Chapter 1 49
  • 51. Snowflake schema  Snowflake schema is an expanded version of a star schema in which dimension tables are normalized into several related tables.  Advantages – Small saving in storage space – Normalized structures are easier to update and maintain  Disadvantages – Schema less intuitive – Ability to browse through the content difficult – Degraded query performance because of additional joins. Chapter 1 51
  • 52. Example of snowflake schema  time time_key item day item_key supplier day_of_the_week Sales Fact Table item_name supplier_key month brand supplier_type quarter time_key type year item_key supplier_key branch_key location branch location_key location_key branch_key units_sold street branch_name city_key city branch_type dollars_sold city_key avg_sales city province_or_stre Measures country Chapter 1 52
  • 53. The User Interface A variety of tools are available to query and analyze data stored in data warehouses. – 1. Querying tools – 2. On-line Analytical processing (OLAP, MOLAP, ROLAP) tools – 3. Data Mining tools – 4. Data Visualization tools Chapter 1 53
  • 54. Role of Metadata (data catalog)  Identify subjects of the data mart  Identify dimensions and facts  Indicate how data is derived from enterprise data warehouses, including derivation rules  Indicate how data is derived from operational data store, including derivation rules  Identify available reports and predefined queries  Identify data analysis techniques (e.g. drill-down)  Identify responsible people Chapter 1 54
  • 55. Querying Tools  SQL is not an analytical language  SQL-99 includes some data warehousing extensions  SQL-99 still is not a full-featured data warehouse querying and analysis tool.  Different DBMS vendors will implement some or all of the SQL-99 OLAP extension commands and possibly others. Chapter 1 55
  • 56. On-Line Analytical Processing (OLAP)  OLAP is the use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques  Relational OLAP (ROLAP) – OLAP tools that view the database as a traditional relational database in either a star schema or other normalized or denormalized set of tables.  Multidimensional OLAP (MOLAP) – OLAP tools that load data into an intermediate structure, usually a three or higher dimensional array. (Cube structure) Chapter 1 56
  • 57. From tables to data cubes A data warehouse is based on a multidimensional data model which views data in the form of a data cube  A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions – Dimension tables, such as item (item_name, brand, type), or time (day, week, month, quarter, year) – Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables Chapter 1 57
  • 58. MOLAP Operations  Roll up (drill-up): summarize data – by climbing up hierarchy or by dimension reduction  Drill down (roll down): reverse of roll-up – from higher level summary to lower level summary or detailed data, or introducing new dimensions  Slice and dice: – project and select Chapter 1 58
  • 59. Figure 11-22: Slicing a data cube Chapter 1 59
  • 60. Summary report Figure 11-23: Example of drill-down Drill-down with color added Chapter 1 60
  • 61. Data Mining  Data mining is knowledge discovery using a blend of statistical, AI, and computer graphics techniques  Goals: – Explain observed events or conditions – Confirm hypotheses – Explore data for new or unexpected relationships  Techniques – Case-based reasoning – Rule discovery – Signal processing – Neural nets – Fractals Chapter 1 61
  • 62. Data Visualization  Data visualization is the representation of data in graphical/multimedia formats for human analysis Chapter 1 62
  • 63. OLAP tool Vendors  IBM  Informix  Cartelon  NCR  Oracle (Oracle Warehouse builder, Oracle OLAP)  Red Brick  Sybase  SAS  Microsoft (SQL Server OLAP)  Microstrategy Corporation Chapter 1 63