SlideShare ist ein Scribd-Unternehmen logo
1 von 146
1
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    Provide superior services and products
         Know the business
         New products
         Invest in customers
         Retain customers
         Invest in technology
         Reinvent to face new challenges



Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]   2
    Ad Hoc Reports : Earliest Stage : Marketing & Finance would send
     requests to IT for special reports.
    Special Extract Program : An Attempt by IT to anticipate
     somewhat the types of reports that would be requested from time to time.
    Small Applications : Create simple applications based on the extracted
     files.
    Information Centers : The information center typically was a place
     where users could go to request ad hoc reports or view special information screen.
    Decision – Support Systems : A Sophisticated systems intended
     to provide strategic information.
    Executive Information Systems : An attempt to bring
     strategic information to the executive desktop.

    Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]   3
4
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
   Defined in many different ways
                A decision support database that is maintained separately
                 from the organization’s operational database
                Support information processing by providing a solid
                 platform of consolidated, historical data for analysis.

              “A data warehouse is a subject-oriented, integrated, time-
               variant, and nonvolatile collection of data in support of
               management’s decision-making process.”—W. H. Inmon

              Data warehousing:
                The process of constructing and using data warehouses


                                                                                                               5
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    Organized around major subjects, such as customer, product,
         sales.
        Focusing on the modeling and analysis of data for decision
         makers, not on daily operations or transaction processing.
        Provide a simple and concise view around particular subject
         issues by excluding data that are not useful in the decision
         support process.




                                                                                                               6
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    Constructed by integrating multiple,
                heterogeneous data sources
                 relational databases, flat files, on-line transaction
                    records
               Data cleaning and data integration techniques are
                applied.
                 Ensure consistency in naming conventions, encoding
                    structures, attribute measures, etc. among different data
                    sources
                     ▪ E.g., Hotel price: currency, tax, breakfast covered, etc.
                 When data is moved to the warehouse, it is converted.

                                                                                                               7
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    The time horizon for the data warehouse is
                significantly longer than that of operational systems.
                  Operational database: current value data.
                  Data warehouse data: provide information from a historical
                     perspective (e.g., past 5-10 years)
               Every key structure in the data warehouse
                  Contains an element of time, explicitly or implicitly
                  But the key of operational data may or may not contain
                     “time element”.


                                                                                                               8
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
 A physically separate store of data transformed
            from the operational environment.
           Operational update of data does not occur in the
            data warehouse environment.
                Does not require transaction processing, recovery, and
                    concurrency control mechanisms
                Requires only two operations in data accessing:
                    ▪ initial loading of data and access of data.


                                                                                                               9
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    Distinct features (OLTP vs. OLAP):
              User and system orientation: customer vs. market
              Data contents: current, detailed vs. historical, consolidated
              Database design: ER + application vs. star + subject
              View: current, local vs. evolutionary, integrated
              Access patterns: update vs. read-only but complex queries




                                                                                                               10
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    OLTP (on-line transaction processing)
                Major task of traditional relational DBMS
                Day-to-day operations: purchasing, inventory, banking,
                    manufacturing, payroll, registration, accounting, etc.

              OLAP (on-line analytical processing)
                Major task of data warehouse system
                Data analysis and decision making




                                                                                                               11
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
OLTP                                    OLAP
users                  clerk, IT professional                  knowledge worker
function               day to day operations                   decision support
DB design              application-oriented                    subject-oriented
data                   current, up-to-date                     historical,
                       detailed, flat relational               summarized, multidimensional
                       isolated                                integrated, consolidated
usage                  repetitive                              ad-hoc
access                 read/write                              lots of scans
                       index/hash on prim. key
unit of work           short, simple transaction               complex query
# records accessed     tens                                    millions
#users                 thousands                               hundreds
DB size                100MB-GB                                100GB-TB
metric                 transaction throughput                  query throughput, response
                     Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                     Warehousing & Data Mining [MCA 204]                                        12
    High performance for both systems

               DBMS— tuned for OLTP: access methods,
                  indexing, concurrency control, recovery

               Warehouse—tuned for OLAP: complex OLAP
                  queries, multidimensional view, consolidation.




                                                                                                               13
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    Different functions and different data:
               missing data: Decision support requires historical
                data which operational DBs do not typically
                maintain
               data consolidation: DS requires consolidation
                (aggregation, summarization) of data from
                heterogeneous sources
               data quality: different sources typically use
                inconsistent data representations, codes and
                formats which have to be reconciled

                                                                                                               14
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
Operational vs Decision Support


                    OLTP                                                      Complex Analysis

                    Information to support                                    Historical information
                    day-to-day service                                        to analyze

                    Data stored at transaction                                Data needs to be integrated
                    level

                    Database design: Normalized                               Database design:
                                                                              Denormalized, star schema




                                                                                                               15
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
MIS and Decision Support

                                                        Ad hoc access

                                        Production
                                         platforms



                                Operational reports                                Decision makers


             •     MIS systems provided business data
             •     Reports were developed on request
             •     Reports provided little analysis capability
             •     no personal ad hoc access to data

                                                                                                               16
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
   Data structures are complex
         Systems are designed for high performance and
ERP       throughput
         Data is not meaningfully represented
         Data is dispersed
         TPS systems unsuitable for intensive queries



                      Production
                      platforms


              Operational reports
                Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                Warehousing & Data Mining [MCA 204]                                        17
Operational systems                    Extracts                     Decision makers

                           End user computing offloaded from the
                            operational environment
                           User’s own data
                                                                                                               18
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
Operational systems                       Extracts
                                                                                          Decision makers
       Extract explosion
           Duplicated effort
           Multiple technologies
           Obsolete reports
           No metadata
                                                                                                               19
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    No common time basis
                           Different calculation algorithms
                           Different levels of extraction
                           Different levels of granularity
                           Different data field names
                           Different data field meanings
                           Missing information
                           No data correction rules
                           No drill-down capability



                                                                                                               20
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
Internal and                     Data warehouse                        Decision makers
                   external systems


                                                  Controlled
                                                  Reliable
                                                  Quality information
                                                  Single source of data

                                                                                                               21
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
External Data Sources



                                                                                                               Visualisation

                             Extract Clean
                                                              Metadata
                             Transform Load                  respository              Serves
                                                                                                                OLAP
                             Refresh




Operational Databases                                    Data Warehouse
                                                                                                          Data Mining


                                                                                                                               22
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
Co rpo rate                                                   Analy st
                                                    data-
                                                 warehouse
                   Mainframe                                                   Serv er                         Analy st
                                                Co rpo rate
                                                 Financial
                                                Marketing
                                                                                                               Analy st
                                              Man ufact uring
                                               Dist ribution




                        Federated data warehouse
                                                                                 Financial                      Analyst

                                                  Corporate                     Marketing                       Analyst
                                                     data
                    Mainframe                     warehouse                  Manufact uring                     Analyst


                                                                               Dist ribution                    Analyst

                                                                                                                          23
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
Mainframe



                        Tier 3 (detailed dat a)              Corporate dat a warehouse


                        Tier 2 (summarized dat a)                    Local data mart


                        Tier 1 (highly summarized data)               Workst at ion                    Analyst




                                                                                                                 24
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
Data
                                                                                                  Data Mart
                                                              Warehouse


                Property                               Data Warehouse                   Data Mart

                Scope                                  Enterprise                       Department

                Subjects                               Multiple                         Single-subject

                Data Source                            Many                             Few

                Size (typical)                         100 GB to > 1 TB                 < 100 GB

                Implementation time                    Months to years                  Months


                                                                                                               25
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    High performance is achieved by pre-planning the
              requirements for joins, summations, and periodic
              reports by end-users.

             There are five main groups of access tools:
               Data reporting and query tools
               Application development tools
               Executive information system (EIS) tools
               Online analytical processing (OLAP) tools
               Data mining tools



                                                                                                               26
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
27
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    Rotate and drill down
                   Create and examine calculated data
                   Determine comparative or relative
                    differences.
                   Perform exception and trend analysis.
                   Perform advanced analytical functions




                                                                                                               28
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
   Roll up (drill-up): summarize data
                   by climbing up hierarchy or by
                dimension reduction
              Drill down (roll down): reverse of roll-up
                   from higher level summary to lower
                 level summary or detailed data, or
                 introducing new dimensions
              Slice and dice:
                   project and select

                                                                                                               29
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    Pivot (rotate):
               reorient the cube, visualization, 3D to
           series of 2D planes.
         Other operations
               drill across: involving (across) more
                  than one fact table
               drill through: through the bottom level
                  of the cube to its back-end relational
                  tables (using SQL)
                                                                                                               30
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
A multi-dimensional data model




                                                                                                               31
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
all

                                                                                                         0-D(apex) cuboid
                   time             item              location         supplier

                                                                                                         1-D cuboids
  time,item         time,location                  item,location

                                   time,supplier                 item,supplierlocation,supplier
                                                                                                         2-D cuboids
                                          time,location,supplier
  time,item,location

                          time,item,supplier                 item,location,supplier                      3-D cuboids


                              time, item, location, supplier
                                                                                                          4-D(base) cuboid

                                                                                                                         32
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    Modeling data warehouses: dimensions & measures
                Star schema: A fact table in the middle connected to a set
                    of dimension tables
                Snowflake schema: A refinement of star schema where
                    some dimensional hierarchy is normalized into a set of
                    smaller dimension tables, forming a shape similar to
                    snowflake
                Fact constellations: Multiple fact tables share dimension
                    tables, viewed as a collection of stars, therefore called
                    galaxy schema or fact constellation

                                                                                                               33
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
time
     time_key                                                                                           item
     day                                                                                             item_key
     day_of_the_week                                 Sales Fact Table                                item_name
     month                                                                                           brand
     quarter                                                       time_key                          type
     year                                                                                            supplier_type
                                                                     item_key
                                                                  branch_key
             branch                                                                                  location
                                                                location_key
              branch_key                                                                             location_key
              branch_name                                          units_sold                        street
              branch_type                                                                            city
                                                                dollars_sold                         province_or_street
                                                                                                     country
                                                                   avg_sales
                                Measures

                                                                                                                          34
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
time
     time_key                                                                                 item
     day                                                                                   item_key               supplier
     day_of_the_week                              Sales Fact Table                         item_name              supplier_key
     month                                                                                 brand                  supplier_type
     quarter                                                    time_key                   type
     year                                                        item_key                  supplier_key

                                                              branch_key
          branch                                                                          location
                                                            location_key
                                                                                          location_key
           branch_key
                                                               units_sold                 street
           branch_name
                                                                                          city_key
           branch_type                                                                                         city
                                                            dollars_sold
                                                                                                               city_key
                                                                avg_sales                                      city
                          Measures                                                                             province_or_street
                                                                                                               country
                                                                                                                             35
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
time
    time_key                                                                       item                  Shipping Fact Table
    day                                                                        item_key
    day_of_the_week                       Sales Fact Table                     item_name                        time_key
    month                                                                      brand
    quarter                                      time_key                      type                                item_key
    year                                                                       supplier_type                    shipper_key
                                                    item_key
                                                  branch_key                                                   from_location

     branch                                     location_key                  location                           to_location
     branch_key                                                               location_key                      dollars_cost
     branch_name
                                                   units_sold
                                                                              street
     branch_type                                 dollars_sold                 city                             units_shipped
                                                                              province_or_street
                                                    avg_sales                 country                              shipper
                    Measures                                                                                       shipper_key
                                                                                                                   shipper_name
                                                                                                                   location_key
                                                                                                                              36
                                                                                                                   shipper_type
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    Cube Definition (Fact Table)
                   define cube <cube_name> [<dimension_list>]:
                     <measure_list>
                  Dimension Definition ( Dimension Table )
                   define dimension <dimension_name> as
                     (<attribute_or_subdimension_list>)
                  Special Case (Shared Dimension Tables)
                    First time as “cube definition”
                    define dimension <dimension_name> as
                       <dimension_name_first_time> in cube
                       <cube_name_first_time>

                                                                                                               37
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
define cube sales_star [time, item, branch, location]:
                dollars_sold = sum(sales_in_dollars), avg_sales =
                  avg(sales_in_dollars), units_sold = count(*)
           define dimension time as (time_key, day, day_of_week, month,
             quarter, year)
           define dimension item as (item_key, item_name, brand, type,
             supplier_type)
           define dimension branch as (branch_key, branch_name,
             branch_type)
           define dimension location as (location_key, street, city,
             province_or_state, country)


                                                                                                               38
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
define cube sales_snowflake [time, item, branch,
        location]:
                dollars_sold = sum(sales_in_dollars), avg_sales =
                 avg(sales_in_dollars), units_sold = count(*)
      define dimension time as (time_key, day, day_of_week,
        month, quarter, year)
      define dimension item as (item_key, item_name, brand,
        type, supplier(supplier_key, supplier_type))


                                                                                                               39
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
define dimension branch as (branch_key,
          branch_name, branch_type)
        define dimension location as (location_key,
          street, city(city_key, province_or_state,
          country))




                                                                                                               40
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
define cube sales [time, item, branch, location]:
          dollars_sold = sum(sales_in_dollars), avg_sales =
            avg(sales_in_dollars), units_sold = count(*)
     define dimension time as (time_key, day, day_of_week, month,
       quarter, year)
     define dimension item as (item_key, item_name, brand, type,
       supplier_type)
     define dimension branch as (branch_key, branch_name,
       branch_type)
     define dimension location as (location_key, street, city,
       province_or_state, country)



                                                                                                               41
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
define cube shipping [time, item, shipper, from_location,
          to_location]:
             dollar_cost = sum(cost_in_dollars), unit_shipped =
               count(*)
        define dimension time as time in cube sales
        define dimension item as item in cube sales
        define dimension shipper as (shipper_key, shipper_name,
          location as location in cube sales, shipper_type)
        define dimension from_location as location in cube sales
        define dimension to_location as location in cube sales




                                                                                                               42
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
   distributive: if the result derived by applying the
              function to n aggregate values is the same as that
              derived by applying the function on all the data
              without partitioning.
                    ▪ E.g., count(), sum(), min(), max().
             algebraic: if it can be computed by an algebraic
              function with M arguments (where M is a bounded
              integer), each of which is obtained by applying a
              distributive aggregate function.
                    ▪ E.g., avg(), min_N(), standard_deviation().

                                                                                                               43
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
    holistic: if there is no constant bound on the
             storage size needed to describe a sub
             aggregate.
                  ▪ E.g., median(), mode(), rank().




                                                                                                               44
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
all                                                                   all


    region                                         Europe                         ...             North_America


    country                       Germany               ...      Spain                     Canada                ...   Mexico


     city                   Frankfurt                ...                   Vancouver ...                       Toronto


     office                                                     L. Chan             ...     M. Wind

                                                                                                                          45
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
       Sales volume as a function of product,
                     month, and region Dimensions: Product, Location, Time
                                                                      Hierarchical summarization paths

                                                                              Industry Region                    Year

                                                                              Category Country Quarter
           Product




                                                                              Product            City          Month Week

                                                                                                 Office          Day


                                 Month
                                                                                                                        46
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
Date                                    Total annual sales
                                     1Qtr         2Qtr        3Qtr          4Qtr         sum         of TV in U.S.A.
                   TV
                 PC                                                                                 U.S.A
               VCR




                                                                                                               Country
            sum
                                                                                                   Canada

                                                                                                   Mexico

                                                                                                      sum




                                                                                                                         47
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
all
                                                                                                    0-D(apex) cuboid
                       product                 date                country
                                                                                                    1-D cuboids

            product,date                  product,country                       date, country
                                                                                                    2-D cuboids


                                                                                                    3-D(base) cuboid
                                      product, date, country




                                                                                                                       48
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
49
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
Customer                        Store
                                                                                                       Store

                                                                    Time                                       Time


                                     SALES                                                    FINANCE




           Product


               The data is found at the intersection of
                dimensions.

                                                                                                                      50
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
Data warehouse architecture




  Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
  Warehousing & Data Mining [MCA 204]                                        51
   The design of a data warehouse: a business
    analysis framework
   The process of data warehouse design
   A three-tier data ware house architecture




              Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
              Warehousing & Data Mining [MCA 204]                                        52
   Four views regarding the design of a data
    warehouse
     Top-down view
      ▪ allows selection of the relevant information
        necessary for the data warehouse




                Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                Warehousing & Data Mining [MCA 204]                                        53
 Data warehouse view
  ▪ consists of fact tables and dimension tables


 Data source view
  ▪ exposes the information being captured, stored, and
    managed by operational systems


 Business query view
  ▪ sees the perspectives
             Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
             Warehousing & Data Mining [MCA 204]                                        54
   Top-down, bottom-up approaches or a combination
    of both
     Top-down: Starts with overall design and planning
      (mature)
     Bottom-up: Starts with experiments and prototypes
      (rapid)
   From software engineering point of view
     Waterfall: structured and systematic analysis at each step
      before proceeding to the next
     Spiral: rapid generation of increasingly functional
      systems, short turn around time, quick turn around
                Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                Warehousing & Data Mining [MCA 204]                                        55
   Typical data warehouse design process
     Choose a business process to model, e.g., orders,
      invoices, etc.
     Choose the grain (atomic level of data) of the
      business process
     Choose the dimensions that will apply to each fact
      table record
     Choose the measure that will populate each fact
      table record
               Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
               Warehousing & Data Mining [MCA 204]                                        56
Multi-Tiered Architecture

                                     Monitor
                                        &                  OLAP Server
  other          Metadata
  sources                           Integrator

                                                                         Analysis
 Operational   Extract                                                   Query
               Transform         Data                       Serve        Reports
 DBs
               Load
               Refresh
                               Warehouse                                 Data mining




                                  Data Marts


Data Sources        Data Storage Professor - MCA Department | Paper :Front-End Tools
                    Dr. Vaibhav Bansal | Associate OLAP Engine Data
                     Warehousing & Data Mining [MCA 204]                               57
   Meta data is the data defining warehouse objects. It has the
    following kinds
     Description of the structure of the warehouse
      ▪ schema, view, dimensions, hierarchies, derived data defn, data mart locations
        and contents
     Operational meta-data
      ▪ data lineage (history of migrated data and transformation path), currency of
        data (active, archived, or purged), monitoring information (warehouse usage
        statistics, error reports, audit trails)
     The algorithms used for summarization
     The mapping from operational environment to the data warehouse
     Data related to system performance
      ▪ warehouse schema, view and derived data definitions
     Business data
      ▪ business terms Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                     Dr.
                         and definitions, ownership of data, charging policies
                          Warehousing & Data Mining [MCA 204]                                 58
   Data extraction:
     get data from multiple, heterogeneous, and external
      sources
   Data cleaning:
     detect errors in the data and rectify them when possible
   Data transformation:
     convert data from legacy or host format to warehouse
      format
   Load:
     sort, summarize, consolidate, compute views, check
      integrity, and build indices and partitions
   Refresh
     propagate the updates from the data sources to the
      warehouse
                   Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                   Warehousing & Data Mining [MCA 204]                                        59
 Enterprise warehouse
   collects all of the information about subjects spanning the entire
     organization
 Data Mart
   a subset of corporate-wide data that is of value to a specific groups
     of users. Its scope is confined to specific, selected groups, such as
     marketing data mart
     ▪ Independent vs. dependent (directly from warehouse) data mart
 Virtual warehouse
   A set of views over operational databases
   Only some of the possible summary views may be materialized

                  Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                  Warehousing & Data Mining [MCA 204]                                        60
Multi-Tier Data
                                                     Warehouse
       Distributed
       Data Marts



Data                  Data                             Enterprise
                                                       Data
Mart                  Mart
                                                       Warehouse

   Model refinement            Model refinement


Define aDr.high-level corporate data model
           Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
               Warehousing & Data Mining [MCA 204]                                61
   Relational OLAP (ROLAP)
     Use relational or extended-relational DBMS to store and
      manage warehouse data and OLAP middle ware to support
      missing pieces
     Include optimization of DBMS backend, implementation of
      aggregation navigation logic, and additional tools and
      services
     greater scalability
   Multidimensional OLAP (MOLAP)
     Array-based multidimensional storage engine (sparse
      matrix techniques)
     fast indexing to pre-computed summarized data
                 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                 Warehousing & Data Mining [MCA 204]                                        62
   Hybrid OLAP (HOLAP)
     User flexibility, e.g., low level: relational, high-
      level: array
   Specialized SQL servers
     specialized support for SQL queries over
      star/snowflake schemas




                 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                 Warehousing & Data Mining [MCA 204]                                        63
Data warehouse implementation




    Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
    Warehousing & Data Mining [MCA 204]                                        64
   Data cube can be viewed as a lattice of cuboids
     The bottom-most cuboid is the base cuboid
     The top-most cuboid (apex) contains only one cell
     How many cuboids in an n-dimensional cube with L levels?
                         n
                 T        ( Li 1)
                       i 1
   Materialization of data cube
     Materialize every (cuboid) (full materialization), none (no
      materialization), or some (partial materialization)
     Selection of which cuboids to materialize
       ▪ Based on size, sharing, access frequency, etc.
                  Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                  Warehousing & Data Mining [MCA 204]                                        65
   Cube definition and computation in DMQL
      define cube sales[item, city, year]: sum(sales_in_dollars)
      compute cube sales
   Transform it into a SQL-like language (with a new operator
    cube by, introduced by Gray et al.’96)        ()
      SELECT item, city, year, SUM (amount)
      FROM SALES                                            (city)             (item)          (year)
      CUBE BY item, city, year
   Need compute the following Group-Bys
      (date, product, customer),                         (city, item)           (city, year)   (item, year)
      (date,product),(date, customer), (product, customer),
      (date), (product), (customer)
                                                                            (city, item, year)
                      Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
      ()              Warehousing & Data Mining [MCA 204]                                                66
   Efficient cube computation methods
     ROLAP-based cubing algorithms (Agarwal et al’96)
     Array-based cubing algorithm (Zhao et al’97)
     Bottom-up computation method (Bayer & Ramarkrishnan’99)
   ROLAP-based cubing algorithms
     Sorting, hashing, and grouping operations are applied to the
      dimension attributes in order to reorder and cluster related tuples
     Grouping is performed on some sub aggregates as a “partial
      grouping step”
     Aggregates may be computed from previously computed
      aggregates, rather than from the base fact table

                  Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                  Warehousing & Data Mining [MCA 204]                                        67
   Partition arrays into chunks (a small sub cube which fits in
    memory).
   Compressed sparse array addressing: (chunk_id, offset)
   Compute aggregates in “multi way” by visiting cube cells in the
    order which minimizes the # of times to visit each cell, and
    reduces memory access and storage cost.




                   Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                   Warehousing & Data Mining [MCA 204]                                        68
C      c3 61
         c2 45
                       62       63       64
                    46       47       48
       c1 29     30       31       32
     c0
          B13         14         15          16                 60
    b3                                                     44
B                                                     28        56
    b2     9
                                                           40
                                                      24        52
    b1     5
                                                           36
                                                      20
    b0     1          2            3         4
           a0        a1         a2         a3
                          A


Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
Warehousing & Data Mining [MCA 204]                                        69
   Method: the planes should be sorted and
    computed according to their size in
    ascending order.
     Idea: keep the smallest plane in the main
     memory, fetch and compute only one chunk at a
     time for the largest plane
   Limitation of the method: computing well
    only for a small number of dimensions
     If there are a large number of dimensions,
     “bottom-up computation” and iceberg cube
     computation methods can be explored
              Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
              Warehousing & Data Mining [MCA 204]                                        70
    Index on a particular column
       Each value in the column has a bit vector: bit-op is fast
       The length of the bit vector: # of records in the base table
       The i-th bit is set if the i-th row of the base table has the
        value for the indexed column
       not suitable for high cardinality domains

  Base table                 Index on Region                      Index on Type
Cust   Region    Type RecIDAsia Europe America RecID Retail Dealer
C1     Asia      Retail     1         1           0               0                1     1 0
C2     Europe    Dealer 2             0           1               0                2     0 1
C3     Asia      Dealer 3             1           0               0                3     0 1
C4     America   Retail     4         0           0               1                4     1 0
C5     Europe    Dealer Vaibhav Bansal0Associate Professor - MCA Department | Paper5Data
                        Dr.
                            5
                                      |
                                                  1               0
                                                                                   :
                                                                                         0 1
                        Warehousing & Data Mining [MCA 204]                             71
   Join index: JI(R-id, S-id) where R (R-id, …)
     S (S-id, …)
   Traditional indices map the values to a list
    of record ids
     It materializes relational join in JI file and speeds
      up relational join — a rather costly operation
   In data warehouses, join index relates the
    values of the dimensions of a start schema
    to rows in the fact table.
     E.g. fact table: Sales and two dimensions city and
      product
      ▪ A join index on city maintains for each distinct
         city a list of R-IDs of the tuples recording the
         Sales in the city
     Join indices can span multiple dimensions
                         Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                         Warehousing & Data Mining [MCA 204]                                        72
   Determine which operations should be performed on
    the available cuboids:
     transform drill, roll, etc. into corresponding SQL and/or
      OLAP operations, e.g, dice = selection + projection
   Determine to which materialized cuboid(s) the
    relevant operations should be applied.
   Exploring indexing structures and compressed vs.
    dense array structures in MOLAP
                    Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                    Warehousing & Data Mining [MCA 204]                                        73
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
Warehousing & Data Mining [MCA 204]                                        74
   Data in the real world is:
     incomplete: lacking attribute values, lacking certain
      attributes of interest, or containing only aggregate data
     noisy: containing errors or outliers
     inconsistent: containing discrepancies in codes or names
   No quality data, no quality mining results!
     Quality decisions must be based on quality data
     Data warehouse needs consistent integration of quality
      data


                Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
               Lecture - Why Data Preprocessing?
                Warehousing & Data Mining [MCA 204]                                        75
   A well-accepted multidimensional view:
       Accuracy
       Completeness
       Consistency
       Timeliness
       Believability
       Value added
       Interpretability
       Accessibility
   Broad categories:
     intrinsic, contextual, representational, and accessibility.
                 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                Lecture - Why Data Preprocessing?
                 Warehousing & Data Mining [MCA 204]                                        76
   Data cleaning
     Fill in missing values, smooth noisy data, identify or remove outliers,
      and resolve inconsistencies
   Data integration
     Integration of multiple databases, data cubes, or files
   Data transformation
     Normalization and aggregation
   Data reduction
     Obtains reduced representation in volume but produces the same or
      similar analytical results
   Data discretization
     Part of data reduction but with particular importance, especially for
      numerical data
                   Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                  Lecture - Why Data Preprocessing?
                   Warehousing & Data Mining [MCA 204]                                        77
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
Lecture - Why Data Preprocessing?
 Warehousing & Data Mining [MCA 204]                                        78
Lecture
         Data cleaning




Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
Warehousing & Data Mining [MCA 204]                                        79
   Data cleaning tasks
     Fill in missing values

     Identify outliers and smooth out noisy data

     Correct inconsistent data




             Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                  Lecture - Data cleaning
             Warehousing & Data Mining [MCA 204]                                        80
   Data is not always available
     E.g., many tuples have no recorded value for several attributes, such
      as customer income in sales data
   Missing data may be due to
     equipment malfunction
     inconsistent with other recorded data and thus deleted
     data not entered due to misunderstanding
     certain data may not be considered important at the time of entry
     not register history or changes of the data
   Missing data may need to be inferred.

                  Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                       Lecture - Data cleaning
                  Warehousing & Data Mining [MCA 204]                                        81
   Ignore the tuple: usually done when class label is
    missing.
   Fill in the missing value manually
   Use a global constant to fill in the missing value: ex.
    “unknown”




                Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                     Lecture - Data cleaning
                Warehousing & Data Mining [MCA 204]                                        82
   Use the attribute mean to fill in the missing value
   Use the attribute mean for all samples belonging to the
    same class to fill in the missing value
   Use the most probable value to fill in the missing value:
    inference-based such as Bayesian formula or decision tree




                  Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                       Lecture - Data cleaning
                  Warehousing & Data Mining [MCA 204]                                        83
 Noise: random error or variance in a measured
  variable
 Incorrect attribute values may due to
       faulty data collection instruments
       data entry problems
       data transmission problems
       technology limitation
       inconsistency in naming convention
   Other data problems which requires data cleaning
     duplicate records
     incomplete data
     inconsistent data
                    Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                         Lecture - Data cleaning
                    Warehousing & Data Mining [MCA 204]                                        84
   Binning method:
     first sort data and partition into (equal-frequency)
      bins
     then one can smooth by bin means, smooth by bin
      median, smooth by bin boundaries
   Clustering
     detect and remove outliers
   Regression
     smooth by fitting the data to a regression functions
     – linear regression
                 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                      Lecture - Data cleaning
                 Warehousing & Data Mining [MCA 204]                                        85
   Equal-width (distance) partitioning:
     It divides the range into N intervals of equal size: uniform
        grid
       if A and B are the lowest and highest values of the
        attribute, the width of intervals will be: W = (B-A)/N.
       The most straightforward
       But outliers may dominate presentation
       Skewed data is not handled well.
   Equal-depth (frequency) partitioning:
     It divides the range into N intervals, each containing
      approximately same number of samples
     Good data scaling
     Managing categorical attributes can be tricky.
                     Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                          Lecture - Data cleaning
                     Warehousing & Data Mining [MCA 204]                                        86
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
  26, 28, 29, 34
* Partition into (equi-depth) bins:
   - Bin 1: 4, 8, 9, 15
   - Bin 2: 21, 21, 24, 25
   - Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
   - Bin 1: 9, 9, 9, 9
   - Bin 2: 23, 23, 23, 23
   - Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
   - Bin 1: 4, 4, 4, 15
   - Bin 2: 21, 21, 25, 25
   - Bin 3: 26, 26, 26, 34
                 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                      Lecture-14 - Data cleaning
                 Warehousing & Data Mining [MCA 204]                                        87
Cluster Analysis is based on Regression Lines.




                        Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                             Lecture-14 - Data cleaning
                        Warehousing & Data Mining [MCA 204]                                        88
Lecture
Data integration and transformation




       Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
       Warehousing & Data Mining [MCA 204]                                        89
   Data integration:
     combines data from multiple sources into a coherent store
   Schema integration
     integrate metadata from different sources
     Entity identification problem: identify real world entities
      from multiple data sources, e.g., A.cust-id B.cust-#
   Detecting and resolving data value conflicts
     for the same real world entity, attribute values from different
      sources are different
     possible reasons: different representations, different scales,
      e.g., metric vs. British units

                   Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
              Lecture-15 - Data integration204] transformation
                   Warehousing & Data Mining [MCA and                                         90
   Redundant data occur often when integration
    of multiple databases
     The same attribute may have different names in
     different databases
     One attribute may be a “derived” attribute in
     another table, e.g., annual revenue



                 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
            Lecture-15 - Data integration204] transformation
                 Warehousing & Data Mining [MCA and                                         91
   Redundant data may be able to be detected
    by correlation analysis
   Careful integration of the data from multiple
    sources may help reduce/avoid redundancies
    and inconsistencies and improve mining
    speed and quality


              Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
              Warehousing & Data Mining [MCA 204]                                        92
 Smoothing: remove noise from data
 Aggregation: summarization, data cube
  construction
 Generalization: concept hierarchy climbing




            Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
            Warehousing & Data Mining [MCA 204]                                        93
    Normalization: scaled to fall within a small, specified range
            min-max normalization :-> Min-Max normalization is the process of taking data measured in its
             engineering units (for example: miles per hour or degrees ) and transforming it to a value between
             0.0 and 1.0.
             The lowest (min) value is set to 0.0 and the highest (max) value is set to 1.0. This provides an easy
             way to compare values that are measured using different scales (for example degrees Celsius and
             degrees Fahrenheit) or different units of measure (speed and distance).

             The normalized value is defined as: (the value - the minimum) / (the range of values).

             As an example, consider a temperature reading with values 20, 24, 26, 27, 30.

             The minimum value is 20.
             The maximum values is 30.
             The data has a range of 10 degrees.

             For a temperature reading of 26, the normalized value is 0.6.

             (26 - 20) / 10 = 0.6.


                                                                                                                     94
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
 z-score normalization :->                   The z -Score utility model transforms the
                  distribution of pixel values into a standard normal distribution (z-score value). By this
                  normalization the images of different individuals become more comparable.

               Normalization by decimal scaling.




                                                                                                               95
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
Lecture
                                                 Data reduction




                                                                                                               96
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
   Warehouse may store terabytes of data:
    Complex data analysis/mining may take a very
    long time to run on the complete data set
   Data reduction
     Obtains a reduced representation of the data set
      that is much smaller in volume but yet produces the
      same (or almost the same) analytical results
     Example: Materialized Views in Oracle.


                Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                         Lecture-16 - Data reduction
                Warehousing & Data Mining [MCA 204]                                        97
   Data reduction strategies
     Data cube aggregation
     Attribute subset selection
     Dimensionality reduction
     Numerosity reduction
     Discretization and concept hierarchy generation




               Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                        Lecture-16 - Data reduction
               Warehousing & Data Mining [MCA 204]                                        98
   The lowest level of a data cube (Reduced the data in
    concept level required in analysis)
     the aggregated data for an individual entity of interest.
     e.g., a customer in a phone calling data warehouse.
   Multiple levels of aggregation in data cubes
     Further reduce the size of data to deal with analysis
   Reference appropriate levels
     Use the smallest representation which is enough to solve the
      task and analysis.
   Queries regarding aggregated information should be
    answered using data |cube. - MCA Department | Paper : Data
                Dr. Vaibhav Bansal Associate Professor
                   Warehousing & Data Mining [MCA 204]               99
   Feature selection (i.e. attribute subset selection):
     Select only the necessary attributes.
     The goal is to find a minimum set of features such that the
      resulting probability distribution of different data classes is
      as close as possible to the original distribution obtaining all
      the given the values of all attributes.




                   Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                   Warehousing & Data Mining [MCA 204]                                        100
 reduce # of patterns in the patterns, easier to understand
   Heuristic methods
     step-wise forward selection {} {A1,A3} {A1,A3,A5}
     step-wise backward elimination {A1,A2,A3,A4,A5} {A1, A3 , A4 , A5 }
    {A1,A3,A5}
     combining forward selection and backward elimination
     decision-tree induction




                     Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                     Warehousing & Data Mining [MCA 204]                                        101
Example of Decision Tree Induction
    Initial attribute set:
    {A1, A2, A3, A4, A5, A6}


                                    A4 ?


               A1?                                       A6?



     Class 1           Class 2                 Class 1               Class 2

>    Reduced attribute set: {A1, A4, A6}
               Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
               Warehousing & Data Mining [MCA 204]                                        102
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
Warehousing & Data Mining [MCA 204]                                        103
   Attribute subset selection reduces the data
    set size by removing irrelevant or redundant
    attributes.
   Goal is find ‘min’ set of attributes
   Uses basic heuristic methods of attribute
    selection



              Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                       Lecture-16 - Data reduction
              Warehousing & Data Mining [MCA 204]                                        104
   Linear regression: Data are modeled to fit a straight line
     Often uses the least-square method to fit the line

   Multiple regression: allows a response variable Y to be
    modeled as a linear function of multidimensional
    feature vector
   Log-linear model: approximates discrete
    multidimensional probability distributions
                   Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                            Lecture-16 - Data reduction
                   Warehousing & Data Mining [MCA 204]                                        106
   Three types of attributes:
     Nominal — values from an unordered set
     Ordinal — values from an ordered set
     Continuous — real numbers
   Discretization: divide the range of a continuous
    attribute into intervals
     Some classification algorithms only accept categorical
      attributes.
     Reduce data size by discretization
     Prepare for further analysis

                      Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
       Lecture-17 -   Discretization and concept hierarchy generation
                      Warehousing & Data Mining [MCA 204]                                        107
   Discretization
     reduce the number of values for a given continuous
      attribute by dividing the range of the attribute into
      intervals. Interval labels can then be used to replace
      actual data values.
   Concept hierarchies
     reduce the data by collecting and replacing low level
      concepts (such as numeric values for the attribute age) by
      higher level concepts (such as young, middle-aged, or
      senior).

                   Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
      Lecture- Discretization andMining [MCA hierarchy generation
                   Warehousing & Data concept 204]                                            108
   Binning

   Histogram analysis

   Clustering analysis

   Entropy-based discretization

   Discretization by intuitive partitioning

                      Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
       Lecture-17 -   Discretization and concept hierarchy generation
                      Warehousing & Data Mining [MCA 204]                                        109
 A popular data                          40
  reduction technique
                                          35
 Divide data into
  buckets and store                       30
  average (sum) for each 25
  bucket                                  20
 Can be constructed
                                          15
  optimally in one
  dimension using                         10
  dynamic programming 5
 Related to quantization
                                            0
  problems.      Dr. Vaibhav Bansal | Associate 10000
                                                Professor - MCA Department 50000 : Data 70000
                          Lecture-16 - Data reduction
                 Warehousing & Data Mining [MCA 204]
                                                              30000        | Paper              90000
                                                                                                        110
   Partition data set into clusters, and one can store cluster
    representation only
   Can be very effective if data is clustered but not if data is
    “smeared”
   Can have hierarchical clustering and be stored in multi-
    dimensional index tree structures
   There are many choices of clustering definitions and
    clustering algorithms.

                 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                          Lecture-16 - Data reduction
                 Warehousing & Data Mining [MCA 204]                                        111
   Given a set of samples ‘S’, if ‘S’ is partitioned into two
    intervals S1 and S2 using boundary T, the entropy
    after partitioning is E ( S , T ) | S1| Ent ( S1) | S 2 | Ent ( S 2)
                                                     | S|       | S|

 The boundary that minimizes the entropy function
  over all possible boundaries is selected as a binary
  discretization.
 The process is recursively applied to partitions
  obtained until some stopping criterion is met, e.g.,
 Experiments show that it may reduce data size and
  improve classification accuracy
                                                              Ent ( S ) E (T , S )
                Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                         Warehousing & Data Mining [MCA 204]                               112
  3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
• If an interval covers 3, 6, 7 or 9 distinct values at the most
   significant digit, partition the range into 3 equal-width intervals.
• Example: 3-5,5-7,7-9  (9-3=6/3 = 2 width)
• If it covers 2, 4, or 8 distinct values at the most significant digit,
   partition the range into 4 intervals
• Example : 8-2 =6/4 = 1.5 width {2-3.5,3.5 -5, 5-6.5, 6.5-8]
• If it covers 1, 5, or 10 distinct values at the most significant digit,
   partition the range into 5 intervals
• Example: 10-1 =9/5 = 1.8 width {1-2.8,2.8-4.6, 4.6-6.4,6.4-8.2,
   8.2-10)           Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                         Warehousing & Data Mining [MCA 204]                                113
   Specification of a partial ordering of attributes explicitly at the
    schema level by users or experts
   Specification of a portion of a hierarchy by explicit data grouping
   Specification of a set of attributes, but not of their partial
    ordering
   Specification of only a partial set of attributes




                        Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
         Lecture-17 -   Discretization and concept hierarchy generation
                        Warehousing & Data Mining [MCA 204]                                        114
   Allows a large data set to be represented by a smaller
    of the data.

   Let a large data set D, contains N tuples.

   Methods to reduce data set D:
     Simple random sample without replacement (SRSWOR)
     Simple random sample with replacement (SRSWR)
     Cluster sample
     Stratified sample

                 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                          Lecture-16 - Data reduction
                 Warehousing & Data Mining [MCA 204]                                        115
Sampling




Raw Data
       Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                Lecture-16 - Data reduction
       Warehousing & Data Mining [MCA 204]                                        116
Raw Data                                     Cluster/Stratified Sample




       Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                Lecture-16 - Data reduction
       Warehousing & Data Mining [MCA 204]                                        117
Concept hierarchy can be automatically generated
  based on the number of distinct values per
  attribute in the given attribute set. The attribute
  with the most distinct values is placed at the
  lowest level of the hierarchy.

             country                                 15 distinct values

       province_or_ state                             65 distinct values

                 city                               3567 distinct values

                street Bansal | Associate Professor - MCA Departmentdistinct values
                Dr. Vaibhav                           674,339 | Paper : Data
    Lecture-17 - Discretization and concept hierarchy generation
                 Warehousing & Data Mining [MCA 204]                                  118
Lecture          Topic
**********************************************
Lecture   Data mining primitives: What defines a data
          mining task?
Lecture   A data mining query language
Lecture   Design graphical user interfaces
          based on a data mining query language
Lecture   Architecture of data mining systems
Lecture
Data mining primitives: What defines a data
                mining task?
 Finding all the patterns autonomously in a database?
      --unrealistic because the patterns could be too many.
      --uninteresting (not to be considered due to constraint).
 Data mining should be an interactive process
   User directs what to be mined and when. Yes it is an interactive
    process and it is user interactive process.
 Users must be provided with a set of primitives to be used to
  communicate with the data mining system
 Incorporating these primitives in a data mining query language
   More flexible user interaction
   Foundation for design of graphical user interface
   Standardization of data mining industry and practice

    Lecture - Data mining primitives: What defines a data mining task?
   Task-relevant data (Minable View).

   Type of knowledge to be mined

   Background/History knowledge

   Pattern interestingness measurements

   Visualization of discovered patterns


    Lecture-18 - Data mining primitives: What defines a data mining task?
   Database or data warehouse name

   Database tables or data warehouse cubes

   Condition for data selection

   Relevant attributes or dimensions

   Data grouping criteria


    Lecture-18 - Data mining primitives: What defines a data mining task?
   Characterization
   Discrimination
   Association
   Classification/prediction
   Clustering
   Outlier analysis
   Other data mining tasks

    Lecture - Data mining primitives: What defines a data mining task?
 Schema hierarchy
   street < city < province_or_state < country
 Set-grouping hierarchy
   {20-39} = young, {40-59} = middle_aged
 Operation-derived hierarchy
   email address: login-name < department <
    university < country
 Rule-based hierarchy
   low_profit_margin (X) <= price(X, P1) and cost (X,
    P2) and (P1 - P2) < $50


    Lecture-18 - Data mining primitives: What defines a data mining task?
    Simplicity
        association rule length, decision tree size
    Certainty
       confidence, P(A|B) = n(A and B)/ n (B), classification
        reliability or accuracy, certainty factor, rule strength, rule
        quality, discriminating weight
    Utility
        potential usefulness, support (association), noise
        threshold (description)
    Novelty
        not previously known, surprising (used to remove
        redundant rules, Canada vs. Vancouver rule implication
        support ratio
    Lecture-18 - Data mining primitives: What defines a data mining task?
   Different backgrounds/usages may require different
    forms of representation
     rules, tables, cross tabs, pie/bar chart
   Concept hierarchy is also important
     Discovered knowledge might be more understandable
      when represented at high level of abstraction
     Interactive drill up/down, pivoting, slicing and dicing
      provide different perspective to data
   Different kinds of knowledge require different representation:
    association, classification, clustering
     Lecture-18 - Data mining primitives: What defines a data mining task?
Lecture-19
A data mining query language
   Motivation
     A DMQL can provide the ability to support ad-hoc and
      interactive data mining
     By providing a standardized language like SQL
      ▪ to achieve a similar effect like that SQL has on
        relational database
      ▪ Foundation for system development and evolution
      ▪ Facilitate information exchange, technology transfer,
        commercialization and wide acceptance
   Design
     DMQL is designed with the primitives
             Lecture-19 - A data mining query language
   Syntax for specification of
     task-relevant data
     the kind of knowledge to be mined
     concept hierarchy specification
     interestingness measure
     pattern presentation and visualization
        — a DMQL query

          Lecture-19 - A data mining query language
   use database database_name, or use data
    warehouse data_warehouse_name
   from relation(s)/cube(s) [where condition]
   in relevance to att_or_dim_list
   order by order_list
   group by grouping_list
   having condition
 Characterization
  Mine_Knowledge_Specification ::=
     mine characteristics [as pattern_name]
     analyze measure(s)
 Discrimination
     Mine_Knowledge_Specification ::=
      mine comparison [as pattern_name]
      for target_class where target_condition
      {versus contrast_class_i where contrast_condition_i}
      analyze measure(s)
   Association
    Mine_Knowledge_Specification ::=
       mine associations [as pattern_name]
            Lecture-19 - A data mining query language
Classification
 Mine_Knowledge_Specification ::=
  mine classification [as pattern_name]
  analyze classifying_attribute_or_dimension
 Prediction
 Mine_Knowledge_Specification ::=
   mine prediction [as pattern_name]
   analyze prediction_attribute_or_dimension
   {set {attribute_or_dimension_i= value_i}}


        Lecture-19 - A data mining query language
   To specify what concept hierarchies to use
    use hierarchy <hierarchy> for <attribute_or_dimension>
   use different syntax to define different type of hierarchies
     schema hierarchies
      define hierarchy time_hierarchy on date as [date,month quarter,year]
     set-grouping hierarchies
        define hierarchy age_hierarchy for age on customer as
                 level1: {young, middle_aged, senior} < level0: all
                 level2: {20, ..., 39} < level1: young
                 level2: {40, ..., 59} < level1: middle_aged
                 level2: {60, ..., 89} < level1: senior



                Lecture-19 - A data mining query language
 operation-derived hierarchies
  define hierarchy age_hierarchy for age on
    customer as
   {age_category(1), ..., age_category(5)} :=
    cluster(default, age, 5) < all(age)




      Lecture-19 - A data mining query language
 rule-based hierarchies
  define hierarchy profit_margin_hierarchy on item as
   level_1: low_profit_margin < level_0: all
           if (price - cost)< $50
   level_1: medium-profit_margin < level_0: all
            if ((price - cost) > $50) and ((price - cost) <=
    $250))
   level_1: high_profit_margin < level_0: all
           if (price - cost) > $250


          Lecture-19 - A data mining query language
   Interestingness measures and thresholds can be
    specified by the user with the statement:
    with <interest_measure_name> threshold =
      threshold_value
   Example:
    with support threshold = 0.05
    with confidence threshold = 0.7



             Lecture-19 - A data mining query language
  syntax which allows users to specify the display of
  discovered patterns in one or more forms
              display as <result_form>
 To facilitate interactive viewing at different concept level,
  the following syntax is defined:
  Multilevel_Manipulation ::= roll up on
     attribute_or_dimension
                              | drill down on
     attribute_or_dimension
                              | add attribute_or_dimension
                                       | drop
     attribute_or_dimension
             Lecture-19 - A data mining query language
use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age, I.type, I.place_made
from customer C, item I, purchases P, items_sold S, works_at
   W, branch
where I.item_ID = S.item_ID and S.trans_ID = P.trans_ID
     and P.cust_ID = C.cust_ID and P.method_paid = ``AmEx''
     and P.empl_ID = W.empl_ID and W.branch_ID =
     B.branch_ID and B.address = ``Canada" and I.price >= 100
with noise threshold = 0.05
display as table


             Lecture-19 - A data mining query language
   Association rule language specifications
     MSQL (Imielinski & Virmani’99)
     MineRule (Meo Psaila and Ceri’96)
     Query flocks based on Datalog syntax (Tsur et al’98)
   OLEDB for DM (Microsoft’2000)
     Based on OLE, OLE DB, OLE DB for OLAP
     Integrating DBMS, data warehouse and data mining
   CRISP-DM (CRoss-Industry Standard Process for Data
    Mining)
     Providing a platform and process structure for effective data mining
     Emphasizing on deploying data mining technology to solve business
      problems

            Lecture-19 - A data mining query language
Lecture-20
Design graphical user interfaces based on a
        data mining query language
   What tasks should be considered in the design
          GUIs based on a data mining query language?
           Data collection and data mining query composition
           Presentation of discovered patterns

           Hierarchy specification and manipulation
           Manipulation of data mining primitives

           Interactive multilevel mining
           Other miscellaneous information

Lecture-20 - Design graphical user interfaces based on a data mining query language
Lecture-21
Architecture of data mining systems
   Coupling data mining system with DB/DW
    system
     No coupling—flat file processing,
     Loose coupling
      ▪ Fetching data from DB/DW
     Semi-tight coupling—enhanced DM performance

     ▪ Provide efficient implement a few data mining primitives
       in a DB/DW system- sorting, indexing, aggregation,
       histogram analysis, multiway join, precomputation of
       some stat functions

             Lecture-21 - Architecture of data mining systems
   Tight coupling—A uniform information
    processing environment

     DM is smoothly integrated into a DB/DW system,
     mining query is optimized based on mining query,
     indexing, query processing methods




            Lecture-21 - Architecture of data mining systems
Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
Warehousing & Data Mining [MCA 204]                                        146
   Associations
     85 percent of customers who buy a certain brand of wine also buy a certain
      type of pasta
   Sequential patterns
     32 percent of female customers who order a red jacket within six months
      buy a gray skirt
   Classifying
     Frequent customers are those with incomes about $50,000 and having two
      or more children
   Clustering
     Market segmentation
   Predicting
     predict the revenue value of a new customer based on that personal
      demographic variables

                         Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data
                         Warehousing & Data Mining [MCA 204]                                        147

Weitere ähnliche Inhalte

Was ist angesagt?

How Real TIme Data Changes the Data Warehouse
How Real TIme Data Changes the Data WarehouseHow Real TIme Data Changes the Data Warehouse
How Real TIme Data Changes the Data Warehousemark madsen
 
Implementing BI & DW Governance
Implementing BI & DW GovernanceImplementing BI & DW Governance
Implementing BI & DW GovernanceDavid Walker
 
Storage Characteristics Of Call Data Records In Column Store Databases
Storage Characteristics Of Call Data Records In Column Store DatabasesStorage Characteristics Of Call Data Records In Column Store Databases
Storage Characteristics Of Call Data Records In Column Store DatabasesDavid Walker
 
6 - Foundations of BI: Database & Info Mgmt
6 - Foundations of BI: Database & Info Mgmt6 - Foundations of BI: Database & Info Mgmt
6 - Foundations of BI: Database & Info MgmtHemant Nagwekar
 
Warehouse Planning and Implementation
Warehouse Planning and ImplementationWarehouse Planning and Implementation
Warehouse Planning and ImplementationSHIKHA GAUTAM
 
Data & database administration hoffer
Data & database administration   hofferData & database administration   hoffer
Data & database administration hofferMohd Arif
 
Evolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital eraEvolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital eraVishal Puri
 
Data warehouse 101-fundamentals-
Data warehouse 101-fundamentals-Data warehouse 101-fundamentals-
Data warehouse 101-fundamentals-AshishGuleria
 
CH10-Managing a Database
CH10-Managing a DatabaseCH10-Managing a Database
CH10-Managing a DatabaseSukanya Ben
 
Data Warehouse: A Primer
Data Warehouse: A PrimerData Warehouse: A Primer
Data Warehouse: A PrimerIJRTEMJOURNAL
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRyan Andhavarapu
 
Wallchart - Data Warehouse Documentation Roadmap
Wallchart - Data Warehouse Documentation RoadmapWallchart - Data Warehouse Documentation Roadmap
Wallchart - Data Warehouse Documentation RoadmapDavid Walker
 
Capitalizing on the New Era of In-memory Computing
Capitalizing on the New Era of In-memory ComputingCapitalizing on the New Era of In-memory Computing
Capitalizing on the New Era of In-memory ComputingInfosys
 
New Microsoft Office WordDatabase administration and automation Document (2)
New Microsoft Office WordDatabase administration and automation Document (2)New Microsoft Office WordDatabase administration and automation Document (2)
New Microsoft Office WordDatabase administration and automation Document (2)naveen
 

Was ist angesagt? (19)

How Real TIme Data Changes the Data Warehouse
How Real TIme Data Changes the Data WarehouseHow Real TIme Data Changes the Data Warehouse
How Real TIme Data Changes the Data Warehouse
 
Implementing BI & DW Governance
Implementing BI & DW GovernanceImplementing BI & DW Governance
Implementing BI & DW Governance
 
Storage Characteristics Of Call Data Records In Column Store Databases
Storage Characteristics Of Call Data Records In Column Store DatabasesStorage Characteristics Of Call Data Records In Column Store Databases
Storage Characteristics Of Call Data Records In Column Store Databases
 
6 - Foundations of BI: Database & Info Mgmt
6 - Foundations of BI: Database & Info Mgmt6 - Foundations of BI: Database & Info Mgmt
6 - Foundations of BI: Database & Info Mgmt
 
Unit 1
Unit 1Unit 1
Unit 1
 
Warehouse Planning and Implementation
Warehouse Planning and ImplementationWarehouse Planning and Implementation
Warehouse Planning and Implementation
 
Data & database administration hoffer
Data & database administration   hofferData & database administration   hoffer
Data & database administration hoffer
 
Evolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital eraEvolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital era
 
Data warehouse 101-fundamentals-
Data warehouse 101-fundamentals-Data warehouse 101-fundamentals-
Data warehouse 101-fundamentals-
 
CH10-Managing a Database
CH10-Managing a DatabaseCH10-Managing a Database
CH10-Managing a Database
 
Planning Data Warehouse
Planning Data WarehousePlanning Data Warehouse
Planning Data Warehouse
 
Data Warehouse: A Primer
Data Warehouse: A PrimerData Warehouse: A Primer
Data Warehouse: A Primer
 
Database Management
Database ManagementDatabase Management
Database Management
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data Warehouse
 
Wallchart - Data Warehouse Documentation Roadmap
Wallchart - Data Warehouse Documentation RoadmapWallchart - Data Warehouse Documentation Roadmap
Wallchart - Data Warehouse Documentation Roadmap
 
SQL- Data Base
SQL- Data BaseSQL- Data Base
SQL- Data Base
 
Capitalizing on the New Era of In-memory Computing
Capitalizing on the New Era of In-memory ComputingCapitalizing on the New Era of In-memory Computing
Capitalizing on the New Era of In-memory Computing
 
Week 5
Week 5Week 5
Week 5
 
New Microsoft Office WordDatabase administration and automation Document (2)
New Microsoft Office WordDatabase administration and automation Document (2)New Microsoft Office WordDatabase administration and automation Document (2)
New Microsoft Office WordDatabase administration and automation Document (2)
 

Andere mochten auch

Lecture 2, c++(complete reference,herbet sheidt)chapter-12
Lecture 2, c++(complete reference,herbet sheidt)chapter-12Lecture 2, c++(complete reference,herbet sheidt)chapter-12
Lecture 2, c++(complete reference,herbet sheidt)chapter-12Abu Saleh
 
Lecture 5, c++(complete reference,herbet sheidt)chapter-15
Lecture 5, c++(complete reference,herbet sheidt)chapter-15Lecture 5, c++(complete reference,herbet sheidt)chapter-15
Lecture 5, c++(complete reference,herbet sheidt)chapter-15Abu Saleh
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Salah Amean
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
 
Lecture 6, c++(complete reference,herbet sheidt)chapter-16
Lecture 6, c++(complete reference,herbet sheidt)chapter-16Lecture 6, c++(complete reference,herbet sheidt)chapter-16
Lecture 6, c++(complete reference,herbet sheidt)chapter-16Abu Saleh
 
Lecture 4, c++(complete reference,herbet sheidt)chapter-14
Lecture 4, c++(complete reference,herbet sheidt)chapter-14Lecture 4, c++(complete reference,herbet sheidt)chapter-14
Lecture 4, c++(complete reference,herbet sheidt)chapter-14Abu Saleh
 
Lecture 3, c++(complete reference,herbet sheidt)chapter-13
Lecture 3, c++(complete reference,herbet sheidt)chapter-13Lecture 3, c++(complete reference,herbet sheidt)chapter-13
Lecture 3, c++(complete reference,herbet sheidt)chapter-13Abu Saleh
 
Gulabs Ppt On Data Warehousing And Mining
Gulabs Ppt On Data Warehousing And MiningGulabs Ppt On Data Warehousing And Mining
Gulabs Ppt On Data Warehousing And Mininggulab sharma
 
Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data miningRohit Kumar
 
Lecture 7, c++(complete reference,herbet sheidt)chapter-17.
Lecture 7, c++(complete reference,herbet sheidt)chapter-17.Lecture 7, c++(complete reference,herbet sheidt)chapter-17.
Lecture 7, c++(complete reference,herbet sheidt)chapter-17.Abu Saleh
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic conceptsKrish_ver2
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 

Andere mochten auch (17)

Lecture 2, c++(complete reference,herbet sheidt)chapter-12
Lecture 2, c++(complete reference,herbet sheidt)chapter-12Lecture 2, c++(complete reference,herbet sheidt)chapter-12
Lecture 2, c++(complete reference,herbet sheidt)chapter-12
 
Lecture 5, c++(complete reference,herbet sheidt)chapter-15
Lecture 5, c++(complete reference,herbet sheidt)chapter-15Lecture 5, c++(complete reference,herbet sheidt)chapter-15
Lecture 5, c++(complete reference,herbet sheidt)chapter-15
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 
Lecture 6, c++(complete reference,herbet sheidt)chapter-16
Lecture 6, c++(complete reference,herbet sheidt)chapter-16Lecture 6, c++(complete reference,herbet sheidt)chapter-16
Lecture 6, c++(complete reference,herbet sheidt)chapter-16
 
Lecture 4, c++(complete reference,herbet sheidt)chapter-14
Lecture 4, c++(complete reference,herbet sheidt)chapter-14Lecture 4, c++(complete reference,herbet sheidt)chapter-14
Lecture 4, c++(complete reference,herbet sheidt)chapter-14
 
Lecture 3, c++(complete reference,herbet sheidt)chapter-13
Lecture 3, c++(complete reference,herbet sheidt)chapter-13Lecture 3, c++(complete reference,herbet sheidt)chapter-13
Lecture 3, c++(complete reference,herbet sheidt)chapter-13
 
C++ Complete Reference
C++ Complete ReferenceC++ Complete Reference
C++ Complete Reference
 
Gulabs Ppt On Data Warehousing And Mining
Gulabs Ppt On Data Warehousing And MiningGulabs Ppt On Data Warehousing And Mining
Gulabs Ppt On Data Warehousing And Mining
 
Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data mining
 
Lecture 7, c++(complete reference,herbet sheidt)chapter-17.
Lecture 7, c++(complete reference,herbet sheidt)chapter-17.Lecture 7, c++(complete reference,herbet sheidt)chapter-17.
Lecture 7, c++(complete reference,herbet sheidt)chapter-17.
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Ähnlich wie Vbmca204821311240

02. Data Warehouse and OLAP
02. Data Warehouse and OLAP02. Data Warehouse and OLAP
02. Data Warehouse and OLAPAchmad Solichin
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing Girish Dhareshwar
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsGDi Techno Solutions
 
Data mining 2 - Data warehouse (cheat sheet - printable)
Data mining 2 - Data warehouse (cheat sheet - printable)Data mining 2 - Data warehouse (cheat sheet - printable)
Data mining 2 - Data warehouse (cheat sheet - printable)yesheeka
 
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse EMC
 
BI Forum 2009 - Principy architektury MPP datového skladu
BI Forum 2009 - Principy architektury MPP datového skladuBI Forum 2009 - Principy architektury MPP datového skladu
BI Forum 2009 - Principy architektury MPP datového skladuOKsystem
 
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)Krishan Pareek
 
Management information system database management
Management information system database managementManagement information system database management
Management information system database managementOnline
 
Translational Research Intelligence - Beyond Traditional Bi
Translational Research Intelligence - Beyond Traditional BiTranslational Research Intelligence - Beyond Traditional Bi
Translational Research Intelligence - Beyond Traditional Bishc66columbia
 
Dbms and it infrastructure
Dbms and  it infrastructureDbms and  it infrastructure
Dbms and it infrastructureprojectandppt
 

Ähnlich wie Vbmca204821311240 (20)

Dbm630_lecture02-03
Dbm630_lecture02-03Dbm630_lecture02-03
Dbm630_lecture02-03
 
Dbm630_Lecture02-03
Dbm630_Lecture02-03Dbm630_Lecture02-03
Dbm630_Lecture02-03
 
02. Data Warehouse and OLAP
02. Data Warehouse and OLAP02. Data Warehouse and OLAP
02. Data Warehouse and OLAP
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
 
2. olap warehouse
2. olap warehouse2. olap warehouse
2. olap warehouse
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
 
Data mining 2 - Data warehouse (cheat sheet - printable)
Data mining 2 - Data warehouse (cheat sheet - printable)Data mining 2 - Data warehouse (cheat sheet - printable)
Data mining 2 - Data warehouse (cheat sheet - printable)
 
Ch03
Ch03Ch03
Ch03
 
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
 
BI Forum 2009 - Principy architektury MPP datového skladu
BI Forum 2009 - Principy architektury MPP datového skladuBI Forum 2009 - Principy architektury MPP datového skladu
BI Forum 2009 - Principy architektury MPP datového skladu
 
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
 
Dwbasics
DwbasicsDwbasics
Dwbasics
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data mining notes
Data mining notesData mining notes
Data mining notes
 
354 ch1
354 ch1354 ch1
354 ch1
 
Management information system database management
Management information system database managementManagement information system database management
Management information system database management
 
Translational Research Intelligence - Beyond Traditional Bi
Translational Research Intelligence - Beyond Traditional BiTranslational Research Intelligence - Beyond Traditional Bi
Translational Research Intelligence - Beyond Traditional Bi
 
Lecture1
Lecture1Lecture1
Lecture1
 
Dbms and it infrastructure
Dbms and  it infrastructureDbms and  it infrastructure
Dbms and it infrastructure
 

Vbmca204821311240

  • 1. 1 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 2. Provide superior services and products  Know the business  New products  Invest in customers  Retain customers  Invest in technology  Reinvent to face new challenges Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 2
  • 3. Ad Hoc Reports : Earliest Stage : Marketing & Finance would send requests to IT for special reports.  Special Extract Program : An Attempt by IT to anticipate somewhat the types of reports that would be requested from time to time.  Small Applications : Create simple applications based on the extracted files.  Information Centers : The information center typically was a place where users could go to request ad hoc reports or view special information screen.  Decision – Support Systems : A Sophisticated systems intended to provide strategic information.  Executive Information Systems : An attempt to bring strategic information to the executive desktop. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 3
  • 4. 4 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 5. Defined in many different ways  A decision support database that is maintained separately from the organization’s operational database  Support information processing by providing a solid platform of consolidated, historical data for analysis.  “A data warehouse is a subject-oriented, integrated, time- variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon  Data warehousing:  The process of constructing and using data warehouses 5 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 6. Organized around major subjects, such as customer, product, sales.  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. 6 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 7. Constructed by integrating multiple, heterogeneous data sources  relational databases, flat files, on-line transaction records  Data cleaning and data integration techniques are applied.  Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources ▪ E.g., Hotel price: currency, tax, breakfast covered, etc.  When data is moved to the warehouse, it is converted. 7 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 8. The time horizon for the data warehouse is significantly longer than that of operational systems.  Operational database: current value data.  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)  Every key structure in the data warehouse  Contains an element of time, explicitly or implicitly  But the key of operational data may or may not contain “time element”. 8 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 9.  A physically separate store of data transformed from the operational environment.  Operational update of data does not occur in the data warehouse environment.  Does not require transaction processing, recovery, and concurrency control mechanisms  Requires only two operations in data accessing: ▪ initial loading of data and access of data. 9 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 10. Distinct features (OLTP vs. OLAP):  User and system orientation: customer vs. market  Data contents: current, detailed vs. historical, consolidated  Database design: ER + application vs. star + subject  View: current, local vs. evolutionary, integrated  Access patterns: update vs. read-only but complex queries 10 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 11. OLTP (on-line transaction processing)  Major task of traditional relational DBMS  Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.  OLAP (on-line analytical processing)  Major task of data warehouse system  Data analysis and decision making 11 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 12. OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date historical, detailed, flat relational summarized, multidimensional isolated integrated, consolidated usage repetitive ad-hoc access read/write lots of scans index/hash on prim. key unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 12
  • 13. High performance for both systems  DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery  Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation. 13 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 14. Different functions and different data:  missing data: Decision support requires historical data which operational DBs do not typically maintain  data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources  data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled 14 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 15. Operational vs Decision Support OLTP Complex Analysis Information to support Historical information day-to-day service to analyze Data stored at transaction Data needs to be integrated level Database design: Normalized Database design: Denormalized, star schema 15 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 16. MIS and Decision Support Ad hoc access Production platforms Operational reports Decision makers • MIS systems provided business data • Reports were developed on request • Reports provided little analysis capability • no personal ad hoc access to data 16 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 17. Data structures are complex  Systems are designed for high performance and ERP throughput  Data is not meaningfully represented  Data is dispersed  TPS systems unsuitable for intensive queries Production platforms Operational reports Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 17
  • 18. Operational systems Extracts Decision makers  End user computing offloaded from the operational environment  User’s own data 18 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 19. Operational systems Extracts Decision makers Extract explosion  Duplicated effort  Multiple technologies  Obsolete reports  No metadata 19 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 20. No common time basis  Different calculation algorithms  Different levels of extraction  Different levels of granularity  Different data field names  Different data field meanings  Missing information  No data correction rules  No drill-down capability 20 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 21. Internal and Data warehouse Decision makers external systems  Controlled  Reliable  Quality information  Single source of data 21 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 22. External Data Sources Visualisation Extract Clean Metadata Transform Load respository Serves OLAP Refresh Operational Databases Data Warehouse Data Mining 22 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 23. Co rpo rate Analy st data- warehouse Mainframe Serv er Analy st Co rpo rate Financial Marketing Analy st Man ufact uring Dist ribution Federated data warehouse Financial Analyst Corporate Marketing Analyst data Mainframe warehouse Manufact uring Analyst Dist ribution Analyst 23 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 24. Mainframe Tier 3 (detailed dat a) Corporate dat a warehouse Tier 2 (summarized dat a) Local data mart Tier 1 (highly summarized data) Workst at ion Analyst 24 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 25. Data Data Mart Warehouse Property Data Warehouse Data Mart Scope Enterprise Department Subjects Multiple Single-subject Data Source Many Few Size (typical) 100 GB to > 1 TB < 100 GB Implementation time Months to years Months 25 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 26. High performance is achieved by pre-planning the requirements for joins, summations, and periodic reports by end-users.  There are five main groups of access tools:  Data reporting and query tools  Application development tools  Executive information system (EIS) tools  Online analytical processing (OLAP) tools  Data mining tools 26 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 27. 27 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 28. Rotate and drill down  Create and examine calculated data  Determine comparative or relative differences.  Perform exception and trend analysis.  Perform advanced analytical functions 28 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 29. Roll up (drill-up): summarize data  by climbing up hierarchy or by dimension reduction  Drill down (roll down): reverse of roll-up  from higher level summary to lower level summary or detailed data, or introducing new dimensions  Slice and dice:  project and select 29 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 30. Pivot (rotate):  reorient the cube, visualization, 3D to series of 2D planes.  Other operations  drill across: involving (across) more than one fact table  drill through: through the bottom level of the cube to its back-end relational tables (using SQL) 30 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 31. A multi-dimensional data model 31 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 32. all 0-D(apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location time,supplier item,supplierlocation,supplier 2-D cuboids time,location,supplier time,item,location time,item,supplier item,location,supplier 3-D cuboids time, item, location, supplier 4-D(base) cuboid 32 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 33. Modeling data warehouses: dimensions & measures  Star schema: A fact table in the middle connected to a set of dimension tables  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 33 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 34. time time_key item day item_key day_of_the_week Sales Fact Table item_name month brand quarter time_key type year supplier_type item_key branch_key branch location location_key branch_key location_key branch_name units_sold street branch_type city dollars_sold province_or_street country avg_sales Measures 34 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 35. time time_key item day item_key supplier day_of_the_week Sales Fact Table item_name supplier_key month brand supplier_type quarter time_key type year item_key supplier_key branch_key branch location location_key location_key branch_key units_sold street branch_name city_key branch_type city dollars_sold city_key avg_sales city Measures province_or_street country 35 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 36. time time_key item Shipping Fact Table day item_key day_of_the_week Sales Fact Table item_name time_key month brand quarter time_key type item_key year supplier_type shipper_key item_key branch_key from_location branch location_key location to_location branch_key location_key dollars_cost branch_name units_sold street branch_type dollars_sold city units_shipped province_or_street avg_sales country shipper Measures shipper_key shipper_name location_key 36 shipper_type Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 37. Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list>  Dimension Definition ( Dimension Table ) define dimension <dimension_name> as (<attribute_or_subdimension_list>)  Special Case (Shared Dimension Tables)  First time as “cube definition”  define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> 37 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 38. define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) 38 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 39. define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) 39 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 40. define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country)) 40 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 41. define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) 41 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 42. define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales 42 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 43. distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning. ▪ E.g., count(), sum(), min(), max().  algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. ▪ E.g., avg(), min_N(), standard_deviation(). 43 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 44. holistic: if there is no constant bound on the storage size needed to describe a sub aggregate. ▪ E.g., median(), mode(), rank(). 44 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 45. all all region Europe ... North_America country Germany ... Spain Canada ... Mexico city Frankfurt ... Vancouver ... Toronto office L. Chan ... M. Wind 45 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 46. Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product Product City Month Week Office Day Month 46 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 47. Date Total annual sales 1Qtr 2Qtr 3Qtr 4Qtr sum of TV in U.S.A. TV PC U.S.A VCR Country sum Canada Mexico sum 47 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 48. all 0-D(apex) cuboid product date country 1-D cuboids product,date product,country date, country 2-D cuboids 3-D(base) cuboid product, date, country 48 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 49. 49 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 50. Customer Store Store Time Time SALES FINANCE Product The data is found at the intersection of dimensions. 50 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 51. Data warehouse architecture Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 51
  • 52. The design of a data warehouse: a business analysis framework  The process of data warehouse design  A three-tier data ware house architecture Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 52
  • 53. Four views regarding the design of a data warehouse  Top-down view ▪ allows selection of the relevant information necessary for the data warehouse Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 53
  • 54.  Data warehouse view ▪ consists of fact tables and dimension tables  Data source view ▪ exposes the information being captured, stored, and managed by operational systems  Business query view ▪ sees the perspectives Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 54
  • 55. Top-down, bottom-up approaches or a combination of both  Top-down: Starts with overall design and planning (mature)  Bottom-up: Starts with experiments and prototypes (rapid)  From software engineering point of view  Waterfall: structured and systematic analysis at each step before proceeding to the next  Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 55
  • 56. Typical data warehouse design process  Choose a business process to model, e.g., orders, invoices, etc.  Choose the grain (atomic level of data) of the business process  Choose the dimensions that will apply to each fact table record  Choose the measure that will populate each fact table record Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 56
  • 57. Multi-Tiered Architecture Monitor & OLAP Server other Metadata sources Integrator Analysis Operational Extract Query Transform Data Serve Reports DBs Load Refresh Warehouse Data mining Data Marts Data Sources Data Storage Professor - MCA Department | Paper :Front-End Tools Dr. Vaibhav Bansal | Associate OLAP Engine Data Warehousing & Data Mining [MCA 204] 57
  • 58. Meta data is the data defining warehouse objects. It has the following kinds  Description of the structure of the warehouse ▪ schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents  Operational meta-data ▪ data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)  The algorithms used for summarization  The mapping from operational environment to the data warehouse  Data related to system performance ▪ warehouse schema, view and derived data definitions  Business data ▪ business terms Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Dr. and definitions, ownership of data, charging policies Warehousing & Data Mining [MCA 204] 58
  • 59. Data extraction:  get data from multiple, heterogeneous, and external sources  Data cleaning:  detect errors in the data and rectify them when possible  Data transformation:  convert data from legacy or host format to warehouse format  Load:  sort, summarize, consolidate, compute views, check integrity, and build indices and partitions  Refresh  propagate the updates from the data sources to the warehouse Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 59
  • 60.  Enterprise warehouse  collects all of the information about subjects spanning the entire organization  Data Mart  a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart ▪ Independent vs. dependent (directly from warehouse) data mart  Virtual warehouse  A set of views over operational databases  Only some of the possible summary views may be materialized Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 60
  • 61. Multi-Tier Data Warehouse Distributed Data Marts Data Data Enterprise Data Mart Mart Warehouse Model refinement Model refinement Define aDr.high-level corporate data model Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 61
  • 62. Relational OLAP (ROLAP)  Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces  Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services  greater scalability  Multidimensional OLAP (MOLAP)  Array-based multidimensional storage engine (sparse matrix techniques)  fast indexing to pre-computed summarized data Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 62
  • 63. Hybrid OLAP (HOLAP)  User flexibility, e.g., low level: relational, high- level: array  Specialized SQL servers  specialized support for SQL queries over star/snowflake schemas Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 63
  • 64. Data warehouse implementation Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 64
  • 65. Data cube can be viewed as a lattice of cuboids  The bottom-most cuboid is the base cuboid  The top-most cuboid (apex) contains only one cell  How many cuboids in an n-dimensional cube with L levels? n T ( Li 1) i 1  Materialization of data cube  Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)  Selection of which cuboids to materialize ▪ Based on size, sharing, access frequency, etc. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 65
  • 66. Cube definition and computation in DMQL define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales  Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) () SELECT item, city, year, SUM (amount) FROM SALES (city) (item) (year) CUBE BY item, city, year  Need compute the following Group-Bys (date, product, customer), (city, item) (city, year) (item, year) (date,product),(date, customer), (product, customer), (date), (product), (customer) (city, item, year) Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data () Warehousing & Data Mining [MCA 204] 66
  • 67. Efficient cube computation methods  ROLAP-based cubing algorithms (Agarwal et al’96)  Array-based cubing algorithm (Zhao et al’97)  Bottom-up computation method (Bayer & Ramarkrishnan’99)  ROLAP-based cubing algorithms  Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples  Grouping is performed on some sub aggregates as a “partial grouping step”  Aggregates may be computed from previously computed aggregates, rather than from the base fact table Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 67
  • 68. Partition arrays into chunks (a small sub cube which fits in memory).  Compressed sparse array addressing: (chunk_id, offset)  Compute aggregates in “multi way” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 68
  • 69. C c3 61 c2 45 62 63 64 46 47 48 c1 29 30 31 32 c0 B13 14 15 16 60 b3 44 B 28 56 b2 9 40 24 52 b1 5 36 20 b0 1 2 3 4 a0 a1 a2 a3 A Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 69
  • 70. Method: the planes should be sorted and computed according to their size in ascending order.  Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane  Limitation of the method: computing well only for a small number of dimensions  If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 70
  • 71. Index on a particular column  Each value in the column has a bit vector: bit-op is fast  The length of the bit vector: # of records in the base table  The i-th bit is set if the i-th row of the base table has the value for the indexed column  not suitable for high cardinality domains Base table Index on Region Index on Type Cust Region Type RecIDAsia Europe America RecID Retail Dealer C1 Asia Retail 1 1 0 0 1 1 0 C2 Europe Dealer 2 0 1 0 2 0 1 C3 Asia Dealer 3 1 0 0 3 0 1 C4 America Retail 4 0 0 1 4 1 0 C5 Europe Dealer Vaibhav Bansal0Associate Professor - MCA Department | Paper5Data Dr. 5 | 1 0 : 0 1 Warehousing & Data Mining [MCA 204] 71
  • 72. Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)  Traditional indices map the values to a list of record ids  It materializes relational join in JI file and speeds up relational join — a rather costly operation  In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.  E.g. fact table: Sales and two dimensions city and product ▪ A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city  Join indices can span multiple dimensions Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 72
  • 73. Determine which operations should be performed on the available cuboids:  transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection  Determine to which materialized cuboid(s) the relevant operations should be applied.  Exploring indexing structures and compressed vs. dense array structures in MOLAP Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 73
  • 74. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 74
  • 75. Data in the real world is:  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  noisy: containing errors or outliers  inconsistent: containing discrepancies in codes or names  No quality data, no quality mining results!  Quality decisions must be based on quality data  Data warehouse needs consistent integration of quality data Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture - Why Data Preprocessing? Warehousing & Data Mining [MCA 204] 75
  • 76. A well-accepted multidimensional view:  Accuracy  Completeness  Consistency  Timeliness  Believability  Value added  Interpretability  Accessibility  Broad categories:  intrinsic, contextual, representational, and accessibility. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture - Why Data Preprocessing? Warehousing & Data Mining [MCA 204] 76
  • 77. Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data transformation  Normalization and aggregation  Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization  Part of data reduction but with particular importance, especially for numerical data Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture - Why Data Preprocessing? Warehousing & Data Mining [MCA 204] 77
  • 78. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture - Why Data Preprocessing? Warehousing & Data Mining [MCA 204] 78
  • 79. Lecture Data cleaning Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 79
  • 80. Data cleaning tasks  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture - Data cleaning Warehousing & Data Mining [MCA 204] 80
  • 81. Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data  Missing data may need to be inferred. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture - Data cleaning Warehousing & Data Mining [MCA 204] 81
  • 82. Ignore the tuple: usually done when class label is missing.  Fill in the missing value manually  Use a global constant to fill in the missing value: ex. “unknown” Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture - Data cleaning Warehousing & Data Mining [MCA 204] 82
  • 83. Use the attribute mean to fill in the missing value  Use the attribute mean for all samples belonging to the same class to fill in the missing value  Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture - Data cleaning Warehousing & Data Mining [MCA 204] 83
  • 84.  Noise: random error or variance in a measured variable  Incorrect attribute values may due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention  Other data problems which requires data cleaning  duplicate records  incomplete data  inconsistent data Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture - Data cleaning Warehousing & Data Mining [MCA 204] 84
  • 85. Binning method:  first sort data and partition into (equal-frequency) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries  Clustering  detect and remove outliers  Regression  smooth by fitting the data to a regression functions – linear regression Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture - Data cleaning Warehousing & Data Mining [MCA 204] 85
  • 86. Equal-width (distance) partitioning:  It divides the range into N intervals of equal size: uniform grid  if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.  The most straightforward  But outliers may dominate presentation  Skewed data is not handled well.  Equal-depth (frequency) partitioning:  It divides the range into N intervals, each containing approximately same number of samples  Good data scaling  Managing categorical attributes can be tricky. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture - Data cleaning Warehousing & Data Mining [MCA 204] 86
  • 87. * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-14 - Data cleaning Warehousing & Data Mining [MCA 204] 87
  • 88. Cluster Analysis is based on Regression Lines. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-14 - Data cleaning Warehousing & Data Mining [MCA 204] 88
  • 89. Lecture Data integration and transformation Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 89
  • 90. Data integration:  combines data from multiple sources into a coherent store  Schema integration  integrate metadata from different sources  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#  Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different  possible reasons: different representations, different scales, e.g., metric vs. British units Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-15 - Data integration204] transformation Warehousing & Data Mining [MCA and 90
  • 91. Redundant data occur often when integration of multiple databases  The same attribute may have different names in different databases  One attribute may be a “derived” attribute in another table, e.g., annual revenue Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-15 - Data integration204] transformation Warehousing & Data Mining [MCA and 91
  • 92. Redundant data may be able to be detected by correlation analysis  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 92
  • 93.  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 93
  • 94. Normalization: scaled to fall within a small, specified range  min-max normalization :-> Min-Max normalization is the process of taking data measured in its engineering units (for example: miles per hour or degrees ) and transforming it to a value between 0.0 and 1.0. The lowest (min) value is set to 0.0 and the highest (max) value is set to 1.0. This provides an easy way to compare values that are measured using different scales (for example degrees Celsius and degrees Fahrenheit) or different units of measure (speed and distance). The normalized value is defined as: (the value - the minimum) / (the range of values). As an example, consider a temperature reading with values 20, 24, 26, 27, 30. The minimum value is 20. The maximum values is 30. The data has a range of 10 degrees. For a temperature reading of 26, the normalized value is 0.6. (26 - 20) / 10 = 0.6. 94 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 95.  z-score normalization :-> The z -Score utility model transforms the distribution of pixel values into a standard normal distribution (z-score value). By this normalization the images of different individuals become more comparable.  Normalization by decimal scaling. 95 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 96. Lecture Data reduction 96 Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204]
  • 97. Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set  Data reduction  Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results  Example: Materialized Views in Oracle. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-16 - Data reduction Warehousing & Data Mining [MCA 204] 97
  • 98. Data reduction strategies  Data cube aggregation  Attribute subset selection  Dimensionality reduction  Numerosity reduction  Discretization and concept hierarchy generation Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-16 - Data reduction Warehousing & Data Mining [MCA 204] 98
  • 99. The lowest level of a data cube (Reduced the data in concept level required in analysis)  the aggregated data for an individual entity of interest.  e.g., a customer in a phone calling data warehouse.  Multiple levels of aggregation in data cubes  Further reduce the size of data to deal with analysis  Reference appropriate levels  Use the smallest representation which is enough to solve the task and analysis.  Queries regarding aggregated information should be answered using data |cube. - MCA Department | Paper : Data Dr. Vaibhav Bansal Associate Professor Warehousing & Data Mining [MCA 204] 99
  • 100. Feature selection (i.e. attribute subset selection):  Select only the necessary attributes.  The goal is to find a minimum set of features such that the resulting probability distribution of different data classes is as close as possible to the original distribution obtaining all the given the values of all attributes. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 100
  • 101.  reduce # of patterns in the patterns, easier to understand  Heuristic methods  step-wise forward selection {} {A1,A3} {A1,A3,A5}  step-wise backward elimination {A1,A2,A3,A4,A5} {A1, A3 , A4 , A5 } {A1,A3,A5}  combining forward selection and backward elimination  decision-tree induction Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 101
  • 102. Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A1, A4, A6} Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 102
  • 103. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 103
  • 104. Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes.  Goal is find ‘min’ set of attributes  Uses basic heuristic methods of attribute selection Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-16 - Data reduction Warehousing & Data Mining [MCA 204] 104
  • 105. Linear regression: Data are modeled to fit a straight line  Often uses the least-square method to fit the line  Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector  Log-linear model: approximates discrete multidimensional probability distributions Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-16 - Data reduction Warehousing & Data Mining [MCA 204] 106
  • 106. Three types of attributes:  Nominal — values from an unordered set  Ordinal — values from an ordered set  Continuous — real numbers  Discretization: divide the range of a continuous attribute into intervals  Some classification algorithms only accept categorical attributes.  Reduce data size by discretization  Prepare for further analysis Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-17 - Discretization and concept hierarchy generation Warehousing & Data Mining [MCA 204] 107
  • 107. Discretization  reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.  Concept hierarchies  reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior). Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture- Discretization andMining [MCA hierarchy generation Warehousing & Data concept 204] 108
  • 108. Binning  Histogram analysis  Clustering analysis  Entropy-based discretization  Discretization by intuitive partitioning Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-17 - Discretization and concept hierarchy generation Warehousing & Data Mining [MCA 204] 109
  • 109.  A popular data 40 reduction technique 35  Divide data into buckets and store 30 average (sum) for each 25 bucket 20  Can be constructed 15 optimally in one dimension using 10 dynamic programming 5  Related to quantization 0 problems. Dr. Vaibhav Bansal | Associate 10000 Professor - MCA Department 50000 : Data 70000 Lecture-16 - Data reduction Warehousing & Data Mining [MCA 204] 30000 | Paper 90000 110
  • 110. Partition data set into clusters, and one can store cluster representation only  Can be very effective if data is clustered but not if data is “smeared”  Can have hierarchical clustering and be stored in multi- dimensional index tree structures  There are many choices of clustering definitions and clustering algorithms. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-16 - Data reduction Warehousing & Data Mining [MCA 204] 111
  • 111. Given a set of samples ‘S’, if ‘S’ is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is E ( S , T ) | S1| Ent ( S1) | S 2 | Ent ( S 2) | S| | S|  The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.  The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,  Experiments show that it may reduce data size and improve classification accuracy Ent ( S ) E (T , S ) Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 112
  • 112.  3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals. • If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equal-width intervals. • Example: 3-5,5-7,7-9  (9-3=6/3 = 2 width) • If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals • Example : 8-2 =6/4 = 1.5 width {2-3.5,3.5 -5, 5-6.5, 6.5-8] • If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals • Example: 10-1 =9/5 = 1.8 width {1-2.8,2.8-4.6, 4.6-6.4,6.4-8.2, 8.2-10) Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 113
  • 113. Specification of a partial ordering of attributes explicitly at the schema level by users or experts  Specification of a portion of a hierarchy by explicit data grouping  Specification of a set of attributes, but not of their partial ordering  Specification of only a partial set of attributes Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-17 - Discretization and concept hierarchy generation Warehousing & Data Mining [MCA 204] 114
  • 114. Allows a large data set to be represented by a smaller of the data.  Let a large data set D, contains N tuples.  Methods to reduce data set D:  Simple random sample without replacement (SRSWOR)  Simple random sample with replacement (SRSWR)  Cluster sample  Stratified sample Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-16 - Data reduction Warehousing & Data Mining [MCA 204] 115
  • 115. Sampling Raw Data Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-16 - Data reduction Warehousing & Data Mining [MCA 204] 116
  • 116. Raw Data Cluster/Stratified Sample Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Lecture-16 - Data reduction Warehousing & Data Mining [MCA 204] 117
  • 117. Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. country 15 distinct values province_or_ state 65 distinct values city 3567 distinct values street Bansal | Associate Professor - MCA Departmentdistinct values Dr. Vaibhav 674,339 | Paper : Data Lecture-17 - Discretization and concept hierarchy generation Warehousing & Data Mining [MCA 204] 118
  • 118. Lecture Topic ********************************************** Lecture Data mining primitives: What defines a data mining task? Lecture A data mining query language Lecture Design graphical user interfaces based on a data mining query language Lecture Architecture of data mining systems
  • 119. Lecture Data mining primitives: What defines a data mining task?
  • 120.  Finding all the patterns autonomously in a database? --unrealistic because the patterns could be too many. --uninteresting (not to be considered due to constraint).  Data mining should be an interactive process  User directs what to be mined and when. Yes it is an interactive process and it is user interactive process.  Users must be provided with a set of primitives to be used to communicate with the data mining system  Incorporating these primitives in a data mining query language  More flexible user interaction  Foundation for design of graphical user interface  Standardization of data mining industry and practice Lecture - Data mining primitives: What defines a data mining task?
  • 121. Task-relevant data (Minable View).  Type of knowledge to be mined  Background/History knowledge  Pattern interestingness measurements  Visualization of discovered patterns Lecture-18 - Data mining primitives: What defines a data mining task?
  • 122. Database or data warehouse name  Database tables or data warehouse cubes  Condition for data selection  Relevant attributes or dimensions  Data grouping criteria Lecture-18 - Data mining primitives: What defines a data mining task?
  • 123. Characterization  Discrimination  Association  Classification/prediction  Clustering  Outlier analysis  Other data mining tasks Lecture - Data mining primitives: What defines a data mining task?
  • 124.  Schema hierarchy  street < city < province_or_state < country  Set-grouping hierarchy  {20-39} = young, {40-59} = middle_aged  Operation-derived hierarchy  email address: login-name < department < university < country  Rule-based hierarchy  low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50 Lecture-18 - Data mining primitives: What defines a data mining task?
  • 125. Simplicity association rule length, decision tree size  Certainty confidence, P(A|B) = n(A and B)/ n (B), classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight  Utility potential usefulness, support (association), noise threshold (description)  Novelty not previously known, surprising (used to remove redundant rules, Canada vs. Vancouver rule implication support ratio Lecture-18 - Data mining primitives: What defines a data mining task?
  • 126. Different backgrounds/usages may require different forms of representation  rules, tables, cross tabs, pie/bar chart  Concept hierarchy is also important  Discovered knowledge might be more understandable when represented at high level of abstraction  Interactive drill up/down, pivoting, slicing and dicing provide different perspective to data  Different kinds of knowledge require different representation: association, classification, clustering Lecture-18 - Data mining primitives: What defines a data mining task?
  • 127. Lecture-19 A data mining query language
  • 128. Motivation  A DMQL can provide the ability to support ad-hoc and interactive data mining  By providing a standardized language like SQL ▪ to achieve a similar effect like that SQL has on relational database ▪ Foundation for system development and evolution ▪ Facilitate information exchange, technology transfer, commercialization and wide acceptance  Design  DMQL is designed with the primitives Lecture-19 - A data mining query language
  • 129. Syntax for specification of  task-relevant data  the kind of knowledge to be mined  concept hierarchy specification  interestingness measure  pattern presentation and visualization — a DMQL query Lecture-19 - A data mining query language
  • 130. use database database_name, or use data warehouse data_warehouse_name  from relation(s)/cube(s) [where condition]  in relevance to att_or_dim_list  order by order_list  group by grouping_list  having condition
  • 131.  Characterization Mine_Knowledge_Specification ::= mine characteristics [as pattern_name] analyze measure(s)  Discrimination Mine_Knowledge_Specification ::= mine comparison [as pattern_name] for target_class where target_condition {versus contrast_class_i where contrast_condition_i} analyze measure(s)  Association Mine_Knowledge_Specification ::= mine associations [as pattern_name] Lecture-19 - A data mining query language
  • 132. Classification Mine_Knowledge_Specification ::= mine classification [as pattern_name] analyze classifying_attribute_or_dimension  Prediction Mine_Knowledge_Specification ::= mine prediction [as pattern_name] analyze prediction_attribute_or_dimension {set {attribute_or_dimension_i= value_i}} Lecture-19 - A data mining query language
  • 133. To specify what concept hierarchies to use use hierarchy <hierarchy> for <attribute_or_dimension>  use different syntax to define different type of hierarchies  schema hierarchies define hierarchy time_hierarchy on date as [date,month quarter,year]  set-grouping hierarchies define hierarchy age_hierarchy for age on customer as level1: {young, middle_aged, senior} < level0: all level2: {20, ..., 39} < level1: young level2: {40, ..., 59} < level1: middle_aged level2: {60, ..., 89} < level1: senior Lecture-19 - A data mining query language
  • 134.  operation-derived hierarchies define hierarchy age_hierarchy for age on customer as {age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age) Lecture-19 - A data mining query language
  • 135.  rule-based hierarchies define hierarchy profit_margin_hierarchy on item as level_1: low_profit_margin < level_0: all if (price - cost)< $50 level_1: medium-profit_margin < level_0: all if ((price - cost) > $50) and ((price - cost) <= $250)) level_1: high_profit_margin < level_0: all if (price - cost) > $250 Lecture-19 - A data mining query language
  • 136. Interestingness measures and thresholds can be specified by the user with the statement: with <interest_measure_name> threshold = threshold_value  Example: with support threshold = 0.05 with confidence threshold = 0.7 Lecture-19 - A data mining query language
  • 137.  syntax which allows users to specify the display of discovered patterns in one or more forms display as <result_form>  To facilitate interactive viewing at different concept level, the following syntax is defined: Multilevel_Manipulation ::= roll up on attribute_or_dimension | drill down on attribute_or_dimension | add attribute_or_dimension | drop attribute_or_dimension Lecture-19 - A data mining query language
  • 138. use database AllElectronics_db use hierarchy location_hierarchy for B.address mine characteristics as customerPurchasing analyze count% in relevance to C.age, I.type, I.place_made from customer C, item I, purchases P, items_sold S, works_at W, branch where I.item_ID = S.item_ID and S.trans_ID = P.trans_ID and P.cust_ID = C.cust_ID and P.method_paid = ``AmEx'' and P.empl_ID = W.empl_ID and W.branch_ID = B.branch_ID and B.address = ``Canada" and I.price >= 100 with noise threshold = 0.05 display as table Lecture-19 - A data mining query language
  • 139. Association rule language specifications  MSQL (Imielinski & Virmani’99)  MineRule (Meo Psaila and Ceri’96)  Query flocks based on Datalog syntax (Tsur et al’98)  OLEDB for DM (Microsoft’2000)  Based on OLE, OLE DB, OLE DB for OLAP  Integrating DBMS, data warehouse and data mining  CRISP-DM (CRoss-Industry Standard Process for Data Mining)  Providing a platform and process structure for effective data mining  Emphasizing on deploying data mining technology to solve business problems Lecture-19 - A data mining query language
  • 140. Lecture-20 Design graphical user interfaces based on a data mining query language
  • 141. What tasks should be considered in the design GUIs based on a data mining query language?  Data collection and data mining query composition  Presentation of discovered patterns  Hierarchy specification and manipulation  Manipulation of data mining primitives  Interactive multilevel mining  Other miscellaneous information Lecture-20 - Design graphical user interfaces based on a data mining query language
  • 143. Coupling data mining system with DB/DW system  No coupling—flat file processing,  Loose coupling ▪ Fetching data from DB/DW  Semi-tight coupling—enhanced DM performance ▪ Provide efficient implement a few data mining primitives in a DB/DW system- sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions Lecture-21 - Architecture of data mining systems
  • 144. Tight coupling—A uniform information processing environment  DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods Lecture-21 - Architecture of data mining systems
  • 145. Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 146
  • 146. Associations  85 percent of customers who buy a certain brand of wine also buy a certain type of pasta  Sequential patterns  32 percent of female customers who order a red jacket within six months buy a gray skirt  Classifying  Frequent customers are those with incomes about $50,000 and having two or more children  Clustering  Market segmentation  Predicting  predict the revenue value of a new customer based on that personal demographic variables Dr. Vaibhav Bansal | Associate Professor - MCA Department | Paper : Data Warehousing & Data Mining [MCA 204] 147