SlideShare a Scribd company logo
1 of 25
Download to read offline
DBM630: Data Mining and
                         Data Warehousing

                                   MS.IT. Rangsit University
                                                          Semester 2/2011
    by Kritsada Sriphaew (sriphaew.k AT gmail.com)

                                     Lecture 1
                               Introduction to
            Data Mining and Data Warehousing
               Text: Data Mining: Concepts and Techniques, By Jiawei Han
               and Micheline Kamber, Morgan Kaufmann Publishers (2006).

               ISBN: 978-1558609013

1
Administrative Matters
   Course Syllabus

   Lecture Notes & Assignments & Quizzes

   Course’s Communication
    Announcements, discussion, lecture notes, etc.
       Page: http://www.facebook.com/pages/Data-mining-MSIT-
        RSU/


2                               Data Mining and Data Warehousing by Kritsada Sriphaew
How we will be evaluated?
   Assessment Tasks
             Tasks                                   % Scores
             Quizzes (Approx. 2 times)               20
             Assignment                              20
             (Disscussion/Demonstration)
             Final                                   60



   To Pass
       At least 60% of the overall scores.

3                                        Data Mining and Data Warehousing by Kritsada Sriphaew
Text Books
   Mandatory Book
    Data Mining: Concepts and Techniques
     By Jiawei Han and Micheline Kamber
     Morgan Kaufmann Publishers (2006), Second Edition,
        ISBN-10: 1558609016, ISBN-13: 978-1558609013



   Supplementary Book
    Practical Machine Learning Tools and
    Techniques with JAVA Implementations
      By Ian H. Witten and Eibe Frank, Data Mining
      Morgan Kaufmann Publishers (2005), 2nd Edition
         ISBN-10: 0120884070, ISBN-13: 978-0120884070


4                                   Data Mining and Data Warehousing by Kritsada Sriphaew
Course Description (What we’LL learn?)
   Introduction to data warehousing. Characteristics of data warehousing, drawbacks
    and benefits of data warehousing, architecture of data warehousing, internal data
    structure for data warehousing, data integration, creating high quality data, data
    mart, online analytical processing (OLAP). Introduction to data mining, types of
    data for mining, architecture of typical data mining system, data preprocessing,
    association rule mining, classification and prediction, clustering, data mining
    applications, current trends in data mining, text mining, web mining, including
    tools for data mining analysis such as WEKA, SAS, etc.

                                                                                            ั
    แนวคิดเบืองต้นของคลังข้อมูล คุณลักษณะของคลังข้อมูล ข้อดีและข้อเสียของคลังข้อมูล สถาปตยกรรมของคลังข้อมูล
              ้
    โครงสร้างการจัดเก็บข้อมูลภายในคลังข้อมูล การบูรณาการข้อมูล การสร้างข้อมูลทีมคุณภาพ ดาต้ามาร์ท การ
                                                                                  ่ ี
    ประมวลผลออนไลน์เชิงวิเคราะห์ แนวคิดเบืองต้นการทาเหมืองข้อมูล ชนิดข้อมูลสาหรับการทาเหมืองข้อมูล
                                             ้
            ั
    สถาปตยกรรมของระบบเหมืองข้อมูล การเตรียมข้อมูล การขุดค้นกฎสัมพันธ์ การจาแนกประเภทและการทานาย การ
          ่                     ่ ี                                                   ั ั
    จัดกลุม การทาเหมืองข้อมูลทีมความซับซ้อน การประยุกต์ใช้เหมืองข้อมูล แนวโน้มปจจุบนการทาเหมืองข้อมูล เหมือง
    ข้อมูลตัวอักษร เหมืองข้อมูลเว็บ รวมถึงการใช้เครืองมือในการวิเคราะห์เหมืองข้อมูล เช่น WEKA, SAS เป็ นต้น
                                                   ่

    5                                               Data Mining and Data Warehousing by Kritsada Sriphaew
Course Schedule (tentative)
Week     Date                                    Topics
    1     8 JAN Introduction to Data Mining and Data Warehousing
    2    15 JAN Data Warehouse and OLAP Technology – I
    3    22 JAN Data Warehouse and OLAP Technology – II
    4    29 JAN Data Mining Concepts and Data Preparation
    5      5 FEB Association Rule Mining
    6     12 FEB Classification Model: Decision Tree, Classification Rules
    7     19 FEB Classification Model: Naïve Bayes
    8     26 FEB Prediction Model: Regression
    9     4 MAR Clustering
    10   11 MAR Data Mining Application: Text Mining, Web Mining, Social Network
                Analysis
    11   18 MAR Introduction to Data Mining Tool: WEKA
    12   25 MAR Tutorials
6                                      Final Mining and Data Warehousing by Kritsada Sriphaew
                                        Data
Prerequisites
 Basic Database Concepts
 Basic Statistics:
        Probability, Sampling, Logic, Linear Regression, …
    Algorithms:
        Basic Data Structures, Dynamic Programming, ...



We provide some backgrounds, but the class will be
fast pace if you have some basics in advance.
 7                                 Data Mining and Data Warehousing by Kritsada Sriphaew
Introduction
 Motivation: Why mine data?
 KDD: Knowledge Discovery in Databases
 What is Data Mining?
 Data Mining: on What kind of Data?
 Data Mining Tasks
 Data Mining Applications




8                        Data Mining and Data Warehousing by Kritsada Sriphaew
Evolution of Database Technology
       1960s:
         Data collection, database creation, IMS and network
          DBMS
       1970s:
         Relational data model, relational DBMS implementation
       1980s:
         RDBMS, advanced data models (extended-relational,
          OO, deductive, etc.) and application-oriented DBMS
          (spatial, scientific, engineering, etc.)
       1990s—2000s:
         Data mining and data warehousing, multimedia
          databases, and Web databases


    9                             Data Mining and Data Warehousing by Kritsada Sriphaew
Large Data Sets: A Motivation
 There is often information “hidden” in the data that
  is not readily evident.
 Human analysts take weeks to discover useful
  information.
 Much of the data is never been analyzed at all


      How do you explore millions of
      records, tens or hundreds of
      fields, and find patterns?



 10                                Data Mining and Data Warehousing by Kritsada Sriphaew
KDD Process
(Knowledge Discovery in Databases)
                                                   Interpretation/
                                                     Evaluation

                            Data Mining                                    Knowledge



                 Preprocessing
                                                Patterns


     Selection
                                 Preprocessed
                                     Data
     Data
                   Target
                    Data



                                      adapted from:
                                      U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An
                                      Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et
                                      al. (Eds.), AAAI/MIT Press

11                                      Data Mining and Data Warehousing by Kritsada Sriphaew
Knowledge Discovery




12                    Data Mining and Data Warehousing by Kritsada Sriphaew
Business Intelligence (BI) vs. Data Mining
    A word to call processes, techniques and tools that support
     business decision using information technology
         Increasing potential
         to support                                                              End User
         business decisions         Making Decisions

                                     Data Presentation                Business Analyst
                                 Visualization Techniques
                                      Data Mining
                                   Knowledge Discovery                      Data Analyst
                                        Data Exploration
                       Statistical Analysis, Querying and Reporting
                                Data Warehouses / Data Marts
                                             OLAP                                       DBA
                                        Data Sources
               Paper, Files, Information Providers, Database Systems, OLTP
    13                                        Data Mining and Data Warehousing by Kritsada Sriphaew
Terminology
   Data Mining
    A step in the knowledge discovery process consisting of
    particular algorithms (methods) that under some
    acceptable objective, produces a particular enumeration
    of patterns (models) over the data.

   Knowledge Discovery Process
    The process of using data mining methods (algorithms)
    to extract (identify) what is deemed knowledge according
    to the specifications of measures and thresholds, using a
    database along with any necessary preprocessing or
    transformations.
14                            Data Mining and Data Warehousing by Kritsada Sriphaew
Other definitions of Data Mining
 Non‐trivial extraction of implicit, previously unknown
  and useful information from data
 Automatic or semi-automatic process for analyzing
  large databases to find patterns that are:
       valid: hold on new data with some certainty
       novel: non‐obvious to the system
       useful: should be possible to act on the item
       understandable: humans should be able to interpret the
        pattern

 15                              Data Mining and Data Warehousing by Kritsada Sriphaew
Origins of Data Mining

                       Overlaps various fields, but
                        focus on
                           Scalability
                           Algorithm and Architecture
                           Automation to handle large
                            data




16                   Data Mining and Data Warehousing by Kritsada Sriphaew
Data Mining: on What kind of Data?
    Relational Databases
    Data Warehouses                                         Structure - 3D Anatomy

    Transactional Databases
    Advanced Database Systems
                                                                    Function – 1D Signal
        Object-Relational
        Spatial and Temporal
        Time-Series
                                                          Metadata – Annotation
        Multimedia                                                           GeneFilter Comparison Report

        Text                                             GeneFilter 1 Name:
                                                          O2#1 8-20-99adjfinal
                                                                              INTENSITIES
                                                                                                  GeneFilter 1
                                                                                                  N2#1finaladj
                                                                                                                 Name:




        Heterogeneous, Legacy, and Distributed           ORF NAME
                                                          YAL001C      TFC3 1
                                                                              RAW
                                                                       GENE NAME
                                                                                    NORMALIZED
                                                                                    CHRM F      G
                                                                                    1 A 1 2 12.03 7.38
                                                                                                      R          GF1
                                                                                                                 403.83
                                                                                                                        GF2



         WWW
                                                          YBL080C      PET112       2      1 A 1 3 53.21         35.62 "1,
                                                         YBR154C
                                                          YCL044C
                                                                       RPB5 2
                                                                              3
                                                                                    1 A 1 4 79.26 78.51
                                                                                    1 A 1 5 53.22 44.66
                                                                                                                 "2,660.73"
                                                                                                                 "1,786.53"
                                                          YDL020C      SON1 4       1 A 1 6 23.80 20.34          799.06
                                                          YDL211C             4     1 A 1 7 17.31 35.34          581.00
                                                          YDR155C      CPH1 4       1 A 1 8 349.78               401.84
                                                          YDR346C             4     1 A 1 9 64.97 65.88          "2,180.87"
                                                          YAL010C      MDM10 1      1 A 2 2 13.73 9.61           461.03
    17                                                    YBL088C      TEL1 2       1 A 2 3 8.50 7.74
                                    Data Mining and Data Warehousing by Kritsada Sriphaew
                                                          YBR162C             2     1 A 2 4 226.84
                                                                                                                 285.38
                                                                                                                 293.83
                                                          YCL052C      PBN1 3       1 A 2 5 41.28 34.79          "1,385.79"
                                                          YDL028C      MPS1 4       1 A 2 6 7.95 6.24            266.99
Data Mining Tasks
 Classification
 Clustering
 Association Rule Mining
 Sequential Pattern Discovery
 Regression
 Anomaly Detection
Ex: Classifying Galaxy




19                   Data Mining and Data Warehousing by Kritsada Sriphaew
Ex: Market Basket Analysis


                               ?   Where should detergents be placed in the
                                   Store to maximize their sales?



                               ?   Are window cleaning products purchased
                                   when detergents and orange juice are
                                   bought together?



                               ?   Is soda typically purchased with bananas?
                                   Does the brand of soda make a difference?




                               ?   How are the demographics of the
                                   neighborhood affecting what customers
                                   are buying?




20                  Data Mining and Data Warehousing by Kritsada Sriphaew
Ex: Anomaly Detection
   Detect significant deviations from normal behavior

   Applications:
       Credit Card Fraud Detection
       Network Intrusion Detection




21                                Data Mining and Data Warehousing by Kritsada Sriphaew
Some Success Stories
    Network intrusion detection using a combination of sequential
     rule discovery and classification tree on 4 GB DARPA data
        Won over (manual) knowledge engineering approach
        http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good
         detailed description of the entire process
    Major US bank: Customer attrition prediction
        Segment customers based on financial behavior: 3 segments
        Build attrition models for each of the 3 segments
        40‐50% of attritions were predicted == factor of 18 increase
    Targeted credit marketing: major US banks
        find customer segments based on 13 months credit balances
        build another response model based on surveys
        increased response 4 times -- 2%
    22                                 Data Mining and Data Warehousing by Kritsada Sriphaew
How You’LL Benefit
 Confidently discuss the role and applicability of data
  warehousing and data mining to
  business/organization problems
 Get background knowledge for further explore to
  your thesis, independent study or your career’s
  projects since data mining methods (to extract
  knowledge from the data) are very useful for every
  fields.
Assignment
 Assignments will aim to test your detailed knowledge
  and understanding of the topics, as well as your
  critical thinking and research ability. Assignments may
  include tasks involving: writing detailed designs;
  reading research papers; learning and using specialist
  software/hardware.
 Assessment: the assignment will be worth 20% of the
  total course assessment.
PreTest
1. Select only one of the following items to fill in the blanks.
          (a) Characterization/Discrimination
          (b) Classification
          (c) Numeric Prediction
          (d) Clustering
          (e) Association Analysis
          (f) Trend Analysis
          Which function matches with the following task?
          ______(1) To estimate the price of the stock A in next month
          ______(2) To display a portion of sold products, according to their types.
          ______(3) To know which products are likely to be sold with which products
          ______(4) To group customers to a set of similar groups based on their features
          ______(5) To find the value of an experiment when a substance is tested.
          ______(6) To predict that a customer tends to be a good customer or not.

2.            Assume that we want to design a model to forecast tomorrow’s SET index,
              please suggest the detail of the model that we should construct and
              recommend the input and output to the model.
     25

More Related Content

What's hot

Data Mining
Data MiningData Mining
Data Mining
swami920
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 

What's hot (20)

Introduction
IntroductionIntroduction
Introduction
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
Data mining
Data miningData mining
Data mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Data Mining on Twitter
Data Mining on TwitterData Mining on Twitter
Data Mining on Twitter
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and Techniques
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
 
Data Mining
Data MiningData Mining
Data Mining
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Kdd process
Kdd processKdd process
Kdd process
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data mining concepts
Data mining conceptsData mining concepts
Data mining concepts
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data mining
Data miningData mining
Data mining
 
Knowledge Discovery in Databases
Knowledge Discovery in DatabasesKnowledge Discovery in Databases
Knowledge Discovery in Databases
 

Viewers also liked (7)

Multimedia Systems
Multimedia SystemsMultimedia Systems
Multimedia Systems
 
Lecture01_Introduction
Lecture01_IntroductionLecture01_Introduction
Lecture01_Introduction
 
Weather conditions
Weather conditionsWeather conditions
Weather conditions
 
Lecture01
Lecture01Lecture01
Lecture01
 
Dbm630_Lecture02-03
Dbm630_Lecture02-03Dbm630_Lecture02-03
Dbm630_Lecture02-03
 
MIT628_coursesyllabus
MIT628_coursesyllabusMIT628_coursesyllabus
MIT628_coursesyllabus
 
Csc533 ch3a mm_framework
Csc533 ch3a mm_frameworkCsc533 ch3a mm_framework
Csc533 ch3a mm_framework
 

Similar to Dbm630 Lecture01

Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.ppt
PadmajaLaksh
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
bhagathk
 

Similar to Dbm630 Lecture01 (20)

Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.ppt
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
data mining
data miningdata mining
data mining
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dw
 
isd314-01
isd314-01isd314-01
isd314-01
 
Graph
GraphGraph
Graph
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
Data mining Introduction
Data mining IntroductionData mining Introduction
Data mining Introduction
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data Mining
 
DM Lecture 2
DM Lecture 2DM Lecture 2
DM Lecture 2
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 

More from Aj Kritsada Sriphaew (6)

Lecture1-Introduction-Jan7-2017.pptx
Lecture1-Introduction-Jan7-2017.pptxLecture1-Introduction-Jan7-2017.pptx
Lecture1-Introduction-Jan7-2017.pptx
 
IRS185-RSU185-lecture03.pdf
IRS185-RSU185-lecture03.pdfIRS185-RSU185-lecture03.pdf
IRS185-RSU185-lecture03.pdf
 
Google Sites and Digital Portfolios.pptx
Google Sites and Digital Portfolios.pptxGoogle Sites and Digital Portfolios.pptx
Google Sites and Digital Portfolios.pptx
 
I18N.pdf
I18N.pdfI18N.pdf
I18N.pdf
 
210724 DoCare_Proposal_BDMS Pattaya_Quotation Device Set.pdf
210724 DoCare_Proposal_BDMS Pattaya_Quotation Device Set.pdf210724 DoCare_Proposal_BDMS Pattaya_Quotation Device Set.pdf
210724 DoCare_Proposal_BDMS Pattaya_Quotation Device Set.pdf
 
Flash Tutorial
Flash TutorialFlash Tutorial
Flash Tutorial
 

Dbm630 Lecture01

  • 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 by Kritsada Sriphaew (sriphaew.k AT gmail.com) Lecture 1 Introduction to Data Mining and Data Warehousing Text: Data Mining: Concepts and Techniques, By Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers (2006). ISBN: 978-1558609013 1
  • 2. Administrative Matters  Course Syllabus  Lecture Notes & Assignments & Quizzes  Course’s Communication Announcements, discussion, lecture notes, etc.  Page: http://www.facebook.com/pages/Data-mining-MSIT- RSU/ 2 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 3. How we will be evaluated?  Assessment Tasks Tasks % Scores Quizzes (Approx. 2 times) 20 Assignment 20 (Disscussion/Demonstration) Final 60  To Pass  At least 60% of the overall scores. 3 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 4. Text Books  Mandatory Book Data Mining: Concepts and Techniques By Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers (2006), Second Edition, ISBN-10: 1558609016, ISBN-13: 978-1558609013  Supplementary Book Practical Machine Learning Tools and Techniques with JAVA Implementations By Ian H. Witten and Eibe Frank, Data Mining Morgan Kaufmann Publishers (2005), 2nd Edition ISBN-10: 0120884070, ISBN-13: 978-0120884070 4 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 5. Course Description (What we’LL learn?)  Introduction to data warehousing. Characteristics of data warehousing, drawbacks and benefits of data warehousing, architecture of data warehousing, internal data structure for data warehousing, data integration, creating high quality data, data mart, online analytical processing (OLAP). Introduction to data mining, types of data for mining, architecture of typical data mining system, data preprocessing, association rule mining, classification and prediction, clustering, data mining applications, current trends in data mining, text mining, web mining, including tools for data mining analysis such as WEKA, SAS, etc. ั แนวคิดเบืองต้นของคลังข้อมูล คุณลักษณะของคลังข้อมูล ข้อดีและข้อเสียของคลังข้อมูล สถาปตยกรรมของคลังข้อมูล ้ โครงสร้างการจัดเก็บข้อมูลภายในคลังข้อมูล การบูรณาการข้อมูล การสร้างข้อมูลทีมคุณภาพ ดาต้ามาร์ท การ ่ ี ประมวลผลออนไลน์เชิงวิเคราะห์ แนวคิดเบืองต้นการทาเหมืองข้อมูล ชนิดข้อมูลสาหรับการทาเหมืองข้อมูล ้ ั สถาปตยกรรมของระบบเหมืองข้อมูล การเตรียมข้อมูล การขุดค้นกฎสัมพันธ์ การจาแนกประเภทและการทานาย การ ่ ่ ี ั ั จัดกลุม การทาเหมืองข้อมูลทีมความซับซ้อน การประยุกต์ใช้เหมืองข้อมูล แนวโน้มปจจุบนการทาเหมืองข้อมูล เหมือง ข้อมูลตัวอักษร เหมืองข้อมูลเว็บ รวมถึงการใช้เครืองมือในการวิเคราะห์เหมืองข้อมูล เช่น WEKA, SAS เป็ นต้น ่ 5 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 6. Course Schedule (tentative) Week Date Topics 1 8 JAN Introduction to Data Mining and Data Warehousing 2 15 JAN Data Warehouse and OLAP Technology – I 3 22 JAN Data Warehouse and OLAP Technology – II 4 29 JAN Data Mining Concepts and Data Preparation 5 5 FEB Association Rule Mining 6 12 FEB Classification Model: Decision Tree, Classification Rules 7 19 FEB Classification Model: Naïve Bayes 8 26 FEB Prediction Model: Regression 9 4 MAR Clustering 10 11 MAR Data Mining Application: Text Mining, Web Mining, Social Network Analysis 11 18 MAR Introduction to Data Mining Tool: WEKA 12 25 MAR Tutorials 6 Final Mining and Data Warehousing by Kritsada Sriphaew Data
  • 7. Prerequisites  Basic Database Concepts  Basic Statistics:  Probability, Sampling, Logic, Linear Regression, …  Algorithms:  Basic Data Structures, Dynamic Programming, ... We provide some backgrounds, but the class will be fast pace if you have some basics in advance. 7 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 8. Introduction  Motivation: Why mine data?  KDD: Knowledge Discovery in Databases  What is Data Mining?  Data Mining: on What kind of Data?  Data Mining Tasks  Data Mining Applications 8 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 9. Evolution of Database Technology  1960s:  Data collection, database creation, IMS and network DBMS  1970s:  Relational data model, relational DBMS implementation  1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)  1990s—2000s:  Data mining and data warehousing, multimedia databases, and Web databases 9 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 10. Large Data Sets: A Motivation  There is often information “hidden” in the data that is not readily evident.  Human analysts take weeks to discover useful information.  Much of the data is never been analyzed at all How do you explore millions of records, tens or hundreds of fields, and find patterns? 10 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 11. KDD Process (Knowledge Discovery in Databases) Interpretation/ Evaluation Data Mining Knowledge Preprocessing Patterns Selection Preprocessed Data Data Target Data adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press 11 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 12. Knowledge Discovery 12 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 13. Business Intelligence (BI) vs. Data Mining  A word to call processes, techniques and tools that support business decision using information technology Increasing potential to support End User business decisions Making Decisions Data Presentation Business Analyst Visualization Techniques Data Mining Knowledge Discovery Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP 13 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 14. Terminology  Data Mining A step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data.  Knowledge Discovery Process The process of using data mining methods (algorithms) to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations. 14 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 15. Other definitions of Data Mining  Non‐trivial extraction of implicit, previously unknown and useful information from data  Automatic or semi-automatic process for analyzing large databases to find patterns that are:  valid: hold on new data with some certainty  novel: non‐obvious to the system  useful: should be possible to act on the item  understandable: humans should be able to interpret the pattern 15 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 16. Origins of Data Mining  Overlaps various fields, but focus on  Scalability  Algorithm and Architecture  Automation to handle large data 16 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 17. Data Mining: on What kind of Data?  Relational Databases  Data Warehouses Structure - 3D Anatomy  Transactional Databases  Advanced Database Systems Function – 1D Signal  Object-Relational  Spatial and Temporal  Time-Series Metadata – Annotation  Multimedia GeneFilter Comparison Report  Text GeneFilter 1 Name: O2#1 8-20-99adjfinal INTENSITIES GeneFilter 1 N2#1finaladj Name:  Heterogeneous, Legacy, and Distributed ORF NAME YAL001C TFC3 1 RAW GENE NAME NORMALIZED CHRM F G 1 A 1 2 12.03 7.38 R GF1 403.83 GF2 WWW YBL080C PET112 2 1 A 1 3 53.21 35.62 "1,  YBR154C YCL044C RPB5 2 3 1 A 1 4 79.26 78.51 1 A 1 5 53.22 44.66 "2,660.73" "1,786.53" YDL020C SON1 4 1 A 1 6 23.80 20.34 799.06 YDL211C 4 1 A 1 7 17.31 35.34 581.00 YDR155C CPH1 4 1 A 1 8 349.78 401.84 YDR346C 4 1 A 1 9 64.97 65.88 "2,180.87" YAL010C MDM10 1 1 A 2 2 13.73 9.61 461.03 17 YBL088C TEL1 2 1 A 2 3 8.50 7.74 Data Mining and Data Warehousing by Kritsada Sriphaew YBR162C 2 1 A 2 4 226.84 285.38 293.83 YCL052C PBN1 3 1 A 2 5 41.28 34.79 "1,385.79" YDL028C MPS1 4 1 A 2 6 7.95 6.24 266.99
  • 18. Data Mining Tasks  Classification  Clustering  Association Rule Mining  Sequential Pattern Discovery  Regression  Anomaly Detection
  • 19. Ex: Classifying Galaxy 19 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 20. Ex: Market Basket Analysis ? Where should detergents be placed in the Store to maximize their sales? ? Are window cleaning products purchased when detergents and orange juice are bought together? ? Is soda typically purchased with bananas? Does the brand of soda make a difference? ? How are the demographics of the neighborhood affecting what customers are buying? 20 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 21. Ex: Anomaly Detection  Detect significant deviations from normal behavior  Applications:  Credit Card Fraud Detection  Network Intrusion Detection 21 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 22. Some Success Stories  Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data  Won over (manual) knowledge engineering approach  http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed description of the entire process  Major US bank: Customer attrition prediction  Segment customers based on financial behavior: 3 segments  Build attrition models for each of the 3 segments  40‐50% of attritions were predicted == factor of 18 increase  Targeted credit marketing: major US banks  find customer segments based on 13 months credit balances  build another response model based on surveys  increased response 4 times -- 2% 22 Data Mining and Data Warehousing by Kritsada Sriphaew
  • 23. How You’LL Benefit  Confidently discuss the role and applicability of data warehousing and data mining to business/organization problems  Get background knowledge for further explore to your thesis, independent study or your career’s projects since data mining methods (to extract knowledge from the data) are very useful for every fields.
  • 24. Assignment  Assignments will aim to test your detailed knowledge and understanding of the topics, as well as your critical thinking and research ability. Assignments may include tasks involving: writing detailed designs; reading research papers; learning and using specialist software/hardware.  Assessment: the assignment will be worth 20% of the total course assessment.
  • 25. PreTest 1. Select only one of the following items to fill in the blanks. (a) Characterization/Discrimination (b) Classification (c) Numeric Prediction (d) Clustering (e) Association Analysis (f) Trend Analysis Which function matches with the following task? ______(1) To estimate the price of the stock A in next month ______(2) To display a portion of sold products, according to their types. ______(3) To know which products are likely to be sold with which products ______(4) To group customers to a set of similar groups based on their features ______(5) To find the value of an experiment when a substance is tested. ______(6) To predict that a customer tends to be a good customer or not. 2. Assume that we want to design a model to forecast tomorrow’s SET index, please suggest the detail of the model that we should construct and recommend the input and output to the model. 25