SlideShare a Scribd company logo
1 of 37
KDD: A Definition

• KDD is the automatic extraction of non-obvious,
  hidden knowledge from large volumes of data.



                             Then run Data
                             Mining algorithms

   106-1012 bytes:
   we never see the                              What is the knowledge?
   whole data set, so will                       How to represent
   put it in the memory of                       and use it?
   computers
Why do we need KDD ?
Some Data Overload Examples:
                                                                  Science

   Wal-Mart records 20 millions per day

                                               Retail                            Marketing

                                                                   Data
   Health care transactions: multi-gigabyte                      Overload
   databases


   Mobil Oil: geological data of over 100           Healthcare              Finance
   terabytes



       Data is the most Important tool to gain a competitive edge by
                  providing improved, customized services.
Knowledge Discovery Process
                           Integration

                                     Interpretation    Knowledge
                                     & Evaluation

                                                      Knowledge
 Raw
 Dat                    __ __ __
                                     Patterns




                                                          Understanding
                        __ __ __
 a                      __ __ __       and
                                      Rules
                      Transformed
       DATA    Target     Data
       Ware    Data
       house
Knowledge Discovery in Database

• Knowledge discovery in databases (KDD) is the non-trivial
  process of identifying valid, potentially useful and ultimately
  understandable patterns in data


  Clean,                       Data         Training        Data
  Collect,      Data                          Data          Mining
                            Preparation
Summarize     Warehouse




                                          Verification,      Model
Operational                                Evaluation       Patterns
Databases
Knowledge Discovery Process
  Goals

  Data Selection, Acquisition & Integration

  Data Cleaning

  Data Reduction & Projection

  Matching the Goals

  Exploratory Data Analysis

  Data Mining

  Interpretation and Testing

  Consolidation & Use
Knowledge Discovery Process

• Goals                     STEP – 1: IDENTIFYING THE GOAL
• Data Selection,
Acquisition & Integration
                            • First step is developing an understanding of
• Data Cleaning               the application domain and the relevant
• Data reduction and          prior knowledge and identifying the goal of
Projection                    the KDD process from the customer’s
•Matching the goals           viewpoint.
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals                     STEP – 2: CREATING A TARGET DATA SET
• Data Selection,
Acquisition & Integration
                            • Selecting a data set, or focusing on a subset
• Data Cleaning               of variables or data samples, on which
• Data reduction and          discovery is to be performed.
Projection
•Matching the goals
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals                     STEP – 3: DATA CLEANING AND PREPROCESSING
• Data Selection,
Acquisition & Integration    • Basic operations include removing noise if
• Data Cleaning                appropriate, collecting the necessary
• Data reduction and           information to model or account for noise,
Projection                     deciding on strategies for handling missing
•Matching the goals            data fields, and accounting for time-
• Exploratory Data             sequence information and known changes.
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals
                            STEP – 4: DATA REDUCTION AND
• Data Selection,           PROJECTION
Acquisition & Integration
• Data Cleaning             • Finding useful features to represent the data
• Data reduction and          depending on the goal of the task.
Projection                  • With dimensionality reduction or
•Matching the goals           transformation methods, the effective
• Exploratory Data
                              number of variables under consideration can
Analysis
• Data Mining
                              be reduced, or invariant representations for
•Interpretation and           the data can be found.
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals                     STEP – 5: MATCHING THE GOALS
• Data Selection,
Acquisition & Integration   • Matching the goals of the KDD process to a
• Data Cleaning
                              particular data-mining method such as
• Data reduction and
                              summarization, classification, regression,
Projection
•Matching the goals           clustering, etc.
• Exploratory Data
Analysis
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals
                            STEP – 6: EXPLORATORY ANALYSIS AND
• Data Selection,           MODEL & HYPOTHESIS SELECTION
Acquisition & Integration
• Data Cleaning             • Choosing the data mining algorithms and
• Data reduction and          selecting methods to be used for searching
Projection                    for data patterns.
•Matching the goals         • This process includes deciding which models
• Exploratory Data
                              and parameters might be appropriate and
Analysis
• Data Mining
                              matching a particular data-mining method
• Interpretation and          with the overall criteria of the KDD process.
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals                     STEP – 7: DATA MINING
• Data Selection,
Acquisition & Integration
                            • Searching for patterns of interest in a
• Data Cleaning               particular representational form or a set of
• Data reduction and          such representations, including classification
Projection                    rules or trees, regression, and clustering.
•Matching the goals         • The user can significantly aid the data-
• Exploratory Data            mining method by correctly performing the
Analysis                      preceding steps.
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals                     STEP – 8: INTERPRETATION & TESTING
• Data Selection,
Acquisition & Integration
                            • Interpreting mined patterns, possibly
• Data Cleaning               returning to any of steps 1 through 7 for
• Data reduction and          further iteration.
Projection                  • This step can also involve visualization of the
•Matching the goals           extracted patterns and models or
• Exploratory Data            visualization of the data given the extracted
Analysis                      models.
• Data Mining
•Interpretation and
Testing
• Consolidation & Use
Knowledge Discovery Process

• Goals                      STEP – 9: KNOWLEDGE PRESENTATION
• Data Selection,
Acquisition & Integration
                             • Using the knowledge directly, incorporating
• Data Cleaning                the knowledge into another system for
• Data reduction and           further action, or simply documenting it and
Projection                     reporting it to interested parties.
•Matching the goals          • This process also includes checking for and
• Exploratory Data             resolving potential conflicts with previously
Analysis                       believed (or extracted) knowledge.
• Data Mining
• Testing and Verification
• Interpretation
• Consolidation & Use
Data Warehousing

• A platform for online analytical processing (OLAP)
• Warehouses collect transactional data from several
  transactional databases and organize them in a fashion
  amenable to analysis
• Also called “data marts”
• A critical component of the decision support system (DSS) of
  enterprises
• Some typical DW queries:
   – Which item sells best in each region that has retail outlets?
   – Which advertising strategy is best for Dubai Markets?
Data Warehousing

                 OLTP




                        Data Cleaning

     Inventory




                                 Data
                               Warehouse
                                (OLAP)
Data Cleaning
• Performs logical transformation of transactional data to suit the data
  warehouse
• Model of operations  model of enterprise
• Usually a semi-automatic process
                                                    Data Warehouse
                  Orders
                 Order_id                           Customers
                   Price                            Products
                 Cust_id                            Orders
                                                    Inventory
                                                    Price
                  Inventory
                                    Sales           Time
                   Prod_id
                                   Cust_id
                    Price
                                 Cust_profit
                Price_change
                                 Total_sales
Primary Tasks of Data Mining
               finding the description
                                             identifying a finite
               of several predefined
                                             set of categories or
               classes and classify
                                             clusters to describe
               a data item into one
                                             the data.
               of them.                                         Clustering
 Classification
                                         finding a model
                maps a data item         which describes
       ?                                 significant dependencies
                to a real-valued
                prediction variable.     between variables.
 Regression                                               Dependency
                                                                Modeling
                discovering the           finding a
                most significant          compact description
                changes in the data       for a subset of data
Deviation and
change detection
                                                     Summarization
Data Mining Algorithm Components
• Model representation
   – descriptions of discovered patterns
   – overly limited representation -- unable to capture data patterns
     too powerful -- potential for over fit.
     (decision trees, rules, linear/non-linear regression & classification,
      nearest neighbor and case-based reasoning methods, graphical
      dependency models)


• Model evaluation criteria
   – how well a pattern (model) meets goals (fit function)
   – e.g., accuracy, novelty, etc.
Data Mining Algorithm Components
• Search method
   – parameter search: optimization of parameters for a given model
     representation
   – model search: considers a family of models


 Different methods suit different problems. Proper problem formulation
  crucial.
Data Mining Techniques

                                Data Mining Techniques


 Descriptive                           Predictive


               Clustering                           Classification


               Association                                              Decision Tree



          Sequential Analysis                                          Rule Induction


                                                                      Neural Networks


                                                                Nearest Neighbor Classification



                                                     Regression
Association Rule: Application

• Supermarket Shelf Management
• Goal: to identify items which are bought together (by sufficiently many
  customers)
• Approach: process point-of-sale data (collected with barcode scanners)
  to find dependencies among items.
• Consider discovered rule:
   {Diapers, Milk … } --> {Baby food}
• Example:
   – If a customer buys Diapers and Milk, then he is very likely to buy
     Baby foods.
   – so stack baby foods next to diapers?
Sequential Pattern Discovery: Application

• Sequences in which customers purchase goods/services
• Understanding long term customer behavior -- timely
  promotions.

• In point-of--sale transaction sequences
   – Computer bookstore:
  (Intro to Visual C++) (Java & J2EE) --> (Perl for Dummies, PHP in 24 Hrs)

   – Athletic Apparel Store:
  (Shoes) (Racket, Racket ball) --> (Sports Jacket)
Hierarchical Clustering (K-Means): Application

     Hierarchical clustering: Clusters are formed at different levels by
                              merging clusters at a lower level
10

9
                                                                                                                                                                 10
8                                                           10

                                                            9
                                                                                                                                                                 9
7
                                                            8
                                                                                                                                                                 8
6
                                                                                                                                                                 7
5                                                           7


                                                                                                                                                       Update
                                                            6
                                                                                                                                                                 6
4
                                                  Assign    5
                                                                                                                                                                 5

                                                                                                                                                       the
3

2                                                 each of   4
                                                                                                                                                                 4

1
                                                  the
                                                            3                                                                                          cluster   3


                                                                                                                                                       means
                                                                                                                                                                 2
0                                                           2
     0   1   2   3   4   5   6   7   8   9   10   objects   1
                                                                                                                                                                 1


                                                  to most   0
                                                                 0       1       2       3       4       5       6       7       8       9       10
                                                                                                                                                                 0
                                                                                                                                                                      0   1   2   3   4   5   6   7   8   9   10

                                                  similar
                                                  center                                                         reassign
     K=2
                                                             10
     Arbitrarily choose K                                        9

     objects as initial                                          8


     cluster center                                                                                                                                    Update
                                                                 7

                                                                 6

                                                                 5                                                                                     the
                                                                 4
                                                                                                                                                       cluster
                                                                                                                                                       means
                                                                 3

                                                                 2

                                                                 1

                                                                 0
                                                                     0       1       2       3       4       5       6       7       8       9    10
Decision Tree Identification: Application


  Decision Tree Identification Example

   Outlook    Temp     Play?
   Sunny      Warm     Yes               Sunny     Yes
   Overcast   Chilly   No
   Sunny      Chilly   Yes
                                         Cloudy   Yes/No
   Cloudy     Pleasant Yes
   Overcast   Pleasant Yes
                                     Overcast     Yes/No
   Overcast   Chilly   No
   Cloudy     Chilly   No
   Cloudy     Warm     Yes
Decision Tree Identification: Application




                                  Yes/No

             Cloudy                            Overcast
                                     Sunny

         Yes/No                    Yes                    Yes/No

                       Pleasant                Chilly
 Warm
           Chilly
                                                   No              Pleasant
   Yes                No                 Yes

                                                                    Yes
Major Application Areas for Data
Mining (Classification)
•   Advertising
•   Bioinformatics
•   Customer Relationship Management (CRM)
•   Database Marketing
•   Fraud Detection
•   ecommerce
•   Health Care
•   Investment/Securities
•   Manufacturing, Process Control
•   Sports and Entertainment
•   Telecommunications
•   Web
Major Application Areas for Data
Mining: Marketing
• Direct Marketing:
  Most major direct marketing companies are using
  modeling and data mining.
• Customer segmentation:
  All industries can take advantage of DM to discover
  discrete segments in their customer bases by considering
  additional variables beyond traditional analysis.
• CRM:
  Find other people in similar life stages and determine
  which customers are following similar behavior patterns    For e.g. Verizon
   – Up-sell                                                 Wireless
   – Cross-sell                                              reduced churn
   – Keeping the customers for a longer period of time       rate from 2% to
                                                             1.5%
Major Application Areas for Data
Mining: Fraud Detection

• Credit Card Fraud Detection
• Money laundering
   – FAIS (US Treasury)
• Securities Fraud
   – NASDAQ Sonar system
• Phone fraud
   – AT&T, Bell Atlantic, British Telecom/MCI
• Bio-terrorism detection at Salt Lake
  Olympics 2002
Major Application Areas for Data
Mining: Retail
• Sales forecasting:
    Examining time-based patterns helps retailers make
    stocking decisions.

• Database Retailing:
    Retailers can develop profiles of customers with
    certain behaviors, for example, those who purchase
    designer labels clothing or those who attend sales.

•   Merchandise planning and allocation:
    When retailers add new stores, they can improve
    merchandise planning and allocation by examining
    patterns in stores with similar demographic
    characteristics.
Major Application Areas for Data
Mining: Banking

• Credit Card marketing
  By identifying customer segments, card
  issuers and acquirers can improve
  profitability with more effective acquisition
  and retention programs.


• Cardholder pricing and profitability
  Card issuers can take advantage of data
  mining technology to price their products so
  as to maximize profit and minimize loss of
  customers.
Major Application Areas for Data
  Mining: Telecommunication
• Call detail record analysis:
  Telecommunication companies accumulate
  detailed call records. By identifying customer
  segments with similar use patterns, the
  companies can develop attractive pricing and
  feature promotions.

• Customer loyalty:
  Some customers repeatedly switch providers, or
  “churn”, to take advantage of attractive incentives
  by competing companies. The companies can use
  DM to identify the characteristics of customers
  who are likely to remain loyal once they switch,
  thus enabling the companies to target their
  spending on customers who will produce the most
  profit.
Major Application Areas for Data
Mining: Manufacturing

• Manufacturing:
  Through choice boards, manufacturers are
  beginning to customize products for
  customers; therefore they must be able to
  predict which features should be bundled to
  meet customer demand.

• Warranties:
  Manufacturers need to predict the number of
  customers who will submit warranty claims
  and the average cost of those claims.
Issues and Challenges
• Large data
   – Number of variables (features), number of cases (examples)
   – Multi gigabyte, terabyte databases
   – Efficient algorithms, parallel processing
• High dimensionality
   – Large number of features: exponential increase in search space
   – Potential for spurious patterns
   – Dimensionality reduction
• Over fitting
   – Models noise in training data, rather than just the general patterns
• Changing data, missing and noisy data
• Use of domain knowledge
   – Utilizing knowledge on complex data relationships, known facts
• Understandability of patterns
Success Stories

• Network intrusion detection using a combination of sequential
  rule discovery and classification tree on 4 GB DARPA data
   – Won over (manual) knowledge engineering approach
   – http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides
      good detailed description of the entire process
• Major US bank: customer attrition prediction
   – First segment customers based on financial behavior: found 3
      segments
   – Build attrition models for each of the 3 segments
   – 40-50% of attritions were predicted == factor of 18 increase
• Targeted credit marketing: major US banks
   – Find customer segments based on 13 months credit balances
   – Build another response model based on surveys
   – Increased response 4 times -- 2%
Amitava Manna
(11DCP007)
Amritanshu Mehra
(11DCP008)
Animesh Ranjan
(11DCP009)

More Related Content

What's hot

1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesKrish_ver2
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Salah Amean
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining Sulman Ahmed
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycleManoj Mishra
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data miningkavitha muneeshwaran
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Salah Amean
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandasAkshitaKanther
 
Data Mining Technique - CRISP-DM
Data Mining Technique - CRISP-DMData Mining Technique - CRISP-DM
Data Mining Technique - CRISP-DMAshish Chandra Jha
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptxmaha797959
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 

What's hot (20)

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
SPADE -
SPADE - SPADE -
SPADE -
 
Data Mining Technique - CRISP-DM
Data Mining Technique - CRISP-DMData Mining Technique - CRISP-DM
Data Mining Technique - CRISP-DM
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 

Similar to Knowledge Discovery and Data Mining

finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxshumPanwar
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
introduction to data mining
introduction to data mining introduction to data mining
introduction to data mining rzgar zebari
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfssuserb933d8
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1malathieswaran29
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationSunderland City Council
 
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...TEST Huddle
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dwANUSUYA T K
 
Zakipoint Introduction
Zakipoint IntroductionZakipoint Introduction
Zakipoint Introductionrameshkbudhani
 
Data mining in the field of library
Data mining in the field of libraryData mining in the field of library
Data mining in the field of libraryMegha Goyal
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)Kartik Kalpande Patil
 
Data mining nouman javed
Data mining   nouman javedData mining   nouman javed
Data mining nouman javednouman javed
 

Similar to Knowledge Discovery and Data Mining (20)

finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptx
 
Data mining concepts
Data mining conceptsData mining concepts
Data mining concepts
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
introduction to data mining
introduction to data mining introduction to data mining
introduction to data mining
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdf
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data mining & column stores
Data mining & column storesData mining & column stores
Data mining & column stores
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
 
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
 
Data mining
Data miningData mining
Data mining
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dw
 
Zakipoint Introduction
Zakipoint IntroductionZakipoint Introduction
Zakipoint Introduction
 
15 19
15 1915 19
15 19
 
KDD assignmnt data.docx
KDD assignmnt data.docxKDD assignmnt data.docx
KDD assignmnt data.docx
 
Data mining in the field of library
Data mining in the field of libraryData mining in the field of library
Data mining in the field of library
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
 
Data mining nouman javed
Data mining   nouman javedData mining   nouman javed
Data mining nouman javed
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 

Recently uploaded

Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 

Recently uploaded (20)

Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 

Knowledge Discovery and Data Mining

  • 1.
  • 2. KDD: A Definition • KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. Then run Data Mining algorithms 106-1012 bytes: we never see the What is the knowledge? whole data set, so will How to represent put it in the memory of and use it? computers
  • 3. Why do we need KDD ? Some Data Overload Examples: Science Wal-Mart records 20 millions per day Retail Marketing Data Health care transactions: multi-gigabyte Overload databases Mobil Oil: geological data of over 100 Healthcare Finance terabytes Data is the most Important tool to gain a competitive edge by providing improved, customized services.
  • 4. Knowledge Discovery Process Integration Interpretation Knowledge & Evaluation Knowledge Raw Dat __ __ __ Patterns Understanding __ __ __ a __ __ __ and Rules Transformed DATA Target Data Ware Data house
  • 5. Knowledge Discovery in Database • Knowledge discovery in databases (KDD) is the non-trivial process of identifying valid, potentially useful and ultimately understandable patterns in data Clean, Data Training Data Collect, Data Data Mining Preparation Summarize Warehouse Verification, Model Operational Evaluation Patterns Databases
  • 6. Knowledge Discovery Process Goals Data Selection, Acquisition & Integration Data Cleaning Data Reduction & Projection Matching the Goals Exploratory Data Analysis Data Mining Interpretation and Testing Consolidation & Use
  • 7. Knowledge Discovery Process • Goals STEP – 1: IDENTIFYING THE GOAL • Data Selection, Acquisition & Integration • First step is developing an understanding of • Data Cleaning the application domain and the relevant • Data reduction and prior knowledge and identifying the goal of Projection the KDD process from the customer’s •Matching the goals viewpoint. • Exploratory Data Analysis • Data Mining •Interpretation and Testing • Consolidation & Use
  • 8. Knowledge Discovery Process • Goals STEP – 2: CREATING A TARGET DATA SET • Data Selection, Acquisition & Integration • Selecting a data set, or focusing on a subset • Data Cleaning of variables or data samples, on which • Data reduction and discovery is to be performed. Projection •Matching the goals • Exploratory Data Analysis • Data Mining •Interpretation and Testing • Consolidation & Use
  • 9. Knowledge Discovery Process • Goals STEP – 3: DATA CLEANING AND PREPROCESSING • Data Selection, Acquisition & Integration • Basic operations include removing noise if • Data Cleaning appropriate, collecting the necessary • Data reduction and information to model or account for noise, Projection deciding on strategies for handling missing •Matching the goals data fields, and accounting for time- • Exploratory Data sequence information and known changes. Analysis • Data Mining •Interpretation and Testing • Consolidation & Use
  • 10. Knowledge Discovery Process • Goals STEP – 4: DATA REDUCTION AND • Data Selection, PROJECTION Acquisition & Integration • Data Cleaning • Finding useful features to represent the data • Data reduction and depending on the goal of the task. Projection • With dimensionality reduction or •Matching the goals transformation methods, the effective • Exploratory Data number of variables under consideration can Analysis • Data Mining be reduced, or invariant representations for •Interpretation and the data can be found. Testing • Consolidation & Use
  • 11. Knowledge Discovery Process • Goals STEP – 5: MATCHING THE GOALS • Data Selection, Acquisition & Integration • Matching the goals of the KDD process to a • Data Cleaning particular data-mining method such as • Data reduction and summarization, classification, regression, Projection •Matching the goals clustering, etc. • Exploratory Data Analysis • Data Mining •Interpretation and Testing • Consolidation & Use
  • 12. Knowledge Discovery Process • Goals STEP – 6: EXPLORATORY ANALYSIS AND • Data Selection, MODEL & HYPOTHESIS SELECTION Acquisition & Integration • Data Cleaning • Choosing the data mining algorithms and • Data reduction and selecting methods to be used for searching Projection for data patterns. •Matching the goals • This process includes deciding which models • Exploratory Data and parameters might be appropriate and Analysis • Data Mining matching a particular data-mining method • Interpretation and with the overall criteria of the KDD process. Testing • Consolidation & Use
  • 13. Knowledge Discovery Process • Goals STEP – 7: DATA MINING • Data Selection, Acquisition & Integration • Searching for patterns of interest in a • Data Cleaning particular representational form or a set of • Data reduction and such representations, including classification Projection rules or trees, regression, and clustering. •Matching the goals • The user can significantly aid the data- • Exploratory Data mining method by correctly performing the Analysis preceding steps. • Data Mining •Interpretation and Testing • Consolidation & Use
  • 14. Knowledge Discovery Process • Goals STEP – 8: INTERPRETATION & TESTING • Data Selection, Acquisition & Integration • Interpreting mined patterns, possibly • Data Cleaning returning to any of steps 1 through 7 for • Data reduction and further iteration. Projection • This step can also involve visualization of the •Matching the goals extracted patterns and models or • Exploratory Data visualization of the data given the extracted Analysis models. • Data Mining •Interpretation and Testing • Consolidation & Use
  • 15. Knowledge Discovery Process • Goals STEP – 9: KNOWLEDGE PRESENTATION • Data Selection, Acquisition & Integration • Using the knowledge directly, incorporating • Data Cleaning the knowledge into another system for • Data reduction and further action, or simply documenting it and Projection reporting it to interested parties. •Matching the goals • This process also includes checking for and • Exploratory Data resolving potential conflicts with previously Analysis believed (or extracted) knowledge. • Data Mining • Testing and Verification • Interpretation • Consolidation & Use
  • 16. Data Warehousing • A platform for online analytical processing (OLAP) • Warehouses collect transactional data from several transactional databases and organize them in a fashion amenable to analysis • Also called “data marts” • A critical component of the decision support system (DSS) of enterprises • Some typical DW queries: – Which item sells best in each region that has retail outlets? – Which advertising strategy is best for Dubai Markets?
  • 17. Data Warehousing OLTP Data Cleaning Inventory Data Warehouse (OLAP)
  • 18. Data Cleaning • Performs logical transformation of transactional data to suit the data warehouse • Model of operations  model of enterprise • Usually a semi-automatic process Data Warehouse Orders Order_id Customers Price Products Cust_id Orders Inventory Price Inventory Sales Time Prod_id Cust_id Price Cust_profit Price_change Total_sales
  • 19. Primary Tasks of Data Mining finding the description identifying a finite of several predefined set of categories or classes and classify clusters to describe a data item into one the data. of them. Clustering Classification finding a model maps a data item which describes ? significant dependencies to a real-valued prediction variable. between variables. Regression Dependency Modeling discovering the finding a most significant compact description changes in the data for a subset of data Deviation and change detection Summarization
  • 20. Data Mining Algorithm Components • Model representation – descriptions of discovered patterns – overly limited representation -- unable to capture data patterns too powerful -- potential for over fit. (decision trees, rules, linear/non-linear regression & classification, nearest neighbor and case-based reasoning methods, graphical dependency models) • Model evaluation criteria – how well a pattern (model) meets goals (fit function) – e.g., accuracy, novelty, etc.
  • 21. Data Mining Algorithm Components • Search method – parameter search: optimization of parameters for a given model representation – model search: considers a family of models Different methods suit different problems. Proper problem formulation crucial.
  • 22. Data Mining Techniques Data Mining Techniques Descriptive Predictive Clustering Classification Association Decision Tree Sequential Analysis Rule Induction Neural Networks Nearest Neighbor Classification Regression
  • 23. Association Rule: Application • Supermarket Shelf Management • Goal: to identify items which are bought together (by sufficiently many customers) • Approach: process point-of-sale data (collected with barcode scanners) to find dependencies among items. • Consider discovered rule: {Diapers, Milk … } --> {Baby food} • Example: – If a customer buys Diapers and Milk, then he is very likely to buy Baby foods. – so stack baby foods next to diapers?
  • 24. Sequential Pattern Discovery: Application • Sequences in which customers purchase goods/services • Understanding long term customer behavior -- timely promotions. • In point-of--sale transaction sequences – Computer bookstore: (Intro to Visual C++) (Java & J2EE) --> (Perl for Dummies, PHP in 24 Hrs) – Athletic Apparel Store: (Shoes) (Racket, Racket ball) --> (Sports Jacket)
  • 25. Hierarchical Clustering (K-Means): Application Hierarchical clustering: Clusters are formed at different levels by merging clusters at a lower level 10 9 10 8 10 9 9 7 8 8 6 7 5 7 Update 6 6 4 Assign 5 5 the 3 2 each of 4 4 1 the 3 cluster 3 means 2 0 2 0 1 2 3 4 5 6 7 8 9 10 objects 1 1 to most 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 similar center reassign K=2 10 Arbitrarily choose K 9 objects as initial 8 cluster center Update 7 6 5 the 4 cluster means 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
  • 26. Decision Tree Identification: Application Decision Tree Identification Example Outlook Temp Play? Sunny Warm Yes Sunny Yes Overcast Chilly No Sunny Chilly Yes Cloudy Yes/No Cloudy Pleasant Yes Overcast Pleasant Yes Overcast Yes/No Overcast Chilly No Cloudy Chilly No Cloudy Warm Yes
  • 27. Decision Tree Identification: Application Yes/No Cloudy Overcast Sunny Yes/No Yes Yes/No Pleasant Chilly Warm Chilly No Pleasant Yes No Yes Yes
  • 28. Major Application Areas for Data Mining (Classification) • Advertising • Bioinformatics • Customer Relationship Management (CRM) • Database Marketing • Fraud Detection • ecommerce • Health Care • Investment/Securities • Manufacturing, Process Control • Sports and Entertainment • Telecommunications • Web
  • 29. Major Application Areas for Data Mining: Marketing • Direct Marketing: Most major direct marketing companies are using modeling and data mining. • Customer segmentation: All industries can take advantage of DM to discover discrete segments in their customer bases by considering additional variables beyond traditional analysis. • CRM: Find other people in similar life stages and determine which customers are following similar behavior patterns For e.g. Verizon – Up-sell Wireless – Cross-sell reduced churn – Keeping the customers for a longer period of time rate from 2% to 1.5%
  • 30. Major Application Areas for Data Mining: Fraud Detection • Credit Card Fraud Detection • Money laundering – FAIS (US Treasury) • Securities Fraud – NASDAQ Sonar system • Phone fraud – AT&T, Bell Atlantic, British Telecom/MCI • Bio-terrorism detection at Salt Lake Olympics 2002
  • 31. Major Application Areas for Data Mining: Retail • Sales forecasting: Examining time-based patterns helps retailers make stocking decisions. • Database Retailing: Retailers can develop profiles of customers with certain behaviors, for example, those who purchase designer labels clothing or those who attend sales. • Merchandise planning and allocation: When retailers add new stores, they can improve merchandise planning and allocation by examining patterns in stores with similar demographic characteristics.
  • 32. Major Application Areas for Data Mining: Banking • Credit Card marketing By identifying customer segments, card issuers and acquirers can improve profitability with more effective acquisition and retention programs. • Cardholder pricing and profitability Card issuers can take advantage of data mining technology to price their products so as to maximize profit and minimize loss of customers.
  • 33. Major Application Areas for Data Mining: Telecommunication • Call detail record analysis: Telecommunication companies accumulate detailed call records. By identifying customer segments with similar use patterns, the companies can develop attractive pricing and feature promotions. • Customer loyalty: Some customers repeatedly switch providers, or “churn”, to take advantage of attractive incentives by competing companies. The companies can use DM to identify the characteristics of customers who are likely to remain loyal once they switch, thus enabling the companies to target their spending on customers who will produce the most profit.
  • 34. Major Application Areas for Data Mining: Manufacturing • Manufacturing: Through choice boards, manufacturers are beginning to customize products for customers; therefore they must be able to predict which features should be bundled to meet customer demand. • Warranties: Manufacturers need to predict the number of customers who will submit warranty claims and the average cost of those claims.
  • 35. Issues and Challenges • Large data – Number of variables (features), number of cases (examples) – Multi gigabyte, terabyte databases – Efficient algorithms, parallel processing • High dimensionality – Large number of features: exponential increase in search space – Potential for spurious patterns – Dimensionality reduction • Over fitting – Models noise in training data, rather than just the general patterns • Changing data, missing and noisy data • Use of domain knowledge – Utilizing knowledge on complex data relationships, known facts • Understandability of patterns
  • 36. Success Stories • Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data – Won over (manual) knowledge engineering approach – http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed description of the entire process • Major US bank: customer attrition prediction – First segment customers based on financial behavior: found 3 segments – Build attrition models for each of the 3 segments – 40-50% of attritions were predicted == factor of 18 increase • Targeted credit marketing: major US banks – Find customer segments based on 13 months credit balances – Build another response model based on surveys – Increased response 4 times -- 2%