SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Presented and Contributed by:
                         Ahmet Selman Bozkır
                    Hacettepe University Ph.D. Student



November 29, 2011                                        1
   What is data mining?
   Motivation: Why data mining?
   Classification of data mining systems
   Architecture: Typical Data Mining System
   Data mining functionality

November 29, 2011                              2
   Data mining (knowledge discovery from data)
     Extraction of interesting (non-trivial, implicit, previously unknown
        and potentially useful) patterns or knowledge from huge amount of
        data
     Data mining: a misnomer?
   Alternative names
     Knowledge discovery (mining) in databases (KDD), knowledge
        extraction, data/pattern analysis, data archeology, data
        dredging, information harvesting, business intelligence, etc.
   Watch out: Is everything “data mining”?
     (Deductive) query processing.
     Expert systems or small ML/statistical programs


November 29, 2011                                                            3
   Data explosion problem

     Automated data collection tools and mature database technology lead

        to tremendous amounts of data accumulated and/or to be analyzed in
        databases, data warehouses, and other information repositories
   We are drowning in data, but starving for knowledge!
   Solution: Data warehousing and data mining

     Data warehousing and on-line analytical processing

     Mining interesting knowledge (rules, regularities, patterns, constraints)

        from data in large databases



November 29, 2011                                                                 4
   Data analysis and decision support
     Market analysis and management
        ▪ Target marketing, customer relationship management
          (CRM), market basket analysis, cross selling, market segmentation
     Risk analysis and management
        ▪ Forecasting, customer retention, improved underwriting, quality
          control, competitive analysis
     Fraud detection and detection of unusual patterns (outliers)
   Other Applications
     Text mining (news group, email, documents) and Web mining
     Bioinformatics and bio-data analysis


November 29, 2011                                                             5
   Target marketing
     Find clusters of “model” customers who share the same characteristics:
      interest, income level, spending habits, etc.
     Determine customer purchasing patterns over time

   Cross-market analysis—Find associations/co-relations between product
    sales, & predict based on such association

   Customer profiling —What types of customers buy what products
    (clustering or classification)

   Customer requirement analysis
     Identify the best products for different groups of customers
     Predict what factors will attract new customers


November 29, 2011    Data Mining: Concepts and Techniques                      6
   Finance planning and asset evaluation
     cash flow analysis and prediction cross-sectional and time series
        analysis (financial-ratio, trend analysis, etc.)
   Resource planning
     summarize and compare the resources and spending
   Competition
     monitor competitors and market directions

     group customers into classes and a class-based pricing procedure

     set pricing strategy in a highly competitive market




November 29, 2011                                                         7
   Approaches: Clustering & model construction for frauds, outlier analysis
   Applications: Health care, retail, credit card service, telecomm.
     Auto insurance: ring of collisions
     Money laundering: suspicious monetary transactions
     Medical insurance
        ▪ Professional patients, ring of doctors, and ring of references
        ▪ Unnecessary or correlated screening tests
     Telecommunications: phone-call fraud
        ▪ Phone call model: destination of the call, duration, time of day or week.
          Analyze patterns that deviate from an expected norm
     Retail industry
        ▪ Analysts estimate that 38% of retail shrink is due to dishonest employees
     Anti-terrorism


November 29, 2011       Data Mining: Concepts and Techniques                          8
Pattern Evaluation
  Data mining—core of
    knowledge discovery
    process                            Data Mining

                        Task-relevant Data


        Data Warehouse            Selection


Data Cleaning

               Data Integration


            Databases
 November 29, 2011                                                 9
Increasing potential
 to support
 business decisions                                                         End User
                                        Making
                                        Decisions

                                     Data Presentation                      Business
                                                                             Analyst
                                 Visualization Techniques
                                       Data Mining                            Data
                                    Information Discovery                   Analyst

                                      Data Exploration
                        Statistical Analysis, Querying and Reporting

                              Data Warehouses / Data Marts
                                      OLAP, MDA                                DBA
                                     Data Sources
              Paper, Files, Information Providers, Database Systems, OLTP
November 29, 2011                                                                      10
   Learning the application domain
     relevant prior knowledge and goals of application
   Creating a target data set: data selection
   Data cleaning and preprocessing: (may take 70% of effort!)
   Data reduction and transformation
     Find useful features, dimensionality/variable reduction, invariant
        representation.
   Choosing functions of data mining
       summarization, classification, regression, association, clustering.
   Choosing the mining algorithm(s)
   Data mining: search for patterns of interest
   Pattern evaluation and knowledge presentation
     visualization, transformation, removing redundant patterns, etc.
   Use of discovered knowledge

November 29, 2011                                                             11
Graphical user interface


                     Pattern evaluation

                    Data mining engine
                                                   Knowledge-base
                        Database or data
                        warehouse server
Data cleaning &
                                           Filtering
data integration
                                      Data
                    Databases       Warehouse

November 29, 2011                                              12
   General functionality
       Descriptive data mining
       Predictive data mining
     Different views, different classifications
       Kinds of databases to be mined
       Kinds of knowledge to be discovered
       Kinds of techniques utilized
       Kinds of applications adapted

November 29, 2011                                  13
   Concept description: Characterization and discrimination
     Generalize, summarize, and contrast data characteristics, e.g., dry vs.
        wet regions
   Association (correlation and causality)
     Diaper  Beer [0.5%, 75%]
   Classification and Prediction
     Construct models (functions) that describe and distinguish classes or
        concepts for future prediction
        ▪ E.g., classify countries based on climate, or classify cars based on gas
          mileage
     Presentation: decision-tree, classification rule, neural network
     Predict some unknown or missing numerical values
November 29, 2011                                                                    14
   Cluster analysis
     Class label is unknown: Group data to form new
      classes, e.g., cluster houses to find distribution patterns
     Maximizing intra-class similarity & minimizing interclass
      similarity
   Outlier analysis
     Outlier: a data object that does not comply with the general
      behavior of the data
     Noise or exception? No! useful in fraud detection, rare
      events analysis



November 29, 2011                                                    15
   Data mining: discovering interesting patterns from large amounts of data
   A natural evolution of database technology, in great demand, with wide
    applications
   A KDD process includes data cleaning, data integration, data
    selection, transformation, data mining, pattern evaluation, and knowledge
    presentation
   Mining can be performed in a variety of information repositories
   Data mining functionalities:
    characterization, discrimination, association, classification, clustering, outl
    ier and trend analysis, etc.
   Data mining systems and architectures
   Major issues in data mining


November 29, 2011                                                                 16
   R. Agrawal, J. Han, and H. Mannila, Readings in Data Mining: A
    Database Perspective, Morgan Kaufmann (in preparation)
   J. Han and M. Kamber. Data Mining: Concepts and Techniques.
    Morgan Kaufmann, 2001




November 29, 2011                                                    17
November 29, 2011
                    Thank you !!!   18
   • A decision tree (DT) is a hierarchical classification
    and prediction model

    • It is organized as a rooted tree with 2 types of
    nodes called decision nodes and inter nodes

    • It is a supervised data mining model used for
    classification or prediction


November 29, 2011                                             19
November 29, 2011   20
   Chance and Terminal Nodes

    •Each internal node of a DT is a decision point, where some
    condition is tested
    •The result of this condition determines which branch of the
    tree is to be taken next
    •Thus they are called decision node, chance node or non-
    terminal node
    •Chance nodes partition the available data at that point to
    maximize dependent variable differences


November 29, 2011                                                  21
   Terminal nodes

    •The leaf nodes of a DT are called terminal node
    •They indicate the class into which a data instance will
    be classified
    •They have just one incoming node
    •They do not have child nodes (outgoing nodes)
    •There are no conditions tested at terminal nodes
    •Tree traversal from the root to the leaf produces the
    production rule for that class

November 29, 2011                                              22
November 29, 2011   23
   Advantages of DT

    • Easy to understand and interpret
    • Works for categorical and continious data
    • High performance classification (generally)
    • DT can grow to any depth
    • On-the-fly prediction
    • Pruning a DT is very easy
    • Works for missing or null values

November 29, 2011                                   24
   Advantages contd.

    • Can be used to identify outliers
    • Production rules can be obtained directly from the built DT
    • They are relatively faster than other classification models
    • DT can be used even when domain experts are absent
    • Provide clear indication of which field is important for
    predication and classification




November 29, 2011                                                   25
   Disadvantages

    •Class-overlap problem (due to the curse of
    dimensionality)
    •Complex production rules
    •A DT can be sub-optimal (for this reason ensembe
    methods are developed)
    • Some decision tree can deal only with binary-valued.



November 29, 2011                                            26
November 29, 2011   27
•Training set - - to derive classifier
        (Generally %70-%80)

      •Test set - - to measure accuracy
        (Generally %20-%30)




November 29, 2011                              28
   Construction Phase: Initial Decision tree is
    Constructed in this Phase
    Q:How to split nodes?
    A: Different approaches with algorithms

   Pruning Phase: In this stage lower branches
    are removed to improve the performance
    Q:Why?
    A: Avoiding overfitting/overtraining
November 29, 2011                                  29
   ID3 (Available Everywhere)
   C4.5 / C5.0 (Weka/Spss Clementine)
   CART (Spss Clementine)
   CHAID (Spss Clementine, etc..)
   Microsoft Decision Trees (MS Analysis Services)
   Random Forests (Statistica)




November 29, 2011                                     30
   ID3 induction algorithm

    •ID3 (Interactive dichotomiser)
    •Introduced in 1986 by Quinlan
    •Designed for only classification
    •Works on categorical attributes only
    •Uses entropy measure as splitting criteria
    •Missing value handling is absent

November 29, 2011                                 31
   C4.5 induction algorithm

    •Invented by Quinlan in 1993
    •Is an extension of ID3 algorithm
    •Designed for only classification
    •Numerical attributes can be input
    •Uses entropy measure as splitting criteria
    •Uses multi-way splits
    •Missing value handling is provided
    •Tree pruning is also provided
November 29, 2011                                 32
   Classification and Regression Trees

    •Invented by Breiman, et.al. in 1984
    •Uses binary recursive partitioning method
    •Designed for both classification and regression
    •Works on both categorical & numerical attributes
    •Uses Gini measure as splitting criteria
    •Uses two-way splits
    •Missing value handling is provided
    •Tree pruning is also provided
November 29, 2011                                       33
   Chi-squared Automatic Interaction Detection

    •Invented by Kass, et.al. in 1980
    •Designed for both classification and regression
    •Works on both categorical & numerical attributes
    •Uses Karl Pearson's X2 test as splitting criteria
    •Uses multi-way splits
    •Missing value handling is provided
    •Avoids tree pruning

November 29, 2011                                        34
   Micorosoft Decision Trees

    •Invented by MS, in 1999
    •Designed for both classification and regression
    •Works on both categorical & numerical attributes
    •Serves entropy, Bayesian K2, and Bayesian
    Dirichlet Equivalent with Uniform prior choices as
    splitting criteria
    •Uses multi-way splits and support binary splitting
    •Missing value handling is provided
    •Avoids tree pruning
November 29, 2011                                         35
   Overfitting: An induced tree may overfit the training data
     Too many branches, some may reflect anomalies due to noise or outliers
     Poor accuracy for unseen samples




November 29, 2011                                                              36
   Two approaches to avoid overfitting
     Prepruning: Halt tree construction early—do not split a node if this
       would result in the goodness measure falling below a threshold
       ▪ Difficult to choose an appropriate threshold
     Postpruning: Remove branches from a “fully grown” tree—get a
       sequence of progressively pruned trees
       ▪ Use a set of data different from the training data to decide which is
         the “best pruned tree”




November 29, 2011                                                                37
Validation error




                      Training error


                                Time
November 29, 2011                      38

Weitere ähnliche Inhalte

Was ist angesagt?

Classification
ClassificationClassification
ClassificationCloudxLab
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Recovery with concurrent transaction
Recovery with concurrent transactionRecovery with concurrent transaction
Recovery with concurrent transactionlavanya marichamy
 
Multidimensional schema
Multidimensional schemaMultidimensional schema
Multidimensional schemaChaand Chopra
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentationAyanaRukasar
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelJunya Tanaka
 
Structure of shared memory space
Structure of shared memory spaceStructure of shared memory space
Structure of shared memory spaceCoder Tech
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityRushali Deshmukh
 
Dm from databases perspective u 1
Dm from databases perspective u 1Dm from databases perspective u 1
Dm from databases perspective u 1sakthyvel3
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big DataSeval Çapraz
 
Lecture 12 binary classifier confusion matrix
Lecture 12 binary classifier confusion matrixLecture 12 binary classifier confusion matrix
Lecture 12 binary classifier confusion matrixMostafa El-Hosseini
 
Adbms 23 distributed database design
Adbms 23 distributed database designAdbms 23 distributed database design
Adbms 23 distributed database designVaibhav Khanna
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
Decision Tree in Machine Learning
Decision Tree in Machine Learning  Decision Tree in Machine Learning
Decision Tree in Machine Learning Souma Maiti
 
Over fitting underfitting
Over fitting underfittingOver fitting underfitting
Over fitting underfittingSivapriyaS12
 

Was ist angesagt? (20)

Classification
ClassificationClassification
Classification
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Recovery with concurrent transaction
Recovery with concurrent transactionRecovery with concurrent transaction
Recovery with concurrent transaction
 
Multidimensional schema
Multidimensional schemaMultidimensional schema
Multidimensional schema
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentation
 
Bayes Belief Networks
Bayes Belief NetworksBayes Belief Networks
Bayes Belief Networks
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic Model
 
Structure of shared memory space
Structure of shared memory spaceStructure of shared memory space
Structure of shared memory space
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Dm from databases perspective u 1
Dm from databases perspective u 1Dm from databases perspective u 1
Dm from databases perspective u 1
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
Lecture 12 binary classifier confusion matrix
Lecture 12 binary classifier confusion matrixLecture 12 binary classifier confusion matrix
Lecture 12 binary classifier confusion matrix
 
Adbms 23 distributed database design
Adbms 23 distributed database designAdbms 23 distributed database design
Adbms 23 distributed database design
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Decision Tree in Machine Learning
Decision Tree in Machine Learning  Decision Tree in Machine Learning
Decision Tree in Machine Learning
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Over fitting underfitting
Over fitting underfittingOver fitting underfitting
Over fitting underfitting
 

Andere mochten auch

Andere mochten auch (8)

Hopfield Ağı
Hopfield AğıHopfield Ağı
Hopfield Ağı
 
ADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food Courts
ADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food CourtsADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food Courts
ADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food Courts
 
Yapay sinir agları
Yapay sinir aglarıYapay sinir agları
Yapay sinir agları
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
 
hopfield neural network
hopfield neural networkhopfield neural network
hopfield neural network
 
Id3,c4.5 algorithim
Id3,c4.5 algorithimId3,c4.5 algorithim
Id3,c4.5 algorithim
 
Hopfield Networks
Hopfield NetworksHopfield Networks
Hopfield Networks
 
HOPFIELD NETWORK
HOPFIELD NETWORKHOPFIELD NETWORK
HOPFIELD NETWORK
 

Ähnlich wie Data mining & Decison Trees

What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)Pratik Tambekar
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhardeepikakaler1
 
6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhianadeepikakaler1
 
6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhardeepikakaler1
 
6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhianadeepikakaler1
 
Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.pptbommaiah
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Miningdataminers.ir
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1DanWooster1
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesDeepaR42
 

Ähnlich wie Data mining & Decison Trees (20)

Data mining
Data miningData mining
Data mining
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
D
DD
D
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
 
Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Introduction
IntroductionIntroduction
Introduction
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
isd314-01
isd314-01isd314-01
isd314-01
 
6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar
 
6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana
 
6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar
 
6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.ppt
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and Techniques
 
Data mining
Data miningData mining
Data mining
 

Mehr von Selman Bozkır

23--Web-Design-Principles
23--Web-Design-Principles23--Web-Design-Principles
23--Web-Design-PrinciplesSelman Bozkır
 
Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...
Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...
Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...Selman Bozkır
 
Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...
Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...
Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...Selman Bozkır
 
Use of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionUse of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionSelman Bozkır
 
Measurement and metrics in model driven software development
Measurement and metrics in model driven software developmentMeasurement and metrics in model driven software development
Measurement and metrics in model driven software developmentSelman Bozkır
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
SHOE (simple html ontology extensions)
SHOE (simple html ontology extensions)SHOE (simple html ontology extensions)
SHOE (simple html ontology extensions)Selman Bozkır
 
Predicting food demand in food courts by decision tree approaches
Predicting food demand in food courts by decision tree approachesPredicting food demand in food courts by decision tree approaches
Predicting food demand in food courts by decision tree approachesSelman Bozkır
 
Identification of User Patterns in Social Networks by Data Mining Techniques:...
Identification of User Patterns in Social Networks by Data Mining Techniques:...Identification of User Patterns in Social Networks by Data Mining Techniques:...
Identification of User Patterns in Social Networks by Data Mining Techniques:...Selman Bozkır
 
FUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolFUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolSelman Bozkır
 

Mehr von Selman Bozkır (12)

lecture_07.pptx
lecture_07.pptxlecture_07.pptx
lecture_07.pptx
 
23--Web-Design-Principles
23--Web-Design-Principles23--Web-Design-Principles
23--Web-Design-Principles
 
Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...
Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...
Phishing Attacks: Trends, Detection Systems and Computer Vision as a Promisin...
 
Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...
Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...
Kötücül Yazılımların Tanınmasında Evrişimsel Sinir Ağlarının Kullanımı ve Kar...
 
Use of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionUse of hog descriptors in phishing detection
Use of hog descriptors in phishing detection
 
Measurement and metrics in model driven software development
Measurement and metrics in model driven software developmentMeasurement and metrics in model driven software development
Measurement and metrics in model driven software development
 
UML ile Modelleme
UML ile ModellemeUML ile Modelleme
UML ile Modelleme
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
SHOE (simple html ontology extensions)
SHOE (simple html ontology extensions)SHOE (simple html ontology extensions)
SHOE (simple html ontology extensions)
 
Predicting food demand in food courts by decision tree approaches
Predicting food demand in food courts by decision tree approachesPredicting food demand in food courts by decision tree approaches
Predicting food demand in food courts by decision tree approaches
 
Identification of User Patterns in Social Networks by Data Mining Techniques:...
Identification of User Patterns in Social Networks by Data Mining Techniques:...Identification of User Patterns in Social Networks by Data Mining Techniques:...
Identification of User Patterns in Social Networks by Data Mining Techniques:...
 
FUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolFUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis Tool
 

Kürzlich hochgeladen

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 

Kürzlich hochgeladen (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 

Data mining & Decison Trees

  • 1. Presented and Contributed by: Ahmet Selman Bozkır Hacettepe University Ph.D. Student November 29, 2011 1
  • 2. What is data mining?  Motivation: Why data mining?  Classification of data mining systems  Architecture: Typical Data Mining System  Data mining functionality November 29, 2011 2
  • 3. Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Data mining: a misnomer?  Alternative names  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  Watch out: Is everything “data mining”?  (Deductive) query processing.  Expert systems or small ML/statistical programs November 29, 2011 3
  • 4. Data explosion problem  Automated data collection tools and mature database technology lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, and other information repositories  We are drowning in data, but starving for knowledge!  Solution: Data warehousing and data mining  Data warehousing and on-line analytical processing  Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases November 29, 2011 4
  • 5. Data analysis and decision support  Market analysis and management ▪ Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation  Risk analysis and management ▪ Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and detection of unusual patterns (outliers)  Other Applications  Text mining (news group, email, documents) and Web mining  Bioinformatics and bio-data analysis November 29, 2011 5
  • 6. Target marketing  Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.  Determine customer purchasing patterns over time  Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association  Customer profiling —What types of customers buy what products (clustering or classification)  Customer requirement analysis  Identify the best products for different groups of customers  Predict what factors will attract new customers November 29, 2011 Data Mining: Concepts and Techniques 6
  • 7. Finance planning and asset evaluation  cash flow analysis and prediction cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)  Resource planning  summarize and compare the resources and spending  Competition  monitor competitors and market directions  group customers into classes and a class-based pricing procedure  set pricing strategy in a highly competitive market November 29, 2011 7
  • 8. Approaches: Clustering & model construction for frauds, outlier analysis  Applications: Health care, retail, credit card service, telecomm.  Auto insurance: ring of collisions  Money laundering: suspicious monetary transactions  Medical insurance ▪ Professional patients, ring of doctors, and ring of references ▪ Unnecessary or correlated screening tests  Telecommunications: phone-call fraud ▪ Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm  Retail industry ▪ Analysts estimate that 38% of retail shrink is due to dishonest employees  Anti-terrorism November 29, 2011 Data Mining: Concepts and Techniques 8
  • 9. Pattern Evaluation  Data mining—core of knowledge discovery process Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases November 29, 2011 9
  • 10. Increasing potential to support business decisions End User Making Decisions Data Presentation Business Analyst Visualization Techniques Data Mining Data Information Discovery Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP November 29, 2011 10
  • 11. Learning the application domain  relevant prior knowledge and goals of application  Creating a target data set: data selection  Data cleaning and preprocessing: (may take 70% of effort!)  Data reduction and transformation  Find useful features, dimensionality/variable reduction, invariant representation.  Choosing functions of data mining  summarization, classification, regression, association, clustering.  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc.  Use of discovered knowledge November 29, 2011 11
  • 12. Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & Filtering data integration Data Databases Warehouse November 29, 2011 12
  • 13. General functionality  Descriptive data mining  Predictive data mining  Different views, different classifications  Kinds of databases to be mined  Kinds of knowledge to be discovered  Kinds of techniques utilized  Kinds of applications adapted November 29, 2011 13
  • 14. Concept description: Characterization and discrimination  Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions  Association (correlation and causality)  Diaper  Beer [0.5%, 75%]  Classification and Prediction  Construct models (functions) that describe and distinguish classes or concepts for future prediction ▪ E.g., classify countries based on climate, or classify cars based on gas mileage  Presentation: decision-tree, classification rule, neural network  Predict some unknown or missing numerical values November 29, 2011 14
  • 15. Cluster analysis  Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns  Maximizing intra-class similarity & minimizing interclass similarity  Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  Noise or exception? No! useful in fraud detection, rare events analysis November 29, 2011 15
  • 16. Data mining: discovering interesting patterns from large amounts of data  A natural evolution of database technology, in great demand, with wide applications  A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation  Mining can be performed in a variety of information repositories  Data mining functionalities: characterization, discrimination, association, classification, clustering, outl ier and trend analysis, etc.  Data mining systems and architectures  Major issues in data mining November 29, 2011 16
  • 17. R. Agrawal, J. Han, and H. Mannila, Readings in Data Mining: A Database Perspective, Morgan Kaufmann (in preparation)  J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001 November 29, 2011 17
  • 18. November 29, 2011 Thank you !!! 18
  • 19. • A decision tree (DT) is a hierarchical classification and prediction model • It is organized as a rooted tree with 2 types of nodes called decision nodes and inter nodes • It is a supervised data mining model used for classification or prediction November 29, 2011 19
  • 21. Chance and Terminal Nodes •Each internal node of a DT is a decision point, where some condition is tested •The result of this condition determines which branch of the tree is to be taken next •Thus they are called decision node, chance node or non- terminal node •Chance nodes partition the available data at that point to maximize dependent variable differences November 29, 2011 21
  • 22. Terminal nodes •The leaf nodes of a DT are called terminal node •They indicate the class into which a data instance will be classified •They have just one incoming node •They do not have child nodes (outgoing nodes) •There are no conditions tested at terminal nodes •Tree traversal from the root to the leaf produces the production rule for that class November 29, 2011 22
  • 24. Advantages of DT • Easy to understand and interpret • Works for categorical and continious data • High performance classification (generally) • DT can grow to any depth • On-the-fly prediction • Pruning a DT is very easy • Works for missing or null values November 29, 2011 24
  • 25. Advantages contd. • Can be used to identify outliers • Production rules can be obtained directly from the built DT • They are relatively faster than other classification models • DT can be used even when domain experts are absent • Provide clear indication of which field is important for predication and classification November 29, 2011 25
  • 26. Disadvantages •Class-overlap problem (due to the curse of dimensionality) •Complex production rules •A DT can be sub-optimal (for this reason ensembe methods are developed) • Some decision tree can deal only with binary-valued. November 29, 2011 26
  • 28. •Training set - - to derive classifier (Generally %70-%80) •Test set - - to measure accuracy (Generally %20-%30) November 29, 2011 28
  • 29. Construction Phase: Initial Decision tree is Constructed in this Phase Q:How to split nodes? A: Different approaches with algorithms  Pruning Phase: In this stage lower branches are removed to improve the performance Q:Why? A: Avoiding overfitting/overtraining November 29, 2011 29
  • 30. ID3 (Available Everywhere)  C4.5 / C5.0 (Weka/Spss Clementine)  CART (Spss Clementine)  CHAID (Spss Clementine, etc..)  Microsoft Decision Trees (MS Analysis Services)  Random Forests (Statistica) November 29, 2011 30
  • 31. ID3 induction algorithm •ID3 (Interactive dichotomiser) •Introduced in 1986 by Quinlan •Designed for only classification •Works on categorical attributes only •Uses entropy measure as splitting criteria •Missing value handling is absent November 29, 2011 31
  • 32. C4.5 induction algorithm •Invented by Quinlan in 1993 •Is an extension of ID3 algorithm •Designed for only classification •Numerical attributes can be input •Uses entropy measure as splitting criteria •Uses multi-way splits •Missing value handling is provided •Tree pruning is also provided November 29, 2011 32
  • 33. Classification and Regression Trees •Invented by Breiman, et.al. in 1984 •Uses binary recursive partitioning method •Designed for both classification and regression •Works on both categorical & numerical attributes •Uses Gini measure as splitting criteria •Uses two-way splits •Missing value handling is provided •Tree pruning is also provided November 29, 2011 33
  • 34. Chi-squared Automatic Interaction Detection •Invented by Kass, et.al. in 1980 •Designed for both classification and regression •Works on both categorical & numerical attributes •Uses Karl Pearson's X2 test as splitting criteria •Uses multi-way splits •Missing value handling is provided •Avoids tree pruning November 29, 2011 34
  • 35. Micorosoft Decision Trees •Invented by MS, in 1999 •Designed for both classification and regression •Works on both categorical & numerical attributes •Serves entropy, Bayesian K2, and Bayesian Dirichlet Equivalent with Uniform prior choices as splitting criteria •Uses multi-way splits and support binary splitting •Missing value handling is provided •Avoids tree pruning November 29, 2011 35
  • 36. Overfitting: An induced tree may overfit the training data  Too many branches, some may reflect anomalies due to noise or outliers  Poor accuracy for unseen samples November 29, 2011 36
  • 37. Two approaches to avoid overfitting  Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold ▪ Difficult to choose an appropriate threshold  Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees ▪ Use a set of data different from the training data to decide which is the “best pruned tree” November 29, 2011 37
  • 38. Validation error Training error Time November 29, 2011 38