SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Introduction to Data Mining




                  Informatika
1                        Diambil dari © Copyright 2007, Natash
Outline
       Motivation: Why Data Mining?
       What is Data Mining?
       Data Mining Applications
       Issues in Data Mining




2                                      Diambil dari © Copyrigh
Data vs. Information

   Society produces massive amounts of data
       business, science, medicine, economics, sports, …
   Potentially valuable resource
   Raw data is useless
       need techniques to automatically extract information
       Data: recorded facts
       Information: patterns underlying the data




3                                              Diambil dari © Copyrigh
Multidisciplinary Field
             Database
                                           Statistics
             Technology



    Machine
    Learning
                           Data Mining                  Visualization




     Artificial Intelligence                    Other
     (Machine Learning – Neural Network)      Disciplines

4                                               Diambil dari © Copyrigh
Terminology

       Gold Mining
       Knowledge mining from databases
       Knowledge extraction
       Data/pattern analysis
       Knowledge Discovery Databases or KDD
       Information harvesting
       Business intelligence

5                                 Diambil dari © Copyrigh
KDD Process

    Database



Selection      Data        Training   Data       Model,
Transformation Preparation Data       Mining     Patterns



                                      Evaluation,
                                      Verification

6                                     Diambil dari © Copyrigh
Data Mining Tasks

       Exploratory Data Analysis
       Predictive Modeling: Classification and Regression
       Descriptive Modeling
         Cluster analysis/segmentation

       Discovering Patterns and Rules
           Association/Dependency rules
           Sequential patterns
           Temporal sequences
       Deviation detection
7                                          Diambil dari © Copyrigh
Data Mining Tasks

   Concept/Class description: Characterization
    and discrimination
           Generalize, summarize, and contrast data
            characteristics, e.g., dry vs. wet regions
   Association (correlation and causality)
           Multi-dimensional or single-dimensional association
        age(X, “20-29”) ^ income(X, “60-90K”)  buys(X, “TV”)

    8                                         Diambil dari © Copyrigh
Data Mining Tasks

   Classification and Prediction
       Finding models (functions) that describe and
        distinguish classes or concepts for future prediction
       Example: classify countries based on climate, or
        classify cars based on gas mileage
       Presentation:
          If-THENrules, decision-tree, classification rule,
          neural network
       Prediction: Predict some unknown or missing
9       numerical values                Diambil dari © Copyrigh
Data Mining Tasks


    Cluster analysis
        Class label is unknown: Group data to form
         new classes,
             Example: cluster houses to find distribution
              patterns
        Clustering based on the principle: maximizing
         the intra-class similarity and minimizing the
         interclass similarity

10                                               Diambil dari © Copyrigh
Data Mining Applications

   Science: Chemistry, Physics, Medicine
        Biochemical analysis
        Remote sensors on a satellite
        Telescopes – star galaxy classification
        Medical Image analysis




11                                         Diambil dari © Copyrigh
Data Mining Applications

   Bioscience
        Sequence-based analysis
        Protein structure and function prediction
        Protein family classification
        Microarray gene expression




12                                         Diambil dari © Copyrigh
Data Mining Applications
    Pharmaceutical companies, Insurance
     and Health care, Medicine
        Drug development
        Identify successful medical therapies
        Claims analysis, fraudulent behavior
        Medical diagnostic tools
        Predict office visits



13                                       Diambil dari © Copyrigh
Data Mining Applications

    Financial Industry, Banks, Businesses, E-
     commerce
        Stock and investment analysis
        Identify loyal customers vs. risky customer
        Predict customer spending
        Risk management
        Sales forecasting


14                                       Diambil dari © Copyrigh
Data Mining Applications

   Retail and Marketing
        Customer buying patterns/demographic
         characteristics
        Mailing campaigns
        Market basket analysis
        Trend analysis



15                                     Diambil dari © Copyrigh
Data Mining Applications

   Database analysis and decision support
        Market analysis and management
            target marketing, customer relation management, market
             basket analysis, cross selling, market segmentation
        Risk analysis and management
            Forecasting, customer retention, improved underwriting,
             quality control, competitive analysis
        Fraud detection and management
16                                               Diambil dari © Copyrigh
Data Mining Applications

    Sports and Entertainment
        IBM Advanced Scout analyzed NBA game
         statistics (shots blocked, assists, and fouls) to
         gain competitive advantage for New York
         Knicks and Miami Heat
    Astronomy
        JPL and the Palomar Observatory discovered
         22 quasars with the help of data mining

17                                         Diambil dari © Copyrigh
DATA MINING EXAMPLES
    Grocery store
    NBA
    Banking and Credit Card scoring
        Fraud detection
    Personalization & Customer Profiling
    Campaign Management and Database
     Marketing

18                               Diambil dari © Copyrigh
Data Mining Challenges

    Computationally expensive to investigate all
     possibilities
    Dealing with noise/missing information and
     errors in data
    Choosing appropriate attributes/input
     representation
    Finding the minimal attribute space
    Finding adequate evaluation function(s)
    Extracting meaningful information
 
19
     Not overfitting                Diambil dari © Copyrigh
Summary

   Data mining: discovering interesting patterns
    from large amounts of data
   A KDD process includes data cleaning, data
    integration, data selection, transformation, data
    mining, pattern evaluation, and knowledge
    presentation


20                                   Diambil dari © Copyrigh
Summary
    Mining can be performed in a variety of
     information repositories
    Data mining functionalities: characterization,
     association, classification, clustering, outlier
     and trend analysis, etc.
    Classification of data mining systems
    Major issues in data mining
21                                   Diambil dari © Copyrigh
Kinds of Data Mining
        Decision Tree Learning
        Clustering
        Neural Networks
        Association Rules
        Support Vector Machines
        Genetic Algorithms
        Nearest Neighbor Method

22                                 Diambil dari © Copyrigh
DECISION TREE FOR THE CONCEPT

                                   “Play Tennis”
        Day      Outlook    Temp    Humidity   Wind     PlayTennis

           D1    Sunny      Hot     High       Weak     No
           D2    Sunny      Hot     High       Strong   No
           D3    Overcast   Hot     High       Weak     Yes
           D4    Rain       Mild    High       Weak     Yes
           D5    Rain       Cool    Normal     Weak     Yes
           D6    Rain       Cool    Normal     Strong   No
           D7    Overcast   Cool    Normal     Strong   Yes
           D8    Sunny      Mild    High       Weak     No
           D9    Sunny      Cool    Normal     Weak     Yes
           D10   Rain       Mild    Normal     Weak     Yes
           D11   Sunny      Mild    Normal     Strong   Yes
           D12   Overcast   Mild    High       Strong   Yes
           D13   Overcast   Hot     Normal     Weak     Yes
           D14
Mitchell, 1997   Rain       Mild    High       Strong   No


  23                                                          Diambil dari © Copyrigh
DECISION TREE FOR THE CONCEPT

                “Play Tennis”




                           [Mitchell,1997]

24                          Diambil dari © Copyrigh

Weitere ähnliche Inhalte

Andere mochten auch

Poluare electromagnetica Chiriac Alin
Poluare electromagnetica   Chiriac AlinPoluare electromagnetica   Chiriac Alin
Poluare electromagnetica Chiriac Alin
alinchiriac95
 
призентація права дитини
призентація права дитинипризентація права дитини
призентація права дитини
Lutsk Biblio
 

Andere mochten auch (10)

Lutsk volonter
Lutsk volonterLutsk volonter
Lutsk volonter
 
Presentation for lecture on underwater concrete - TU Delft: MSc Geotechnical ...
Presentation for lecture on underwater concrete - TU Delft: MSc Geotechnical ...Presentation for lecture on underwater concrete - TU Delft: MSc Geotechnical ...
Presentation for lecture on underwater concrete - TU Delft: MSc Geotechnical ...
 
Dovidlit
DovidlitDovidlit
Dovidlit
 
Presentation for lecture on underwater concrete - TU Delft: MSc Geotechnical ...
Presentation for lecture on underwater concrete - TU Delft: MSc Geotechnical ...Presentation for lecture on underwater concrete - TU Delft: MSc Geotechnical ...
Presentation for lecture on underwater concrete - TU Delft: MSc Geotechnical ...
 
Амурский (уссурийский) тигр
Амурский (уссурийский) тигрАмурский (уссурийский) тигр
Амурский (уссурийский) тигр
 
Poluare electromagnetica Chiriac Alin
Poluare electromagnetica   Chiriac AlinPoluare electromagnetica   Chiriac Alin
Poluare electromagnetica Chiriac Alin
 
Diferenciales caviar
Diferenciales caviarDiferenciales caviar
Diferenciales caviar
 
How to become a bank teller
How to become a bank tellerHow to become a bank teller
How to become a bank teller
 
КНИЖКОВИЙ ГОРОСКОП
КНИЖКОВИЙ ГОРОСКОПКНИЖКОВИЙ ГОРОСКОП
КНИЖКОВИЙ ГОРОСКОП
 
призентація права дитини
призентація права дитинипризентація права дитини
призентація права дитини
 

Ähnlich wie Intro data mining lingkup

Satyam open analytics nyc
Satyam open analytics nycSatyam open analytics nyc
Satyam open analytics nyc
Open Analytics
 
data minig for eng with all topics and history
data minig for eng with all topics and historydata minig for eng with all topics and history
data minig for eng with all topics and history
nbaisane16
 
Data mining (prefinals)
Data mining (prefinals)Data mining (prefinals)
Data mining (prefinals)
sadam33146
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378
nitttin
 
Building new business models through big data dec 06 2012
Building new business models through big data   dec 06 2012Building new business models through big data   dec 06 2012
Building new business models through big data dec 06 2012
Aki Balogh
 
How to turn GDPR into a Strategic Advantage using Connected Data
How to turn GDPR into a Strategic Advantage using Connected DataHow to turn GDPR into a Strategic Advantage using Connected Data
How to turn GDPR into a Strategic Advantage using Connected Data
Neo4j
 
The Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent ApplicationsThe Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent Applications
Neo4j
 

Ähnlich wie Intro data mining lingkup (20)

Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dw
 
Dbm630 lecture10
Dbm630 lecture10Dbm630 lecture10
Dbm630 lecture10
 
Exploring Data Wealth: Data Mining Insights
Exploring Data Wealth: Data Mining InsightsExploring Data Wealth: Data Mining Insights
Exploring Data Wealth: Data Mining Insights
 
Satyam open analytics nyc
Satyam open analytics nycSatyam open analytics nyc
Satyam open analytics nyc
 
data minig for eng with all topics and history
data minig for eng with all topics and historydata minig for eng with all topics and history
data minig for eng with all topics and history
 
Data mining concepts
Data mining conceptsData mining concepts
Data mining concepts
 
How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?
 
How to Manage Hybrid Data Center Environments
How to Manage Hybrid Data Center EnvironmentsHow to Manage Hybrid Data Center Environments
How to Manage Hybrid Data Center Environments
 
Big data and analytics
Big data and analyticsBig data and analytics
Big data and analytics
 
Data mining (prefinals)
Data mining (prefinals)Data mining (prefinals)
Data mining (prefinals)
 
isd314-01
isd314-01isd314-01
isd314-01
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378
 
Building new business models through big data dec 06 2012
Building new business models through big data   dec 06 2012Building new business models through big data   dec 06 2012
Building new business models through big data dec 06 2012
 
Big data v1.0
Big data v1.0Big data v1.0
Big data v1.0
 
Data Mining in Operating System
Data Mining in Operating SystemData Mining in Operating System
Data Mining in Operating System
 
Digiday Exchange: Infersystems Tech Talk: "The Pulse of RTB: A View Through t...
Digiday Exchange: Infersystems Tech Talk: "The Pulse of RTB: A View Through t...Digiday Exchange: Infersystems Tech Talk: "The Pulse of RTB: A View Through t...
Digiday Exchange: Infersystems Tech Talk: "The Pulse of RTB: A View Through t...
 
How to turn GDPR into a Strategic Advantage using Connected Data
How to turn GDPR into a Strategic Advantage using Connected DataHow to turn GDPR into a Strategic Advantage using Connected Data
How to turn GDPR into a Strategic Advantage using Connected Data
 
The Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent ApplicationsThe Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent Applications
 

Intro data mining lingkup

  • 1. Introduction to Data Mining Informatika 1 Diambil dari © Copyright 2007, Natash
  • 2. Outline  Motivation: Why Data Mining?  What is Data Mining?  Data Mining Applications  Issues in Data Mining 2 Diambil dari © Copyrigh
  • 3. Data vs. Information  Society produces massive amounts of data  business, science, medicine, economics, sports, …  Potentially valuable resource  Raw data is useless  need techniques to automatically extract information  Data: recorded facts  Information: patterns underlying the data 3 Diambil dari © Copyrigh
  • 4. Multidisciplinary Field Database Statistics Technology Machine Learning Data Mining Visualization Artificial Intelligence Other (Machine Learning – Neural Network) Disciplines 4 Diambil dari © Copyrigh
  • 5. Terminology  Gold Mining  Knowledge mining from databases  Knowledge extraction  Data/pattern analysis  Knowledge Discovery Databases or KDD  Information harvesting  Business intelligence 5 Diambil dari © Copyrigh
  • 6. KDD Process Database Selection Data Training Data Model, Transformation Preparation Data Mining Patterns Evaluation, Verification 6 Diambil dari © Copyrigh
  • 7. Data Mining Tasks  Exploratory Data Analysis  Predictive Modeling: Classification and Regression  Descriptive Modeling  Cluster analysis/segmentation  Discovering Patterns and Rules  Association/Dependency rules  Sequential patterns  Temporal sequences  Deviation detection 7 Diambil dari © Copyrigh
  • 8. Data Mining Tasks  Concept/Class description: Characterization and discrimination  Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions  Association (correlation and causality)  Multi-dimensional or single-dimensional association age(X, “20-29”) ^ income(X, “60-90K”)  buys(X, “TV”) 8 Diambil dari © Copyrigh
  • 9. Data Mining Tasks  Classification and Prediction  Finding models (functions) that describe and distinguish classes or concepts for future prediction  Example: classify countries based on climate, or classify cars based on gas mileage  Presentation:  If-THENrules, decision-tree, classification rule, neural network  Prediction: Predict some unknown or missing 9 numerical values Diambil dari © Copyrigh
  • 10. Data Mining Tasks  Cluster analysis  Class label is unknown: Group data to form new classes,  Example: cluster houses to find distribution patterns  Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity 10 Diambil dari © Copyrigh
  • 11. Data Mining Applications  Science: Chemistry, Physics, Medicine  Biochemical analysis  Remote sensors on a satellite  Telescopes – star galaxy classification  Medical Image analysis 11 Diambil dari © Copyrigh
  • 12. Data Mining Applications  Bioscience  Sequence-based analysis  Protein structure and function prediction  Protein family classification  Microarray gene expression 12 Diambil dari © Copyrigh
  • 13. Data Mining Applications  Pharmaceutical companies, Insurance and Health care, Medicine  Drug development  Identify successful medical therapies  Claims analysis, fraudulent behavior  Medical diagnostic tools  Predict office visits 13 Diambil dari © Copyrigh
  • 14. Data Mining Applications  Financial Industry, Banks, Businesses, E- commerce  Stock and investment analysis  Identify loyal customers vs. risky customer  Predict customer spending  Risk management  Sales forecasting 14 Diambil dari © Copyrigh
  • 15. Data Mining Applications  Retail and Marketing  Customer buying patterns/demographic characteristics  Mailing campaigns  Market basket analysis  Trend analysis 15 Diambil dari © Copyrigh
  • 16. Data Mining Applications  Database analysis and decision support  Market analysis and management  target marketing, customer relation management, market basket analysis, cross selling, market segmentation  Risk analysis and management  Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and management 16 Diambil dari © Copyrigh
  • 17. Data Mining Applications  Sports and Entertainment  IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat  Astronomy  JPL and the Palomar Observatory discovered 22 quasars with the help of data mining 17 Diambil dari © Copyrigh
  • 18. DATA MINING EXAMPLES  Grocery store  NBA  Banking and Credit Card scoring  Fraud detection  Personalization & Customer Profiling  Campaign Management and Database Marketing 18 Diambil dari © Copyrigh
  • 19. Data Mining Challenges  Computationally expensive to investigate all possibilities  Dealing with noise/missing information and errors in data  Choosing appropriate attributes/input representation  Finding the minimal attribute space  Finding adequate evaluation function(s)  Extracting meaningful information  19 Not overfitting Diambil dari © Copyrigh
  • 20. Summary  Data mining: discovering interesting patterns from large amounts of data  A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation 20 Diambil dari © Copyrigh
  • 21. Summary  Mining can be performed in a variety of information repositories  Data mining functionalities: characterization, association, classification, clustering, outlier and trend analysis, etc.  Classification of data mining systems  Major issues in data mining 21 Diambil dari © Copyrigh
  • 22. Kinds of Data Mining  Decision Tree Learning  Clustering  Neural Networks  Association Rules  Support Vector Machines  Genetic Algorithms  Nearest Neighbor Method 22 Diambil dari © Copyrigh
  • 23. DECISION TREE FOR THE CONCEPT “Play Tennis” Day Outlook Temp Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Mitchell, 1997 Rain Mild High Strong No 23 Diambil dari © Copyrigh
  • 24. DECISION TREE FOR THE CONCEPT “Play Tennis” [Mitchell,1997] 24 Diambil dari © Copyrigh

Hinweis der Redaktion

  1. One Midwest grocery chain used the data mining tool to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays. WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products These suppliers use this data to identify customer buying patterns at the store display level . They use this information to manage local store inventory and identify new merchandising opportunities. to build a model of customer behavior that could be used to predict which customers would be likely to respond to the new product. By using this information a marketing manager can select only the customers who are most likely to respond.  The (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played can reveal that when player A played the Guard position, the opposite teams player B attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the team during that game.
  2. DT algorithm has been successfully applied to a wide range of learning tasks from medical diagnosis to classifying equipment malfunction by their cause Simple to understand Works with data types
  3. Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute Example: This tree classifies Saturday mornings according to whether or not they are suitable for playing tennis  This family of algorithms infers decision trees by growing them from the root downward, greedily selecting the next best attribute for each new decision branch added to the tree. During the dop-down construction of the tree a decision to which attribute to put as a root or later to split on, needs to be made. In order to determine which attribute is the best classifier of the input instances, the algorithm uses statistical test called information gain. (Information gain of an attribute can be defined by measuring the expected reduction in entropy caused by partitioning the examples according to that attribute. ) How well a given attribute separates the training examples according to their target classification.
  4. Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute Example: This tree classifies Saturday mornings according to whether or not they are suitable for playing tennis  This family of algorithms infers decision trees by growing them from the root downward, greedily selecting the next best attribute for each new decision branch added to the tree. During the dop-down construction of the tree a decision to which attribute to put as a root or later to split on, needs to be made. In order to determine which attribute is the best classifier of the input instances, the algorithm uses statistical test called information gain. (Information gain of an attribute can be defined by measuring the expected reduction in entropy caused by partitioning the examples according to that attribute. ) How well a given attribute separates the training examples according to their target classification.