SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Chapter 35

              Data Mining
             Transparencies
Chapter Objectives

     The concepts associated with data mining.
     The main features of data mining
     operations, including predictive modeling,
     database segmentation, link analysis, and
     deviation detection.
     The techniques associated with the data
     mining operations.



                                                  2
Chapter Objectives

     The process of data mining.
     Important characteristics of data mining
     tools.
     The relationship between data mining and
     data warehousing.
     How Oracle supports data mining.




                                                3
Data Mining

    The process of extracting valid, previously
    unknown, comprehensible, and actionable
    information from large databases and using it
    to make crucial business decisions,
    (Simoudis,1996).

    Involves the analysis of data and the use of
    software techniques for finding hidden and
    unexpected patterns and relationships in sets of
    data.
                                                    4
Data Mining

    Reveals information that is hidden and
    unexpected, as little value in finding patterns
    and relationships that are already intuitive.

    Patterns and relationships are identified by
    examining the underlying rules and features in
    the data.



                                                      5
Data Mining

    Tends to work from the data up and most
    accurate results normally require large
    volumes of data to deliver reliable conclusions.

    Starts by developing an optimal representation
    of structure of sample data, during which time
    knowledge is acquired and extended to larger
    sets of data.


                                                       6
Data Mining

    Data mining can provide huge paybacks for
    companies who have made a significant
    investment in data warehousing.

    Relatively new technology, however already
    used in a number of industries.




                                                 7
Examples of Applications of Data Mining

     Retail / Marketing
     – Identifying buying patterns of customers
     – Finding associations among customer
       demographic characteristics
     – Predicting response to mailing campaigns
     – Market basket analysis




                                                  8
Examples of Applications of Data Mining

     Banking
     – Detecting patterns of fraudulent credit card
       use
     – Identifying loyal customers
     – Predicting customers likely to change their
       credit card affiliation
     – Determining credit card spending by
       customer groups

                                                      9
Examples of Applications of Data Mining

     Insurance
      – Claims analysis
      – Predicting which customers will buy new
        policies

     Medicine
     – Characterizing patient behavior to predict
       surgery visits
     – Identifying successful medical therapies for
       different illnesses                            10
Data Mining Operations
  Four main operations include:
   – Predictive modeling
   – Database segmentation
   – Link analysis
   – Deviation detection

  There are recognized associations between the
  applications and the corresponding operations.
   – e.g. Direct marketing strategies use database
     segmentation.                                   11
Data Mining Techniques

     Techniques are specific implementations of the
     data mining operations.

     Each operation has its own strengths and
     weaknesses.




                                                      12
Data Mining Techniques

     Data mining tools sometimes offer a choice of
     operations to implement a technique.

     Criteria for selection of tool includes
     – Suitability for certain input data types
     – Transparency of the mining output
     – Tolerance of missing variable values
     – Level of accuracy possible
     – Ability to handle large volumes of data
                                                     13
Data Mining Operations and Associated
Techniques




                                        14
Predictive Modeling
     Similar to the human learning experience
      – uses observations to form a model of the
        important characteristics of some
        phenomenon.

     Uses generalizations of ‘real world’ and ability
     to fit new data into a general framework.

     Can analyze a database to determine essential
     characteristics (model) about the data set.        15
Predictive Modeling

     Model is developed using a supervised learning
     approach, which has two phases: training and
     testing.
      – Training builds a model using a large
        sample of historical data called a training
        set.
      – Testing involves trying out the model on
        new, previously unseen data to determine its
        accuracy and physical performance
        characteristics.
                                                   16
Predictive Modeling

     Applications of predictive modeling include
     customer retention management, credit
     approval, cross selling, and direct marketing.

     There are two techniques associated with
     predictive modeling: classification and value
     prediction, which are distinguished by the
     nature of the variable being predicted.


                                                      17
Predictive Modeling - Classification
     Used to establish a specific predetermined class
     for each record in a database from a finite set
     of possible, class values.

     Two specializations of classification: tree
     induction and neural induction.




                                                    18
Example of Classification using Tree Induction




                                             19
Example of Classification using Neural
Induction




                                         20
Predictive Modeling - Value Prediction

     Used to estimate a continuous numeric value
     that is associated with a database record.

     Uses the traditional statistical techniques of
     linear regression and nonlinear regression.

     Relatively easy-to-use and understand.



                                                      21
Predictive Modeling - Value Prediction

     Linear regression attempts to fit a straight line
     through a plot of the data, such that the line is
     the best representation of the average of all
     observations at that point in the plot.

     Problem is that the technique only works well
     with linear data and is sensitive to the presence
     of outliers (that is, data values, which do not
     conform to the expected norm).

                                                         22
Predictive Modeling - Value Prediction

     Although nonlinear regression avoids the main
     problems of linear regression, it is still not
     flexible enough to handle all possible shapes of
     the data plot.

     Statistical measurements are fine for building
     linear models that describe predictable data
     points, however, most data is not linear in
     nature.

                                                        23
Predictive Modeling - Value Prediction

     Data mining requires statistical methods that
     can accommodate non-linearity, outliers, and
     non-numeric data.

     Applications of value prediction include credit
     card fraud detection or target mailing list
     identification.



                                                       24
Database Segmentation

     Aim is to partition a database into an unknown
     number of segments, or clusters, of similar
     records.

     Uses unsupervised learning to discover
     homogeneous sub-populations in a database to
     improve the accuracy of the profiles.



                                                    25
Database Segmentation
   Less precise than other operations thus less
   sensitive to redundant and irrelevant features.

   Sensitivity can be reduced by ignoring a subset
   of the attributes that describe each instance or
   by assigning a weighting factor to each
   variable.

   Applications of database segmentation include
   customer profiling, direct marketing, and cross
   selling.                                           26
Example of Database Segmentation using a
Scatterplot




                                           27
Database Segmentation

     Associated with demographic or neural
     clustering techniques, which are distinguished
     by
      – Allowable data inputs
      – Methods used to calculate the distance
        between records
      – Presentation of the resulting segments for
        analysis


                                                      28
Link Analysis
   Aims to establish links (associations) between
   records, or sets of records, in a database.

   There are three specializations
   – Associations discovery
   – Sequential pattern discovery
   – Similar time sequence discovery

   Applications include product affinity analysis,
   direct marketing, and stock price movement.       29
Link Analysis - Associations Discovery

     Finds items that imply the presence of other
     items in the same event.

     Affinities between items are represented by
     association rules.
      – e.g. ‘When a customer rents property for
        more than 2 years and is more than 25 years
        old, in 40% of cases, the customer will buy a
        property. This association happens in 35%
        of all customers who rent properties’.
                                                    30
Link Analysis - Sequential Pattern Discovery

     Finds patterns between events such that the
     presence of one set of items is followed by
     another set of items in a database of events
     over a period of time.
      – e.g. Used to understand long term customer
        buying behavior.




                                                     31
Link Analysis - Similar Time Sequence
Discovery
     Finds links between two sets of data that are
     time-dependent, and is based on the degree of
     similarity between the patterns that both time
     series demonstrate.
      – e.g. Within three months of buying property,
        new home owners will purchase goods such
        as cookers, freezers, and washing machines.



                                                   32
Deviation Detection

     Relatively new operation in terms of
     commercially available data mining tools.

     Often a source of true discovery because it
     identifies outliers, which express deviation
     from some previously known expectation and
     norm.



                                                    33
Deviation Detection

     Can be performed using statistics and
     visualization techniques or as a by-product of
     data mining.

     Applications include fraud detection in the use
     of credit cards and insurance claims, quality
     control, and defects tracing.



                                                       34
Example of Database Segmentation using a
Visualization




                                           35
The Data Mining Process

     Recognizing that a systematic approach is
     essential to successful data mining, many
     vendor and consulting organizations have
     specified a process model designed to guide the
     user through a sequence of steps that will lead
     to good results.

     Developed a specification called the Cross
     Industry Standard Process for Data Mining
     (CRISP-DM).
                                                       36
The Data Mining Process

     CRISP-DM specifies a data mining process
     model that is not compliant with a particular
     industry or tool.

     CRISP-DM has evolved from the knowledge
     discovery processes used widely in industry
     and in direct response to user requirements.



                                                     37
The Data Mining Process

     The major aims of CRISP-DM are to make
     large data mining projects run more efficiently,
     be cheaper, more reliable, and more
     manageable.

     CRISP-DM is a hierarchical process model. At
     the top level, the process is divided into six
     different generic phases, ranging from business
     understanding to deployment of project
     results.
                                                    38
The Data Mining Process

     The next level elaborates each of these phases
     as comprising of several generic tasks. At this
     level, the description is generic enough to cover
     all the DM scenarios.

     The third level specialises these tasks for
     specific situations. For instance, the generic
     task might be cleaning data, and specialised
     task could be cleaning of numeric values or
     categorical values.
                                                      39
The Data Mining Process

     The fourth level is the process instance; that is
     a record of actions, decisions and result of an
     actual execution of DM project.

     The model also discusses relationships between
     different DM tasks. It gives idealised sequence
     of actions during a DM project.



                                                         40
Phases of the CRISP-DM Model




                               41
Data Mining Tools

     There are a growing number of commercial
     data mining tools on the marketplace.

     Important characteristics of data mining tools
     include:
      – Data preparation facilities
      – Selection of data mining operations
      – Product scalability and performance
      – Facilities for understanding results
                                                      42
Data Mining Tools

     Data preparation facilities
     – Data preparation is the most time-
       consuming aspect of data mining.
     – Functions supported include: data
       preparation, data cleansing, data describing,
       data transforming and data sampling.




                                                   43
Data Mining Tools

     Selection of data mining operations
      – Important to understand the characteristics
        of the operations (algorithms) to ensure that
        they meet the user’s requirements.
      – In particular, important to establish how the
        algorithms treat the data types of the
        response and predictor variables, how fast
        they train, and how fast they work on new
        data.

                                                    44
Data Mining Tools

     Product scalability and performance
      – Capable of dealing with increasing amounts
        of data, possibly with sophisticated
        validation controls.
      – Maintaining satisfactory performance may
        require investigations into whether a tool is
        capable of supporting parallel processing
        using technologies such as SMP or MPP.


                                                        45
Data Mining Tools

     Facilities for understanding results
      – By providing measures such as those
        describing accuracy and significance in
        useful formats such as confusion matrices,
        by allowing the user to perform sensitivity
        analysis on the result, and by presenting the
        result in alternative ways using for example
        visualization techniques.


                                                        46
Data Mining and Data Warehousing

     Major challenge to exploit data mining is
     identifying suitable data to mine.

     Data mining requires single, separate, clean,
     integrated, and self-consistent source of data.




                                                       47
Data Mining and Data Warehousing

     A data warehouse is well equipped for
     providing data for mining.

     Data quality and consistency is a pre-requisite
     for mining to ensure the accuracy of the
     predictive models. Data warehouses are
     populated with clean, consistent data.



                                                       48
Data Mining and Data Warehousing

     It is advantageous to mine data from multiple
     sources to discover as many interrelationships
     as possible. Data warehouses contain data from
     a number of sources.

     Selecting the relevant subsets of records and
     fields for data mining requires the query
     capabilities of the data warehouse.


                                                     49
Data Mining and Data Warehousing

     The results of a data mining study are useful if
     there is some way to further investigate the
     uncovered patterns. Data warehouses provide
     the capability to go back to the data source.




                                                        50

Weitere ähnliche Inhalte

Was ist angesagt?

BUILDING A GENERAL CONCEPT OF ANALYTICAL SERVICES FOR ANALYSIS OF STRUCTURED ...
BUILDING A GENERAL CONCEPT OF ANALYTICAL SERVICES FOR ANALYSIS OF STRUCTURED ...BUILDING A GENERAL CONCEPT OF ANALYTICAL SERVICES FOR ANALYSIS OF STRUCTURED ...
BUILDING A GENERAL CONCEPT OF ANALYTICAL SERVICES FOR ANALYSIS OF STRUCTURED ...
Kiogyf
 
The International Journal of Engineering and Science
The International Journal of Engineering and ScienceThe International Journal of Engineering and Science
The International Journal of Engineering and Science
theijes
 
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
ijsrd.com
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
IJDKP
 

Was ist angesagt? (19)

Application of KDD & its future scope
Application of KDD & its future scopeApplication of KDD & its future scope
Application of KDD & its future scope
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Literature%20 review
Literature%20 reviewLiterature%20 review
Literature%20 review
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
 
BUILDING A GENERAL CONCEPT OF ANALYTICAL SERVICES FOR ANALYSIS OF STRUCTURED ...
BUILDING A GENERAL CONCEPT OF ANALYTICAL SERVICES FOR ANALYSIS OF STRUCTURED ...BUILDING A GENERAL CONCEPT OF ANALYTICAL SERVICES FOR ANALYSIS OF STRUCTURED ...
BUILDING A GENERAL CONCEPT OF ANALYTICAL SERVICES FOR ANALYSIS OF STRUCTURED ...
 
A new hybrid algorithm for business intelligence recommender system
A new hybrid algorithm for business intelligence recommender systemA new hybrid algorithm for business intelligence recommender system
A new hybrid algorithm for business intelligence recommender system
 
DataMining Techniq
DataMining TechniqDataMining Techniq
DataMining Techniq
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Dwdm chapter 5 data mining a closer look
Dwdm chapter 5  data mining a closer lookDwdm chapter 5  data mining a closer look
Dwdm chapter 5 data mining a closer look
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in Database
 
Proficiency comparison ofladtree
Proficiency comparison ofladtreeProficiency comparison ofladtree
Proficiency comparison ofladtree
 
The International Journal of Engineering and Science
The International Journal of Engineering and ScienceThe International Journal of Engineering and Science
The International Journal of Engineering and Science
 
4113ijaia09
4113ijaia094113ijaia09
4113ijaia09
 
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
 
Unit i
Unit iUnit i
Unit i
 
Datamining
DataminingDatamining
Datamining
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
 
LINK MINING PROCESS
LINK MINING PROCESSLINK MINING PROCESS
LINK MINING PROCESS
 

Ähnlich wie Ch35

Data mining (prefinals)
Data mining (prefinals)Data mining (prefinals)
Data mining (prefinals)
sadam33146
 
Cs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_miningCs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_mining
hari91
 

Ähnlich wie Ch35 (20)

Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Data Mining Module 1 Business Analytics.
Data Mining Module 1 Business Analytics.Data Mining Module 1 Business Analytics.
Data Mining Module 1 Business Analytics.
 
Data mining (prefinals)
Data mining (prefinals)Data mining (prefinals)
Data mining (prefinals)
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Cs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_miningCs 1004 -_data_warehousing_and_data_mining
Cs 1004 -_data_warehousing_and_data_mining
 
Unit 4 Advanced Data Analytics
Unit 4 Advanced Data AnalyticsUnit 4 Advanced Data Analytics
Unit 4 Advanced Data Analytics
 
4113ijaia09
4113ijaia094113ijaia09
4113ijaia09
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
 
Data mining concepts
Data mining conceptsData mining concepts
Data mining concepts
 
Advancing Knowledge Discovery and Data Mining
Advancing Knowledge Discovery and Data MiningAdvancing Knowledge Discovery and Data Mining
Advancing Knowledge Discovery and Data Mining
 
Exploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptxExploratory data analysis for business MODULE 1.pptx
Exploratory data analysis for business MODULE 1.pptx
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Data Mining
Data MiningData Mining
Data Mining
 
A LITERATURE REVIEW ON DATAMINING
A LITERATURE REVIEW ON DATAMININGA LITERATURE REVIEW ON DATAMINING
A LITERATURE REVIEW ON DATAMINING
 
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVEDATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
 
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdfData Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
 
Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualization
 

Ch35

  • 1. Chapter 35 Data Mining Transparencies
  • 2. Chapter Objectives The concepts associated with data mining. The main features of data mining operations, including predictive modeling, database segmentation, link analysis, and deviation detection. The techniques associated with the data mining operations. 2
  • 3. Chapter Objectives The process of data mining. Important characteristics of data mining tools. The relationship between data mining and data warehousing. How Oracle supports data mining. 3
  • 4. Data Mining The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions, (Simoudis,1996). Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data. 4
  • 5. Data Mining Reveals information that is hidden and unexpected, as little value in finding patterns and relationships that are already intuitive. Patterns and relationships are identified by examining the underlying rules and features in the data. 5
  • 6. Data Mining Tends to work from the data up and most accurate results normally require large volumes of data to deliver reliable conclusions. Starts by developing an optimal representation of structure of sample data, during which time knowledge is acquired and extended to larger sets of data. 6
  • 7. Data Mining Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing. Relatively new technology, however already used in a number of industries. 7
  • 8. Examples of Applications of Data Mining Retail / Marketing – Identifying buying patterns of customers – Finding associations among customer demographic characteristics – Predicting response to mailing campaigns – Market basket analysis 8
  • 9. Examples of Applications of Data Mining Banking – Detecting patterns of fraudulent credit card use – Identifying loyal customers – Predicting customers likely to change their credit card affiliation – Determining credit card spending by customer groups 9
  • 10. Examples of Applications of Data Mining Insurance – Claims analysis – Predicting which customers will buy new policies Medicine – Characterizing patient behavior to predict surgery visits – Identifying successful medical therapies for different illnesses 10
  • 11. Data Mining Operations Four main operations include: – Predictive modeling – Database segmentation – Link analysis – Deviation detection There are recognized associations between the applications and the corresponding operations. – e.g. Direct marketing strategies use database segmentation. 11
  • 12. Data Mining Techniques Techniques are specific implementations of the data mining operations. Each operation has its own strengths and weaknesses. 12
  • 13. Data Mining Techniques Data mining tools sometimes offer a choice of operations to implement a technique. Criteria for selection of tool includes – Suitability for certain input data types – Transparency of the mining output – Tolerance of missing variable values – Level of accuracy possible – Ability to handle large volumes of data 13
  • 14. Data Mining Operations and Associated Techniques 14
  • 15. Predictive Modeling Similar to the human learning experience – uses observations to form a model of the important characteristics of some phenomenon. Uses generalizations of ‘real world’ and ability to fit new data into a general framework. Can analyze a database to determine essential characteristics (model) about the data set. 15
  • 16. Predictive Modeling Model is developed using a supervised learning approach, which has two phases: training and testing. – Training builds a model using a large sample of historical data called a training set. – Testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics. 16
  • 17. Predictive Modeling Applications of predictive modeling include customer retention management, credit approval, cross selling, and direct marketing. There are two techniques associated with predictive modeling: classification and value prediction, which are distinguished by the nature of the variable being predicted. 17
  • 18. Predictive Modeling - Classification Used to establish a specific predetermined class for each record in a database from a finite set of possible, class values. Two specializations of classification: tree induction and neural induction. 18
  • 19. Example of Classification using Tree Induction 19
  • 20. Example of Classification using Neural Induction 20
  • 21. Predictive Modeling - Value Prediction Used to estimate a continuous numeric value that is associated with a database record. Uses the traditional statistical techniques of linear regression and nonlinear regression. Relatively easy-to-use and understand. 21
  • 22. Predictive Modeling - Value Prediction Linear regression attempts to fit a straight line through a plot of the data, such that the line is the best representation of the average of all observations at that point in the plot. Problem is that the technique only works well with linear data and is sensitive to the presence of outliers (that is, data values, which do not conform to the expected norm). 22
  • 23. Predictive Modeling - Value Prediction Although nonlinear regression avoids the main problems of linear regression, it is still not flexible enough to handle all possible shapes of the data plot. Statistical measurements are fine for building linear models that describe predictable data points, however, most data is not linear in nature. 23
  • 24. Predictive Modeling - Value Prediction Data mining requires statistical methods that can accommodate non-linearity, outliers, and non-numeric data. Applications of value prediction include credit card fraud detection or target mailing list identification. 24
  • 25. Database Segmentation Aim is to partition a database into an unknown number of segments, or clusters, of similar records. Uses unsupervised learning to discover homogeneous sub-populations in a database to improve the accuracy of the profiles. 25
  • 26. Database Segmentation Less precise than other operations thus less sensitive to redundant and irrelevant features. Sensitivity can be reduced by ignoring a subset of the attributes that describe each instance or by assigning a weighting factor to each variable. Applications of database segmentation include customer profiling, direct marketing, and cross selling. 26
  • 27. Example of Database Segmentation using a Scatterplot 27
  • 28. Database Segmentation Associated with demographic or neural clustering techniques, which are distinguished by – Allowable data inputs – Methods used to calculate the distance between records – Presentation of the resulting segments for analysis 28
  • 29. Link Analysis Aims to establish links (associations) between records, or sets of records, in a database. There are three specializations – Associations discovery – Sequential pattern discovery – Similar time sequence discovery Applications include product affinity analysis, direct marketing, and stock price movement. 29
  • 30. Link Analysis - Associations Discovery Finds items that imply the presence of other items in the same event. Affinities between items are represented by association rules. – e.g. ‘When a customer rents property for more than 2 years and is more than 25 years old, in 40% of cases, the customer will buy a property. This association happens in 35% of all customers who rent properties’. 30
  • 31. Link Analysis - Sequential Pattern Discovery Finds patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period of time. – e.g. Used to understand long term customer buying behavior. 31
  • 32. Link Analysis - Similar Time Sequence Discovery Finds links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate. – e.g. Within three months of buying property, new home owners will purchase goods such as cookers, freezers, and washing machines. 32
  • 33. Deviation Detection Relatively new operation in terms of commercially available data mining tools. Often a source of true discovery because it identifies outliers, which express deviation from some previously known expectation and norm. 33
  • 34. Deviation Detection Can be performed using statistics and visualization techniques or as a by-product of data mining. Applications include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing. 34
  • 35. Example of Database Segmentation using a Visualization 35
  • 36. The Data Mining Process Recognizing that a systematic approach is essential to successful data mining, many vendor and consulting organizations have specified a process model designed to guide the user through a sequence of steps that will lead to good results. Developed a specification called the Cross Industry Standard Process for Data Mining (CRISP-DM). 36
  • 37. The Data Mining Process CRISP-DM specifies a data mining process model that is not compliant with a particular industry or tool. CRISP-DM has evolved from the knowledge discovery processes used widely in industry and in direct response to user requirements. 37
  • 38. The Data Mining Process The major aims of CRISP-DM are to make large data mining projects run more efficiently, be cheaper, more reliable, and more manageable. CRISP-DM is a hierarchical process model. At the top level, the process is divided into six different generic phases, ranging from business understanding to deployment of project results. 38
  • 39. The Data Mining Process The next level elaborates each of these phases as comprising of several generic tasks. At this level, the description is generic enough to cover all the DM scenarios. The third level specialises these tasks for specific situations. For instance, the generic task might be cleaning data, and specialised task could be cleaning of numeric values or categorical values. 39
  • 40. The Data Mining Process The fourth level is the process instance; that is a record of actions, decisions and result of an actual execution of DM project. The model also discusses relationships between different DM tasks. It gives idealised sequence of actions during a DM project. 40
  • 41. Phases of the CRISP-DM Model 41
  • 42. Data Mining Tools There are a growing number of commercial data mining tools on the marketplace. Important characteristics of data mining tools include: – Data preparation facilities – Selection of data mining operations – Product scalability and performance – Facilities for understanding results 42
  • 43. Data Mining Tools Data preparation facilities – Data preparation is the most time- consuming aspect of data mining. – Functions supported include: data preparation, data cleansing, data describing, data transforming and data sampling. 43
  • 44. Data Mining Tools Selection of data mining operations – Important to understand the characteristics of the operations (algorithms) to ensure that they meet the user’s requirements. – In particular, important to establish how the algorithms treat the data types of the response and predictor variables, how fast they train, and how fast they work on new data. 44
  • 45. Data Mining Tools Product scalability and performance – Capable of dealing with increasing amounts of data, possibly with sophisticated validation controls. – Maintaining satisfactory performance may require investigations into whether a tool is capable of supporting parallel processing using technologies such as SMP or MPP. 45
  • 46. Data Mining Tools Facilities for understanding results – By providing measures such as those describing accuracy and significance in useful formats such as confusion matrices, by allowing the user to perform sensitivity analysis on the result, and by presenting the result in alternative ways using for example visualization techniques. 46
  • 47. Data Mining and Data Warehousing Major challenge to exploit data mining is identifying suitable data to mine. Data mining requires single, separate, clean, integrated, and self-consistent source of data. 47
  • 48. Data Mining and Data Warehousing A data warehouse is well equipped for providing data for mining. Data quality and consistency is a pre-requisite for mining to ensure the accuracy of the predictive models. Data warehouses are populated with clean, consistent data. 48
  • 49. Data Mining and Data Warehousing It is advantageous to mine data from multiple sources to discover as many interrelationships as possible. Data warehouses contain data from a number of sources. Selecting the relevant subsets of records and fields for data mining requires the query capabilities of the data warehouse. 49
  • 50. Data Mining and Data Warehousing The results of a data mining study are useful if there is some way to further investigate the uncovered patterns. Data warehouses provide the capability to go back to the data source. 50

Hinweis der Redaktion

  1. September 98 1 Chapter Name