# Information Retrieval 08

Student um Student
28. Nov 2017
1 von 54

### Information Retrieval 08

• 1. Induction of Decision Trees Blaž Zupan and Ivan Bratko magix.fri.uni-lj.si/predavanja/uisp
• 2. CS583, Bing Liu, UIC 2 An example application • An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. • A decision is needed: whether to put a new patient in an intensive-care unit. • Due to the high cost of ICU, those patients who may survive less than a month are given higher priority. • Problem: to predict high-risk patients and discriminate them from low-risk patients.
• 3. CS583, Bing Liu, UIC 3 Another application • A credit card company receives thousands of applications for new cards. Each application contains information about an applicant, – age – Marital status – annual salary – outstanding debts – credit rating – etc. • Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved.
• 4. CS583, Bing Liu, UIC 4 Machine learning and our focus • Like human learning from past experiences. • A computer does not have “experiences”. • A computer system learns from data, which represent some “past experiences” of an application domain. • Our focus: learn a target function that can be used to predict the values of a discrete class attribute, e.g., approve or not-approved, and high-risk or low risk. • The task is commonly called: Supervised learning, classification, or inductive learning.
• 5. CS583, Bing Liu, UIC 5 • Data: A set of data records (also called examples, instances or cases) described by – k attributes: A1, A2, … Ak. – a class: Each example is labelled with a pre- defined class. • Goal: To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances. The data and the goal
• 6. CS583, Bing Liu, UIC 6 An example: data (loan application) Approved or not
• 7. CS583, Bing Liu, UIC 7 An example: the learning task • Learn a classification model from the data • Use the model to classify future loan applications into – Yes (approved) and – No (not approved) • What is the class for following case/instance?
• 8. CS583, Bing Liu, UIC 8 Supervised vs. unsupervised Learning • Supervised learning: classification is seen as supervised learning from examples. – Supervision: The data (observations, measurements, etc.) are labeled with pre-defined classes. It is like that a “teacher” gives the classes (supervision). – Test data are classified into these classes too. • Unsupervised learning (clustering) – Class labels of the data are unknown – Given a set of data, the task is to establish the existence of classes or clusters in the data
• 9. CS583, Bing Liu, UIC 9 Supervised learning process: two steps  Learning (training): Learn a model using the training data  Testing: Test the model using unseen test data to assess the model accuracy , casestestofnumberTotal tionsclassificacorrectofNumber =Accuracy
• 10. CS583, Bing Liu, UIC 10 What do we mean by learning? • Given – a data set D, – a task T, and – a performance measure M, a computer system is said to learn from D to perform the task T if after learning the system’s performance on T improves as measured by M. • In other words, the learned model helps the system to perform T better as compared to no learning.
• 11. CS583, Bing Liu, UIC 11 An example • Data: Loan application data • Task: Predict whether a loan should be approved or not. • Performance measure: accuracy. No learning: classify all future applications (test data) to the majority class (i.e., Yes): Accuracy = 9/15 = 60%. • We can do better than 60% with learning.
• 12. CS583, Bing Liu, UIC 12 Fundamental assumption of learning Assumption: The distribution of training examples is identical to the distribution of test examples (including future unseen examples). • In practice, this assumption is often violated to certain degree. • Strong violations will clearly result in poor classification accuracy. • To achieve good accuracy on the test data, training examples must be sufficiently representative of the test data.
• 13. The worst pH value at ICU The worst active partial thromboplastin time PH_ICU and APPT_WORST are exactly the two factors (theoretically) advocated to be the most important ones in the study by Rotondo et al., 1997. Analysis of Severe Trauma Patients Data Death 0.0 (0/15) Well 0.82 (9/11) Death 0.0 (0/7) APPT_WORST Well 0.88 (14/16) PH_ICU <78.7 >=78.7 >7.33<7.2 7.2-7.33
• 14. Tree induced by Assistant Professional Interesting: Accuracy of this tree compared to medical specialists Breast Cancer Recurrence no rec 125 recurr 39 recurr 27 no_rec 10 Tumor Size no rec 30 recurr 18 Degree of Malig < 3 Involved Nodes Age no rec 4 recurr 1 no rec 32 recurr 0 >= 3 < 15 >= 15 < 3 >= 3
• 15. Prostate cancer recurrence Secondary Gleason GradeSecondary Gleason Grade NoNo YesYesPSA LevelPSA Level StageStage Primary Gleason GradePrimary Gleason Grade NoNo YesYes NoNo NoNo YesYes 1,2 3 4 5 ≤14.9 >14.9 T1c,T2a, T2b,T2c T1ab,T3 2,3 4
• 16. CS583, Bing Liu, UIC 16 Introduction • Decision tree learning is one of the most widely used techniques for classification. – Its classification accuracy is competitive with other methods, and – it is very efficient. • The classification model is a tree, called decision tree. • C4.5 by Ross Quinlan is perhaps the best known system. It can be downloaded from the Web.
• 17. CS583, Bing Liu, UIC 17 The loan data (reproduced) Approved or not
• 18. CS583, Bing Liu, UIC 18 A decision tree from the loan data  Decision nodes and leaf nodes (classes)
• 19. CS583, Bing Liu, UIC 19 Use the decision tree No
• 20. CS583, Bing Liu, UIC 20 Is the decision tree unique?  No. Here is a simpler tree.  We want smaller tree and accurate tree.  Easy to understand and perform better.  Finding the best tree is NP-hard.  All current tree building algorithms are heuristic algorithms
• 21. An Example Data Set and Decision Tree yes no yes no sunny rainy no med yes small big big outlook company sailboat # Class Outlook Company Sailboat Sail? 1 sunny big small yes 2 sunny med small yes 3 sunny med big yes 4 sunny no small yes 5 sunny big big yes 6 rainy no small no 7 rainy med small yes 8 rainy big big yes 9 rainy no big no 10 rainy med big no Attribute
• 22. Classification yes no yes no sunny rainy no med yes small big big outlook company sailboat # Class Outlook Company Sailboat Sail? 1 sunny no big ? 2 rainy big small ? Attribute
• 23. Induction of Decision Trees • Data Set (Learning Set) – Each example = Attributes + Class • Induced description = Decision tree • TDIDT – Top Down Induction of Decision Trees • Recursive Partitioning
• 24. Some TDIDT Systems • ID3 (Quinlan 79) • CART (Brieman et al. 84) • Assistant (Cestnik et al. 87) • C4.5 (Quinlan 93) • See5 (Quinlan 97) • ... • Orange (Demšar, Zupan 98-03)
• 25. TDIDT Algorithm • Also known as ID3 (Quinlan) • To construct decision tree T from learning set S: – If all examples in S belong to some class C Then make leaf labeled C – Otherwise • select the “most informative” attribute A • partition S according to A’s values • recursively construct subtrees T1, T2, ..., for the subsets of S
• 26. Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat Yes No Refund Don’t Cheat Yes No Marital Status Don’t Cheat Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K >= 80K Refund Don’t Cheat Yes No Marital Status Don’t Cheat Cheat Single, Divorced Married
• 27. TDIDT Algorithm • Resulting tree T is: A v1 T1 T2 Tn v2 vn Attribute A A’s values Subtrees
• 28. Another Example # Class Outlook Temperature Humidity Windy Play 1 sunny hot high no N 2 sunny hot high yes N 3 overcast hot high no P 4 rainy moderate high no P 5 rainy cold normal no P 6 rainy cold normal yes N 7 overcast cold normal yes P 8 sunny moderate high no N 9 sunny cold normal no P 10 rainy moderate normal no P 11 sunny moderate normal yes P 12 overcast moderate high yes P 13 overcast hot normal no P 14 rainy moderate high yes N Attribute
• 29. Simple Tree Outlook Humidity WindyP sunny overcast rainy PN high normal PN yes no
• 30. Complicated Tree Temperature Outlook Windy cold moderate hot P sunny rainy N yes no P overcast Outlook sunny rainy P overcast Windy PN yes no Windy NP yes no Humidity P high normal Windy PN yes no Humidity P high normal Outlook N sunny rainy P overcast null
• 31. Attribute Selection Criteria • Main principle – Select attribute which partitions the learning set into subsets as “pure” as possible • Various measures of purity – Information-theoretic – Gini index – X2 – ReliefF – ... • Various improvements – probability estimates – normalization – binarization, subsetting
• 32. Information-Theoretic Approach • To classify an object, a certain information is needed – I, information • After we have learned the value of attribute A, we only need some remaining amount of information to classify the object – Ires, residual information • Gain – Gain(A) = I – Ires(A) • The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes Gain
• 33. Entropy • The average amount of information I needed to classify an object is given by the entropy measure • For a two-class problem: entropy p(c1)
• 34. Residual Information • After applying attribute A, S is partitioned into subsets according to values v of A • Ires is equal to weighted sum of the amounts of information for the subsets
• 35. Triangles and Squares # Shape Color Outline Dot 1 green dashed no triange 2 green dashed yes triange 3 yellow dashed no square 4 red dashed no square 5 red solid no square 6 red solid yes triange 7 green solid no square 8 green dashed no triange 9 yellow solid yes square 10 red solid no square 11 green solid yes square 12 yellow dashed yes square 13 yellow solid no square 14 red dashed yes triange Attribute
• 36. Triangles and Squares . . . . . . # Shape Color Outline Dot 1 green dashed no triange 2 green dashed yes triange 3 yellow dashed no square 4 red dashed no square 5 red solid no square 6 red solid yes triange 7 green solid no square 8 green dashed no triange 9 yellow solid yes square 10 red solid no square 11 green solid yes square 12 yellow dashed yes square 13 yellow solid no square 14 red dashed yes triange Attribute Data Set: A set of classified objects
• 37. Entropy • 5 triangles • 9 squares • class probabilities • entropy . . . . . .
• 38. Entropy reduction by data set partitioning . . . . . . .. . . . . Color? red yellow green
• 39. Entropijavrednosti atributa . . . . . . .. . . . . Color? red yellow green
• 40. InformationGain . . . . . . .. . . . . Color? red yellow green
• 41. Information Gain of The Attribute • Attributes – Gain(Color) = 0.246 – Gain(Outline) = 0.151 – Gain(Dot) = 0.048 • Heuristics: attribute with the highest gain is chosen • This heuristics is local (local minimization of impurity)
• 42. . . . . . . .. . . . . Color? red yellow green Gain(Outline) = 0.971 – 0 = 0.971 bits Gain(Dot) = 0.971 – 0.951 = 0.020 bits
• 43. . . . . . . .. . . . . Color? red yellow green . . Outline? dashed solid Gain(Outline) = 0.971 – 0.951 = 0.020 bits Gain(Dot) = 0.971 – 0 = 0.971 bits
• 44. . . . . . . .. . . . . Color? red yellow green . . dashed solid Dot? no yes . . Outline?
• 45. Decision Tree Color Dot Outlinesquare red yellow green squaretriangle yes no squaretriangle dashed solid . . . . . .
• 46. A Defect of Ires • Ires favors attributes with many values • Such attribute splits S to many subsets, and if these are small, they will tend to be pure anyway • One way to rectify this is through a corrected measure of information gain ratio.
• 47. Information Gain Ratio • I(A) is amount of information needed to determine the value of an attribute A • Information gain ratio
• 48. InformationGainRatio . . . . . . .. . . . . Color? red yellow green
• 49. Information Gain and Information Gain Ratio A |v(A)| Gain(A) GainRatio(A) Color 3 0.247 0.156 Outline 2 0.152 0.152 Dot 2 0.048 0.049
• 50. Gini Index • Another sensible measure of impurity (i and j are classes) • After applying attribute A, the resulting Gini index is • Gini can be interpreted as expected error rate
• 51. Gini Index . . . . . .
• 52. GiniIndexforColor . . . . . . .. . . . . Color? red yellow green
• 53. Gain of Gini Index
• 54. Three Impurity Measures A Gain(A) GainRatio(A) GiniGain(A) Color 0.247 0.156 0.058 Outline 0.152 0.152 0.046 Dot 0.048 0.049 0.015 • These impurity measures assess the effect of a single attribute • Criterion “most informative” that they define is local (and “myopic”) • It does not reliably predict the effect of several attributes applied jointly

### Hinweis der Redaktion

1. Induction is different from deduction and DBMS does not not support induction; The result of induction is higher-level information or knowledge: general statements about data There are many approaches. Refer to the lecture notes for CS3244 available at the Co-Op. We focus on three approaches here, other examples: Other approaches Instance-based learning other neural networks Concept learning (Version space, Focus, Aq11, …) Genetic algorithms Reinforcement learning