Diese Präsentation wurde erfolgreich gemeldet.

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
×

1 von 65 Anzeige

# Unit-2 Bayes Decision Theory.pptx

Bayes theorem

Bayes theorem

Anzeige
Anzeige

## Weitere Verwandte Inhalte

Anzeige

### Unit-2 Bayes Decision Theory.pptx

1. 1. 2 Data Preprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization 1. Data collection parameters, features, variables • Errors will propagate If you have error in very beginning (1st step) 1st step  2nd  3rd  4th
2. 2. Data Quality: Why Preprocess the Data?  Measures for data quality: A multidimensional view  Accuracy: correct or wrong, accurate or not  Completeness: not recorded, unavailable, …  Consistency: some modified but some not, dangling, …  Timeliness: timely update?  Believability: how trustable the data are correct?  Interpretability: how easily the data can be understood?
3. 3. Major Tasks in Data Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data reduction  Dimensionality reduction  Numerosity reduction  Data compression  Data transformation and data discretization  Normalization  Concept hierarchy generation
4. 4. 5 Histogram Analysis  Divide data into buckets and store average (sum) for each bucket  Partitioning rules:  Equal-width: equal bucket range  Equal-frequency (or equal- depth) 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000 Interval of 10000
5. 5. 6 Correlation Analysis (Nominal Data)  Χ2 (chi-square) test  The larger the Χ2 value, the more likely the variables are related  The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count  Correlation does not imply causality  # of hospitals and # of car-theft in a city are correlated  Both are causally linked to the third variable: population    Expected Expected Observed 2 2 ) (  Features, attributes or variable
6. 6. 7 Chi-Square Calculation: An Example  Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)  It shows that like_science_fiction and play_chess are correlated in the group 93 . 507 840 ) 840 1000 ( 360 ) 360 200 ( 210 ) 210 50 ( 90 ) 90 250 ( 2 2 2 2 2           Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500
7. 7. 8 Correlation Analysis (Numeric Data)  Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.  If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation.  rA,B = 0: independent; rAB < 0: negatively correlated B A n i i i B A n i i i B A n B A n b a n B b A a r     ) 1 ( ) ( ) 1 ( ) )( ( 1 1 ,            A B
8. 8. Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1. Age Height f1 f2
9. 9. Covariance (Numeric Data)  Covariance is similar to correlation where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B.  Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values.  Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value.  Independence: CovA,B = 0 but the converse is not true:  Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence A B Correlation coefficient:
10. 10. Co-Variance: An Example  It can be simplified in computation as  Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).  Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?  E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4  E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6  Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4  Thus, A and B rise together since Cov(A, B) > 0.
11. 11. The Normal Distribution X f(X)   Changing μ shifts the distribution left or right. Changing σ increases or decreases the spread. - Population - Samples - Mean - Standard Deviation
12. 12. The Normal Distribution: as mathematical function (pdf) 2 ) ( 2 1 2 1 ) (         x e x f Note constants: =3.14159 e=2.71828 This is a bell shaped curve with different centers and spreads depending on  and 
13. 13. **The beauty of the normal curve: No matter what  and  are, the area between -1 and +1 is about 68%; the area between -2 and +2 is about 95%; and the area between -3 and +3 is about 99.7%. Almost all values fall within 3 standard deviations.
14. 14. 68-95-99.7 Rule 68% of the data 95% of the data 99.7% of the data
15. 15. Example  Suppose SAT scores roughly follows a normal distribution in the U.S. population of college- bound students (with range restricted to 200-800), and the average math SAT is 500 with a standard deviation of 50, then:  68% of students will have scores between 450 and 550  95% will be between 400 and 600  99.7% will be between 350 and 650
16. 16. Basic Formulas for Probabilities • Product Rule : probability P(AB) of a conjunction of two events A and B: •Sum Rule: probability of a disjunction of two events A and B: •Theorem of Total Probability : if events A1, …., An are mutually exclusive with ) ( ) | ( ) ( ) | ( ) , ( A P A B P B P B A P B A P   ) ( ) ( ) ( ) ( AB P B P A P B A P     ) ( ) | ( ) ( 1 i n i i A P A B P B P   
17. 17. Basic Approach Bayes Rule: ) ( ) ( ) | ( ) | ( D P h P h D P D h P   P(h) = prior probability of hypothesis h  P(D) = prior probability of training data D  P(h|D) = probability of h given D (posterior density )  P(D|h) = probability of D given h (likelihood of D given h) The Goal of Bayesian Learning: the most probable hypothesis given the training data (Maximum A Posteriori hypothesis ) map h ) ( ) | ( max ) ( ) ( ) | ( max ) | ( max h P h D P D P h P h D P D h P h H h H h H h map       Null hypothesis Alternate hypothesis In which class I have to put my sample? Prediction (classification ) Class = [0 1]
18. 18. An Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer. ) ( ) ( ) | ( ) | ( ) ( ) ( ) | ( ) | ( 97 . ) | ( , 03 . ) | ( 02 . ) | ( , 98 . ) | ( 992 . ) ( , 008 . ) (                         P cancer P cancer P cancer P P cancer P cancer P cancer P cancer P cancer P cancer P cancer P cancer P cancer P
19. 19. MAP Learner For each hypothesis h in H, calculate the posterior probability ) ( ) ( ) | ( ) | ( D P h P h D P D h P  Output the hypothesis hmap with the highest posterior probability ) | ( max D h P h H h map   Comments: Computational intensive Providing a standard for judging the performance of learning algorithms Choosing P(h) and P(D|h) reflects our prior knowledge about the learning task
20. 20. Bayes Optimal Classifier  Question: Given new instance x, what is its most probable classification?  Hmap(x) is not the most probable classification! Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3 Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = - What is the most probable classification of x ? Bayes optimal classification: ) | ( ) | ( max D h P h v P i H hj i j V vj    Example: P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1 P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0 P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0 6 . ) | ( ) | ( 4 . ) | ( ) | (         D h P h P D h P h P i H hi i i H hi i
21. 21. Naïve Bayes Learner Assume target function f: X-> V, where each instance x described by attributes <a1, a2, …., an>. Most probable value of f(x) is: ) ( ) | .... , ( max ) .... , ( ) ( ) | .... , ( max ) .... , | ( max 2 1 2 1 2 1 2 1 j j n V vj n j j n V vj n j V vj v P v a a a P a a a P v P v a a a P a a a v P v       Naïve Bayes assumption: ) | ( ) | .... , ( 2 1 j i i j n v a P v a a a P   (attributes are conditionally independent) a1 (# persons) a2 (temp) … . an label
22. 22. Bayesian classification  The classification problem may be formalized using a-posteriori probabilities:  P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C.  E.g. P(class=N | outlook=sunny,windy=true,…)  Idea: assign to sample X the class label C such that P(C|X) is maximal
23. 23. Estimating a-posteriori probabilities  Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X)  P(X) is constant for all classes  P(C) = relative freq of class C samples  C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum  Problem: computing P(X|C) is unfeasible!
24. 24. Naïve Bayesian Classification  Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)  If i-th attribute is categorical: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C  If i-th attribute is continuous: P(xi|C) is estimated thru a Gaussian density function  Computationally easy in both cases
25. 25. Naive Bayesian Classifier (II)  Given a training set, we can compute the probabilities O utlook P N H um idity P N sunny 2/9 3/5 high 3/9 4/5 overcast 4/9 0 norm al 6/9 1/5 rain 3/9 2/5 Tem preature W indy hot 2/9 2/5 true 3/9 3/5 m ild 4/9 2/5 false 6/9 2/5 cool 3/9 1/5
26. 26. Play-tennis example: estimating P(xi|C) Outlook Temperature Humidity Windy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N outlook P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(overcast|p) = 4/9 P(overcast|n) = 0 P(rain|p) = 3/9 P(rain|n) = 2/5 temperature P(hot|p) = 2/9 P(hot|n) = 2/5 P(mild|p) = 4/9 P(mild|n) = 2/5 P(cool|p) = 3/9 P(cool|n) = 1/5 humidity P(high|p) = 3/9 P(high|n) = 4/5 P(normal|p) = 6/9 P(normal|n) = 2/5 windy P(true|p) = 3/9 P(true|n) = 3/5 P(false|p) = 6/9 P(false|n) = 2/5 P(p) = 9/14 P(n) = 5/14
27. 27. Example : Naïve Bayes Predict playing tennis in the day with the condition <sunny, cool, high, strong> (P(v| o=sunny, t= cool, h=high w=strong)) using the following training data: Day Outlook Temperature Humidity Wind Play Tennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No we have : 021 . ) | ( ) | ( ) | ( ) | ( ) ( 005 . ) | ( ) | ( ) | ( ) | ( ) (   n strong p n high p n cool p n sun p n p y strong p y high p y cool p y sun p y p tennise playing of days wind strong with tennise playing of days # #
28. 28. The independence hypothesis…  … makes computation possible  … yields optimal classifiers when satisfied  … but is seldom satisfied in practice, as attributes (variables) are often correlated.  Attempts to overcome this limitation:  Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes  Decision trees, that reason on one attribute at the time, considering most important attributes first
29. 29. Naïve Bayes Algorithm Naïve_Bayes_Learn (examples) for each target value vj estimate P(vj) for each attribute value ai of each attribute a estimate P(ai | vj ) Classify_New_Instance (x) ) | ( ) ( max j x a i V vj j v a P v P v i     Typical estimation of P(ai | vj) m n mp n v a P c j i    ) | ( Where n: examples with v=v; p is prior estimate for P(ai|vj) nc: examples with a=ai, m is the weight to prior
30. 30. K-Nearest-Neighbors Algorithm  K nearest neighbors (KNN) is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (distance function)  KNN has been used in statistical estimation and pattern recognition since 1970’s.
31. 31. K-Nearest-Neighbors Algorithm  A case is classified by a majority voting of its neighbors, with the case being assigned to the class most common among its K nearest neighbors measured by a distance function.  If K=1, then the case is simply assigned to the class of its nearest neighbor
32. 32. Distance Function Measurements
33. 33. Hamming Distance  For category variables, Hamming distance can be used.
34. 34. K-Nearest-Neighbors
35. 35. What is the most possible label for c? c
36. 36. What is the most possible label for c?  Solution: Looking for the nearest K neighbors of c.  Take the majority label as c’s label  Let’s suppose k = 3:
37. 37. What is the most possible label for c? c
38. 38. What is the most possible label for c?  The 3 nearest points to c are: a, a and o.  Therefore, the most possible label for c is a.
39. 39. Nearest Neighbour Rule Non-parametric pattern classification. Consider a two class problem where each sample consists of two measurements (x,y). k = 1 k = 3 For a given query point q, assign the class of the nearest neighbour. Compute the k nearest neighbours and assign the class by majority vote.
40. 40. Nearest Neighbour Issues  Expensive  To determine the nearest neighbour of a query point q, must compute the distance to all N training examples + Pre-sort training examples into fast data structures (kd-trees) + Compute only an approximate distance (LSH) + Remove redundant data (condensing)  Storage Requirements  Must store all training data P + Remove redundant data (condensing) - Pre-sorting often increases the storage requirements  High Dimensional Data  “Curse of Dimensionality”  Required amount of training data increases exponentially with dimension  Computational cost also increases dramatically  Partitioning techniques degrade to linear search in high dimension
41. 41. Decision theory  Decision theory is the study of making decisions that have a significant impact  Decision-making is distinguished into:  Decision-making under certainty  Decision-making under non-certainty  Decision-making under risk  Decision-making under uncertainty
42. 42. Probability theory  Most decisions have to be taken in the presence of uncertainty  Probability theory quantifies uncertainty regarding the occurrence of events or states of the world  Basic elements of probability theory:  Random variables describe aspects of the world whose state is initially unknown  Each random variable has a domain of values that it can take on (discrete, boolean, continuous)  An atomic event is a complete specification of the state of the world, i.e. an assignment of values to variables of which the world is composed
43. 43. Probability Theory..  Probability space  The sample space S={e1 ,e2 ,…,en } which is a set of atomic events  Probability measure P which assigns a real number between 0 and 1 to the members of the sample space  Axioms  All probabilities are between 0 and 1  The sum of probabilities for the atomic events of a probability space must sum up to 1  The certain event S (the sample space itself)
44. 44. Prior  Priori Probabilities or Prior reflects our prior knowledge of how likely an event occurs.  In the absence of any other information, a random variable is assigned a degree of belief called unconditional or prior probability
45. 45. Class Conditional probability  When we have information concerning previously unknown random variables then we use posterior or conditional probabilities: P(a|b) the probability of a given event a that we know b  Alternatively this can be written (the product rule): P(a b)=P(a|b)P(b)  ) ( ) ( ) | ( b P b a P b a P  
46. 46. Bayes’ rule  The product rule can be written as:  P(a b)=P(a|b)P(b)  P(a b)=P(b|a)P(a)  By equating the right-hand sides:  This is known as Bayes’ rule   ) ( ) ( ) | ( ) | ( a P b P b a P a b P 
47. 47. Bayesian Decision Theory  Bayesian Decision Theory is a fundamental statistical approach that quantifies the tradeoffs between various decisions using probabilities and costs that accompany such decisions.  Example: Patient has trouble breathing – Decision: Asthma versus Lung cancer – Decide lung cancer when person has asthma  Cost: moderately high (e.g., order unnecessary tests, scare patient) – Decide asthma when person has lung
48. 48. Decision Rules  Progression of decision rules:  – (1) Decide based on prior probabilities  – (2) Decide based on posterior probabilities  – (3) Decide based on risk
49. 49. Fish Sorting Example Revisited
50. 50. Decision based on prior probabilities
51. 51. Question  Consider a two-class problem, { c1 and c2 } where the prior probabilities of the two classes are given by  P ( c1 ) = ⋅7 and P ( c2 ) = ⋅3  Design a classification rule for a pattern based only on prior probabilities  Calculation of Error Probability – P ( error )
52. 52. Solution
53. 53. Decision based on class conditional probabilities
54. 54. Posterior Probabilities
55. 55. Bayes Formula  Suppose the priors P(wj) and conditional densities p(x|wj) are known, ( | ) ( ) ( | ) ( ) j j j p x P P x p x     posterior likelihood prior evidence
56. 56. Making a Decision
57. 57. Probability of Error Average probability of error P(error ) Bayes decision rule minimizes this error because
58. 58.  The dotted line at x0 is a threshold partitioning the feature  space into two regions,R1 and R2. According to the Bayes decision rule,for all values  of x in R1 the classifier decides 1 and for all values in R2 it decides 2. However,  it is obvious from the figure that decision errors are unavoidable. Example of the two regions R1 and R2 formed by the Bayesian classifier for the case of two equiprobable classes. The dotted line at x0 is a threshold partitioning the feature space into two regions,R1 and R2. According to the Bayes decision rule, for all values of x in R1 the classifier decides 1 and for all values in R2 it decides 2. However, it is obvious from the figure that decision errors are unavoidable.
59. 59. total probability,Pe,of committing a decision error  which is equal to the total shaded area under the curves in Figure
60. 60. Minimizing the Classification Error Probability  Show that the Bayesian classifier is optimal with respect to minimizing the classification error probability.
61. 61. Generalized Bayesian Decision Theory
62. 62. Bayesian Decision Theory…
63. 63. Bayesian Decision Theory…