Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
[系列活動] 資料探勘速遊
1. Quick Tour of Data Mining
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com
2. About Speaker
陳宜欣 Yi-Shin Chen
▷ Currently
• 清華大學資訊工程系副教授
• 主持智慧型資料工程與應用實驗室 (IDEA Lab)
▷ Education
• Ph.D. in Computer Science, USC, USA
• M.B.A. in Information Management, NCU, TW
• B.B.A. in Information Management, NCU, TW
▷ Courses (all in English)
• Research and Presentation Skills
• Introduction to Database Systems
• Advanced Database Systems
• Data Mining: Concepts, Techniques, and
Applications
2
4. 4
1900 1920 1940 1950 1960 1970
Manual
Record
Managers
1950: Univac had developed
a magnetic tape
1951: Univac I delivered to
the US Census Bureau
1931: Gödel's Incompleteness
Theorem 1948: Information theory (by
Shannon)
Information Entropy
1944: Mark I
(Server) 1963: The origins of the Internet
Programmed Record
Managers
• Birth of high-level
programming
languages
• Batch processing
Punched-Card
Record Managers
On-line Network
Databases
• Indexed sequential
records
• Data independence
• Concurrent Access
5. 2001: Data Science
2009: Deep Learning
5
1970 1980 1990 2000 2010
1985: 1st standardized
of SQL
1976: E-R Model by
Peter Chen
1993: WWW
2006: Amazon.com Elastic
Compute Cloud
1980: Artificial Neural
Networks
Knowledge Discovery
in Databases
Object Relational
Model
• Support multiple
datatypes and
applications
1974: IBM System R
Relational Model
• Give Database users
high-level set-oriented
data access
operations
7. Data Mining
▷ What is data mining?
• Algorithms for seeking unexpected “pearls of wisdom”
▷ Current data mining research:
• Focus on efficient ways to discover models of existing data sets
• Developed algorithms are: classification, clustering, association-
rule discovery, summarization…etc.
7
10. Knowledge Discovery (KDD) Process
10
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task-relevant
Data
Selection
Data Mining
Pattern
Evaluation
14. Informal Design Guidelines for Database
▷ Design a schema that can be explained easily relation by
relation. The semantics of attributes should be easy to interpret
▷ Should avoid update anomaly problems
▷ Relations should be designed such that their tuples will have as
few NULL values as possible
▷ The relations should be designed to satisfy the lossless join
condition (guarantee meaningful results for join operations)
14
15. Data Warehouse
▷Assemble and manage data from various sources
for the purpose of answering business questions
15
CRM ERP POS …OLTP
Data Warehouse
Meaningful
16. Knowledge Discovery (KDD) Process
16
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task-relevant
Data
Selection
Data Mining
Pattern
Evaluation
18. Data
Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
The most important part in the whole process
18
19. Types of Attributes
▷There are different types of attributes
• Nominal (=,≠)
→ Nominal values can only distinguish one object from
another
→ Examples: ID numbers, eye color, zip codes
• Ordinal (<,>)
→ Ordinal values can help to order objects
→ Examples: rankings, grades
• Interval (+,-)
→ The difference between values are meaningful
→ Examples: calendar dates
• Ratio (*,/)
→ Both differences and ratios are meaningful
→ Examples: temperature in Kelvin, length, time, counts
19只有這一種能適用所有的處理方法
20. Types of Data Sets
▷Record
• Data Matrix
• Document Data
• Transaction Data
▷Graph
• World Wide Web
• Molecular Structures
▷Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
20
1.12.216.226.2512.65
1.22.715.225.2710.23
ThicknessLoadDistanceProjection
of y load
Projection
of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
ThicknessLoadDistanceProjection
of y load
Projection
of x Load
Document 1
season
timeout
lost
wi
n
game
score
ball
pla
y
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
TID Time Items
1 2009/2/8 Bread, Coke, Milk
2 2009/2/13 Beer, Bread
3 2009/2/23 Beer, Diaper
4 2009/3/1 Coke, Diaper, Milk
28. Recap: Types of Attributes
▷There are different types of attributes
• Nominal (=,≠)
→ Nominal values can only distinguish one object from
another
→ Examples: ID numbers, eye color, zip codes
• Ordinal (<,>)
→ Ordinal values can help to order objects
→ Examples: rankings, grades
• Interval (+,-)
→ The difference between values are meaningful
→ Examples: calendar dates
• Ratio (*,/)
→ Both differences and ratios are meaningful
→ Examples: temperature in Kelvin, length, time, counts
28
29. Vector Space Model
▷Represent the keywords of objects using a term vector
• Term: basic concept, e.g., keywords to describe an object
• Each term represents one dimension in a vector
• N total terms define an n-element terms
• Values of each term in a vector corresponds to the
importance of that term
▷Measure similarity by the vector distances
29
Document 1
season
timeout
lost
wi
n
game
score
ball
pla
y
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
30. Term Frequency and Inverse
Document Frequency (TFIDF)
▷Since not all objects in the vector space are equally
important, we can weight each term using its
occurrence probability in the object description
• Term frequency: TF(d,t)
→ number of times t occurs in the object description d
• Inverse document frequency: IDF(t)
→ to scale down the terms that occur in many descriptions
30
31. Normalizing Term Frequency
▷nij represents the number of times a term ti occurs in
a description dj . tfij can be normalized using the total
number of terms in the document
• 𝑡𝑓𝑖𝑗 =
𝑛 𝑖𝑗
𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑𝑉𝑎𝑙𝑢𝑒
▷NormalizedValue could be:
• Sum of all frequencies of terms
• Max frequency value
• Any other values can make tfij between 0 to 1
31
32. Inverse Document Frequency
▷ IDF seeks to scale down the coordinates of terms
that occur in many object descriptions
• For example, some stop words(the, a, of, to, and…) may
occur many times in a description. However, they should
be considered as non-important in many cases
• 𝑖𝑑𝑓𝑖 = 𝑙𝑜𝑔
𝑁
𝑑𝑓 𝑖
+ 1
→ where dfi (document frequency of term ti) is the
number of descriptions in which ti occurs
▷ IDF can be replaced with ICF (inverse class frequency) and
many other concepts based on applications
32
33. Reasons of Log
▷ Each distribution can indicate the hidden force
•
•
•
33
Power-law distribution Normal distribution Normal distribution
35. Big Data?
▷ “Every day, we create 2.5 quintillion bytes of data — so much
that 90% of the data in the world today has been created in the
last two years alone. This data comes from everywhere:
sensors used to gather climate information, posts to social
media sites, digital pictures and videos, purchase transaction
records, and cell phone GPS signals to name a few. This data
is “big data.”
• --from www.ibm.com/software/data/bigdata/what-is-big-data.html
35
37. Data Quality
▷What kinds of data quality problems?
▷How can we detect problems with the data?
▷What can we do about these problems?
▷Examples of data quality problems:
•Noise and outliers
•Missing values
•Duplicate data
37
38. Noise
▷Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen
38Two Sine Waves Two Sine Waves + Noise
39. Outliers
▷Outliers are data objects with characteristics
that are considerably different than most of
the other data objects in the data set
39
40. Missing Values
▷Reasons for missing values
• Information is not collected
→ e.g., people decline to give their age and weight
• Attributes may not be applicable to all cases
→ e.g., annual income is not applicable to children
▷Handling missing values
• Eliminate Data Objects
• Estimate Missing Values
• Ignore the Missing Value During Analysis
• Replace with all possible values
→ Weighted by their probabilities
40
41. Duplicate Data
▷Data set may include data objects that are
duplicates, or almost duplicates of one another
• Major issue when merging data from heterogeneous sources
▷Examples:
• Same person with multiple email addresses
▷Data cleaning
• Process of dealing with duplicate data issues
41
44. Aggregation
▷Combining two or more attributes (or objects) into a single
attribute (or object)
▷Purpose
• Data reduction
→ Reduce the number of attributes or objects
• Change of scale
→ Cities aggregated into regions, states, countries, etc
• More “stable” data
→ Aggregated data tends to have less variability
44
SELECT d.Name, avg(Salary)
FROM Employee AS e, Department AS d
WHERE e.Dept=d.DNo
GROUP BY d.Name
HAVING COUNT(e.ID)>=2;
45. Sampling
▷Sampling is the main technique employed for data
selection
• It is often used for both
→ Preliminary investigation of the data
→ The final data analysis
• Reasons:
→ Statistics: Obtaining the entire set of data of interest is too
expensive
→ Data mining: Processing the entire data set is too
expensive
45
46. Key Principle For Effective Sampling
▷The sample is representative
•Using a sample will work almost as well as using
the entire data sets
•The approximately the same property as the
original set of data
46
49. Dimensionality Reduction
▷Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data
mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise
▷Techniques
• Principle Component Analysis
• Singular Value Decomposition
• Others: supervised and non-linear techniques
49
50. Curse of Dimensionality
▷When dimensionality increases, data becomes
increasingly sparse in the space that it occupies
• Definitions of density and distance between points, which is
critical for clustering and outlier detection, become less
meaningful
50
• Randomly generate 500
points
• Compute difference
between max and min
distance between any pair
of points
52. Feature Subset Selection
▷Another way to reduce dimensionality of data
▷Redundant features
•Duplicate much or all of the information contained in
one or more other attributes
•E.g. purchase price of a product vs. sales tax
▷Irrelevant features
•Contain no information that is useful for the data
mining task at hand
•E.g. students' ID is often irrelevant to the task of
predicting students' GPA
52
53. Feature Creation
▷Create new attributes that can capture the
important information in a data set much
more efficiently than the original attributes
▷Three general methodologies:
•Feature extraction
→Domain-specific
•Mapping data to new space
•Feature construction
→Combining features
53
54. Mapping Data to a New Space
▷Fourier transform
▷Wavelet transform
54
Two Sine
Waves
Two Sine Waves +
Noise Frequency
55. Discretization Using Class Labels
▷Entropy based approach
55
3 categories for both x and y 5 categories for both x and y
57. Attribute Transformation
▷A function that maps the entire set of values of a
given attribute to a new set of replacement values
• So each old value can be identified with one of the new
values
• Simple functions: xk, log(x), ex, |x|
• Standardization and Normalization
57
61. Language Detection
▷To detect an language (possible languages)
in which the specified text is written
▷Difficulties
•Short message
•Different languages in one statement
•Noisy
61
你好 現在幾點鐘
apa kabar sekarang jam berapa ?
繁體中文 (zh-tw)
印尼文 (id)
62. Wrong Detection Examples
▷Twitter examples
62
@sayidatynet top song #LailaGhofran
shokran ya garh new album #listen
中華隊的服裝挺特別的,好藍。。。
#ChineseTaipei #Sochi #2014冬奧
授業前の雪合戦w
http://t.co/d9b5peaq7J
Before / after removing noise
en -> id
it -> zh-tw
en -> ja
64. Data Cleaning
▷Special character
▷Utilize regular expressions to clean data
64
Unicode emotions ☺, ♥…
Symbol icon ☏, ✉…
Currency symbol €, £, $...
Tweet URL
Filter out non-(letters, space,
punctuation, digit) ◕‿◕ Friendship is everything ♥ ✉
xxxx@gmail.com
I added a video to a @YouTube playlist
http://t.co/ceYX62StGO Jamie Riepe
(^|s*)http(S+)?(s*|$)
(p{L}+)|(p{Z}+)|
(p{Punct}+)|(p{Digit}+)
65. Japanese Examples
▷Use regular expression remove all
special words
•うふふふふ(*^^*)楽しむ!ありがとうございま
す^o^ アイコン、ラブラブ(-_-)♡
•うふふふふ 楽しむ ありがとうございます ア
イコン ラブラブ
65
W
66. Part-of-speech (POS) Tagging
▷Processing text and assigning parts of
speech to each word
▷Twitter POS tagging
•Noun (N), Adjective (A), Verb (V), URL (U)…
66
Happy Easter! I went to work and came home to an empty house now im
going for a quick run http://t.co/Ynp0uFp6oZ
Happy_A Easter_N !_, I_O went_V to_P work_N and_& came_V home_N
to_P an_D empty_A house_N now_R im_L going_V for_P a_D quick_A
run_N http://t.co/Ynp0uFp6oZ_U
67. Stemming
▷@DirtyDTran gotta be caught up for
tomorrow nights episode
▷@ASVP_Jaykey for some reasons I found
this very amusing
67
• @DirtyDTran gotta be catch up for tomorrow night episode
• @ASVP_Jaykey for some reason I find this very amusing
RT @kt_biv : @caycelynnn loving and missing you! we are
still looking for Lucy
love miss be
look
68. Hashtag Segmentation
▷By using Microsoft Web N-Gram Service
(or by using Viterbi algorithm)
68
#pray #for #boston
Wow! explosion at a boston race ... #prayforboston
#citizenscience
#bostonmarathon
#goodthingsarecoming
#lowbloodpressure
→
→
→
→
#citizen #science
#boston #marathon
#good #things #are #coming
#low #blood #pressure
69. More Preprocesses for Different Web
Data
▷Extract source code without javascript
▷Removing html tags
69
70. Extract Source Code Without Javascript
▷Javascript code should be considered as an exception
• it may contain hidden content
70
76. Similarity and Dissimilarity
▷Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
▷Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
76
77. Euclidean Distance
Where n is the number of dimensions (attributes) and pk and
qk are, respectively, the kth attributes (components) or data
objects p and q.
▷Standardization is necessary, if scales differ.
77
n
k
kk qpdist
1
2
)(
78. Minkowski Distance
▷ Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter, n is the number of dimensions
(attributes) and pk and qk are, respectively, the kth attributes
(components) or data objects p and q.
78
r
n
k
r
kk qpdist
1
1
)||(
: is extremely sensitive to the scales of the variables involved
79. Mahalanobis Distance
▷Mahalanobis distance measure:
•Transforms the variables into covariance
•Make the covariance equal to 1
•Calculate simple Euclidean distance
79
)()(),( 1
yxSyxyxd
S is the covariance matrix of the input data
80. Similarity Between Binary Vectors
▷ Common situation is that objects, p and q, have
only binary attributes
▷ Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
▷ Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01 + M10 + M11)
80
81. Cosine Similarity
▷ If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,
where indicates vector dot product and || d || is the length of vector d.
▷ Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
81
82. Correlation
▷ Correlation measures the linear relationship between objects
▷ To compute correlation, we standardize data objects, p and q,
and then take their dot product
82
)(/))(( pstdpmeanpp kk
)(/))(( qstdqmeanqq kk
qpqpncorrelatio ),(
83. Using Weights to Combine Similarities
▷May not want to treat all attributes the same.
• Use weights wk which are between 0 and 1 and sum to 1.
83
84. Density
▷Density-based clustering require a notion of
density
▷Examples:
• Euclidean density
→ Euclidean density = number of points per unit volume
• Probability density
• Graph-based density
84
86. Data Exploration
▷A preliminary exploration of the data to better
understand its characteristics
▷Key motivations of data exploration include
• Helping to select the right tool for preprocessing or
analysis
• Making use of humans’ abilities to recognize
patterns
• People can recognize patterns not captured by
data analysis tools
86
87. Summary Statistics
▷Summary statistics are numbers that
summarize properties of the data
•Summarized properties include frequency,
location and spread
→ Examples: location - mean
spread - standard deviation
•Most summary statistics can be calculated in
a single pass through the data
87
88. Frequency and Mode
▷Given a set of unordered categorical values
→ Compute the frequency with each value occurs is the
easiest way
▷The mode of a categorical attribute
• The attribute value that has the highest frequency
88
m
v
vfrequency i
i
valueattributewithovjectsofnumber
89. Percentiles
▷For ordered data, the notion of a percentile is
more useful
▷Given
• An ordinal or continuous attribute x
• A number p between 0 and 100
▷The pth percentile xp is a value of x
• p% of the observed values of x are less than xp
89
90. Measures of Location: Mean and Median
▷The mean is the most common measure of the
location of a set of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also commonly used
90
91. Measures of Spread: Range and Variance
▷Range is the difference between the max and min
▷The variance or standard deviation is the most
common measure of the spread of a set of points.
▷However, this is also sensitive to outliers, so that
other measures are often used
91
92. Visualization
Visualization is the conversion of data into a
visual or tabular format
▷Visualization of data is one of the most
powerful and appealing techniques for data
exploration.
• Humans have a well developed ability to analyze large
amounts of information that is presented visually
• Can detect general patterns and trends
• Can detect outliers and unusual patterns
92
93. Arrangement
▷Is the placement of visual elements within a display
▷Can make a large difference in how easy it is to
understand the data
▷Example
93
94. Visualization Techniques: Histograms
▷ Histogram
• Usually shows the distribution of values of a single variable
• Divide the values into bins and show a bar plot of the number of
objects in each bin.
• The height of each bar indicates the number of objects
• Shape of histogram depends on the number of bins
▷ Example: Petal Width (10 and 20 bins, respectively)
94
95. Visualization Techniques: Box Plots
▷Another way of displaying the distribution of data
• Following figure shows the basic part of a box plot
95
97. Visualization Techniques: Contour Plots
▷Contour plots
• Partition the plane into regions of similar values
• The contour lines that form the boundaries of these regions
connect points with equal values
• The most common example is contour maps of elevation
• Can also display temperature, rainfall, air pressure, etc.
97
Celsius
Sea Surface Temperature (SST)
100. Visualization Techniques: Star Plots
▷ Similar approach to parallel coordinates
• One axis for each attribute
▷ The size and the shape of polygon fives a visual description of
the attribute value of the object
100
Petal length sepal length
SepalwidthPetalwidth
101. Visualization Techniques: Chernoff Faces
▷This approach associates each attribute with a
characteristic of a face
▷The values of each attribute determine the
appearance of the corresponding facial
characteristic
▷Each object becomes a separate face
101
Data Feature Facial Feature
Sepal length Size of face
Sepal width Forehead/jaw relative arc length
Petal length Shape of forehead
Petal width Shape of jaw
102. Do's and Don'ts
▷ Apprehension
• Correctly perceive relations among variables
▷ Clarity
• Visually distinguish all the elements of a graph
▷ Consistency
• Interpret a graph based on similarity to previous graphs
▷ Efficiency
• Portray a possibly complex relation in as simple a way as
possible
▷ Necessity
• The need for the graph, and the graphical elements
▷ Truthfulness
• Determine the true value represented by any graphical
element
102
103. Data Mining Techniques
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com
Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
105. Tasks in Data Mining
▷Problems should be well defined at the beginning
▷Two categories of tasks [Fayyad et al., 1996]
105
Predictive Tasks
• Predict unknown values
• e.g., potential customers
Descriptive Tasks
• Find patterns to describe data
• e.g., Friendship finding
VIP
Cheap
Potential
106. Select Techniques
▷Problems could be further decomposed
106
Predictive Tasks
• Classification
• Ranking
• Regression
• …
Descriptive Tasks
• Clustering
• Association rules
• Summarization
• …
Supervised
Learning
Unsupervised
Learning
107. Supervised vs. Unsupervised Learning
▷ Supervised learning
• Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
• New data is classified based on the training set
▷ Unsupervised learning
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
107
108. Classification
▷ Given a collection of records (training set )
• Each record contains a set of attributes
• One of the attributes is the class
▷ Find a model for class attribute:
• The model forms a function of the values of other attributes
▷ Goal: previously unseen records should be assigned a class as
accurately as possible.
• A test set is needed
→ To determine the accuracy of the model
▷Usually, the given data set is divided into training & test
• With training set used to build the model
• With test set used to validate it
108
109. Ranking
▷Produce a permutation to items in a new list
• Items ranked in higher positions should be more important
• E.g., Rank webpages in a search engine Webpages in
higher positions are more relevant.
109
110. Regression
▷Find a function which model the data with least error
• The output might be a numerical value
• E.g.: Predict the stock value
110
111. Clustering
▷Group data into clusters
• Similar to the objects within the same cluster
• Dissimilar to the objects in other clusters
• No predefined classes (unsupervised classification)
111
112. Association Rule Mining
▷Basic concept
• Given a set of transactions
• Find rules that will predict the occurrence of an item
• Based on the occurrences of other items in the transaction
112
113. Summarization
▷Provide a more compact representation of the data
• Data: Visualization
• Text – Document Summarization
→ E.g.: Snippet
113
115. Illustrating Classification Task
115
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
117. Algorithm for Decision Tree Induction
▷ Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Data Mining 117
119. The Problem Of Decision Tree
119
Deep Bushy Tree Deep Bushy Tree Useless
The Decision Tree has a hard time with correlated attributes
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
?
100
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
120. Advantages/Disadvantages of Decision Trees
▷Advantages:
• Easy to understand
• Easy to generate rules
▷Disadvantages:
• May suffer from overfitting.
• Classifies by rectangular partitioning (so does not handle
correlated features very well).
• Can be quite large – pruning is necessary.
• Does not handle streaming data easily
120
127. Naïve Bayes Classifier
▷A simplified assumption: attributes are conditionally
independent and each data sample has n attributes
▷No dependence relation between attributes
▷By Bayes theorem,
▷As P(X) is constant for all classes, assign X to the
class with maximum P(X|Ci)*P(Ci)
127
n
k
CixkPCiXP
1
)|()|(
)(
)()|()|(
XP
CiPCiXPXCiP
128. Naïve Bayesian Classifier: Comments
▷ Advantages :
• Easy to implement
• Good results obtained in most of the cases
▷ Disadvantages
• Assumption: class conditional independence
• Practically, dependencies exist among variables
→ E.g., hospitals: patients: Profile: age, family history etc
→ E.g., Symptoms: fever, cough etc., Disease: lung cancer, diabetes
etc
• Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
▷ How to deal with these dependencies?
• Bayesian Belief Networks
128
129. Bayesian Networks
▷Bayesian belief network allows a subset of the
variables conditionally independent
▷A graphical model of causal relationships
• Represents dependency among the variables
• Gives a specification of joint probability distribution
Data Mining 129
130. Bayesian Belief Network: An Example
130
Family
History
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The conditional probability table for the variable
LungCancer:
Shows the conditional probability for each
possible combination of its parents
n
i
ZParents iziPznzP
1
))(|(),...,1(
131. Neural Networks
▷Artificial neuron
• Each input is multiplied by a weighting factor.
• Output is 1 if sum of weighted inputs exceeds a threshold
value; 0 otherwise
▷Network is programmed by adjusting weights using
feedback from examples
131
132. General Structure
Data Mining 132
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: xi
wij
i
jiijj OwI
jIj
e
O
1
1
))(1( jjjjj OTOOErr
jk
k
kjjj wErrOOErr )1(
ijijij OErrlww )(
jjj Errl)(
133. Network Training
▷The ultimate objective of training
• Obtain a set of weights that makes almost all the tuples in
the training data classified correctly
▷Steps
• Initialize weights with random values
• Feed the input tuples into the network one by one
• For each unit
→ Compute the net input to the unit as a linear combination of
all the inputs to the unit
→ Compute the output value using the activation function
→ Compute the error
→ Update the weights and the bias
133
134. Summary of Neural Networks
▷Advantages
• Prediction accuracy is generally high
• Robust, works when training examples contain errors
• Fast evaluation of the learned target function
▷Criticism
• Long training time
• Difficult to understand the learned function (weights)
• Not easy to incorporate domain knowledge
134
135. The k-Nearest Neighbor Algorithm
▷All instances correspond to points in the n-D space.
▷The nearest neighbor are defined in terms of
Euclidean distance.
▷The target function could be discrete- or real-
valued.
▷For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq.
135
.
_
+
_ xq
+
_ _
+
_
_
+
136. Discussion on the k-NN Algorithm
▷Distance-weighted nearest neighbor algorithm
• Weight the contribution of each of the k neighbors
according to their distance to the query point xq
→ Giving greater weight to closer neighbors
▷Curse of dimensionality: distance between
neighbors could be dominated by irrelevant
attributes.
• To overcome it, elimination of the least relevant attributes.
136
140. Strong Rules & Interesting
140
▷Corr(A,B)=P(AUB)/(P(A)P(B))
• Corr(A, B)=1, A & B are independent
• Corr(A, B)<1, occurrence of A is negatively correlated with B
• Corr(A, B)>1, occurrence of A is positively correlated with B
▷E.g. Corr(games, videos)=0.4/(0.6*0.75)=0.89
• In fact, games & videos are negatively associated
→ Purchase of one actually decrease the likelihood of purchasing the other
10000
6000
games
7500
video4000
142. Good Clustering
▷Good clustering (produce high quality clusters)
• Intra-cluster similarity is high
• Inter-cluster class similarity is low
▷Quality factors
• Similarity measure and its implementation
• Definition and representation of cluster chosen
• Clustering algorithm
142
148. Partitioning Algorithms: Basic Concept
▷Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
• Global optimal: exhaustively enumerate all partitions.
• Heuristic methods.
→ k-means: each cluster is represented by the center of the
cluster
→ k-medoids or PAM (Partition Around Medoids) : each
cluster is represented by one of the objects in the cluster.
148
149. K-Means Clustering Algorithm
▷Algorithm:
• Randomly initialize k cluster means
• Iterate:
→ Assign each genes to the nearest cluster mean
→ Recompute cluster means
• Stop when clustering converges
149
K=4
160. Density-Based Clustering
▷Clustering based on density (local cluster criterion),
such as density-connected points
▷Each cluster has a considerable higher density of
points than outside of the cluster
160
161. Density-Based Clustering Methods
▷Major features:
• Discover clusters of arbitrary shape
• Handle noise
• One scan
• Need density parameters as termination condition
▷Approaches
• DBSCAN (KDD’96)
• OPTICS (SIGMOD’99).
• DENCLUE (KDD’98)
• CLIQUE (SIGMOD’98)
Data Mining 161
167. Evaluation
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com
Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
168. Tasks in Data Mining
▷Problems should be well defined at the beginning
▷Two categories of tasks [Fayyad et al., 1996]
168
Predictive Tasks
• Predict unknown values
• e.g., potential customers
Descriptive Tasks
• Find patterns to describe data
• e.g., Friendship finding
VIP
Cheap
Potential
197. Case Studies
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com
Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation