2. âą Introduction
âą What is Data Mining?
âą Challenges and Trends
âą The origin of DM
âą Data Mining Tasks
âą Types of Data
âą Data Quality
âą Takeaways
Agenda
3.
4. What is Data Mining?
Definition:
ïŹ
Data mining is the process of extracting patterns from
data.- Wikipedia
ïŹ
Process of semi-automatically analyzing large databases
to find patterns that are:
ïŹ
valid: hold on new data with some certainty
ïŹ
novel: non-obvious to the system
ïŹ
useful: should be possible to act on the item
ïŹ
understandable: humans should be able to interpret the pattern
Other Definitions:
ïŹ
Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
ïŹ
Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
5. What is Data Mining? (Other Definitions)
ïŹ
Extracting or âminingâ knowledge from large amounts of
data
ïŹ
Data -driven discovery and modeling of hidden patterns
(we never new existed) in large volumes of data
ïŹ
Extraction of implicit, previously unknown and
unexpected, potentially extremely useful information
from data
6. What is Data Mining?
What is Data Mining?
â Certain names are more
prevalent in certain US
locations (OâBrien, OâRurke,
OâReilly⊠in Boston area)
â Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
What is not Data
Mining?
â Look up phone
number in phone
directory
â Query a Web
search engine for
information about
âAmazonâ
7. What is Data Mining?
Moore's Law:
ïŹ
The law is named after Intel co-founder Gordon E. Moore, who described the
trend in his 1965 paper. The paper noted that number of components in
integrated circuits had doubled every year from the invention of the integrated
circuit in 1958 until 1965 and predicted that the trend would continue "for at least
ten years"
ïŹ
The number of transistors that can be placed inexpensively on an integrated
circuit has doubled approximately every two years. The trend has continued for
more than half a century and is not expected to stop until 2015 or later.
9. What is Data Mining? History
ïŹ
Data Analysis?
ïŹ
Data Warehouse?
ïŹ
Data Mining and Statistics
â Standard Deviation
ïŹ
Data Mining and AI and Machine Learning
â Identify Possible Heart Attack cases
1980's1980's
1990's1990's
2000's2000's
2010's2010's
12. Data Mining Challenges and Trends
ïź Computationally expensive to investigate all
possibilities
ïź Dealing with noise/missing information and
errors in data
ïź Choosing appropriate attributes/input
representation
ïź Finding the minimal attribute space
ïź Finding adequate evaluation function(s)
ïź Extracting meaningful information
ïź Not overfitting
13. Predictive AnalysisPredictive Analysis
Presentation Exploration Discovery
Passive
Interactive
Proactive
Role of Software
Business
Insight
Canned reporting
Ad-hoc reporting
Online Analytical
Processing
Data mining
Data Mining Challenges and Trends
14. Data Mining Tasks
ïŹ
Prediction Methods
â Use some variables to predict unknown or
future values of other variables.
ïŹ
Description Methods
â Find human-interpretable patterns that
describe the data.
16. Data Mining Tasks
ïź Concept/Class description: Characterization
and discrimination
ïź Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
ïź Association (correlation and causality)
ïź Multi-dimensional or single-dimensional association
age(X, â20-29â) ^ income(X, â60-90Kâ) ï buys(X, âTVâ)
17. Data Mining Tasks
ïź Classification and Prediction
ïź Finding models (functions) that describe and
distinguish classes or concepts for future prediction
ïź Example: classify countries based on climate, or
classify cars based on gas mileage
ïź Presentation:
ïź If-THEN rules, decision-tree, classification rule,
neural network
ïź Prediction: Predict some unknown or missing
numerical values
18. Data Mining Tasks
ïŹ
Cluster analysis
â Class label is unknown: Group data to form
new classes,
ïŹ
Example: cluster houses to find distribution
patterns
â Clustering based on the principle:
maximizing the intra-class similarity and
minimizing the interclass similarity
19. Data Mining Tasks
ïź Outlier analysis
ïź Outlier: a data object that does not comply with the
general behavior of the data
ïź Mostly considered as noise or exception, but is
quite useful in fraud detection, rare events analysis
ïź Trend and evolution analysis
ïź Trend and deviation: regression analysis
ïź Sequential pattern mining, periodicity analysis
21. What is Data?
ïŹ
Collection of data objects and
their attributes
ïŹ
An attribute is a property or
characteristic of an object
â Examples: eye color of a
person, temperature, etc.
â Attribute is also known as
variable, field, characteristic,
or feature
ïŹ
A collection of attributes
describe an object
â Object is also known as
record, point, case, sample,
entity, or instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
22. Attribute Values
ïŹ
Attribute values are numbers or symbols assigned to an
attribute
ïŹ
Distinction between attributes and attribute values
â Same attribute can be mapped to different attribute values
ïŹ
Example: height can be measured in feet or meters
â Different attributes can be mapped to the same set of
values
ïŹ
Example: Attribute values for ID and age are integers
ïŹ
But properties of attribute values can be different
â ID has no limit but age has a maximum and
minimum value
23. Types of Attribute
ïŹ
There are different types of attributes
â Nominal
ï” Examples: ID numbers, eye color, zip codes
â Ordinal
ï” Examples: rankings (e.g., taste of potato chips on
a scale from 1-10), grades, height in {tall, medium,
short}
â Interval
ï” Examples: calendar dates, temperatures in Celsius
or Fahrenheit.
â Ratio
ï” Examples: temperature in Kelvin, length, time,
counts
24. Properties of Attribute Values
ïŹ
The type of an attribute depends on which of
the following properties it possesses:
â Distinctness: = â
â Order: < >
â Addition: + -
â Multiplication: * /
â Nominal attribute: distinctness
â Ordinal attribute: distinctness & order
â Interval attribute: distinctness, order & addition
â Ratio attribute: all 4 properties
25. Attribute
Type
Description Examples Operations
Nominal The values of a nominal attribute
are just different names, i.e.,
nominal attributes provide only
enough information to distinguish
one object from another. (=, â )
zip codes, employee
ID numbers, eye
color, sex: {male,
female}
mode, entropy,
contingency
correlation, Ï2
test
Ordinal The values of an ordinal attribute
provide enough information to
order objects. (<, >)
hardness of
minerals, {good,
better, best},
grades, street
numbers
median,
percentiles, rank
correlation, run
tests, sign tests
Interval For interval attributes, the
differences between values are
meaningful, i.e., a unit of
measurement exists.
(+, - )
calendar dates,
temperature in
Celsius or
Fahrenheit
mean, standard
deviation,
Pearson's
correlation, t and
F tests
Ratio For ratio variables, both
differences and ratios are
meaningful. (*, /)
temperature in
Kelvin, monetary
quantities, counts,
age, mass, length,
electrical current
geometric mean,
harmonic mean,
percent variation
26. Discreet and Continuous Attributes
ïŹ
Discrete Attribute
â Has only a finite or countably infinite set of values
â Examples: zip codes, counts, or the set of words in a collection
of documents
â Often represented as integer variables.
â Note: binary attributes are a special case of discrete attributes
ïŹ
Continuous Attribute
â Has real numbers as attribute values
â Examples: temperature, height, or weight.
â Practically, real values can only be measured and represented
using a finite number of digits.
â Continuous attributes are typically represented as floating-point
variables.
27. Attribute Values
ïŹ
Attribute values are numbers or symbols assigned to an
attribute
ïŹ
Distinction between attributes and attribute values
â Same attribute can be mapped to different attribute values
ïŹ
Example: height can be measured in feet or meters
â Different attributes can be mapped to the same set of
values
ïŹ
Example: Attribute values for ID and age are integers
ïŹ
But properties of attribute values can be different
â ID has no limit but age has a maximum and
minimum value
28. Types of Record Sets
Record
â Data Matrix
â Document Data
â Transaction Data
Graph
â World Wide Web
â Molecular Structures
Ordered
â Spatial Data
â Temporal Data
â Sequential Data
â Genetic Sequence Data
29. Record Sets
ïŹ
Data that consists of a collection of
records, each of which consists of a fixed
set of attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
30. Data Matrics
ïŹ
If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
ïŹ
Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
31. Document Data
ïŹ
Each document becomes a `term' vector,
â each term is a component (attribute) of the
vector,
â the value of each component is the number of
times the corresponding term occurs in the
document.
32. Transaction Data
ïŹ
A special type of record data, where
â each record (transaction) involves a set of
items.
â For example, consider a grocery store. The
set of products purchased by a customer
during one shopping trip constitute a
transaction, while the individual products that
were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
38. Data Quality
ïŹ
What kinds of data quality problems?
ïŹ
How can we detect problems with the
data?
ïŹ
What can we do about these problems?
ïŹ
Examples of data quality problems:
â Noise and outliers
â missing values
â duplicate data
39. Noise
ïŹ
Noise refers to modification of original
values
â Examples: distortion of a personâs voice when
talking on a poor phone and âsnowâ on
television screen
Two Sine Waves Two Sine Waves + Noise
40. Outliers
ïŹ
Outliers are data objects with
characteristics that are considerably
different than most of the other data
objects in the data set
41. Missing Values
ïŹ
Reasons for missing values
â Information is not collected
(e.g., people decline to give their age and
weight)
â Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children)
ïŹ
Handling missing values
â Eliminate Data Objects
â Estimate Missing Values
â Ignore the Missing Value During Analysis
â Replace with all possible values (weighted by
their probabilities)
42. Duplicate Data
ïŹ
Data set may include data objects that are
duplicates, or almost duplicates of one
another
â Major issue when merging data from
heterogeous sources
ïŹ
Examples:
â Same person with multiple email addresses
ïŹ
Data cleaning
â Process of dealing with duplicate data issues
43. Six Rules of Data Quality
1. Data that is not used cannot be correct for very long
2. Data Quality in an information system is a function of
its use, not its collection
3.Data quality will ultimately be no better than its most
stringent use
4. Data quality problems tend to become worse with
the age of the system
5. Less likely it is that some data element will change,
more traumatic it will be when it finally does change.
6. Information overload affects data quality
44. Takeaways
What is Data Mining?
Challenges and Trends
The origin of DM
Data Mining Tasks
Data Types
Data Quality