3 Data Mining Tasks

Introduction to Data Mining
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 1 – Lecture 3

Outline
 Definition of Data Mining
 Data Mining as an Interdisciplinary field
 Process of Data Mining
 Data Mining Tasks
 Challenges of Data Mining
 Data mining application examples
 Introduction to RapidMiner

Data Mining Tasks
 Data mining tasks are the kind of data
patterns that can be mined.
 Data Mining functionalities are used to
specify the kind of patterns to be found in the
data mining tasks.

 In general data mining tasks can be classified into
two categories:
Descriptive mining tasks characterize the general
properties of the data.
Predictive mining tasks perform inferences on the current
data in order to make predictions.
Data Mining Tasks

 Most famous data mining tasks:
 Classification [Predictive]
Prediction [Predictive]
Association Rules [Descriptive]
Clustering [Descriptive]
Outlier Analysis [Descriptive]
Data Mining Tasks

Classification
 Classification is used for predictive mining tasks.
 The input data for predictive modeling consists of
two types of variables:
Explanatory variables, which define the essential properties of
the data.
 Target variables , whose values are to be predicted.
 Classification is used to predicate the value of
discrete target variable.

Prediction
 Similar to classification, except we are trying to predict
the value of a variable (e.g. amount of purchase),
rather than a class (e.g. purchaser or non-purchaser).

Association
 Association Rules aims to find out the relationship
among valuables in database, resulting in deferent types
of rules.
 Seek to produce a set of rules describing the set of
features that are strongly related to each others.

Association
Gender Age Smoker LAD% RCA%
F 52 Y 85 100
M 62 N 80 0
M 75 Y 70 80
M 73 Y 40 99
M 66 N 50 45
… … … … …
 LAD%－ The percentage of heat disease caused by left anterior descending coronary artery.
 RCA%－ The percentage of heat disease caused by right coronary artery.
Original data from a research on heart disease

Association
Medical Association Rules
NO. Rule
1 Gender=M∩Age≥70∩Smoker=YRCA%≥50(40%,100%)
2 Gender=F∩Age＜70∩Smoker=YLAD%≥70(20%,100%)
 Rule 1 indicates：40% of the cases are male, over 70 years old and have the habit of
smoking, the possibility of RCA%≥50% is 100%
 Rule 2 indicates：20% of the cases are female, under 70 years old and have the habit
of smoking, the possibility of LAD%≥70% is 100%

Clustering
 Finds groups of data pointes (clusters) so that data
points that belong to one cluster are more similar to
each other than to data points belonging to different
cluster.

Clustering
Document Clustering:
 Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
 Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the frequencies
of different terms. Use it to cluster.
 Gain: Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.

Outlier Analysis
 Discovers data points that are significantly different
than the rest of the data. Such points are known as
anomalies or outliers.

Outline
Definition of Data Mining
Data Mining as an Interdisciplinary field
Process of Data Mining
Data Mining Tasks
Challenges of Data Mining
Data mining application examples
Introduction to RapidMiner

Scalability: Scalable techniques are needed
to handle the massive scale of data.
Dimensionality: Many applications may
involves a large number of dimensions (e.g.
features or attributes of data)

Heterogeneous and Complex Data: In recent years
complicated data types such as graph-based, text-free
and structured data types are introduced. Techniques
developed for data mining must be able to handle the
heterogeneity of the data.

Data Quality: Many data sets are imperfect due to
present of missing values and noise un the data. To
handle the imperfection, robust data mining algorithms
must be developed.

Data Distribution: As the volume of data increases , it
is no longer possible or safe to keep all the data in the
same place. As a result, the need for distributed data
mining techniques has increased over the years.

Privacy Preservation: While privacy intends to prevent
the disclosure of information, data mining attempts to
revel interesting knowledge about data. As a result,
there is growing interest in developing privacy-
preserving data mining algorithms.

Outline
Definition of Data Mining
Data Mining as an Interdisciplinary field
Process of Data Mining
Data Mining Tasks
Data mining application examples
Introduction to RapidMine

Data mining application
Science
astronomy, bioinformatics, drug discovery, …
Business
advertising, CRM (Customer Relationship management),
investments, manufacturing, sports/entertainment, telecom, e-
Commerce, targeted marketing, health care, …
Web
search engines, web mining,…
Government
law enforcement, profiling tax cheaters,

3 Data Mining Tasks

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 3 Data Mining Tasks

Ähnlich wie 3 Data Mining Tasks (20)

Mehr von Mahmoud Alfarra

Mehr von Mahmoud Alfarra (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

3 Data Mining Tasks