Combining Data Mining and Machine Learning for Effective User Profiling

Combining Data Mining and Machine
Learning for Effective User Profiling
Saturday, 14 May 2016

Wealth of data/information, Lack of knowledge
The databases are more and more large
• Terrorbytes!
A deluge of data, containing a lot of hidden information
• new knowledge
What are the technological motivations?
• Technologies to collect data
• Bar code readers, scanners, cameras, etc..
• Technologies to store data
• Databases, data warehouses, other repositories
• Network (Web) as computing and storage platform
An example of data deluge:
• the WEB and SOCIAL MEDIA !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Why Mine Data? Commercial Viewpoint
Lots of data is being collected and warehoused
• Web data, e-commerce
• Purchases at department/ grocery stores
• Bank/Credit Card transactions
Competitive Pressure is Strong
• Use Data Mining to provide better, customized services for an edge
(e.g. in Customer Relationship Management)

Why Mine Data? Scientific Viewpoint
Data collected and stored at enormous
speeds (GB/hour)
• remote sensors on a satellite
• telescopes scanning the skies
• microarrays generating gene expression data
• scientific simulations generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists:
• in classifying and segmenting data
• in Hypothesis Formation

What is Data Mining
Data mining (Many Definitions)
Exploration & analysis, by automatic or semi-automatic means, of large
quantities of data in order to discover meaningful patterns
Data mining: a misnomer?
It should be pattern mining in analogy to gold mining
Alternative names:
Knowledge discovery(mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.

Origins of Data Mining
• Draws ideas from machine learning/AI, pattern recognition, statistics,
database systems, HPC
• Traditional Techniques may be unsuitable due to
1. Enormity of data
2. High dimensionality of data
3. Heterogeneous, distributed nature of data
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database
systems
High
Performance
Computing

KDD is a process
Data Warehouse Cleansing / Selection /
Transformation
Data Selection
Data Integration
Databases
Pattern Interpretation /
Evaluation
– Data mining is the core of the
KDD process
Data Mining
Task-relevant Data

Data Mining: On What Kind of Data?
• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
1. Object-oriented and object-relational databases
2. Spatial databases
3. Time-series data and temporal data
4. Text databases and multimedia databases
5. Heterogeneous and legacy databases
6. WWW

Web Mining applies DM to WWW
Data Mining
•Often applied to structured database
Web mining
• Applied to less structured data, dynamic, of huge size
• Not only Web content, but also hyperlinks and access
log

Why?
Data gathered from both the web and more conventional sources can
be used to answer such questions as:
• Marketing - those likely to buy.
• Forecasts - predicting demand.
• Loyalty - those likely to defect.
• Credit - which were the proﬁtable items.
• Fraud - when and where they occur.

Related Terms
DATA MINING PREDICTIVEANALYTICS
DISCOVERY AND
COMMUNICATION OF
MEANINGFUL
PATTERNS IN DATA.
PROCESS OF DISCOVERING
PATTERNS IN LARGE
DATASETS USING METHODS
FROM AI, MACHINE
LEARNING, STATISTICS AND
DATABASE SYSTEMS
TECHNIQUES FROM
STATISTICS, MACHINE
LEARNING AND DATA
MINING IN CONJUNCTION
WITH HISTORICAL AND
CURRENT DATATO MAKE
PREDICTIONS ABOUT THE
FUTURE.

Machine Learning
Underlying processx y
Machine
learning
algorithm
Model that approximates the
underlying process
“Using data to understand an underlying process”

Underlying process {x1, x2, …}
Machine
learning
algorithm
Model that approximates the
underlying process
“Using data to understand an underlying process”

Data set 1
Model 1
Data set 2
Model 2
The created model depends on the data values used for training.
Machine
learning
algorithm
Machine
learning
algorithm

Why build a model?
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o o
o o
o
o
o
o
o o
x
o
o o
o
o x o o
o oo
oo oo
o o
o
o
o
o
o o
o o o
o
x
o o
oo
o
o o
Time
• Predict
– A continuous value
– A category label
• Find clusters in data
• Identify key predictors
• …

Why build a model (cont..)
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o o
o o
o
o
o
o
o o
x
o
o o
o
o x o o
o oo
oo oo
o o
o
o
o
o
o o
o o o
o
x
o o
o
oo
o
o o
Time
• Predict
• …

o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o o
o o
o
o
o
o
o o
x
o
o o
o
o x o o
o oo
oo oo
o o
o
o
o
o
o o
o o o
o
x
o o
oo
o
o o
Time
• Predict
• …

o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o o
o o
o
o
o
o
o o
x
o
o o
o
o x o o
o oo
oo oo
o o
o
o
o
o
o o
o o o
o
x
o o
oo
o
o o
Time
• …

Training phase
– The machine learning algorithm
learns from data
– Output is a trained model
– Time consuming
– Typically involves multiple iterations
over training data
Testing or scoring phase
– The trained model is used in
conjunction with new data inputs to
estimate corresponding output
– Much quicker as compared to training
MACHINE
LEARNING
ALGORITHM
Trained
model
Training
data
TRAINED
MODEL
Corresponding
data output
New
data
input

Linear
– OLS regression
Generalized linear
– Logistic regression, GAMs
Rule based
– Decision trees
Kernel-based
– Support vector machines
White box
– Regression family, Decision tree
family
Black box
– Neural networks
Parametric
– Regression family
Non-parametric
– Support vector machines, Rule based fuzzy systems
Ensemble based
– Random forest, AdaBoost
Supervised
– Decision trees,logistic
regression
Unsupervised
– K-means clustering,
hierarchical clustering
Generative
– Naïve Bayes, mixture of
Gaussians
Discriminative
– Support vector machines,
logistic regression,
Decision trees
Classification
– Decision trees, logistic
regression
Regression (predicting a
continuous value)
– OLSregression
Algorithm

Source :http://what-when-how.com/face-recognition/facial-landmark-localization-face-recognition-techniques-part-1/
Linear regression Logistic regression
Decisiontrees
Multi-layer perceptron
Random forest
Source :Wikipedia
Ref: http://www.saedsayad.com/logistic_regression.htm
Source :Wikipedia

Which algorithm should I use ?
• Objective of analysis
– Prediction of a continuous value
– classification
– identifying key predictors
• Data type and distribution
• Computational complexity of the algorithm Data volume

Combining Data Mining and Machine Learning for Effective User Profiling

Combining Data Mining and Machine Learning for Effective User Profiling

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie Combining Data Mining and Machine Learning for Effective User Profiling

Ähnlich wie Combining Data Mining and Machine Learning for Effective User Profiling (20)

Mehr von CodePolitan

Mehr von CodePolitan (11)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Combining Data Mining and Machine Learning for Effective User Profiling

Hinweis der Redaktion