Slide presentasi ini dibawakan oleh Anne Regina pada Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDIO pada tanggal 14 Mei 2016.
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
Combining Data Mining and Machine Learning for Effective User Profiling
1. Combining Data Mining and Machine
Learning for Effective User Profiling
Saturday, 14 May 2016
2. Wealth of data/information, Lack of knowledge
The databases are more and more large
• Terrorbytes!
A deluge of data, containing a lot of hidden information
• new knowledge
What are the technological motivations?
• Technologies to collect data
• Bar code readers, scanners, cameras, etc..
• Technologies to store data
• Databases, data warehouses, other repositories
• Network (Web) as computing and storage platform
An example of data deluge:
• the WEB and SOCIAL MEDIA !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
3. Why Mine Data? Commercial Viewpoint
Lots of data is being collected and warehoused
• Web data, e-commerce
• Purchases at department/ grocery stores
• Bank/Credit Card transactions
Competitive Pressure is Strong
• Use Data Mining to provide better, customized services for an edge
(e.g. in Customer Relationship Management)
4. Why Mine Data? Scientific Viewpoint
Data collected and stored at enormous
speeds (GB/hour)
• remote sensors on a satellite
• telescopes scanning the skies
• microarrays generating gene expression data
• scientific simulations generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists:
• in classifying and segmenting data
• in Hypothesis Formation
5. What is Data Mining
Data mining (Many Definitions)
Exploration & analysis, by automatic or semi-automatic means, of large
quantities of data in order to discover meaningful patterns
Data mining: a misnomer?
It should be pattern mining in analogy to gold mining
Alternative names:
Knowledge discovery(mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
6. Origins of Data Mining
• Draws ideas from machine learning/AI, pattern recognition, statistics,
database systems, HPC
• Traditional Techniques may be unsuitable due to
1. Enormity of data
2. High dimensionality of data
3. Heterogeneous, distributed nature of data
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database
systems
High
Performance
Computing
7. KDD is a process
Data Warehouse Cleansing / Selection /
Transformation
Data Selection
Data Integration
Databases
Pattern Interpretation /
Evaluation
– Data mining is the core of the
KDD process
Data Mining
Task-relevant Data
8. Data Mining: On What Kind of Data?
• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
1. Object-oriented and object-relational databases
2. Spatial databases
3. Time-series data and temporal data
4. Text databases and multimedia databases
5. Heterogeneous and legacy databases
6. WWW
9. Web Mining applies DM to WWW
Data Mining
•Often applied to structured database
Web mining
• Applied to less structured data, dynamic, of huge size
• Not only Web content, but also hyperlinks and access
log
11. Why?
Data gathered from both the web and more conventional sources can
be used to answer such questions as:
• Marketing - those likely to buy.
• Forecasts - predicting demand.
• Loyalty - those likely to defect.
• Credit - which were the profitable items.
• Fraud - when and where they occur.
12. Related Terms
DATA MINING PREDICTIVEANALYTICS
DISCOVERY AND
COMMUNICATION OF
MEANINGFUL
PATTERNS IN DATA.
PROCESS OF DISCOVERING
PATTERNS IN LARGE
DATASETS USING METHODS
FROM AI, MACHINE
LEARNING, STATISTICS AND
DATABASE SYSTEMS
TECHNIQUES FROM
STATISTICS, MACHINE
LEARNING AND DATA
MINING IN CONJUNCTION
WITH HISTORICAL AND
CURRENT DATATO MAKE
PREDICTIONS ABOUT THE
FUTURE.
13.
14. Machine Learning
Underlying processx y
Machine
learning
algorithm
Model that approximates the
underlying process
“Using data to understand an underlying process”
15. Underlying process {x1, x2, …}
Machine
learning
algorithm
Model that approximates the
underlying process
“Using data to understand an underlying process”
16. Data set 1
Model 1
Data set 2
Model 2
The created model depends on the data values used for training.
Machine
learning
algorithm
Machine
learning
algorithm
17. Why build a model?
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o o
o o
o
o
o
o
o o
x
o
o o
o
o x o o
o oo
oo oo
o o
o
o
o
o
o o
o o o
o
x
o o
oo
o
o o
Time
• Predict
– A continuous value
– A category label
• Find clusters in data
• Identify key predictors
• …
18. Why build a model (cont..)
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o o
o o
o
o
o
o
o o
x
o
o o
o
o x o o
o oo
oo oo
o o
o
o
o
o
o o
o o o
o
x
o o
o
oo
o
o o
Time
• Predict
– A continuous value
– A category label
• Find clusters in data
• Identify key predictors
• …
19. o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o o
o o
o
o
o
o
o o
x
o
o o
o
o x o o
o oo
oo oo
o o
o
o
o
o
o o
o o o
o
x
o o
oo
o
o o
Time
• Predict
– A continuous value
– A category label
• Find clusters in data
• Identify key predictors
• …
Why build a model (cont..)
20. o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o o
o o
o
o
o
o
o o
x
o
o o
o
o x o o
o oo
oo oo
o o
o
o
o
o
o o
o o o
o
x
o o
oo
o
o o
Time
• Predict
– A continuous value
– A category label
• Find clusters in data
• Identify key predictors
• …
Why build a model (cont..)
21. o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o o
o o
o
o
o
o
o o
x
o
o o
o
o x o o
o oo
oo oo
o o
o
o
o
o
o o
o o o
o
x
o o
oo
o
o o
Time
– A continuous value
– A category label
• Find clusters in data
• Identify key predictors
• …
Why build a model (cont..)
22. Training phase
– The machine learning algorithm
learns from data
– Output is a trained model
– Time consuming
– Typically involves multiple iterations
over training data
Testing or scoring phase
– The trained model is used in
conjunction with new data inputs to
estimate corresponding output
– Much quicker as compared to training
MACHINE
LEARNING
ALGORITHM
Trained
model
Training
data
TRAINED
MODEL
Corresponding
data output
New
data
input
23. Linear
– OLS regression
Generalized linear
– Logistic regression, GAMs
Rule based
– Decision trees
Kernel-based
– Support vector machines
White box
– Regression family, Decision tree
family
Black box
– Neural networks
Parametric
– Regression family
Non-parametric
– Support vector machines, Rule based fuzzy systems
Ensemble based
– Random forest, AdaBoost
Supervised
– Decision trees,logistic
regression
Unsupervised
– K-means clustering,
hierarchical clustering
Generative
– Naïve Bayes, mixture of
Gaussians
Discriminative
– Support vector machines,
logistic regression,
Decision trees
Classification
– Decision trees, logistic
regression
Regression (predicting a
continuous value)
– OLSregression
Algorithm
25. Which algorithm should I use ?
• Objective of analysis
– Prediction of a continuous value
– classification
– identifying key predictors
• Data type and distribution
• Computational complexity of the algorithm Data volume
Hinweis der Redaktion
Big Data
1. The increasing ‘datafication’ of the world, which means we generate new data at frightening rates.
2. Our increasing ability to harness and analyse large and complex sets of data
Activity Data: Simple activities like listening to music or reading a book are now generating data. Digital music players and eBooks collect data on our activities. Your smart phone collects data on how you use it and your web browser collects information on
what you are searching for. Your credit card company collects data on where you shop and your shop collects data on what you buy. It is hard to imagine any activity that does not generate data.
Conversation Data: Our conversations are now digitally recorded. It all started with emails but nowadays most of our conversations leave a digital trail. Just think of all the conversations we have on
social media sites like Facebook or Twitter. Even many of our phone conversations are now digitally recorded.
Photo and Video Image Data: Just think about all the pictures we take on our smart phones or digital cameras. We upload
and share 100s of thousands of them on social media sites every second. The increasing amounts of CCTV cameras take video images and every minute we up-load hundreds of hours of video images to YouTube and other sites.
Sensor Data: We are increasingly surrounded by sensors that collect and share data. Take your
smart phone, it contains a global positioning sensor to track exactly where you are every second of the day, it includes an accelometer to track the speed and direction at which you are travelling.
We now have sensors in many devices and products.
The Internet of Things Data: We now have smart TVs that are able to collect and process data, we have smart watches, smart fridges, and smart alarms. The Internet of Things, or Internet of Everything connects these devices so that the traffic sensors on the road send data to your alarm clock which will wake you up earlier than planned because the blocked road means you have to leave earlier to make your 9am meeting…
1.Learning the application domain:
–relevant prior knowledge and goals of application
2.Creating a target data set: data selection
3. Data cleaning and preprocessing: (may take 60% of effort!)
4. Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
5. Choosing functions of data mining
summarization, classification, regression, association, clustering.
6.Choosing the mining algorithm(s)
7. Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
8.Use of discovered knowledge
Content mining seeks to uncover the objects and resources within a site
while structure mining reveals the inter and intra connectivity of the web pages.
Usage mining analyzes web server logs to track the activities of the users as they traverse a site.
A web site is often the first point of contact between a potential customer and a company. It is therefore essential that the process of browsing/using the web site is made as simple and pleasurable as possible for the customer. Carefully designed web pages play a major part here and can be enhanced through information relating to web access. The progress of the customer is monitored by the web server log which holds details of every web page visited.
Information is available from:
• Registration forms, these are very useful and the customers should be persuaded to fill out at least one. Useful information such as age, sex and location can be obtained.
• Server log, this provides details of each web page visited and timings.
However, the main advantage of mining web server logs relate to sales and marketing. Sites like Amazon hold individual customer’s previous product
searches and past purchases with which to target this particular individual.
• Past purchases and previous search patterns, useful for personalization of web pages.
• Cookies, these reside on the customers hard drive and enable details between sessions to be recorded.