SlideShare ist ein Scribd-Unternehmen logo
1 von 40
| © Copyright 2015 Hitachi Consulting1
Data Mining
The big picture!
Khalid M. Salama, Ph.D.
Microsoft Business Intelligence
Hitachi Consulting UK
We Make it Happen. Better.
| © Copyright 2015 Hitachi Consulting2
Outline
Context
Data Mining Tasks, Techniques, and Applications
Knowledge Discovery Process
Screenshots
Concluding Remarks
| © Copyright 2015 Hitachi Consulting3
Business Intelligence as a Context
Business Intelligence - “A broad category of concepts, methods, tools and
techniques of collecting, storing, managing, analysing and sharing data to
support/improve decision making”.
 Data Mining is a subset of these concepts, methods, tools and techniques that
concerns with automatically extracting hidden, useful patterns from the data.
 Examples:
−CRM: Customer Segmentation, Profiling, etc.
−Finance, Banking & Insurance: Fraud Detection, Credit Scoring, Stock Market, etc.
−Medicine/Health Care: Disease Development, Diagnosis, Best Treatments, etc.
−Telecommunication: Churn Analysis, Network Fault Isolation, etc.
−Retail: Cross-selling, Targeted Marketing, Propensity Modelling, etc.
revealing the mystery…
| © Copyright 2015 Hitachi Consulting4
Terms and Significance
 Data Mining – “An interdisciplinary subfield of computer science, which is the computational
process of discovering patterns in datasets” – “Knowledge Discovery in Databases (KDD)”
 Data Science – “the extraction of knowledge from volumes of data, which is a continuation of
the field data mining and predictive analytics”
 Machine Learning – “A subfield of computer science that evolved from the study of pattern
recognition and computational learning theory”
 Predictive Analytics – “A variety of statistical techniques from modelling, machine learning,
and data mining that analyse current and historical facts to make predictions about future”
 Big Data – “A broad term for data sets so large or complex that traditional data processing
applications are inadequate”
brining order to buzzwords chaos
| © Copyright 2015 Hitachi Consulting5
Data Mining
… in a nutshell
Data
Mining
Machine
Learning
Statistics
Artificial
Intelligence
Databases
Other
Technologies
“Data mining, an interdisciplinary subfield of
computer science, is the computational
process of discovering patterns in large data
sets involving methods at the intersection of
artificial intelligence, machine learning,
statistics, and database systems.”
Other Related Technologies:
 Visualization
 Big Data
 High Performance Computing
 Cloud Computing
 Others..
| © Copyright 2015 Hitachi Consulting6
Knowledge Discovery in Databases (KDD)
…or data science, if you like!
Understanding
the Business
Understanding
the Data
Preparing
the Data
Modelling
Evaluation -
InterpretationDeployment
Cross Industry Standard Process
for Data Mining (CRISP-DM)
Data
| © Copyright 2015 Hitachi Consulting7
Data Mining Taxonomy
A 10,000 foot view…
Learning Paradigms
Mining Tasks
Modelling Techniques
Measures
Heuristic Search Methods
Supervised Learning
Classification
Decision Trees
Information Gain
Greedy Recursive Partitioning
| © Copyright 2015 Hitachi Consulting8
Learning Paradigms
Data as the teacher, machine as the student…
Supervised Learning
Labelled data = data + output (predictable, target, response, class) variable
Learn the relationship between data and output
Unsupervised Learning
Unlabelled data
Learn associations, similarities, groups, etc.
Semi-
supervised
Learning
Partially labelled data
Online/Active
Learning
Real-time learning on
data streams
Reinforcement Learning
game theory, control theory, simulation-based
optimization, operations research, robotics, etc.
| © Copyright 2015 Hitachi Consulting9
Data Mining Task
…only the genuine ones!
• Predicting the class of a given case – SupervisedClassification
• Estimating the value of a response variable – SupervisedRegression
• Partitioning the cases into similar groups – UnsupervisedClustering
• Finding frequent (co)-occurring items – UnsupervisedAssociation Rules Analysis
• Finding similar cases of a given case – BothSimilarity Analysis
• Calculating the probability of variables – BothProbabilistic Inference
• Forecasting future values – SupervisedTime Series Analysis
Important Terms:
• Learning Paradigms:
− Supervised
− Unsupervised
− Semi-supervised
− Others (Reinforcement
learning, Active, etc.)
• Analytics Types:
− Descriptive (Exploratory)
− Predictive
− Prescriptive (Decisive)
Application Fields:
• Text Mining
• Information Retrieval
• (Social) Web Mining
• Speech Recognition
• Image Recognition
• Anomaly Detection
• State Transition Analysis
• Collaborative Filtering
(Recommender systems)
| © Copyright 2015 Hitachi Consulting10
Classification Learning
my favourite data mining task!
Data Mining Task:
• Classification
• Regression
• Clustering
• Association Rules Analysis
• Similarity Analysis
• Probabilistic Inference
• Time Series Analysis
Target Class Type
• Binary vs. Multi-class
• Multi-label
• Hierarchical Class
Classification Applications:
• Targeted Advertising
• Churn Analysis
• Fraud Detection
• OCR
• Sentiment Analysis
• Predictive Maintenance
• Document Classification
• Protein Function Prediction
• Medical Support Systems
 Input: Labelled cases (nominal labels).
 Process: Learn the relationships between the input variables and the target class.
 Output: A model that used to predicted the class of unlabeled cases (+ probability).
Model
(Classifier)
Classification
Algorithm
Outlook Temperature Humidity Windy Class
sunny hot high no Don’t
sunny hot high yes Don’t
overcast hot high no OK
rain mild high no OK
rain cool normal no OK
rain cool normal yes Don’t
overcast cool normal yes OK
sunny mild normal no Don’t
sunny cool normal no OK
rain mild normal no OK
sunny mild normal yes OK
overcast mild high yes OK
overcast hot normal no OK
rain mild high yes Don’t
OK
Labeled cases (Training Set)
Unlabeled (new) Case
| © Copyright 2015 Hitachi Consulting11
Classification Learning
classification modelling techniques
Data Mining Task:
• Classification
• Regression
• Clustering
• Association Rules Analysis
• Similarity Analysis
• Probabilistic Inference
• Time Series Analysis
Classification Techniques:
• Decision Trees
• Classification Rules
• Linear Discriminant Analysis
• Artificial Neural Networks
• Instance-based Learning
• Probabilistic Graphical Models
• Support Vector Machines
• Gaussian Process
• Ensemble Methods
Advances Classification Task:
• Multi-label Classification
• Hierarchical Classification
 Decision Trees
 Forests/ Jungles
 Classification Rules
 Ordered List/ Unordered Set
 Linear Discriminate Analysis
 Logistic Regression
 Artificial Neural Networks
 Feed-forward Multilayer perceptron
 Instance-based Learning
 Nearest-neighbours classifiers
 Probabilistic Graphical Models
 Bayesian Network Classifiers
 Support Vector Machines
 Kernel Methods
 Gaussian Process
 Non-parametric Methods
 Ensemble Methods
 Bagging/ Boosting/ Stacking
IF .. AND .. AND .. THEN A
ELSE IF .. AND .. THEN C
ELSE IF .. AND .. THEN B
..
..
ELSE C
.
.
.
| © Copyright 2015 Hitachi Consulting12
Regression Analysis
 Input: cases with numerical target variable (response value)
 Process: Learn the relationship: 𝑦 = 𝑓(𝐗).
 Output: A regression model that used to estimate the target value of new cases (+ confidence Intervals)
Linear Regression
 Simple Linear Regression: 𝑦 = 𝑎𝑥 + 𝑏
 Multi-variate Linear Regression: 𝑦 = 𝑎1 𝑥1 + 𝑎1 𝑥1 + ⋯ + 𝑎 𝑚 𝑥 𝑚 + 𝑏
 Generalized Linear Model (Binomial, Poisson, Chi-square, Gaussian, etc.)
Non-linear Regression
 Non-linear Transformation 𝑦 = 𝑎1 𝑙𝑜𝑔 𝑥1 + 𝑎2 𝑥2
3
+ 𝑏
 Multi-variate Adaptive Regression Splines (MARS)
 Regression Trees (Hierarchical Regression)
 Artificial Neural Networks
 Gaussian Process
Related Concepts
 Parameter Estimation: Least Square Error, Weighted LSE, ect.
 Regularization: least absolute shrinkage and selection operator (LASSO) - Ridge
 Model Selection (e.g. stepwise, Information Criterion, …)
the most classical ML task
Data Mining Task:
• Classification
• Regression
• Clustering
• Association Rules Analysis
• Similarity Analysis
• Probabilistic Inference
• Time Series Analysis
Regression Applications:
• Credit Scoring
• Survival Analysis
• Risk Estimation
• Value Evaluation
Regression Techniques:
• Simple vs. Multi-variate
• Generalized LM
• Local Models - Splines
• Trees - ANN - GP
Related Concepts:
• Parameter Estimation
• Regularization
• Model Selection
| © Copyright 2015 Hitachi Consulting13
Cluster Analysis
 Input: cases without a specific target class.
 Process: find groups where the distance “within” is minimized, and the “between” in maximized.
 Output: case-cluster assignment (membership).
Clustering Techniques
 Exclusive vs. Overlapping
 K-Means vs. Fuzzy K-Means, EM
 Partitioned vs. Hierarchical
 K-Means vs. Agglomerative/Divisive
 Center-based vs. Density-based
 K-Means vs. DBScan
 Complete vs. Partial.
Clusters Quality
 Minimize intra-distance/linkage (Cohesion)
 Maximize inter-distance/linkage (Separation)
 Number of Clusters
… rather a mean to an end
Data Mining Task:
• Classification
• Regression
• Clustering
• Association Rules Analysis
• Similarity Analysis
• Probabilistic Inference
• Time Series Analysis
Clustering Applications:
• Customer Segmentation
• Outlier Detection
• Topic Grouping
• Profiling
• Summarisation
• Mixture of Models
Clustering Techniques
• Exclusive vs. Overlapping
• Partitioned vs. Hierarchical
• Center-based vs. Density-based
• Complete vs. Partial.
| © Copyright 2015 Hitachi Consulting14
Association Rule Analysis
 Input: cases without a specific target class (or basket data).
 Process: Find frequent co-occurrences between variable values (items).
 Output: Frequent Item sets/ Association Rules.
 Frequent Item set: {a}, {b}, {d}, {ab} ,{ad}
 Association Rule: IF {a,b} THEN {d,e}
Approach
 Define Constrains (Data Space/ Rule Space)
 Frequent Item Generation (min. support threshold)
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑎, 𝑏 =
|𝑎,𝑏|
|𝑇|
 Rule Generation (min. confidence threshold).
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑎→𝑏 =
|𝑎,𝑏|
|𝑎|
 Prune and proceed to larger item set (adjusted thresholds)
 Rank the rules based on an interestingness measure
discovery of “interesting” relationships
Data Mining Task:
• Classification
• Regression
• Clustering
• Association Rules Analysis
• Similarity Analysis
• Probabilistic Inference
• Time Series Analysis
Asso. Rules Applications:
• Market Basket Analysis
• Text Mining - Sentiment Analysis
• Graph/Link Analysis
Rule Measures:
• Support & Confidence
• Interestingness
− Lift & Chi-Squared
− Jaccard & Kulczynzki
− Kappa & Conviction
Related Issues:
• Negative Item sets
• Quantitative Items
• Sequential Patterns
• Item Sets Compression
• Redundancy-Aware Patterns
• Colossal Item Sets & Scalability
a b c d e
T1 yes no yes yes no
T2 yes no no yes no
T3 no yes no no yes
.. .. .. .. .. ..
Basket Data
T1 → {a,c,d}
T2 → {a,d}
T3 → {b,e}
…
| © Copyright 2015 Hitachi Consulting15
Similarity Analysis
a.k.a. instance-based learning
Data Mining Task:
• Classification
• Regression
• Clustering
• Association Rules Analysis
• Similarity Analysis
• Probabilistic Inference
• Time Series Analysis
Similarity Matching Applications:
• Case-based Reasoning
• Lazy Classification
• Record Matching
• Outlier Detection
• Search Engines
Attribute Proximity Measures :
• Edit-based – Levenstein and Jaro-
Winkler distance.
• Token-based – Jaccard, Shannon,
and Cosine Similarity.
• Sequence-based –Longest
Common Subsequence.
• Phonetic-based – Soundex and
Metaphone.
• Numeric-based – Euclidean
distance.
Casei Vi,1 Vi,2 … vi,m
Casej Vj,1 Vj,2 … vj,m
Weights W1 W2 … Wm
Att-1 Att-2 … Att-m
Similarity(i,j) = Sim(Vi,1,Vj,1 ) + Sim(Vi,2,Vj,2) … + Sim(Vi,m,Vj,m )W1 . W2 . Wm .
 Input: A set of (labelled/ unlabeled) cases + subject case.
 Process: find a set of similar cases to the subject case.
 Output: similar cases (nearest neighbors).
Proximity Measure
 Distance vs. Similarity
Weighting
 User Input vs.
Automatic Optimisation
Neighbours
 Distance-based (Threshold) vs. Top K
Classification / Regression
 Voting / Average
 Weighted Voting / Weighted Average – Kernel Methods (Gaussian Kernel)
| © Copyright 2015 Hitachi Consulting16
Probability Estimation and Inference
 Input: A set of (labelled/ unlabeled) cases.
 Process: learn the structure/parameters of the variable dependency relationships
 Output: A Probabilistic Graphical Model
Probabilistic Graphical Models
 Directed Acyclic Graphs
 Bayesian Networks (classifiers)
 Dynamic Bayesian Networks
 Markov Blankets
 Directed Cyclic Graphs
 Markov Chains
 (Hidden) Markov Models
 Undirected Graphs
 Factor Graphs
 Dependency Networks
 Markov Random fields
Learning
 Structure (variable-dependency relationships)
 Parameters (quantification of the relationships)
Inferencing
 Exact inference and the junction tree
 MCMC
 Variational methods and EM
the doctrine of chances…
Data Mining Task:
• Classification
• Regression
• Clustering
• Association Rules Analysis
• Similarity Analysis
• Probabilistic Inference
• Time Series Analysis
Probabilistic Inference
Applications:
• ML Framework
• Diagnostic Systems
• State Transition Analysis
Probabilistic Graphical Models:
• Directed Acyclic Graphs
− Bayesian Networks
− Markov Blankets
• Directed Cyclic Graphs
− Markov Chains
− Markov Models
• Undirected Graphs
− Factor Graphs
− Dependency Networks
− Markov Random fields
| © Copyright 2015 Hitachi Consulting17
Time Series Analysis
 Input: a sequence of evenly-spaced numerical data.
 Process: learn a function that describe the current value with respect to the previous ones.
 Output: Time Series Model (describe/forecast).
Components:
 Trend: Overall upward, downward, or stationary pattern.
 Cyclical: Repeating upwards or downwards movements.
 Seasonal: Regular pattern of up & down fluctuations.
 Irregular: Unsystematic, ‘residual’ fluctuations (random).
Techniques:
 Regression.
 (Weighted) Moving Average.
 Exponential Smoothing.
 Auto-regressive (STL, ARMA, ARIMA, etc.).
history tends to repeat itself…
Data Mining Task:
• Classification
• Regression
• Clustering
• Association Rules Analysis
• Similarity Analysis
• Probabilistic Inference
• Time Series Analysis
Time Series Applications:
• Stock Market
• Supply/Demand
• Financial Applications
• Signal Processing
Time Series Components
• Trend
• Cyclical
• Seasonal
• Random
Techniques
• Regression
• Moving Average
• Exponential Smoothing
• Auto-regressive
| © Copyright 2015 Hitachi Consulting18
Knowledge Discovery in Databases (KDD)
the virtuous cycle of data science
CRISP-DM Process:
• Understanding the Business
• Understanding the Data
• Preparing the Data
• Modelling
• Evaluation & Interpretation
• DeploymentUnderstanding
the Business
Understanding
the Data
Preparing
the Data
Modelling
Evaluation -
InterpretationDeployment
Data
| © Copyright 2015 Hitachi Consulting19
Step 1 - Understanding the Business
Ways to answer “Data Analysis” questions:
 Query/Report – “How many new customers bought my service this month? How many renewed? How
many left?”
 Complex Query/Report – “What are the top selling products by region in the Online sales? How does
that compare to the store sales?” (Multi-dimensional Analysis/Visualisation)
 Calculations/KPIs – “Is my business going well? Are we meeting our targets?”
 What-if Analysis – “Based on the last year sales, what will be the revenues if we increase the price of
this product X by 1% and decreased the price of product Y by 2%?” (budgeting/planning)
 Statistical Analysis – “What are the most important factors that impact the energy consumption in our
facilities?” (dependency/correlation)
 Hypothesis Testing – “Is there significant improved amongst the group of people who took the new
drug, compared the placebo group?” (experimental studies/market research)
 Data Mining – “Who are the customer that most likely to response to our new advertising campaign?”
(predictive analytics)
“The formulation of a problem is often more essential than its solution” - Albert Einstein
CRISP-DM Process:
• Understanding the Business
• Understanding the Data
• Preparing the Data
• Modelling
• Evaluation & Interpretation
• Deployment
Analytics Techniques:
• Database Query
• Multi-dimensional
Analysis/Visualisation
• Calculations/KPIs
• What-if Analysis
• Statistical Analysis
• Hypothesis Testing
• Data Mining
| © Copyright 2015 Hitachi Consulting21
Step 1 - Understanding the Business
 A business problem can be decomposed into multiple business question, which of each can
be mapped to different analytics technique or data mining task.
 Example 1: Microsoft How-old.net
− “What are the distinct object in the picture?” → Clustering
− “For each object, is it a face or not?” → Classification
− “What is the estimate age for each identified face?” → Regression
 Example 2: Churn Analysis and Targeted Offering
− “Which customers would likely terminate the contract this month?” → Classification
− “Which service package will a customer likely purchase if given incentive ?” → Classification
− “How much will this customer use the service?” → Regression
− “What will be the expected utility of targeting this customer?” → Calculation
 Example 3: Planning
− “What will be the amount of demand on each item next year, per region?”→ Time Series
− “What will be the revenue according to this pricing schemes?” → What-if
from business problems to analytics tasks
CRISP-DM Process:
• Understanding the Business
• Understanding the Data
• Preparing the Data
• Modelling
• Evaluation & Interpretation
• Deployment
Analytics Techniques:
• Database Query
• Multi-dimensional
Analysis/Visualisation
• Calculations/KPIs
• What-if Analysis
• Statistical Analysis
• Hypothesis Testing
• Data Mining
| © Copyright 2015 Hitachi Consulting22
Step 2 – Understanding the Data
what is data?
CRISP-DM Process:
• Understanding the Business
• Understanding the Data
• Preparing the Data
• Modelling
• Evaluation & Interpretation
• Deployment
“Data are values of qualitative or quantitative variables, belonging to a set of items”
Variables
 Numerical
 Categorical (Nominal, Ordinal)
 Special (Identifier, Time Index)
What should data look like:
 One row for each case
 Columns represent attributes
What does data really look like:
 Transactional (normalised) data
 Ordered data
− Sequence data (DNA)
− Time-based data (temporal auto-correlated)
− Spatial data (spatial auto-correlated)
 Graph-based data
 Free-from Text
 Image/Video (sequence of images)
 Audio
Id Att-1 Att-2 .. Att-M
Case 1 V(1,1) V(1,2)
Case 2 V(2,1) V(2,2)
…
Case N V(N,M)
Variables:
• Numerical
• Categorical
− Nominal
− Ordinal
Data Forms:
• Matrix
• Normalized
• Ordered
− Sequence
− Time-Series
− Spatial
• Graph-based
• Free-from Text
• Image/Video
• Audio
| © Copyright 2015 Hitachi Consulting23
Step 2 – Understanding the Data
Answering the following questions…
 What is the available data?
 Do we need to acquire other data? (Publicly available/ Buy data)
 What is the nature of the dataset? (Data profiling)
− Number of cases
− Number of attributes
− Missing values (sparsity)
− Numerical variables (min, max, mean, media, stdv. , outliers)
− Categorical variables (cardinality, frequencies, mode value)
− Correlations between numerical variables
− Statistical dependency between categorical variables.
− Statistical variance (numerical vs. categorical variables)
− Inconsistencies (based on business rules)
Should lead to…
 Identify the data pre-processing operation needed.
 Suggest the model to be used.
exploratory data analysis
CRISP-DM Process:
• Understanding the Business
• Understanding the Data
• Preparing the Data
• Modelling
• Evaluation & Interpretation
• Deployment
Data Profiling:
• Number of cases
• Number of attributes
• Missing values
• Numerical variables
− Min - Max - Median
− Distribution(Mean, stdv.)
− Outliers
• Categorical variables
− Cardinality
− Frequencies
• Correlations/Dependencies
• Inconsistencies
| © Copyright 2015 Hitachi Consulting24
Step 3 – Preparing the Data
Feature Engineering: Building the dataset.
 Feature Construction: fabricating a set of (possibly) useful features.
Example - Input: Sales Transactions (Customer, Product, Orders)
- Objective: Customer Segmentation
- Features: Days First Purchase, Days Last Purchase, Avg. Days between 2
Purchase, Last 3 months total Spending, Last 6 Month Total Spending, Promotion
Responsiveness, New Product Responsiveness, Avg. Purchased Product Price, …, Web Usage
Information, Demographics, Geographic, Economic Indices, Date Indicators, etc.
 Feature Selection: Selecting the most effective subset of the available
features – Filter vs. Wrapper
 Feature Extraction: constructing a new set of independent (uncorrelated)
features, from the existing feature set, using mathematical transformation –
Principal Component Analysis (PCA), Factor Analysis (FA), etc.
good luck is a residue of preparation…
CRISP-DM Process:
• Understanding the Business
• Understanding the Data
• Preparing the Data
• Modelling
• Evaluation & Interpretation
• Deployment
Data Preparation:
• Feature Engineering
− Feature Construction
− Feature Selection
− Feature Extraction
• Type Conversion
− Discretisation
− To Numeric
• Variable Tuning
− Missing values
− Clipping
− Scaling
• Row Processing
− Aggregation
− Removing duplicates
− Sampling
− Data Reduction
| © Copyright 2015 Hitachi Consulting25
Step 3 – Preparing the Data
Variable Type Conversion:
 Numerical to Categorical (Discretisation) → Equal Width/ Equal Size/ Supervised.
 Categorical to Numerical → Hot-one/ Relative Counts
Variable Tuning:
 Missing Values → Eliminate/ Estimate.
 Clipping Extreme Values → Fix/ Remove.
 Scaling → Normalisation/ Standardisation.
Row Processing:
 Aggregation
 Removing Duplicates
 Instance Selection (Data Reduction)
 Sampling/Partitioning
“garbage in–garbage out”…
CRISP-DM Process:
• Understanding the Business
• Understanding the Data
• Preparing the Data
• Modelling
• Evaluation & Interpretation
• Deployment
Data Preparation:
• Feature Engineering
− Feature Construction
− Feature Selection
− Feature Extraction
• Type Conversion
− Discretisation
− To Numeric
• Variable Tuning
− Missing values
− Clipping
− Scaling
• Row Processing
− Aggregation
− Removing duplicates
− Sampling
− Data Reduction
| © Copyright 2015 Hitachi Consulting26
Step 4 - Modelling
If you interrogate the data, it will confess…
CRISP-DM Process:
• Understanding the Business
• Understanding the Data
• Preparing the Data
• Modelling
• Evaluation & Interpretation
• Deployment
Modelling Variation:
• Approaches
• Algorithms
• Parameters
• Dataset Representations
 Overall Procedure:
 sets = Split( dataset, ratio);
 train=sets[0]; test=sets[1];
 model=Build(algorithm, train, preproc, param);
 Visualize(model);
 quality= Evaluate( model, test, measure);
 Always Build Multiple Models:
 Using different approaches.
 Using different algorithms.
 Using different parameters (parameter sweeping).
 Using different dataset representations.
 Empirical Evaluation for Model Selection
| © Copyright 2015 Hitachi Consulting27
Step 5 – Evaluation and Interpretation
Model Predictive Effectiveness
 Predictive Accuracy
Model Comprehensibility
 Interpretability → Insights
 Model acceptance
 Legal explanation (Justifiability)
− Credit Denial
− Medical Decisions
Algorithm Efficiency
 Scalability/running time
 User Input parameters
Performance Quality Aspects
CRISP-DM Process:
• Understanding the Business
• Understanding the Data
• Preparing the Data
• Modelling
• Evaluation & Interpretation
• Deployment
Predictive Model Quality
• Predictive Effectiveness
• Comprehensibility
Algorithm Efficiency
• Scalability, running time
• User input parameters
| © Copyright 2015 Hitachi Consulting28
Step 5 – Evaluation and Interpretation
Predictive Models – Predictive Effectiveness (accuracy?)
 Considerations
− Imbalance Class
− Misclassification Cost (Expected Utility)
− Single Class Focus (Hits Rate vs. False Alarms)
 Measures
− Confusion Matrix
− Accuracy (Micro vs. Macro)
− Precision, Recall, Sensitivity, Specificity, F-Measure, etc.
− Area Under Curve, lift Chart, Profit/Cost Chart, etc.
− QLF, BIR, etc. (Probabilistic Classification/Regression)
 Methods
− Hold-out
− k-fold Cross Validation
− Leave-one-out
Descriptive Models – It is up to you!
all models are wrong, but some are useful…
Actual
Predicted
Positive Negative
Positive TP FP
Negative FN TN
CRISP-DM Process:
• Understanding the Business
• Understanding the Data
• Preparing the Data
• Modelling
• Evaluation & Interpretation
• Deployment
Predictive Model Quality
• Predictive Effectiveness
• Comprehensibility
Algorithm Efficiency
• Scalability, running time
• User input parameters
Predictive Quality Measures:
• Accuracy (Micro vs. Macro)
• Precision vs. Recall
• Sensitivity vs. Specificity
• Kappa – Lift – odds
• QLF, CE, BIR
• AUC, lift, cost charts
Evaluation Methods:
• Hold-out
• k-fold Cross Validation
| © Copyright 2015 Hitachi Consulting29
Step 6 – Deployment
data mining in action!
CRISP-DM Process:
• Understanding the Business
• Understanding the Data
• Preparing the Data
• Modelling
• Evaluation & Interpretation
• Deployment
Demo
Tools & Technologies
• MS Azure ML
• MS Analysis Services
• Infer.NET
• WEKA (JAVA)
• R Statistics (caret, rattle)
• Python (Mlpy, scikit-learn)
• OpenML
• C/C++ - Matlab
• SAS
• SPSS
• RapidMiner
• Apache Mahout
Dataset Repository
• UCI - KDD
• data.gov.uk
• GapMinder
| © Copyright 2015 Hitachi Consulting30
Screenshot – Decision Trees
Microsoft Analysis Services
| © Copyright 2015 Hitachi Consulting31
Screenshot – Cluster Analysis
Microsoft Analysis Services
| © Copyright 2015 Hitachi Consulting32
Screenshot – Association Rules Analysis
Microsoft Analysis Services
| © Copyright 2015 Hitachi Consulting33
Screenshot – Time Series
Microsoft Analysis Services
| © Copyright 2015 Hitachi Consulting34
Screenshot – ML Experiment
Microsoft Azure Machine Learning
| © Copyright 2015 Hitachi Consulting35
Screenshot – ML Web Services
Microsoft Azure Machine Learning
| © Copyright 2015 Hitachi Consulting36
Screenshot – Probabilistic Models
Microsoft Infer.net
| © Copyright 2015 Hitachi Consulting37
Screenshot – Classification Rules
Java - WEKA
| © Copyright 2015 Hitachi Consulting38
Screenshot – Text Mining
R Statistics
| © Copyright 2015 Hitachi Consulting39
Screenshot – Regression Models
R Statistics
| © Copyright 2015 Hitachi Consulting40
Concluding Remarks
a few takeaways…
• Understand the business problem first, please!
• Use the appropriate tool/technique that best suits the business problem, not the other way around.
• Start by solving simple business problems first, before moving to complex ones (BI Insight Maturity Journey).
• Spend sometime to explore and understand the data.
• Incorporate domain knowledge in your analysis (avoid reinventing the wheel!).
• Data preparation is very important for building effective models.
• Data mining is an experimental/ iterative process (not ideal for fixed-price projects!).
• Try to tackle the business problem with different analytic approaches.
• It is clever to solve complex problems with simple techniques.
| © Copyright 2015 Hitachi Consulting41
My Background
Applying Ant Colony Optimisation (ACO) in Building Classification Models
• Honorary Research Fellow, School of Computing , University of Kent.
• Ph.D. Computer Science, University of Kent, Canterbury, UK.
• M.Sc. Computer Science , The American University in Cairo, Egypt.
• 20+ published journal and conference papers, focusing on:
– classification rules induction,
– decision trees construction,
– Bayesian classification modelling,
– data reduction,
– instance-based learning, and
– evolving neural networks.
• Journals: Swarm Intelligence, Swarm & Evolutionary Computation,
Intelligent Data Analysis, Applied Soft Computing, and Memetic Computing.
• Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio, ECTA, and INNS-BigData.

Weitere ähnliche Inhalte

Was ist angesagt?

Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Creating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data ArchitectureCreating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data ArchitecturePerficient, Inc.
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business AnalyticsCleverDATA
 
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",..."From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...Dataconomy Media
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseVaticle
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...Revolution Analytics
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016Kent Graziano
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAmpoolIO
 
Fixing data science & Accelerating Artificial Super Intelligence Development
 Fixing data science & Accelerating Artificial Super Intelligence Development Fixing data science & Accelerating Artificial Super Intelligence Development
Fixing data science & Accelerating Artificial Super Intelligence DevelopmentManojKumarR41
 
Data Virtualization: From Zero to Hero
Data Virtualization: From Zero to HeroData Virtualization: From Zero to Hero
Data Virtualization: From Zero to HeroDenodo
 
Are You Killing the Benefits of Your Data Lake?
Are You Killing the Benefits of Your Data Lake?Are You Killing the Benefits of Your Data Lake?
Are You Killing the Benefits of Your Data Lake?Denodo
 

Was ist angesagt? (20)

Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Data management
Data managementData management
Data management
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Creating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data ArchitectureCreating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data Architecture
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
AI is a Team Sport
AI is a Team SportAI is a Team Sport
AI is a Team Sport
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business Analytics
 
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",..."From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
 
Ibm big data
Ibm big dataIbm big data
Ibm big data
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Fixing data science & Accelerating Artificial Super Intelligence Development
 Fixing data science & Accelerating Artificial Super Intelligence Development Fixing data science & Accelerating Artificial Super Intelligence Development
Fixing data science & Accelerating Artificial Super Intelligence Development
 
Data Virtualization: From Zero to Hero
Data Virtualization: From Zero to HeroData Virtualization: From Zero to Hero
Data Virtualization: From Zero to Hero
 
Are You Killing the Benefits of Your Data Lake?
Are You Killing the Benefits of Your Data Lake?Are You Killing the Benefits of Your Data Lake?
Are You Killing the Benefits of Your Data Lake?
 

Andere mochten auch

Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureKhalid Salama
 
Microservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryMicroservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryKhalid Salama
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with SparkKhalid Salama
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysisPoonam Kshirsagar
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data Srinath Perera
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerLuminary Labs
 

Andere mochten auch (19)

Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
 
Introduction to Data Mining and Big Data Analytics
Introduction to Data Mining and Big Data AnalyticsIntroduction to Data Mining and Big Data Analytics
Introduction to Data Mining and Big Data Analytics
 
Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
 
Data mining on big data
Data mining on big dataData mining on big data
Data mining on big data
 
Making Big Data Work
Making Big Data WorkMaking Big Data Work
Making Big Data Work
 
Data Mining- Big Data landscape
Data Mining- Big Data landscapeData Mining- Big Data landscape
Data Mining- Big Data landscape
 
Microservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryMicroservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous Delivery
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Data mining and_big_data_web
Data mining and_big_data_webData mining and_big_data_web
Data mining and_big_data_web
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
Big data mining
Big data miningBig data mining
Big data mining
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Data mining
Data miningData mining
Data mining
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 

Ähnlich wie Data Mining - The Big Picture!

An Introduction to Advanced analytics and data mining
An Introduction to Advanced analytics and data miningAn Introduction to Advanced analytics and data mining
An Introduction to Advanced analytics and data miningBarry Leventhal
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needGibDevs
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfSaketBansal9
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 
Data Science Training in Chandigarh h
Data Science Training in Chandigarh    hData Science Training in Chandigarh    h
Data Science Training in Chandigarh hasmeerana605
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxshumPanwar
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseSoftServe
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 

Ähnlich wie Data Mining - The Big Picture! (20)

Data mining
Data miningData mining
Data mining
 
An Introduction to Advanced analytics and data mining
An Introduction to Advanced analytics and data miningAn Introduction to Advanced analytics and data mining
An Introduction to Advanced analytics and data mining
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Data Science Training in Chandigarh h
Data Science Training in Chandigarh    hData Science Training in Chandigarh    h
Data Science Training in Chandigarh h
 
Knowledge Discovery
Knowledge DiscoveryKnowledge Discovery
Knowledge Discovery
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
 
Machine learning
Machine learning Machine learning
Machine learning
 
CLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptxCLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptx
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Talk
TalkTalk
Talk
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptx
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
 
machine learning
machine learningmachine learning
machine learning
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
data mining
data miningdata mining
data mining
 

Mehr von Khalid Salama

Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR OverviewKhalid Salama
 
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Khalid Salama
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsightKhalid Salama
 
Enterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureEnterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureKhalid Salama
 
Microsoft Azure Batch
Microsoft Azure BatchMicrosoft Azure Batch
Microsoft Azure BatchKhalid Salama
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft AzureKhalid Salama
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureKhalid Salama
 

Mehr von Khalid Salama (8)

Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
 
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsight
 
Enterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureEnterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft Azure
 
Microsoft Azure Batch
Microsoft Azure BatchMicrosoft Azure Batch
Microsoft Azure Batch
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft Azure
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 

Kürzlich hochgeladen

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 

Kürzlich hochgeladen (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 

Data Mining - The Big Picture!

  • 1. | © Copyright 2015 Hitachi Consulting1 Data Mining The big picture! Khalid M. Salama, Ph.D. Microsoft Business Intelligence Hitachi Consulting UK We Make it Happen. Better.
  • 2. | © Copyright 2015 Hitachi Consulting2 Outline Context Data Mining Tasks, Techniques, and Applications Knowledge Discovery Process Screenshots Concluding Remarks
  • 3. | © Copyright 2015 Hitachi Consulting3 Business Intelligence as a Context Business Intelligence - “A broad category of concepts, methods, tools and techniques of collecting, storing, managing, analysing and sharing data to support/improve decision making”.  Data Mining is a subset of these concepts, methods, tools and techniques that concerns with automatically extracting hidden, useful patterns from the data.  Examples: −CRM: Customer Segmentation, Profiling, etc. −Finance, Banking & Insurance: Fraud Detection, Credit Scoring, Stock Market, etc. −Medicine/Health Care: Disease Development, Diagnosis, Best Treatments, etc. −Telecommunication: Churn Analysis, Network Fault Isolation, etc. −Retail: Cross-selling, Targeted Marketing, Propensity Modelling, etc. revealing the mystery…
  • 4. | © Copyright 2015 Hitachi Consulting4 Terms and Significance  Data Mining – “An interdisciplinary subfield of computer science, which is the computational process of discovering patterns in datasets” – “Knowledge Discovery in Databases (KDD)”  Data Science – “the extraction of knowledge from volumes of data, which is a continuation of the field data mining and predictive analytics”  Machine Learning – “A subfield of computer science that evolved from the study of pattern recognition and computational learning theory”  Predictive Analytics – “A variety of statistical techniques from modelling, machine learning, and data mining that analyse current and historical facts to make predictions about future”  Big Data – “A broad term for data sets so large or complex that traditional data processing applications are inadequate” brining order to buzzwords chaos
  • 5. | © Copyright 2015 Hitachi Consulting5 Data Mining … in a nutshell Data Mining Machine Learning Statistics Artificial Intelligence Databases Other Technologies “Data mining, an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.” Other Related Technologies:  Visualization  Big Data  High Performance Computing  Cloud Computing  Others..
  • 6. | © Copyright 2015 Hitachi Consulting6 Knowledge Discovery in Databases (KDD) …or data science, if you like! Understanding the Business Understanding the Data Preparing the Data Modelling Evaluation - InterpretationDeployment Cross Industry Standard Process for Data Mining (CRISP-DM) Data
  • 7. | © Copyright 2015 Hitachi Consulting7 Data Mining Taxonomy A 10,000 foot view… Learning Paradigms Mining Tasks Modelling Techniques Measures Heuristic Search Methods Supervised Learning Classification Decision Trees Information Gain Greedy Recursive Partitioning
  • 8. | © Copyright 2015 Hitachi Consulting8 Learning Paradigms Data as the teacher, machine as the student… Supervised Learning Labelled data = data + output (predictable, target, response, class) variable Learn the relationship between data and output Unsupervised Learning Unlabelled data Learn associations, similarities, groups, etc. Semi- supervised Learning Partially labelled data Online/Active Learning Real-time learning on data streams Reinforcement Learning game theory, control theory, simulation-based optimization, operations research, robotics, etc.
  • 9. | © Copyright 2015 Hitachi Consulting9 Data Mining Task …only the genuine ones! • Predicting the class of a given case – SupervisedClassification • Estimating the value of a response variable – SupervisedRegression • Partitioning the cases into similar groups – UnsupervisedClustering • Finding frequent (co)-occurring items – UnsupervisedAssociation Rules Analysis • Finding similar cases of a given case – BothSimilarity Analysis • Calculating the probability of variables – BothProbabilistic Inference • Forecasting future values – SupervisedTime Series Analysis Important Terms: • Learning Paradigms: − Supervised − Unsupervised − Semi-supervised − Others (Reinforcement learning, Active, etc.) • Analytics Types: − Descriptive (Exploratory) − Predictive − Prescriptive (Decisive) Application Fields: • Text Mining • Information Retrieval • (Social) Web Mining • Speech Recognition • Image Recognition • Anomaly Detection • State Transition Analysis • Collaborative Filtering (Recommender systems)
  • 10. | © Copyright 2015 Hitachi Consulting10 Classification Learning my favourite data mining task! Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Target Class Type • Binary vs. Multi-class • Multi-label • Hierarchical Class Classification Applications: • Targeted Advertising • Churn Analysis • Fraud Detection • OCR • Sentiment Analysis • Predictive Maintenance • Document Classification • Protein Function Prediction • Medical Support Systems  Input: Labelled cases (nominal labels).  Process: Learn the relationships between the input variables and the target class.  Output: A model that used to predicted the class of unlabeled cases (+ probability). Model (Classifier) Classification Algorithm Outlook Temperature Humidity Windy Class sunny hot high no Don’t sunny hot high yes Don’t overcast hot high no OK rain mild high no OK rain cool normal no OK rain cool normal yes Don’t overcast cool normal yes OK sunny mild normal no Don’t sunny cool normal no OK rain mild normal no OK sunny mild normal yes OK overcast mild high yes OK overcast hot normal no OK rain mild high yes Don’t OK Labeled cases (Training Set) Unlabeled (new) Case
  • 11. | © Copyright 2015 Hitachi Consulting11 Classification Learning classification modelling techniques Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Classification Techniques: • Decision Trees • Classification Rules • Linear Discriminant Analysis • Artificial Neural Networks • Instance-based Learning • Probabilistic Graphical Models • Support Vector Machines • Gaussian Process • Ensemble Methods Advances Classification Task: • Multi-label Classification • Hierarchical Classification  Decision Trees  Forests/ Jungles  Classification Rules  Ordered List/ Unordered Set  Linear Discriminate Analysis  Logistic Regression  Artificial Neural Networks  Feed-forward Multilayer perceptron  Instance-based Learning  Nearest-neighbours classifiers  Probabilistic Graphical Models  Bayesian Network Classifiers  Support Vector Machines  Kernel Methods  Gaussian Process  Non-parametric Methods  Ensemble Methods  Bagging/ Boosting/ Stacking IF .. AND .. AND .. THEN A ELSE IF .. AND .. THEN C ELSE IF .. AND .. THEN B .. .. ELSE C . . .
  • 12. | © Copyright 2015 Hitachi Consulting12 Regression Analysis  Input: cases with numerical target variable (response value)  Process: Learn the relationship: 𝑦 = 𝑓(𝐗).  Output: A regression model that used to estimate the target value of new cases (+ confidence Intervals) Linear Regression  Simple Linear Regression: 𝑦 = 𝑎𝑥 + 𝑏  Multi-variate Linear Regression: 𝑦 = 𝑎1 𝑥1 + 𝑎1 𝑥1 + ⋯ + 𝑎 𝑚 𝑥 𝑚 + 𝑏  Generalized Linear Model (Binomial, Poisson, Chi-square, Gaussian, etc.) Non-linear Regression  Non-linear Transformation 𝑦 = 𝑎1 𝑙𝑜𝑔 𝑥1 + 𝑎2 𝑥2 3 + 𝑏  Multi-variate Adaptive Regression Splines (MARS)  Regression Trees (Hierarchical Regression)  Artificial Neural Networks  Gaussian Process Related Concepts  Parameter Estimation: Least Square Error, Weighted LSE, ect.  Regularization: least absolute shrinkage and selection operator (LASSO) - Ridge  Model Selection (e.g. stepwise, Information Criterion, …) the most classical ML task Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Regression Applications: • Credit Scoring • Survival Analysis • Risk Estimation • Value Evaluation Regression Techniques: • Simple vs. Multi-variate • Generalized LM • Local Models - Splines • Trees - ANN - GP Related Concepts: • Parameter Estimation • Regularization • Model Selection
  • 13. | © Copyright 2015 Hitachi Consulting13 Cluster Analysis  Input: cases without a specific target class.  Process: find groups where the distance “within” is minimized, and the “between” in maximized.  Output: case-cluster assignment (membership). Clustering Techniques  Exclusive vs. Overlapping  K-Means vs. Fuzzy K-Means, EM  Partitioned vs. Hierarchical  K-Means vs. Agglomerative/Divisive  Center-based vs. Density-based  K-Means vs. DBScan  Complete vs. Partial. Clusters Quality  Minimize intra-distance/linkage (Cohesion)  Maximize inter-distance/linkage (Separation)  Number of Clusters … rather a mean to an end Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Clustering Applications: • Customer Segmentation • Outlier Detection • Topic Grouping • Profiling • Summarisation • Mixture of Models Clustering Techniques • Exclusive vs. Overlapping • Partitioned vs. Hierarchical • Center-based vs. Density-based • Complete vs. Partial.
  • 14. | © Copyright 2015 Hitachi Consulting14 Association Rule Analysis  Input: cases without a specific target class (or basket data).  Process: Find frequent co-occurrences between variable values (items).  Output: Frequent Item sets/ Association Rules.  Frequent Item set: {a}, {b}, {d}, {ab} ,{ad}  Association Rule: IF {a,b} THEN {d,e} Approach  Define Constrains (Data Space/ Rule Space)  Frequent Item Generation (min. support threshold) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑎, 𝑏 = |𝑎,𝑏| |𝑇|  Rule Generation (min. confidence threshold). 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑎→𝑏 = |𝑎,𝑏| |𝑎|  Prune and proceed to larger item set (adjusted thresholds)  Rank the rules based on an interestingness measure discovery of “interesting” relationships Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Asso. Rules Applications: • Market Basket Analysis • Text Mining - Sentiment Analysis • Graph/Link Analysis Rule Measures: • Support & Confidence • Interestingness − Lift & Chi-Squared − Jaccard & Kulczynzki − Kappa & Conviction Related Issues: • Negative Item sets • Quantitative Items • Sequential Patterns • Item Sets Compression • Redundancy-Aware Patterns • Colossal Item Sets & Scalability a b c d e T1 yes no yes yes no T2 yes no no yes no T3 no yes no no yes .. .. .. .. .. .. Basket Data T1 → {a,c,d} T2 → {a,d} T3 → {b,e} …
  • 15. | © Copyright 2015 Hitachi Consulting15 Similarity Analysis a.k.a. instance-based learning Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Similarity Matching Applications: • Case-based Reasoning • Lazy Classification • Record Matching • Outlier Detection • Search Engines Attribute Proximity Measures : • Edit-based – Levenstein and Jaro- Winkler distance. • Token-based – Jaccard, Shannon, and Cosine Similarity. • Sequence-based –Longest Common Subsequence. • Phonetic-based – Soundex and Metaphone. • Numeric-based – Euclidean distance. Casei Vi,1 Vi,2 … vi,m Casej Vj,1 Vj,2 … vj,m Weights W1 W2 … Wm Att-1 Att-2 … Att-m Similarity(i,j) = Sim(Vi,1,Vj,1 ) + Sim(Vi,2,Vj,2) … + Sim(Vi,m,Vj,m )W1 . W2 . Wm .  Input: A set of (labelled/ unlabeled) cases + subject case.  Process: find a set of similar cases to the subject case.  Output: similar cases (nearest neighbors). Proximity Measure  Distance vs. Similarity Weighting  User Input vs. Automatic Optimisation Neighbours  Distance-based (Threshold) vs. Top K Classification / Regression  Voting / Average  Weighted Voting / Weighted Average – Kernel Methods (Gaussian Kernel)
  • 16. | © Copyright 2015 Hitachi Consulting16 Probability Estimation and Inference  Input: A set of (labelled/ unlabeled) cases.  Process: learn the structure/parameters of the variable dependency relationships  Output: A Probabilistic Graphical Model Probabilistic Graphical Models  Directed Acyclic Graphs  Bayesian Networks (classifiers)  Dynamic Bayesian Networks  Markov Blankets  Directed Cyclic Graphs  Markov Chains  (Hidden) Markov Models  Undirected Graphs  Factor Graphs  Dependency Networks  Markov Random fields Learning  Structure (variable-dependency relationships)  Parameters (quantification of the relationships) Inferencing  Exact inference and the junction tree  MCMC  Variational methods and EM the doctrine of chances… Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Probabilistic Inference Applications: • ML Framework • Diagnostic Systems • State Transition Analysis Probabilistic Graphical Models: • Directed Acyclic Graphs − Bayesian Networks − Markov Blankets • Directed Cyclic Graphs − Markov Chains − Markov Models • Undirected Graphs − Factor Graphs − Dependency Networks − Markov Random fields
  • 17. | © Copyright 2015 Hitachi Consulting17 Time Series Analysis  Input: a sequence of evenly-spaced numerical data.  Process: learn a function that describe the current value with respect to the previous ones.  Output: Time Series Model (describe/forecast). Components:  Trend: Overall upward, downward, or stationary pattern.  Cyclical: Repeating upwards or downwards movements.  Seasonal: Regular pattern of up & down fluctuations.  Irregular: Unsystematic, ‘residual’ fluctuations (random). Techniques:  Regression.  (Weighted) Moving Average.  Exponential Smoothing.  Auto-regressive (STL, ARMA, ARIMA, etc.). history tends to repeat itself… Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Time Series Applications: • Stock Market • Supply/Demand • Financial Applications • Signal Processing Time Series Components • Trend • Cyclical • Seasonal • Random Techniques • Regression • Moving Average • Exponential Smoothing • Auto-regressive
  • 18. | © Copyright 2015 Hitachi Consulting18 Knowledge Discovery in Databases (KDD) the virtuous cycle of data science CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • DeploymentUnderstanding the Business Understanding the Data Preparing the Data Modelling Evaluation - InterpretationDeployment Data
  • 19. | © Copyright 2015 Hitachi Consulting19 Step 1 - Understanding the Business Ways to answer “Data Analysis” questions:  Query/Report – “How many new customers bought my service this month? How many renewed? How many left?”  Complex Query/Report – “What are the top selling products by region in the Online sales? How does that compare to the store sales?” (Multi-dimensional Analysis/Visualisation)  Calculations/KPIs – “Is my business going well? Are we meeting our targets?”  What-if Analysis – “Based on the last year sales, what will be the revenues if we increase the price of this product X by 1% and decreased the price of product Y by 2%?” (budgeting/planning)  Statistical Analysis – “What are the most important factors that impact the energy consumption in our facilities?” (dependency/correlation)  Hypothesis Testing – “Is there significant improved amongst the group of people who took the new drug, compared the placebo group?” (experimental studies/market research)  Data Mining – “Who are the customer that most likely to response to our new advertising campaign?” (predictive analytics) “The formulation of a problem is often more essential than its solution” - Albert Einstein CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Analytics Techniques: • Database Query • Multi-dimensional Analysis/Visualisation • Calculations/KPIs • What-if Analysis • Statistical Analysis • Hypothesis Testing • Data Mining
  • 20. | © Copyright 2015 Hitachi Consulting21 Step 1 - Understanding the Business  A business problem can be decomposed into multiple business question, which of each can be mapped to different analytics technique or data mining task.  Example 1: Microsoft How-old.net − “What are the distinct object in the picture?” → Clustering − “For each object, is it a face or not?” → Classification − “What is the estimate age for each identified face?” → Regression  Example 2: Churn Analysis and Targeted Offering − “Which customers would likely terminate the contract this month?” → Classification − “Which service package will a customer likely purchase if given incentive ?” → Classification − “How much will this customer use the service?” → Regression − “What will be the expected utility of targeting this customer?” → Calculation  Example 3: Planning − “What will be the amount of demand on each item next year, per region?”→ Time Series − “What will be the revenue according to this pricing schemes?” → What-if from business problems to analytics tasks CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Analytics Techniques: • Database Query • Multi-dimensional Analysis/Visualisation • Calculations/KPIs • What-if Analysis • Statistical Analysis • Hypothesis Testing • Data Mining
  • 21. | © Copyright 2015 Hitachi Consulting22 Step 2 – Understanding the Data what is data? CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment “Data are values of qualitative or quantitative variables, belonging to a set of items” Variables  Numerical  Categorical (Nominal, Ordinal)  Special (Identifier, Time Index) What should data look like:  One row for each case  Columns represent attributes What does data really look like:  Transactional (normalised) data  Ordered data − Sequence data (DNA) − Time-based data (temporal auto-correlated) − Spatial data (spatial auto-correlated)  Graph-based data  Free-from Text  Image/Video (sequence of images)  Audio Id Att-1 Att-2 .. Att-M Case 1 V(1,1) V(1,2) Case 2 V(2,1) V(2,2) … Case N V(N,M) Variables: • Numerical • Categorical − Nominal − Ordinal Data Forms: • Matrix • Normalized • Ordered − Sequence − Time-Series − Spatial • Graph-based • Free-from Text • Image/Video • Audio
  • 22. | © Copyright 2015 Hitachi Consulting23 Step 2 – Understanding the Data Answering the following questions…  What is the available data?  Do we need to acquire other data? (Publicly available/ Buy data)  What is the nature of the dataset? (Data profiling) − Number of cases − Number of attributes − Missing values (sparsity) − Numerical variables (min, max, mean, media, stdv. , outliers) − Categorical variables (cardinality, frequencies, mode value) − Correlations between numerical variables − Statistical dependency between categorical variables. − Statistical variance (numerical vs. categorical variables) − Inconsistencies (based on business rules) Should lead to…  Identify the data pre-processing operation needed.  Suggest the model to be used. exploratory data analysis CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Data Profiling: • Number of cases • Number of attributes • Missing values • Numerical variables − Min - Max - Median − Distribution(Mean, stdv.) − Outliers • Categorical variables − Cardinality − Frequencies • Correlations/Dependencies • Inconsistencies
  • 23. | © Copyright 2015 Hitachi Consulting24 Step 3 – Preparing the Data Feature Engineering: Building the dataset.  Feature Construction: fabricating a set of (possibly) useful features. Example - Input: Sales Transactions (Customer, Product, Orders) - Objective: Customer Segmentation - Features: Days First Purchase, Days Last Purchase, Avg. Days between 2 Purchase, Last 3 months total Spending, Last 6 Month Total Spending, Promotion Responsiveness, New Product Responsiveness, Avg. Purchased Product Price, …, Web Usage Information, Demographics, Geographic, Economic Indices, Date Indicators, etc.  Feature Selection: Selecting the most effective subset of the available features – Filter vs. Wrapper  Feature Extraction: constructing a new set of independent (uncorrelated) features, from the existing feature set, using mathematical transformation – Principal Component Analysis (PCA), Factor Analysis (FA), etc. good luck is a residue of preparation… CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Data Preparation: • Feature Engineering − Feature Construction − Feature Selection − Feature Extraction • Type Conversion − Discretisation − To Numeric • Variable Tuning − Missing values − Clipping − Scaling • Row Processing − Aggregation − Removing duplicates − Sampling − Data Reduction
  • 24. | © Copyright 2015 Hitachi Consulting25 Step 3 – Preparing the Data Variable Type Conversion:  Numerical to Categorical (Discretisation) → Equal Width/ Equal Size/ Supervised.  Categorical to Numerical → Hot-one/ Relative Counts Variable Tuning:  Missing Values → Eliminate/ Estimate.  Clipping Extreme Values → Fix/ Remove.  Scaling → Normalisation/ Standardisation. Row Processing:  Aggregation  Removing Duplicates  Instance Selection (Data Reduction)  Sampling/Partitioning “garbage in–garbage out”… CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Data Preparation: • Feature Engineering − Feature Construction − Feature Selection − Feature Extraction • Type Conversion − Discretisation − To Numeric • Variable Tuning − Missing values − Clipping − Scaling • Row Processing − Aggregation − Removing duplicates − Sampling − Data Reduction
  • 25. | © Copyright 2015 Hitachi Consulting26 Step 4 - Modelling If you interrogate the data, it will confess… CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Modelling Variation: • Approaches • Algorithms • Parameters • Dataset Representations  Overall Procedure:  sets = Split( dataset, ratio);  train=sets[0]; test=sets[1];  model=Build(algorithm, train, preproc, param);  Visualize(model);  quality= Evaluate( model, test, measure);  Always Build Multiple Models:  Using different approaches.  Using different algorithms.  Using different parameters (parameter sweeping).  Using different dataset representations.  Empirical Evaluation for Model Selection
  • 26. | © Copyright 2015 Hitachi Consulting27 Step 5 – Evaluation and Interpretation Model Predictive Effectiveness  Predictive Accuracy Model Comprehensibility  Interpretability → Insights  Model acceptance  Legal explanation (Justifiability) − Credit Denial − Medical Decisions Algorithm Efficiency  Scalability/running time  User Input parameters Performance Quality Aspects CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Predictive Model Quality • Predictive Effectiveness • Comprehensibility Algorithm Efficiency • Scalability, running time • User input parameters
  • 27. | © Copyright 2015 Hitachi Consulting28 Step 5 – Evaluation and Interpretation Predictive Models – Predictive Effectiveness (accuracy?)  Considerations − Imbalance Class − Misclassification Cost (Expected Utility) − Single Class Focus (Hits Rate vs. False Alarms)  Measures − Confusion Matrix − Accuracy (Micro vs. Macro) − Precision, Recall, Sensitivity, Specificity, F-Measure, etc. − Area Under Curve, lift Chart, Profit/Cost Chart, etc. − QLF, BIR, etc. (Probabilistic Classification/Regression)  Methods − Hold-out − k-fold Cross Validation − Leave-one-out Descriptive Models – It is up to you! all models are wrong, but some are useful… Actual Predicted Positive Negative Positive TP FP Negative FN TN CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Predictive Model Quality • Predictive Effectiveness • Comprehensibility Algorithm Efficiency • Scalability, running time • User input parameters Predictive Quality Measures: • Accuracy (Micro vs. Macro) • Precision vs. Recall • Sensitivity vs. Specificity • Kappa – Lift – odds • QLF, CE, BIR • AUC, lift, cost charts Evaluation Methods: • Hold-out • k-fold Cross Validation
  • 28. | © Copyright 2015 Hitachi Consulting29 Step 6 – Deployment data mining in action! CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Demo Tools & Technologies • MS Azure ML • MS Analysis Services • Infer.NET • WEKA (JAVA) • R Statistics (caret, rattle) • Python (Mlpy, scikit-learn) • OpenML • C/C++ - Matlab • SAS • SPSS • RapidMiner • Apache Mahout Dataset Repository • UCI - KDD • data.gov.uk • GapMinder
  • 29. | © Copyright 2015 Hitachi Consulting30 Screenshot – Decision Trees Microsoft Analysis Services
  • 30. | © Copyright 2015 Hitachi Consulting31 Screenshot – Cluster Analysis Microsoft Analysis Services
  • 31. | © Copyright 2015 Hitachi Consulting32 Screenshot – Association Rules Analysis Microsoft Analysis Services
  • 32. | © Copyright 2015 Hitachi Consulting33 Screenshot – Time Series Microsoft Analysis Services
  • 33. | © Copyright 2015 Hitachi Consulting34 Screenshot – ML Experiment Microsoft Azure Machine Learning
  • 34. | © Copyright 2015 Hitachi Consulting35 Screenshot – ML Web Services Microsoft Azure Machine Learning
  • 35. | © Copyright 2015 Hitachi Consulting36 Screenshot – Probabilistic Models Microsoft Infer.net
  • 36. | © Copyright 2015 Hitachi Consulting37 Screenshot – Classification Rules Java - WEKA
  • 37. | © Copyright 2015 Hitachi Consulting38 Screenshot – Text Mining R Statistics
  • 38. | © Copyright 2015 Hitachi Consulting39 Screenshot – Regression Models R Statistics
  • 39. | © Copyright 2015 Hitachi Consulting40 Concluding Remarks a few takeaways… • Understand the business problem first, please! • Use the appropriate tool/technique that best suits the business problem, not the other way around. • Start by solving simple business problems first, before moving to complex ones (BI Insight Maturity Journey). • Spend sometime to explore and understand the data. • Incorporate domain knowledge in your analysis (avoid reinventing the wheel!). • Data preparation is very important for building effective models. • Data mining is an experimental/ iterative process (not ideal for fixed-price projects!). • Try to tackle the business problem with different analytic approaches. • It is clever to solve complex problems with simple techniques.
  • 40. | © Copyright 2015 Hitachi Consulting41 My Background Applying Ant Colony Optimisation (ACO) in Building Classification Models • Honorary Research Fellow, School of Computing , University of Kent. • Ph.D. Computer Science, University of Kent, Canterbury, UK. • M.Sc. Computer Science , The American University in Cairo, Egypt. • 20+ published journal and conference papers, focusing on: – classification rules induction, – decision trees construction, – Bayesian classification modelling, – data reduction, – instance-based learning, and – evolving neural networks. • Journals: Swarm Intelligence, Swarm & Evolutionary Computation, Intelligent Data Analysis, Applied Soft Computing, and Memetic Computing. • Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio, ECTA, and INNS-BigData.

Hinweis der Redaktion

  1. Pattern fusion