Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Data Mining - The Big Picture!

1.564 Aufrufe

Veröffentlicht am

Recently, in the fields Business Intelligence and Data Management, everybody is talking about data science, machine learning, predictive analytics and many other “clever” terms with promises to turn your data into gold. In this slides, we present the big picture of data science and machine learning. First, we define the context for data mining from BI perspective, and try to clarify various buzzwords in this field. Then we give an overview of the machine learning paradigms. After that, we are going to discuss - at a high level - the various data mining tasks, techniques and applications. Next, we will have a quick tour through the Knowledge Discovery Process. Screenshots from demos will be shown, and finally we conclude with some takeaway points.

Veröffentlicht in: Daten & Analysen
  • How big is the picture? There has to be some unit to measure
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Data Mining - The Big Picture!

  1. 1. | © Copyright 2015 Hitachi Consulting1 Data Mining The big picture! Khalid M. Salama, Ph.D. Microsoft Business Intelligence Hitachi Consulting UK We Make it Happen. Better.
  2. 2. | © Copyright 2015 Hitachi Consulting2 Outline Context Data Mining Tasks, Techniques, and Applications Knowledge Discovery Process Screenshots Concluding Remarks
  3. 3. | © Copyright 2015 Hitachi Consulting3 Business Intelligence as a Context Business Intelligence - “A broad category of concepts, methods, tools and techniques of collecting, storing, managing, analysing and sharing data to support/improve decision making”.  Data Mining is a subset of these concepts, methods, tools and techniques that concerns with automatically extracting hidden, useful patterns from the data.  Examples: −CRM: Customer Segmentation, Profiling, etc. −Finance, Banking & Insurance: Fraud Detection, Credit Scoring, Stock Market, etc. −Medicine/Health Care: Disease Development, Diagnosis, Best Treatments, etc. −Telecommunication: Churn Analysis, Network Fault Isolation, etc. −Retail: Cross-selling, Targeted Marketing, Propensity Modelling, etc. revealing the mystery…
  4. 4. | © Copyright 2015 Hitachi Consulting4 Terms and Significance  Data Mining – “An interdisciplinary subfield of computer science, which is the computational process of discovering patterns in datasets” – “Knowledge Discovery in Databases (KDD)”  Data Science – “the extraction of knowledge from volumes of data, which is a continuation of the field data mining and predictive analytics”  Machine Learning – “A subfield of computer science that evolved from the study of pattern recognition and computational learning theory”  Predictive Analytics – “A variety of statistical techniques from modelling, machine learning, and data mining that analyse current and historical facts to make predictions about future”  Big Data – “A broad term for data sets so large or complex that traditional data processing applications are inadequate” brining order to buzzwords chaos
  5. 5. | © Copyright 2015 Hitachi Consulting5 Data Mining … in a nutshell Data Mining Machine Learning Statistics Artificial Intelligence Databases Other Technologies “Data mining, an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.” Other Related Technologies:  Visualization  Big Data  High Performance Computing  Cloud Computing  Others..
  6. 6. | © Copyright 2015 Hitachi Consulting6 Knowledge Discovery in Databases (KDD) …or data science, if you like! Understanding the Business Understanding the Data Preparing the Data Modelling Evaluation - InterpretationDeployment Cross Industry Standard Process for Data Mining (CRISP-DM) Data
  7. 7. | © Copyright 2015 Hitachi Consulting7 Data Mining Taxonomy A 10,000 foot view… Learning Paradigms Mining Tasks Modelling Techniques Measures Heuristic Search Methods Supervised Learning Classification Decision Trees Information Gain Greedy Recursive Partitioning
  8. 8. | © Copyright 2015 Hitachi Consulting8 Learning Paradigms Data as the teacher, machine as the student… Supervised Learning Labelled data = data + output (predictable, target, response, class) variable Learn the relationship between data and output Unsupervised Learning Unlabelled data Learn associations, similarities, groups, etc. Semi- supervised Learning Partially labelled data Online/Active Learning Real-time learning on data streams Reinforcement Learning game theory, control theory, simulation-based optimization, operations research, robotics, etc.
  9. 9. | © Copyright 2015 Hitachi Consulting9 Data Mining Task …only the genuine ones! • Predicting the class of a given case – SupervisedClassification • Estimating the value of a response variable – SupervisedRegression • Partitioning the cases into similar groups – UnsupervisedClustering • Finding frequent (co)-occurring items – UnsupervisedAssociation Rules Analysis • Finding similar cases of a given case – BothSimilarity Analysis • Calculating the probability of variables – BothProbabilistic Inference • Forecasting future values – SupervisedTime Series Analysis Important Terms: • Learning Paradigms: − Supervised − Unsupervised − Semi-supervised − Others (Reinforcement learning, Active, etc.) • Analytics Types: − Descriptive (Exploratory) − Predictive − Prescriptive (Decisive) Application Fields: • Text Mining • Information Retrieval • (Social) Web Mining • Speech Recognition • Image Recognition • Anomaly Detection • State Transition Analysis • Collaborative Filtering (Recommender systems)
  10. 10. | © Copyright 2015 Hitachi Consulting10 Classification Learning my favourite data mining task! Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Target Class Type • Binary vs. Multi-class • Multi-label • Hierarchical Class Classification Applications: • Targeted Advertising • Churn Analysis • Fraud Detection • OCR • Sentiment Analysis • Predictive Maintenance • Document Classification • Protein Function Prediction • Medical Support Systems  Input: Labelled cases (nominal labels).  Process: Learn the relationships between the input variables and the target class.  Output: A model that used to predicted the class of unlabeled cases (+ probability). Model (Classifier) Classification Algorithm Outlook Temperature Humidity Windy Class sunny hot high no Don’t sunny hot high yes Don’t overcast hot high no OK rain mild high no OK rain cool normal no OK rain cool normal yes Don’t overcast cool normal yes OK sunny mild normal no Don’t sunny cool normal no OK rain mild normal no OK sunny mild normal yes OK overcast mild high yes OK overcast hot normal no OK rain mild high yes Don’t OK Labeled cases (Training Set) Unlabeled (new) Case
  11. 11. | © Copyright 2015 Hitachi Consulting11 Classification Learning classification modelling techniques Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Classification Techniques: • Decision Trees • Classification Rules • Linear Discriminant Analysis • Artificial Neural Networks • Instance-based Learning • Probabilistic Graphical Models • Support Vector Machines • Gaussian Process • Ensemble Methods Advances Classification Task: • Multi-label Classification • Hierarchical Classification  Decision Trees  Forests/ Jungles  Classification Rules  Ordered List/ Unordered Set  Linear Discriminate Analysis  Logistic Regression  Artificial Neural Networks  Feed-forward Multilayer perceptron  Instance-based Learning  Nearest-neighbours classifiers  Probabilistic Graphical Models  Bayesian Network Classifiers  Support Vector Machines  Kernel Methods  Gaussian Process  Non-parametric Methods  Ensemble Methods  Bagging/ Boosting/ Stacking IF .. AND .. AND .. THEN A ELSE IF .. AND .. THEN C ELSE IF .. AND .. THEN B .. .. ELSE C . . .
  12. 12. | © Copyright 2015 Hitachi Consulting12 Regression Analysis  Input: cases with numerical target variable (response value)  Process: Learn the relationship: 𝑦 = 𝑓(𝐗).  Output: A regression model that used to estimate the target value of new cases (+ confidence Intervals) Linear Regression  Simple Linear Regression: 𝑦 = 𝑎𝑥 + 𝑏  Multi-variate Linear Regression: 𝑦 = 𝑎1 𝑥1 + 𝑎1 𝑥1 + ⋯ + 𝑎 𝑚 𝑥 𝑚 + 𝑏  Generalized Linear Model (Binomial, Poisson, Chi-square, Gaussian, etc.) Non-linear Regression  Non-linear Transformation 𝑦 = 𝑎1 𝑙𝑜𝑔 𝑥1 + 𝑎2 𝑥2 3 + 𝑏  Multi-variate Adaptive Regression Splines (MARS)  Regression Trees (Hierarchical Regression)  Artificial Neural Networks  Gaussian Process Related Concepts  Parameter Estimation: Least Square Error, Weighted LSE, ect.  Regularization: least absolute shrinkage and selection operator (LASSO) - Ridge  Model Selection (e.g. stepwise, Information Criterion, …) the most classical ML task Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Regression Applications: • Credit Scoring • Survival Analysis • Risk Estimation • Value Evaluation Regression Techniques: • Simple vs. Multi-variate • Generalized LM • Local Models - Splines • Trees - ANN - GP Related Concepts: • Parameter Estimation • Regularization • Model Selection
  13. 13. | © Copyright 2015 Hitachi Consulting13 Cluster Analysis  Input: cases without a specific target class.  Process: find groups where the distance “within” is minimized, and the “between” in maximized.  Output: case-cluster assignment (membership). Clustering Techniques  Exclusive vs. Overlapping  K-Means vs. Fuzzy K-Means, EM  Partitioned vs. Hierarchical  K-Means vs. Agglomerative/Divisive  Center-based vs. Density-based  K-Means vs. DBScan  Complete vs. Partial. Clusters Quality  Minimize intra-distance/linkage (Cohesion)  Maximize inter-distance/linkage (Separation)  Number of Clusters … rather a mean to an end Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Clustering Applications: • Customer Segmentation • Outlier Detection • Topic Grouping • Profiling • Summarisation • Mixture of Models Clustering Techniques • Exclusive vs. Overlapping • Partitioned vs. Hierarchical • Center-based vs. Density-based • Complete vs. Partial.
  14. 14. | © Copyright 2015 Hitachi Consulting14 Association Rule Analysis  Input: cases without a specific target class (or basket data).  Process: Find frequent co-occurrences between variable values (items).  Output: Frequent Item sets/ Association Rules.  Frequent Item set: {a}, {b}, {d}, {ab} ,{ad}  Association Rule: IF {a,b} THEN {d,e} Approach  Define Constrains (Data Space/ Rule Space)  Frequent Item Generation (min. support threshold) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑎, 𝑏 = |𝑎,𝑏| |𝑇|  Rule Generation (min. confidence threshold). 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑎→𝑏 = |𝑎,𝑏| |𝑎|  Prune and proceed to larger item set (adjusted thresholds)  Rank the rules based on an interestingness measure discovery of “interesting” relationships Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Asso. Rules Applications: • Market Basket Analysis • Text Mining - Sentiment Analysis • Graph/Link Analysis Rule Measures: • Support & Confidence • Interestingness − Lift & Chi-Squared − Jaccard & Kulczynzki − Kappa & Conviction Related Issues: • Negative Item sets • Quantitative Items • Sequential Patterns • Item Sets Compression • Redundancy-Aware Patterns • Colossal Item Sets & Scalability a b c d e T1 yes no yes yes no T2 yes no no yes no T3 no yes no no yes .. .. .. .. .. .. Basket Data T1 → {a,c,d} T2 → {a,d} T3 → {b,e} …
  15. 15. | © Copyright 2015 Hitachi Consulting15 Similarity Analysis a.k.a. instance-based learning Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Similarity Matching Applications: • Case-based Reasoning • Lazy Classification • Record Matching • Outlier Detection • Search Engines Attribute Proximity Measures : • Edit-based – Levenstein and Jaro- Winkler distance. • Token-based – Jaccard, Shannon, and Cosine Similarity. • Sequence-based –Longest Common Subsequence. • Phonetic-based – Soundex and Metaphone. • Numeric-based – Euclidean distance. Casei Vi,1 Vi,2 … vi,m Casej Vj,1 Vj,2 … vj,m Weights W1 W2 … Wm Att-1 Att-2 … Att-m Similarity(i,j) = Sim(Vi,1,Vj,1 ) + Sim(Vi,2,Vj,2) … + Sim(Vi,m,Vj,m )W1 . W2 . Wm .  Input: A set of (labelled/ unlabeled) cases + subject case.  Process: find a set of similar cases to the subject case.  Output: similar cases (nearest neighbors). Proximity Measure  Distance vs. Similarity Weighting  User Input vs. Automatic Optimisation Neighbours  Distance-based (Threshold) vs. Top K Classification / Regression  Voting / Average  Weighted Voting / Weighted Average – Kernel Methods (Gaussian Kernel)
  16. 16. | © Copyright 2015 Hitachi Consulting16 Probability Estimation and Inference  Input: A set of (labelled/ unlabeled) cases.  Process: learn the structure/parameters of the variable dependency relationships  Output: A Probabilistic Graphical Model Probabilistic Graphical Models  Directed Acyclic Graphs  Bayesian Networks (classifiers)  Dynamic Bayesian Networks  Markov Blankets  Directed Cyclic Graphs  Markov Chains  (Hidden) Markov Models  Undirected Graphs  Factor Graphs  Dependency Networks  Markov Random fields Learning  Structure (variable-dependency relationships)  Parameters (quantification of the relationships) Inferencing  Exact inference and the junction tree  MCMC  Variational methods and EM the doctrine of chances… Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Probabilistic Inference Applications: • ML Framework • Diagnostic Systems • State Transition Analysis Probabilistic Graphical Models: • Directed Acyclic Graphs − Bayesian Networks − Markov Blankets • Directed Cyclic Graphs − Markov Chains − Markov Models • Undirected Graphs − Factor Graphs − Dependency Networks − Markov Random fields
  17. 17. | © Copyright 2015 Hitachi Consulting17 Time Series Analysis  Input: a sequence of evenly-spaced numerical data.  Process: learn a function that describe the current value with respect to the previous ones.  Output: Time Series Model (describe/forecast). Components:  Trend: Overall upward, downward, or stationary pattern.  Cyclical: Repeating upwards or downwards movements.  Seasonal: Regular pattern of up & down fluctuations.  Irregular: Unsystematic, ‘residual’ fluctuations (random). Techniques:  Regression.  (Weighted) Moving Average.  Exponential Smoothing.  Auto-regressive (STL, ARMA, ARIMA, etc.). history tends to repeat itself… Data Mining Task: • Classification • Regression • Clustering • Association Rules Analysis • Similarity Analysis • Probabilistic Inference • Time Series Analysis Time Series Applications: • Stock Market • Supply/Demand • Financial Applications • Signal Processing Time Series Components • Trend • Cyclical • Seasonal • Random Techniques • Regression • Moving Average • Exponential Smoothing • Auto-regressive
  18. 18. | © Copyright 2015 Hitachi Consulting18 Knowledge Discovery in Databases (KDD) the virtuous cycle of data science CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • DeploymentUnderstanding the Business Understanding the Data Preparing the Data Modelling Evaluation - InterpretationDeployment Data
  19. 19. | © Copyright 2015 Hitachi Consulting19 Step 1 - Understanding the Business Ways to answer “Data Analysis” questions:  Query/Report – “How many new customers bought my service this month? How many renewed? How many left?”  Complex Query/Report – “What are the top selling products by region in the Online sales? How does that compare to the store sales?” (Multi-dimensional Analysis/Visualisation)  Calculations/KPIs – “Is my business going well? Are we meeting our targets?”  What-if Analysis – “Based on the last year sales, what will be the revenues if we increase the price of this product X by 1% and decreased the price of product Y by 2%?” (budgeting/planning)  Statistical Analysis – “What are the most important factors that impact the energy consumption in our facilities?” (dependency/correlation)  Hypothesis Testing – “Is there significant improved amongst the group of people who took the new drug, compared the placebo group?” (experimental studies/market research)  Data Mining – “Who are the customer that most likely to response to our new advertising campaign?” (predictive analytics) “The formulation of a problem is often more essential than its solution” - Albert Einstein CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Analytics Techniques: • Database Query • Multi-dimensional Analysis/Visualisation • Calculations/KPIs • What-if Analysis • Statistical Analysis • Hypothesis Testing • Data Mining
  20. 20. | © Copyright 2015 Hitachi Consulting21 Step 1 - Understanding the Business  A business problem can be decomposed into multiple business question, which of each can be mapped to different analytics technique or data mining task.  Example 1: Microsoft How-old.net − “What are the distinct object in the picture?” → Clustering − “For each object, is it a face or not?” → Classification − “What is the estimate age for each identified face?” → Regression  Example 2: Churn Analysis and Targeted Offering − “Which customers would likely terminate the contract this month?” → Classification − “Which service package will a customer likely purchase if given incentive ?” → Classification − “How much will this customer use the service?” → Regression − “What will be the expected utility of targeting this customer?” → Calculation  Example 3: Planning − “What will be the amount of demand on each item next year, per region?”→ Time Series − “What will be the revenue according to this pricing schemes?” → What-if from business problems to analytics tasks CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Analytics Techniques: • Database Query • Multi-dimensional Analysis/Visualisation • Calculations/KPIs • What-if Analysis • Statistical Analysis • Hypothesis Testing • Data Mining
  21. 21. | © Copyright 2015 Hitachi Consulting22 Step 2 – Understanding the Data what is data? CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment “Data are values of qualitative or quantitative variables, belonging to a set of items” Variables  Numerical  Categorical (Nominal, Ordinal)  Special (Identifier, Time Index) What should data look like:  One row for each case  Columns represent attributes What does data really look like:  Transactional (normalised) data  Ordered data − Sequence data (DNA) − Time-based data (temporal auto-correlated) − Spatial data (spatial auto-correlated)  Graph-based data  Free-from Text  Image/Video (sequence of images)  Audio Id Att-1 Att-2 .. Att-M Case 1 V(1,1) V(1,2) Case 2 V(2,1) V(2,2) … Case N V(N,M) Variables: • Numerical • Categorical − Nominal − Ordinal Data Forms: • Matrix • Normalized • Ordered − Sequence − Time-Series − Spatial • Graph-based • Free-from Text • Image/Video • Audio
  22. 22. | © Copyright 2015 Hitachi Consulting23 Step 2 – Understanding the Data Answering the following questions…  What is the available data?  Do we need to acquire other data? (Publicly available/ Buy data)  What is the nature of the dataset? (Data profiling) − Number of cases − Number of attributes − Missing values (sparsity) − Numerical variables (min, max, mean, media, stdv. , outliers) − Categorical variables (cardinality, frequencies, mode value) − Correlations between numerical variables − Statistical dependency between categorical variables. − Statistical variance (numerical vs. categorical variables) − Inconsistencies (based on business rules) Should lead to…  Identify the data pre-processing operation needed.  Suggest the model to be used. exploratory data analysis CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Data Profiling: • Number of cases • Number of attributes • Missing values • Numerical variables − Min - Max - Median − Distribution(Mean, stdv.) − Outliers • Categorical variables − Cardinality − Frequencies • Correlations/Dependencies • Inconsistencies
  23. 23. | © Copyright 2015 Hitachi Consulting24 Step 3 – Preparing the Data Feature Engineering: Building the dataset.  Feature Construction: fabricating a set of (possibly) useful features. Example - Input: Sales Transactions (Customer, Product, Orders) - Objective: Customer Segmentation - Features: Days First Purchase, Days Last Purchase, Avg. Days between 2 Purchase, Last 3 months total Spending, Last 6 Month Total Spending, Promotion Responsiveness, New Product Responsiveness, Avg. Purchased Product Price, …, Web Usage Information, Demographics, Geographic, Economic Indices, Date Indicators, etc.  Feature Selection: Selecting the most effective subset of the available features – Filter vs. Wrapper  Feature Extraction: constructing a new set of independent (uncorrelated) features, from the existing feature set, using mathematical transformation – Principal Component Analysis (PCA), Factor Analysis (FA), etc. good luck is a residue of preparation… CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Data Preparation: • Feature Engineering − Feature Construction − Feature Selection − Feature Extraction • Type Conversion − Discretisation − To Numeric • Variable Tuning − Missing values − Clipping − Scaling • Row Processing − Aggregation − Removing duplicates − Sampling − Data Reduction
  24. 24. | © Copyright 2015 Hitachi Consulting25 Step 3 – Preparing the Data Variable Type Conversion:  Numerical to Categorical (Discretisation) → Equal Width/ Equal Size/ Supervised.  Categorical to Numerical → Hot-one/ Relative Counts Variable Tuning:  Missing Values → Eliminate/ Estimate.  Clipping Extreme Values → Fix/ Remove.  Scaling → Normalisation/ Standardisation. Row Processing:  Aggregation  Removing Duplicates  Instance Selection (Data Reduction)  Sampling/Partitioning “garbage in–garbage out”… CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Data Preparation: • Feature Engineering − Feature Construction − Feature Selection − Feature Extraction • Type Conversion − Discretisation − To Numeric • Variable Tuning − Missing values − Clipping − Scaling • Row Processing − Aggregation − Removing duplicates − Sampling − Data Reduction
  25. 25. | © Copyright 2015 Hitachi Consulting26 Step 4 - Modelling If you interrogate the data, it will confess… CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Modelling Variation: • Approaches • Algorithms • Parameters • Dataset Representations  Overall Procedure:  sets = Split( dataset, ratio);  train=sets[0]; test=sets[1];  model=Build(algorithm, train, preproc, param);  Visualize(model);  quality= Evaluate( model, test, measure);  Always Build Multiple Models:  Using different approaches.  Using different algorithms.  Using different parameters (parameter sweeping).  Using different dataset representations.  Empirical Evaluation for Model Selection
  26. 26. | © Copyright 2015 Hitachi Consulting27 Step 5 – Evaluation and Interpretation Model Predictive Effectiveness  Predictive Accuracy Model Comprehensibility  Interpretability → Insights  Model acceptance  Legal explanation (Justifiability) − Credit Denial − Medical Decisions Algorithm Efficiency  Scalability/running time  User Input parameters Performance Quality Aspects CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Predictive Model Quality • Predictive Effectiveness • Comprehensibility Algorithm Efficiency • Scalability, running time • User input parameters
  27. 27. | © Copyright 2015 Hitachi Consulting28 Step 5 – Evaluation and Interpretation Predictive Models – Predictive Effectiveness (accuracy?)  Considerations − Imbalance Class − Misclassification Cost (Expected Utility) − Single Class Focus (Hits Rate vs. False Alarms)  Measures − Confusion Matrix − Accuracy (Micro vs. Macro) − Precision, Recall, Sensitivity, Specificity, F-Measure, etc. − Area Under Curve, lift Chart, Profit/Cost Chart, etc. − QLF, BIR, etc. (Probabilistic Classification/Regression)  Methods − Hold-out − k-fold Cross Validation − Leave-one-out Descriptive Models – It is up to you! all models are wrong, but some are useful… Actual Predicted Positive Negative Positive TP FP Negative FN TN CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Predictive Model Quality • Predictive Effectiveness • Comprehensibility Algorithm Efficiency • Scalability, running time • User input parameters Predictive Quality Measures: • Accuracy (Micro vs. Macro) • Precision vs. Recall • Sensitivity vs. Specificity • Kappa – Lift – odds • QLF, CE, BIR • AUC, lift, cost charts Evaluation Methods: • Hold-out • k-fold Cross Validation
  28. 28. | © Copyright 2015 Hitachi Consulting29 Step 6 – Deployment data mining in action! CRISP-DM Process: • Understanding the Business • Understanding the Data • Preparing the Data • Modelling • Evaluation & Interpretation • Deployment Demo Tools & Technologies • MS Azure ML • MS Analysis Services • Infer.NET • WEKA (JAVA) • R Statistics (caret, rattle) • Python (Mlpy, scikit-learn) • OpenML • C/C++ - Matlab • SAS • SPSS • RapidMiner • Apache Mahout Dataset Repository • UCI - KDD • data.gov.uk • GapMinder
  29. 29. | © Copyright 2015 Hitachi Consulting30 Screenshot – Decision Trees Microsoft Analysis Services
  30. 30. | © Copyright 2015 Hitachi Consulting31 Screenshot – Cluster Analysis Microsoft Analysis Services
  31. 31. | © Copyright 2015 Hitachi Consulting32 Screenshot – Association Rules Analysis Microsoft Analysis Services
  32. 32. | © Copyright 2015 Hitachi Consulting33 Screenshot – Time Series Microsoft Analysis Services
  33. 33. | © Copyright 2015 Hitachi Consulting34 Screenshot – ML Experiment Microsoft Azure Machine Learning
  34. 34. | © Copyright 2015 Hitachi Consulting35 Screenshot – ML Web Services Microsoft Azure Machine Learning
  35. 35. | © Copyright 2015 Hitachi Consulting36 Screenshot – Probabilistic Models Microsoft Infer.net
  36. 36. | © Copyright 2015 Hitachi Consulting37 Screenshot – Classification Rules Java - WEKA
  37. 37. | © Copyright 2015 Hitachi Consulting38 Screenshot – Text Mining R Statistics
  38. 38. | © Copyright 2015 Hitachi Consulting39 Screenshot – Regression Models R Statistics
  39. 39. | © Copyright 2015 Hitachi Consulting40 Concluding Remarks a few takeaways… • Understand the business problem first, please! • Use the appropriate tool/technique that best suits the business problem, not the other way around. • Start by solving simple business problems first, before moving to complex ones (BI Insight Maturity Journey). • Spend sometime to explore and understand the data. • Incorporate domain knowledge in your analysis (avoid reinventing the wheel!). • Data preparation is very important for building effective models. • Data mining is an experimental/ iterative process (not ideal for fixed-price projects!). • Try to tackle the business problem with different analytic approaches. • It is clever to solve complex problems with simple techniques.
  40. 40. | © Copyright 2015 Hitachi Consulting41 My Background Applying Ant Colony Optimisation (ACO) in Building Classification Models • Honorary Research Fellow, School of Computing , University of Kent. • Ph.D. Computer Science, University of Kent, Canterbury, UK. • M.Sc. Computer Science , The American University in Cairo, Egypt. • 20+ published journal and conference papers, focusing on: – classification rules induction, – decision trees construction, – Bayesian classification modelling, – data reduction, – instance-based learning, and – evolving neural networks. • Journals: Swarm Intelligence, Swarm & Evolutionary Computation, Intelligent Data Analysis, Applied Soft Computing, and Memetic Computing. • Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio, ECTA, and INNS-BigData.

×