SlideShare a Scribd company logo
1 of 1
Machine Learning and Quality Analytics
Ryan Riopelle, UC San Diego/CU Boulder
Engineering Intern, Database Management and Analytics Team, Solidfire
Automating DU Quality Tracking
Identify data sources and quality metrics that have been previously
monitored.
Automate data processes for recording and evaluating DU events.
Machine Learning and Predictive Analytics
Implement machine learning algorithms to Predict DU events.
Determine the key contributing factors involved with DU events
Group customer feature usage into patterns that be tracked.
Build a streaming pipeline for anomaly detection over customers usage
patterns.
Future Work
Methods
Milestones Achieved
Project Goals
Benefits and Challenges
Query
Data
K- Means
Clustering
Columns
Clustering Input: Group
using K-Means, Means
Shift, Affinity Propagation,
Spectral Clustering
Use Cluster Variance Analysis
to Determine N-Groups
TF-IDF Vectorizer Python Dictionary
(IDF) Inverse Document Frequency
weighting
Clustered
Categorical
Data
Split Data
Test
Data
Training
Data
Out to
Streaming
Analysis
ML Classification: Ada Boosting, Boosted Trees,
Support Vector Machines (SVM), Neural Networks
(Multi-layer Perceptron), Stochastic Gradient Descent
(SGD)
Optimization Tuning:
Bagging, Ensembles,
Boosting, Changing
Kernel Functions,
Changing Learning Rate/
Step Size Parameter,
Loss/Error Function
Benefits
Improve analytics by reducing the time that it takes to manually query
each data source.
Reduce overhead by reducing time that it takes to manually query each
data source.
Normalize data process for consistent and reliable analytics.
Identifying a consistent set of (KPI’s) companywide.
Challenges
Dealing with highly nested hierarchical data.
Variable time intervals for metrics record recorded by collectors.
Constantly changing features associated with different element releases.
Dealing with effects of multicollinearity across data.
AIQ
System design, operations and data management.
Operations
Data management best practices
Monitoring and alerting
Disaster Recovery
Implementation of replication and backup practices for
critical business systems. AIQ, AT2 and DMA.
Support
Move from Reactive to Predictive model for DU/DL
Engineering
Analytics tools for better resource management and
identification of potential problems.
.
Automating DU Quality Tracking
Identify data sources and quality metrics previously monitored.
 Completed–
 Identify schema structure for database and machine learning.
 Setup schemas with automated updates using Cron and SQL.
 Automated data store with information automatically pulled in from
AIQ, Salesforce, Jira, and Fogbugs.
 Currently Working On–
 Implement Predictive Learning Algorithms
 Measure performance for input fields related to DU events
 Fine tune extraction, translation and loading (ETL) pipeline
.
Data Sources
Normalized
Data Store
Data Extraction,
Translation, and Loading
Automating DU Tracking
Building python connections using SQL Alchemy and PYSH 2 connectors.
Import NoSQL Data into MySQL Database schema, normalize, and perform ETL
processes.
Provide visualizations and regularly timed export options for tracking key performance
indicators (KPI’s).
Algorithms and TrainingData Management Life Cycle Dealing With Textual Data

More Related Content

What's hot

Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...
Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...
Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...Simplilearn
 
20170110_IOuellette_CV
20170110_IOuellette_CV20170110_IOuellette_CV
20170110_IOuellette_CVIan Ouellette
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Systat 13 Training ppt
Systat 13 Training pptSystat 13 Training ppt
Systat 13 Training pptSiriyak Cr
 
Revenue Earned From Students in USA
Revenue Earned From Students in USARevenue Earned From Students in USA
Revenue Earned From Students in USAApekshitBhingardive
 
Federated Galaxy: Biomedical Computing at the Frontier
Federated Galaxy: Biomedical Computing at the FrontierFederated Galaxy: Biomedical Computing at the Frontier
Federated Galaxy: Biomedical Computing at the FrontierEnis Afgan
 
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Denny Lee
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computingBAINIDA
 
Resume - Vikash Chilana - 3yrs Exp
Resume - Vikash Chilana - 3yrs ExpResume - Vikash Chilana - 3yrs Exp
Resume - Vikash Chilana - 3yrs ExpVikas Chilana
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run GraphVaticle
 
Qualitative data analysis software's By Iqbal Rana
Qualitative data analysis software's By Iqbal RanaQualitative data analysis software's By Iqbal Rana
Qualitative data analysis software's By Iqbal RanaIqbal Rana
 
Metrology sampling models using tool sensor data
Metrology sampling models using tool sensor dataMetrology sampling models using tool sensor data
Metrology sampling models using tool sensor dataArvind Mozumdar
 
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...Institute of Information Systems (HES-SO)
 
Predictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryPredictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryMatouš Havlena
 
Online index recommendations for high dimensional databases using query workl...
Online index recommendations for high dimensional databases using query workl...Online index recommendations for high dimensional databases using query workl...
Online index recommendations for high dimensional databases using query workl...Mumbai Academisc
 

What's hot (20)

Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...
Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...
Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...
 
20170110_IOuellette_CV
20170110_IOuellette_CV20170110_IOuellette_CV
20170110_IOuellette_CV
 
MicroStrategy at Badoo
MicroStrategy at BadooMicroStrategy at Badoo
MicroStrategy at Badoo
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Systat 13 Training ppt
Systat 13 Training pptSystat 13 Training ppt
Systat 13 Training ppt
 
Revenue Earned From Students in USA
Revenue Earned From Students in USARevenue Earned From Students in USA
Revenue Earned From Students in USA
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Federated Galaxy: Biomedical Computing at the Frontier
Federated Galaxy: Biomedical Computing at the FrontierFederated Galaxy: Biomedical Computing at the Frontier
Federated Galaxy: Biomedical Computing at the Frontier
 
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
resume_MH
resume_MHresume_MH
resume_MH
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
Resume - Vikash Chilana - 3yrs Exp
Resume - Vikash Chilana - 3yrs ExpResume - Vikash Chilana - 3yrs Exp
Resume - Vikash Chilana - 3yrs Exp
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run Graph
 
Journals analysis ppt
Journals analysis pptJournals analysis ppt
Journals analysis ppt
 
Qualitative data analysis software's By Iqbal Rana
Qualitative data analysis software's By Iqbal RanaQualitative data analysis software's By Iqbal Rana
Qualitative data analysis software's By Iqbal Rana
 
Metrology sampling models using tool sensor data
Metrology sampling models using tool sensor dataMetrology sampling models using tool sensor data
Metrology sampling models using tool sensor data
 
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
 
Predictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryPredictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive Industry
 
Online index recommendations for high dimensional databases using query workl...
Online index recommendations for high dimensional databases using query workl...Online index recommendations for high dimensional databases using query workl...
Online index recommendations for high dimensional databases using query workl...
 

Similar to 36x48_Trifold_FinalPoster

Kumar priyawart cv 2017 v1.4
Kumar priyawart cv 2017 v1.4Kumar priyawart cv 2017 v1.4
Kumar priyawart cv 2017 v1.4Kumar Priyawart
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analyticsMariaDB plc
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
 
Venkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-ResumeVenkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-Resumevenkata sateeshs
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Why You Need Data Engineers to Enable Efficient Analytics and Maintaining Ana...
Why You Need Data Engineers to Enable Efficient Analytics and Maintaining Ana...Why You Need Data Engineers to Enable Efficient Analytics and Maintaining Ana...
Why You Need Data Engineers to Enable Efficient Analytics and Maintaining Ana...Sun Technologies
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLBarbara Fusinska
 
AnalysisServices
AnalysisServicesAnalysisServices
AnalysisServiceswebuploader
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...RTTS
 
Machine Learning and AI at Oracle
Machine Learning and AI at OracleMachine Learning and AI at Oracle
Machine Learning and AI at OracleSandesh Rao
 
Labmatrix
LabmatrixLabmatrix
Labmatrixjwppz
 
Demantra Case Study Doug
Demantra Case Study DougDemantra Case Study Doug
Demantra Case Study Dougsichie
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET Journal
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET Journal
 
B2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftB2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftSteve Feldman
 
StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical OverviewRaheel Retiwalla
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 

Similar to 36x48_Trifold_FinalPoster (20)

Resume
ResumeResume
Resume
 
Kumar priyawart cv 2017 v1.4
Kumar priyawart cv 2017 v1.4Kumar priyawart cv 2017 v1.4
Kumar priyawart cv 2017 v1.4
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
Venkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-ResumeVenkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-Resume
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Why You Need Data Engineers to Enable Efficient Analytics and Maintaining Ana...
Why You Need Data Engineers to Enable Efficient Analytics and Maintaining Ana...Why You Need Data Engineers to Enable Efficient Analytics and Maintaining Ana...
Why You Need Data Engineers to Enable Efficient Analytics and Maintaining Ana...
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure ML
 
AnalysisServices
AnalysisServicesAnalysisServices
AnalysisServices
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
Machine Learning and AI at Oracle
Machine Learning and AI at OracleMachine Learning and AI at Oracle
Machine Learning and AI at Oracle
 
Labmatrix
LabmatrixLabmatrix
Labmatrix
 
Demantra Case Study Doug
Demantra Case Study DougDemantra Case Study Doug
Demantra Case Study Doug
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
 
B2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftB2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draft
 
StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical Overview
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 

36x48_Trifold_FinalPoster

  • 1. Machine Learning and Quality Analytics Ryan Riopelle, UC San Diego/CU Boulder Engineering Intern, Database Management and Analytics Team, Solidfire Automating DU Quality Tracking Identify data sources and quality metrics that have been previously monitored. Automate data processes for recording and evaluating DU events. Machine Learning and Predictive Analytics Implement machine learning algorithms to Predict DU events. Determine the key contributing factors involved with DU events Group customer feature usage into patterns that be tracked. Build a streaming pipeline for anomaly detection over customers usage patterns. Future Work Methods Milestones Achieved Project Goals Benefits and Challenges Query Data K- Means Clustering Columns Clustering Input: Group using K-Means, Means Shift, Affinity Propagation, Spectral Clustering Use Cluster Variance Analysis to Determine N-Groups TF-IDF Vectorizer Python Dictionary (IDF) Inverse Document Frequency weighting Clustered Categorical Data Split Data Test Data Training Data Out to Streaming Analysis ML Classification: Ada Boosting, Boosted Trees, Support Vector Machines (SVM), Neural Networks (Multi-layer Perceptron), Stochastic Gradient Descent (SGD) Optimization Tuning: Bagging, Ensembles, Boosting, Changing Kernel Functions, Changing Learning Rate/ Step Size Parameter, Loss/Error Function Benefits Improve analytics by reducing the time that it takes to manually query each data source. Reduce overhead by reducing time that it takes to manually query each data source. Normalize data process for consistent and reliable analytics. Identifying a consistent set of (KPI’s) companywide. Challenges Dealing with highly nested hierarchical data. Variable time intervals for metrics record recorded by collectors. Constantly changing features associated with different element releases. Dealing with effects of multicollinearity across data. AIQ System design, operations and data management. Operations Data management best practices Monitoring and alerting Disaster Recovery Implementation of replication and backup practices for critical business systems. AIQ, AT2 and DMA. Support Move from Reactive to Predictive model for DU/DL Engineering Analytics tools for better resource management and identification of potential problems. . Automating DU Quality Tracking Identify data sources and quality metrics previously monitored.  Completed–  Identify schema structure for database and machine learning.  Setup schemas with automated updates using Cron and SQL.  Automated data store with information automatically pulled in from AIQ, Salesforce, Jira, and Fogbugs.  Currently Working On–  Implement Predictive Learning Algorithms  Measure performance for input fields related to DU events  Fine tune extraction, translation and loading (ETL) pipeline . Data Sources Normalized Data Store Data Extraction, Translation, and Loading Automating DU Tracking Building python connections using SQL Alchemy and PYSH 2 connectors. Import NoSQL Data into MySQL Database schema, normalize, and perform ETL processes. Provide visualizations and regularly timed export options for tracking key performance indicators (KPI’s). Algorithms and TrainingData Management Life Cycle Dealing With Textual Data