36x48_Trifold_FinalPoster

•Download as PPT, PDF•

0 likes•57 views

This document discusses using machine learning to automate tracking of data usage (DU) quality metrics. It outlines automating the collection of DU event data from multiple sources and storing it in a normalized database. Machine learning algorithms would then be implemented to predict DU events and identify contributing factors. Key steps completed include identifying relevant data sources and metrics, and setting up schemas to automatically update and pull data. Current work involves implementing predictive learning algorithms and refining the data extraction, transformation and loading processes. Challenges include dealing with nested hierarchical data, variable recording intervals, and changing features between releases.

Machine Learning and Quality Analytics
Ryan Riopelle, UC San Diego/CU Boulder
Engineering Intern, Database Management and Analytics Team, Solidfire
Automating DU Quality Tracking
Identify data sources and quality metrics that have been previously
monitored.
Automate data processes for recording and evaluating DU events.
Machine Learning and Predictive Analytics
Implement machine learning algorithms to Predict DU events.
Determine the key contributing factors involved with DU events
Group customer feature usage into patterns that be tracked.
Build a streaming pipeline for anomaly detection over customers usage
patterns.
Future Work
Methods
Milestones Achieved
Project Goals
Benefits and Challenges
Query
Data
K- Means
Clustering
Columns
Clustering Input: Group
using K-Means, Means
Shift, Affinity Propagation,
Spectral Clustering
Use Cluster Variance Analysis
to Determine N-Groups
TF-IDF Vectorizer Python Dictionary
(IDF) Inverse Document Frequency
weighting
Clustered
Categorical
Data
Split Data
Test
Data
Training
Data
Out to
Streaming
Analysis
ML Classification: Ada Boosting, Boosted Trees,
Support Vector Machines (SVM), Neural Networks
(Multi-layer Perceptron), Stochastic Gradient Descent
(SGD)
Optimization Tuning:
Bagging, Ensembles,
Boosting, Changing
Kernel Functions,
Changing Learning Rate/
Step Size Parameter,
Loss/Error Function
Benefits
Improve analytics by reducing the time that it takes to manually query
each data source.
Reduce overhead by reducing time that it takes to manually query each
data source.
Normalize data process for consistent and reliable analytics.
Identifying a consistent set of (KPI’s) companywide.
Challenges
Dealing with highly nested hierarchical data.
Variable time intervals for metrics record recorded by collectors.
Constantly changing features associated with different element releases.
Dealing with effects of multicollinearity across data.
AIQ
System design, operations and data management.
Operations
Data management best practices
Monitoring and alerting
Disaster Recovery
Implementation of replication and backup practices for
critical business systems. AIQ, AT2 and DMA.
Support
Move from Reactive to Predictive model for DU/DL
Engineering
Analytics tools for better resource management and
identification of potential problems.
.
Automating DU Quality Tracking
Identify data sources and quality metrics previously monitored.
 Completed–
 Identify schema structure for database and machine learning.
 Setup schemas with automated updates using Cron and SQL.
 Automated data store with information automatically pulled in from
AIQ, Salesforce, Jira, and Fogbugs.
 Currently Working On–
 Implement Predictive Learning Algorithms
 Measure performance for input fields related to DU events
 Fine tune extraction, translation and loading (ETL) pipeline
.
Data Sources
Normalized
Data Store
Data Extraction,
Translation, and Loading
Automating DU Tracking
Building python connections using SQL Alchemy and PYSH 2 connectors.
Import NoSQL Data into MySQL Database schema, normalize, and perform ETL
processes.
Provide visualizations and regularly timed export options for tracking key performance
indicators (KPI’s).
Algorithms and TrainingData Management Life Cycle Dealing With Textual Data

What's hot

Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...Simplilearn

20170110_IOuellette_CVIan Ouellette

MicroStrategy at BadooFrancesco Mucio

Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo

Systat 13 Training pptSiriyak Cr

Revenue Earned From Students in USAApekshitBhingardive

Shikha fdp 62_14july2017Dr. Shikha Mehta

Federated Galaxy: Biomedical Computing at the FrontierEnis Afgan

Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Denny Lee

Testing Big Data: Automated ETL Testing of HadoopRTTS

resume_MHMengling Hettinger

microsoft r server for distributed computingBAINIDA

Resume - Vikash Chilana - 3yrs ExpVikas Chilana

Power of the Run GraphVaticle

Journals analysis pptMuhammad Heikal

Qualitative data analysis software's By Iqbal RanaIqbal Rana

Metrology sampling models using tool sensor dataArvind Mozumdar

MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...Institute of Information Systems (HES-SO)

Predictive Analytics Project in Automotive IndustryMatouš Havlena

Online index recommendations for high dimensional databases using query workl...Mumbai Academisc

What's hot (20)

Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...

20170110_IOuellette_CV

MicroStrategy at Badoo

Big Data Analysis in Hydrogen Station using Spark and Azure ML

Systat 13 Training ppt

Revenue Earned From Students in USA

Shikha fdp 62_14july2017

Federated Galaxy: Biomedical Computing at the Frontier

Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)

Testing Big Data: Automated ETL Testing of Hadoop

resume_MH

microsoft r server for distributed computing

Resume - Vikash Chilana - 3yrs Exp

Power of the Run Graph

Journals analysis ppt

Qualitative data analysis software's By Iqbal Rana

Metrology sampling models using tool sensor data

MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...

Predictive Analytics Project in Automotive Industry

Online index recommendations for high dimensional databases using query workl...

Similar to 36x48_Trifold_FinalPoster

ResumeJitendra Gupta

Kumar priyawart cv 2017 v1.4Kumar Priyawart

Machine Learning Models in ProductionDataWorks Summit

Applying linear regression and predictive analyticsMariaDB plc

E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal

Venkata Sateesh_BigData_Latest-Resumevenkata sateeshs

MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus

Why You Need Data Engineers to Enable Efficient Analytics and Maintaining Ana...Sun Technologies

GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey

V like Velocity, Predicting in Real-Time with Azure MLBarbara Fusinska

AnalysisServiceswebuploader

Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...RTTS

Machine Learning and AI at OracleSandesh Rao

Labmatrixjwppz

Demantra Case Study Dougsichie

IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET Journal

IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET Journal

B2 2005 introduction_load_testing_blackboard_primer_draftSteve Feldman

StreamCentral Technical OverviewRaheel Retiwalla

Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah

Similar to 36x48_Trifold_FinalPoster (20)

Resume

Kumar priyawart cv 2017 v1.4

Machine Learning Models in Production

Applying linear regression and predictive analytics

E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...

Venkata Sateesh_BigData_Latest-Resume

MLOps and Data Quality: Deploying Reliable ML Models in Production

Why You Need Data Engineers to Enable Efficient Analytics and Maintaining Ana...

GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...

V like Velocity, Predicting in Real-Time with Azure ML

AnalysisServices

Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...

Machine Learning and AI at Oracle

Labmatrix

Demantra Case Study Doug

IRJET- Deep Learning Model to Predict Hardware Performance

IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive

B2 2005 introduction_load_testing_blackboard_primer_draft

StreamCentral Technical Overview

Cloudera Movies Data Science Project On Big Data

36x48_Trifold_FinalPoster

1. Machine Learning and Quality Analytics Ryan Riopelle, UC San Diego/CU Boulder Engineering Intern, Database Management and Analytics Team, Solidfire Automating DU Quality Tracking Identify data sources and quality metrics that have been previously monitored. Automate data processes for recording and evaluating DU events. Machine Learning and Predictive Analytics Implement machine learning algorithms to Predict DU events. Determine the key contributing factors involved with DU events Group customer feature usage into patterns that be tracked. Build a streaming pipeline for anomaly detection over customers usage patterns. Future Work Methods Milestones Achieved Project Goals Benefits and Challenges Query Data K- Means Clustering Columns Clustering Input: Group using K-Means, Means Shift, Affinity Propagation, Spectral Clustering Use Cluster Variance Analysis to Determine N-Groups TF-IDF Vectorizer Python Dictionary (IDF) Inverse Document Frequency weighting Clustered Categorical Data Split Data Test Data Training Data Out to Streaming Analysis ML Classification: Ada Boosting, Boosted Trees, Support Vector Machines (SVM), Neural Networks (Multi-layer Perceptron), Stochastic Gradient Descent (SGD) Optimization Tuning: Bagging, Ensembles, Boosting, Changing Kernel Functions, Changing Learning Rate/ Step Size Parameter, Loss/Error Function Benefits Improve analytics by reducing the time that it takes to manually query each data source. Reduce overhead by reducing time that it takes to manually query each data source. Normalize data process for consistent and reliable analytics. Identifying a consistent set of (KPI’s) companywide. Challenges Dealing with highly nested hierarchical data. Variable time intervals for metrics record recorded by collectors. Constantly changing features associated with different element releases. Dealing with effects of multicollinearity across data. AIQ System design, operations and data management. Operations Data management best practices Monitoring and alerting Disaster Recovery Implementation of replication and backup practices for critical business systems. AIQ, AT2 and DMA. Support Move from Reactive to Predictive model for DU/DL Engineering Analytics tools for better resource management and identification of potential problems. . Automating DU Quality Tracking Identify data sources and quality metrics previously monitored.  Completed–  Identify schema structure for database and machine learning.  Setup schemas with automated updates using Cron and SQL.  Automated data store with information automatically pulled in from AIQ, Salesforce, Jira, and Fogbugs.  Currently Working On–  Implement Predictive Learning Algorithms  Measure performance for input fields related to DU events  Fine tune extraction, translation and loading (ETL) pipeline . Data Sources Normalized Data Store Data Extraction, Translation, and Loading Automating DU Tracking Building python connections using SQL Alchemy and PYSH 2 connectors. Import NoSQL Data into MySQL Database schema, normalize, and perform ETL processes. Provide visualizations and regularly timed export options for tracking key performance indicators (KPI’s). Algorithms and TrainingData Management Life Cycle Dealing With Textual Data

36x48_Trifold_FinalPoster

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 36x48_Trifold_FinalPoster

Similar to 36x48_Trifold_FinalPoster (20)

36x48_Trifold_FinalPoster