Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

[DSC Europe 22] Smart approach in development and deployment process for various ML models - Danijel Ilievski & Milos Josifovic

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 15 Anzeige

[DSC Europe 22] Smart approach in development and deployment process for various ML models - Danijel Ilievski & Milos Josifovic

Herunterladen, um offline zu lesen

During development of machine learning model about 80% of time is used for data preparation and due to data quality issues, especially when there is a need to combine data from structured and unstructured data sources. Development of smart generic data mart can reduce go to production time for new ML models. We will share creative solutions for challenges we encountered during data transfer between DWH and Data Lake, furthermore data preprocessing, development, deployment/orchestration of ML models, using python/pyspark scripts.

During development of machine learning model about 80% of time is used for data preparation and due to data quality issues, especially when there is a need to combine data from structured and unstructured data sources. Development of smart generic data mart can reduce go to production time for new ML models. We will share creative solutions for challenges we encountered during data transfer between DWH and Data Lake, furthermore data preprocessing, development, deployment/orchestration of ML models, using python/pyspark scripts.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Weitere von DataScienceConferenc1 (20)

Aktuellste (20)

Anzeige

[DSC Europe 22] Smart approach in development and deployment process for various ML models - Danijel Ilievski & Milos Josifovic

  1. 1. Smart approach in development and deployment process for various ML models Jelena Pekez (Advanced Analytics Team Lead) Miloš Josifović (Big Data Architect) Danijel Ilievski (Senior ML Engineer)
  2. 2. Comtrade System Integration Introduction →Since 87% of models are never deployed, all steps should be planned at the beginning of Data Science Lifecycle (pipeline): 1. Manage 2. Develop 3. Deploy 4. Monitor →The first goal is to reduce go to production time for new ML models with development of Smart Generic Data Mart(s). →With Smart Data Mart(s) we can prototype ML model and evaluate feasibility. →The final goal is to generate Production Models and easily orchestrate them. 2 Results Interpretation Modeling Data Preprocess Data mart design ADS Problem Formulation Deployment PROD.MODEL
  3. 3. Comtrade System Integration 3 ADS smart development to support all future ML models →Planning DataMart for creation of first ML model in a program takes exhaustive time: • Collect at high-level all possible future use-cases • Come up with all relevant and available data sources • Customer’s activities which company has interest in • Combine data from structured and unstructured data sources • Extensive feature engineering (text processing, normalization, binning,…) • Complying with GDPR regulation • Define proper access rights on selected Data Mart(s) • Resolving data quality issues at the very beginning will reduce endless reloads FornextMLmodeldatascientistscanspendmoretimeoncreativeactivitiesusingdevelopedAnalyticalDatamarts/Sets(ADS)
  4. 4. Comtrade System Integration Smart generic data mart(s) →Creating Multipurpose Data Marts: • Generate list of target features and relevant target events • Design it so new events can be easily added • Eliminate data that have no business/use-case value • Filter out system records - clean data • Make initial (starting) base table/s - what is definition of customer? • Aggregate data to different granularity levels to catch behavior trends • Feature Engineering do indeed make a difference! 4 Generate quickly and easily new ML training datasets
  5. 5. Comtrade System Integration Data Science requires domain knowledge makes a big difference →How much domain knowledge do I need? Depends. →Domain knowledge is critical for data preparation, productization and orchestration →Which data points add value? →Domain knowledge is necessary in data pre-processing: • Outlier detection, feature importance, model selection, model evaluation stage... 5 DATA SCIENCE DOMAIN KNOWLEDGE MATH, STATS & ML COMPUTER SCIENCE You have to get best of both worlds!
  6. 6. Comtrade System Integration Control your data mart(s) in production →Steps in data pipeline for data quality check: • Missing data vs Loaded data - aggregations • Duplicates – the same records were repeated • Relative change threshold - increment or decrement in the number of records • Statistical expected range • Data drift – target variable distribution 6 Data Pipeline
  7. 7. Comtrade System Integration Example how Generic Data Set can help to focus on Data Science – Transfer between DWH and Data Lake →Data on two platforms (DWH – SQL database, Data Lake – Hadoop) →Data can be transferred among databases: • Through SQL federation / DB link – with certain specifics/products compatibility • Via Spark engine (PySpark) to Hadoop →Aim is to simplify data transfer between platforms so, Data Scientist can do it on their own, without: • Dealing with Spark’s jobs directly • Manage Hadoop security (Kerberos, read-write permissions, etc.) 7
  8. 8. Comtrade System Integration Speed up writting SQL queries →ADS  [GENERATE SQL QUERY]  Training/Scoring table →Query automation for training table → Input for Python script: e.g. of Python script: 8 SCHEMA SOURCE VAR_IN VAR_OUT FUNCTIONS PERIOD S ZERO EXCLUDE ADS DS_PAYMENT TOTAL_PAYMENT_AMT TOTAL_PAYMENT_AM T [MAX, AVG/P] [3, 6] 1 ADS DS_PAYMENT TOTAL_PAYMENT_CNT TOTAL_PAYMENT_CN T [SUM] [1] 1 ADS DS_PAYMENT MAX_PAYMENT_AMT MAX_PAYMENT_AMT [MAX] [3] 1 ADS DS_PAYMENT MIN_PAYMENT_AMT MIN_PAYMENT_AMT [MIN] [3] 1 ADS DS_PAYMENT ADD_PAYMENT_CNT ADD_PAYMENT_CNT [AVG/P] [6] 1 ADS DS_USAGE USAGE_OUT_DUR USAGE_OUT_DUR [SUM] [1] 1 ADS DS_USAGE USAGE_OUT_DUR USAGE_OUT_DUR [AVG/P, MAX, MIN] [3, 6] 1 ADS DS_USAGE USAGE_OUT_IN_PACK_DUR USAGE_OUT_IN_PACK _DUR [SUM] [1] 1 ADS DS_USAGE NVL(USAGE_OUT_REG_INT_DUR, 0) + NVL(USAGE_OUT_INT_DUR,0) USAGE_OUT_INT_DUR [AVG/P] [6] 1 for i, line in enumerate(variables): for i2, k in enumerate(line[2]): #funkcija for i3, kk in enumerate(line[3]): #period if (i == len(variables) - 1) & (i2 == len(line[2])-1) & (i3 == len(line[3])-1): zarez = '' else: zarez = ',' #KREIRA AGREGACIONU KOLONU, npr. AVG(FIELD_NAME) AS NEW_FIELD_NAME divider = '' if 'AVG/P' == str.upper(k): func1 = 'SUM' func2 = '_' + 'AVG' divider = '/' + str(kk) elif ('SUM' == str.upper(k)) & (kk == '1'): func1 = 'SUM' func2 = '' else: func1 = k func2 = '_' + k query += (func1 + '(' + line[1] + '_' + str(kk) + 'M' + ')' + divider + ' AS ' + line[1] + func2 + '_' + str(kk) + 'M' + zarez + ' n’) … for i, line in enumerate(variables): for i2, line2 in enumerate(line[3]): if (i == len(variables) - 1) & (i2 == len(line[3])-1): zarez = '' else: zarez = ',' if line[4] == 1: zero_rule = 'AND {varijabla} <> 0'.format(varijabla = line[0]) else: zero_rule = '' query += ("CASE WHEN TIME_ID BETWEEN ADD_MONTHS('{datum_place}', {vreme2}) AND '{datum_place}' {zero_rule} THEN {varijabla} ELSE NULL END AS {varijabla2}_{vreme}M{zarez_place}".format(varijabla = line[0], varijabla2 = line[1], datum_place = datum, vreme2 = -1 * (int(line2) - 1), zero_rule=zero_rule, vreme = line2, zarez_place = zarez))+ ' n' query += ("FROMn
  9. 9. Comtrade System Integration Develop phase - Devote more time to the creative side →Improve ML traditional development processes: • Benefit from pre-trained models (deep learning – mainly image recognition) • Automated Machine learning (AutoML) – pretty good in supervised ML 9 →Auto ML: • Optimize DS workload or lack of experience • Processes tasks like Feature Selection, Data Preprocessing, Hyperparameter Optimization, Model/Algorithm Selection • Let you focus more on the data side • Is no silver bullet, it is more exploration tool rather than an optimal model generation tool MLBox, Auto-Sklearn, TPOT, H2O AutoML, Auto Keras, Auto PyTorch, Google Cloud AutoML, DataRobot, etc.​
  10. 10. Comtrade System Integration Deploy phase - don’tgetanyvalueoutofamodelsittingonsomeonecomputer →Phase where model is transferred to a production environment. →Same best-practice principles and design patterns for software also apply to ML models →ML model should be deployed as part of existing data pipeline →Output of ML model should be monitored for bias →ML model in deploy phase: • Registered in appropriate repository • Passed testing • Model artifacts are retained →Validate model  Publish model Deliver model →Don’t update Python libraries before proper testing on development environment 😊 10
  11. 11. Comtrade System Integration Deploy phase – more than one ML model 12 →Model registry: • Place for all trained/production-ready models (with version control) • Alternative models as backup • All model artifacts, model dependencies, evaluation metrics, documentation • Which dataset was used for training / model lineage • Log performance details of the model and comparison with other models • Tracking models during whole time (training, staging and production) →Model registry enables faster deployment of your models or retrain current ones →Shared by multiple team members (team collaboration) →Tie up business rules and output from production model →Consume the model through API integration
  12. 12. Comtrade System Integration Single Pipeline for datatransfer Conclusion 12 Easy deployment Smart Generic Data Mart(s) More creative time
  13. 13. Contact us as on: Danijel.Ilievski@comtrade.com Jelena.Pekez@comtrade.com Milos.Josifovic@comtrade.com Milos.
  14. 14. Q&A
  15. 15. www.comtradeintegration.com Copyright © 2020 Comtrade. All rights reserved. The content of this presentation is copyright protected. Any reproduction, distribution, or modification is not allowed. The information, solutions, and opinions contained in this presentation are of informative nature only and are not intended to be a comprehensive study, nor should they be relied on or treated as a means to provide a complete solution or advice, since we may not be aware of all specific circumstances of the case. We try to provide quality information, but we make no claims, promises, or guaranties about the accuracy, completeness, or adequacy of the information contained herein. Thank you

Hinweis der Redaktion

  • DANIJEL
  • DANIJEL
    During deployment in large organizations, we have to orchestrate more than one ML model and best thing is to have in mind that at very beginning of projects that we will have more ml models in future, so organize everything in that manner that can support adding new models easily.


    - Since very beginning special focus in Data Science Lifecycle should be on data quality and production.

    Foundation for more models in a future:
    Development of analytical dataset for future models development we can observe like a different project.
  • JELENA
    So if we go more in details….

    Kada se razvija model, focus na pripremi podataka
    –Organize DB tables considering performance and optimization
    Analiza dodavanje kolona, bitnih izvora
    Osmislite izvore, target tabele, kako organiz. Tabel po pitanju perform, I logike, imati higtlevel koji su use case-ove.
  • JELENA

    POMENUTI: 
    Organize DB tables considering performance and optimization
    Feature Engineering - isn't about generating a higher quantity of new features. It's about the quality of the features created. 
  • -DANIJEL ILI JELENA
    Doman knowledge cannot be optimized.

    - Make an instruction file with field names and action how to handling null:
    Constant value, Max(), Min(), Mean(), Nearby value, Regression, Delete record

    - Domain knowledge will allow you to take the impact of your machine learning skills to a much higher level of significance.
    --------------
    --Random forests, for example, can handle heterogeneous data types right out of the box.


    As Data Scientist with domain knowledge you will have answer on question Which data points add value? And you just need to find them.

  • DANIJEL
  • MILOS
    Benefit / suggestion:
    Parallel execution
    No temp data on initial database
    Fast transfer
    Careful about data types specified on table level
  • DANIJEL
  • JELNEA DO KRAJA
    Efficiently automate all regular, manual, and tedious workloads of ML implementations

    „Fails short“ for Feature Engineering.
    Can easily overfit (watch for label distribution, how many outliers, etc.
  • Deploy model as a stand alone container - easier

×