Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

PyData Meetup - Feature Store for Hopsworks and ML Pipelines

377 Aufrufe

Veröffentlicht am

Hopsworks Feature Store Machine Learning Pipelines TensorFlow PySpark

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

PyData Meetup - Feature Store for Hopsworks and ML Pipelines

  1. 1. The Feature Store: Data Engineering meets Data Science PyData 56th Meetup, London May 7th, 2019 jim_dowling CEO @ Logical Clocks Assoc Prof @ KTH
  2. 2. ©2018 Logical Clocks AB. All Rights Reserved www.logicalclocks.com Stockholm Office Box 1263, Isafjordsgatan 22 Kista, Sweden Silicon Valley Office 470 Ramona St Palo Alto California, USA UK Office IDEALondon, 69 Wilson St, London, EC2A2BB, UK Dr. Jim Dowling CEO Steffen Grohsschmiedt Head of Cloud Theofilos Kakantousis COO Fabio Buso Head of Engineering Venture Capital Backed (Inventure, Frontline.vc, AI Seed) Prof Seif Hardi Chief Scientist
  3. 3. ©2018 Logical Clocks AB. All Rights Reserved 3 He Her Han Hon Hen
  4. 4. Become a Data Scientist! 4 Eureka! This will give a 12% increase in the efficiency of this wind farm!
  5. 5. Data Scientists are not Data Engineers 5 HDFSGCS Storage CosmosDB How do I find features in this sea of data sources? This tastes like dairy in my Latte!
  6. 6. Data Science with the Feature Store 6 HDFSGCS Storage CosmosDB Feature Store Feature Pipelines (Select, Transform, Aggregate, ..) Now, I can change the world - one click- through at a time.
  7. 7. ©2018 Logical Clocks AB. All Rights Reserved 7 Reading from the Feature Store (Data Scientist) from hops import featurestore raw_data = spark.read.parquet(filename) polynomial_features = raw_data.map(lambda x: x^2) featurestore.insert_into_featuregroup(polynomial_features, "polynomial_featuregroup") from hops import featurestore df = featurestore.get_features([ "average_attendance", "average_player_age“]) df.create_training_dataset(df, “players_td”) Writing to the Feature Store (Data Engineer) tfrecords, numpy, petastorm, hdf5, csv
  8. 8. What is a Feature? A feature may be a column in a Data Warehouse, but more generally it is a measurable property of a phenomena under observation and (part of) an input to a ML model. Features are often computed from raw or structured data sources: •A raw word, a pixel, a sound wave, a sensor value; •An aggregate (mean, max, sum, min) •A window (last_hour, last_day, etc) •A derived representation (embedding or cluster) 8
  9. 9. Just select and type text. Use control handle to adjust line spacing. Bert Features Bert Features Bert Features Marketing Research Analytics Duplicated Feature Engineering 9 DUPLICATED
  10. 10. Prevent Inconsistent Features– Training/Serving 10 Feature implementations may not be consistent – correctness problems!
  11. 11. Features as first-class entities •Features should be discoverable and reused. •Features should be access controlled, versioned, and governed. - Enable reproducibility. •Ability to pre-compute and automatically backfill features. - Aggregates, embeddings - avoid expensive re-computation. - On-demand computation of features should also be possible. •The Feature Store should help “solve the data problem, so that Data Scientists don’t have to.” [uber] 11
  12. 12. Data Engineering meets Data Science Feature Store Add/Remove Features Browse & Select Features to create Train/Test Data Data Engineer Data Scientist 12
  13. 13. A ML Pipeline with the Feature Store 13 Feature Store Register Feature and its Job/Data Select Features and generate Train/Test DataStructured & Raw Data Train Model Validate Models, Deploy Serve Model Online Features
  14. 14. Offline (Batch/Streaming) Feature Store 14 Data Lake Offline Feature Store Training Job Batch or Streaming Inference 1. Register Feature Engineering Job, copy Feature Data 2. Create Training Data and Train 3. Save Model a. Get Feature Engineering Job, Model, Conda Environment b. Run Job
  15. 15. Online Feature Store 15 Data Lake Online Feature Store Train Real-Time Serving 1. Engineer Features 2. Create Training Data 3. Train Model 4. Deploy Model a. Request Prediction b. Get Online Features c. Response Online Apps
  16. 16. Known Feature Stores in Production •Logical Clocks – Hopsworks (world’s first open source) •Uber Michelangelo •Airbnb – Bighead/Zipline •Comcast •Twitter •GO-JEK Feast (GCE) •Branch 16
  17. 17. A Feature Store for Hopsworks 17
  18. 18. ©2018 Logical Clocks AB. All Rights Reserved Hopsworks – Batch, Streaming, Deep Learning Data Sources HopsFS Kafka Airflow Spark / Flink Spark Feature Store Hive Deep Learning BI Tools & Reporting Notebooks Serving w/ Kubernetes Hopsworks On-Premise, AWS, Azure, GCE Elastic External Service Hopsworks Service
  19. 19. ©2018 Logical Clocks AB. All Rights Reserved Data Sources HopsFS Kafka Airflow Spark / Flink Spark Feature Store Hive Deep Learning BI Tools & Reporting Notebooks Serving w/ Kubernetes Hopsworks On-Premise, AWS, Azure, GCE Elastic External Service Hopsworks Service BATCH ANALYTICS STREAMING ML & DEEP LEARNING Hopsworks – Batch, Streaming, Deep Learning
  20. 20. ©2018 Logical Clocks AB. All Rights Reserved ML Infrastructure in Hopsworks 20 MODEL TRAINING Feature Store HopsML API & Airflow [Diagram adapted from “technical debt of machine learning”]
  21. 21. ©2018 Logical Clocks AB. All Rights Reserved 21 Distributed Deep Learning in Hopsworks Executor 1 Executor N Driver conda_env conda_env conda_env HopsFS (HDFS) TensorBoard ModelsExperiments Training Data Logs
  22. 22. ©2018 Logical Clocks AB. All Rights Reserved Hyperparameter Optimization 22 # RUNS ON THE EXECUTORS def train(lr, dropout): def input_fn(): # return dataset optimizer = … model = … model.add(Conv2D(…)) model.compile(…) model.fit(…) model.evaluate(…) # RUNS ON THE DRIVER Hparams= {‘lr’:[0.001, 0.0001], ‘dropout’: [0.25, 0.5, 0.75]} experiment.grid_search(train,HParams) https://github.com/logicalclocks/hops-examples Executor 1 Executor N Driver conda_env conda_env conda_env HopsFS (HDFS) TensorBoard ModelsExperiments Training Data Logs
  23. 23. ©2018 Logical Clocks AB. All Rights Reserved Distributed Training 23 # RUNS ON THE EXECUTORS def train(): def input_fn(): # return dataset model = … optimizer = … model.compile(…) rc = tf.estimator.RunConfig( ‘CollectiveAllReduceStrategy’) keras_estimator = tf.keras.estimator. model_to_estimator(….) tf.estimator.train_and_evaluate( keras_estimator, input_fn) # RUNS ON THE DRIVER experiment.collective_all_reduce(train) https://github.com/logicalclocks/hops-examples Executor 1 Executor N Driver conda_env conda_env conda_env HopsFS (HDFS) TensorBoard ModelsExperiments Training Data Logs
  24. 24. Hopsworks’ Feature Store Concepts 24
  25. 25. ©2018 Logical Clocks AB. All Rights Reserved Online Model Serving and Monitoring 25 25 Link Predictions with Outcomes to measure Model Performance Hopsworks Inference Request Response 1. Access Control Feature Store 2. Build Feature Vector Model Server Kubernetes 3. Make Prediction 4. Log Prediction Data Lake Monitor
  26. 26. HopsML Feature Store Pipelines 26
  27. 27. ©2018 Logical Clocks AB. All Rights Reserved Raw Data Event Data Monitor HopsFS Feature Store Serving Feature StoreFeature EngineeringIngest DeployExperiment/Train Airflow logs logs
  28. 28. ©2018 Logical Clocks AB. All Rights Reserved ML Pipelines of Jupyter Notebooks 28 Select Features, File Format Feature Engineering Validate & Deploy Model Experiment, Train Model Airflow End-to-End ML Pipeline Feature Backfill Pipeline Training and Deployment Pipeline Feature Store
  29. 29. ©2018 Logical Clocks AB. All Rights Reserved Hopsworks Feature Store as a Service 29 Hops df.save( topic ) df.read( s3:// ) Real-Time Serving Train Hopsworks Feature Store Sagemaker, Azure ML, Google AIFeature Engineering (AWS EMR, Azure HD Insight, Cloudera) Data Lake Data Lake select features and generate training data s3://..tfrecords read online features Batch Serving
  30. 30. Feature Store Demo 30
  31. 31. Summary and Roadmap •Hopsworks is a new Data Platform with first-class support for Python / Deep Learning / ML / Data Governance / GPUs -Hopsworks has an open-source Feature Store •Ongoing Work -Data Provenance -Feature Store Incremental Updates with Hudi on Hive 31/32
  32. 32. ©2018 Logical Clocks AB. All Rights Reserved 32 @logicalclocks www.logicalclocks.com Try it Out! 1. Register for an account at: www.hops.site
  33. 33. ©2018 Logical Clocks AB. All Rights Reserved ML Pipelines of Jupyter Notebooks 33 Convert .ipynb to .py Jobs Service Run .py or .jar Schedule using REST API or UI materialize certs, ENV variables View Old Notebooks, Experiments and Visualizations Experiments & Tensorboard PySpark Kernel .ipynb (HDFS contents) [logs, results] Livy Server HopsYARNHopsFS Interactive materialize certs, ENV variables

×