ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with Earth Observation data

BDVe Webinar Series SIMPLIFY - Data Analytics and Machine Learning Made Simple
11th of June 2020, 10:00 CEST
Theofilos Kakantousis
COO / Co-Founder
ExtremeEarth: Hopsworks, a data-intensive
AI platform for Deep Learning with Earth
Observation data

ExtremeEarth
From Copernicus Big Data to Extreme Earth Analytics
Copernicus data is a paradigmatic case of big data:
• Volume (>191*10 3 users, >11*10 6 products, >106 PB of data in
the Copernicus Open Access Hub)
• Velocity (as of 2017, 10TB of data were generated and 93TB
were disseminated every day)
• Variety (many kinds of satellite images, many kinds of collateral
data)
• Veracity (quality is important)
• Value (13.5 billion and 28.030 job years are projected for 2008-
2020)
https://www.slideshare.net/BDVA/extreme-earth-overview

ExtremeEarth
Go beyond current state of the art.
Develop extreme earth analytics techniques and technologies that scale to the PBs of big Copernicus
data information and knowledge.
http://earthanalytics.eu/partners.html

ExtremeEarth
Need for a platform that enables data scientists to develop scalable Artificial Intelligence applications.
Hopsworks a data-intensive AI platform for Deep Learning with Earth Observation data

Management Team & Offices
Stockholm
Box 1263,
Isafjordsgatan 22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London,,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
Dr. Jim Dowling
CEO
Theo Kakantousis
COO
Prof. Seif Haridi
Chief Scientist
Fabio Buso
VP Engineering
Steffen Grohsschmiedt
Head Of Cloud
www.logicalclocks.com

Hopsworks - Award Winning Platform

Datasources
Applications
API
Dashboards
Hopsworks
Apache Beam
Apache Spark Pip
Conda
Tensorflow
scikit-learn
PyTorch
Jupyter
Notebooks
Tensorboard
Apache Beam
Apache Spark
Apache Flink
Kubernetes
Batch Distributed
ML & DL
Model
Serving
Hopsworks
Feature Store
Kafka +
Spark
Streaming
Model
Monitoring
Orchestration in Airflow
Data Preparation
& Ingestion
Experimentation
& Model Training
Deploy
& Productionalize
Streaming
Filesystem and Metadata storage
HopsFS
Apache
Kafka
Datasources

ML Pipelines in Hopsworks
ServingTrainingFeature SelectionFeature Engineering Prediction
Batch
Prediction
Kafka
Data
Lake
Online
Feature Store
Offline
Feature Store
Data
Warehouse
Online
Application
Model
Serving
Model
Repository
Train
Train/Test Data
(S3, HDFS, etc)
Deploy
Experiments
Monitor
Feature Vectors
Train
Model
Validate
Model
Select
Features
Feature
Engineering

ML Pipelines Technologies in Hopsworks
Event DataRaw Data
DATA PIPELINES FEATURE STORE TRAIN+VALIDATE MODEL SERVING
MONITOR
Data Lake

THE HIDDEN
COST OF
MACHINE
LEARNING
80% of the ML projects effort is
feature engineering, considerably
slowing down the production and
release of models.
Hopsworks supports the reusability, discoverability and
management of features, allowing ML projects to be
released faster.

Hopsworks – Manage the Complexity of Machine Learning
Data validation
Distributed
Training
Model
Serving
A/B
Testing
Monitoring
Pipeline
Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
[Adapted from Schulley et Al “Technical Debt of ML” ]

Data validation
Distributed
Training
Model
Serving
A/B
Testing
Monitoring
Pipeline
Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
φ(x)
Hopsworks
Feature Store

Data validation
Distributed
Training
Model
Serving
A/B
Testing
Monitoring
Pipeline
Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
φ(x)
Hopsworks
REST API
Hopsworks
Feature Store

numbers
(in arrays)
Feature Store – API between Data Engineering and Data
Science
numbers
arrays
(of numbers)
one-hot
encoding
Databases
Schemas
varchar, charsets
integer, blob,
varbinary

Postgres Dataset 2 Dataset N
Models are trained using sets
of features. The features are
fetched from the feature store and
can be reused between models. 40 60 80 100
160
180
200
A
BB
( -8, -8 )(0, -10 )(-10, 0 )(-1, -1 )
< 11.2< 0.9< 0.2≥ 0.9
≥ 0.2 ≥ 11.2
The Feature Store
Data management platform
for machine learning.
Interface (API) between
data engineering and
data science
Datasets are made
of sets of features which do not
communicate with each other.
Postgres - Mongo - Redshift - S3 - Delta Lake

Experiments service
• Reproducible experiments
• Experiment metadata and
artifacts
• Provenance

Distributed Deep Learning
https://www.slideshare.net/dowlingjim/invited-lecture-on-gpus-and-distributed-deep-learning-at-uppsala-university

Challenges
• How do I maintain up to 4 different code bases for training models?
DRY training code, please!
• What is Python / Cluster?
Dask, PySpark, Distributed TensorFlow, etc?
• Can I have a single execution framework to run all these 4 phases?
Kubernetes, python, spark-submit, Jupyter notebook
Hopsworks solution
• Make the training loop oblivious to the given distribution context
• Maggy: a framework to support the distribution contexts based on PySpark
https://github.com/logicalclocks/maggy

ML Pipelines in Hopsworks
• Horizontally scalable
• Aggregate results
• Progress monitor
• Fault-tolerant

Project-Based Multi-Tenancy
Proj-All
Proj-X
• A Project is a Grouping of Users and Datasets
– HopsFS datasets, FeatureStore, Kafka Topics, HiveDB
𝑝𝑟𝑜𝑗𝑒𝑐𝑡 = 𝑢 ∪ 𝑑
𝑢 ⊆ 𝑈𝑠𝑒𝑟𝑠
𝑑 ⊆ 𝐷𝑎𝑡𝑎
Proj-42
Shared TopicFeatureStore /Projs/My/Data
CompanyDB

Provenance
• Track lineage of ML artifacts
• Navigate from a model – training/validation dataset – experiment- features
• Instrument instead of rewrite (TFX, MLFlow) – enabled by a CDC API

Feature Store
Data warehouse for ML
FS
Distributed Deep Learning
Faster with more GPUs
HopsFS
NVMe speed with Big Data
Horizontally Scalable
Ingestion, DataPrep,
Training, Serving
Elasticity & Performance
Notebooks for Development
First-class Python Support
Version Everything
Code, Infrastructure, Data
Model Serving on Kubernetes
TF Serving, MLeap, SkLearn
End-to-End ML Pipelines
Orchestrated by Airflow
Development & Operations
Secure Multi-Tenancy
Project-based restricted access
Encryption At-Rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, experiments, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies
Governance & Compliance
Hopsworks

ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with Earth Observation data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with Earth Observation data

Ähnlich wie ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with Earth Observation data (20)

Mehr von Big Data Value Association

Mehr von Big Data Value Association (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with Earth Observation data