The main goal of the session is to showcase approaches that greatly simplify the work of a data analyst when performing data analytics, or when employing machine learning algorithms, over Big Data. The session will include presentations on
(a) How data analytics workflows can be easily and graphically composed, and then optimized for execution,
(b) How raw data with great variety can be easily queried using SQL interfaces, and
(c) How complex machine learning operations can be performed efficiently in distributed settings.
After these presentations, the speakers will participate in a discussion with the audience, in order to discuss further tools that could make the work of a data analyst more simple.
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with Earth Observation data
1. BDVe Webinar Series SIMPLIFY - Data Analytics and Machine Learning Made Simple
11th of June 2020, 10:00 CEST
Theofilos Kakantousis
COO / Co-Founder
ExtremeEarth: Hopsworks, a data-intensive
AI platform for Deep Learning with Earth
Observation data
2. ExtremeEarth
From Copernicus Big Data to Extreme Earth Analytics
Copernicus data is a paradigmatic case of big data:
• Volume (>191*10 3 users, >11*10 6 products, >106 PB of data in
the Copernicus Open Access Hub)
• Velocity (as of 2017, 10TB of data were generated and 93TB
were disseminated every day)
• Variety (many kinds of satellite images, many kinds of collateral
data)
• Veracity (quality is important)
• Value (13.5 billion and 28.030 job years are projected for 2008-
2020)
https://www.slideshare.net/BDVA/extreme-earth-overview
3. ExtremeEarth
Go beyond current state of the art.
Develop extreme earth analytics techniques and technologies that scale to the PBs of big Copernicus
data information and knowledge.
http://earthanalytics.eu/partners.html
4. ExtremeEarth
Need for a platform that enables data scientists to develop scalable Artificial Intelligence applications.
Hopsworks a data-intensive AI platform for Deep Learning with Earth Observation data
5. Management Team & Offices
Stockholm
Box 1263,
Isafjordsgatan 22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London,,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
Dr. Jim Dowling
CEO
Theo Kakantousis
COO
Prof. Seif Haridi
Chief Scientist
Fabio Buso
VP Engineering
Steffen Grohsschmiedt
Head Of Cloud
www.logicalclocks.com
9. Datasources
Applications
API
Dashboards
Hopsworks
Apache Beam
Apache Spark Pip
Conda
Tensorflow
scikit-learn
PyTorch
Jupyter
Notebooks
Tensorboard
Apache Beam
Apache Spark
Apache Flink
Kubernetes
Batch Distributed
ML & DL
Model
Serving
Hopsworks
Feature Store
Kafka +
Spark
Streaming
Model
Monitoring
Orchestration in Airflow
Data Preparation
& Ingestion
Experimentation
& Model Training
Deploy
& Productionalize
Streaming
Filesystem and Metadata storage
HopsFS
Apache
Kafka
Datasources
10. ML Pipelines in Hopsworks
ServingTrainingFeature SelectionFeature Engineering Prediction
Batch
Prediction
Kafka
Data
Lake
Online
Feature Store
Offline
Feature Store
Data
Warehouse
Online
Application
Model
Serving
Model
Repository
Train
Train/Test Data
(S3, HDFS, etc)
Deploy
Experiments
Monitor
Feature Vectors
Train
Model
Validate
Model
Select
Features
Feature
Engineering
11. ML Pipelines Technologies in Hopsworks
Event DataRaw Data
DATA PIPELINES FEATURE STORE TRAIN+VALIDATE MODEL SERVING
MONITOR
Data Lake
12. THE HIDDEN
COST OF
MACHINE
LEARNING
80% of the ML projects effort is
feature engineering, considerably
slowing down the production and
release of models.
Hopsworks supports the reusability, discoverability and
management of features, allowing ML projects to be
released faster.
13. Hopsworks – Manage the Complexity of Machine Learning
Data validation
Distributed
Training
Model
Serving
A/B
Testing
Monitoring
Pipeline
Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
[Adapted from Schulley et Al “Technical Debt of ML” ]
14. Hopsworks – Manage the Complexity of Machine Learning
Data validation
Distributed
Training
Model
Serving
A/B
Testing
Monitoring
Pipeline
Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
Hopsworks
Feature Store
[Adapted from Schulley et Al “Technical Debt of ML” ]
15. Hopsworks – Manage the Complexity of Machine Learning
Data validation
Distributed
Training
Model
Serving
A/B
Testing
Monitoring
Pipeline
Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
Hopsworks
REST API
Hopsworks
Feature Store
[Adapted from Schulley et Al “Technical Debt of ML” ]
16. numbers
(in arrays)
Feature Store – API between Data Engineering and Data
Science
numbers
arrays
(of numbers)
one-hot
encoding
Databases
Schemas
varchar, charsets
integer, blob,
varbinary
17. Postgres Dataset 2 Dataset N
Models are trained using sets
of features. The features are
fetched from the feature store and
can be reused between models. 40 60 80 100
160
180
200
A
BB
( -8, -8 )(0, -10 )(-10, 0 )(-1, -1 )
< 11.2< 0.9< 0.2≥ 0.9
≥ 0.2 ≥ 11.2
The Feature Store
Data management platform
for machine learning.
Interface (API) between
data engineering and
data science
Datasets are made
of sets of features which do not
communicate with each other.
Postgres - Mongo - Redshift - S3 - Delta Lake
18. ML Pipelines in Hopsworks
ServingTrainingFeature SelectionFeature Engineering Prediction
Batch
Prediction
Kafka
Data
Lake
Online
Feature Store
Offline
Feature Store
Data
Warehouse
Online
Application
Model
Serving
Model
Repository
Train
Train/Test Data
(S3, HDFS, etc)
Deploy
Experiments
Monitor
Feature Vectors
Train
Model
Validate
Model
Select
Features
Feature
Engineering
22. Challenges
• How do I maintain up to 4 different code bases for training models?
DRY training code, please!
• What is Python / Cluster?
Dask, PySpark, Distributed TensorFlow, etc?
• Can I have a single execution framework to run all these 4 phases?
Kubernetes, python, spark-submit, Jupyter notebook
Hopsworks solution
• Make the training loop oblivious to the given distribution context
• Maggy: a framework to support the distribution contexts based on PySpark
https://github.com/logicalclocks/maggy
23. ML Pipelines in Hopsworks
• Horizontally scalable
• Aggregate results
• Progress monitor
• Fault-tolerant
26. Project-Based Multi-Tenancy
Proj-All
Proj-X
• A Project is a Grouping of Users and Datasets
– HopsFS datasets, FeatureStore, Kafka Topics, HiveDB
𝑝𝑟𝑜𝑗𝑒𝑐𝑡 = 𝑢 ∪ 𝑑
𝑢 ⊆ 𝑈𝑠𝑒𝑟𝑠
𝑑 ⊆ 𝐷𝑎𝑡𝑎
Proj-42
Shared TopicFeatureStore /Projs/My/Data
CompanyDB
27. Provenance
• Track lineage of ML artifacts
• Navigate from a model – training/validation dataset – experiment- features
• Instrument instead of rewrite (TFX, MLFlow) – enabled by a CDC API
28. Feature Store
Data warehouse for ML
FS
Distributed Deep Learning
Faster with more GPUs
HopsFS
NVMe speed with Big Data
Horizontally Scalable
Ingestion, DataPrep,
Training, Serving
Elasticity & Performance
Notebooks for Development
First-class Python Support
Version Everything
Code, Infrastructure, Data
Model Serving on Kubernetes
TF Serving, MLeap, SkLearn
End-to-End ML Pipelines
Orchestrated by Airflow
Development & Operations
Secure Multi-Tenancy
Project-based restricted access
Encryption At-Rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, experiments, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies
Governance & Compliance
Hopsworks