In this webinar, Francisco Brasileiro and Ignacio Blanquer will discuss the trustworthiness requirements of big-data applications deployed atop cloud infrastructures, and how the ATMOSPHERE platform can be used to handle them. This will be explained using as example a medical application developed in the context of the ATMOSPHERE project, and deployed over a transatlantic federated cloud infrastructure.
Managing Trustworthy Big-data Applications in the Cloud with the ATMOSPHERE Platform
1. Co-funded by the European Commission
Horizon 2020 - Grant #777154
Managing Trustworthy Big-
Data Applications in the
Cloud with the
ATMOSPHERE Platform
Ignacio Blanquer
ATMOSPHERE EU Project coordinator
Francisco Brasileiro
ATMOSPHERE Brazil Project coordinator
2. • ATMOSPHERE is a 24-month H2020
project aiming at the design and
development of a framework and a
platform to implement trustworthy
cloud services on a federated
intercontinental cloud.
• Expected Results
• A federated cloud platform.
• A development framework
• Trustworthy evaluation and monitoring
• Trustworthy Distributed Data Management
• Trustworthy Distributed Data Processing
• A pilot use case on Medical Imaging
Processing.
The Project
Trustworthy Data Processing Services (TDPS)
Application
Trustworthy Data Management Services (TDMS)
Infrastructure Management Services (IMS)
Federated Infrastructure
Trustworthiness
Monit.&Assessment
(TMA)
3. The problem
I do not want to care for the infrastructure, resource
management, job scheduling, secure access and
similar burdens. Moreover, I want to guarantee that
no sensitive data is exposed outside of the country
where it was produced.
I need to build up an Image Processing Tool that
uses sensitive data that requires a high computing
demand. Once developed, I want to exploit it as a
service securely and with a Quality of Service.
5. • PROVAR study – the first large-scale RHD screening program in
Brazil.
• RHD Screening: public schools, private schools and primary health
units in the cities of Belo Horizonte,
Montes Claros and Bocaiúva,
Minas Gerais, Brazil.
The Data
6. • The characterization of Echo-cardio
images obtained in public schools
• 5,600 exams, with an average of 14
videos per exams (total of 75,836
videos)
• 5,330 exams are classified as normal (with a
total of 71,686 videos) - 95%
• 238 exams are classified as borderline RHD
(with a total of 3,649 videos) - 4%.
• 32 exams are classified as definite RHD (with a
total of 501 videos) - 1%.
• Additionally, there is another databank with 3.5
millions electrocardiograms from the same
population area and age.
Image Biobank Requirements
Mean age: 13 ± 3 y.o.
Female sex: 55%.
7. • Sensitive data must not be accessible out of the boundaries of
the hosting country
• Sensitive data is protected by the Brazilian LGPD and must be processed under high
access-protection means, robust even in a potentially vulnerable cloud offering.
• Anonymous data, though, can be released but should be kept accessible only in a
secured environment.
• Medical Imaging processing and Machine Learning model
building requires intensive computing resources
• The capabilities for processing may not be accessible in the boundaries where the
data is located and therefore such processing algorithms must run elsewhere.
• The access should be coherent and secure, and image processing should be efficient.
• Experiments should be reproducible and stable
• The model building, image processing and classification should run on well-defined
environments that could be reproduced for further analysis.
Image Biobank Requirements
8. • Trust is a choice that is based on past experience. Trust takes time to
build, but it can disappear in a second.
• Trusting cloud services is as complicated as trusting people. You need a
way to measure it and pieces of evidence to build trust.
• Trust in a cloud environment is considered as the reliance of a customer on a cloud
service and, consequently, on its provider.
• Trust bases on a broad spectrum of properties such as
Security, Privacy, Coherence, Isolation, Stability,
Fairness, Transparency and Dependability.
• Nowadays, few approaches deal with the
quantification of trust in cloud computing.
What is trust?
9. • Along with these
requirements, we explore
other requirements:
• Measurement of the Fairness of
the models to evaluate the bias
of the model with respect to
sensitive categories, such as
gender or race.
• Evaluation of the Explainability
of the model.
• Evaluation of the privacy loss
risk to determine the quality of
the anonymisation and the
potential leakage of personal
data inside the models.
Trust in Health Data Processing
... successfully reidentified the demographic data of
4478 adults (94.9%) & 2120 children (87.4%) …
(P < .001)
10. 10
The Previous situation
Application Developers
- Who develop the tools for
processing the data.
- They require the
infrastructure to provide
some types of services and
resources, such as
computing, secure storage,
high-availability, data
persistence.
- They will deliver the
applications to others
to operate.
Application Manager
- An Application Developer may
not be in charge of deploying
the application on the
production infrast.
- The deployment implies the
monitoring and management
of the resources, services,
user accounts and data.
- The Application Manager will
have access credentials to the
infrastructure and will decide
the optimal allocation of the
resource.
End-Users
- Data providers and Data
scientists exploring and
processing data.
- Need for secure data
transfer and data access
tracing, as well as
simplified processing
tools.
- No need to worry about
achieving ICT skills.
12. 12
One platform, multiple dimensions
• The platform can be described considering
different conceptual dimensions
• Users and their roles
• Service delivery models
• Service classes
• Application life cycle
13. 13
Users and their roles
Federated Infrastructure
Resource
Provider
Resource
Provider
Resource
Provider
Trustworthy
Applications &
Services
ATMOSPHERE Platform
Application developer
Data scientist
Application manager
System administrator
Data owner
21. ● Lemonade* is a web-based system for
designing and running analytics
applications.
● Users, who are not necessarily
programmers, describe applications as
workflows; Lemonade generates code and
controls their execution.
● Workflows consist of operations (boxes) and
data flows (arrows) among them,
performing:
⁃ Data preparation and engineering
⁃ Machine learning methods (MLib)
⁃ Visualization metaphors 21
LEMONADE
22. 22
Supported Trustworthiness properties
Property Developers Data Scientists
Stability Stability strategies (e.g., cross-
validation)
Quality assurance of model outcome
(e.g., calibrate cross validation and
evaluate accuracy variance)
Privacy Privacy-preserving algorithms and
techniques (e.g., k-anonymity)
Assess the impact of preserving privacy
on the outcome utility and effectiveness
Transparency Transparency methods to be combined
with different data analytic flows (e.g.,
LIME/SHAP methods)
Execute ML models and, based on
explanations, calibrate the model or
enhance the input
Fairness Fairness-enhancing mechanisms and
strategies (e.g., Aequitas toolkit).
Generate report as to evaluate fairness
and decide on features to include on
models
23. • PAF assists organizations owning and
processing datasets to understand how the
processing of data can affect their
conformance with regulations related to
privacy (GDPR and LGPD)
• These assessments may be used to
generate appropriate security/privacy
policies used by other services (eg.
LEMONADE)
23
Privacy assessment forms (PAF)
25. • Typical best practices
• Data in transit and at rest can be encrypted
• Some processing can even be done over encrypted data
• Keys and certificates not included in repositories
• But this is not enough...
• If attacker has access to the machine (VM escapes, internal
attacker, cold boots), code can be changed, memory can be
dumped
• Keys or data can be stolen 25
Data access challenges
26. 26
ATMOSPHERE approach for data
access security and privacy
• Use trusted execution environments (TEE) to protect data access
• Advantages
• Raw data is preserved: no noise or anonymization before
storage, value of the original data is preserved
• Proxies used for filtering queries and results to guarantee
protection of sensitive data
• Data is encrypted not only in transit and at rest, but also
during processing
• Enforcement of which applications can access data
• Vallum: the TEE-enabled Access and Privacy Protection Layer
27. Data Protection
Layer
(Vallum)
The Vallum Framework
Colunar DBMS
(e.g., Cassandra)
Relational DBMS
(e.g., MySQL)Proxying
Authentication
Authorization
Privacy
Auditing Document Store
(e.g., MongoDB)
File System
(e.g. IPFS)
Query
Compliant
Results
Query
Compliant
Results
Query
Compliant
Results
Modified
Query
Result
Modified
Query
Result
Modified
Query
Result
Modified
Query
Result
31. • The underlying infrastructure is a federated cloud
• Using fogbow (www.fogbowcloud.org) on OpenStack and OpenNebula.
• With a Federated Network to provide a coherent network space among nodes.
• Heterogeneous resources: SGX-enabled and GPU nodes.
• Using EC3(1) and Infrastructure Manager(2) to deploy a virtual infrastructure.
31
Intercontinental infrastructure
Cloud Resources @EU
Cloud Resources
@ Brazil
SGX-Enabled Resources
container
Encrypted
PROVAR
Study
Cloud
Manager
Cloud
Manager
Federation Layer
Secure overlay network
Central
TMA
TOSCA-IM
GPU-Enabled
Resources container
(1) https://marketplace.eosc-portal.eu/services/elastic-cloud-compute-cluster-ec3
(2) https://marketplace.eosc-portal.eu/services/infrastructure-manager-im
EC3
32. • The virtual infrastructure is managed by an elastic
Kubernetes cluster spawn over the federated network
• Containers and services are accessible from both sites but
only through the federated network.
• Resources are properly tagged (SGX and GPU capabilities
and Brazil / Europe) so K8s applications are placed in the
correct resource.
• Infrastructure is described as code(3).
• K8s Front-end is deployed and nodes are being
powered on as the applications are deployed, creating
the request for specific resources.
32
Deployment of the virtual
infrastructure
(3) https://github.com/grycap/ec3/tree/atmosphere
33. • A secure storage is deployed at the
Brazilian side
• It uses Vallum(4), a service that provides
on-the-fly annonymisation based on policies.
• It masks (or blurs) the fields that are marked
as sensitive to different profiles of users.
• It relies on an HDFS filesystem for the files
and on SQL databases for the structured data.
• It runs the data anonymisation and sensitive data access on enclaves running
on SGX-enabled containers, so they securely run even in untrusted clouds
• Data remains encrypted in disk.
33
Secure storage at Brazilian side
Cloud Resources
@ Brazil
SGX-Enabled
Resources
VALLUM
Encrypted
PROVAR
Study
Cloud
Manager
(4) https://www.atmosphere-eubrazil.eu/vallum-framework-access-privacy-protection
34. • Data is requested to Vallum from external users, but they will only access
to partially anonymised data
• Anonymised data (~1TB) is copied where the computing accelerators are placed.
34
Anonymised Data
Cloud Resources @EUCloud Resources @ Brazil
SGX-Enabled Resources
VALLUM
Encrypted
PROVAR
Study
Plain &
Anonymised
data
Application
TMA
Cloud
Manager
Cloud
Manager
Federation Layer
Secure overlay network Central
TMA
GPU-Enabled
Resources
TOSCA-IM
storage
service
35. • Videos are split into frames and
classified by color inspection
• A color-based segmentation using k-means
clustering extracts the color pixels from the
Doppler images.
• Images are classified according
their acquisition view using a CNN
• Parasternal long axis view has proven to be
relevant to obtain an accurate classification.
• First & second order texture analyses
characterize the images by the spatial variation of pixel intensities.
• Besides texture features, blood velocity information is also obtained.
• Finally, all the extracted features are classified through machine learning
techniques in order to differentiate between RHD positive and healthy subjects.
35
Building the models for the
Estimation pipeline.
Image
Classification
Frame
Splitting
Preparation of
images for classifier Color-Based
Segment.
Doppler
Data Preparation
View
Classification
Texture Analysis &
Velocity Extraction
Features
Classification
Parasternal Long Axis
Data Analysis
36. • The pipeline is developed
using LEMONADE(5)
• LEMONADE provides
a GUI and a Machine
Learning librarie to
develop data analytics
pipelines.
• Pipelines can be run
interactively or transformed into executable code.
• Code can be interactively run or further embed into
services to be exposed for production.
• A model building pipeline and an estimation
pipeline are developed.
36
Coding the pipeline: LEMONADE
(5) https://www.atmosphere-eubrazil.eu/lemonade-live-exploration-and-mining-non-trivial-amount-data-everywhere
37. Fairness
● Algorithms, in ML and IA, learn by identifying patterns in data collected
over many years. Why may algorithms become “unfair”?
○ By using unbalanced data sets, biased to certain population.
○ By using data sets that are perpetuating historical biases.
○ By inappropriate data handling.
○ As result of inappropriate model selection, uncorrect algorithm design or application.
● Algorithms Fairness components:
○ Aequitas Bias and Fairness Audit Toolkit, proposed
by the DSSG group from University of Chicago
(http://aequitas.dssg.io/)
○ Properties:
■ Equal Parity & Proportional Parity.
■ False Positive Rate and False Discovery
Rate Parity.
■ False Negative Rate and False Omission
Rate Parity.
Fairness
Tree
Equal
Parity
Proport.
Parity
Represent.
Fairness
Error
Fairness
FNRP FPRP FDRP FORP
38. ● Model Complexity increase typically reduces Interpretability
○ Complex multilayer Convolutional Neural Networks are far more difficult to explain than
Decision Trees or Linear Regression.
● Effort is invested in characterizing explainability and providing
information to explain how the algorithm reached such results
○ 𝛿-Interprepetability (https://arxiv.org/pdf/1707.03886.pdf).
○ LIME (https://github.com/marcotcr/lime)
■ The output of LIME is a list of explanations,
reflecting the contribution of each feature to
the prediction of a data sample.
Interpretability
Retinopathy prediction using a 48 layers deep net)
https://www.kaggle.com/kmader/inceptionv3-for-retinopathy-gpu-hr
Severe
Retinopathy
39. Privacy Assessment Forms for GDPR
and LGPD
● The International context requires
dealing with multiple legal
frameworks
○ Brazilian LGPD and GDPR in our case.
● Integrated a tool for tagging and
following up sensitive fields
○ To provide a list of Personally Identifiable
Information (PII) and Sensitive Information
■ PIIs: Fullname, Ethnicity, Medical Record id,
Gender,..
■ Sensitive Info: Medical Information,
Genetics,..
○ Traces the use of sensitive data within a
processing workflow to guide on the
annotation of sensitive derived information.
40. Re-identification Risk
● Anonymisation defined by policies
○ Define actions (Removal, Blurring, Reduction,
Substitution) and fields.
○ The system starts with the less restrictive
policy, applies anonymisation and computes
the Metric.
● Data Privacy Model
○ Anonymisation Process.
○ K-anonymity Model Computation.
○ Threshold Checker.
○ Linkage Attack for Validation.
○ Increase Anonymity.
41. 41
Conclusions
• Need to manually
configure the
environment.
• Lack of
reproducibility.
• Qualitative
appraisal of the
trustworthiness.
Before After
• Self-assessment of
GDPR/LGDP.
• Trustable storage
environment even
on an untrusted
provider.
• Quantitative
anonymisation
level.
• Manual analysis of
GDPR/LGDP risks
• Need to trust on the
storage provider.
• Anonymisation
level is qualitative.
• Applications templates
for complex &
distributed
applications.
• Provide a repeatable
way to deploy the
whole application.
• Quantitative measure
of trustworthiness