SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Khalid Belhajjame, Noura Faci, Zakaria Maamar,Vanilson Burégio,
Edvan Soares and Mahmoud Berhamgi
Contact: kbelhajj@gmail.com
 Data driven analysis pipelines
 Systematic gathering of data and
analysis tools into computational
solutions for scientific problem-solving
 Tools for automating frequently
performed data intensive activities
 Provenance for the resulting datasets
 The method followed
 The resources used
 The datasets used
Khalid Belhajjame @ DarliAP Workshop, 2019 2
GWAS, Pharmacogenomics
Association study of
Nevirapine-induced skin rash
inThai Population
Trypanosomiasis (sleeping
sickness parasite) in
African Cattle
Astronomy &
HelioPhysics
Library Doc
Preservation
Systems Biology
of Micro-
Organisms
Observing Systems
Simulation
Experiments
JPL, NASA
BioDiversity
Invasive Species
Modelling
[Credit Carole A. Goble]
Khalid Belhajjame @ DarliAP Workshop, 2019 3
 In fields such as biomedicine and social and
behavioral sciences, workflow executions
manipulate and generate sensitive information
about individuals.
 There is, therefore, a serious concern about dataset
inappropriate manipulation/misuse during
experiences that could lead to sensitive-data leak
and/or misuse.
 Publishing the provenance of the executions of such
workflows raises privacy concerns.
Khalid Belhajjame @ DarliAP Workshop, 2019 4
To our knowledge, there does not exist any proposal that assists scientists in
the task of anonymizing the provenance of their experiments..
Khalid Belhajjame @ DarliAP Workshop, 2019 5
Our objective: we seek to assist scientists in the task of anonymizing
workflow provenance to preserve the privacy of individuals.
 Most related work in the area have focused on the problem of securing
workflow provenance and policing their access.
 Protecting the integrity of provenance data from corruption using
cryptography techniques [Hasan and Khan, 2017; Lyle and Martin, 2010].
 Deriving a partial view on a workflow that conforms to a pre-specified
access permissions on the modules' inputs and output and their
dependences [Chebotko et al., 2008; Cohen Boulakia et al., 2008]
 Policy languages allowing scientists to specify relationships between
datasets and the workflow modules, and their properties relevant to
datasets [Alhaqbani et al., 2013; Gil et al., 2010]
 Protecting the privacy of the modules that compose the workflows by hiding
certain parameters (attributes) of the module that compose the workflow
[Davidson et al., 2011].
[Credit: Steve Touw, Immuta]
Khalid Belhajjame @ DarliAP Workshop, 2019 6
‘Differential privacy formalizes the idea that a "private" computation should
not reveal whether any one person participated in the input or not, much
less what their data are.’ - [Frank McSherry]
(https://github.com/frankmcsherry/blog/blob/master/posts/2016-02-03.md)
$320k $340k $330k
$30M
Sensitivity of median = ~10k
Sensitivity of mean = ~30M
Khalid Belhajjame @ DarliAP Workshop, 2019 7
 For our work, we chose to use the most fundamental
anonymization privacy model, namely k-anonymity,
which has been proposed to protect individual privacy in
data publishing.
 While k-anonymity is less powerful than differntial
privacy, it is suitable for our purposes, given that it
provides the means for :
 Exploring the provenance of workflows,
 Examining the data products used and generated by
the workflows,
 Preserve (to certain extent) lineage information
between data products.
Khalid Belhajjame @ DarliAP Workshop, 2019 8
• A workflow is defined by the triple
• An operation op in OP is defined as.
• The data links:
Khalid Belhajjame @ DarliAP Workshop, 2019 9
Khalid Belhajjame @ DarliAP Workshop, 2019 10
Khalid Belhajjame @ DarliAP Workshop, 2019 11
Khalid Belhajjame @ DarliAP Workshop, 2019 12
 Sensitive parameters
To specify that a given input or output parameter carries
sensitive data, we use the following boolean function:
that is true if the data bound to <op,p> during the execution
are sensitive
 Anonymity Degree
we use the following function to specify the anonymity degree
of the parameter <p, op> with respect to a workflow instance
insWf:
Khalid Belhajjame @ DarliAP Workshop, 2019 13
 Manual identification of a workflow’s
parameters that are sensitive and setting their
anonymity degrees can be tedious.
 This is the case when the workflow includes a
large number of operations.
 We assist the scientist in this task by
leveraging parameter dependencies.
Khalid Belhajjame @ DarliAP Workshop, 2019 14
 A parameter <op, p> depends on a parameter <op', p’> in a workflow
(DWf), if during the execution of (DWf) the data bound to <op', p’>
contribute to or influence the data bound to <op', p’>
 Given a workflow (DWf), the dependencies between its parameters are
inferred as follows:
 Given an operation (op) that belongs to (DWf), we can infer that the
outputs of (op) depends on its inputs.
 If the workfow (DWf) contains a data link connecting an output <op, o>
to an input <op, i>, then:
 We also transitively derive dependencies between the operation
parameters:
Khalid Belhajjame @ DarliAP Workshop, 2019 15
 A parameter <p', op’> that is not an input to the
workflow may be sensitive if it depends on a
workflow input that is known to be sensitive:
 Note that we say may be sensitive. This is because
an operation that consumes sensitive datasets may
produce non-sensitive datasets.
Khalid Belhajjame @ DarliAP Workshop, 2019 16
 In addition to assisting the designer identify sensitive intermediate and
final output parameters, we also infer details about the anonymity degree
that should be applied to dataset instances of those sensitive parameters.
 The anonymity degree of a parameter <p', op’> given a workflow
execution insWf can be defined as the maximum degree of the sensitive
datasets that are used as input to the workflow and that contribute to the
datasets instances of <p', op’>.
Khalid Belhajjame @ DarliAP Workshop, 2019 17
Khalid Belhajjame @ DarliAP Workshop, 2019 18
Sensi ve Data
Non Sensi ve
Data
Sensi ve Data
Data owner
Data owner
Non Sensi ve
Data
Non Sensi ve
Data
Non Sensi ve
Data
Public data repositories
Trusted workflow environment
Workflow
execu on engine
Workflow
workbench
Data anonymizer
Private data
repository
share
launch
execution
get
inputs
store
outputs
publish data
1
2
3
4
5
6
7
get data
launch data
anonymization
 For validation purposes, we used 20 different CWL
workflows [1], we performed 500s executions per workflow,
and computed the overhead of our method in terms of the
computation of parameter dependencies, identification of
sensitive parameters and the computation of anonymity
degree.
 The results obtained showed that the overhead is small
compared to the execution of the workflow. It takes in
average less than a millisecond to perform all the
computation necessary.
Khalid Belhajjame @ DarliAP Workshop, 2019 19
[1] view.commonwl.org/workflows
 We presented an approach for preserving privacy in the
context of scientific workflows that heavily rely on large
datasets.
 We have shown how data plays a role in i) identifying
sensitive operation parameters in the workflow and ii)
deriving the anonymity degree that needs to be enforced
when publishing the datasets instances of these parameters.
 This is a preliminary work that opens up opportunities for
more research in the field of anonymization of workflow data
Khalid Belhajjame @ DarliAP Workshop, 2019 20
Khalid Belhajjame, Noura Faci, Zakaria Maamar,Vanilson Burégio,
Edvan Soares and Mahmoud Berhamgi
Contact: kbelhajj@gmail.com

Weitere ähnliche Inhalte

Ähnlich wie Data Driven Workflow Anonymization

FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows Carole Goble
 
The Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningThe Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningIRJET Journal
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseLeMeniz Infotech
 
Acupulco cda access (2)
Acupulco cda access (2)Acupulco cda access (2)
Acupulco cda access (2)eyetech
 
Fake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve BayesFake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve BayesIRJET Journal
 
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDSSECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDSGyan Prakash
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data CenterGilles Fedak
 
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...Kim Daniels
 
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...ijtsrd
 
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...IRJET Journal
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
 
Attribute-Based Data Sharing
Attribute-Based Data SharingAttribute-Based Data Sharing
Attribute-Based Data SharingIJERA Editor
 
IRJET - Mobile Chatbot for Information Search
 IRJET - Mobile Chatbot for Information Search IRJET - Mobile Chatbot for Information Search
IRJET - Mobile Chatbot for Information SearchIRJET Journal
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
 
Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018ITIIIndustries
 
Automated Fake News Detection -1.pptx
Automated Fake News Detection -1.pptxAutomated Fake News Detection -1.pptx
Automated Fake News Detection -1.pptxmike423372
 
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...1crore projects
 
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...1crore projects
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemIRJET Journal
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
 

Ähnlich wie Data Driven Workflow Anonymization (20)

FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
The Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningThe Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine Learning
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing database
 
Acupulco cda access (2)
Acupulco cda access (2)Acupulco cda access (2)
Acupulco cda access (2)
 
Fake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve BayesFake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve Bayes
 
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDSSECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data Center
 
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
 
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
 
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
Attribute-Based Data Sharing
Attribute-Based Data SharingAttribute-Based Data Sharing
Attribute-Based Data Sharing
 
IRJET - Mobile Chatbot for Information Search
 IRJET - Mobile Chatbot for Information Search IRJET - Mobile Chatbot for Information Search
IRJET - Mobile Chatbot for Information Search
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014
 
Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018
 
Automated Fake News Detection -1.pptx
Automated Fake News Detection -1.pptxAutomated Fake News Detection -1.pptx
Automated Fake News Detection -1.pptx
 
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
 
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 

Mehr von Khalid Belhajjame

Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsKhalid Belhajjame
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Khalid Belhajjame
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsKhalid Belhajjame
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in SepublicaKhalid Belhajjame
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenanceKhalid Belhajjame
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Khalid Belhajjame
 

Mehr von Khalid Belhajjame (20)

Provenance witha purpose
Provenance witha purposeProvenance witha purpose
Provenance witha purpose
 
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
 
Irpb workshop
Irpb workshopIrpb workshop
Irpb workshop
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
 
Anr cair meeting feb 2016
Anr cair meeting feb 2016Anr cair meeting feb 2016
Anr cair meeting feb 2016
 
Ikc 2015
Ikc 2015Ikc 2015
Ikc 2015
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
 
Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)
 
Edbt2014 talk
Edbt2014 talkEdbt2014 talk
Edbt2014 talk
 
Credible workshop
Credible workshopCredible workshop
Credible workshop
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
 
Why Workflows Break
Why Workflows BreakWhy Workflows Break
Why Workflows Break
 
D-prov use-case
D-prov use-caseD-prov use-case
D-prov use-case
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow Results
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in Sepublica
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenance
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
 

Kürzlich hochgeladen

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 

Kürzlich hochgeladen (20)

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 

Data Driven Workflow Anonymization

  • 1. Khalid Belhajjame, Noura Faci, Zakaria Maamar,Vanilson Burégio, Edvan Soares and Mahmoud Berhamgi Contact: kbelhajj@gmail.com
  • 2.  Data driven analysis pipelines  Systematic gathering of data and analysis tools into computational solutions for scientific problem-solving  Tools for automating frequently performed data intensive activities  Provenance for the resulting datasets  The method followed  The resources used  The datasets used Khalid Belhajjame @ DarliAP Workshop, 2019 2
  • 3. GWAS, Pharmacogenomics Association study of Nevirapine-induced skin rash inThai Population Trypanosomiasis (sleeping sickness parasite) in African Cattle Astronomy & HelioPhysics Library Doc Preservation Systems Biology of Micro- Organisms Observing Systems Simulation Experiments JPL, NASA BioDiversity Invasive Species Modelling [Credit Carole A. Goble] Khalid Belhajjame @ DarliAP Workshop, 2019 3
  • 4.  In fields such as biomedicine and social and behavioral sciences, workflow executions manipulate and generate sensitive information about individuals.  There is, therefore, a serious concern about dataset inappropriate manipulation/misuse during experiences that could lead to sensitive-data leak and/or misuse.  Publishing the provenance of the executions of such workflows raises privacy concerns. Khalid Belhajjame @ DarliAP Workshop, 2019 4
  • 5. To our knowledge, there does not exist any proposal that assists scientists in the task of anonymizing the provenance of their experiments.. Khalid Belhajjame @ DarliAP Workshop, 2019 5 Our objective: we seek to assist scientists in the task of anonymizing workflow provenance to preserve the privacy of individuals.  Most related work in the area have focused on the problem of securing workflow provenance and policing their access.  Protecting the integrity of provenance data from corruption using cryptography techniques [Hasan and Khan, 2017; Lyle and Martin, 2010].  Deriving a partial view on a workflow that conforms to a pre-specified access permissions on the modules' inputs and output and their dependences [Chebotko et al., 2008; Cohen Boulakia et al., 2008]  Policy languages allowing scientists to specify relationships between datasets and the workflow modules, and their properties relevant to datasets [Alhaqbani et al., 2013; Gil et al., 2010]  Protecting the privacy of the modules that compose the workflows by hiding certain parameters (attributes) of the module that compose the workflow [Davidson et al., 2011].
  • 6. [Credit: Steve Touw, Immuta] Khalid Belhajjame @ DarliAP Workshop, 2019 6 ‘Differential privacy formalizes the idea that a "private" computation should not reveal whether any one person participated in the input or not, much less what their data are.’ - [Frank McSherry] (https://github.com/frankmcsherry/blog/blob/master/posts/2016-02-03.md) $320k $340k $330k $30M Sensitivity of median = ~10k Sensitivity of mean = ~30M
  • 7. Khalid Belhajjame @ DarliAP Workshop, 2019 7  For our work, we chose to use the most fundamental anonymization privacy model, namely k-anonymity, which has been proposed to protect individual privacy in data publishing.  While k-anonymity is less powerful than differntial privacy, it is suitable for our purposes, given that it provides the means for :  Exploring the provenance of workflows,  Examining the data products used and generated by the workflows,  Preserve (to certain extent) lineage information between data products.
  • 8. Khalid Belhajjame @ DarliAP Workshop, 2019 8 • A workflow is defined by the triple • An operation op in OP is defined as. • The data links:
  • 9. Khalid Belhajjame @ DarliAP Workshop, 2019 9
  • 10. Khalid Belhajjame @ DarliAP Workshop, 2019 10
  • 11. Khalid Belhajjame @ DarliAP Workshop, 2019 11
  • 12. Khalid Belhajjame @ DarliAP Workshop, 2019 12
  • 13.  Sensitive parameters To specify that a given input or output parameter carries sensitive data, we use the following boolean function: that is true if the data bound to <op,p> during the execution are sensitive  Anonymity Degree we use the following function to specify the anonymity degree of the parameter <p, op> with respect to a workflow instance insWf: Khalid Belhajjame @ DarliAP Workshop, 2019 13
  • 14.  Manual identification of a workflow’s parameters that are sensitive and setting their anonymity degrees can be tedious.  This is the case when the workflow includes a large number of operations.  We assist the scientist in this task by leveraging parameter dependencies. Khalid Belhajjame @ DarliAP Workshop, 2019 14
  • 15.  A parameter <op, p> depends on a parameter <op', p’> in a workflow (DWf), if during the execution of (DWf) the data bound to <op', p’> contribute to or influence the data bound to <op', p’>  Given a workflow (DWf), the dependencies between its parameters are inferred as follows:  Given an operation (op) that belongs to (DWf), we can infer that the outputs of (op) depends on its inputs.  If the workfow (DWf) contains a data link connecting an output <op, o> to an input <op, i>, then:  We also transitively derive dependencies between the operation parameters: Khalid Belhajjame @ DarliAP Workshop, 2019 15
  • 16.  A parameter <p', op’> that is not an input to the workflow may be sensitive if it depends on a workflow input that is known to be sensitive:  Note that we say may be sensitive. This is because an operation that consumes sensitive datasets may produce non-sensitive datasets. Khalid Belhajjame @ DarliAP Workshop, 2019 16
  • 17.  In addition to assisting the designer identify sensitive intermediate and final output parameters, we also infer details about the anonymity degree that should be applied to dataset instances of those sensitive parameters.  The anonymity degree of a parameter <p', op’> given a workflow execution insWf can be defined as the maximum degree of the sensitive datasets that are used as input to the workflow and that contribute to the datasets instances of <p', op’>. Khalid Belhajjame @ DarliAP Workshop, 2019 17
  • 18. Khalid Belhajjame @ DarliAP Workshop, 2019 18 Sensi ve Data Non Sensi ve Data Sensi ve Data Data owner Data owner Non Sensi ve Data Non Sensi ve Data Non Sensi ve Data Public data repositories Trusted workflow environment Workflow execu on engine Workflow workbench Data anonymizer Private data repository share launch execution get inputs store outputs publish data 1 2 3 4 5 6 7 get data launch data anonymization
  • 19.  For validation purposes, we used 20 different CWL workflows [1], we performed 500s executions per workflow, and computed the overhead of our method in terms of the computation of parameter dependencies, identification of sensitive parameters and the computation of anonymity degree.  The results obtained showed that the overhead is small compared to the execution of the workflow. It takes in average less than a millisecond to perform all the computation necessary. Khalid Belhajjame @ DarliAP Workshop, 2019 19 [1] view.commonwl.org/workflows
  • 20.  We presented an approach for preserving privacy in the context of scientific workflows that heavily rely on large datasets.  We have shown how data plays a role in i) identifying sensitive operation parameters in the workflow and ii) deriving the anonymity degree that needs to be enforced when publishing the datasets instances of these parameters.  This is a preliminary work that opens up opportunities for more research in the field of anonymization of workflow data Khalid Belhajjame @ DarliAP Workshop, 2019 20
  • 21. Khalid Belhajjame, Noura Faci, Zakaria Maamar,Vanilson Burégio, Edvan Soares and Mahmoud Berhamgi Contact: kbelhajj@gmail.com

Hinweis der Redaktion

  1. In this age of data-intensive science we’re witnessing the unprecedented generation and sharing of large scientific datasets, where the pace of data generation has far surpassed the pace of conducting analysis over the data. Scientific Workflows [6] are a recent but very popular method for task automation and resource integration. Using workflows, scientists are able to systematically weave datasets and analytical tools into pipelines, represented as networks of data processing operations connected with dataflow links. (Figure 1 illustrates a workflow from genomics, which “from a given set of gene ids, retrieves corresponding enzyme ids and finds the biological pathways involving them, then for each pathway retrieves its diagram with a designated coloring scheme”). As well as being automation pipelines, workflows are of paramount importance for the provenance of data generated from their execution [6]. Provenance refers to data’s derivation history starting from the original sources, namely its lineage.
  2. In fields such as biomedicine and social and behavioral sciences, workflow executions manipulate and generate sensitive information about individuals. There is a serious concern about dataset inappropriate manipulation/misuse during experiences that could lead to sensitive-data leak and/or misuse. Although this could happen inadvertently, the consequences remain the same. Publishing the provenance of the executions of such workflows raises privacy concerns. For example, record linking techniques can be applied to provenance traces to cross-reference datasets used and generated by the workflow modules with the intention to reveal private or sensitive information about individuals, thereby violating basic privacy rights.
  3. Protecting the integrity of provenance data from corruption using sophisticated secure computing and cryptography techniques Chebotko {\em et al} \cite{DBLP:conf/waim/ChebotkoCLFY08} discusses means for deriving a partial view on a workflow that conforms to a pre-specified access permissions on the modules' inputs and output and their dependences. Gil {\em et al.} \cite{DBLP:conf/semweb/CheungG07,DBLP:conf/aaaiss/GilF10} and Alhaqbani {\em et al.} \cite{Alhaqbani2013} proposed policy languages allowing scientists to specify relationships between datasets and the workflow modules, and their properties relevant to datasets. Policies can be utilized for instance to specify that the data instances of a module's output needs to be anonymized. In doing so, however, the policy language does not specify how the datasets are to be anonymized, and even less, how their lineage information are to be preserved. Davidson {\em et al.} \cite{DBLP:conf/icdt/DavidsonKRSTC11,DBLP:conf/pods/DavidsonKMPR11,DBLP:conf/cidr/DavidsonKTRCMS11} investigated a related problem but with a focus on module privacy. The objective of this line of proposals is to identify the subset of the the inputs and outputs, or more specifically attributes thereof, of the wokflow modules that need to be hidden to keep the functionality of the workflow modules hidden. Our objective is different in that we consider that the modules that compose the workflow are public and we seek to anonymize the workflow provenance, with the objective to hide sensitive information about individual from the provenance records. In doing so, we examine anonymization techniques to generalize attribute values of data records, as opposed to hiding completely the attributes as done in \cite{DBLP:conf/pods/DavidsonKMPR11}.
  4. The intuition of differential privacy is that the removal or addition of a single record does not significantly affect the outcome of any analysis. Differential privacy: Very hard to do exploration with the privacy budget, you somewhat have to know the questions you intend to ask up front. You can only ask aggregate questions. Different techniques have been proposed in the literature for protecting the privacy of individuals, e.g., k-anonymity [28, 31], l-diversity [24], t-closeness [22] and differential privacy [11]. In particular, differential privacy [] has recently gained momentum as the method of choice in statistical databases. It involves adding random noise to the data so that the distribu- tion of the resulting dataset is almost invariant to the inclusion of any data record. While extremely powerful, differential privacy is not suitable for our purposes, and its application may hamper the utility of anonymized provenance data in to preserve a more rigorous guarantee of privacy [29]. Indeed, for it to be useful, provenance information should keep track of the data records that have been used and generated by the workflow modules as well as their connections (lineage), which may be lost or broken when applying differential privacy techniques. [Khalid: you need to check the validity of the following statement, with evidence (paper).]
  5. The anonymity degree of a~$\mathtt{DWf}$'s parameter ($\mathtt{\langle p, op \rangle}$) is defined with respect to a given $\mathtt{DWf}$ instance~($\mathtt{insWf}$). Indeed, different instances of $\mathtt{DWf}$ may have as input datasets different anonymity degree requirements. For example, the owner of an input dataset used for a given workflow instance ($\mathtt{insWf_1}$) may impose a more stringent anonymity degree than the owner of an input dataset used for a different workflow instance ($\mathtt{insWf_2}$).
  6. Manual identification of a workflow’s parameters that are sensitive and setting their anonymity degrees can be tedious. Thisbecomes a serious concern when the workflow includes a largenumber of operations. To address this issue, we propose in thissection, an approach that takes as input the sensitivity of the inputparameters of the workflow(DWf)together with their anonymitydegrees. It then detects the list of (intermediate and final) pa-rameters in(DWf)that may be sensitive, and infer the anonymitydegree that should be applied to the datasets bound to thoseparameters during the execution of the(DWf)
  7. Taking the maximum anonymity degree of the contributing inputs ensures that the anonymity degrees imposed on such inputs is honored by the dependent parameter in question.
  8. This work opens up opportunities for more research in the field of anonymization of workflow data. In this respect, our ongoing work includes investigating the applicability of our solution to anonymization techniques