SGX-PySpark: Secure Distributed Data Analytics addresses how public cloud users can protect sensitive data while still preserving the same utility of data analytics.
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
WWW19: SGX-PySpark: Secure Distributed Data Analytics
1. Do Le Quoc, Franz Gregor, Jatinder Singh and Christof Fetzer
SGX-PySpark: Secure Distributed Data Analytics
Motivation
• Data analytics has become an important component
of modern cloud-based data-driven services
• Large-scale datasets processed by the service may
contain customer's sensitive information
• Customers need to trust both service providers and
cloud providers
• How to protect sensitive data while preserving the same utility of
data analytics ?
Key idea
• Ensure confidentiality and integrity for both code and
data using trusted hardware, i.e., Intel Software Guard
Extensions (SGX)
• Execute only sensitive parts of data analytics inside
enclaves
• Encrypt input data; decrypt and securely process it
inside enclaves
Implementation
• PySpark: widely used in industry for big data analytics
• SCONE: enables unmodified applications run inside
Intel SGX enclaves
• Execute Spark Driver and Python processes of PySpark
inside enclaves using SCONE
SGX-PySpark
• Objectives:
• Support complex operations for big data analytics
• Provide strong security guarantees
• Minimize performance overhead
• Support Python
• Architecture:
Evaluation
• Dataset: TPC-H Benchmark
• ~22 % overhead compared to native execution
Demo
• GitHub repository: https://github.com/doflink/sgx-pyspark-demo
• Demo video: https://youtu.be/yI3iEFWUWbU
0
20
40
60
80
100
Q1 Q3 Q4 Q5 Q6 Q7 Q10 Q12 Q13 Q14 Q16 Q18 Q19
Latency[seconds]
TPC-H Queries
SGX-PySpark
Native PySpark