SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Gidon Gershinsky, IBM Research - Haifa
gidon@il.ibm.com
Efficient Spark Analytics
on Encrypted Data
#SAISDev14
Overview
• What problem we are solving?
• Parquet modular encryption
• Spark encryption with Parquet
– connected car scenario: demo screenshots
– performance implications of encryption
– getting from prototype to real thing
2#SAISDev14
What Problem Are We Solving?
3
• Protect sensitive data-at-rest
– data confidentiality: encryption
– data integrity
– in any storage - untrusted, cloud or private, file system, object
store, archives
• Preserve performance of Spark analytics
– advanced data filtering (projection, predicate) with encrypted data
• Leverage encryption for fine-grained access control
– per-column encryption keys
– key-based access in any storage: private -> cloud -> archive
#SAISDev14
Background
• EU Horizon2020 research project
• EU partners: Adaptant, IT Innovation, OCC, Thales, UDE
– usage-based car insurance, social services
• Collaboration with UC Berkeley RISELab
– Opaque technology
• Secure cloud analytics (Spark and H/W Enclaves)
– how to protect “data-in-use”?
– how to protect “data-at-rest” – without losing analytics efficiency?
4#SAISDev14
Apache Parquet
• Integral part of Apache Spark
• Encoding, compression
• Advanced data filtering
– columnar projection: skip columns
– predicate push down: skip files, or row groups,
or data pages
• Performance benefits
– less data to fetch from storage: I/O, latency
– less data to process: CPU, latency
• How to protect sensitive Parquet data?
– in any storage - keeping performance, supporting column access control, data tamper-proofing etc.
5#SAISDev14
Parquet Modular Encryption
• Apache Parquet community work
• Full encryption: all data and metadata
modules
– metadata: indexes (min/max), schema, size and
number of secret values, encryption key metadata,
etc ..
– encrypted and tamper-proof
• Enables columnar projection and predicate
pushdown
• Storage server / admin never sees
encryption keys or cleartext data
– “client-side” encryption (e.g., Spark-side)
6#SAISDev14
Parquet Modular Encryption
• Works in any storage
– supporting range-read
• Multiple encryption algorithms
– different security and performance requirements
• Data integrity verification
– AES GCM: “authenticated encryption”
– data not tampered with
– file not replaced with wrong version
• Column access control
– encryption with column-specific keys
7#SAISDev14
AES Modes: GCM and CTR
• AES: symmetric encryption standard supported in
– Java and other languages
– CPUs (x86 and others): hardware acceleration of AES
• GCM (Galois Counter Mode)
– encryption
– integrity verification
• basic mode: make sure data not modified
• AAD mode: “additional authentication data”, e.g. file name with a version/timestamp
– prevent replacing with wrong (old) version
• CTR (Counter Mode)
– encryption only, no integrity verification
– useful when AES hardware acceleration is not available (e.g. Java 8)
8#SAISDev14
Status & Roadmap
• PARQUET-1178
– full encryption functionality layer, with API access
– format definition PR merged in Apache Parquet feature branch
– Java and C++ implementation
• one PR merged, others being reviewed
• Next: tools on top
➢ PARQUET-1373: key management tools
▪ number of different mechanisms
▪ options for storing key material
➢ PARQUET-1376: data obfuscation layer
▪ per-cell masking and aggregated anonymization
▪ reader alerts and choice of obfuscated data
9
➢ PARQUET-1396: schema service interface
▪ “no-API” interface to encryption
▪ all parameters fetched via loaded class
➢ Demo feature: Property interface
▪ “no-API” interface to encryption
▪ most parameters passed via (Hadoop) config properties
▪ key server access via loaded class
#SAISDev14
Spark Integration Prototype
• Standard Spark 2.3.0 is built with Parquet 1.8.2
• Encryption prototype: standard Spark 2.3.0 with Parquet 1.8.2-E
– Spark-side encryption and decryption
• storage (admin) never sees data key or plain data
– Column access control (key-per-column)
– Cryptographic verification of data integrity
• replaced with wrong version, tampered with
– Filtering of encrypted data
• column projection, predicate push-down
10#SAISDev14
Connected Car Usecase
11
Car Monitoring Events
- car ID (plate #)
- driver ID (license #)
- speed
- location
Personal Data of Drivers
- driver ID (license #)
- name
- home address
- email address
- credit card
Gateway
Usage-based
Insurance
Service
Spark
Parquet
who is speeding at night?
(20% and more
above limit)
Raise premium,
Notify by email
#SAISDev14
Car Event Schema
12
Secret data, to
be encrypted
Unencrypted
columns
#SAISDev14
Driver Table Schema
13
Secret data, to
be encrypted
#SAISDev14
Masked data
Writing Encrypted Parquet Files
14
Data Keys
(base64)
#SAISDev14
Writing Encrypted Parquet Files
15#SAISDev14
Reading Encrypted Parquet Files
16
location key: not eligible
credit card key: not eligible
#SAISDev14
table schema and other metadata are protected
Hidden Columns
17#SAISDev14
Query Execution
18
who is
speeding
at night?
#SAISDev14
Tampering with Data
19
missed Jack
Wilson, 80%
above speed
limit !!
#SAISDev14
Parquet Data Integrity Protection
20#SAISDev14
Key Management
21#SAISDev14
Spark
Parquet
Client
AUTH
Service
Token
KMS
• PARQUET-1373
– randomly generated Data keys
– wrapped in KMS with Master keys
– storage options for wrapped keys
• in file itself: easier to find
• in separate location: easier to re-key
– compromised Master key
– client: specify Master key ID per column
IBM Spark-based Services
• IBM Analytics Engine
– on-demand Spark (and Hadoop) clusters in IBM cloud
• Watson Studio Spark Environments
– cloud tools for data scientists and application developers
– dedicated Spark cluster per Notebook
• DB2 Event Store
– rapidly ingest, store and analyze streamed data in event-driven
applications
– analyze with a combination of Spark and C++ engines,
store with C++ Parquet
• SQL Query Service
– serverless scale-out SQL execution on TB of data
– Spark SQL engine, running in IBM Cloud
22#SAISDev14
Column Access Control with
Encryption @Uber
• Access control to sensitive columns
only
• PARQUET-1178 provides
fundamentals to encrypt columns
• PARQUET-1396 provides
schema and key retrieval interface
• Currently under development - stay
tuned!
23#SAISDev14
Encryption Performance
• Ubuntu 16.04 box
– 8-core Intel Core i7 6820HQ CPU / 2.70GHz
– 32GB memory
• Raw Java AES-GCM speed
– no Spark or Parquet, just encrypting buffers in loop
• 8 threads, 1MB buffers
– Java 8: ~ 250 MB/sec
– Java 9 and later: ~ 2-5 GB/sec
• AES-NI in Hotspot (hardware acceleration)
• Slow warmup in decryption: Bug report JDK-8201633
– can be bypassed with a workaround
– if you know someone who knows someone in Java Hotspot team…
24
Not a Benchmark
#SAISDev14
Spark Encryption Performance
• Local mode, 1 executor: 8 cores, 32GB memory
• Storage: Local RamDisk
• Write / encryption test
– load carEvents.csv files (10GB)
– cache in Spark memory!
– write as parquet{.encrypted}: measure time (of second write)
• Query / decryption test
– ‘read’ carEvents.parquet{.encrypted} files (2.7GB)
– run “speeding at night” query: measure time
25#SAISDev14
Spark Encryption Performance
• Typical usecases - only 5-10% of columns are sensitive (encrypted)
• Java 8… If needed, use GCM_CTR instead of GCM
• Next Java version in Spark will support AES hardware acceleration
• Encryption is not a bottleneck
– app workload, data I/O, encoding, compression
26
Test Write (sec) Query (sec) Notes
no encryption 26 2 query on 4 columns: input ~12% of data
encryption (GCM) 28.5 2.5 GCM on data and metadata
encryption (GCM_CTR) 26.8 2.2 CTR on data, GCM on metadata
To Be Benchmarked!
#SAISDev14
From Prototype to Real Thing
• Spark encryption prototype
– standard Spark code, built with modified Parquet
– encryption is triggered via Hadoop config
• transparent to Spark
• That’s it? Just upgrade to new Parquet version? Not really ..
– partitions!
• exception on encrypted columns? obfuscate or encrypt sensitive partition values?
– Spark support for other things above file format
• TBD!
27#SAISDev14
Parquet encryption: Community work. On track. Feedback welcome!
Spark encryption: Prototype, with gaps. Open challenge.
Backup
28#SAISDev14
Existing Solutions
• Bottom line: Need data protection mechanism built in the file format
29
Flat file
encryption
Storage server
encryption with
range read
Storage client
encryption with
range read
Preserves analytics performance X V V
Works in any storage system V X X
Hides data and keys from storage system V X V
Supports data integrity verification V V X
Enables column access control (column keys) X X X
#SAISDev14
Obfuscated Data
30
no keys for
personal data
where the worst
drivers live?
anti-speeding ad campaign:
budget per State?
Transportation
Safety
Agency
Kills!
Speeding
#SAISDev14

Weitere ähnliche Inhalte

Was ist angesagt?

Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
DataWorks Summit
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
HostedbyConfluent
 
Modernizing Integration with Data Virtualization
Modernizing Integration with Data VirtualizationModernizing Integration with Data Virtualization
Modernizing Integration with Data Virtualization
Denodo
 

Was ist angesagt? (20)

Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
The Business Value of Metadata for Data Governance
The Business Value of Metadata for Data GovernanceThe Business Value of Metadata for Data Governance
The Business Value of Metadata for Data Governance
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
DAMA Feb2015 Mastering Master Data
DAMA Feb2015 Mastering Master DataDAMA Feb2015 Mastering Master Data
DAMA Feb2015 Mastering Master Data
 
Business Semantics for Data Governance and Stewardship
Business Semantics for Data Governance and StewardshipBusiness Semantics for Data Governance and Stewardship
Business Semantics for Data Governance and Stewardship
 
Snowflake SnowPro Certification Exam Cheat Sheet
Snowflake SnowPro Certification Exam Cheat SheetSnowflake SnowPro Certification Exam Cheat Sheet
Snowflake SnowPro Certification Exam Cheat Sheet
 
Modernizing Integration with Data Virtualization
Modernizing Integration with Data VirtualizationModernizing Integration with Data Virtualization
Modernizing Integration with Data Virtualization
 

Ähnlich wie Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky

Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...
Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...
Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...
Ramesh Nagappan
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 

Ähnlich wie Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky (20)

Big Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon GershinskyBig Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon Gershinsky
 
ApacheCon: Apache Commons Crypto 2017
ApacheCon: Apache Commons Crypto 2017ApacheCon: Apache Commons Crypto 2017
ApacheCon: Apache Commons Crypto 2017
 
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
In-Memory Key Value Store (KVS) in FPGA for Ultra Low Latency and High Throug...
 
Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...
Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...
Analysis of Security and Compliance using Oracle SPARC T-Series Servers: Emph...
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Oracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 serveryOracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 servery
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 
Deploying Cassandra Multi-cloud
Deploying Cassandra Multi-cloudDeploying Cassandra Multi-cloud
Deploying Cassandra Multi-cloud
 
DPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway ApplicationDPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway Application
 
Spark in the Maritime Domain
Spark in the Maritime DomainSpark in the Maritime Domain
Spark in the Maritime Domain
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Spark on YARN: The Road Ahead
Spark on YARN: The Road AheadSpark on YARN: The Road Ahead
Spark on YARN: The Road Ahead
 
ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure Databricks
 
Streaming meetup
Streaming meetupStreaming meetup
Streaming meetup
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 

Kürzlich hochgeladen (20)

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 

Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky

  • 1. Gidon Gershinsky, IBM Research - Haifa gidon@il.ibm.com Efficient Spark Analytics on Encrypted Data #SAISDev14
  • 2. Overview • What problem we are solving? • Parquet modular encryption • Spark encryption with Parquet – connected car scenario: demo screenshots – performance implications of encryption – getting from prototype to real thing 2#SAISDev14
  • 3. What Problem Are We Solving? 3 • Protect sensitive data-at-rest – data confidentiality: encryption – data integrity – in any storage - untrusted, cloud or private, file system, object store, archives • Preserve performance of Spark analytics – advanced data filtering (projection, predicate) with encrypted data • Leverage encryption for fine-grained access control – per-column encryption keys – key-based access in any storage: private -> cloud -> archive #SAISDev14
  • 4. Background • EU Horizon2020 research project • EU partners: Adaptant, IT Innovation, OCC, Thales, UDE – usage-based car insurance, social services • Collaboration with UC Berkeley RISELab – Opaque technology • Secure cloud analytics (Spark and H/W Enclaves) – how to protect “data-in-use”? – how to protect “data-at-rest” – without losing analytics efficiency? 4#SAISDev14
  • 5. Apache Parquet • Integral part of Apache Spark • Encoding, compression • Advanced data filtering – columnar projection: skip columns – predicate push down: skip files, or row groups, or data pages • Performance benefits – less data to fetch from storage: I/O, latency – less data to process: CPU, latency • How to protect sensitive Parquet data? – in any storage - keeping performance, supporting column access control, data tamper-proofing etc. 5#SAISDev14
  • 6. Parquet Modular Encryption • Apache Parquet community work • Full encryption: all data and metadata modules – metadata: indexes (min/max), schema, size and number of secret values, encryption key metadata, etc .. – encrypted and tamper-proof • Enables columnar projection and predicate pushdown • Storage server / admin never sees encryption keys or cleartext data – “client-side” encryption (e.g., Spark-side) 6#SAISDev14
  • 7. Parquet Modular Encryption • Works in any storage – supporting range-read • Multiple encryption algorithms – different security and performance requirements • Data integrity verification – AES GCM: “authenticated encryption” – data not tampered with – file not replaced with wrong version • Column access control – encryption with column-specific keys 7#SAISDev14
  • 8. AES Modes: GCM and CTR • AES: symmetric encryption standard supported in – Java and other languages – CPUs (x86 and others): hardware acceleration of AES • GCM (Galois Counter Mode) – encryption – integrity verification • basic mode: make sure data not modified • AAD mode: “additional authentication data”, e.g. file name with a version/timestamp – prevent replacing with wrong (old) version • CTR (Counter Mode) – encryption only, no integrity verification – useful when AES hardware acceleration is not available (e.g. Java 8) 8#SAISDev14
  • 9. Status & Roadmap • PARQUET-1178 – full encryption functionality layer, with API access – format definition PR merged in Apache Parquet feature branch – Java and C++ implementation • one PR merged, others being reviewed • Next: tools on top ➢ PARQUET-1373: key management tools ▪ number of different mechanisms ▪ options for storing key material ➢ PARQUET-1376: data obfuscation layer ▪ per-cell masking and aggregated anonymization ▪ reader alerts and choice of obfuscated data 9 ➢ PARQUET-1396: schema service interface ▪ “no-API” interface to encryption ▪ all parameters fetched via loaded class ➢ Demo feature: Property interface ▪ “no-API” interface to encryption ▪ most parameters passed via (Hadoop) config properties ▪ key server access via loaded class #SAISDev14
  • 10. Spark Integration Prototype • Standard Spark 2.3.0 is built with Parquet 1.8.2 • Encryption prototype: standard Spark 2.3.0 with Parquet 1.8.2-E – Spark-side encryption and decryption • storage (admin) never sees data key or plain data – Column access control (key-per-column) – Cryptographic verification of data integrity • replaced with wrong version, tampered with – Filtering of encrypted data • column projection, predicate push-down 10#SAISDev14
  • 11. Connected Car Usecase 11 Car Monitoring Events - car ID (plate #) - driver ID (license #) - speed - location Personal Data of Drivers - driver ID (license #) - name - home address - email address - credit card Gateway Usage-based Insurance Service Spark Parquet who is speeding at night? (20% and more above limit) Raise premium, Notify by email #SAISDev14
  • 12. Car Event Schema 12 Secret data, to be encrypted Unencrypted columns #SAISDev14
  • 13. Driver Table Schema 13 Secret data, to be encrypted #SAISDev14 Masked data
  • 14. Writing Encrypted Parquet Files 14 Data Keys (base64) #SAISDev14
  • 15. Writing Encrypted Parquet Files 15#SAISDev14
  • 16. Reading Encrypted Parquet Files 16 location key: not eligible credit card key: not eligible #SAISDev14 table schema and other metadata are protected
  • 19. Tampering with Data 19 missed Jack Wilson, 80% above speed limit !! #SAISDev14
  • 20. Parquet Data Integrity Protection 20#SAISDev14
  • 21. Key Management 21#SAISDev14 Spark Parquet Client AUTH Service Token KMS • PARQUET-1373 – randomly generated Data keys – wrapped in KMS with Master keys – storage options for wrapped keys • in file itself: easier to find • in separate location: easier to re-key – compromised Master key – client: specify Master key ID per column
  • 22. IBM Spark-based Services • IBM Analytics Engine – on-demand Spark (and Hadoop) clusters in IBM cloud • Watson Studio Spark Environments – cloud tools for data scientists and application developers – dedicated Spark cluster per Notebook • DB2 Event Store – rapidly ingest, store and analyze streamed data in event-driven applications – analyze with a combination of Spark and C++ engines, store with C++ Parquet • SQL Query Service – serverless scale-out SQL execution on TB of data – Spark SQL engine, running in IBM Cloud 22#SAISDev14
  • 23. Column Access Control with Encryption @Uber • Access control to sensitive columns only • PARQUET-1178 provides fundamentals to encrypt columns • PARQUET-1396 provides schema and key retrieval interface • Currently under development - stay tuned! 23#SAISDev14
  • 24. Encryption Performance • Ubuntu 16.04 box – 8-core Intel Core i7 6820HQ CPU / 2.70GHz – 32GB memory • Raw Java AES-GCM speed – no Spark or Parquet, just encrypting buffers in loop • 8 threads, 1MB buffers – Java 8: ~ 250 MB/sec – Java 9 and later: ~ 2-5 GB/sec • AES-NI in Hotspot (hardware acceleration) • Slow warmup in decryption: Bug report JDK-8201633 – can be bypassed with a workaround – if you know someone who knows someone in Java Hotspot team… 24 Not a Benchmark #SAISDev14
  • 25. Spark Encryption Performance • Local mode, 1 executor: 8 cores, 32GB memory • Storage: Local RamDisk • Write / encryption test – load carEvents.csv files (10GB) – cache in Spark memory! – write as parquet{.encrypted}: measure time (of second write) • Query / decryption test – ‘read’ carEvents.parquet{.encrypted} files (2.7GB) – run “speeding at night” query: measure time 25#SAISDev14
  • 26. Spark Encryption Performance • Typical usecases - only 5-10% of columns are sensitive (encrypted) • Java 8… If needed, use GCM_CTR instead of GCM • Next Java version in Spark will support AES hardware acceleration • Encryption is not a bottleneck – app workload, data I/O, encoding, compression 26 Test Write (sec) Query (sec) Notes no encryption 26 2 query on 4 columns: input ~12% of data encryption (GCM) 28.5 2.5 GCM on data and metadata encryption (GCM_CTR) 26.8 2.2 CTR on data, GCM on metadata To Be Benchmarked! #SAISDev14
  • 27. From Prototype to Real Thing • Spark encryption prototype – standard Spark code, built with modified Parquet – encryption is triggered via Hadoop config • transparent to Spark • That’s it? Just upgrade to new Parquet version? Not really .. – partitions! • exception on encrypted columns? obfuscate or encrypt sensitive partition values? – Spark support for other things above file format • TBD! 27#SAISDev14 Parquet encryption: Community work. On track. Feedback welcome! Spark encryption: Prototype, with gaps. Open challenge.
  • 29. Existing Solutions • Bottom line: Need data protection mechanism built in the file format 29 Flat file encryption Storage server encryption with range read Storage client encryption with range read Preserves analytics performance X V V Works in any storage system V X X Hides data and keys from storage system V X V Supports data integrity verification V V X Enables column access control (column keys) X X X #SAISDev14
  • 30. Obfuscated Data 30 no keys for personal data where the worst drivers live? anti-speeding ad campaign: budget per State? Transportation Safety Agency Kills! Speeding #SAISDev14