SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
© 2018 Bloomberg Finance L.P. All rights reserved.
Data Gloveboxes: A
Philosophy of Data Science
Data Security
DataWorks Summit - Barcelona
March 21, 2019
Clay Baenziger
Hadoop Infrastructure
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Bloomberg by the Numbers
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Bloomberg By the Numbers
• Founded in 1981
• 325,000 subscribers in 170 countries
• Over 19,000 employees in 192 locations
— Over 5,000 software engineers
— 100+ machine learning data scientists and engineers
• More News reporters than The New York Times + Washington Post +
Chicago Tribune
— News content from 125K+ sources
— >1.5M news stories ingested / published each day (that's 500 news
stories ingested/second)
• One of the largest private networks in the world
• 100B+ tick messages per day, with a peak of more than 10 million
messages/second
• More than a billion messages (E-Mails and IB chats) processed each day
Nuclear Materials Manufacturing
Image: Office of Legacy Management, U.S. D.O.E., Rocky Flats Plant History & Information Used to Process EEOICPA Claim
Requests. 16 April, 2014
Former U.S. Department of Energy Rocky Flats Plant - South of Boulder, CO
Plutonium Dropbox? Isolation Glovebox?
Dropbox: [n] a container where one can deposit something to be retrieved later
Glovebox: [n] a sealed protective container in which one may safely manipulate a
dangerous substance using gloves attached to holes
Images:
(Top) Office of Legacy Management, U.S. D.O.E., CO-83-M-2 -
Interior view of X-Y retriever. 29 Nov, 1988
(Right) Office of Legacy Management, U.S. D.O.E., CO-83-K-15 -
View of safe geometry station from the inside of an input-output
station. 3 Dec, 1988
Data Dropbox? Data Glovebox?
Dropbox: [n] a data-system where one can deposit a file for later reading and
processing dependent on client (network) location; ideally providing a positive
verification of file contents
Glovebox: [n] a sealed compute environment in which one may safely
manipulate data using restricted access - with strong exfiltration controls
MySQL
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Plutonium Enclave
Image: Office of Legacy Management, U.S. D.O.E.,CO-83-M-14 - Downdraft Table, 20 Aug. 2014
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Enclave
Centralized Models:
• Curator Model: Restrict access, operations and results
— “The curator must remain present throughout the lifetime of the database”
(Dwork, Cynthia. “Differential Privacy: A Survey of Results”, 1, Apr. 2008)
— Statistical Disclosure Control
• Data Enclave:
(Lane and Shipp. “Using a Remote Access Data Enclave for Data Dissemination”. Intl. Journal of Digital Curation. 1.2 (2007))
— Allow for Direct and Exact Access
— Allow Arbitrary Computation
— Prevent Business Automation
Image: Denver Public Library, Rocky Mountain News Photographic Archives, “Rocky Flats
employee handles a robotic arm assembly.” 11 Nov. 1987
Plutonium Glovebox
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Glovebox
• Leaded Pane of Glass: Remote Desktop Without Download
• Glove Ports: Arbitrary Code Execution (Run code to manipulate)
• Robotics: Workflow Management
— Deployment
— Routine operations
• Pass-throughs:
— Firewalls are insufficient
— Protocol aware deep packet inspection
— Databases
• Firewalls: Ensure user and workload isolation
— Distributed file-systems
— Local file-systems
— Processing
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Glovebox Architecture
Copyrights: Git Logo, Jason Long; Python “Two Snakes” Logo, Python Software Foundation
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Client Nodes
Client Nodes
Master Node
HBase Master
YARN Resource Manager
HDFS Namenode
Region Server
HBase
HDFS
Map/ReduceNovel Application
YARN
Spark
Region Server
HBase
HDFS
Map/ReduceNovel Application
YARN
Spark
Region Server
HBase
HDFS
Map/ReduceNovel Application
YARN
Spark
Region Server
HBase
HDFS
Map/ReduceNovel Application
YARN
Spark
Hadoop Architecture
Client Nodes
Cluster Nodes
HBase
Region
Server
HDFS Datanode
Map/ReduceNovel Application
YARN Nodemanager
Spark
Master Node
HBase Master
YARN Resource Manager
HDFS Namenode
<HTTP REST API>
YARN Job:
• Submission
• Status
• Logs
• App. WebUIs
<HTTP/2 REST API>
WebHDFS:
• GET Methods:
— Open (read)
— GetFileChecksum
• PUT Methods:
— Create
• POST Methods:
— Append
<HBaseProtobuf RPC>
• Get
• Put
• Scan
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Handling Material
Image: Office of Legacy Management, U.S. D.O.E., “Rocky Flats Overview”, 20 Aug. 2014. Pg. 13
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Properties of Radioactive Materials
Radioactive Materials:
• Can be harmful to people in small quantities
• Can have a very long hazard life if released
• Should be isolated to prevent their spread
• Should be cataloged and characterized to assess harm
• Can still be machined and worked with proper technique
— Robotics
— Personal Protective Equipment
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Properties of Material Data
Material Data:
• Can be harmful to people in small quantities
• Can have a very long hazard life if released
• Should be isolated to prevent their spread
• Should be cataloged and characterized to assess harm
• Can still be used and analyzed with proper technique
— Continuous/Automated Deployment
— Workflow Automation and Gloveboxes
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Nuclear Material (non)Proliferation
Image: Office of Legacy Management, U.S. D.O.E.,”Rocky Flats Overview”, 20 Aug. 2014. Pg. 59
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Proliferation
• Terrorists? (Well, certainly hackers...)
• Accidental loss (USB sticks, laptops, etc.)
• No Price-Anderson Act for Data Incidents
— Quite the opposite with GDPR!
— GDPR limits untraceable mixing of data
• Data Sovereignty
— Requires data to remain geographically stationary
— Must move computation to the data
• Data:
— Swamps
— Lineage (Visibility e.g. via Apache Atlas)
— Masking (Curator Model e.g. via Apache Ranger)
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Lock Everything Down
Image: Office of Legacy Management, U.S. D.O.E.,”Rocky Flats Overview”, 20 Aug. 2014. Pg. 21
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Lock Data Down
Quality Attributes of a Dropbox
• Perimeter Controls (Network Firewall)
• Encryption:
— At rest
— On the wire
• Authentication (Kerberos)
• Client Location Controls Usage (Directionality)
— Data goes in from insecure networks
— Data cannot come back out to an insecure network
— Allow validation of transmission from anywhere
— “Normal” usage from trusted networks
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
DropboxFilter for HDFS (WebHDFS API)
• Upload:
$curl -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE
HTTP/1.1 307 TEMPORARY_REDIRECT
$curl -X PUT -T <File> "http://<DN>:<PORT>/webhdfs/v1/<PATH>?op=CREATE..."
HTTP/1.1 201 Created
• Download:
$curl -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN
• Checksum:
$curl -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILECHECKSUM"
{"FileChecksum": {
"algorithm": "MD5-of-0MD5-of-512CRC32C",
"bytes": "[...]00eb745ad2f5bd1dccab359b12f7f9411b00000000",
"length": 28
}}
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
DropboxFilter for HDFS (Architecture)
• HDFS Protocols
— Protobuf RPC
— RESTful API over HTTPS/2
• Servlet Based Web Server Design
— Filter
— Request Handler
HTTP
Server
(Netty)
Client
(curl)
Servlet Container
(Jetty)
WebHDFS Handler
AuthenticationFilter(request…)
DropboxFilter(request…)
Provides User Info
Provides User Info
DropboxFilter for HDFS (Examples)
<head><title>Error 401 Authentication
required</title></head>
<body><h2>HTTP ERROR 401</h2>
<p>Problem accessing /webhdfs/v1/user/ubuntu/foo. Reason:
<pre>Authentication required</pre></p>
<hr /><i><small>Powered by Jetty://</small></i><br/>
HTTP
Server
(Netty)
Client
(curl)
Servlet Container
(Jetty)
WebHDFS Handler
AuthenticationFilter(request…)
DropboxFilter(request…)
Provides User Info
Provides User Info
<head><title>Error 403 WebHDFS is configured write-only for
clay</title></head>
<body><h2>HTTP ERROR 403</h2>
<p>Problem accessing /webhdfs/v1/user/clay/foo. Reason:
<pre> WebHDFS is configured write-only for clay</pre></p>
<hr/><i><small>Powered by Jetty://</small></i><br/>
Download from:
https://issues.apache.org/jira/browse/HDFS-14234
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Isolation Glovebox
Images:
(Left) Library of Congress, U.S. D.O.E.,View Of A Worker Holding A Plutonium 'Button.' 19 Sep. 1973
(Right) Office of Legacy Management, U.S. D.O.E.,CO-83-M-8 - View of foundry induction furnaces. N.D.
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Glovebox Production Line
Images:
(Left) Library of Congress, U.S. D.O.E., View Of A Glovebox Line Used In Plutonium Operations. 5 May. 1970
(Right) Office of Legacy Management, U.S. D.O.E., CO-83-M-3 - View of Chainveyor. 25 Jan. 1993
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Glovebox (Leaded Pane of Glass)
Avoid Overexposure to Raw Data
• Remote Desktop:
— Limited RDP
• Key Attributes:
— No copy out
— No file shares
— Isolation per user
• Useful to Have Tools:
— Web browser for Jupyter/Zeppelin
— SSH client for command-line access
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Glovebox (Glove Ports & Robotics)
Manipulate Your Data - With Code
• Run on a compute cloud using Apache YARN; submit:
— SQL to Apache Hive
— Python or Scala to Apache Spark
— An arbitrary application
• Automation to ensure consistency (e.g. Apache Oozie)
— A workflow manager for Hive and Spark jobs
— Data transformations for expected reports -- known
processes generating “decontaminated” results
— Can run as a non-human service accounts to drop data in
directory for data exfiltration
— Can provide repeatable deployment of code using Git
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Glovebox (Pass-Through)
Negative pressure (one-way) network; exfiltrate only “decontaminated” data
• Provide a process for data hand-off through an environment
• Firewalls:
— Mostly a transport OSI Layer 4 device (TCP/IP)
— Can do “deep packet inspection” - but need to MITM traffic
— Policy rules for which users can manipulate which data become extensive
— Prohibitively expensive
• Technology Specific:
— DropboxFilter for WebHDFS
— Database RPCs are complex but:
— GRANT INSERT ON DATABASE.* TO write_only@'%';
— GRANT SELECT ON DATABASE.* TO read_only@'%';
— HBase today has no built in client location filtering
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Firewalls (Workload and User Isolation)
Don’t let your data spontaneously combust; clean up “chips”
File Systems Leak
• Permission on data sets
• User collaboration locations
• Temporary/failed job data
• Temporary data locations
— Distributed file systems
— Hive Warehouse
— /tmp
— Local file systems
— /tmp, /var/tmp, /dev/shm
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Take out the Trash
Image: State of Idaho Oversight Monitor. Nov. 2006. Pg. 10
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Private Temporary Directories
• To provide isolation, one can use pam_namespaces
• To setup directories and clean-up, one can use pam_exec
See also: Our integration of the work in https://github.com/bloomberg/chef-bach/pull/1278
Initial Mount Namespace
tmp (inode 100)
polyinst (inode 101)
tmp_clay (inode 201)
tmp_foo (inode 201)
home (inode 300)
User clay’s Mount Namespace
tmp (inode 201)
polyinst (inode 101)
tmp_clay (inode 201)
tmp_foo (inode 201)
home (inode 300)
/ (inode 2)/ (inode 2)
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Keeping the Pipes Flowing
Image: Office of Legacy Management, U.S. D.O.E., CO-83-AF-1 - View of Building 215A. N.D.
YARN Node
HDFS
Data Node
YARN Node
HDFS
Data Node
YARN Network Isolation (Example)
YARN-7468 - Provide means for container network policy control
Database A WebService ADatabase B
YARN Nodes
HDFS
Data Node
Network Class 1
User A:
Novel
Application
YARN
Nodemanager
Network Class 2
User B:
Sparkiptables
YARN Node
HDFS
Data Node
YARN Node
HDFS
Data Node
YARN Network Isolation (Example)
YARN-7468 - Provide means for container network policy control
Database A WebService ADatabase B
YARN Nodes
HDFS
Data Node
Network Class 1
User A:
Novel
Application
YARN
Nodemanager
Network Class 2
User B:
Sparkiptables
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Firewalls Are Important
Images:
(Left) Office of Legacy Management, U.S. D.O.E., CO-83-N-3 - Damaged Filter Plenums. 16 Sept. 1957
(Right) Office of Legacy Management, U.S. D.O.E., CO-83-M-5 - View of a glove box firewall detail. 8 May. 1970
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Glovebox
• Leaded Pane of Glass: Remote Desktop Without Copy
• Glove Ports: Manipulate your Data at An Arm’s Length
• Robotics: Workflow Management
• Pass-throughs: Negative Pressure to Keep the Bits Flowing
• Firewalls: Ensure User and Workload Isolation
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Cleanup Is Messy
Image: CO Dept. of Pub. Health, “Citizen Summary Rocky Flats Historical Public Exposures
Studies 1969 Fire”,
© 2018 Bloomberg Finance L.P. All rights reserved.
Thank You
Connect with Hadoop Team: hadoop@bloomberg.net

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersDataWorks Summit/Hadoop Summit
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobileDataWorks Summit
 
Hadoop: The Unintended Benefits
Hadoop: The Unintended BenefitsHadoop: The Unintended Benefits
Hadoop: The Unintended BenefitsDataWorks Summit
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudDataWorks Summit
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceDataWorks Summit/Hadoop Summit
 
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data PlatformLessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data PlatformDataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsDataWorks Summit
 
Exploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyExploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyDataWorks Summit
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?DataWorks Summit
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionDataWorks Summit
 

Was ist angesagt? (20)

Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Hadoop: The Unintended Benefits
Hadoop: The Unintended BenefitsHadoop: The Unintended Benefits
Hadoop: The Unintended Benefits
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
 
Shaping a Digital Vision
Shaping a Digital VisionShaping a Digital Vision
Shaping a Digital Vision
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and Druid
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
 
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data PlatformLessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Exploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyExploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthy
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
 

Ähnlich wie Data Gloveboxes: A Philosophy of Data Science Data Security

First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23Dan Boutin
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingCloud Elements
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIAlluxio, Inc.
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Instrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyInstrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyDaniel Hochman
 
巨量資料入門 The evolution of data architecture
巨量資料入門 The evolution of data architecture巨量資料入門 The evolution of data architecture
巨量資料入門 The evolution of data architectureWei-Chiu Chuang
 
Transitioning Geoscience Research to the Cloud: Opportunities and Challenges
Transitioning Geoscience Research to the Cloud: Opportunities and ChallengesTransitioning Geoscience Research to the Cloud: Opportunities and Challenges
Transitioning Geoscience Research to the Cloud: Opportunities and ChallengesAmazon Web Services
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDATAVERSITY
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosionactifio
 
Big Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioBig Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioCHAKER ALLAOUI
 
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Software
 
Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...Aljoscha Krettek
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline DevelopmentTimothy Spann
 
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with Fargate
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with FargateDEM07 Best Practices for Monitoring Amazon ECS Containers Launched with Fargate
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with FargateAmazon Web Services
 
Container and Kubernetes without limits
Container and Kubernetes without limitsContainer and Kubernetes without limits
Container and Kubernetes without limitsAntje Barth
 

Ähnlich wie Data Gloveboxes: A Philosophy of Data Science Data Security (20)

First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Mesoscon 2015
Mesoscon 2015Mesoscon 2015
Mesoscon 2015
 
Instrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyInstrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with Envoy
 
巨量資料入門 The evolution of data architecture
巨量資料入門 The evolution of data architecture巨量資料入門 The evolution of data architecture
巨量資料入門 The evolution of data architecture
 
Transitioning Geoscience Research to the Cloud: Opportunities and Challenges
Transitioning Geoscience Research to the Cloud: Opportunities and ChallengesTransitioning Geoscience Research to the Cloud: Opportunities and Challenges
Transitioning Geoscience Research to the Cloud: Opportunities and Challenges
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
 
Big Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioBig Data to SMART Data : Process Scenario
Big Data to SMART Data : Process Scenario
 
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database Migrations
 
Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...Stream processing for the practitioner: Blueprints for common stream processi...
Stream processing for the practitioner: Blueprints for common stream processi...
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with Fargate
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with FargateDEM07 Best Practices for Monitoring Amazon ECS Containers Launched with Fargate
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with Fargate
 
Container and Kubernetes without limits
Container and Kubernetes without limitsContainer and Kubernetes without limits
Container and Kubernetes without limits
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Kürzlich hochgeladen (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Data Gloveboxes: A Philosophy of Data Science Data Security

  • 1. © 2018 Bloomberg Finance L.P. All rights reserved. Data Gloveboxes: A Philosophy of Data Science Data Security DataWorks Summit - Barcelona March 21, 2019 Clay Baenziger Hadoop Infrastructure
  • 2. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Bloomberg by the Numbers
  • 3. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Bloomberg By the Numbers • Founded in 1981 • 325,000 subscribers in 170 countries • Over 19,000 employees in 192 locations — Over 5,000 software engineers — 100+ machine learning data scientists and engineers • More News reporters than The New York Times + Washington Post + Chicago Tribune — News content from 125K+ sources — >1.5M news stories ingested / published each day (that's 500 news stories ingested/second) • One of the largest private networks in the world • 100B+ tick messages per day, with a peak of more than 10 million messages/second • More than a billion messages (E-Mails and IB chats) processed each day
  • 4. Nuclear Materials Manufacturing Image: Office of Legacy Management, U.S. D.O.E., Rocky Flats Plant History & Information Used to Process EEOICPA Claim Requests. 16 April, 2014 Former U.S. Department of Energy Rocky Flats Plant - South of Boulder, CO
  • 5. Plutonium Dropbox? Isolation Glovebox? Dropbox: [n] a container where one can deposit something to be retrieved later Glovebox: [n] a sealed protective container in which one may safely manipulate a dangerous substance using gloves attached to holes Images: (Top) Office of Legacy Management, U.S. D.O.E., CO-83-M-2 - Interior view of X-Y retriever. 29 Nov, 1988 (Right) Office of Legacy Management, U.S. D.O.E., CO-83-K-15 - View of safe geometry station from the inside of an input-output station. 3 Dec, 1988
  • 6. Data Dropbox? Data Glovebox? Dropbox: [n] a data-system where one can deposit a file for later reading and processing dependent on client (network) location; ideally providing a positive verification of file contents Glovebox: [n] a sealed compute environment in which one may safely manipulate data using restricted access - with strong exfiltration controls MySQL
  • 7. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Plutonium Enclave Image: Office of Legacy Management, U.S. D.O.E.,CO-83-M-14 - Downdraft Table, 20 Aug. 2014
  • 8. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Enclave Centralized Models: • Curator Model: Restrict access, operations and results — “The curator must remain present throughout the lifetime of the database” (Dwork, Cynthia. “Differential Privacy: A Survey of Results”, 1, Apr. 2008) — Statistical Disclosure Control • Data Enclave: (Lane and Shipp. “Using a Remote Access Data Enclave for Data Dissemination”. Intl. Journal of Digital Curation. 1.2 (2007)) — Allow for Direct and Exact Access — Allow Arbitrary Computation — Prevent Business Automation
  • 9. Image: Denver Public Library, Rocky Mountain News Photographic Archives, “Rocky Flats employee handles a robotic arm assembly.” 11 Nov. 1987 Plutonium Glovebox
  • 10. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Glovebox • Leaded Pane of Glass: Remote Desktop Without Download • Glove Ports: Arbitrary Code Execution (Run code to manipulate) • Robotics: Workflow Management — Deployment — Routine operations • Pass-throughs: — Firewalls are insufficient — Protocol aware deep packet inspection — Databases • Firewalls: Ensure user and workload isolation — Distributed file-systems — Local file-systems — Processing
  • 11. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Glovebox Architecture Copyrights: Git Logo, Jason Long; Python “Two Snakes” Logo, Python Software Foundation
  • 12. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Client Nodes Client Nodes Master Node HBase Master YARN Resource Manager HDFS Namenode Region Server HBase HDFS Map/ReduceNovel Application YARN Spark Region Server HBase HDFS Map/ReduceNovel Application YARN Spark Region Server HBase HDFS Map/ReduceNovel Application YARN Spark Region Server HBase HDFS Map/ReduceNovel Application YARN Spark Hadoop Architecture Client Nodes Cluster Nodes HBase Region Server HDFS Datanode Map/ReduceNovel Application YARN Nodemanager Spark Master Node HBase Master YARN Resource Manager HDFS Namenode <HTTP REST API> YARN Job: • Submission • Status • Logs • App. WebUIs <HTTP/2 REST API> WebHDFS: • GET Methods: — Open (read) — GetFileChecksum • PUT Methods: — Create • POST Methods: — Append <HBaseProtobuf RPC> • Get • Put • Scan
  • 13. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Handling Material Image: Office of Legacy Management, U.S. D.O.E., “Rocky Flats Overview”, 20 Aug. 2014. Pg. 13
  • 14. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Properties of Radioactive Materials Radioactive Materials: • Can be harmful to people in small quantities • Can have a very long hazard life if released • Should be isolated to prevent their spread • Should be cataloged and characterized to assess harm • Can still be machined and worked with proper technique — Robotics — Personal Protective Equipment
  • 15. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Properties of Material Data Material Data: • Can be harmful to people in small quantities • Can have a very long hazard life if released • Should be isolated to prevent their spread • Should be cataloged and characterized to assess harm • Can still be used and analyzed with proper technique — Continuous/Automated Deployment — Workflow Automation and Gloveboxes
  • 16. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Nuclear Material (non)Proliferation Image: Office of Legacy Management, U.S. D.O.E.,”Rocky Flats Overview”, 20 Aug. 2014. Pg. 59
  • 17. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Proliferation • Terrorists? (Well, certainly hackers...) • Accidental loss (USB sticks, laptops, etc.) • No Price-Anderson Act for Data Incidents — Quite the opposite with GDPR! — GDPR limits untraceable mixing of data • Data Sovereignty — Requires data to remain geographically stationary — Must move computation to the data • Data: — Swamps — Lineage (Visibility e.g. via Apache Atlas) — Masking (Curator Model e.g. via Apache Ranger)
  • 18. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Lock Everything Down Image: Office of Legacy Management, U.S. D.O.E.,”Rocky Flats Overview”, 20 Aug. 2014. Pg. 21
  • 19. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Lock Data Down Quality Attributes of a Dropbox • Perimeter Controls (Network Firewall) • Encryption: — At rest — On the wire • Authentication (Kerberos) • Client Location Controls Usage (Directionality) — Data goes in from insecure networks — Data cannot come back out to an insecure network — Allow validation of transmission from anywhere — “Normal” usage from trusted networks
  • 20. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. DropboxFilter for HDFS (WebHDFS API) • Upload: $curl -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE HTTP/1.1 307 TEMPORARY_REDIRECT $curl -X PUT -T <File> "http://<DN>:<PORT>/webhdfs/v1/<PATH>?op=CREATE..." HTTP/1.1 201 Created • Download: $curl -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN • Checksum: $curl -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILECHECKSUM" {"FileChecksum": { "algorithm": "MD5-of-0MD5-of-512CRC32C", "bytes": "[...]00eb745ad2f5bd1dccab359b12f7f9411b00000000", "length": 28 }}
  • 21. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. DropboxFilter for HDFS (Architecture) • HDFS Protocols — Protobuf RPC — RESTful API over HTTPS/2 • Servlet Based Web Server Design — Filter — Request Handler HTTP Server (Netty) Client (curl) Servlet Container (Jetty) WebHDFS Handler AuthenticationFilter(request…) DropboxFilter(request…) Provides User Info Provides User Info
  • 22. DropboxFilter for HDFS (Examples) <head><title>Error 401 Authentication required</title></head> <body><h2>HTTP ERROR 401</h2> <p>Problem accessing /webhdfs/v1/user/ubuntu/foo. Reason: <pre>Authentication required</pre></p> <hr /><i><small>Powered by Jetty://</small></i><br/> HTTP Server (Netty) Client (curl) Servlet Container (Jetty) WebHDFS Handler AuthenticationFilter(request…) DropboxFilter(request…) Provides User Info Provides User Info <head><title>Error 403 WebHDFS is configured write-only for clay</title></head> <body><h2>HTTP ERROR 403</h2> <p>Problem accessing /webhdfs/v1/user/clay/foo. Reason: <pre> WebHDFS is configured write-only for clay</pre></p> <hr/><i><small>Powered by Jetty://</small></i><br/> Download from: https://issues.apache.org/jira/browse/HDFS-14234
  • 23. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Isolation Glovebox Images: (Left) Library of Congress, U.S. D.O.E.,View Of A Worker Holding A Plutonium 'Button.' 19 Sep. 1973 (Right) Office of Legacy Management, U.S. D.O.E.,CO-83-M-8 - View of foundry induction furnaces. N.D.
  • 24. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Glovebox Production Line Images: (Left) Library of Congress, U.S. D.O.E., View Of A Glovebox Line Used In Plutonium Operations. 5 May. 1970 (Right) Office of Legacy Management, U.S. D.O.E., CO-83-M-3 - View of Chainveyor. 25 Jan. 1993
  • 25. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Glovebox (Leaded Pane of Glass) Avoid Overexposure to Raw Data • Remote Desktop: — Limited RDP • Key Attributes: — No copy out — No file shares — Isolation per user • Useful to Have Tools: — Web browser for Jupyter/Zeppelin — SSH client for command-line access
  • 26. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Glovebox (Glove Ports & Robotics) Manipulate Your Data - With Code • Run on a compute cloud using Apache YARN; submit: — SQL to Apache Hive — Python or Scala to Apache Spark — An arbitrary application • Automation to ensure consistency (e.g. Apache Oozie) — A workflow manager for Hive and Spark jobs — Data transformations for expected reports -- known processes generating “decontaminated” results — Can run as a non-human service accounts to drop data in directory for data exfiltration — Can provide repeatable deployment of code using Git
  • 27. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Glovebox (Pass-Through) Negative pressure (one-way) network; exfiltrate only “decontaminated” data • Provide a process for data hand-off through an environment • Firewalls: — Mostly a transport OSI Layer 4 device (TCP/IP) — Can do “deep packet inspection” - but need to MITM traffic — Policy rules for which users can manipulate which data become extensive — Prohibitively expensive • Technology Specific: — DropboxFilter for WebHDFS — Database RPCs are complex but: — GRANT INSERT ON DATABASE.* TO write_only@'%'; — GRANT SELECT ON DATABASE.* TO read_only@'%'; — HBase today has no built in client location filtering
  • 28. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Firewalls (Workload and User Isolation) Don’t let your data spontaneously combust; clean up “chips” File Systems Leak • Permission on data sets • User collaboration locations • Temporary/failed job data • Temporary data locations — Distributed file systems — Hive Warehouse — /tmp — Local file systems — /tmp, /var/tmp, /dev/shm
  • 29. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Take out the Trash Image: State of Idaho Oversight Monitor. Nov. 2006. Pg. 10
  • 30. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Private Temporary Directories • To provide isolation, one can use pam_namespaces • To setup directories and clean-up, one can use pam_exec See also: Our integration of the work in https://github.com/bloomberg/chef-bach/pull/1278 Initial Mount Namespace tmp (inode 100) polyinst (inode 101) tmp_clay (inode 201) tmp_foo (inode 201) home (inode 300) User clay’s Mount Namespace tmp (inode 201) polyinst (inode 101) tmp_clay (inode 201) tmp_foo (inode 201) home (inode 300) / (inode 2)/ (inode 2)
  • 31. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Keeping the Pipes Flowing Image: Office of Legacy Management, U.S. D.O.E., CO-83-AF-1 - View of Building 215A. N.D.
  • 32. YARN Node HDFS Data Node YARN Node HDFS Data Node YARN Network Isolation (Example) YARN-7468 - Provide means for container network policy control Database A WebService ADatabase B YARN Nodes HDFS Data Node Network Class 1 User A: Novel Application YARN Nodemanager Network Class 2 User B: Sparkiptables
  • 33. YARN Node HDFS Data Node YARN Node HDFS Data Node YARN Network Isolation (Example) YARN-7468 - Provide means for container network policy control Database A WebService ADatabase B YARN Nodes HDFS Data Node Network Class 1 User A: Novel Application YARN Nodemanager Network Class 2 User B: Sparkiptables
  • 34. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Firewalls Are Important Images: (Left) Office of Legacy Management, U.S. D.O.E., CO-83-N-3 - Damaged Filter Plenums. 16 Sept. 1957 (Right) Office of Legacy Management, U.S. D.O.E., CO-83-M-5 - View of a glove box firewall detail. 8 May. 1970
  • 35. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Glovebox • Leaded Pane of Glass: Remote Desktop Without Copy • Glove Ports: Manipulate your Data at An Arm’s Length • Robotics: Workflow Management • Pass-throughs: Negative Pressure to Keep the Bits Flowing • Firewalls: Ensure User and Workload Isolation
  • 36. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Cleanup Is Messy Image: CO Dept. of Pub. Health, “Citizen Summary Rocky Flats Historical Public Exposures Studies 1969 Fire”,
  • 37. © 2018 Bloomberg Finance L.P. All rights reserved. Thank You Connect with Hadoop Team: hadoop@bloomberg.net