Data Scientists often have access to very sensitive material: data! Today's data scientists need a way to interact with toxic data where spilling more than a few data could be destructive to a company. Securing compute clusters to be like nuclear glove boxes of old is one technique to limit data exfiltration and ensure data production is regularized, reliable and secure.
This talk will cover the philosophy and implementation of:
Data Dropbox: data goes in blindly but can be verified via checksums - data directionality is enforced; using HDFS is a model and the state of HBase is discussed.
Data Glovebox: one can manipulate data as desired but can not exfiltrate except via very specific, controlled processes; the Oozie Git action is a step in this direction.
4. Nuclear Materials Manufacturing
Image: Office of Legacy Management, U.S. D.O.E., Rocky Flats Plant History & Information Used to Process EEOICPA Claim
Requests. 16 April, 2014
Former U.S. Department of Energy Rocky Flats Plant - South of Boulder, CO
5. Plutonium Dropbox? Isolation Glovebox?
Dropbox: [n] a container where one can deposit something to be retrieved later
Glovebox: [n] a sealed protective container in which one may safely manipulate a
dangerous substance using gloves attached to holes
Images:
(Top) Office of Legacy Management, U.S. D.O.E., CO-83-M-2 -
Interior view of X-Y retriever. 29 Nov, 1988
(Right) Office of Legacy Management, U.S. D.O.E., CO-83-K-15 -
View of safe geometry station from the inside of an input-output
station. 3 Dec, 1988
6. Data Dropbox? Data Glovebox?
Dropbox: [n] a data-system where one can deposit a file for later reading and
processing dependent on client (network) location; ideally providing a positive
verification of file contents
Glovebox: [n] a sealed compute environment in which one may safely
manipulate data using restricted access - with strong exfiltration controls
MySQL
32. YARN Node
HDFS
Data Node
YARN Node
HDFS
Data Node
YARN Network Isolation (Example)
YARN-7468 - Provide means for container network policy control
Database A WebService ADatabase B
YARN Nodes
HDFS
Data Node
Network Class 1
User A:
Novel
Application
YARN
Nodemanager
Network Class 2
User B:
Sparkiptables
33. YARN Node
HDFS
Data Node
YARN Node
HDFS
Data Node
YARN Network Isolation (Example)
YARN-7468 - Provide means for container network policy control
Database A WebService ADatabase B
YARN Nodes
HDFS
Data Node
Network Class 1
User A:
Novel
Application
YARN
Nodemanager
Network Class 2
User B:
Sparkiptables