SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Apache Spark on K8s +
HDFS Security
Ilan Filonenko
Agenda
1. Kubernetes intro
2. Big Data on Kubernetes
3. Demo: Spark on K8s accessing secure HDFS
4. Secure HDFS deep dive
5. HDFS running on K8s
6. Data locality deep dive
Kubernetes
New open-source cluster manager.
- github.com/kubernetes/kubernetes
libs
app
kernel
libs
app
libs
app
libs
app
Runs programs in Linux containers.
1600+ contributors and 60,000+ commits.
“My app was running fine
until someone installed
their software”
DON’T
TOUCH
MY
STUFF
More isolation is good
Kubernetes provides each program with:
● a lightweight virtual file system -- Docker image
○ an independent set of S/W packages
● a virtual network interface
○ a unique virtual IP address
○ an entire range of ports
Other isolation layers
● Separate process ID space
● Max memory limit
● CPU share throttling
● Mountable volumes
○ Config files -- ConfigMaps
○ Credentials -- Secrets
○ Local storages -- EmptyDir, HostPath
○ Network storages -- PersistentVolumes
Kubernetes architecture
node A node B
Pod 1 Pod 2 Pod 3
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Pod, a unit of scheduling and isolation.
● runs a user program in a primary container
● holds isolation layers like a virtual IP in an infra container
Big Data on Kubernetes
github.com/apache-spark-on-k8s
● Bloomberg, Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata,
Red Hat, and growing
● Patching up Spark Driver and Executor code to work on Kubernetes.
● Upstreaming. Part of Spark 2.3 --
“Spark release 2.3.0. … Major features: Spark on Kubernetes: [SPARK-18278] A new kubernetes scheduler
backend that supports native submission of spark jobs to a cluster managed by kubernetes. ...”
Related talks: spark-summit.org/2017/events/apache-spark-on-kubernetes/
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63855
Spark on Kubernetes
Spark Core Kubernetes Scheduler Backend
Kubernetes Clusternew executors
remove executors
configuration
• Resource Requests
• Authnz
• Communication with K8s
Spark on Kubernetes
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Client
Client
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.4 10.0.0.5 10.0.1.3
Job 1
Job 2
What about storage?
Spark on Kubernetes supports cloud storages like S3.
Your data is often stored on HDFS:
node A
node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Namenode Datanode 1 Datanode 2
● Access remote HDFS running outside Kubernetes
● Run HDFS itself on Kubernetes -- github.com/apache-spark-on-k8s/kubernetes-HDFS
node A
node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Namenode Datanode 1 Datanode 2
Kerberos
Agenda
1. Kubernetes intro
2. Big Data on Kubernetes
3. Demo: Spark on K8s accessing secure HDFS
4. Secure HDFS deep dive
5. HDFS running on K8s
6. Data locality deep dive
Demo: Spark k8s Accessing Secure HDFS
Running a Spark Job on Kubernetes accessing Secure HDFS
https://github.com/ifilonenko/secure-hdfs-test
Security deep dive
● Kerberos tickets
● HDFS tokens
● Long running jobs
● Access Control of Secrets
User A
encrypted with session key SK1
encrypted with HDFS’ password
encrypted with A’s password
Session 1 Requests/Responses
Kerberos
Server
A’s password
HDFS’ password
HDFS’ password
I’m user A. May I talk to HDFS?
SK1 copy for HDFS
SK1 copy for User A
SK1 copy for HDFS
Ticket to HDFS
Kerberos, simplified
SK1
You guys should talk only if the
other side knows SK1.
I’ll get SK1 to each of you secretly.
I guarantee that the other side is
genuine if they know SK1.
Order # SK1
Customer copy
Order # SK1
Merchant copy
SK1 SK1
HDFS Delegation Token
Kerberos ticket, no good for executors on cluster nodes.
● Stamped with the client IP.
Give tokens to driver and executors instead.
● Issued by namenode only if the client has a valid
Kerberos ticket.
● No client IP stamped.
● Permit for driver and executors to use HDFS on
your behalf across all cluster nodes.
Solved: Share tokens via K8s Secret
node A
node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Client
Namenode Datanode 1 Datanode 2
Secret 1
Kerberos
Problem: Driver & executors need token
ADMIT
USER
Solved: Refresh tokens with K8s microservice
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Client
Namenode Datanode 1 Datanode 2
Refresh Pod
10.0.0.8
Secret 1
Kerberos
Problem: Tokens expire
ADMIT
SERVER
Solved: Keep Secret to yourself with K8s RBAC
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Client
Client
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.4 10.0.0.5 10.0.1.3
Secret 1
Secret 1
Job 1
Job 2
Problem: Secrets can be exposed to others
Access Control of Secrets
HDFS DTs and renewal service keytab in Secrets
Job
owner
human
user
Job
owner’s
pods
Other
human
users
Other
users’
pods
Renew
service
pods
Access
to the
DT
secret
create get none none get,
update
Access
to the
renewal
keytab
secret
none none none none get
Admin can restrict access by:
1. Per-user AC, manual
2. Per-group AC, manual
3. Per-user AC (automated, upcoming)
Demo: Spark k8s Accessing Secure HDFS
Running a Spark Job on Kubernetes accessing Secure HDFS
https://github.com/ifilonenko/secure-hdfs-test
Agenda
1. Kubernetes intro
2. Big Data on Kubernetes
3. Demo: Spark on K8s accessing secure HDFS
4. Secure HDFS deep dive
5. HDFS running on K8s
6. Data locality deep dive
node A
node B
196.0.0.5 196.0.0.6
Namenode Datanode 1
node A
node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Namenode Datanode 1 Datanode 2
Run HDFS itself on Kubernetes
node A node C
Driver Pod Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.7
10.0.1.2
Client
Spark
Namenode Pod 1
Datanode Pod 1 Datanode Pod 3
HDFS
HostPath HostPath
github.com/apache-spark-on-k8s/kubernetes-HDFS
196.0.0.6
Executor Pod 1
10.0.0.3
Datanode Pod 2
HostPath
Namenode Pod 2
node B
Persistent
volume 1
Persistent
volume 2
ZK
Pod 1
ZK
Pod 2
JN
Pod 1
ZK
Pod 3
JN
Pod 2
JN
Pod 3
Zookeeper
Journal node
Kerberos
StatefulSet
DaemonSet
active standby
anti pod affinity
Locality deep dive
Send compute to data
● Node locality
● Rack locality
● Where to launch executors
Spark on K8s had to be fixed
Executor 2
node B
Executor 1
node A
Datanode 1 Datanode 2
SLOWFAST
Problem: Node locality broken with virtual pod IPs
Executor Pod 2
10.0.1.2
Driver Executor Pod 1
10.0.0.2 10.0.0.3
Location of fileA == Location of Executor 1
Read /fileA
Read /fileB
/fileA /fileB
node A
196.0.0.5
node B
196.0.0.6
Datanode Pod 1 Datanode Pod 2Namenode Pod
(/fileA → Datanode 1 → 196.0.0.5) == Location of Executor 1(/fileA → Datanode 1 → 196.0.0.5) != (Executor 1 →10.0.0.3)(/fileA → Datanode 1 → 196.0.0.5) == (Executor 1 →10.0.0.3 → 196.0.0.5)
Solved: Node locality
Problem: Rack locality broken with virtual pod IPs
Executor Pod 1
10.0.1.2
Driver
10.0.0.2
Read /fileA
/fileA
node A
196.0.0.5
node B
196.0.0.6
Datanode Pod 1 Datanode Pod 2
(/fileA → Datanode 1 → 196.0.0.5 → Rack 1) != (Executor 1 →10.0.1.2)
Executor Pod 2
10.0.2.2
Read /fileB
/fileB
node C
196.0.1.5
Datanode Pod 3
Rack 1 Rack 2
Rack of fileA == Rack of Executor 1(/fileA → Datanode 1 → 196.0.0.5 → Rack 1) == (Executor 1 →10.0.1.2 → 196.0.0.6 → Rack 1)
SLOW
Solved: Rack locality
Solved: Node preference
Hey K8s, I’d like node A much more for my executors
Driver Executor Pod 1
10.0.0.2 10.0.0.3
/fileA
node A
196.0.0.5
node B
196.0.0.6
Datanode Pod 1 Datanode Pod 2/fileB
Executor Pod 2
10.0.0.4
Node affinity
Rescued data locality!
with data locality fix
- duration: 10 minutes
without data locality fix
- duration: 25 minutes
Thank you
Ilan Filonenko
Appendix
Hadoop Cluster Setup
Launching: single-noded, pseudo-distributed, kerberized HC
https://github.com/ifilonenko/hadoop-kerberos-helm

Weitere ähnliche Inhalte

Was ist angesagt?

Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesMesosphere Inc.
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneUnleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneDatabricks
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Summit
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierDatabricks
 
State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)Nicolas Poggi
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William BentonSpark Summit EU talk by William Benton
Spark Summit EU talk by William BentonSpark Summit
 
War Stories: DIY Kafka
War Stories: DIY KafkaWar Stories: DIY Kafka
War Stories: DIY Kafkaconfluent
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Spark Summit
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsDatabricks
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationPedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationJen Aman
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangSpark Summit
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Spark Summit
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
 

Was ist angesagt? (20)

Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data Services
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneUnleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William BentonSpark Summit EU talk by William Benton
Spark Summit EU talk by William Benton
 
War Stories: DIY Kafka
War Stories: DIY KafkaWar Stories: DIY Kafka
War Stories: DIY Kafka
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined Functions
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationPedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon Innovation
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 

Ähnlich wie Apache Spark on K8s and HDFS Security

Dayta AI Seminar - Kubernetes, Docker and AI on Cloud
Dayta AI Seminar - Kubernetes, Docker and AI on CloudDayta AI Seminar - Kubernetes, Docker and AI on Cloud
Dayta AI Seminar - Kubernetes, Docker and AI on CloudJung-Hong Kim
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScyllaDB
 
Scaling Docker with Kubernetes
Scaling Docker with KubernetesScaling Docker with Kubernetes
Scaling Docker with KubernetesCarlos Sanchez
 
How to build a Kubernetes networking solution from scratch
How to build a Kubernetes networking solution from scratchHow to build a Kubernetes networking solution from scratch
How to build a Kubernetes networking solution from scratchAll Things Open
 
Open stackaustinmeetupsept21
Open stackaustinmeetupsept21Open stackaustinmeetupsept21
Open stackaustinmeetupsept21Brent Doncaster
 
Lessons learned and challenges faced while running Kubernetes at Scale
Lessons learned and challenges faced while running Kubernetes at ScaleLessons learned and challenges faced while running Kubernetes at Scale
Lessons learned and challenges faced while running Kubernetes at ScaleSidhartha Mani
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDatainside-BigData.com
 
Metal-k8s presentation by Julien Girardin @ Paris Kubernetes Meetup
Metal-k8s presentation by Julien Girardin @ Paris Kubernetes MeetupMetal-k8s presentation by Julien Girardin @ Paris Kubernetes Meetup
Metal-k8s presentation by Julien Girardin @ Paris Kubernetes MeetupLaure Vergeron
 
[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless PodmanAkihiro Suda
 
Container network security
Container network securityContainer network security
Container network securityDaisuke Nakajima
 
Running a database on local NVMes on Kubernetes
Running a database on local NVMes on KubernetesRunning a database on local NVMes on Kubernetes
Running a database on local NVMes on KubernetesDoKC
 
Running a database on local NVMes on Kubernetes
Running a database on local NVMes on KubernetesRunning a database on local NVMes on Kubernetes
Running a database on local NVMes on KubernetesDoKC
 
Data weekender deploying prod grade sql 2019 big data clusters
Data weekender deploying prod grade sql 2019 big data clustersData weekender deploying prod grade sql 2019 big data clusters
Data weekender deploying prod grade sql 2019 big data clustersChris Adkin
 
Automating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestratorAutomating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestratorAndrew Yongjoon Kong
 
Revolutionizing WSO2 PaaS with Kubernetes & App Factory
Revolutionizing WSO2 PaaS with Kubernetes & App FactoryRevolutionizing WSO2 PaaS with Kubernetes & App Factory
Revolutionizing WSO2 PaaS with Kubernetes & App FactoryImesh Gunaratne
 
Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...Jeremy Eder
 
Kubernetes deployment on bare metal with container linux
Kubernetes deployment on bare metal with container linuxKubernetes deployment on bare metal with container linux
Kubernetes deployment on bare metal with container linuxmacchiang
 
Octo talk : docker multi-host networking
Octo talk : docker multi-host networking Octo talk : docker multi-host networking
Octo talk : docker multi-host networking Hervé Leclerc
 

Ähnlich wie Apache Spark on K8s and HDFS Security (20)

Dayta AI Seminar - Kubernetes, Docker and AI on Cloud
Dayta AI Seminar - Kubernetes, Docker and AI on CloudDayta AI Seminar - Kubernetes, Docker and AI on Cloud
Dayta AI Seminar - Kubernetes, Docker and AI on Cloud
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
 
Scaling Docker with Kubernetes
Scaling Docker with KubernetesScaling Docker with Kubernetes
Scaling Docker with Kubernetes
 
How to build a Kubernetes networking solution from scratch
How to build a Kubernetes networking solution from scratchHow to build a Kubernetes networking solution from scratch
How to build a Kubernetes networking solution from scratch
 
Open stackaustinmeetupsept21
Open stackaustinmeetupsept21Open stackaustinmeetupsept21
Open stackaustinmeetupsept21
 
Lessons learned and challenges faced while running Kubernetes at Scale
Lessons learned and challenges faced while running Kubernetes at ScaleLessons learned and challenges faced while running Kubernetes at Scale
Lessons learned and challenges faced while running Kubernetes at Scale
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
Metal-k8s presentation by Julien Girardin @ Paris Kubernetes Meetup
Metal-k8s presentation by Julien Girardin @ Paris Kubernetes MeetupMetal-k8s presentation by Julien Girardin @ Paris Kubernetes Meetup
Metal-k8s presentation by Julien Girardin @ Paris Kubernetes Meetup
 
[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman
 
Container network security
Container network securityContainer network security
Container network security
 
Cloud Native SDN
Cloud Native SDNCloud Native SDN
Cloud Native SDN
 
Running a database on local NVMes on Kubernetes
Running a database on local NVMes on KubernetesRunning a database on local NVMes on Kubernetes
Running a database on local NVMes on Kubernetes
 
Running a database on local NVMes on Kubernetes
Running a database on local NVMes on KubernetesRunning a database on local NVMes on Kubernetes
Running a database on local NVMes on Kubernetes
 
Data weekender deploying prod grade sql 2019 big data clusters
Data weekender deploying prod grade sql 2019 big data clustersData weekender deploying prod grade sql 2019 big data clusters
Data weekender deploying prod grade sql 2019 big data clusters
 
Automating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestratorAutomating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestrator
 
Introduction to Docker
Introduction to DockerIntroduction to Docker
Introduction to Docker
 
Revolutionizing WSO2 PaaS with Kubernetes & App Factory
Revolutionizing WSO2 PaaS with Kubernetes & App FactoryRevolutionizing WSO2 PaaS with Kubernetes & App Factory
Revolutionizing WSO2 PaaS with Kubernetes & App Factory
 
Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...
 
Kubernetes deployment on bare metal with container linux
Kubernetes deployment on bare metal with container linuxKubernetes deployment on bare metal with container linux
Kubernetes deployment on bare metal with container linux
 
Octo talk : docker multi-host networking
Octo talk : docker multi-host networking Octo talk : docker multi-host networking
Octo talk : docker multi-host networking
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 

Kürzlich hochgeladen (20)

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 

Apache Spark on K8s and HDFS Security

  • 1. Apache Spark on K8s + HDFS Security Ilan Filonenko
  • 2. Agenda 1. Kubernetes intro 2. Big Data on Kubernetes 3. Demo: Spark on K8s accessing secure HDFS 4. Secure HDFS deep dive 5. HDFS running on K8s 6. Data locality deep dive
  • 3. Kubernetes New open-source cluster manager. - github.com/kubernetes/kubernetes libs app kernel libs app libs app libs app Runs programs in Linux containers. 1600+ contributors and 60,000+ commits.
  • 4. “My app was running fine until someone installed their software” DON’T TOUCH MY STUFF
  • 5. More isolation is good Kubernetes provides each program with: ● a lightweight virtual file system -- Docker image ○ an independent set of S/W packages ● a virtual network interface ○ a unique virtual IP address ○ an entire range of ports
  • 6. Other isolation layers ● Separate process ID space ● Max memory limit ● CPU share throttling ● Mountable volumes ○ Config files -- ConfigMaps ○ Credentials -- Secrets ○ Local storages -- EmptyDir, HostPath ○ Network storages -- PersistentVolumes
  • 7. Kubernetes architecture node A node B Pod 1 Pod 2 Pod 3 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Pod, a unit of scheduling and isolation. ● runs a user program in a primary container ● holds isolation layers like a virtual IP in an infra container
  • 8. Big Data on Kubernetes github.com/apache-spark-on-k8s ● Bloomberg, Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata, Red Hat, and growing ● Patching up Spark Driver and Executor code to work on Kubernetes. ● Upstreaming. Part of Spark 2.3 -- “Spark release 2.3.0. … Major features: Spark on Kubernetes: [SPARK-18278] A new kubernetes scheduler backend that supports native submission of spark jobs to a cluster managed by kubernetes. ...” Related talks: spark-summit.org/2017/events/apache-spark-on-kubernetes/ https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63855
  • 9. Spark on Kubernetes Spark Core Kubernetes Scheduler Backend Kubernetes Clusternew executors remove executors configuration • Resource Requests • Authnz • Communication with K8s
  • 10. Spark on Kubernetes node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Client Client Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.4 10.0.0.5 10.0.1.3 Job 1 Job 2
  • 11. What about storage? Spark on Kubernetes supports cloud storages like S3. Your data is often stored on HDFS: node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Namenode Datanode 1 Datanode 2 ● Access remote HDFS running outside Kubernetes ● Run HDFS itself on Kubernetes -- github.com/apache-spark-on-k8s/kubernetes-HDFS node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Namenode Datanode 1 Datanode 2 Kerberos
  • 12. Agenda 1. Kubernetes intro 2. Big Data on Kubernetes 3. Demo: Spark on K8s accessing secure HDFS 4. Secure HDFS deep dive 5. HDFS running on K8s 6. Data locality deep dive
  • 13. Demo: Spark k8s Accessing Secure HDFS Running a Spark Job on Kubernetes accessing Secure HDFS https://github.com/ifilonenko/secure-hdfs-test
  • 14. Security deep dive ● Kerberos tickets ● HDFS tokens ● Long running jobs ● Access Control of Secrets
  • 15. User A encrypted with session key SK1 encrypted with HDFS’ password encrypted with A’s password Session 1 Requests/Responses Kerberos Server A’s password HDFS’ password HDFS’ password I’m user A. May I talk to HDFS? SK1 copy for HDFS SK1 copy for User A SK1 copy for HDFS Ticket to HDFS Kerberos, simplified SK1 You guys should talk only if the other side knows SK1. I’ll get SK1 to each of you secretly. I guarantee that the other side is genuine if they know SK1. Order # SK1 Customer copy Order # SK1 Merchant copy SK1 SK1
  • 16. HDFS Delegation Token Kerberos ticket, no good for executors on cluster nodes. ● Stamped with the client IP. Give tokens to driver and executors instead. ● Issued by namenode only if the client has a valid Kerberos ticket. ● No client IP stamped. ● Permit for driver and executors to use HDFS on your behalf across all cluster nodes.
  • 17. Solved: Share tokens via K8s Secret node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Client Namenode Datanode 1 Datanode 2 Secret 1 Kerberos Problem: Driver & executors need token ADMIT USER
  • 18. Solved: Refresh tokens with K8s microservice node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Client Namenode Datanode 1 Datanode 2 Refresh Pod 10.0.0.8 Secret 1 Kerberos Problem: Tokens expire ADMIT SERVER
  • 19. Solved: Keep Secret to yourself with K8s RBAC node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Client Client Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.4 10.0.0.5 10.0.1.3 Secret 1 Secret 1 Job 1 Job 2 Problem: Secrets can be exposed to others
  • 20. Access Control of Secrets HDFS DTs and renewal service keytab in Secrets Job owner human user Job owner’s pods Other human users Other users’ pods Renew service pods Access to the DT secret create get none none get, update Access to the renewal keytab secret none none none none get Admin can restrict access by: 1. Per-user AC, manual 2. Per-group AC, manual 3. Per-user AC (automated, upcoming)
  • 21. Demo: Spark k8s Accessing Secure HDFS Running a Spark Job on Kubernetes accessing Secure HDFS https://github.com/ifilonenko/secure-hdfs-test
  • 22. Agenda 1. Kubernetes intro 2. Big Data on Kubernetes 3. Demo: Spark on K8s accessing secure HDFS 4. Secure HDFS deep dive 5. HDFS running on K8s 6. Data locality deep dive node A node B 196.0.0.5 196.0.0.6 Namenode Datanode 1 node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Namenode Datanode 1 Datanode 2
  • 23. Run HDFS itself on Kubernetes node A node C Driver Pod Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.7 10.0.1.2 Client Spark Namenode Pod 1 Datanode Pod 1 Datanode Pod 3 HDFS HostPath HostPath github.com/apache-spark-on-k8s/kubernetes-HDFS 196.0.0.6 Executor Pod 1 10.0.0.3 Datanode Pod 2 HostPath Namenode Pod 2 node B Persistent volume 1 Persistent volume 2 ZK Pod 1 ZK Pod 2 JN Pod 1 ZK Pod 3 JN Pod 2 JN Pod 3 Zookeeper Journal node Kerberos StatefulSet DaemonSet active standby anti pod affinity
  • 24. Locality deep dive Send compute to data ● Node locality ● Rack locality ● Where to launch executors Spark on K8s had to be fixed Executor 2 node B Executor 1 node A Datanode 1 Datanode 2 SLOWFAST
  • 25. Problem: Node locality broken with virtual pod IPs Executor Pod 2 10.0.1.2 Driver Executor Pod 1 10.0.0.2 10.0.0.3 Location of fileA == Location of Executor 1 Read /fileA Read /fileB /fileA /fileB node A 196.0.0.5 node B 196.0.0.6 Datanode Pod 1 Datanode Pod 2Namenode Pod (/fileA → Datanode 1 → 196.0.0.5) == Location of Executor 1(/fileA → Datanode 1 → 196.0.0.5) != (Executor 1 →10.0.0.3)(/fileA → Datanode 1 → 196.0.0.5) == (Executor 1 →10.0.0.3 → 196.0.0.5) Solved: Node locality
  • 26. Problem: Rack locality broken with virtual pod IPs Executor Pod 1 10.0.1.2 Driver 10.0.0.2 Read /fileA /fileA node A 196.0.0.5 node B 196.0.0.6 Datanode Pod 1 Datanode Pod 2 (/fileA → Datanode 1 → 196.0.0.5 → Rack 1) != (Executor 1 →10.0.1.2) Executor Pod 2 10.0.2.2 Read /fileB /fileB node C 196.0.1.5 Datanode Pod 3 Rack 1 Rack 2 Rack of fileA == Rack of Executor 1(/fileA → Datanode 1 → 196.0.0.5 → Rack 1) == (Executor 1 →10.0.1.2 → 196.0.0.6 → Rack 1) SLOW Solved: Rack locality
  • 27. Solved: Node preference Hey K8s, I’d like node A much more for my executors Driver Executor Pod 1 10.0.0.2 10.0.0.3 /fileA node A 196.0.0.5 node B 196.0.0.6 Datanode Pod 1 Datanode Pod 2/fileB Executor Pod 2 10.0.0.4 Node affinity
  • 28. Rescued data locality! with data locality fix - duration: 10 minutes without data locality fix - duration: 25 minutes
  • 31. Hadoop Cluster Setup Launching: single-noded, pseudo-distributed, kerberized HC https://github.com/ifilonenko/hadoop-kerberos-helm