Since mid-2016, Spark-as-a-Service has been available to researchers in Sweden from the Rise SICS ICE Data Center at www.hops.site. In this session, Dowling will discuss the challenges in building multi-tenant Spark structured streaming applications on YARN that are metered and easy-to-debug. The platform, called Hopsworks, is in an entirely UI-driven environment built with only open-source software. Learn how they use the ELK stack (Elasticsearch, Logstash and Kibana) for logging and debugging running Spark streaming applications; how they use Grafana and InfluxDB for monitoring Spark streaming applications; and, finally, how Apache Zeppelin can provide interactive visualizations and charts to end-users.
This session will also show how Spark applications are run within a ‘project’ on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In addition, hear about the experiences of their users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and the novel solutions for helping researchers debug and optimize Spark applications.hear about the experiences of their users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and the novel solutions for helping researchers debug and optimize Spark applications.afka topics are protected from access by users that are not members of the project. We will also discuss the experiences of our users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
1. Jim Dowling
Assoc. Prof, Royal Institute of Technology, Stockholm
Structured Spark
Streaming-as-a-Service
with Hopsworks
Hops
@hopshadoop
http://github.com/hopshadoop
http://www.hops.io
2. Spark Streaming-as-a-Service in Sweden
• SICS ICE
Datacenter research environment
• Hopsworks
Spark/Flink/Kafka/Tensorflow-as-a-service
– Built on Hops Hadoop (www.hops.io)
>150 active users
Swedish National
Day Today!
5. Smoking Spark Streaming
A pipe for my data: Kafka
A cloud for my smoke: Hops Hadoop
A way to roll A/B tests: Jupyter/Zeppelin
A lookout for trouble: Grafana/Influx
A log with evidence: ELK Stack
6. Structured Spark Streaming (Hops)
HopsFS
YARN
HopsFS
YARN
InfluxDBInfluxDB
ElasticElastic
Public Cloud or On-PremisePublic Cloud or On-Premise
Parquet
Data Src
Spark Batch Analytics
SQL
Spark StreamingKafka
…...….….
7. General Data Protection Regulation (GDPR)
Ostrich Day: 2018-05-25
http://www.computerweekly.com/news/450295538/D-Day-for-GDPR-is-25-May-2018
9. Proj-42
Projects for Multi-tenancy
A Project is a Grouping of Users and Data
Proj-X
Hive DBKafka Topic hdfs://My/Data
Proj-Company-WideSparkSQL Table
13. Share Datasets/Topics like Dropbox
Spark Summit, Hopsworks, June 2017 13/53
Share any Data Source/Sink: HDFS Datasets, Kafka Topics, etc
14. Custom Python Environments with Conda
Spark Summit, Hopsworks, June 2017 14/53
Python libraries are usable by any framework (Spark/Tensorflow)
15. Case Study: IoT Data Platform
Sensor
Node
Sensor
Node
Sensor
Node
Sensor
Node
Sensor
Node
Sensor
Node
Field Gateway
StorageStorage
AnalysisAnalysis
IngestionIngestion
ACMEACME
Evil CorpEvil Corp
IoT Infrastructure
16. Multi-Cloud IoT Data Platform
ACME DontBeEvil Corp Evil-Corp
AWS Google
Cloud
Oracle
Cloud
User Apps control IoT Devices
IoT Data
Platform
Field Gateway Field Gateway Field Gateway
17. Cloud Native Solution
ACME S3S3
[Authorization]
GCSGCS
IoT Company
IoT Company cedes control
of the data to its Customers
IoT Company cedes control
of the data to its Customers
Spark
Streaming App
23. Tuning Kafka/Spark Streaming
• Key Kafka Tuning Parameters
– Number of Topics
– Number of Partitions per Topic
●
Spark Streaming Tuning Considerations
– Match # of Executors to the # of Partitions
– Ensure balanced data across partitions
24. Realtime Logs
●
YARN aggregates logs on job completion
– No good to us for Streaming
●
Collect logs and make them searchable in real-
time using Logstash, Elasticsearch, and Kibana
– Log4j auto-configured to write to Logstash
http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/
31. SSL/TLS Everywhere
• For each project, a user has a different SSL/TLS
(X.509) certificate.
– Client-side SSL certs for authentication
• Services are also issued with SSL/TLS certificates.
– Kafka performs access control to topics using certs
36. Secure Structured Spark Streaming
• Streaming Apps also need to know:
– Credentials, Endpoints
– monitoring.properties
– How to shutdown gracefully
• The HopsUtil API hides this complexity.
41. Hops Roadmap
• HopsFS
– Multi-Data-Center High Availability
– Small files, 2-Level Erasure Coding
• HopsYARN
– Tensorflow on Spark with GPUs-as-a-Resource
• Open Datasets
– P2P Dataset Sharing
42. Spark Streaming on Hopsworks
• Hopsworks is a new Data Platform built on Hops
• Spark-Streaming is a First Class Citizen
• Built-in Tooling for Spark-Streaming
43. Hops Heads
Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Roberto
Bampi, Fabio Buso, Fanti Machmount Al Samisti, Braulio Grana, Zahin Azher Rashid, Robin
Andersson, ArunaKumari Yedurupaka, Tobias Johansson, August Bonds, Filotas Siskos.
Active:
Alumni:
Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,
Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis
Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana
Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari,
Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.