Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Spark Streaming-as-a-
Service with Kafka and
YARN
Jim Dowling
KTH Royal Institute of Technology, Stockholm
Senior Research...
Spark Streaming-as-a-Service in Sweden
• SICS ICE: datacenter research environment
• Hopsworks: Spark/Flink/Kafka/Tensorfl...
Hadoop is not a cool kid anymore!
Hadoop’s Evolution
2009 2016
?
Hadoop’s Evolution
2009 2016
?
Tiny Brain
(NameNode, ResourceMgr)
Huge Body (DataNodes)
Build out Hadoop’s Brain with External
Weakly Consistent MetaData Services
Google-Glass Approach to Intelligence
NameNodes
NDB
HDFS Client
DataNodes
>37X Capacity
>16 X
Throughput
HopsFS
Larger Brains => Bigger, Faster*
16x
Performance
on Spotify Workload
*Usenix FAST 2017, HopsFS: Scaling Hierarchical File ...
Hopsworks
• Projects
– Datasets/Files
– Topics
– Jobs/Notebooks
Hadoop
• Clusters
• Users
• Jobs/Applications
• Files
• AC...
YARN Spark Streaming Support
• Apache Kafka
• ELK Stack
– Real-time Logs
• Grafana/InfluxDB
– Monitoring
Hopsworks
YARN ag...
Kafka Self-Service UI
Manage & Share
• Topics
• ACLs
• Avro Schemas
Manage & Share
• Topics
• ACLs
• Avro Schemas
Logs
Elasticsearch,
Logstash,
Kibana
(ELK Stack)
Elasticsearch,
Logstash,
Kibana
(ELK Stack)
Monitoring/Alerting
InfluxDB
and
Grafana
InfluxDB
and
Grafana
metrics.properties: StreamingMetrics.streaming.lastReceivedB...
Zeppelin for Prototyping Streaming Apps
[https://github.com/knockdata/spark-highcharts]
Debugging Spark with Dr. Elephant
• Analyzes Spark Jobs
for errors and
common using
pluggable heuristics
• Doesn’t show ki...
Integration as Microservices in Hopsworks
• Project-based Multi-tenancy
• Self-Service UI
• Simplifying Spark Streaming Ap...
Proj-All
Proj-X
Projects in Hopsworks
•
Proj-42
Shared TopicTopic /Projs/My/Data
CompanyDB
User roles
18
Data Owner
- Import/Export data
- Manage Membership
- Share DataSets, Topics
Data Scientist
- Write and Ru...
Notebooks, Data sharing and Quotas
• Zeppelin Notebooks in HDFS, Jobs launcher UI.
• Sharing is not Copying
– Datasets/Top...
Dynamic roles
alice@gmail.com
ProjectA
Authenticate
ProjectB
HopsFS
YARN
Kafka
SSL/TLS
Certificates
Secure
Impersonation
P...
Look Ma, no Kerberos
• Each project-specific user issued with a SSL/TLS
(X.509) certificate for both authentication and en...
Simplifying Spark Streaming Apps
• Spark Streaming Applications need to know
– Credentials
• Hadoop, Kafka, InfluxDb, Logs...
Secure Streaming App with Kafka
Developer
1.Discover: Schema Registry and Kafka/InfluxDB/ELK Endpoints
2.Create: Kafka Pro...
Streaming Producer in HopsWorks
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
String topic = HopsUtil.getTopic()...
Streaming Consumer in HopsWorks
JavaStreamingContext jssc = new
JavaStreamingContext(sparkConf,Durations.seconds(2));
Stri...
Less code to write
https://github.com/hopshadoop/hops-kafka-examples
Properties props = new Properties();
props.put(Produc...
Distributing Certs for Spark Streaming
Alice@gmail.com
1. Launch Spark Job
Distributed
Database
2. Get certs,
service endp...
Multi-Tenant IoT Scenario
Sensor
Node
Sensor
Node
Sensor
Node
Sensor
Node
Sensor
Node
Sensor
Node
Field Gateway
StorageSto...
IoT Scenario
ACME DontBeEvil Corp Evil-Corp
AWS Google
Cloud
Oracle
Cloud
User Apps control IoT Devices
IoT Company:
Analy...
Cloud-Native Analytics Solution
ACME S3S3
[Authorization]
GCSGCS
OracleOracleIoT Company
Each customer needs its own
Analy...
IoT Company
Project
GatewayTopic
Hopsworks Solution using Projects
ACME
ProjectACMETopic
ACME Dataset
Data Stream
Analytic...
Hopsworks Solution
ACME
Spark
Streaming App
[Authorized]
ACME
Dataset
ACME
Dataset
ACME Topic
ACME Analytics
Reports
ACME ...
Karamel/Chef for Automated Installation
Google Compute Engine
BareMetal
DEMO
Hops Roadmap
• HopsFS
– HA support for Multi-Data-Center
– Small files, 2-Level Erasure Coding
• HopsYARN
– Tensorflow wit...
Summary
• Hops is a new distribution of Hadoop
– Tinker-friendly and open-source.
• Hopsworks provides first-class support...
Hops Team
Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis,...
Thank You.
We totally understand it’s going to be
America First Spark Streaming first, but
can we take this chance to say
...
Nächste SlideShare
Wird geladen in …5
×

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by Jim Dowling

1.830 Aufrufe

Veröffentlicht am

Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.

Veröffentlicht in: Daten & Analysen
  • Login to see the comments

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by Jim Dowling

  1. 1. Spark Streaming-as-a- Service with Kafka and YARN Jim Dowling KTH Royal Institute of Technology, Stockholm Senior Researcher, SICS CEO, Logical Clocks AB
  2. 2. Spark Streaming-as-a-Service in Sweden • SICS ICE: datacenter research environment • Hopsworks: Spark/Flink/Kafka/Tensorflow/Hadoop • -as-a-service – Built on Hops Hadoop (www.hops.io) – >130 active users
  3. 3. Hadoop is not a cool kid anymore!
  4. 4. Hadoop’s Evolution 2009 2016 ?
  5. 5. Hadoop’s Evolution 2009 2016 ? Tiny Brain (NameNode, ResourceMgr) Huge Body (DataNodes)
  6. 6. Build out Hadoop’s Brain with External Weakly Consistent MetaData Services Google-Glass Approach to Intelligence
  7. 7. NameNodes NDB HDFS Client DataNodes >37X Capacity >16 X Throughput HopsFS
  8. 8. Larger Brains => Bigger, Faster* 16x Performance on Spotify Workload *Usenix FAST 2017, HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
  9. 9. Hopsworks • Projects – Datasets/Files – Topics – Jobs/Notebooks Hadoop • Clusters • Users • Jobs/Applications • Files • ACLs • Sys Admins • Kerberos Larger Brains => More Intelligent* *HMGA2 gene mutations correlated with increased intracranial volume as well as enhanced IQ. http://newsroom.ucla.edu/releases/international-team-uncovers-new-231989 User-Friendly Concepts http://www.ibtimes.co.uk/embargoed-8pm-25th-jan-size-matters-brain-size-relative-body-size-indicates-animals-ability-1539994
  10. 10. YARN Spark Streaming Support • Apache Kafka • ELK Stack – Real-time Logs • Grafana/InfluxDB – Monitoring Hopsworks YARN aggregates logs on job completion http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/
  11. 11. Kafka Self-Service UI Manage & Share • Topics • ACLs • Avro Schemas Manage & Share • Topics • ACLs • Avro Schemas
  12. 12. Logs Elasticsearch, Logstash, Kibana (ELK Stack) Elasticsearch, Logstash, Kibana (ELK Stack)
  13. 13. Monitoring/Alerting InfluxDB and Grafana InfluxDB and Grafana metrics.properties: StreamingMetrics.streaming.lastReceivedBatch_records == 0
  14. 14. Zeppelin for Prototyping Streaming Apps [https://github.com/knockdata/spark-highcharts]
  15. 15. Debugging Spark with Dr. Elephant • Analyzes Spark Jobs for errors and common using pluggable heuristics • Doesn’t show killed jobs • No online support for streaming apps yet
  16. 16. Integration as Microservices in Hopsworks • Project-based Multi-tenancy • Self-Service UI • Simplifying Spark Streaming Apps
  17. 17. Proj-All Proj-X Projects in Hopsworks • Proj-42 Shared TopicTopic /Projs/My/Data CompanyDB
  18. 18. User roles 18 Data Owner - Import/Export data - Manage Membership - Share DataSets, Topics Data Scientist - Write and Run code Self-Service Administration – No Administrator Needed
  19. 19. Notebooks, Data sharing and Quotas • Zeppelin Notebooks in HDFS, Jobs launcher UI. • Sharing is not Copying – Datasets/Topics • Per-Project quotas – Storage in HDFS – CPU in YARN (Uber-style Pricing)
  20. 20. Dynamic roles alice@gmail.com ProjectA Authenticate ProjectB HopsFS YARN Kafka SSL/TLS Certificates Secure Impersonation ProjectA__alice ProjectB__alice
  21. 21. Look Ma, no Kerberos • Each project-specific user issued with a SSL/TLS (X.509) certificate for both authentication and encryption. • Services also issued with SSL/TLS certificates. – Same root CA as user certs
  22. 22. Simplifying Spark Streaming Apps • Spark Streaming Applications need to know – Credentials • Hadoop, Kafka, InfluxDb, Logstash – Endpoints • Kafka Broker, Kafka SchemaRegistry, ResourceManager, NameNode, InfluxDB, Logstash • The HopsUtil API hides this complexity. – Location/security transparent Spark applications
  23. 23. Secure Streaming App with Kafka Developer 1.Discover: Schema Registry and Kafka/InfluxDB/ELK Endpoints 2.Create: Kafka Properties file with certs and broker details 3.Create: Producer/Consumer using Kafka Properties 4.Download: the Schema for the Topic from the Schema Registry 5.Distribute: X.509 certs to all hosts on the cluster 6.Cleanup securely These steps are replaced by calls to the HopsUtil API Operations https://github.com/hopshadoop/hops-kafka-examples
  24. 24. Streaming Producer in HopsWorks JavaSparkContext jsc = new JavaSparkContext(sparkConf); String topic = HopsUtil.getTopic(); //Optional SparkProducer producer = HopsUtil.getSparkProducer(); Map<String, String> message = … sparkProducer.produce(message);
  25. 25. Streaming Consumer in HopsWorks JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,Durations.seconds(2)); String topic = HopsUtil.getTopic(); //Optional String consumerGroup = HopsUtil.getConsumerGroup(); //Optional SparkConsumer consumer = HopsUtil.getSparkConsumer(jssc); JavaInputDStream<ConsumerRecord<String, byte[]>> messages = consumer.createDirectStream(); jssc.start();
  26. 26. Less code to write https://github.com/hopshadoop/hops-kafka-examples Properties props = new Properties(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList); props.put(SCHEMA_REGISTRY_URL, restApp.restConnect); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.StringSerializer.class); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroSerializer.class); props.put("producer.type", "sync"); props.put("serializer.class","kafka.serializer.StringEncoder"); props.put("request.required.acks", "1"); props.put("ssl.keystore.location","/var/ssl/kafka.client.keystore.jks" ) props.put("ssl.keystore.password","test1234") props.put("ssl.key.password","test1234") ProducerConfig config = new ProducerConfig(props); String userSchema = "{"namespace": "example.avro", "type": "record", "name": "U ser"," + ""fields": [{"name": "name", "type": "string"}]}"; Schema.Parser parser = new Schema.Parser(); Schema schema = parser.parse(userSchema); GenericRecord avroRecord = new GenericData.Record(schema); avroRecord.put("name", "testUser"); Producer<String, String> producer = new Producer<String, String>(config); ProducerRecord<String, Object> message = new ProducerRecord<>(“topicName”, avroRecord ); producer.send(data); Lots of Hard-Coded Endpoints Here! SparkProducer producer = HopsUtil.getSparkProducer(); Map<String, String> message = … sparkProducer.produce(message); Massively Simplified Code for Secure Spark Streaming/Kafka
  27. 27. Distributing Certs for Spark Streaming Alice@gmail.com 1. Launch Spark Job Distributed Database 2. Get certs, service endpoints YARN Private LocalResources Spark Streaming App 4. Materialize certs 3. YARN Job, config 6. Get Schema 7. Consume Produce 5. Read Certs Hopsworks HopsUtil 8. Read ACLs for authentication
  28. 28. Multi-Tenant IoT Scenario Sensor Node Sensor Node Sensor Node Sensor Node Sensor Node Sensor Node Field Gateway StorageStorage AnalysisAnalysis IngestionIngestion ACMEACME Evil CorpEvil Corp IoT Cloud Platform DontBeEvil Corp DontBeEvil Corp
  29. 29. IoT Scenario ACME DontBeEvil Corp Evil-Corp AWS Google Cloud Oracle Cloud User Apps control IoT Devices IoT Company: Analyze Data, Data Services for Clients ACME DontBeEvil Corp Evil Corp
  30. 30. Cloud-Native Analytics Solution ACME S3S3 [Authorization] GCSGCS OracleOracleIoT Company Each customer needs its own Analytics Infrastructure Each customer needs its own Analytics Infrastructure Spark Streaming App
  31. 31. IoT Company Project GatewayTopic Hopsworks Solution using Projects ACME ProjectACMETopic ACME Dataset Data Stream Analytics Reports
  32. 32. Hopsworks Solution ACME Spark Streaming App [Authorized] ACME Dataset ACME Dataset ACME Topic ACME Analytics Reports ACME Analytics Reports Spark Batch Job ACME Project
  33. 33. Karamel/Chef for Automated Installation Google Compute Engine BareMetal
  34. 34. DEMO
  35. 35. Hops Roadmap • HopsFS – HA support for Multi-Data-Center – Small files, 2-Level Erasure Coding • HopsYARN – Tensorflow with isolated GPUs • Hopsworks – P2P Dataset Sharing – Jupyter, Presto, Hive
  36. 36. Summary • Hops is a new distribution of Hadoop – Tinker-friendly and open-source. • Hopsworks provides first-class support for Spark-Streaming-as-a-Service – With support services like Kafka, ELK Stack, Zeppelin, Grafana/InfluxDB.
  37. 37. Hops Team Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Roberto Bampi, Fabio Buso, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, Robin Andersso, ArunaKumari Yedurupaka, Tobias Johansson, August Bonds, Tiago Brito, Filotas Siskos. Active: Alumni: Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu. Hops
  38. 38. Thank You. We totally understand it’s going to be America First Spark Streaming first, but can we take this chance to say Hopsworks second! http://www.hops.io @hopshadoop Hops

×