Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

6.484 Aufrufe

Veröffentlicht am

Flink Forward 2015

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

  2. 2. INTRODUCTION • YARN opened Hadoop for many more developers • API to integrate into a Hadoop cluster • Flexibility • Applications: MR, TEZ, Flink, Spark,… • Flink has been great in using the opportunity • Flexible program execution graph • Operators other than Map and Reduce • Clean and convenient API • Efficient with I/O
  3. 3. EXPECTATIONS FROM YARN • New programming models in addition to MapReduce • More alternatives to cover cases where the MapReduce paradigm does not suit well • Flexibility with expressing operations on data • Elasticity of a cluster • Ability to write own applications to distribute computations across the cluster
  4. 4. DISTRIBUTING COMPUTATIONAL TASKS • Writing own YARN application • Complicated • Tedious • Error-prone • Somebody must have done something simpler • Apache Twill • Was not simple enough still • Execute CLI tools remotely (if everything else fails) • Flink?
  5. 5. FLINK AT RESEARCHGATE Lots of benefits: • Made MapReduce jobs more readable • More compact • Less boiler plate code • Easier to understand and maintain • Got rid of ugly Hive queries and optimised runtime • Better and cleaner orchestration of workflow subtasks (before we had to glue multiple MR jobs) • Iterative machine learning algorithms • Distributing computational tasks across a cluster
  7. 7. REAL USE CASE • In essence: • Reads MongoDB documents • Converts them to Avro records (based on a provided Avro schema) • Persists them on HDFS • Avrongo evolution • One threaded program • Multi-threaded program talking to different shards in parallel • Distributed across cluster • Reasons for distributing: • Were CPU bound • HDFS load distribution A MongoDB to Avro Bridge (aka Avrongo) Used to dump live DB data to HDFS for further batch-processing and analytics
  8. 8. HOW AVRONGO WORKS? Basic Version • One thread • Using one MongoDB cursor to iterate the whole collection • Suitable for smaller collections
  9. 9. MONGODB SHARDS AND CHUNKS • Controlling load on the MongoDB cluster • Deterministic way of splitting collection for input Utilizing MongoDB chunks
  10. 10. AVRONGO - SHARDED VERSION • Collecting chunks information (sets of documents living on a particular shard) • Processing chunks of each shard in a separate group of threads
  11. 11. AVRONGO - FLINK VERSION • Custom InputFormat that distributes MongoDB chunks uniformly • FlatMap operator • Number of task nodes = (number of shards) x (parallelism per shard) • Custom Generic AvroOutputFormat • Slower shards receive a bit more attention
  12. 12. FLINK APPROACH Outcome • No longer bound by CPU • Imports to HDFS are faster • Some collections: from 6h to 2.5h or from 3.5h to 2h • Very few lines of code • Same command line interface (no efforts to migrate to Flink-based version) • Reusing the same converter as in standalone versions • All orchestration and parallelisation work is done automatically by Flink Benefits
  14. 14. HADOOP DISTCP • Generates a MapReduce job that copies big amount of data • List of files as an input to a Map Task • Two types of Input Formats: • UniformSizeInputFormat • DynamicInputFormat • gives more load to faster mappers • complicated code • utilizes FS to feed the mappers https://hadoop.apache.org/docs/r1.2.1/distcp2.html
  15. 15. • Implements the same logic as in a DynamicInputFormat of Hadoop’s distcp • Much fewer lines of code • Same runtime as Hadoop distcp • Available in Flink Java examples • Not fault-tolerant (yet) FLINK DISTCP https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/ src/main/java/org/apache/flink/examples/java/distcp
  17. 17. CONCLUSIONS • Flink - a thin layer for implementing your YARN application for parallelising independent tasks on the cluster • Thanks to custom input formats that are easy to implement • No boilerplate code Would be nice to have: • Elasticity • Better progress tracking • Fault tolerance Custom input format + a Flink operator with business logic = Happiness
  18. 18. QUESTIONS? https://www.researchgate.net/careers