Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Google Cloud Dataflow

5.654 Aufrufe

Veröffentlicht am

Introduction to Google Cloud DataFlow/Apache Beam

Veröffentlicht in: Technologie

Google Cloud Dataflow

  1. 1. Cloud Dataflow A Google Service and SDK
  2. 2. The Presenter Alex Van Boxel, Software Architect, software engineer, devops, tester,... at Vente-Exclusive.com Introduction Twitter @alexvb Plus +AlexVanBoxel E-mail alex@vanboxel.be Web http://alex.vanboxel.be
  3. 3. Dataflow what, how, where?
  4. 4. Dataflow is... A set of SDK that define the programming model that you use to build your streaming and batch processing pipeline (*) Google Cloud Dataflow is a fully managed service that will run and optimize your pipeline
  5. 5. Dataflow where... ETL ● Move ● Filter ● Enrich Analytics ● Streaming Compu ● Batch Compu ● Machine Learning Composition ● Connecting in Pub/Sub ○ Could be other dataflows ○ Could be something else
  6. 6. NoOps Model ● Resources allocated on demand ● Lifecycle managed by the service ● Optimization ● Rebalancing Cloud Dataflow Benefits
  7. 7. Unified Programming Model Unified: Streaming and Batch Open Sourced ● Java 8 implementation ● Python in the works Cloud Dataflow Benefits
  8. 8. Unified Programming Model (streaming) Cloud Dataflow Benefits DataFlow (streaming) DataFlow (batch) BigQuery
  9. 9. Unified Programming Model (stream and backup) Cloud Dataflow Benefits DataFlow (streaming) DataFlow (batch) BigQuery
  10. 10. Unified Programming Model (batch) Cloud Dataflow Benefits DataFlow (streaming) DataFlow (batch) BigQuery
  11. 11. Beam Apache Incubation
  12. 12. DataFlow SDK becomes Apache Beam Cloud Dataflow Benefits
  13. 13. DataFlow SDK becomes Apache Beam ● Pipeline first, runtime second ○ focus your data pipelines, not the characteristics runner ● Portability ○ portable across a number of runtime engines ○ runtime based on any number of considerations ■ performance ■ cost ■ scalability Cloud Dataflow Benefits
  14. 14. DataFlow SDK becomes Apache Beam ● Unified model ○ Batch and streaming are integrated into a unified model ○ Powerful semantics, such as windowing, ordering and triggering ● Development tooling ○ tools you need to create portable data pipelines quickly and easily using open-source languages, libraries and tools. Cloud Dataflow Benefits
  15. 15. DataFlow SDK becomes Apache Beam ● Scala Implementation ● Alternative Runners ○ Spark Runner from Cloudera Labs ○ Flink Runner from data Artisans Cloud Dataflow Benefits
  16. 16. DataFlow SDK becomes Apache Beam ● Cloudera ● data Artisans ● Talend ● Cask ● PayPal Cloud Dataflow Benefits
  17. 17. SDK Primitives Autopsy of a dataflow pipeline
  18. 18. Pipeline ● Inputs: Pub/Sub, BigQuery, GCS, XML, JSON, … ● Transforms: Filter, Join, Aggregate, … ● Windows: Sliding, Sessions, Fixed, … ● Triggers: Correctness…. ● Outputs: Pub/Sub, BigQuery, GCS, XML, JSON, ... Logical Model
  19. 19. PCollection<T> ● Immutable collection ● Could be bound or unbound ● Created by ○ a backing data store ○ a transformation from other PCollection ○ generated ● PCollection<KV<K,V>> Dataflow SDK Primitives
  20. 20. ParDo<I,O> ● Parallel Processing of each element in the PCollection ● Developer provide DoFn<I,O> ● Shards processed independent of each other Dataflow SDK Primitives
  21. 21. ParDo<I,O> Interesting concept of side outputs: allows for data sharing Dataflow SDK Primitives
  22. 22. PTransform ● Combine small operations into bigger transforms ● Standard transform (COUNT, SUM,...) ● Reuse and Monitoring Dataflow SDK Primitives public class BlockTransform extends PTransform<PCollection<Legacy>, PCollection<Event>> { @Override public PCollection<Event> apply(PCollection<Legacy> input) { return input.apply(ParDo.of(new TimestampEvent1Fn())) .apply(ParDo.of(new FromLegacy2EventFn())).setCoder(AvroCoder.of(Event.class)) .apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20)))) .apply(ParDo.of(new ExtactChannelUserFn())) } }
  23. 23. Merging ● GroupByKey ○ Takes a PCollection of KV<K,V> and groups them ○ Must be keyed ● Flatten ○ Join with same type Dataflow SDK Primitives
  24. 24. Input/Output Dataflow SDK Primitives PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); // streamData is Unbounded; apply windowing afterward. PCollection<String> streamData = p.apply(PubsubIO.Read.named("ReadFromPubsub") .topic("/topics/my-topic")); PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); PCollection<TableRow> weatherData = p .apply(BigQueryIO.Read.named("ReadWeatherStations") .from("clouddataflow-readonly:samples.weather_stations"));
  25. 25. ● Multiple Inputs + Outputs Input/Output Dataflow SDK Primitives
  26. 26. Primitives on steroids Autopsy of a dataflow pipeline
  27. 27. Windows ● Split unbounded collections (ex. Pub/Sub) ● Works on bounded collection as well Dataflow SDK Primitives .apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20)))) Window.<CAL>into(SlidingWindows.of(Duration.standardMinutes(5)));
  28. 28. Triggers ● Time-based ● Data-based ● Composite Dataflow SDK Primitives AfterEach.inOrder( AfterWatermark.pastEndOfWindow(), Repeatedly.forever(AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(10))) .orFinally(AfterWatermark .pastEndOfWindow() .plusDelayOf(Duration.standardDays(2))));
  29. 29. 14:22Event time 14:21 Input/Output Dataflow SDK Primitives 14:21 14:22 14:21:25Processing time Event time 14:2314:22 Watermark 14:21:20 14:21:40 Early trigger Late trigger Late trigger 14:25
  30. 30. 14:22Event time 14:21 Input/Output Dataflow SDK Primitives 14:21 14:22 14:21:25Processing time Event time 14:2314:22 Watermark 14:21:20 14:21:40 Early trigger Late trigger Late trigger 14:25
  31. 31. Google Cloud Dataflow The service
  32. 32. What you need ● Google Cloud SDK ○ utilities to authenticate and manage project ● Java IDE ○ Eclipse ○ IntelliJ ● Standard Build System ○ Maven ○ Gradle Dataflow SDK Primitives
  33. 33. What you need Dataflow SDK Primitives subprojects { apply plugin: 'java' repositories { maven { url "http://jcenter.bintray.com" } mavenLocal() } dependencies { compile 'com.google.cloud.dataflow:google-cloud-dataflow-java-sdk-all:1.3.0' testCompile 'org.hamcrest:hamcrest-all:1.3' testCompile 'org.assertj:assertj-core:3.0.0' testCompile 'junit:junit:4.12' } }
  34. 34. How it work Google Cloud Dataflow Service Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(FlatMapElements.via((String word) -> Arrays.asList(word.split("[^a-zA-Z']+"))) .withOutputType(new TypeDescriptor<String>() {})) .apply(Filter.byPredicate((String word) -> !word.isEmpty())) .apply(Count.<String>perElement()) .apply(MapElements .via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + wordCount.getValue()) .withOutputType(new TypeDescriptor<String>() {})) .apply(TextIO.Write.to("gs://YOUR_OUTPUT_BUCKET/AND_OUTPUT_PREFIX"));
  35. 35. Service ● Stage ● Optimize (Flume) ● Execute Google Cloud Dataflow Service
  36. 36. Optimization Google Cloud Dataflow Service
  37. 37. Optimization Google Cloud Dataflow Service
  38. 38. Console Google Cloud Dataflow Service
  39. 39. Lifecycle Google Cloud Dataflow Service
  40. 40. Compute Worker Scaling Google Cloud Dataflow Service
  41. 41. Compute Worker Rescaling Google Cloud Dataflow Service
  42. 42. Compute Worker Rebalancing Google Cloud Dataflow Service
  43. 43. Compute Worker Rescaling Google Cloud Dataflow Service
  44. 44. Lifecycle Google Cloud Dataflow Service
  45. 45. BigQuery Google Cloud Dataflow Service
  46. 46. Testability Test Driven Development Build-In
  47. 47. Testability - Built in Dataflow Testability KV<String, List<CalEvent>> input = KV.of("", new ArrayList<>()); input.getValue().add(CalEvent.event("SESSION", "REFERRAL") .build()); input.getValue().add(CalEvent.event("SESSION", "START") .data("{"referral":{"Sourc...}").build()); DoFnTester<KV<?, List<CalEvent>>, CalEvent> fnTester = DoFnTester.of(new SessionRepairFn()); fnTester.processElement(input); List<CalEvent> calEvents = fnTester.peekOutputElements(); assertThat(calEvents).hasSize(2); CalEvent referral = CalUtil.findEvent(calEvents, "REFERRAL"); assertNotNull("REFERRAL should still exist", referral); CalEvent start = CalUtil.findEvent(calEvents, "START"); assertNotNull("START should still exist", start); assertNull("START should have referral removed.",
  48. 48. Appendix Follow the yellow brick road
  49. 49. Paper - Flume Java FlumeJava: Easy, Efficient Data-Parallel Pipelines http://pages.cs.wisc.edu/~akella/CS838/F12/838- CloudPapers/FlumeJava.pdf More Dataflow Resources
  50. 50. DataFlow/Bean vs Spark Dataflow/Beam & Spark: A Programming Model Comparison https://cloud.google.com/dataflow/blog/dataflow-beam- and-spark-comparison More Dataflow Resources
  51. 51. Github - Integrate Dataflow with Luigi Google Cloud integration for Luigi https://github.com/alexvanboxel/luigiext-gcloud More Dataflow Resources
  52. 52. Questions and Answers The last slide Twitter @alexvb Plus +AlexVanBoxel E-mail alex@vanboxel.be Web http://alex.vanboxel.be

×