Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Sep 2015
Google Dataflow
introduction
iglushkov@machinezone.com
What is Google Dataflow
❖ Data processing system: batch and streaming
❖ Set of SDKs
❖ Google Cloud Platform managed servic...
Programming Model
❖ Pipeline - entire series of computations
❖ PCollection - set of data in a pipeline
❖ Transform - any d...
Pipeline
❖ Data + Transforms
❖ Branching + merging
❖ Multiple sources
❖ Unit testing + Integration testing
❖ Pipeline Exec...
PCollection
❖ Represent data in a pipeline from any source
❖ Potentially unlimited (stream)
❖ Serializable, immutable, no ...
Windowing
❖ Window - subdivided logical parts of a PCollection
❖ Each element is assigned to one or more windows
❖ Fixed t...
Late Data
❖ Event time / Processing time
❖ No order guarantee
❖ No consistent delta b/w Event and Processing time
❖ Waterm...
Triggers
❖ Enough data for the window -> aggregate result: “pane”
❖ Help handle late data
❖ Time-based triggers
❖ Data-dri...
Transforms
❖ Math, convert format, grouping, filtering, combining
❖ [PCollection] -> [PCollection]
❖ Core Transforms: ParDo...
Pipeline I/O
❖ Read/Write from/to external sources
❖ Text Files in Google Cloud Storage or local FS
❖ BigQuery tables
❖ Go...
Extra
❖ Parallelization, distribution, optimization, scaling
❖ Dataflow monitoring UI and CLI
❖ Logging
❖ Unit testing (loc...
Questions?
Nächste SlideShare
Wird geladen in …5
×

Google Dataflow Intro

884 Aufrufe

Veröffentlicht am

Main concepts of Google Dataflow. Pipelines, Windowing, Triggers, Late Data, etc.

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Google Dataflow Intro

  1. 1. Sep 2015 Google Dataflow introduction iglushkov@machinezone.com
  2. 2. What is Google Dataflow ❖ Data processing system: batch and streaming ❖ Set of SDKs ❖ Google Cloud Platform managed services: ❖ Google Compute Engine (VMs) ❖ Google Cloud Storage (r/w data) ❖ BigQuery (r/w data)
  3. 3. Programming Model ❖ Pipeline - entire series of computations ❖ PCollection - set of data in a pipeline ❖ Transform - any data processing operation ❖ Pipeline I/O - data source and data sink APIs
  4. 4. Pipeline ❖ Data + Transforms ❖ Branching + merging ❖ Multiple sources ❖ Unit testing + Integration testing ❖ Pipeline Execution Parameters (local/prod) ❖ Where from, what it looks like, what to do, where store
  5. 5. PCollection ❖ Represent data in a pipeline from any source ❖ Potentially unlimited (stream) ❖ Serializable, immutable, no random access to elements ❖ Deferred data (may have yet to be computed) ❖ Windowing, triggers
  6. 6. Windowing ❖ Window - subdivided logical parts of a PCollection ❖ Each element is assigned to one or more windows ❖ Fixed time windows ❖ Sliding time windows ❖ Per-session windows ❖ Single global windows
  7. 7. Late Data ❖ Event time / Processing time ❖ No order guarantee ❖ No consistent delta b/w Event and Processing time ❖ Watermark ❖ Late data ❖ Triggers to refine windowing, data reporting time
  8. 8. Triggers ❖ Enough data for the window -> aggregate result: “pane” ❖ Help handle late data ❖ Time-based triggers ❖ Data-driven triggers (e.g. certain amount is enough) ❖ Composite triggers: OR, AND - operations on triggers ❖ Window Accumulation modes: accumulate/discard the previous “panes”
  9. 9. Transforms ❖ Math, convert format, grouping, filtering, combining ❖ [PCollection] -> [PCollection] ❖ Core Transforms: ParDo, GroupByKey, Combine, … ❖ Functions with business logic to apply:
 Serializable, Thread-compatible, Idempotent ❖ Composite Transforms
  10. 10. Pipeline I/O ❖ Read/Write from/to external sources ❖ Text Files in Google Cloud Storage or local FS ❖ BigQuery tables ❖ Google Cloud PubSub ❖ Custom Sources and Sinks
  11. 11. Extra ❖ Parallelization, distribution, optimization, scaling ❖ Dataflow monitoring UI and CLI ❖ Logging ❖ Unit testing (locally) any Fn, end-to-end ❖ Introspection toolchain ❖ Update toolchain: for code, windowing configs
  12. 12. Questions?

×