4. About Me: Taro L. Saito
4
2007 University of Tokyo. Ph.D.
XML DBMS, Transaction Processing
Relational-Style XML Query [SIGMOD 2008]
~ 2014 Assistant Professor at University of Tokyo
Genome Science Research
- Big Data Processing
- Distributed Computing
2014.03~ Treasure Data, Inc. Tokyo
2015.07~ Treasure Data, Inc.
Mountain View, CA
5.
6.
7.
8. Cloud Platform for Data Analytics
8
• Importing 1,000,000~ records / sec.
• Presto (Distributed SQL engine)
• 50,000~ queries / day
• Processing 10 trillion records / day
• http://qiita.com/xerial/items/a9093b60062f2c613fda
Import Export
Store
Analyze with
Presto/Hive
(Distributed SQL Engine)
Enterp
Enterprise
Data
BI
9. Workflow Fundamental Features
• Dependency management
• task1 -> task2 -> task3 …
• Scheduling
• Execution monitoring
• State management
• Error handling
• Easy access to logs
• Notification
9
11. Dataflow DSL
• Translate this data processing program
• into a cluster computing program
11
A B
A0
A1
A2
B1
B2
f
B0
C
C
g
map reduce
f g
12. Redbook: Dataflow Engines
• Chapter 5: Large-Scale Dataflow Engine, by Peter Bailis
• http://www.redbook.io/ch5-dataflow.html
• DryadLINQ
• Most influential interface
for dataflow DSL
• SQL-like operation
• Functional style
• Spark
• SparkSQL
• 70% of Spark accesses
• Dataset API
• Shift to the dataframe based API
12
13. Dataflow -> Execution Plan
• Example - Hive: SQL to MapReduce
• Mapping SQL stages into MapReduce program
• SELECT page, count(*) FROM weblog
GROUP BY page
13
HDFS
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
HDFS
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
15. Hadoop is not enough
• C. Olston et al. [SIGMOD 2011]
• continuous processing
• independent scheduling
• Incremental processing
• Google Parcolator [OSDI 2010]
• Naiad - Differential Workflow
Microsoft [SOSP 2013]
15
16. Continuous Processing
• The Dataflow Model
• Akidau et al., Google [VLDB2015]
• Unbounded data processing
• late-coming data
• Integration of
• batch processing
• accumulation
16
20. Airflow
• Best practices with Airflow - An open source
platform for workflows & schedules (Nov 2015)
• At Silicon Valley Data Engineering Meetup
• https://youtu.be/dgaoqOZlvEA
20
21. Workflow Development
• Programmatic
• Generate workflows by code
• Configuration as Code
• Workflow reuse/overwrite
• object oriented
• Parameterization
21
24. Dataflow DSL vs Workflow DSL
• Dataflow
• A -> B -> C -> …
• Data dependencies
• Workflow
• Task A -> Task B -> Task C -> …
• Task dependencies
• Data transfer is optional (through file or DB)
• + Scheduling
• + Task names
• For monitoring, redo, etc.
24
25. Weavelet (wvlet)
• Object-oriented workflow DSL for Scala
• Workflow reuse, extension, override
• Parameterization
• Function := Task, Workflow := Class
25
26. Isolating DAG generation and its execution
• Alternatives of MR
• Tez
• Pig on Spark https://issues.apache.org/jira/browse/PIG-4059
• Asakusa on Hadoop, Spark
26
Local
Hadoop
Spark
Result
DSL generates DAG
27. Stream DSL
• Add “moving stream” support to Dataflow DSL
• ”moving" streams and "resting" datasets
• Example
• Spark Streaming
• Spark DSL + Micro-batch for stream
• Microsoft Azure Stream SQL
• Windowing support for moving data
• Norikra
• Stream processing with SQL
• Reactive programming
• ReactiveX (Netflix), Akka Streaming (beta) <- Stream DSL (DAG)
• Back-pressure support
• Controlling data transfer speed from receiver side
27
28. Task Execution Retry
• リトライと冪等性のデザインパターン
• http://frsyuki.hatenablog.com/entry/2014/06/09/164559
• System failures
• Process is not responding
• network, hardware failures
• Middleware failures
• provisioning failures, missing components
• User failures
• Wrong configuration
• Programming error
28
29. Retry Example
• Example: Task calling a REST API /create/xxx
• Client: First attempt
• Server returns 200 Success
• But failed to get the status code
• Client retries the task
• Get 409 conflict error (entry xxx is already created)
• Solution (Application side)
• Handle 409 error as success in the client (idempotent
execution)
• More strict approach
• Making xxx unique for each request
29
30. Fault Tolerance
• Presto: Distributed query engine developed by Facebook
• Uses HTTP data transfer
• No fault-tolerance
• 99.5% of queries finishes without any failure
• For queries processing 10 billions or more rows => Drops to 85%
30
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
31. Summary
• Recent workflow tools
• Driven by Python community
• Because of this book! (=>)
• Airflow, Luigi, etc.
• Workflow manager
• Handle system failures, monitoring
• Workflow development
• DAG based DSL (dataflow, workflow, stream processing) -> Execution
• Does not cover application logic errors
• Idempotent execution
• Requires splitting large tasks into smaller ones
31