Keynote of HadoopCon 2014 Taiwan:
* Data analytics platform architecture & designs
* Lambda architecture overview
* Using SQL as DSL for stream processing
* Lambda architecture using SQL
3. Topics
About Me & LINE
Data analytics workloads
Batch processing
Stream processing
Lambda architecture
Lambda architecture using SQL
Norikra: Stream processing with SQL
13:30-14:20 4F
11. Various Data Analytics Workload
Reports
Monthly/Daily reports
Hourly (or shorter) news
Real-time metrics
Automatically updated reports/graphs
Alerts for abuse of services, overload, ...
12.
13. Batch Processing
Hadoop
MapReduce (or Spark, Tez) & DSLs (Hive, Pig, ...)
For reports
MPP Engines
Cloudera Impala, Apache Drill, Facebook Presto, ...
For interactive analysis
For reports of shorter window
14. Stream Processing
Apache Storm
Incubator project
“Distributed and fault-tolerant realtime computation”
Norikra
by tagomoris
Non-distributed “Stream processing with SQL”
15. Why Stream Processing?
Less latency
Realtime metrics
Short-term prompt reports
Less computing power
10Mbps for batch processing: 100GB/day
10Mbps for stream processing: 1 Server
No query schedule management
Once query registered, it runs forever
16. Disadvantage of Stream Processing
Queries must be written before data
There should be another way to query past data
Queries cannot be run twice
All results will be lost when any error occurs
All data have gone when bugs found
Disorders of events break results
Recorded time based queries? Or arrival time based queries?
18. Lambda Architecture
“The Lambda-architecture aims to satisfy the needs for a
robust system that is fault-tolerant, both against hardware
failures and human mistakes, being able to serve a wide
range of workloads and use cases, and in which low-latency
reads and updates are required. The resulting system should
be linearly scalable, and it should scale out rather than up.”
http://lambda-architecture.net/
19. Lambda Architecture: Overview
new data
batch layer
master dataset
serving layer
view
speed layer
real-time view
query
21. What Lambda Architecture Provides
Replayable queries
Redo queries anytime if results of speed layer are broken
Accurate results on demand
Prompt reports in speed layer with arrival time
Fixed reports in batch layer with recorded time
... And many more benefits of stream processing
22. Why All of Us Don’t Use It?
Storm doesn’t fit well with many uses
Storm requires computer resources too big to deploy
Summingbird requires many steps to deploy
Many directors/analysts don’t write Scala/Java
Summingbird DSL is not enough easy for non-professional
people
25. Norikra
Schema-less stream processing with SQL
“Norikra is a open source server software provides "Stream
Processing" with SQL, written in JRuby, runs on JVM, licensed
under GPLv2.”
SELECT
path,
COUNT(1, status=200) AS success_count,
COUNT(1, status=500) AS server_error_count,
COUNT(*) AS count
FROM AccessLog.win:time_batch(10 min, 0L)
WHERE service='myservice' AND path LIKE '/api/%'
GROUP BY path
http://norikra.github.io/
27. “Pseudo Lambda” Architecture Using SQL
Lambda architecture platform
with almost same queries
SELECT path,
COUNT(IF(status=200,1,NULL)) AS success_count,
COUNT(IF(status=500,1,NULL)) AS server_error_count,
COUNT(*) AS count
FROM AccessLog
WHERE service='myservice' AND path LIKE '/api/%'
AND timestamp >= ‘2014-09-13 10:40:00’
AND timestamp < ‘2014-09-13 10:50:00’
GROUP BY path
SELECT path,
COUNT(1, status=200) AS success_count,
COUNT(1, status=500) AS server_error_count,
COUNT(*) AS count
FROM AccessLog.win:time_batch(10 min, 0L)
WHERE service='myservice' AND path LIKE '/api/%'
GROUP BY path
28. “Pseudo Lambda” Architecture Using SQL
SQL dialects are easy to learn!
Standard SQL, Hive, Presto, Impala, Drill, ...
+ Norikra
For non-professional people too!
SQL queries are very easy to write twice!
29. Use Cases in LINE
Prompt reports for Ads service
Short-term prompt reports by Norikra
Daily fixed reports by Hive
Summary of application server error log
Aggregate error log for alerting by Norikra
Check details with Hive, Presto (or grep!)
See you later for details!