Real time big data applications with hadoop ecosystem
1. Real-time Big Data Applications
with Hadoop Ecosystem
Chris Huang
Sr. Manager, Core Tech
2014/9/24
1 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc.
2. About – Chris Huang
• Chris Huang
– SPN Solution Developer Manager
– SPN Hadoop Architect
– Hadoop.TW Active Member
• Believes Cloud, Service, Software, Big
Data are critical factors for Taiwan’s
future economic development
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
5. Hot Keywords in Hadoop Community
Real-time
• Impala, Stinger
Computing Framework
• YARN, Tez
In Memory
• Spark
Streaming
• Kafka, Storm
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 5
6. Big Data Applications
• Operational
– Real-time
– Near Real-time
• Analytical
– Batch
– Interactive
– Near Real-time
– Streaming
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 6
7. An Online Music Example
• Operational
– Recent N login time (listen duration)
– Recent N album/artist user browses
– Recent N keyword user search
– Recent N song/album/artist user listens (buys)
– Recent N month user’s purchase amount
• Analytical
– Recommend right song/album/artist to right user at right time
– Correlate similar song/album/artist (CDDB or user behavior)
– Know seasonal music trending (X’max, Valentine’s Day, New Year)
– Know regional music trending
– Calculate regional leaderboard
– Connect user with social network
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 7
8. An Online Banking Example
• Operational
– Recent N login time / frequency
– Recent N items purchased by credit card
– Recent N month balance amount
– Recent N transfer in/out amount
– Recent N investment event
– Recent N month investment balance
• Analytical
– Know user’s profile more (assets/debts/shopping habits/family)
– Recommend right product to right user (investment, credit card, loan)
– Know seasonal trending (tax month/year end/back to school/X’mas)
– Know regional investment product leaderboard (by different age)
– Recommend product by similar user profile
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 8
9. Building Your Big Data Applications
• Think about your data
– Entity or Event?
• Think about your use case
– Operational or Analytic?
• Think about your data user
– External or Internal?
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 9
10. Think About Your Data
Slides from “Apache HBase Application Archetypes”,
HBaseCon 2014
You can Replace HBase with similar alternatives, but
concepts are the same
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 10
19. Think About Your Use
Case
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 19
20. Operational Use Case 1
MR /
Spark
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 20
Real-time
MR /
Spark
Real-time
Batch
Batch
Real-time
HDFS
21. HBase: No Secondary Index (yet)
• Search index building (row key)
• Use Solr to make text data searchable
– Snapshot & clone table
– Index column qualifier text
– Record row-key in Solr document
– Use HBase client to fetch data
• Usually less than few seconds
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 21
22. Operational Use Case 2 (SPN)
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 22
Get, Scan
Solr Client
low latency
high throughput
Index Query
MapReduce
Pig
HDFS
Flume
Feed App
Real-time
Real-time
Batch
23. Operational Use Case 3 (Mixed)
Real-time
Put, Incr,
Append
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 23
Get, Scan
Solr Client
low latency
high throughput
Index Query
Gets
Short scan
MapReduce
Pig
HDFS
Flume
Feed App
Real-time
Batch
HBase Client
HBase Client
Bulk Import
HBase Client
MR /
Spark Batch
HBase
Replication
Solr
MR /
Batch Spark
24. HBase or HDFS?
• Depends on what’s your data
– Entity or Event?
• Depends on your workload
– Low latency?
– Random read/write?
– Short/full scan?
– Sequential read/write?
– Update?
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 24
25. Wait…
Batch for
Operational?
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 25
30. Operational: Batch + Real-time
• Bridge the gap between batch and now
• 80/20 rule
– HDFS/MapReduce/Spark solves 80% easily
– Remaining 20% takes 80% of the efforts
• Go as close as possible, don’t overdo it!
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 30
31. What is Real-time?
• Real-time is NOT always “faster than batch”
– If you have really BIG DATA
• Most of the time, we want Timely Information
• Minimize the gap between scheduled batch jobs
Hourly Job
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 31
Hourly Job
Hourly Job
How to get result at 1:33?
32. Analytical Use Case
Batch/streaming compute
Near real-time/interactive deliver
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 32
36. The Online Music Example
• Operational
– Recent N login time (listen duration )
– Recent N album/artist user browses
– Recent N keyword user search
– Recent N song/album/artist user listens (buys)
– Recent N month user’s purchase amount
Do you really want to analytical result
• Analytical
(recommendation)
EVERY 50 millisecond?
– Recommend right song/album/artist to right user at right time
– Correlate similar song/album/artist (CDDB or user behavior)
– Know seasonal music trending (X’max, Valentine’s Day, New Year)
– Know regional music trending
– Calculate regional leaderboard
– Connect user with social network
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 36
37. Analytical Use Case 1
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 37
Batch
HDFS
Index Query
Solr Client
Real-time
38. Analytical Use Case 2 (SPN)
“A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase”,
HBaseCon 2014
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 38
49. Apache Pig (MapReduce)
• Do hourly count on akamai log
– A = load 'date://2014/07/20/00'
using AkamaiRCLoader();
B = foreach (group A all) COUNT_STAR(A);
dump B;
– …
0% complete
100% complete
(194202349)
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Too Slow for
Interactive
50. Using Impala
• No memory cache
– > select count(*) from akafast
where day=20140720 and hour=0
– 194202349
• with OS cache
• Do a further query:
– select count(*) from akafast where day=20140720
and hour=00 and c='US';
– 41118019
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Make Sense
Now
51. Don’t Connect
Analytic
Engine with
Operational
Use Case
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 51
52. Analytical Use Case 3
low latency
high throughput
Real-time
Put, Incr,
Append
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 52
Gets
Short scan
HBase Client
Impala/Stinger
HDFS
Flume
Feed App
Real-time
Interactive
HBase Client
Bulk Import
HBase Client
MR /
Spark Batch
Customer
Analyst
53. Streaming Use Cases
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 53
59. Streaming Operational Use Case
Real-time
Gets
Short scan
Kafka/Storm
Put, Incr,
Append
HBase Client
Kafka/Storm
low latency
HDFS
high throughput
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 59
HBase Client
Streaming
Index Query
Solr Client
Streaming
60. Streaming Analytical Use Case
Put, Incr,
Append
HBase Client
Kafka/Storm
low latency
HDFS
high throughput
Flume
Feed App
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 60
Gets
Short scan
HBase Client
Impala/Stinger
Interactive
Analyst
Real-time
Customer
Streaming
61. Think About Your Data User
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 61
62. Data User
• External
– Customer
– Partner
• Internal
– Business report user
– Data researcher
– Data analyst
– Algorithm developer
• They want instant response
• They don’t know (and don’t care) if
the recommendation is computed 1
hour ago or 50 ms ago
• Interactive or near real-time is
enough
• Sometimes even wait for batch (make
data small and analyze)
• Of course, everyone wants result
faster, but it depends on your
investment $$
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 62
63. No Silver Bullet
For Real-time,
Or Big Data Application
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 63