My INSURER PTE LTD - Insurtech Innovation Award 2024
Jubatus talk at HadoopSummit 2013
1. Jubatus: Real-time and Highly-scalable
Machine Learning Platform
Shohei Hido
Preferred Infrastructure, Inc. Japan.
HadoopSummit 2013 @ San Jose, CA
2013/06/27
2. Jubatus: OSS for real-time big data analytics
l Joint development with NTT laboratory in Japan
l Released Oct. 2011 (current version is v0.4.3)
l You can download it from https://github.com/jubatus/
2
1. Bigger data
3. Machine learning
2. More in real-time
4. l Software company in Tokyo, Japan (founded in 2006)
l Focus on long-term technology innovation
l 28 regular employees, many top-notch engineers
l Customers: media, e-commerce, research institutes
Distributed computing
Natural language
processing
Machine learning
Information retrieval
Preferred Infrastructure, Inc. (PFI)
-To bring cutting-edge research advances to the real world-
4
5. l What is Jubatus? : Motivation and applications
l How Jubatus works? : The architecture
l How to use it : Quick-start steps
l Summary and future
Agenda
6. At HadoopSummit Last year:
Everyone talked about “real-time”
6
Real-time BI
Definition of
real-time
Real-time
analytics
Real-time
SQL-like
query
Real-time
processing
Real-time
ad-hoc query
Real-time
visualization
7. Real-time big data analytics: A trend
From an O’Reilly article(2013)
“Real-time big data isn’t just a process for storing petabytes
or exabytes of data in a data warehouse”
“It’s about the ability to make better decisions and take
meaningful actions at the right time.”
“It’s about combining and analyzing data so you can take
the right action, at the right time, and at the right place”
- Michael Minelli, Co-author of “Big Data, Big Analytics”
7
9. Big data analytics will go real-time and deeper
9
1. Bigger data
3. Machine learning
2. More in real-time
l Future: Deeper analytics for rapid decisions and actions
l Twitter analysis for personalized advertisement optimization
l Anomaly detection from M2M sensor data
l Energy demand forecast / Smart grid optimization
l Security monitoring on Network traffic or financial fraud
10. Demo: real-time tweet categorization
l Automatically learns “Apple + iPad => Apple” then “iPad => Apple” in real-time
11. Jubatus is with Twitter ecosystem in Japan
l NTT Data: Exclusive tweet reseller in Japan
l Firehose contract with Twitter
l Jubatus is an official tool for analytics on Japanese tweets
l Jubatus can classify 5,000+ tweets per second on a few servers
l
11
http://blog.jp.twitter.com/2012/09/twitter.html
http://www.nttdata.com/jp/ja/news/release/2012/092700.html
Our twitter analysis modules
12. Jubatus as a big data analytics platform for industry
l Gov. fund for IT fusion: big-data new business creation
l In collaboration with NEC and other research labs.
l Focus on performance improvement for larger M2M data
12
Datasize
Development plan
Human-generated
+ Machine- generated
+ Severe real-time requirement
SNS data
Healthcare
Agriculture
Network
Traffic
Video
surveillance
12
Scaling
up
13. Active development & growing business/community
l 10+ active committers
l & Pull requests from users
l Monthly minor update
l Bug & usability fix
l Quarterly major update
l Add new features & interface
13
l PoC on user companies
l Real-time ad optimization
l Server monitoring
l Smart-house / smart-grid
l Intelligent camera
l Deployment & Experiment
l Twitter analysis
l Social media monitoring
l Malicious attack detection
l Malware detection
l 2 Hands-on: 90+ attend in total
l 1 Meetup: 90+ attendees
14. l What is Jubatus? : Motivation and applications
l How Jubatus works? : The architecture
l How to use it : Quick-start steps
l Summary and future
Agenda
15. Online machine learning in Jubatus
l Batch learning
l Scan all data before building a model
l Data must be stored in memory or storage
l Online learning
l Model will be updated by each data sample
l Sometimes with theory that the online model converges
to the batch model
15
Model
Model
17. Online learning or distributed learning:
No unified solution has been available
l Jubatus combines them into a unified computation framework
17
WEKA
1993-
SPSS
1988-
Mahout
2006-
Online ML alg.:
PA [2003]
CW[2008]
Real-time/
Online
Batch
Small scale
Stand-alone
Large scale
&
Distributed/
Parallel
computing
Jubatus
2011-
18. Q: How to make online algorithms distributed?
A: no trivial and some tricks needed
l Online learning requires frequent model updates
l Naïve data distribution leads to too many synchronization operations
l It causes performance problems in terms of network communications and
accuracy
LLLL
LLLL
L
Sync
LLL
Sync
Sync
Sync
time
Data syncronization?
Server A
Server B
Server C
Local model update
18
19. Our approach: Loose model sharing
l Jubatus only shares the local models in a loose manner
l Model size << Data size
l Jubatus DOES NOT share datasets
l Unique approach compared to existing framework
l Local models can be different on the servers
l Different models will be gradually merged
l We define three fundamental operations
l UPDATE / MIX / ANALYZE
l Algorithms can be implemented independently from
l Distribution logic
l Data sharing
l Failover
19
ModelModelModel
20. UPDATE, MIX, and ANALYZE
1. UPDATE - locally
l Receive a sample, learn and update the local model
2. MIX - globally
l Exchange and merge the local models between servers
3. ANALYZE - locally
l Receive a sample, apply the local model, return result
ModelModelModel
Unified
model
Unified
model
Unified
model
MIX
Share only models
UPDATE
Distributed training
ANALYZE
Distributed prediction
20
21. UPDATE
l Each server starts from an initial model
l Each data sample are sent to one (or two) servers
l Local models updated based on the sample
l Data samples are NEVER shared
21
Local
model
1
Local
model
2
Initial
model
Initial
model
Distributed
randomly
or consistently
22. MIX
l Each server sends its model diff
l Model diffs are merged and distributed
l Only model diffs are transmitted
Local
model
1
Local
model
2
Mixed
model
Mixed
model
Initial
model
Initial
model
=
=
Model
diff
1
Model
diff
2
Initial
model
Initial
model
-
-
Model
diff
1
Model
diff
2
Merged
diff
Merged
diff
Merged
diff
+
+
=
=
=
+
22
23. UPDATE (iteration)
l Locally updated models after MIX are discarded
l Each server starts updating from the mixed model
l The mixed model improves gradually thanks to all of the servers
Local
model
1
Local
model
2
Mixed
model
Mixed
model
Distributed
randomly
or consistently
23
24. ANALYZE
l For prediction, each sample randomly goes to a server
l Server applies the current mixed model to the sample
l The prediction will be returned to the client
l You add servers for higher throughput
24
Mixed
model
Mixed
model
Distributed
randomly
Return prediction
Return prediction
25. Model inside Jubatus (1): classification
w1
w2
wn
MIX
w
w
w
w =
1
n
w1 ++ wn( )
l Each server updates local linear models
l MIX computes the averaged coefficients
25
26. Model inside Jubatus (2): nearest neighbor
011010010
110001100
110010111
000100101
110101011
000010110
1
2
3
4
5
6
011010010
000010110
1
6
:
011010010
000010110
1
6
:
011010010
000010110
1
6
:
MIX
l Samples are approximated by LSH, MinHash, etc
l Only bit-arrays are shared between servers
27. Jubatus architecture
Standard client-server system
l Zookeeper and RPC handles connections between clients and servers
l We have clients for C++/Java/Ruby/Python (All under MIT license)
27
JubaServer
JubaKeeper
fv_converter Algorithm
JubaServer
JubaServer
Linux server
thread
thread
Linux server
Client
Linux server
Linux server
thread
thread
RPC
Client+JubaKeeper
Client+JubaKeeper
…
…
…
……
thread
…
thread
thread
thread
…
RPCRPC RPC Model
28. Best QPS performances (evaluated on old ver.)
l Experimental settings
l Standalone vs. multiple servers
l Client processes: 1 - 4
l Server processes: 1 – 6
l Server thread: 1 – 6
l Results
l Classification scales linearly with #server-processes & threads
l Recommendation performance highly depends on collected #samples
28
Task Operation Max-qps
Classification UPDATE 3,000 [qps]
ANALYZE 6,500 [qps]
Recommendation UPDATE 400 [qps]
ANALYZE 2,500 [qps]
29. l What is Jubatus? : Motivation and applications
l How Jubatus works? : The architecture
l How to use it : Quick-start steps
l Summary and future
Agenda
31. Step (1): VM images and tutorial
http://download.jubat.us/event/handson_01/en/
31
l Hands-on tutorial
l Intro to ML, How to start, examples, configurations
l VM images running on any OS
l VirtualBox / VMware
32. Step (2) : Download from github
l https://github.com/jubatus/jubatus/
32
33. Step (3): Play with Jubatus examples
l https://github.com/jubatus/jubatus-example/
33
35. l What is Jubatus? : Motivation and applications
l How Jubatus works? : The architecture
l How to use it : Quick-start steps
l Summary and future
Agenda
36. Summary
l Jubatus is an OSS for online distributed machine learning
l UPDATE-MIX-ANALYZE for abstracting ML algorithms
l Most of the tasks
l Future plans
l Clustering
l P2P-like MIX method
l Time-series preprocessing in fv_converter
l Unlearning
36
1. Bigger data
3. Machine learning
2. More in real-time
37. Current: As a meta-data predictor
37
User
Time
Bet
Act.
Gain
Class
Est,
Cluster
Outlier
A33
5:34
40
↑C
+20
Good
+18
C1
0.07
A33
5:34
10
←B
+80
Good
-10
C3
0.92
A33
5:35
20
↑B
-16
Bad
-15
C1
0.11
…
…
…
…
…
RDB
• Aggregation
• Reporting
• AnalyticsOnline learning
Input
data
Enriched data
Real-time
prediction
NoSQL
HDFS
Predicted columns
Search
l Apply Jubatus models before storing
l Adaptive and memory-efficient
38. Future: For edge-heavy data
38
l Emerging apps that can’t collect data into one place
l Due to data intensity: video streams from millions of devices
l Due to latency: real-time decision within <100 msec
l Due to privacy: sensitive raw data cannot be shared
Smartphones
Intelligent cars
Intelligent cameras
Healthcare
monitoring
Bio-medical
40. We opened a subsidiary in San Jose
l Preferred Infrastructure America, Inc.
l Established in March, Office opened in April
l Next to the SJC airport
l Start doing business in the U.S.
40
41. Thank you
l Follow us
l github.com/jubatus
l jubatus@googlegroups.com
l Twitter: @JubatusOfficial
l We welcome your contribution and collaboration
41