2. Jubatus is…
• A Distributed Online Machine-Learning framework
– An OSS developped in Japan
• GPL2.0
• Distributed
– Fault-Tolerance
– Scale out
• Online
– Fixed time computation
• Machine-Learning
– More than “word count”!
3. Architecture
• ML model is combined with feature-extractor
Machine
Learning
Model
Feature
Extractor
Jubatus Server
Jubatus RPC
4. Architecture
• Multilanguage client library
– gem, pip, cpan, maven Ready!
– It essentially uses a messagepack-rpc.
• So you can use OCaml, Haskell, JavaScript, Go with your own
risk.
Client
Jubatus RPC
6. Classifier
• Task: Classification of Datum
import sys
def fib(a):
if a == 1 or a == 0:
return 1
else:
return fib(a-1) + fib(a-2)
if __name__ == “__main__”:
print(fib(int(sys.argv[1])))
def fib(a)
if a == 1 or a == 0
1
else
return fib(a-1) + fib(a-2)
end
end
if __FILE__ == $0
puts fib(ARGV[0].to_i)
end
Sample Task: Classify what programming language used
It’s It’s
7. Classifier
• Set configuration in the Jubatus server
ClassifierFreature
Extractor
"converter": {
"string_types": {
"bigram": {
"method": "ngram",
"char_num": "2"
}
},
"string_rules": [
{
"key": "*",
"type": "bigram",
"sample_weight": "tf",
"global_weight": "idf“
}
]
}
Feature Extractor
8. Classifier
• Configuration JSON
– It does “feature vector design”
– very important step for machine learning
"converter": {
"string_types": {
"bigram": {
"method": "ngram",
"char_num": "2"
}
},
"string_rules": [
{
"key": "*",
"type": "bigram",
"sample_weight": "tf",
"global_weight": "idf“
}
]
}
setteings for extract feature from string
define function named “bigram”
original embedded function “ngram”
pass “2” to “ngram” to create “bigram”
for all data
apply “bigram”
feature weights based on tf/idf
see wikipedia/tf-idf
10. Feature Extractor
• What bigram extractor does?
bigram
extractor
import sys
def fib(a):
if a == 1 or a == 0:
return 1
else:
return fib(a-1) + fib(a-2)
if __name__ == “__main__”:
print(fib(int(sys.argv[1])))
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
Feature Vector
11. Classifier
• Training model with feature vectors
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
Classifier
key value
pu 1
ut 1
... ...
{| ...
|m 1
m| 1
{| 1
en 1
nd 1
key value
@a 1
$_ 1
... ...
my ...
su 1
ub 1
us 1
se 1
... ...
12. Classifier
• Set configuration in the Jubatus server
Classifier
"method" : "AROW",
"parameter" : {
"regularization_weight" : 1.0
}
Feature Extractor
bigram
extractor Classifier Algorithms
• Perceptron
• Passive Aggressive
• Confidence Weight
• Adaptive Regularization of Weights
• Normal Herd
13. Classifier
• Use model to classification task
– Jubatus will find clue for classification
AROW
key value
si 1
il 1
... ...
{| 1
... ...
It’s
14. Classifier
• Use model to classification task
– Jubatus will find clue for classification
AROW
key value
re 1
): 1
... ...
s[ 1
... ...
It’s
15. Via RPC
• invoke feature extraction and classification from
client via RPC
AROWbigram
extractor
lang = client.classify([sourcecode])
import sys
def fib(a):
if a == 1 or a == 0:
return 1
else:
return fib(a-1) + fib(a-2)
if __name__ == “__main__”:
print(fib(int(sys.argv[1])))
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
It may be
16. What classifier can do?
• You can
– estimate the topic of tweets
– trash spam mail automatically
– monitor server failure from syslog
– estimate sentiment of user from blog post
– detect malicious attack
– find what feature is the best clue to classification
17. How to use?
• see examples in
http://github.com/jubatus/jubatus-example
– gender
– shogun
– malware classification
– language detection
Editor's Notes
Hello, I’ll speak about Jubatus.
You may heard about jubatus, but I’m afraid you don’t know jubatus well.
In this speak, I wish you’d realize what jubatus can do, or how to use it for your task.
Jubatus has 3 feature.
Jubatus is a distributed online machine-learning framework.
Distributed means resilient to machine failure.
And Jubatus can increase its performance for your task by coordinate multi-machine cluster.
Online means fixed time computation.
Jubatus developer carefully designed Jubatus API so that users can balance between performance and computation time.
Machine-Learning is key factor of Big Data age.
You’ll need more than “word count”
This is a overview of Jubatus process.
This red rectangle is one Jubatus process.
Inside process, there is two component exists.
Feature Extractor and Machine-Learning-Model.
You can connect your program with jubatus via Jubatus RPC.
So you can do machine learning with client-server model.
And Jubatus client library is implemented in many language.
you can get jubatus client library via gem, pip, cpan, maven.
If you want to use it in other language, you can use messagepack-rpc client with your own risk.
It will work! (I tried Javascript
And Jubatus has many kind of machine-learning module.
You can use these machine learning rapidly.
Among 6 machine learning modules, Classsifier and Recommender and Anomaly Detection will be great help of you.
I’ll introduce these 3 machine learning modules.
classifier can classify data.
A sample task, you may want to detect programming language of source code.
In this case, you can classify language from sequence of text.
First of all, you have to set configuration in the jubatus server.
The configuration is written in JSON.
In this case, you choose embedded ngram function, and passing number 2 to ngram. You can get bigram function.
And set rule. In this rule, all data inserted will be handled with bigram.
Regulating the weights of words with tf/idf scheme.
Now, the Feature Extractor becomes “bigram extractor”
with this bigram extractor, all datum to be splited into two character words.
“import” will become “im”, “mp”, “po”, “or”, “rt” with bigram scheme.
This form of datum representation if Feature Vector.
bigram extractor extracts bigram from datum and get Feature Vector.
You extracting feature vectors from many language source code.
Jubatus Classifier learns from feature vectors and create model.
Next, the classifier algorithm should be configured.
You can select Classifier Algorithm from Perceptron or Passive Aggressive or the others.
the trained model can classify datum from feature vector.
In this case, Jubatus classifier finds a Ruby characteristic feature like "{|"
and highly score for ruby, then Jubatus estimate this source code is Ruby.
Another datum, Jubatus find Python characteristic feature like “):”
Jubatus scores high for this feature and it estimate this source code should be python.
You can do these procedure via Jubatus RPC.
On RPC, giving datum for classification, and Jubatus returns the classification result.
All you have to do is write precise JSON configuration and client source code.
You can
estimate the topic of a tweet
trash spam mail automatically
monitor server failure from syslog
estimate sentiment from blog post
detect attacking via network
calculate what feature is the best clue to classification
Other information for using classifier is available at jubatus official example repository.
These 4 sample may be useful for study.
Next Jubatus algorithm is recommender.
With this “movie and review rating matrix” which movie should we recommend Ann?
Jubatus can answer.
An imaginary field of highly dimensional rating space.
Star Wars lover and Star Trek lover is relatively close.
Both of them movie is a kind of Science Fiction.
Ann and Emily is relatively close.
These distance is useful for recommendation.
Because Preferences of the human is tend to be similar.
In this case, Ann would like Frozen
Jubatus recommender server consists of Feature Extractor and recommender engine.
Feature extractor is completely the same with classifier’s one.
Jubatus calculates distance between feature vectors.
From former example, jubatus recommender extracts feature vector from source code, and recommender engine maps each vectors in feature space.
You can
create recommendation engine
calculate similarity of tweets
find similar directional NBA player
visualize distance between “Star Wars” and “Star Trek
notice that you can use recommender more than recommender.
Recommender is based on unsupervised algorithm.
So that
You cannot Labeling data(use classifier!)
get decision tree
And it is nearest-neighbor based recommendation so that
get a-priori based recommendation
Another algorithm is Anomaly Detection
It calculates “How this datum is far from others?”
Jubatus can detect the outlier from mass of data.
In easy way, you may use recommender’s distance score for finding outlier
Distance is not homogeneous, it can not be used to discover outliers.
anomaly detection server consists of Feature Extractor and anomaly detection engine.
Feature extractor is completely the same with classifier and recommender’s one.
Jubatus finds outlier from feature vectors
The same wit recommender, Jubatus detect anomaly from Feature Vector
You should access this procedure via RPC too.
You (might) can
find outlier
detect or prediction of server’s failure
protect service against zero-day attack
know the trend of the entire data stream
You cannot
get mostly common datum
get cluster map of data
give a diagnosis the outlier reason automatically
Jubatus have embedded feature extractor with algorithms.
User should configure both feature extractor and algorithm properly
Client use configured machine learning via Jubatus-RPC
Classifier and Recommender and Anomaly may be useful for your task.