SlideShare a Scribd company logo
1 of 17
What is Jubatus?
How it works for you?
NTT SIC
Hiroki Kumazaki
Jubatus is…
• A Distributed Online Machine-Learning framework
– An OSS developped in Japan
• GPL2.0
• Distributed
– Fault-Tolerance
– Scale out
• Online
– Fixed time computation
• Machine-Learning
– More than “word count”!
Architecture
• ML model is combined with feature-extractor
Machine
Learning
Model
Feature
Extractor
Jubatus Server
Jubatus RPC
Architecture
• Multilanguage client library
– gem, pip, cpan, maven Ready!
– It essentially uses a messagepack-rpc.
• So you can use OCaml, Haskell, JavaScript, Go with your own
risk.
Client
Jubatus RPC
Architecture
• Many ML algorithms
– Classifier
– Recommender
– Anomaly Detection
– Clustering
– Regression
– Graph Mining
Useful!
Classifier
• Task: Classification of Datum
import sys
def fib(a):
if a == 1 or a == 0:
return 1
else:
return fib(a-1) + fib(a-2)
if __name__ == “__main__”:
print(fib(int(sys.argv[1])))
def fib(a)
if a == 1 or a == 0
1
else
return fib(a-1) + fib(a-2)
end
end
if __FILE__ == $0
puts fib(ARGV[0].to_i)
end
Sample Task: Classify what programming language used
It’s It’s
Classifier
• Set configuration in the Jubatus server
ClassifierFreature
Extractor
"converter": {
"string_types": {
"bigram": {
"method": "ngram",
"char_num": "2"
}
},
"string_rules": [
{
"key": "*",
"type": "bigram",
"sample_weight": "tf",
"global_weight": "idf“
}
]
}
Feature Extractor
Classifier
• Configuration JSON
– It does “feature vector design”
– very important step for machine learning
"converter": {
"string_types": {
"bigram": {
"method": "ngram",
"char_num": "2"
}
},
"string_rules": [
{
"key": "*",
"type": "bigram",
"sample_weight": "tf",
"global_weight": "idf“
}
]
}
setteings for extract feature from string
define function named “bigram”
original embedded function “ngram”
pass “2” to “ngram” to create “bigram”
for all data
apply “bigram”
feature weights based on tf/idf
see wikipedia/tf-idf
Classifier
• Feature Extractor becomes “bigram extractor”
Classifierbigram
extractor
Feature Extractor
• What bigram extractor does?
bigram
extractor
import sys
def fib(a):
if a == 1 or a == 0:
return 1
else:
return fib(a-1) + fib(a-2)
if __name__ == “__main__”:
print(fib(int(sys.argv[1])))
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
Feature Vector
Classifier
• Training model with feature vectors
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
Classifier
key value
pu 1
ut 1
... ...
{| ...
|m 1
m| 1
{| 1
en 1
nd 1
key value
@a 1
$_ 1
... ...
my ...
su 1
ub 1
us 1
se 1
... ...
Classifier
• Set configuration in the Jubatus server
Classifier
"method" : "AROW",
"parameter" : {
"regularization_weight" : 1.0
}
Feature Extractor
bigram
extractor Classifier Algorithms
• Perceptron
• Passive Aggressive
• Confidence Weight
• Adaptive Regularization of Weights
• Normal Herd
Classifier
• Use model to classification task
– Jubatus will find clue for classification
AROW
key value
si 1
il 1
... ...
{| 1
... ...
It’s
Classifier
• Use model to classification task
– Jubatus will find clue for classification
AROW
key value
re 1
): 1
... ...
s[ 1
... ...
It’s
Via RPC
• invoke feature extraction and classification from
client via RPC
AROWbigram
extractor
lang = client.classify([sourcecode])
import sys
def fib(a):
if a == 1 or a == 0:
return 1
else:
return fib(a-1) + fib(a-2)
if __name__ == “__main__”:
print(fib(int(sys.argv[1])))
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
It may be
What classifier can do?
• You can
– estimate the topic of tweets
– trash spam mail automatically
– monitor server failure from syslog
– estimate sentiment of user from blog post
– detect malicious attack
– find what feature is the best clue to classification
How to use?
• see examples in
http://github.com/jubatus/jubatus-example
– gender
– shogun
– malware classification
– language detection

More Related Content

Viewers also liked

よくわかるHopscotch hashing
よくわかるHopscotch hashingよくわかるHopscotch hashing
よくわかるHopscotch hashingKumazaki Hiroki
 
冬のLock free祭り safe
冬のLock free祭り safe冬のLock free祭り safe
冬のLock free祭り safeKumazaki Hiroki
 
トランザクションの設計と進化
トランザクションの設計と進化トランザクションの設計と進化
トランザクションの設計と進化Kumazaki Hiroki
 
トランザクション入門
トランザクション入門 トランザクション入門
トランザクション入門 Kumazaki Hiroki
 
地理分散DBについて
地理分散DBについて地理分散DBについて
地理分散DBについてKumazaki Hiroki
 
分散システムについて語らせてくれ
分散システムについて語らせてくれ分散システムについて語らせてくれ
分散システムについて語らせてくれKumazaki Hiroki
 
本当は恐ろしい分散システムの話
本当は恐ろしい分散システムの話本当は恐ろしい分散システムの話
本当は恐ろしい分散システムの話Kumazaki Hiroki
 

Viewers also liked (13)

Lockfree Priority Queue
Lockfree Priority QueueLockfree Priority Queue
Lockfree Priority Queue
 
よくわかるHopscotch hashing
よくわかるHopscotch hashingよくわかるHopscotch hashing
よくわかるHopscotch hashing
 
Lockfree Queue
Lockfree QueueLockfree Queue
Lockfree Queue
 
Cache obliviousの話
Cache obliviousの話Cache obliviousの話
Cache obliviousの話
 
Lockfree list
Lockfree listLockfree list
Lockfree list
 
SkipGraph
SkipGraphSkipGraph
SkipGraph
 
冬のLock free祭り safe
冬のLock free祭り safe冬のLock free祭り safe
冬のLock free祭り safe
 
トランザクションの設計と進化
トランザクションの設計と進化トランザクションの設計と進化
トランザクションの設計と進化
 
トランザクション入門
トランザクション入門 トランザクション入門
トランザクション入門
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
地理分散DBについて
地理分散DBについて地理分散DBについて
地理分散DBについて
 
分散システムについて語らせてくれ
分散システムについて語らせてくれ分散システムについて語らせてくれ
分散システムについて語らせてくれ
 
本当は恐ろしい分散システムの話
本当は恐ろしい分散システムの話本当は恐ろしい分散システムの話
本当は恐ろしい分散システムの話
 

Similar to What is jubatus (short)

Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)Qiangning Hong
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Introduction to Underscore.js
Introduction to Underscore.jsIntroduction to Underscore.js
Introduction to Underscore.jsDavid Jacobs
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyTravis Oliphant
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disquszeeg
 
Functional Programming - Past, Present and Future
Functional Programming - Past, Present and FutureFunctional Programming - Past, Present and Future
Functional Programming - Past, Present and FuturePushkar Kulkarni
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present FutureIndicThreads
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Caffe framework tutorial2
Caffe framework tutorial2Caffe framework tutorial2
Caffe framework tutorial2Park Chunduck
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
Neural tool box
Neural tool boxNeural tool box
Neural tool boxMohan Raj
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its FeaturesSeiya Tokui
 
Mat lab workshop
Mat lab workshopMat lab workshop
Mat lab workshopVinay Kumar
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 

Similar to What is jubatus (short) (20)

Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Introduction to Underscore.js
Introduction to Underscore.jsIntroduction to Underscore.js
Introduction to Underscore.js
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 
MXNet Workshop
MXNet WorkshopMXNet Workshop
MXNet Workshop
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Functional Programming - Past, Present and Future
Functional Programming - Past, Present and FutureFunctional Programming - Past, Present and Future
Functional Programming - Past, Present and Future
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present Future
 
MATLAB & Image Processing
MATLAB & Image ProcessingMATLAB & Image Processing
MATLAB & Image Processing
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Caffe framework tutorial2
Caffe framework tutorial2Caffe framework tutorial2
Caffe framework tutorial2
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
 
Neural tool box
Neural tool boxNeural tool box
Neural tool box
 
Python: The Dynamic!
Python: The Dynamic!Python: The Dynamic!
Python: The Dynamic!
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
Mat lab workshop
Mat lab workshopMat lab workshop
Mat lab workshop
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 

Recently uploaded

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Recently uploaded (20)

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

What is jubatus (short)

  • 1. What is Jubatus? How it works for you? NTT SIC Hiroki Kumazaki
  • 2. Jubatus is… • A Distributed Online Machine-Learning framework – An OSS developped in Japan • GPL2.0 • Distributed – Fault-Tolerance – Scale out • Online – Fixed time computation • Machine-Learning – More than “word count”!
  • 3. Architecture • ML model is combined with feature-extractor Machine Learning Model Feature Extractor Jubatus Server Jubatus RPC
  • 4. Architecture • Multilanguage client library – gem, pip, cpan, maven Ready! – It essentially uses a messagepack-rpc. • So you can use OCaml, Haskell, JavaScript, Go with your own risk. Client Jubatus RPC
  • 5. Architecture • Many ML algorithms – Classifier – Recommender – Anomaly Detection – Clustering – Regression – Graph Mining Useful!
  • 6. Classifier • Task: Classification of Datum import sys def fib(a): if a == 1 or a == 0: return 1 else: return fib(a-1) + fib(a-2) if __name__ == “__main__”: print(fib(int(sys.argv[1]))) def fib(a) if a == 1 or a == 0 1 else return fib(a-1) + fib(a-2) end end if __FILE__ == $0 puts fib(ARGV[0].to_i) end Sample Task: Classify what programming language used It’s It’s
  • 7. Classifier • Set configuration in the Jubatus server ClassifierFreature Extractor "converter": { "string_types": { "bigram": { "method": "ngram", "char_num": "2" } }, "string_rules": [ { "key": "*", "type": "bigram", "sample_weight": "tf", "global_weight": "idf“ } ] } Feature Extractor
  • 8. Classifier • Configuration JSON – It does “feature vector design” – very important step for machine learning "converter": { "string_types": { "bigram": { "method": "ngram", "char_num": "2" } }, "string_rules": [ { "key": "*", "type": "bigram", "sample_weight": "tf", "global_weight": "idf“ } ] } setteings for extract feature from string define function named “bigram” original embedded function “ngram” pass “2” to “ngram” to create “bigram” for all data apply “bigram” feature weights based on tf/idf see wikipedia/tf-idf
  • 9. Classifier • Feature Extractor becomes “bigram extractor” Classifierbigram extractor
  • 10. Feature Extractor • What bigram extractor does? bigram extractor import sys def fib(a): if a == 1 or a == 0: return 1 else: return fib(a-1) + fib(a-2) if __name__ == “__main__”: print(fib(int(sys.argv[1]))) key value im 1 mp 1 po 1 ... ... ): 1 ... ... de 1 ef 1 ... ... Feature Vector
  • 11. Classifier • Training model with feature vectors key value im 1 mp 1 po 1 ... ... ): 1 ... ... de 1 ef 1 ... ... Classifier key value pu 1 ut 1 ... ... {| ... |m 1 m| 1 {| 1 en 1 nd 1 key value @a 1 $_ 1 ... ... my ... su 1 ub 1 us 1 se 1 ... ...
  • 12. Classifier • Set configuration in the Jubatus server Classifier "method" : "AROW", "parameter" : { "regularization_weight" : 1.0 } Feature Extractor bigram extractor Classifier Algorithms • Perceptron • Passive Aggressive • Confidence Weight • Adaptive Regularization of Weights • Normal Herd
  • 13. Classifier • Use model to classification task – Jubatus will find clue for classification AROW key value si 1 il 1 ... ... {| 1 ... ... It’s
  • 14. Classifier • Use model to classification task – Jubatus will find clue for classification AROW key value re 1 ): 1 ... ... s[ 1 ... ... It’s
  • 15. Via RPC • invoke feature extraction and classification from client via RPC AROWbigram extractor lang = client.classify([sourcecode]) import sys def fib(a): if a == 1 or a == 0: return 1 else: return fib(a-1) + fib(a-2) if __name__ == “__main__”: print(fib(int(sys.argv[1]))) key value im 1 mp 1 po 1 ... ... ): 1 ... ... de 1 ef 1 ... ... It may be
  • 16. What classifier can do? • You can – estimate the topic of tweets – trash spam mail automatically – monitor server failure from syslog – estimate sentiment of user from blog post – detect malicious attack – find what feature is the best clue to classification
  • 17. How to use? • see examples in http://github.com/jubatus/jubatus-example – gender – shogun – malware classification – language detection

Editor's Notes

  1. Hello, I’ll speak about Jubatus. You may heard about jubatus, but I’m afraid you don’t know jubatus well. In this speak, I wish you’d realize what jubatus can do, or how to use it for your task.
  2. Jubatus has 3 feature. Jubatus is a distributed online machine-learning framework. Distributed means resilient to machine failure. And Jubatus can increase its performance for your task by coordinate multi-machine cluster. Online means fixed time computation. Jubatus developer carefully designed Jubatus API so that users can balance between performance and computation time. Machine-Learning is key factor of Big Data age. You’ll need more than “word count”
  3. This is a overview of Jubatus process. This red rectangle is one Jubatus process. Inside process, there is two component exists. Feature Extractor and Machine-Learning-Model. You can connect your program with jubatus via Jubatus RPC. So you can do machine learning with client-server model.
  4. And Jubatus client library is implemented in many language. you can get jubatus client library via gem, pip, cpan, maven. If you want to use it in other language, you can use messagepack-rpc client with your own risk. It will work! (I tried Javascript
  5. And Jubatus has many kind of machine-learning module. You can use these machine learning rapidly. Among 6 machine learning modules, Classsifier and Recommender and Anomaly Detection will be great help of you. I’ll introduce these 3 machine learning modules.
  6. classifier can classify data. A sample task, you may want to detect programming language of source code. In this case, you can classify language from sequence of text.
  7. First of all, you have to set configuration in the jubatus server. The configuration is written in JSON.
  8. In this case, you choose embedded ngram function, and passing number 2 to ngram. You can get bigram function. And set rule. In this rule, all data inserted will be handled with bigram. Regulating the weights of words with tf/idf scheme.
  9. Now, the Feature Extractor becomes “bigram extractor”
  10. with this bigram extractor, all datum to be splited into two character words. “import” will become “im”, “mp”, “po”, “or”, “rt” with bigram scheme. This form of datum representation if Feature Vector. bigram extractor extracts bigram from datum and get Feature Vector.
  11. You extracting feature vectors from many language source code. Jubatus Classifier learns from feature vectors and create model.
  12. Next, the classifier algorithm should be configured. You can select Classifier Algorithm from Perceptron or Passive Aggressive or the others.
  13. the trained model can classify datum from feature vector. In this case, Jubatus classifier finds a Ruby characteristic feature like "{|" and highly score for ruby, then Jubatus estimate this source code is Ruby.
  14. Another datum, Jubatus find Python characteristic feature like “):” Jubatus scores high for this feature and it estimate this source code should be python.
  15. You can do these procedure via Jubatus RPC. On RPC, giving datum for classification, and Jubatus returns the classification result. All you have to do is write precise JSON configuration and client source code.
  16. You can estimate the topic of a tweet trash spam mail automatically monitor server failure from syslog estimate sentiment from blog post detect attacking via network calculate what feature is the best clue to classification
  17. Other information for using classifier is available at jubatus official example repository. These 4 sample may be useful for study.
  18. Next Jubatus algorithm is recommender. With this “movie and review rating matrix” which movie should we recommend Ann? Jubatus can answer.
  19. An imaginary field of highly dimensional rating space. Star Wars lover and Star Trek lover is relatively close. Both of them movie is a kind of Science Fiction. Ann and Emily is relatively close. These distance is useful for recommendation. Because Preferences of the human is tend to be similar.
  20. In this case, Ann would like Frozen
  21. Jubatus recommender server consists of Feature Extractor and recommender engine. Feature extractor is completely the same with classifier’s one. Jubatus calculates distance between feature vectors.
  22. From former example, jubatus recommender extracts feature vector from source code, and recommender engine maps each vectors in feature space.
  23. You can create recommendation engine calculate similarity of tweets find similar directional NBA player visualize distance between “Star Wars” and “Star Trek notice that you can use recommender more than recommender.
  24. Recommender is based on unsupervised algorithm. So that You cannot Labeling data(use classifier!) get decision tree And it is nearest-neighbor based recommendation so that get a-priori based recommendation
  25. Another algorithm is Anomaly Detection It calculates “How this datum is far from others?”
  26. Jubatus can detect the outlier from mass of data.
  27. In easy way, you may use recommender’s distance score for finding outlier Distance is not homogeneous, it can not be used to discover outliers.
  28. anomaly detection server consists of Feature Extractor and anomaly detection engine. Feature extractor is completely the same with classifier and recommender’s one. Jubatus finds outlier from feature vectors
  29. The same wit recommender, Jubatus detect anomaly from Feature Vector You should access this procedure via RPC too.
  30. You (might) can find outlier detect or prediction of server’s failure protect service against zero-day attack know the trend of the entire data stream
  31. You cannot get mostly common datum get cluster map of data give a diagnosis the outlier reason automatically
  32. Jubatus have embedded feature extractor with algorithms. User should configure both feature extractor and algorithm properly Client use configured machine learning via Jubatus-RPC Classifier and Recommender and Anomaly may be useful for your task.
  33. I try to run the jubatus-example.