Real World Machine Learning in Java 8 at Fumankaitori.com

Real World Machine
Learning in Java 8 at
Fumankaitori.com
Mathieu Dumoulin, Chief Data Scientist fumankaitori.com,
Data Science Team manager at en-japan

Today’s menu
● About me and 不満買取センータ
● The business problem: Post pricing
● Project Overview
○ Why use ML
○ How to use ML in projects
○ How we used ML in this project
● Results
● Live code (depends on time)
● Conclusion

Presentation goals
● Machine learning is possible by any Java Engineer
● Java is a great programming language for real-
world machine learning systems
● New ML APIs make it easy to focus on the problem
and the data, and get a well-performing model “for
free”
● You don’t need a ph.D. to use machine learning,
just some self-study, good tools and libraries and
build experience one project at a time

Google map for Quebec City
here!

My Work: Java SE, Hadoop Engineer, Data Scientist

● Launched in Mar 2015. Provide web/Android/iOS
applications.
● An application to collect data about people's
dissatisfactions.
● Features:
○ Users can post any dissatisfaction of any products/services.
○ Users get points as a reward for their posts. And the point is
exchangeable with coupon code of EC sites.
● 250,000 users with 1,500,000 posts (accumulated)
(end of Nov 2015)

Problem statement: post point value prediction
● Fuman user posts have a money value
● We want to give more points for “good”
posts
● At first, operations staff checked all
posts, but they can’t check 10,000 posts
each day...
We made rules, but point value was worse:
● Rules can’t check the content of the posts
● Rules always miss something
● Making hundreds or thousands of rules by
hand is ridiculous

ML is the best solution for 不満買取センター
● ML Problem: Estimate the point value of a user posts (0-25)
● Project goal: Estimate the value of posts with less than 5 points
difference from human judgement
● Data: All user posts and user profile data
● Data with known output (labels): staff already set points for 200k
posts manually
This is a classic case of supervised learning (Wiki). Another reference from Microsoft
Prediction of a price requires to build a Regression model because the prediction is a number, as
opposed to a classification problem which predicts which of two classes each post would belong to.

Real world ML project overview
● Machine Learning Workflow
● Data Scientist and Java Engineer roles
● Java for production ML
● Java 8 benefits
● Our point prediction system details
● Results

Machine Learning Workflow
Load data
Extract Features
Train Model
Evaluate vs. business goal
Load new data
Extract Features
Predict using model
Act on prediction
data, labels (known result)
feature vectors, labels
prediction, labels
data
feature vectors
predictions
iterate
best model
the same

Workflow for machine learning system
1. Set a goal with business
value
2. Get data (fuman user
posts) with a price
already set
3. Transform data for input
into machine learning
algorithm
4. Train and evaluate
machine learning model
until reach goal
5. Deploy best model

Data Scientist’s role
value
posts) with a price
already set
algorithm
until reach goal
Choose features
Build many models

Software Engineer’s role
Implement and integrate into production system
value
posts) with a price
already set
algorithm
until reach goal
Get data from data source
Implement production code

But we don’t have a data scientist...

Java for production ML
● Easy integration with Java applications
● Fast (vs. Python or R)
● Easy to program (vs. C++)
● Most common enterprise programming language, IDE support and excellent
support libraries
● Lots of state of the art machine learning libraries have a Java API

Benefits of Java 8
● Java 8’s functional style is a very good match with ML operations
a. Feature extraction: data in → transform → data out
● Java 8’s streams and Lambdas
a. Code is easier to understand and less verbose
● Easy parallel code
a. Faster “for free”

Post point prediction system: step by step
Feature
Extraction
Fuman
DB
Prediction Service
● Train/Test split
● Categorical features
transformation
● Select best features
● Try many algorithms
● Tune algorithms
● Evaluate models
● REST Prediction API
Iterate until results
meet business goals
CSV format
DR Prediction
API
posts, label

Feature Extraction details
● We added character and words statistics about each fuman user post
○ Number of hiragana, katakana, kanji, alphabet characters and words
○ Number of words, length of words
○ Ratio of hiragana, katakana, kanji, alphabet words to the number of tokens in a
post
● User profile information
○ age, gender, job category, etc.
● Bag-of-word models:
○ Words using Tf-Idf, removing stopwords (これ、あれ、それ、です、など、 …)
○ Part-of-speech （名詞、動詞、形容詞、 …）
○ Word types features　(hiragana word, katakana word, kanji word, …)

マックのポテト揚げたてでお願いしたのに、揚げたてじゃ
なかった。
Feature Extraction: Example

Feature Example: MeCab analyzer
マックのポテト揚げたてでお願いしたのに、揚げたてじゃなかった。
マック名詞,固有名詞,一般,*,*,*,マック,マック,マック
の助詞,連体化,*,*,*,*,の,ノ,ノ
ポテト名詞,一般,*,*,*,*,ポテト,ポテト,ポテト
揚げたて名詞,一般,*,*,*,*,揚げたて,アゲタテ,アゲタテ
で助詞,格助詞,一般,*,*,*,で,デ,デ
お願い名詞,サ変接続,*,*,*,*,お願い,オネガイ,オネガイ
し動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
た助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
のに助詞,接続助詞,*,*,*,*,のに,ノニ,ノニ
、記号,読点,*,*,*,*,、,、,、
揚げたて名詞,一般,*,*,*,*,揚げたて,アゲタテ,アゲタテ
じゃ助詞,副助詞,*,*,*,*,じゃ,ジャ,ジャ
なかっ助動詞,*,*,*,特殊・ナイ,連用タ接続,ない,ナカッ,ナカッ
た助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。記号,句点,*,*,*,*,。,。,。
EOS

Feature Extraction: Example
Character counts
Hiragana: 20
Katakana: 6
Kanji: 3
Alpha: 0
Digits: 0
Marks (!,?): 0
Token type counts
Hiragana: 8
Katakana: 2
Kanji: 3
Alpha: 0
Digits: 0
Marks: 0
Token length
1: 5
2: 2
3: 4
4: 2
5+: 0

Training and evaluation of our model

We reached the project goal!
● DataRobot’s best model
○ eXtreme Gradient Boosted Trees
○ RMSE: 3.54
○ MSE: 12.53
Business result:
● Higher quality evaluation than rules
● Operation staff don’t need to manually check posts
● We can validate points every day
Our result: 3.5 point difference from human judgement

Deployment issues
● Problem: The Prediction API was very slow (>1s / post) so we
had to run it as a batch process each night.
● We want: Make predictions locally with low latency, without losing
the good prediction performance we already have.
We solved this problem using the
excellent open source, distributed
machine learning library H2
O by H2o.ai.
Co-founder: Cliff Click, who made the
Java HotSpot Server Compiler

Post point prediction system: Current system
Feature
Extraction
Fuman
DB
Prediction Service
Prediction
POJO
● Train/Test split
● Categorical features
transformation
● Distributed, fast and state
of the art algorithms
● POJO prediction class
generation
CSV formatposts, label
Fuman Webapp
get new post
values
make feature
vectors

Overview: Making Predictions
● Use the prediction POJO generated
by H2O
● For each new post query Prediction
Service
○ Convert to vector (Double[] for H2O)
○ Get prediction from prediction POJO
(Double value, round to integer)
○ Update database with predicted price

We reached the business goal!
Project goal: Get similar performance from H2O as from DataRobot
H2O is not ideal to explore different models and features, but for
production, it is FAST with similar predictive performance. It is
implemented in pure Java (Github).
● H2O: Train a new model for
production
○ GBM (Gradient Boosting Machine)
○ MSE: 12.8
● DataRobot’s best model
○ eXtreme Gradient Boosted Trees
○ RMSE: 3.54
○ MSE: 12.53

Real world ML loves Java!
● Java is a top choice for making production machine
learning systems
● Benefits of Java 8 makes Java fun and relevant again
● Integration in a Java web application was not hard
● Java is not a good choice for experimentation
○ Start with a Python prototype with Scikit-learn
○ Use a Machine Learning service like DataRobot.com

You can use ML in your projects!
● Web API services are like a personal data
scientist
○ No need for Data Scientist for simple use of ML
○ But harder dataset will need expertise
● Real world ML projects needs Engineers:
○ Get data to train a good model (log files, sales results,
mail campaign results,…)
○ Transform data into input for ML library or web service
○ Deploy and integrate into production
● Most steps are just normal programming
○ Get data from DB
○ Transform data into a CSV
○ Call a REST API or Java POJO to make predictions
○ Integrate with the system that needs predictions

Feature engineering with streams and lambdas
The goal is to take raw data from the DB and create arrays of numerical or
categorical features.
1. Get Fuman user post data from DB -> UserPost
2. Learn the vocabulary of all user posts word types
3. Create the dataset:
a. For each post,
i. Add the statistics features
ii. Add the word types features
4. Transform to csv output (for DataRobot)
Instances are Weka SparseInstance (sparse vectors for memory efficiency), but in
retrospect, a specialized vector library would have been better, I think. Weka is a
terrible production library

Real World Machine Learning in Java 8 at Fumankaitori.com

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Real World Machine Learning in Java 8 at Fumankaitori.com

Ähnlich wie Real World Machine Learning in Java 8 at Fumankaitori.com (20)

Mehr von Mathieu Dumoulin

Mehr von Mathieu Dumoulin (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Real World Machine Learning in Java 8 at Fumankaitori.com