This document summarizes a presentation about using machine learning in Java 8 at Fumankaitori.com. The presentation introduces the speaker and their company, which collects user dissatisfaction posts and rewards users with points that can be exchanged for coupons. Their goal was to automate point assignment for posts using machine learning instead of manual rules. They trained an XGBoost model in DataRobot that achieved their goal of predicting points within 5 of human labels. For production, they achieved similar performance using H2O to train a gradient boosted machine model and generate a prediction POJO for low latency predictions. The presentation emphasizes that machine learning is possible for any Java engineer and that Java 8 features like streams make it a good choice for real
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Real World Machine Learning in Java 8 at Fumankaitori.com
1. Real World Machine
Learning in Java 8 at
Fumankaitori.com
Mathieu Dumoulin, Chief Data Scientist fumankaitori.com,
Data Science Team manager at en-japan
2. Today’s menu
● About me and 不満買取センータ
● The business problem: Post pricing
● Project Overview
○ Why use ML
○ How to use ML in projects
○ How we used ML in this project
● Results
● Live code (depends on time)
● Conclusion
3. Presentation goals
● Machine learning is possible by any Java Engineer
● Java is a great programming language for real-
world machine learning systems
● New ML APIs make it easy to focus on the problem
and the data, and get a well-performing model “for
free”
● You don’t need a ph.D. to use machine learning,
just some self-study, good tools and libraries and
build experience one project at a time
7. ● Launched in Mar 2015. Provide web/Android/iOS
applications.
● An application to collect data about people's
dissatisfactions.
● Features:
○ Users can post any dissatisfaction of any products/services.
○ Users get points as a reward for their posts. And the point is
exchangeable with coupon code of EC sites.
● 250,000 users with 1,500,000 posts (accumulated)
(end of Nov 2015)
8. Problem statement: post point value prediction
● Fuman user posts have a money value
● We want to give more points for “good”
posts
● At first, operations staff checked all
posts, but they can’t check 10,000 posts
each day...
We made rules, but point value was worse:
● Rules can’t check the content of the posts
● Rules always miss something
● Making hundreds or thousands of rules by
hand is ridiculous
9. ML is the best solution for 不満買取センター
● ML Problem: Estimate the point value of a user posts (0-25)
● Project goal: Estimate the value of posts with less than 5 points
difference from human judgement
● Data: All user posts and user profile data
● Data with known output (labels): staff already set points for 200k
posts manually
This is a classic case of supervised learning (Wiki). Another reference from Microsoft
Prediction of a price requires to build a Regression model because the prediction is a number, as
opposed to a classification problem which predicts which of two classes each post would belong to.
10. Real world ML project overview
● Machine Learning Workflow
● Data Scientist and Java Engineer roles
● Java for production ML
● Java 8 benefits
● Our point prediction system details
● Results
11. Machine Learning Workflow
Load data
Extract Features
Train Model
Evaluate vs. business goal
Load new data
Extract Features
Predict using model
Act on prediction
data, labels (known result)
feature vectors, labels
prediction, labels
data
feature vectors
predictions
iterate
best model
the same
12. Workflow for machine learning system
1. Set a goal with business
value
2. Get data (fuman user
posts) with a price
already set
3. Transform data for input
into machine learning
algorithm
4. Train and evaluate
machine learning model
until reach goal
5. Deploy best model
13. Data Scientist’s role
1. Set a goal with business
value
2. Get data (fuman user
posts) with a price
already set
3. Transform data for input
into machine learning
algorithm
4. Train and evaluate
machine learning model
until reach goal
5. Deploy best model
Choose features
Build many models
14. Software Engineer’s role
Implement and integrate into production system
1. Set a goal with business
value
2. Get data (fuman user
posts) with a price
already set
3. Transform data for input
into machine learning
algorithm
4. Train and evaluate
machine learning model
until reach goal
5. Deploy best model
Get data from data source
Implement production code
17. Java for production ML
● Easy integration with Java applications
● Fast (vs. Python or R)
● Easy to program (vs. C++)
● Most common enterprise programming language, IDE support and excellent
support libraries
● Lots of state of the art machine learning libraries have a Java API
19. Benefits of Java 8
● Java 8’s functional style is a very good match with ML operations
a. Feature extraction: data in → transform → data out
● Java 8’s streams and Lambdas
a. Code is easier to understand and less verbose
● Easy parallel code
a. Faster “for free”
20. Post point prediction system: step by step
Feature
Extraction
Fuman
DB
Prediction Service
● Train/Test split
● Categorical features
transformation
● Select best features
● Try many algorithms
● Tune algorithms
● Evaluate models
● REST Prediction API
Iterate until results
meet business goals
CSV format
DR Prediction
API
posts, label
21. Feature Extraction details
● We added character and words statistics about each fuman user post
○ Number of hiragana, katakana, kanji, alphabet characters and words
○ Number of words, length of words
○ Ratio of hiragana, katakana, kanji, alphabet words to the number of tokens in a
post
● User profile information
○ age, gender, job category, etc.
● Bag-of-word models:
○ Words using Tf-Idf, removing stopwords (これ、あれ、それ、です、など、 …)
○ Part-of-speech (名詞、動詞、形容詞、 …)
○ Word types features (hiragana word, katakana word, kanji word, …)
26. We reached the project goal!
● DataRobot’s best model
○ eXtreme Gradient Boosted Trees
○ RMSE: 3.54
○ MSE: 12.53
Business result:
● Higher quality evaluation than rules
● Operation staff don’t need to manually check posts
● We can validate points every day
Our result: 3.5 point difference from human judgement
27. Deployment issues
● Problem: The Prediction API was very slow (>1s / post) so we
had to run it as a batch process each night.
● We want: Make predictions locally with low latency, without losing
the good prediction performance we already have.
We solved this problem using the
excellent open source, distributed
machine learning library H2
O by H2o.ai.
Co-founder: Cliff Click, who made the
Java HotSpot Server Compiler
28. Post point prediction system: Current system
Feature
Extraction
Fuman
DB
Prediction Service
Prediction
POJO
● Train/Test split
● Categorical features
transformation
● Distributed, fast and state
of the art algorithms
● POJO prediction class
generation
CSV formatposts, label
Fuman Webapp
get new post
values
make feature
vectors
30. Overview: Making Predictions
● Use the prediction POJO generated
by H2O
● For each new post query Prediction
Service
○ Convert to vector (Double[] for H2O)
○ Get prediction from prediction POJO
(Double value, round to integer)
○ Update database with predicted price
31. We reached the business goal!
Project goal: Get similar performance from H2O as from DataRobot
H2O is not ideal to explore different models and features, but for
production, it is FAST with similar predictive performance. It is
implemented in pure Java (Github).
● H2O: Train a new model for
production
○ GBM (Gradient Boosting Machine)
○ MSE: 12.8
● DataRobot’s best model
○ eXtreme Gradient Boosted Trees
○ RMSE: 3.54
○ MSE: 12.53
32. Real world ML loves Java!
● Java is a top choice for making production machine
learning systems
● Benefits of Java 8 makes Java fun and relevant again
● Integration in a Java web application was not hard
● Java is not a good choice for experimentation
○ Start with a Python prototype with Scikit-learn
○ Use a Machine Learning service like DataRobot.com
33. You can use ML in your projects!
● Web API services are like a personal data
scientist
○ No need for Data Scientist for simple use of ML
○ But harder dataset will need expertise
● Real world ML projects needs Engineers:
○ Get data to train a good model (log files, sales results,
mail campaign results,…)
○ Transform data into input for ML library or web service
○ Deploy and integrate into production
● Most steps are just normal programming
○ Get data from DB
○ Transform data into a CSV
○ Call a REST API or Java POJO to make predictions
○ Integrate with the system that needs predictions
36. Feature engineering with streams and lambdas
The goal is to take raw data from the DB and create arrays of numerical or
categorical features.
1. Get Fuman user post data from DB -> UserPost
2. Learn the vocabulary of all user posts word types
3. Create the dataset:
a. For each post,
i. Add the statistics features
ii. Add the word types features
4. Transform to csv output (for DataRobot)
Instances are Weka SparseInstance (sparse vectors for memory efficiency), but in
retrospect, a specialized vector library would have been better, I think. Weka is a
terrible production library