This document discusses the future of artificial intelligence on the Java Virtual Machine (JVM). It outlines how machine learning frameworks are currently monolithic and make assumptions about data. The document proposes a micro-services approach to machine learning that separates out concerns like data pipelines, scoring, model training, and evaluation. This would help reduce lock-in and allow greater flexibility. It also discusses how new hardware like GPUs are better suited for deep learning and the role frameworks like Spark and Akka could play in distributed, real-time machine learning applications on the JVM.
4. Problem Space
● Spam Classification
● Summarization
● Face Detection
● Eye Tracking
● Targeted Ads
● Recommendation Engines
5. Current State of ML
● Simpler models
● Most of industry barely uses Logistic Reg.
● Many problems are binary
o e.g. fraud, spam
● Some unsupervised (clustering, reccos)
● Lots of ML frameworks on JVM
6. ML Frameworks on JVM...
● Apache Mahout
● Spark’s MLlib
● Weka (is that R?)
9. Ring a Bell?
● We call that “Monolithic”
● Separate ML concerns:
Data Pipelines/Vectorization
Scoring
Model Training
Evaluation
10. Micro-Services + ML?
● Kinda like micro-services
● Reduce lock in
● Take math, data cleaning, model training,
choosing algorithms ...
● … and separate them
11. Math
● Parametric Models (Matrices!)
● Non Parametric (Random forest)
● Focusing on Matrices (the hard part of ML
systems)
12. Matrices
● NDArrays ( > 2d)
● Tensors (think of pages of matrices)
● Example: 2 x 2 x 2 (2 2x 2 matrices)
● ^^THIS IS UNCLEAR. Two 2 x 2 matrices?
● Applies to graphs w/ sparse representations
13. Chips/Hardware/Matrices
● CPUs - We work with these
● GPUs - CUDA ditto
● FPGAs
o Intel bought Altera, an FPGA maker, for $17 billion
this month
o The edge, the cloud
15. Why New Chips?
● See the numbers yourself:
● http://www.slideshare.net/airbots/cuda-
29330283
● http://devblogs.nvidia.com/parallelforall/bidm
ach-machine-learning-limit-gpus/
● http://jcuda.org
16. Mixed clusters
● GPUs aren’t good for all workloads
● Because latency
● Need to upload data: not good for small
problems
● Mixed CPU/GPU clusters are best bet
17. Data Pipelines
● More data will be binary
● Frameworks today can’t process binary well
● Binary data has different semantics
● Moving windows for audio
● 3d for images ...
18. People Roll Their Own b/c
● Current frameworks assume clean data :(
● Pipelines are brittle, hard to maintain
● Moving towards being composable (reuse)
19. Dedicated Libraries
● Let’s focus on vectorization -- now!
● Because IoT
● Because more access to raw media
● Should fit into current big data frameworks
21. All independent
● These things work for different models
● Shouldn’t be tied to a particular system
● Should be embeddable
22. Training
● Split Train/Test
● Sample data (no, not all the data ;) to
validate model
● Increasingly compute intensive
23. Deep Learning
● Most done in Python...
● Norm training time is measured in
hours/days -- weeks!?
● Work being done in HPC (Model parallelism)
● Distbelief (Data parallelism)
24. Automatic Learning
● Good at unstructured data
● Images, Text, Audio and Sensors
● Quick, baseline feature engineering
● Not good at feature introspection
27. Where Does Scala Fit In?
● Akka - Real time streaming analytics/micro services
● Spark - Dataframes/number crunching
● JVM Key/Value Stores
● Pistachio (powers Yahoo’s ad network)
o http://yahooeng.tumblr.com/post/118860853846/dist
ributed-word2vec-on-top-of-pistachio
28. The Way We Learn Now
● Monolithic ML frameworks
● No per-chip optimizations
● No Tensors (come on guys, it’s 2015...)
● Need isolation and less lockin
● JVM is the platform to make it happen