Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1jupcVS.
Sean Owen provides examples of operational analytics projects in the field, presenting a reference architecture and algorithm design choices for a successful implementation based on his experience with customers and Oryx/Cloudera. Filmed at qconlondon.com.
Sean Owen is Director of Data Science at Cloudera, based in London. Before Cloudera, he founded Myrrix Ltd, a company commercializing large-scale real-time recommender systems on Apache Hadoop. He has been a primary committer and VP for Apache Mahout, and co-author of Mahout in Action. Previously, Sean was a senior engineer at Google.
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/pattern-real-time-learning
http://www.infoq.com/presentati
ons/nasa-big-data
http://www.infoq.com/presentati
ons/nasa-big-data
3. Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. 2
Design
Pa+erns
for
Large-‐Scale
Real-‐Time
Learning
QCon
London
2014
Sean
Owen
/
Director
of
Data
Science
/
Cloudera
9. Data
Science
Is
Exploratory
Analy-cs?
7
www.tc.umn.edu/~zief0002/Comparing-‐Groups/blog.html
thenextweb.com/microsoS/2013/07/08/microsoS-‐brings-‐the-‐office-‐store-‐to-‐22-‐new-‐markets-‐adds-‐power-‐bi-‐an-‐intelligence-‐tool-‐to-‐office-‐365/
10. Example:
Drug
InteracJons
8
Cloudera
analysis
of
FDA
drug
data:
“Our
analysis
revealed
a
few
drug
pairs
with
surprisingly
high
correlaJons
with
adverse
events
that
did
not
show
up
in
a
search
of
the
academic
literature:
gabapenJn
(a
seizure
medicaJon)
taken
in
conjuncJon
with
hydrocodone/paracetamol
was
correlated
with
memory
impairment,
and
haloperidol
in
conjuncJon
with
lorazepam
was
correlated
with
the
paJent
entering
into
a
coma.”
blog.cloudera.com/blog/2011/11/using-‐hadoop-‐to-‐analyze-‐adverse-‐drug-‐events/
12. Example:
Data
Science
in
the
Field
10
• [Large
European
e-‐commerce
site]
• Wants
real-‐Jme
recommendaJons
for
new
and
returning
users
• Data
streamed
from
web
server
via
Flume
to
HDFS
• MulJple
data
sources
• 100K+
products,
20M
users
Exploratory?
13. Example:
11
• Search,
ML
over
PaJent
Data
• MapReduce
for
indexing,
learning
• HBase
for
storage
and
fast
access
• Also:
Storm
for
incremental
update
• And:
relaJonal
DB
for
most
recent
derived
data
• API
façade
for
input;
API
for
querying
learning
engineering.cerner.com/2013/02/near-‐real-‐Jme-‐processing-‐over-‐hadoop-‐and-‐hbase/
Engineering
Machine
Learning
20. Gaps
to
fill,
and
Goals
18
• Model
Building
• Large-‐scale
• Con-nuous
• Apache
Hadoop™-‐based
• Few,
good
algorithms
• Model
Serving
• Real-‐-me
query
• Real-‐-me
update
• Algorithms
• Parallelizable
• Updateable
• Works
on
diverse
input
• Interoperable
• PMML
model
format
• Simple
REST
API
• Open
source
21. Large-‐Scale
or
Real-‐Time?
19
Large-‐Scale
Offline
Batch
Real-‐Time
Online
Streaming
vs
Why
Don’t
We
Have
Both?
λ!
24. Two
Layers
22
• Computa-on
Layer
• Java-‐based
server
process
• Client
of
Hadoop
2.x
• Periodically
builds
“generaJon”
from
recent
data
and
past
model
• Baby-‐sits
MapReduce*
jobs
(or,
locally
in-‐core)
• Publishes
models
• Serving
Layer
• Apache
Tomcat™-‐based
server
process
• Consumes
models
from
HDFS
(or
local
FS)
• Serves
queries
from
model
in
memory
• Updates
from
new
input
• Also
writes
input
to
HDFS
• Replicas
for
scale
*
Apache
Spark
later
25. CollaboraJve
Filtering
:
ALS
23
• AlternaJng
Least
Squares
• Latent-‐factor
model
• Accepts
implicit
or
explicit
feedback
• Real-‐Jme
update
via
fold-‐in
of
input
• No
cold-‐start
• Parallelizable
YT
X
27. ClassificaJon
/
Regression
:
RDF
25
• Random
Decision
Forests
• Ensemble
method
• Numeric,
categorical
features
and
target
• Very
parallel
• Nodes
updateable
• Works
well
on
many
problems
age$>$30
female? Yes
income$>$20000 Yes
Yes No
28. PMML
26
• PredicJve
Modeling
Markup
Language
• XML-‐based
format
for
predicJve
models
• Standardized
by
Data
Mining
Group
(www.dmg.org)
• Wide
tool
support
<PMML xmlns="http://www.dmg.org/PMML-4_1"!
version="4.1">!
<Header copyright="www.dmg.org"/>!
<DataDictionary numberOfFields="5">!
<DataField name="temperature"!
optype="continuous"!
dataType="double"/>!
…!
</DataDictionary>!
<TreeModel modelName="golfing"!
functionName="classification">!
<MiningSchema>!
<MiningField name="temperature"/>!
… !
</MiningSchema>!
<Node score="will play">!
<Node score="will play">!
<SimplePredicate field="outlook"!
operator="equal" !
value="sunny"/>!
…!
</Node>!
</Node>!
</TreeModel>!
</PMML>!
www.dmg.org/v4-‐1/TreeModel.html
29. Extra:
Apache
Spark
as
“Crossover
Hit”
27
• Exploratory-‐friendly
• REPL
• Scala
closures
• MLlib
• OperaJonal-‐friendly
• Distributed
• Hadoop
integraJon
• All
Java
libraries
available
blog.cloudera.com/blog/2014/03/why-‐apache-‐spark-‐is-‐a-‐crossover-‐hit-‐for-‐data-‐scienJsts/