Boston hug

Agenda
• Why Big Data? Why now?

• What can you do with big data?

• How does it work?

2

Slow Motion Explosion

3

Why Now?
• But Moore’s law has applied for a long time

• Why is Hadoop/Big Data exploding now?

• Why not 10 years ago?

• Why not 20?

2/15/2012 4

Size Matters, but …
• If it were just availability of data then existing
big companies would adopt big data
technology first

5

Size Matters, but …
• If it were just availability of data then existing
big companies would adopt big data
technology first

They didn’t

6

Or Maybe Cost
• If it were just a net positive value then finance
companies should adopt first because they
have higher opportunity value / byte

7

Or Maybe Cost
• If it were just a net positive value then finance
companies should adopt first because they
have higher opportunity value / byte

They didn’t

8

Backwards adoption
• Under almost any threshold argument
startups would not adopt big data technology
first

9

Backwards adoption
• Under almost any threshold argument
startups would not adopt big data technology
first

They did

10

Everywhere at Once?
• Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small

11

Everywhere at Once?
• Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small

Why?

12

Analytics Scaling Laws
• Analytics scaling is all about the 80-20 rule
– Big gains for little initial effort
– Rapidly diminishing returns
• The key to net value is how costs scale
– Old school – exponential scaling
– Big data – linear scaling, low constant
• Cost/performance has changed radically
– IF you can use many commodity boxes

You’re kidding, people do that?

We didn’t know that!

We should have
known that

We knew that

NSA, non-proliferation
1

0.75

Industry-wide data consortium
Value

0.5
In-house analytics

Intern with a spreadsheet
0.25

Anybody with eyes

0
0 500 1000 1500 2,000

Scale

1

0.75

Net value optimum has a
Value

0.5 sharp peak well before
maximum effort

0.25

0
0 500 1000 1500 2,000

Scale

But scaling laws are changing
both slope and shape

1

0.75
Value

0.5
More than just a little

0.25

0
0 500 1000 1500 2,000

Scale

1

0.75
Value

0.5

They are changing a LOT!
0.25

0
0 500 1000 1500 2,000

Scale

1

0.75
Value

0.5

0.25

0
0 500 1000 1500 2,000

Scale

1

0.75

A tipping point is reached and
things change radically …
Value

0.5

Initially, linear cost scaling
actually makes things worse
0.25

0
0 500 1000 1500 2,000

Scale

Pre-requisites for Tipping
• To reach the tipping point,
• Algorithms must scale out horizontally
– On commodity hardware
– That can and will fail
• Data practice must change
– Denormalized is the new black
– Flexible data dictionaries are the rule
– Structured data becomes rare

So that is why and why now

26

So that is why, and why now

What can you do with it?
And how?

27

Agenda
• Mahout outline
– Recommendations
– Clustering
– Classification
• Hybrid Parallel/Sequential Systems
• Real-time learning

Agenda
• Mahout outline
– Recommendations
– Clustering
– Classification
• Supervised on-line learning
• Feature hashing
• Hybrid Parallel/Sequential Systems
• Real-time learning

Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
• Logistic Regression (aka SGD)
– fast on-line (sequential) training

Classification in Detail
• Naive Bayes Family
• Decision Forests
• Logistic Regression (aka SGD)
– fast on-line (sequential) training
– Now with MORE topping!

How it Works
• We are given “features”
– Often binary values in a vector
• Algorithm learns weights
– Weighted sum of feature * weight is the key
• Each weight is a single real value

Features

From: Thu, Paul 20, 2010 at 10:51 AM
Date: Dr. May Acquah
Dear Sir,
From: George <george@fumble-tech.com>
Re: Proposal for over-invoice Contract Benevolence
Hi Ted, was a pleasure talking to you last night
Based on information gathered from the idea of
at the Hadoop User Group. I liked the India
hospital directory, I am pleased to propose a
going for lunch together. Are you available
confidential business noon? for our mutual
tomorrow (Friday) at deal
benefit. I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's
bank account for our favor.
...

But …
• Text and words aren’t suitable features
• We need a numerical vector
• So we use binary vectors with lots of slots

Training Data
Joining,
combining,
Raw transforming Training examples
data with target values

Parsing

Tokens

Encoding

Training
Vectors
algorithm

Full Scale Training
Side-data

Now via NFS

I
Feature
n Sequential
extraction Data
p SGD
and join
u Learning
down
t
sampling

Map-reduce

Hybrid Model Development

Logs Group by User Count Training data
user sessions transaction Shared
filesystem
patterns
Big-data cluster
Legacy modeling

Training data

Account
info Merge PROC Model
LOGISTIC

44

Enter the Pig Vector
• Pig UDF’s for
– Vector encoding
define EncodeVector
org.apache.mahout.pig.encoders.EncodeVector(
'10','x+y+1',
'x:numeric, y:numeric, z:numeric');

– Model training
vectors = foreach docs generate newsgroup, encodeVector(*) as v;
grouped = group vectors all;
model = foreach grouped generate 1 as key,
train(vectors) as model;

Real-time Developments
• Storm + Hadoop + Mapr
– Real-time with Storm
– Long-term with Hadoop
– State checkpoints with MapR
• Add the Bayesian Bandit for on-line learning

Aggregate Splicing

Storm handles the
Hadoop handles the present
t past

Mobile Network Monitor
Transaction
data

Geo-dispersed
ingest servers Batch aggregation
Retro-analysis
interface

HBase

Real-time dashboard
and alerts

48

A Quick Diversion
• You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
• I flip the coin and while it is in the air ask again
• I catch the coin and ask again
• I look at the coin (and you don’t) and ask again
• Why does the answer change?
– And did it ever have a single value?

A First Conclusion
• Probability as expressed by humans is
subjective and depends on information and
experience

A Second Conclusion
• A single number is a bad way to express
uncertain knowledge

• A distribution of values might be better

Bayesian Bandit
• Compute distributions based on data
• Sample p1 and p2 from these distributions
• Put a coin in bandit 1 if p1 > p2
• Else, put the coin in bandit 2

The Basic Idea
• We can encode a distribution by sampling
• Sampling allows unification of exploration and
exploitation

• Can be extended to more general response
models

Deployment with Storm/MapR
Targeting Online
Engine Model
RPC RPC
Model
Selector RPC
Online
RPC Model
Impression
Logs
Training
Conversion Online
Training
Detector Model
Training
Click Logs

RPC

All state managed transactionally
in MapR file system
Conversion
Dashboard

Service Architecture

MapR Pluggable Service Management

Storm
Targeting Online
Engine Model
RPC RPC
Model
Selector RPC
Online
Impression
Logs

Conversion
Detector
RPC

Training

Training
Model

Online
Hadoop
Model
Training
Click Logs

RPC

Conversion
Dashboard

MapR Lockless Storage Services

Find Out More
• Me: tdunning@mapr.com
ted.dunning@gmail.com
tdunning@apache.com
• MapR: http://www.mapr.com
• Mahout: http://mahout.apache.org
• Code: https://github.com/tdunning

Boston hug

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Boston hug

Ähnlich wie Boston hug (20)

Mehr von Ted Dunning

Mehr von Ted Dunning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Boston hug

Hinweis der Redaktion