4. Why Now?
• But Moore’s law has applied for a long time
• Why is Hadoop/Big Data exploding now?
• Why not 10 years ago?
• Why not 20?
2/15/2012 4
5. Size Matters, but …
• If it were just availability of data then existing
big companies would adopt big data
technology first
5
6. Size Matters, but …
• If it were just availability of data then existing
big companies would adopt big data
technology first
They didn’t
6
7. Or Maybe Cost
• If it were just a net positive value then finance
companies should adopt first because they
have higher opportunity value / byte
7
8. Or Maybe Cost
• If it were just a net positive value then finance
companies should adopt first because they
have higher opportunity value / byte
They didn’t
8
9. Backwards adoption
• Under almost any threshold argument
startups would not adopt big data technology
first
9
10. Backwards adoption
• Under almost any threshold argument
startups would not adopt big data technology
first
They did
10
11. Everywhere at Once?
• Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small
11
12. Everywhere at Once?
• Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small
Why?
12
13. Analytics Scaling Laws
• Analytics scaling is all about the 80-20 rule
– Big gains for little initial effort
– Rapidly diminishing returns
• The key to net value is how costs scale
– Old school – exponential scaling
– Big data – linear scaling, low constant
• Cost/performance has changed radically
– IF you can use many commodity boxes
14. You’re kidding, people do that?
We didn’t know that!
We should have
known that
We knew that
15. NSA, non-proliferation
1
0.75
Industry-wide data consortium
Value
0.5
In-house analytics
Intern with a spreadsheet
0.25
Anybody with eyes
0
0 500 1000 1500 2,000
Scale
16. 1
0.75
Net value optimum has a
Value
0.5 sharp peak well before
maximum effort
0.25
0
0 500 1000 1500 2,000
Scale
24. 1
0.75
A tipping point is reached and
things change radically …
Value
0.5
Initially, linear cost scaling
actually makes things worse
0.25
0
0 500 1000 1500 2,000
Scale
25. Pre-requisites for Tipping
• To reach the tipping point,
• Algorithms must scale out horizontally
– On commodity hardware
– That can and will fail
• Data practice must change
– Denormalized is the new black
– Flexible data dictionaries are the rule
– Structured data becomes rare
30. Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training
31. Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training
32. Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training
– Now with MORE topping!
33. How it Works
• We are given “features”
– Often binary values in a vector
• Algorithm learns weights
– Weighted sum of feature * weight is the key
• Each weight is a single real value
35. Features
From: Thu, Paul 20, 2010 at 10:51 AM
Date: Dr. May Acquah
Dear Sir,
From: George <george@fumble-tech.com>
Re: Proposal for over-invoice Contract Benevolence
Hi Ted, was a pleasure talking to you last night
Based on information gathered from the idea of
at the Hadoop User Group. I liked the India
hospital directory, I am pleased to propose a
going for lunch together. Are you available
confidential business noon? for our mutual
tomorrow (Friday) at deal
benefit. I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's
bank account for our favor.
...
36. But …
• Text and words aren’t suitable features
• We need a numerical vector
• So we use binary vectors with lots of slots
42. Training Data
Joining,
combining,
Raw transforming Training examples
data with target values
Parsing
Tokens
Encoding
Training
Vectors
algorithm
43. Full Scale Training
Side-data
Now via NFS
I
Feature
n Sequential
extraction Data
p SGD
and join
u Learning
down
t
sampling
Map-reduce
44. Hybrid Model Development
Logs Group by User Count Training data
user sessions transaction Shared
filesystem
patterns
Big-data cluster
Legacy modeling
Training data
Account
info Merge PROC Model
LOGISTIC
44
45. Enter the Pig Vector
• Pig UDF’s for
– Vector encoding
define EncodeVector
org.apache.mahout.pig.encoders.EncodeVector(
'10','x+y+1',
'x:numeric, y:numeric, z:numeric');
– Model training
vectors = foreach docs generate newsgroup, encodeVector(*) as v;
grouped = group vectors all;
model = foreach grouped generate 1 as key,
train(vectors) as model;
46. Real-time Developments
• Storm + Hadoop + Mapr
– Real-time with Storm
– Long-term with Hadoop
– State checkpoints with MapR
• Add the Bayesian Bandit for on-line learning
48. Mobile Network Monitor
Transaction
data
Geo-dispersed
ingest servers Batch aggregation
Retro-analysis
interface
HBase
Real-time dashboard
and alerts
48
49. A Quick Diversion
• You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
• I flip the coin and while it is in the air ask again
• I catch the coin and ask again
• I look at the coin (and you don’t) and ask again
• Why does the answer change?
– And did it ever have a single value?
50. A First Conclusion
• Probability as expressed by humans is
subjective and depends on information and
experience
51. A Second Conclusion
• A single number is a bad way to express
uncertain knowledge
• A distribution of values might be better
55. Bayesian Bandit
• Compute distributions based on data
• Sample p1 and p2 from these distributions
• Put a coin in bandit 1 if p1 > p2
• Else, put the coin in bandit 2
56.
57.
58. The Basic Idea
• We can encode a distribution by sampling
• Sampling allows unification of exploration and
exploitation
• Can be extended to more general response
models
59. Deployment with Storm/MapR
Targeting Online
Engine Model
RPC RPC
Model
Selector RPC
Online
RPC Model
Impression
Logs
Training
Conversion Online
Training
Detector Model
Training
Click Logs
RPC
All state managed transactionally
in MapR file system
Conversion
Dashboard
60. Service Architecture
MapR Pluggable Service Management
Storm
Targeting Online
Engine Model
RPC RPC
Model
Selector RPC
Online
Impression
Logs
Conversion
Detector
RPC
Training
Training
Model
Online
Hadoop
Model
Training
Click Logs
RPC
Conversion
Dashboard
MapR Lockless Storage Services
61. Find Out More
• Me: tdunning@mapr.com
ted.dunning@gmail.com
tdunning@apache.com
• MapR: http://www.mapr.com
• Mahout: http://mahout.apache.org
• Code: https://github.com/tdunning
Hinweis der Redaktion
The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
In classical analytics, the cost of doing analytics increases sharply.
The result is a net value that has a sharp optimum in the area where value is increasing rapidly and cost is not yet increasing so rapidly.
New techniques such as Hadoop result in linear scaling of cost. This is a change in shape and it causes a qualitative change in the way that costs trade off against value to give net value. As technology improves, the slope of this cost line is also changing rapidly over time.
This next sequence shows how the net value changes with different slope linear cost models.
Notice how the best net value has jumped up significantly
And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.
No information would give a relative expected payoff of -0.25. This graph shows 25, 50 and 75%-ile results for sampled experiments with uniform random probabilities. Convergence to optimum is nearly equal to the optimum sqrt(n). Note the log scale on number of trials
Here is how the system converges in terms of how likely it is to pick the better bandit with probabilities that are only slightly different. After 1000 trials, the system is already giving 75% of the bandwidth to the better option. This graph was produced by averaging several thousand runs with the same probabilities.