SlideShare a Scribd company logo
1 of 30
Scaling by Cheating
Approximation, Sampling and Fault-Friendliness
for Scalable Big Learning
Sean Owen / Director, Data Science @ Cloudera

1
Two Big Problems

2
Grow Bigger

“

Today’s big is just
tomorrow’s small.
“ Makeexpected to
We’re quotes look
process or different.”
interestingarbitrarily large
data sets by just adding
computers. You can’t tell
the boss that anything’s
too big to handle these
days.

”

David, Sr. IT Manager

3
And Be Faster

“Speed is king. People

expect up-to-the-second
results, and millisecond
response times. No
more overnight
reporting jobs. My data
grows 10x but my
latency has to drop 10x.

“ Make quotes look
interesting or different.”

”

Shelly, CTO

4
Two Big Solutions

5
Plentiful Resources

“

Disk and CPU are cheap,
on-demand.
“ Make quotesharness
Frameworks to look
them, like Hadoop, are
interesting or different.”
free and mature. We
can easily bring to bear
plenty of resources to
process data quickly and
cheaply.

”

“Scooter”, White Lab

6
Cheating
Not Right, but Close Enough

7
Kirk What would you say the odds are on
our getting out of here?
Spock Difficult to be precise, Captain. I
should say approximately seven thousand
eight hundred twenty four point seven to
one.
Kirk Difficult to be precise?
Seven thousand eight hundred
and twenty four to one?

Spock Seven thousand eight hundred twenty
four point seven to one.
Kirk That's a pretty close approximation.
Star Trek, “Errand of Mercy”
http://www.redbubble.com/people/feelmeflow

8
When To Cheat Approximate
Only a few significant
figures matter
• Least-significant figures
are noise
• Only relative rank matters
• Only care about
“high” or “low”
•

9

Do you care about
37.94% vs simply 40%?
10
Approximation

11
The Mean
Huge stream of values: x1 x2 x3 … *
• Finding entire population mean µ is expensive
• Mean of small sample of N is close:
•

µN = (1/N) (x1 + x2 + … + xN)
•

How much gets close enough?

* independent, roughly normal distribution
12
“Close Enough” Mean
Want: with high probability p, at most ε error
µ = (1± ε) µN
• Use Student’s t-distribution (N-1 d.o.f.)
t = (µ - µN) / (σN/√N)
• How unknown µ behaves relative
to known sample stats
•

13
“Close Enough” Mean
Critical value for one tail
tcrit = CDF-1((1+p)/2)
• Use library like Commons Math3:
•

TDistribution.inverseCumulativeProbability()

Solve for critical µcrit
CDF-1((1+p)/2) = (µcrit - µN) / (σN/√N)
• µ “probably” at most µcrit
• Stop when (µcrit - µN) / µN small (<ε)
•

14
Sampling

15
16
Word Count: Toy Example
Input: text documents
• Exactly how many times does
each word occur?
• Necessary precision?
• Interesting question?
•

Why?
17
Word Count: Useful Example
About how many times does
each word occur?
• Which 10 words occur
most frequently?
• What fraction are
Capitalized?
•

Hmm!

18
Common Crawl
•

•

Count top words, Capitalized, zucchini
in 35GB subset

•

github.com/srowen/commoncrawl

•

19

s3n://aws-publicdatasets/common-crawl/
parse-output/segment/*/textData-*

Amazon EMR
4 c1.xlarge instances
Raw Results
40 minutes
• 40.1% Capitalized
• Most frequent words:
the and to of a in de for is
• zucchini occurs 9,571 times
•

20
Sample 10% of Documents
21 minutes
• 39.9% Capitalized
• Most frequent words:
the and to of a in de for is
• zucchini occurs 967 times,
( 9,670 overall)
•

21

...
if (Math.random() >= 0.1)
continue;
...
Stop When “Close Enough”
•

CloseEnoughMean.java

Stop mapping when
% Capitalized is close
enough
• 10% error, 90% confidence
per Mapper
• 18 minutes
• 39.8% Capitalized
•

22

...
if (m.isCloseEnough()) {
break;
}
...
More Sampling

23
24
Item-Item Similarity
•
•
•
•
•

Input: user-item click counts
Compute all-pairs item-item similarity
Output size is
(# Items x # Items)
Far too large to consume
in next job
1
But, virtually all similarities
are noise, near 0

Item
1

9
7
2

2

User

1

3
1

1

8

8

4
3
1

2

2

1

4

25

2

1

3

1

2
Pruning
•
•

ItemSimilarityJob
--threshold

Discard similarities < value
•

Item

--maxSimilaritiesPerItem

0

0.5

0

0

1

0.1

0

0

0.2

0

0.1

0.5

0.1

1

0

-0.2

0

0

0

0

Item

0

0

0

1

0

0

0

0

0

-0.2

0

1

0.2

0

0.2

0.5

26

0.5

0

Keep only top n pairs per item
--maxPrefsPerUser
Ignore excess from
“prolific” users

0

0

•

1

0.2

0

0

0.2

1

0

0

0

0

0

0

0

0

1

0

0

0.1

0

0

0.2

0

0

1
Pruning Experiment
•

Líbímseti dating site data set
•
•
•

135K users x 165K profiles
17M data points
Rating on 1-10 scale

Compute all item-item
Pearson correlations
• Amazon EMR
2 m1.xlarge
•

27
Pruning Experiment
No Pruning
• 0 threshold
• <10000 pairs per item
• <1000 prefs per user
• 178 minutes
• 20,400 MB output

28

Pruning
• >0.3 threshold
• <10 pairs per item
• <100 prefs per user
• 11 minutes
• 2 MB output
Resources
•

Apache Mahout

•

github.com/srowen/
commoncrawl

•

sowen@cloudera.com

mahout.apache.org

•

Commons Math
commons.apache.org/pro
per/commons-math/

29
October hug

More Related Content

Similar to October hug

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
Discover Pinterest
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
c.titus.brown
 

Similar to October hug (20)

2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Short URLs, Big Fun
Short URLs, Big FunShort URLs, Big Fun
Short URLs, Big Fun
 
14 turing wics
14 turing wics14 turing wics
14 turing wics
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 

More from huguk

More from huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

October hug

  • 1. Scaling by Cheating Approximation, Sampling and Fault-Friendliness for Scalable Big Learning Sean Owen / Director, Data Science @ Cloudera 1
  • 3. Grow Bigger “ Today’s big is just tomorrow’s small. “ Makeexpected to We’re quotes look process or different.” interestingarbitrarily large data sets by just adding computers. You can’t tell the boss that anything’s too big to handle these days. ” David, Sr. IT Manager 3
  • 4. And Be Faster “Speed is king. People expect up-to-the-second results, and millisecond response times. No more overnight reporting jobs. My data grows 10x but my latency has to drop 10x. “ Make quotes look interesting or different.” ” Shelly, CTO 4
  • 6. Plentiful Resources “ Disk and CPU are cheap, on-demand. “ Make quotesharness Frameworks to look them, like Hadoop, are interesting or different.” free and mature. We can easily bring to bear plenty of resources to process data quickly and cheaply. ” “Scooter”, White Lab 6
  • 7. Cheating Not Right, but Close Enough 7
  • 8. Kirk What would you say the odds are on our getting out of here? Spock Difficult to be precise, Captain. I should say approximately seven thousand eight hundred twenty four point seven to one. Kirk Difficult to be precise? Seven thousand eight hundred and twenty four to one? Spock Seven thousand eight hundred twenty four point seven to one. Kirk That's a pretty close approximation. Star Trek, “Errand of Mercy” http://www.redbubble.com/people/feelmeflow 8
  • 9. When To Cheat Approximate Only a few significant figures matter • Least-significant figures are noise • Only relative rank matters • Only care about “high” or “low” • 9 Do you care about 37.94% vs simply 40%?
  • 10. 10
  • 12. The Mean Huge stream of values: x1 x2 x3 … * • Finding entire population mean µ is expensive • Mean of small sample of N is close: • µN = (1/N) (x1 + x2 + … + xN) • How much gets close enough? * independent, roughly normal distribution 12
  • 13. “Close Enough” Mean Want: with high probability p, at most ε error µ = (1± ε) µN • Use Student’s t-distribution (N-1 d.o.f.) t = (µ - µN) / (σN/√N) • How unknown µ behaves relative to known sample stats • 13
  • 14. “Close Enough” Mean Critical value for one tail tcrit = CDF-1((1+p)/2) • Use library like Commons Math3: • TDistribution.inverseCumulativeProbability() Solve for critical µcrit CDF-1((1+p)/2) = (µcrit - µN) / (σN/√N) • µ “probably” at most µcrit • Stop when (µcrit - µN) / µN small (<ε) • 14
  • 16. 16
  • 17. Word Count: Toy Example Input: text documents • Exactly how many times does each word occur? • Necessary precision? • Interesting question? • Why? 17
  • 18. Word Count: Useful Example About how many times does each word occur? • Which 10 words occur most frequently? • What fraction are Capitalized? • Hmm! 18
  • 19. Common Crawl • • Count top words, Capitalized, zucchini in 35GB subset • github.com/srowen/commoncrawl • 19 s3n://aws-publicdatasets/common-crawl/ parse-output/segment/*/textData-* Amazon EMR 4 c1.xlarge instances
  • 20. Raw Results 40 minutes • 40.1% Capitalized • Most frequent words: the and to of a in de for is • zucchini occurs 9,571 times • 20
  • 21. Sample 10% of Documents 21 minutes • 39.9% Capitalized • Most frequent words: the and to of a in de for is • zucchini occurs 967 times, ( 9,670 overall) • 21 ... if (Math.random() >= 0.1) continue; ...
  • 22. Stop When “Close Enough” • CloseEnoughMean.java Stop mapping when % Capitalized is close enough • 10% error, 90% confidence per Mapper • 18 minutes • 39.8% Capitalized • 22 ... if (m.isCloseEnough()) { break; } ...
  • 24. 24
  • 25. Item-Item Similarity • • • • • Input: user-item click counts Compute all-pairs item-item similarity Output size is (# Items x # Items) Far too large to consume in next job 1 But, virtually all similarities are noise, near 0 Item 1 9 7 2 2 User 1 3 1 1 8 8 4 3 1 2 2 1 4 25 2 1 3 1 2
  • 26. Pruning • • ItemSimilarityJob --threshold Discard similarities < value • Item --maxSimilaritiesPerItem 0 0.5 0 0 1 0.1 0 0 0.2 0 0.1 0.5 0.1 1 0 -0.2 0 0 0 0 Item 0 0 0 1 0 0 0 0 0 -0.2 0 1 0.2 0 0.2 0.5 26 0.5 0 Keep only top n pairs per item --maxPrefsPerUser Ignore excess from “prolific” users 0 0 • 1 0.2 0 0 0.2 1 0 0 0 0 0 0 0 0 1 0 0 0.1 0 0 0.2 0 0 1
  • 27. Pruning Experiment • Líbímseti dating site data set • • • 135K users x 165K profiles 17M data points Rating on 1-10 scale Compute all item-item Pearson correlations • Amazon EMR 2 m1.xlarge • 27
  • 28. Pruning Experiment No Pruning • 0 threshold • <10000 pairs per item • <1000 prefs per user • 178 minutes • 20,400 MB output 28 Pruning • >0.3 threshold • <10 pairs per item • <100 prefs per user • 11 minutes • 2 MB output