SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Scaling by Cheating
Approximation, Sampling and Fault-Friendliness
for Scalable Big Learning
Sean Owen / Director, Data Science @ Cloudera

1
Two Big Problems

2
Grow Bigger

“

Today’s big is just
tomorrow’s small.
“ Makeexpected to
We’re quotes look
process or different.”
interestingarbitrarily large
data sets by just adding
computers. You can’t tell
the boss that anything’s
too big to handle these
days.

”

David, Sr. IT Manager

3
And Be Faster

“Speed is king. People

expect up-to-the-second
results, and millisecond
response times. No
more overnight
reporting jobs. My data
grows 10x but my
latency has to drop 10x.

“ Make quotes look
interesting or different.”

”

Shelly, CTO

4
Two Big Solutions

5
Plentiful Resources

“

Disk and CPU are cheap,
on-demand.
“ Make quotesharness
Frameworks to look
them, like Hadoop, are
interesting or different.”
free and mature. We
can easily bring to bear
plenty of resources to
process data quickly and
cheaply.

”

“Scooter”, White Lab

6
Cheating
Not Right, but Close Enough

7
Kirk What would you say the odds are on
our getting out of here?
Spock Difficult to be precise, Captain. I
should say approximately seven thousand
eight hundred twenty four point seven to
one.
Kirk Difficult to be precise?
Seven thousand eight hundred
and twenty four to one?

Spock Seven thousand eight hundred twenty
four point seven to one.
Kirk That's a pretty close approximation.
Star Trek, “Errand of Mercy”
http://www.redbubble.com/people/feelmeflow

8
When To Cheat Approximate
Only a few significant
figures matter
• Least-significant figures
are noise
• Only relative rank matters
• Only care about
“high” or “low”
•

9

Do you care about
37.94% vs simply 40%?
10
Approximation

11
The Mean
Huge stream of values: x1 x2 x3 … *
• Finding entire population mean µ is expensive
• Mean of small sample of N is close:
•

µN = (1/N) (x1 + x2 + … + xN)
•

How much gets close enough?

* independent, roughly normal distribution
12
“Close Enough” Mean
Want: with high probability p, at most ε error
µ = (1± ε) µN
• Use Student’s t-distribution (N-1 d.o.f.)
t = (µ - µN) / (σN/√N)
• How unknown µ behaves relative
to known sample stats
•

13
“Close Enough” Mean
Critical value for one tail
tcrit = CDF-1((1+p)/2)
• Use library like Commons Math3:
•

TDistribution.inverseCumulativeProbability()

Solve for critical µcrit
CDF-1((1+p)/2) = (µcrit - µN) / (σN/√N)
• µ “probably” at most µcrit
• Stop when (µcrit - µN) / µN small (<ε)
•

14
Sampling

15
16
Word Count: Toy Example
Input: text documents
• Exactly how many times does
each word occur?
• Necessary precision?
• Interesting question?
•

Why?
17
Word Count: Useful Example
About how many times does
each word occur?
• Which 10 words occur
most frequently?
• What fraction are
Capitalized?
•

Hmm!

18
Common Crawl
•

•

Count top words, Capitalized, zucchini
in 35GB subset

•

github.com/srowen/commoncrawl

•

19

s3n://aws-publicdatasets/common-crawl/
parse-output/segment/*/textData-*

Amazon EMR
4 c1.xlarge instances
Raw Results
40 minutes
• 40.1% Capitalized
• Most frequent words:
the and to of a in de for is
• zucchini occurs 9,571 times
•

20
Sample 10% of Documents
21 minutes
• 39.9% Capitalized
• Most frequent words:
the and to of a in de for is
• zucchini occurs 967 times,
( 9,670 overall)
•

21

...
if (Math.random() >= 0.1)
continue;
...
Stop When “Close Enough”
•

CloseEnoughMean.java

Stop mapping when
% Capitalized is close
enough
• 10% error, 90% confidence
per Mapper
• 18 minutes
• 39.8% Capitalized
•

22

...
if (m.isCloseEnough()) {
break;
}
...
More Sampling

23
24
Item-Item Similarity
•
•
•
•
•

Input: user-item click counts
Compute all-pairs item-item similarity
Output size is
(# Items x # Items)
Far too large to consume
in next job
1
But, virtually all similarities
are noise, near 0

Item
1

9
7
2

2

User

1

3
1

1

8

8

4
3
1

2

2

1

4

25

2

1

3

1

2
Pruning
•
•

ItemSimilarityJob
--threshold

Discard similarities < value
•

Item

--maxSimilaritiesPerItem

0

0.5

0

0

1

0.1

0

0

0.2

0

0.1

0.5

0.1

1

0

-0.2

0

0

0

0

Item

0

0

0

1

0

0

0

0

0

-0.2

0

1

0.2

0

0.2

0.5

26

0.5

0

Keep only top n pairs per item
--maxPrefsPerUser
Ignore excess from
“prolific” users

0

0

•

1

0.2

0

0

0.2

1

0

0

0

0

0

0

0

0

1

0

0

0.1

0

0

0.2

0

0

1
Pruning Experiment
•

Líbímseti dating site data set
•
•
•

135K users x 165K profiles
17M data points
Rating on 1-10 scale

Compute all item-item
Pearson correlations
• Amazon EMR
2 m1.xlarge
•

27
Pruning Experiment
No Pruning
• 0 threshold
• <10000 pairs per item
• <1000 prefs per user
• 178 minutes
• 20,400 MB output

28

Pruning
• >0.3 threshold
• <10 pairs per item
• <100 prefs per user
• 11 minutes
• 2 MB output
Resources
•

Apache Mahout

•

github.com/srowen/
commoncrawl

•

sowen@cloudera.com

mahout.apache.org

•

Commons Math
commons.apache.org/pro
per/commons-math/

29
October hug

Weitere ähnliche Inhalte

Ähnlich wie October hug

Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data ScienceSean Taylor
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
Short URLs, Big Fun
Short URLs, Big FunShort URLs, Big Fun
Short URLs, Big FunHilary Mason
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Discover Pinterest
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Data Con LA
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoopRussell Jurney
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 

Ähnlich wie October hug (20)

2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Short URLs, Big Fun
Short URLs, Big FunShort URLs, Big Fun
Short URLs, Big Fun
 
14 turing wics
14 turing wics14 turing wics
14 turing wics
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 

Mehr von huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introhuguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitchinghuguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoringhuguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startuphuguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapulthuguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysishuguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analyticshuguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Socialhuguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligencehuguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 

Mehr von huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 

Kürzlich hochgeladen

IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 

Kürzlich hochgeladen (20)

201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 

October hug

  • 1. Scaling by Cheating Approximation, Sampling and Fault-Friendliness for Scalable Big Learning Sean Owen / Director, Data Science @ Cloudera 1
  • 3. Grow Bigger “ Today’s big is just tomorrow’s small. “ Makeexpected to We’re quotes look process or different.” interestingarbitrarily large data sets by just adding computers. You can’t tell the boss that anything’s too big to handle these days. ” David, Sr. IT Manager 3
  • 4. And Be Faster “Speed is king. People expect up-to-the-second results, and millisecond response times. No more overnight reporting jobs. My data grows 10x but my latency has to drop 10x. “ Make quotes look interesting or different.” ” Shelly, CTO 4
  • 6. Plentiful Resources “ Disk and CPU are cheap, on-demand. “ Make quotesharness Frameworks to look them, like Hadoop, are interesting or different.” free and mature. We can easily bring to bear plenty of resources to process data quickly and cheaply. ” “Scooter”, White Lab 6
  • 7. Cheating Not Right, but Close Enough 7
  • 8. Kirk What would you say the odds are on our getting out of here? Spock Difficult to be precise, Captain. I should say approximately seven thousand eight hundred twenty four point seven to one. Kirk Difficult to be precise? Seven thousand eight hundred and twenty four to one? Spock Seven thousand eight hundred twenty four point seven to one. Kirk That's a pretty close approximation. Star Trek, “Errand of Mercy” http://www.redbubble.com/people/feelmeflow 8
  • 9. When To Cheat Approximate Only a few significant figures matter • Least-significant figures are noise • Only relative rank matters • Only care about “high” or “low” • 9 Do you care about 37.94% vs simply 40%?
  • 10. 10
  • 12. The Mean Huge stream of values: x1 x2 x3 … * • Finding entire population mean µ is expensive • Mean of small sample of N is close: • µN = (1/N) (x1 + x2 + … + xN) • How much gets close enough? * independent, roughly normal distribution 12
  • 13. “Close Enough” Mean Want: with high probability p, at most ε error µ = (1± ε) µN • Use Student’s t-distribution (N-1 d.o.f.) t = (µ - µN) / (σN/√N) • How unknown µ behaves relative to known sample stats • 13
  • 14. “Close Enough” Mean Critical value for one tail tcrit = CDF-1((1+p)/2) • Use library like Commons Math3: • TDistribution.inverseCumulativeProbability() Solve for critical µcrit CDF-1((1+p)/2) = (µcrit - µN) / (σN/√N) • µ “probably” at most µcrit • Stop when (µcrit - µN) / µN small (<ε) • 14
  • 16. 16
  • 17. Word Count: Toy Example Input: text documents • Exactly how many times does each word occur? • Necessary precision? • Interesting question? • Why? 17
  • 18. Word Count: Useful Example About how many times does each word occur? • Which 10 words occur most frequently? • What fraction are Capitalized? • Hmm! 18
  • 19. Common Crawl • • Count top words, Capitalized, zucchini in 35GB subset • github.com/srowen/commoncrawl • 19 s3n://aws-publicdatasets/common-crawl/ parse-output/segment/*/textData-* Amazon EMR 4 c1.xlarge instances
  • 20. Raw Results 40 minutes • 40.1% Capitalized • Most frequent words: the and to of a in de for is • zucchini occurs 9,571 times • 20
  • 21. Sample 10% of Documents 21 minutes • 39.9% Capitalized • Most frequent words: the and to of a in de for is • zucchini occurs 967 times, ( 9,670 overall) • 21 ... if (Math.random() >= 0.1) continue; ...
  • 22. Stop When “Close Enough” • CloseEnoughMean.java Stop mapping when % Capitalized is close enough • 10% error, 90% confidence per Mapper • 18 minutes • 39.8% Capitalized • 22 ... if (m.isCloseEnough()) { break; } ...
  • 24. 24
  • 25. Item-Item Similarity • • • • • Input: user-item click counts Compute all-pairs item-item similarity Output size is (# Items x # Items) Far too large to consume in next job 1 But, virtually all similarities are noise, near 0 Item 1 9 7 2 2 User 1 3 1 1 8 8 4 3 1 2 2 1 4 25 2 1 3 1 2
  • 26. Pruning • • ItemSimilarityJob --threshold Discard similarities < value • Item --maxSimilaritiesPerItem 0 0.5 0 0 1 0.1 0 0 0.2 0 0.1 0.5 0.1 1 0 -0.2 0 0 0 0 Item 0 0 0 1 0 0 0 0 0 -0.2 0 1 0.2 0 0.2 0.5 26 0.5 0 Keep only top n pairs per item --maxPrefsPerUser Ignore excess from “prolific” users 0 0 • 1 0.2 0 0 0.2 1 0 0 0 0 0 0 0 0 1 0 0 0.1 0 0 0.2 0 0 1
  • 27. Pruning Experiment • Líbímseti dating site data set • • • 135K users x 165K profiles 17M data points Rating on 1-10 scale Compute all item-item Pearson correlations • Amazon EMR 2 m1.xlarge • 27
  • 28. Pruning Experiment No Pruning • 0 threshold • <10000 pairs per item • <1000 prefs per user • 178 minutes • 20,400 MB output 28 Pruning • >0.3 threshold • <10 pairs per item • <100 prefs per user • 11 minutes • 2 MB output