SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Counting Big Data
by Streaming Algorithms
2013/10/26 @ Rakuten Technology Conference 2013
Rakuten Institute of Technology, Rakuten, Inc.,
Yusaku Kaneta
http://www.rakuten.co.jp/
Who am I?
• Yusaku Kaneta (@yusakukaneta)
– Joined Rakuten in April 2012.
– Rakuten Institute of Technology (RIT)

• Interests:
– String processing (esp., Pattern matching)
– Hardware design using FPGA
– Bitwise tricks & techniques
• Love TAOCP 7.1.3 & Hacker's Delight
2
Problem: Count Big Data
• Counting:
– Fundamental operation in data analysis.

• Big data is difficult to just count
– Because it needs huge amount of memory.
– E.g., 400GB+ is needed for
one-year access logs.

3
Batch Processing
• Batch processing can solve this.
– E.g.,

• Two issues:
– High latency

– Requirement for a cluster of machines
Batch

Batch

Batch

= High cost

Batch

Batch

Batch

4
Our Goals
1. Reduce memory
– Cost reduction.

2. Reduce latency
– Quick business decisions.

3. Achieve high-accuracy
– Correct business decisions.
5
Our Approach
• Streaming algorithms
– Can fulfill all our goals!
– Become common in Web companies.
• See the paper on Google’s PowerDrill & the code of
Twitter’s Algebird for examples of how to use.

• Keys:
– Limited memory
– Low latency
– Theoretical guarantee for accuracy
6
Streaming Algorithm Library
• RIT internally provides a C library
for streaming algorithms, libsketch.
• Three advantages:
Memory
efficient

• Bindings for

High
speed

High
accuracy

&
7
Why C?
• Our target: Python & Ruby users!
for data analysis

for stream processing

– But most of existing libraries are written in Scala
(algebird), Java (stream-lib), ...

This is a reason
why our library is written in C!
Easy to incorporate C libraris in Python & Ruby.
8
Application
Count Query in Rakuten
• Example: We want to know...
1. How many unique users that checked
an item in one day (month, or year)?
2. How many products sold in one day
(month, or year)?

• Streaming algorithms for the queries
1. HyperLogLog algorithm
2. Count-Min Sketch algorithm
10
Count Query in Rakuten
• Example: We want to know...
1. How many unique users that checked
an item in one day (month, or year)?
2. How many products sold in one day
(month, or year)?

• Streaming algorithms for the queries
1. HyperLogLog algorithm
2. Count-Min Sketch algorithm
11
Problem: Unique Item Count
• Naïve approach:
– Uses dict in Python: ”dict[key] += 1”
– This can require a large amount of memory.

• Streaming algorithm: HyperLogLog
– Counts unique items approximately.
– This needs a fixed amount of memory.
• Google recently proposed an improved version of
HyperLogLog, called HyperLogLog++.

12
HyperLogLog
• Basic ideas:

–Hash function
–Harmonic mean
–Stochastic averaging

13
HyperLogLog
• Algorithm
Keys 1. Set i to upper bits 2. Set A[i] to max(j, A[i])

…

upper bits

lower bits

…

Item1 hash(Item1): 0 0 0 1 0 0 0 0 0 1 1 0
Item2
i = (0001)2= 1 j = (# leading 0s)+1= 6
A[1]
Item3
4
0 1
···
···
Item1 array A 2 6
3. Estimate # unique items from E=1/Σ(2-A[i]).
(In practice, we use heuristics for corrections.)
14
Demo
• Naïve vs. HyperLogLog

15
Performance
• Task: Count unique items in an item set.
Memory
efficient

High
speed

1%

4x -1%

Memory
1193MB

5MB

Speed-up
419sec

108sec

High
accuracy

Accuracy
100%

99%

This data set is small,
but we are using HyperLogLog for bigger data.
16
Conclusion
• Streaming algorithms in Rakuten
–We are using them for data analysis.
–We have an internal C library with bindings.
• HyperLogLog, Count-Min Sketch, and so on.

–Future: Plan to implement other algorithms.

17
Reference
• HyperLogLog & HyperLogLog++
– [Flajolet et al., AOFA 2007], [Heule et al., EDBT 2013]

• Count-Min Sketch
– [Cormode, Muthukrishnan, J. Algorithms, 2005]

• An excellent slide by Alex Smola
– http://alex.smola.org/teaching/berkeley2012/slides/3_Streams.pdf

• AK TECH BLOG by Aggregate Knowledge
– http://blog.aggregateknowledge.com/

• Stream-lib by Clearspring
– https://github.com/clearspring/stream-lib

18

Weitere ähnliche Inhalte

Was ist angesagt?

Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Big Data Spain
 
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Yuanyuan Tian
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareTigerGraph
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisYuanyuan Tian
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at ScaleDatabricks
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariKarissa Rae McKelvey
 
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...TigerGraph
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleTuri, Inc.
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
 
Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowJan Wiegelmann
 
Graph Gurus Episode 11: Accumulators for Complex Graph Analytics
Graph Gurus Episode 11: Accumulators for Complex Graph AnalyticsGraph Gurus Episode 11: Accumulators for Complex Graph Analytics
Graph Gurus Episode 11: Accumulators for Complex Graph AnalyticsTigerGraph
 

Was ist angesagt? (20)

Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at Scale
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at Scale
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlow
 
Graph Gurus Episode 11: Accumulators for Complex Graph Analytics
Graph Gurus Episode 11: Accumulators for Complex Graph AnalyticsGraph Gurus Episode 11: Accumulators for Complex Graph Analytics
Graph Gurus Episode 11: Accumulators for Complex Graph Analytics
 

Andere mochten auch

[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...Rakuten Group, Inc.
 
[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era
[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era
[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving EraRakuten Group, Inc.
 
Latent Class Transliteration based on Source Language Origin
Latent Class Transliteration based on Source Language OriginLatent Class Transliteration based on Source Language Origin
Latent Class Transliteration based on Source Language OriginRakuten Group, Inc.
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureRakuten Group, Inc.
 
[RakutenTechConf2013] [C4-1] Text detection in product images
[RakutenTechConf2013] [C4-1] Text detection in product images[RakutenTechConf2013] [C4-1] Text detection in product images
[RakutenTechConf2013] [C4-1] Text detection in product imagesRakuten Group, Inc.
 
Unsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionUnsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionRakuten Group, Inc.
 
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product DescriptionsRakuten Group, Inc.
 
Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)
Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)
Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)Rakuten Group, Inc.
 

Andere mochten auch (9)

[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
 
[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era
[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era
[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era
 
Latent Class Transliteration based on Source Language Origin
Latent Class Transliteration based on Source Language OriginLatent Class Transliteration based on Source Language Origin
Latent Class Transliteration based on Source Language Origin
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 
[RakutenTechConf2013] [C4-1] Text detection in product images
[RakutenTechConf2013] [C4-1] Text detection in product images[RakutenTechConf2013] [C4-1] Text detection in product images
[RakutenTechConf2013] [C4-1] Text detection in product images
 
Unsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionUnsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product Description
 
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
 
The Egison Programming Language
The Egison Programming LanguageThe Egison Programming Language
The Egison Programming Language
 
Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)
Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)
Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)
 

Ähnlich wie [RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms

Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Monthstsliwowicz
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQLYu Ishikawa
 
In-Memory Oracle BI Applications (UKOUG Analytics Event, July 2013)
In-Memory Oracle BI Applications (UKOUG Analytics Event, July 2013)In-Memory Oracle BI Applications (UKOUG Analytics Event, July 2013)
In-Memory Oracle BI Applications (UKOUG Analytics Event, July 2013)Mark Rittman
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesRomi Kuntsman
 
DeNA West & BigQuery
DeNA West & BigQueryDeNA West & BigQuery
DeNA West & BigQueryYoshi Izawa
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)Stratebi
 
Faceted search with Oracle InMemory option
Faceted search with Oracle InMemory optionFaceted search with Oracle InMemory option
Faceted search with Oracle InMemory optionAlexander Tokarev
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...Márton Kodok
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB
 
How GPU Computing literally saved me at work!
How GPU Computing literally saved me at work!How GPU Computing literally saved me at work!
How GPU Computing literally saved me at work!Abhishek Mungoli
 
Tiago Fonseca & Rui Velho - Syone & Leroy Merlin - OSL19
Tiago Fonseca & Rui Velho - Syone & Leroy Merlin - OSL19Tiago Fonseca & Rui Velho - Syone & Leroy Merlin - OSL19
Tiago Fonseca & Rui Velho - Syone & Leroy Merlin - OSL19marketingsyone
 
Python in the real world : from everyday applications to advanced robotics
Python in the real world : from everyday applications to advanced roboticsPython in the real world : from everyday applications to advanced robotics
Python in the real world : from everyday applications to advanced roboticsJivitesh Dhaliwal
 
Hadoop bangalore-meetup-dec-2011-yoda
Hadoop bangalore-meetup-dec-2011-yodaHadoop bangalore-meetup-dec-2011-yoda
Hadoop bangalore-meetup-dec-2011-yodaInMobi
 
Hardware Provisioning for MongoDB
Hardware Provisioning for MongoDBHardware Provisioning for MongoDB
Hardware Provisioning for MongoDBMongoDB
 

Ähnlich wie [RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms (20)

Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Months
 
Big Data
Big DataBig Data
Big Data
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL
 
In-Memory Oracle BI Applications (UKOUG Analytics Event, July 2013)
In-Memory Oracle BI Applications (UKOUG Analytics Event, July 2013)In-Memory Oracle BI Applications (UKOUG Analytics Event, July 2013)
In-Memory Oracle BI Applications (UKOUG Analytics Event, July 2013)
 
data structure
data structuredata structure
data structure
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
DeNA West & BigQuery
DeNA West & BigQueryDeNA West & BigQuery
DeNA West & BigQuery
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Faceted search with Oracle InMemory option
Faceted search with Oracle InMemory optionFaceted search with Oracle InMemory option
Faceted search with Oracle InMemory option
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
 
How GPU Computing literally saved me at work!
How GPU Computing literally saved me at work!How GPU Computing literally saved me at work!
How GPU Computing literally saved me at work!
 
Tiago Fonseca & Rui Velho - Syone & Leroy Merlin - OSL19
Tiago Fonseca & Rui Velho - Syone & Leroy Merlin - OSL19Tiago Fonseca & Rui Velho - Syone & Leroy Merlin - OSL19
Tiago Fonseca & Rui Velho - Syone & Leroy Merlin - OSL19
 
Python in the real world : from everyday applications to advanced robotics
Python in the real world : from everyday applications to advanced roboticsPython in the real world : from everyday applications to advanced robotics
Python in the real world : from everyday applications to advanced robotics
 
Hadoop bangalore-meetup-dec-2011-yoda
Hadoop bangalore-meetup-dec-2011-yodaHadoop bangalore-meetup-dec-2011-yoda
Hadoop bangalore-meetup-dec-2011-yoda
 
Hardware Provisioning for MongoDB
Hardware Provisioning for MongoDBHardware Provisioning for MongoDB
Hardware Provisioning for MongoDB
 

Mehr von Rakuten Group, Inc.

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話Rakuten Group, Inc.
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のりRakuten Group, Inc.
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Rakuten Group, Inc.
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みRakuten Group, Inc.
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開Rakuten Group, Inc.
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用Rakuten Group, Inc.
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャーRakuten Group, Inc.
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割Rakuten Group, Inc.
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Group, Inc.
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfRakuten Group, Inc.
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfRakuten Group, Inc.
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfRakuten Group, Inc.
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfRakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoRakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoRakuten Group, Inc.
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technologyRakuten Group, Inc.
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情Rakuten Group, Inc.
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャーRakuten Group, Inc.
 

Mehr von Rakuten Group, Inc. (20)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 

Kürzlich hochgeladen

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Kürzlich hochgeladen (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms

  • 1. Counting Big Data by Streaming Algorithms 2013/10/26 @ Rakuten Technology Conference 2013 Rakuten Institute of Technology, Rakuten, Inc., Yusaku Kaneta http://www.rakuten.co.jp/
  • 2. Who am I? • Yusaku Kaneta (@yusakukaneta) – Joined Rakuten in April 2012. – Rakuten Institute of Technology (RIT) • Interests: – String processing (esp., Pattern matching) – Hardware design using FPGA – Bitwise tricks & techniques • Love TAOCP 7.1.3 & Hacker's Delight 2
  • 3. Problem: Count Big Data • Counting: – Fundamental operation in data analysis. • Big data is difficult to just count – Because it needs huge amount of memory. – E.g., 400GB+ is needed for one-year access logs. 3
  • 4. Batch Processing • Batch processing can solve this. – E.g., • Two issues: – High latency – Requirement for a cluster of machines Batch Batch Batch = High cost Batch Batch Batch 4
  • 5. Our Goals 1. Reduce memory – Cost reduction. 2. Reduce latency – Quick business decisions. 3. Achieve high-accuracy – Correct business decisions. 5
  • 6. Our Approach • Streaming algorithms – Can fulfill all our goals! – Become common in Web companies. • See the paper on Google’s PowerDrill & the code of Twitter’s Algebird for examples of how to use. • Keys: – Limited memory – Low latency – Theoretical guarantee for accuracy 6
  • 7. Streaming Algorithm Library • RIT internally provides a C library for streaming algorithms, libsketch. • Three advantages: Memory efficient • Bindings for High speed High accuracy & 7
  • 8. Why C? • Our target: Python & Ruby users! for data analysis for stream processing – But most of existing libraries are written in Scala (algebird), Java (stream-lib), ... This is a reason why our library is written in C! Easy to incorporate C libraris in Python & Ruby. 8
  • 10. Count Query in Rakuten • Example: We want to know... 1. How many unique users that checked an item in one day (month, or year)? 2. How many products sold in one day (month, or year)? • Streaming algorithms for the queries 1. HyperLogLog algorithm 2. Count-Min Sketch algorithm 10
  • 11. Count Query in Rakuten • Example: We want to know... 1. How many unique users that checked an item in one day (month, or year)? 2. How many products sold in one day (month, or year)? • Streaming algorithms for the queries 1. HyperLogLog algorithm 2. Count-Min Sketch algorithm 11
  • 12. Problem: Unique Item Count • Naïve approach: – Uses dict in Python: ”dict[key] += 1” – This can require a large amount of memory. • Streaming algorithm: HyperLogLog – Counts unique items approximately. – This needs a fixed amount of memory. • Google recently proposed an improved version of HyperLogLog, called HyperLogLog++. 12
  • 13. HyperLogLog • Basic ideas: –Hash function –Harmonic mean –Stochastic averaging 13
  • 14. HyperLogLog • Algorithm Keys 1. Set i to upper bits 2. Set A[i] to max(j, A[i]) … upper bits lower bits … Item1 hash(Item1): 0 0 0 1 0 0 0 0 0 1 1 0 Item2 i = (0001)2= 1 j = (# leading 0s)+1= 6 A[1] Item3 4 0 1 ··· ··· Item1 array A 2 6 3. Estimate # unique items from E=1/Σ(2-A[i]). (In practice, we use heuristics for corrections.) 14
  • 15. Demo • Naïve vs. HyperLogLog 15
  • 16. Performance • Task: Count unique items in an item set. Memory efficient High speed 1% 4x -1% Memory 1193MB 5MB Speed-up 419sec 108sec High accuracy Accuracy 100% 99% This data set is small, but we are using HyperLogLog for bigger data. 16
  • 17. Conclusion • Streaming algorithms in Rakuten –We are using them for data analysis. –We have an internal C library with bindings. • HyperLogLog, Count-Min Sketch, and so on. –Future: Plan to implement other algorithms. 17
  • 18. Reference • HyperLogLog & HyperLogLog++ – [Flajolet et al., AOFA 2007], [Heule et al., EDBT 2013] • Count-Min Sketch – [Cormode, Muthukrishnan, J. Algorithms, 2005] • An excellent slide by Alex Smola – http://alex.smola.org/teaching/berkeley2012/slides/3_Streams.pdf • AK TECH BLOG by Aggregate Knowledge – http://blog.aggregateknowledge.com/ • Stream-lib by Clearspring – https://github.com/clearspring/stream-lib 18