SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
MapReduce:
Beyond Word Count
Jeff Patti
https://github.com/jepatti/mrjob_recipes
What is MapReduce?
“MapReduce is a programming model for processing large
data sets with a parallel, distributed algorithm on a cluster.”
- Wikipedia
Map - given a line of a file, yield key: value pairs
Reduce - given a key and all values with that key from the
prior map phase, yield key: value pairs
Word Count
Problem: count frequencies of words in
documents
Word Count Using mrjob
def mapper(self, key, line):
for word in line.split():
yield word, 1

def reducer(self, word, occurrences):
yield word, sum(occurrences)
Sample Output
"ligula" 4
"ligula." 2
"lorem" 5
"lorem." 4
"luctus" 3
"magna" 5
"magna," 3
"magnis" 1
Monetate Background
● Core products are merchandising,
personalization, testing, etc.
● A/B & Multivariate testing to determine
impact of experiments
● Involved with >20% of ecommerce spend
each holiday season for the past 2 years
running
Monetate Stack
● Distributed across multiple availability zones
and regions for redundancy, scaling, and
lower round trip times
● Real time decision engine using MySQL
● Nightly processing of each days data via
Hadoop using mrjob, a python library for
writing mapreduce jobs
Beyond Word Count
● Activity stream sessionization
● Product recommendations
● User behavior statistics
Activity Stream Sessionization
Goal: collate user activity, splitting into different
sessions if user inactive for more than 5
minutes
Input format: timestamp, user_id
Collate user activity
def mapper(self, key, line):
timestamp, user_id = line.split()
yield user_id, timestamp

def reducer(self, uid, timestamps):
yield uid, sorted(timestamps)
Sample Output
"998" ["1384389407", "1384389417", "1384389422",
"1384389425", "1384390407", "1384390417",
"1384391416", "1384392410", "1384392416",
"1384395420", "1384396405"]
"999" ["1384388414", "1384388425", "1384389419",
"1384389420", "1384390420", "1384391415",
"1384391418", "1384393413", "1384393425",
"1384394426", "1384395416", "1384396415",
"1384396422"]
Segment into Sessions
MAX_SESSION_INACTIVITY = 60 * 5
...
def reducer(self, uid, timestamps):
timestamps = sorted(timestamps)
start_index = 0
for index, timestamp in enumerate(timestamps):
if index > 0:
if timestamp - timestamps[index-1] >
MAX_SESSION_INACTIVITY:
yield uid, timestamps[start_index:index]
start_index = index
yield uid, timestamps[start_index:]
Sample Output
"999"[1384388414, 1384388425]
"999"[1384389419, 1384389420]
"999"[1384390420]
"999"[1384391415, 1384391418]
"999"[1384393413, 1384393425]
"999"[1384394426]
"999"[1384395416]
"999"[1384396415, 1384396422]
Product Recommendations
Goal: For each product a client sells, generate
a ‘people who bought this also bought this’
recommendation
Input: product_id_1, product_id_2, ...
Coincident Purchase Frequency
def mapper(self, key, line):
purchases = set(line.split(','))
for p1, p2 in permutations(purchases, 2):
yield (p1, p2), 1

def reducer(self, pair, occurrences):
p1, p2 = pair
yield p1, (p2, sum(occurrences))
Sample output
"8" ["5", 11]
"8" ["6", 19]
"8" ["7", 14]
"8" ["9", 11]
"9" ["1", 20]
"9" ["10", 22]
"9" ["11", 21]
"9" ["12", 13]
Top Recommendations
def reducer(self, purchase_pair, occurrences):
p1, p2 = purchase_pair
yield p1, (sum(occurrences), p2)

def reducer_find_best_recos(self, p1, p2_occurrences):
top_products = sorted(p2_occurrences, reverse=True)[:5]
top_products = [p2 for occurrences, p2 in top_products]
yield p1, top_products

def steps(self):
return [self.mr(mapper=self.mapper, reducer=self.reducer),
self.mr(reducer=self.reducer_find_best_recos)]
Sample Output
"7"
"8"
"9"

["15", "18", "17", "16", "3"]
["14", "15", "20", "6", "3"]
["15", "17", "19", "6", "3"]
Top Recommendations
Multi Account
def mapper(self, key, line):
account_id, purchases = line.split()
purchases = set(purchases.split(','))
for p1, p2 in permutations(purchases, 2):
yield (account_id, p1, p2), 1

def reducer(self, purchase_pair, occurrences):
account_id, p1, p2 = purchase_pair
yield (account_id, p1), (sum(occurrences), p2)

2nd step reducer unchanged
Sample Output
["9", "20"]
["9", "3"]
["9", "4"]
["9", "5"]
["9", "6"]
["9", "7"]
["9", "8"]
["9", "9"]

["8", "14", "13", "10", "1"]
["2", "4", "16", "11", "17"]
["3", "18", "11", "16", "15"]
["2", "1", "7", "18", "17"]
["12", "3", "2", "17", "16"]
["18", "5", "17", "1", "9"]
["20", "14", "13", "10", "4"]
["18", "7", "6", "5", "4"]
User Behavior Statistics
Goal: compute statistics about user behavior
(conversion rate & time on site) by account and
experiment in an efficient manner
Input:
account_id, campaigns_viewed, user_id, purchased?,
session_start_time, session_end_time
Statistics Primer
With sample count, mean, and variance for
each side of an experiment we can compute all
the statistics our analytics package displays
Statistics Primer (cont.)
y = a sessions metric value, ex: time on site
● Sample count: count the number of sessions
that viewed the experiment
○ sum(y^0)

● Mean: sum the metric / sample count
○ sum(y^1)/sum(y^0)
Statistics Primer (cont.)
● Variance:

○ Variance = mean of square minus square of mean
○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2

For each side of an experiment we only need to
generate: sum(y^0), sum(y^1), sum(y^2)
Statistics by account
statistic_rollup/statistic_summarize.py
Sample Output
["8", "average session length"] [99, 24463, 7968891]
["8", "conversion rate"] [99, 45, 45]
["9", "average session length"] [115, 29515, 10071591]
["9", "conversion rate"] [115, 55, 55]
Statistics by experiment
statistic_rollup_by_experiment/statistic_summa
rize.py
Sample Output
["9", 0, "average session length"] [32, 8405, 3031009]
["9", 0, "conversion rate"] [32, 20, 20]
["9", 1, "average session length"] [23, 5405, 1770785]
["9", 1, "conversion rate"] [23, 14, 14]
["9", 2, "average session length"] [39, 9481, 2965651]
["9", 2, "conversion rate"] [39, 20, 20]
["9", 3, "average session length"] [25, 6276, 2151014]
["9", 3, "conversion rate"] [25, 13, 13]
["9", 4, "average session length"] [27, 5721, 1797715]
["9", 4, "conversion rate"] [27, 16, 16]
Questions?

?

Weitere ähnliche Inhalte

Was ist angesagt?

Google Page Rank Algorithm
Google Page Rank AlgorithmGoogle Page Rank Algorithm
Google Page Rank AlgorithmOmkar Dash
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detectionDaeHeeKim31
 
Neural Networks
Neural NetworksNeural Networks
Neural NetworksAdri Jovin
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detectionAmar Jindal
 
Creating a Pitch Deck Using the Business Model Canvas
Creating a Pitch Deck Using the Business Model CanvasCreating a Pitch Deck Using the Business Model Canvas
Creating a Pitch Deck Using the Business Model CanvasMichelle Ferrier
 
Pitch Deck Templates for Startups
Pitch Deck Templates for StartupsPitch Deck Templates for Startups
Pitch Deck Templates for StartupsNextView Ventures
 
ELECTRONICS SEMINAR TOPIC HELIO DISPLAY
ELECTRONICS SEMINAR TOPIC  HELIO DISPLAYELECTRONICS SEMINAR TOPIC  HELIO DISPLAY
ELECTRONICS SEMINAR TOPIC HELIO DISPLAYEldhose George
 
Dss2019 hyperparameter optimization landscape
Dss2019 hyperparameter optimization landscapeDss2019 hyperparameter optimization landscape
Dss2019 hyperparameter optimization landscapeJakub Czakon
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance SegmentationHichem Felouat
 
Recent Object Detection Research & Person Detection
Recent Object Detection Research & Person DetectionRecent Object Detection Research & Person Detection
Recent Object Detection Research & Person DetectionKai-Wen Zhao
 
파이썬으로 나만의 강화학습 환경 만들기
파이썬으로 나만의 강화학습 환경 만들기파이썬으로 나만의 강화학습 환경 만들기
파이썬으로 나만의 강화학습 환경 만들기정주 김
 
A Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaA Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaPreferred Networks
 
Graphics inputdevices
Graphics inputdevicesGraphics inputdevices
Graphics inputdevicesBCET
 
Graphics input and output devices
Graphics input and output devicesGraphics input and output devices
Graphics input and output devicesVamsi Dhar
 
Deep learning in E-Commerce Applications and Challenges (CNN)
Deep learning in E-Commerce Applications and Challenges (CNN) Deep learning in E-Commerce Applications and Challenges (CNN)
Deep learning in E-Commerce Applications and Challenges (CNN) Houda Bakir
 

Was ist angesagt? (20)

Google Page Rank Algorithm
Google Page Rank AlgorithmGoogle Page Rank Algorithm
Google Page Rank Algorithm
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detection
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 
Creating a Pitch Deck Using the Business Model Canvas
Creating a Pitch Deck Using the Business Model CanvasCreating a Pitch Deck Using the Business Model Canvas
Creating a Pitch Deck Using the Business Model Canvas
 
Pitch Deck Templates for Startups
Pitch Deck Templates for StartupsPitch Deck Templates for Startups
Pitch Deck Templates for Startups
 
Computer Vision.pptx
Computer Vision.pptxComputer Vision.pptx
Computer Vision.pptx
 
LED,LCD,CRO,CRT UNIT IV of M&I
LED,LCD,CRO,CRT UNIT IV of M&ILED,LCD,CRO,CRT UNIT IV of M&I
LED,LCD,CRO,CRT UNIT IV of M&I
 
ELECTRONICS SEMINAR TOPIC HELIO DISPLAY
ELECTRONICS SEMINAR TOPIC  HELIO DISPLAYELECTRONICS SEMINAR TOPIC  HELIO DISPLAY
ELECTRONICS SEMINAR TOPIC HELIO DISPLAY
 
Depth estimation using deep learning
Depth estimation using deep learningDepth estimation using deep learning
Depth estimation using deep learning
 
Dss2019 hyperparameter optimization landscape
Dss2019 hyperparameter optimization landscapeDss2019 hyperparameter optimization landscape
Dss2019 hyperparameter optimization landscape
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance Segmentation
 
Recent Object Detection Research & Person Detection
Recent Object Detection Research & Person DetectionRecent Object Detection Research & Person Detection
Recent Object Detection Research & Person Detection
 
Plasma display
Plasma  displayPlasma  display
Plasma display
 
파이썬으로 나만의 강화학습 환경 만들기
파이썬으로 나만의 강화학습 환경 만들기파이썬으로 나만의 강화학습 환경 만들기
파이썬으로 나만의 강화학습 환경 만들기
 
YOLO
YOLOYOLO
YOLO
 
A Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaA Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi Kerola
 
Graphics inputdevices
Graphics inputdevicesGraphics inputdevices
Graphics inputdevices
 
Graphics input and output devices
Graphics input and output devicesGraphics input and output devices
Graphics input and output devices
 
Deep learning in E-Commerce Applications and Challenges (CNN)
Deep learning in E-Commerce Applications and Challenges (CNN) Deep learning in E-Commerce Applications and Challenges (CNN)
Deep learning in E-Commerce Applications and Challenges (CNN)
 

Ähnlich wie Map reduce: beyond word count

Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Johann de Boer
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeAtScale
 
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard WorldMonitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard WorldBrian Troutwine
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Feature surfacing - meetup
Feature surfacing  - meetupFeature surfacing  - meetup
Feature surfacing - meetupPredicSis
 
Streaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkStreaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkAmrit Sarkar
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Lucidworks
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkAnirudh Todi
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaOCoderFest
 
Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Dieter Plaetinck
 
Usability testing
Usability testingUsability testing
Usability testinggamelanYK
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowRomain Dorgueil
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
E.D.D.I - Open Source Chatbot Platform
E.D.D.I - Open Source Chatbot PlatformE.D.D.I - Open Source Chatbot Platform
E.D.D.I - Open Source Chatbot PlatformGregor Jarisch
 
Benchmarking and PHPBench
Benchmarking and PHPBenchBenchmarking and PHPBench
Benchmarking and PHPBenchdantleech
 

Ähnlich wie Map reduce: beyond word count (20)

Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard WorldMonitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
 
R console
R consoleR console
R console
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Feature surfacing - meetup
Feature surfacing  - meetupFeature surfacing  - meetup
Feature surfacing - meetup
 
Streaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkStreaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talk
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
 
Tsar tech talk
Tsar tech talkTsar tech talk
Tsar tech talk
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in Grafana
 
Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014
 
Usability testing
Usability testingUsability testing
Usability testing
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
E.D.D.I - Open Source Chatbot Platform
E.D.D.I - Open Source Chatbot PlatformE.D.D.I - Open Source Chatbot Platform
E.D.D.I - Open Source Chatbot Platform
 
Benchmarking and PHPBench
Benchmarking and PHPBenchBenchmarking and PHPBench
Benchmarking and PHPBench
 

Kürzlich hochgeladen

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

Map reduce: beyond word count

  • 1. MapReduce: Beyond Word Count Jeff Patti https://github.com/jepatti/mrjob_recipes
  • 2. What is MapReduce? “MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.” - Wikipedia Map - given a line of a file, yield key: value pairs Reduce - given a key and all values with that key from the prior map phase, yield key: value pairs
  • 3. Word Count Problem: count frequencies of words in documents
  • 4. Word Count Using mrjob def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences)
  • 5. Sample Output "ligula" 4 "ligula." 2 "lorem" 5 "lorem." 4 "luctus" 3 "magna" 5 "magna," 3 "magnis" 1
  • 6. Monetate Background ● Core products are merchandising, personalization, testing, etc. ● A/B & Multivariate testing to determine impact of experiments ● Involved with >20% of ecommerce spend each holiday season for the past 2 years running
  • 7. Monetate Stack ● Distributed across multiple availability zones and regions for redundancy, scaling, and lower round trip times ● Real time decision engine using MySQL ● Nightly processing of each days data via Hadoop using mrjob, a python library for writing mapreduce jobs
  • 8. Beyond Word Count ● Activity stream sessionization ● Product recommendations ● User behavior statistics
  • 9. Activity Stream Sessionization Goal: collate user activity, splitting into different sessions if user inactive for more than 5 minutes Input format: timestamp, user_id
  • 10. Collate user activity def mapper(self, key, line): timestamp, user_id = line.split() yield user_id, timestamp def reducer(self, uid, timestamps): yield uid, sorted(timestamps)
  • 11. Sample Output "998" ["1384389407", "1384389417", "1384389422", "1384389425", "1384390407", "1384390417", "1384391416", "1384392410", "1384392416", "1384395420", "1384396405"] "999" ["1384388414", "1384388425", "1384389419", "1384389420", "1384390420", "1384391415", "1384391418", "1384393413", "1384393425", "1384394426", "1384395416", "1384396415", "1384396422"]
  • 12. Segment into Sessions MAX_SESSION_INACTIVITY = 60 * 5 ... def reducer(self, uid, timestamps): timestamps = sorted(timestamps) start_index = 0 for index, timestamp in enumerate(timestamps): if index > 0: if timestamp - timestamps[index-1] > MAX_SESSION_INACTIVITY: yield uid, timestamps[start_index:index] start_index = index yield uid, timestamps[start_index:]
  • 13. Sample Output "999"[1384388414, 1384388425] "999"[1384389419, 1384389420] "999"[1384390420] "999"[1384391415, 1384391418] "999"[1384393413, 1384393425] "999"[1384394426] "999"[1384395416] "999"[1384396415, 1384396422]
  • 14. Product Recommendations Goal: For each product a client sells, generate a ‘people who bought this also bought this’ recommendation Input: product_id_1, product_id_2, ...
  • 15. Coincident Purchase Frequency def mapper(self, key, line): purchases = set(line.split(',')) for p1, p2 in permutations(purchases, 2): yield (p1, p2), 1 def reducer(self, pair, occurrences): p1, p2 = pair yield p1, (p2, sum(occurrences))
  • 16. Sample output "8" ["5", 11] "8" ["6", 19] "8" ["7", 14] "8" ["9", 11] "9" ["1", 20] "9" ["10", 22] "9" ["11", 21] "9" ["12", 13]
  • 17. Top Recommendations def reducer(self, purchase_pair, occurrences): p1, p2 = purchase_pair yield p1, (sum(occurrences), p2) def reducer_find_best_recos(self, p1, p2_occurrences): top_products = sorted(p2_occurrences, reverse=True)[:5] top_products = [p2 for occurrences, p2 in top_products] yield p1, top_products def steps(self): return [self.mr(mapper=self.mapper, reducer=self.reducer), self.mr(reducer=self.reducer_find_best_recos)]
  • 18. Sample Output "7" "8" "9" ["15", "18", "17", "16", "3"] ["14", "15", "20", "6", "3"] ["15", "17", "19", "6", "3"]
  • 19. Top Recommendations Multi Account def mapper(self, key, line): account_id, purchases = line.split() purchases = set(purchases.split(',')) for p1, p2 in permutations(purchases, 2): yield (account_id, p1, p2), 1 def reducer(self, purchase_pair, occurrences): account_id, p1, p2 = purchase_pair yield (account_id, p1), (sum(occurrences), p2) 2nd step reducer unchanged
  • 20. Sample Output ["9", "20"] ["9", "3"] ["9", "4"] ["9", "5"] ["9", "6"] ["9", "7"] ["9", "8"] ["9", "9"] ["8", "14", "13", "10", "1"] ["2", "4", "16", "11", "17"] ["3", "18", "11", "16", "15"] ["2", "1", "7", "18", "17"] ["12", "3", "2", "17", "16"] ["18", "5", "17", "1", "9"] ["20", "14", "13", "10", "4"] ["18", "7", "6", "5", "4"]
  • 21. User Behavior Statistics Goal: compute statistics about user behavior (conversion rate & time on site) by account and experiment in an efficient manner Input: account_id, campaigns_viewed, user_id, purchased?, session_start_time, session_end_time
  • 22. Statistics Primer With sample count, mean, and variance for each side of an experiment we can compute all the statistics our analytics package displays
  • 23. Statistics Primer (cont.) y = a sessions metric value, ex: time on site ● Sample count: count the number of sessions that viewed the experiment ○ sum(y^0) ● Mean: sum the metric / sample count ○ sum(y^1)/sum(y^0)
  • 24. Statistics Primer (cont.) ● Variance: ○ Variance = mean of square minus square of mean ○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2 For each side of an experiment we only need to generate: sum(y^0), sum(y^1), sum(y^2)
  • 26. Sample Output ["8", "average session length"] [99, 24463, 7968891] ["8", "conversion rate"] [99, 45, 45] ["9", "average session length"] [115, 29515, 10071591] ["9", "conversion rate"] [115, 55, 55]
  • 28. Sample Output ["9", 0, "average session length"] [32, 8405, 3031009] ["9", 0, "conversion rate"] [32, 20, 20] ["9", 1, "average session length"] [23, 5405, 1770785] ["9", 1, "conversion rate"] [23, 14, 14] ["9", 2, "average session length"] [39, 9481, 2965651] ["9", 2, "conversion rate"] [39, 20, 20] ["9", 3, "average session length"] [25, 6276, 2151014] ["9", 3, "conversion rate"] [25, 13, 13] ["9", 4, "average session length"] [27, 5721, 1797715] ["9", 4, "conversion rate"] [27, 16, 16]