SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Elastic MapReduce
   Wikipedia
http://ohkura.com

• 2008                  1
•
•              blog
•              2007
Python




     Wikipedia       (   120   )
MapReduce
• Hadoop
  o

• Hadoop Streaming
  o Mapper Reducer


  o                  OK   Python
  o            IO
• Amazon AWS (S3, EC2)
Elastic MapReduce

• Amazon          Cloud Computing
• MapReduce                    Hadoop


• Master                       Worker EC2
• S3

• http://aws.amazon.com/elasticmapreduce/
Step0:

• AWS
• Elastic MapReduce                                    1
• S3
  o   Ruby                     s3sync
      http://s3sync.net/wiki
• elastic-mapreduce
   o Amazon                       Ruby
  o   http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264
Step1:

• Wikipedia
    o wget "http://download.wikimedia.org/jawiki/latest/jawiki-
      latest-pages-articles.xml.bz2"
    o bunzip2 jawiki-latest-pages-articles.xml.bz2
•
    o   <page>      20000
    o   Hadoop Streaming        worker
• S3
    o   ohkura-wikipedia:jawiki/articles/part-00000, 00001, ...
    o   EC2
Step2:
Step2:

Mapper
 link_pat = re.compile(r"[[([^]|#]*?)[]|#]")

 for line in sys.stdin:
    for link in link_pat.findall(line):
        if ":" not in link:
            print "LongValueSum:%st1" % link

Reducer
  aggregate (Hadoop                   Reducer)
2007     92008
2006     88376
2008     82821
2005     77964
       76111
2000     68078
2004     64921
                 63660
        58081
2001     57419
2003     57130
Step3:
Step3:

Mapper
 timestamp_pat = re.compile("<timestamp>(.+?)</timestamp>")
 articles = ArticleExtractor(sys.stdin)
 for article in articles:
   for line in article:
      m = timestamp_pat.search(line)
      if m:
         dt = m.groups(0)[0]
         # eg. 2009-10-08T05:55:49Z
         t = datetime.datetime.strptime(dt, "%Y-%m-%dT%H:%M:%SZ")
         print "LongValueSum:%s t1" % t.year


Reducer
  aggregate (Hadoop                    Reducer)
JSON                   Wizard




$ elastic-mapreduce --create --num-instances 4
                  --instance-type m1.small
                  --json count-year-jobflow.json
2002: 1
2003: 4107
2004: 19630
2005: 44766
2006: 103018
2007: 151382
2008: 217252
2009: 683079
Step4: PageRank
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
PageRank                 MapReduce

• Step1
    o          Wikipedia
    o   M:


    o   R: Identity
• Step2
  o M:          /
    o   R:
•         Step2     10
    o                    HDFS
1803.63759701
1568.19638967
1029.67219551 2006
991.646816399 2007
930.652982148 2005
885.892964893
866.358526418 2008
798.668799871 2004
779.443042817
.
.
1803.63759701
1568.19638967
885.892964893
779.443042817
755.488775376
728.882441149
682.257070166
623.000478660
580.347125978
569.411885196
...
779.443042817
728.882441149
682.257070166
580.347125978
522.618667481
495.986145911
452.646283200
444.036370473
443.043952427
441.486349135
392.427995635
=100

0.00682557409174                785       ...
0.00682555111099 JR   700
0.00682544488688
0.00682540998664
0.00682540375114
0.00682528989653      (     )
0.00682524117061
0.00682521978481      (               )
0.00682521236658
0.00682517459662
0.00682512260620
• Wikipedia (JA)
  o 1,900,000 articles
  o 4.2GB
  o 20
  o   ~30
• Blog          from   blogeye.jp
  o   200,000,000 articles
  o   800GB
  o   80
  o   70
•
    o
    o                 Master
•
    o
    o
    o
    o
    o   1   1   0.1   1    100   1000
Q&A

Weitere ähnliche Inhalte

Ähnlich wie Quick Wikipedia Mining using Elastic Map Reduce

Traffic Analyzer for GPRS UMTS Networks (TAN)
Traffic Analyzer for GPRS UMTS Networks (TAN)Traffic Analyzer for GPRS UMTS Networks (TAN)
Traffic Analyzer for GPRS UMTS Networks (TAN)Muhannad Aulama
 
Ruby Outside Rails 2 (southfest)
Ruby Outside Rails 2 (southfest)Ruby Outside Rails 2 (southfest)
Ruby Outside Rails 2 (southfest)Victor Petrenko
 
marko_go_in_badoo
marko_go_in_badoomarko_go_in_badoo
marko_go_in_badooMarko Kevac
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in RubyAmazon Web Services
 
Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...pycontw
 
mastodon API
mastodon APImastodon API
mastodon APItreby
 
Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices Daniel Berman
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-templateshintaro mizuno
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB
 
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance FuckupsNETFest
 
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and MergingMongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and MergingJason Terpko
 
Log files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO OpportunitiesLog files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO OpportunitiesRobin Rozhon
 
Django REST Framework における API 実装プラクティス | PyCon JP 2018
Django REST Framework における API 実装プラクティス | PyCon JP 2018Django REST Framework における API 実装プラクティス | PyCon JP 2018
Django REST Framework における API 実装プラクティス | PyCon JP 2018Masashi Shibata
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuningMongoDB
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nltieleman
 
クラウドを支えるハードウェア・ソフトウェア基盤技術
クラウドを支えるハードウェア・ソフトウェア基盤技術クラウドを支えるハードウェア・ソフトウェア基盤技術
クラウドを支えるハードウェア・ソフトウェア基盤技術Ryousei Takano
 
The Seven Wastes of Software Development
The Seven Wastes of Software DevelopmentThe Seven Wastes of Software Development
The Seven Wastes of Software DevelopmentMatt Stine
 
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017OWASP Kyiv
 
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...olberger
 

Ähnlich wie Quick Wikipedia Mining using Elastic Map Reduce (20)

Traffic Analyzer for GPRS UMTS Networks (TAN)
Traffic Analyzer for GPRS UMTS Networks (TAN)Traffic Analyzer for GPRS UMTS Networks (TAN)
Traffic Analyzer for GPRS UMTS Networks (TAN)
 
Ruby Outside Rails 2 (southfest)
Ruby Outside Rails 2 (southfest)Ruby Outside Rails 2 (southfest)
Ruby Outside Rails 2 (southfest)
 
marko_go_in_badoo
marko_go_in_badoomarko_go_in_badoo
marko_go_in_badoo
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby
 
Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...
 
mastodon API
mastodon APImastodon API
mastodon API
 
Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices Machine Learning and Logging for Monitoring Microservices
Machine Learning and Logging for Monitoring Microservices
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-template
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
 
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and MergingMongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and Merging
 
Log files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO OpportunitiesLog files: The Overlooked Source of SEO Opportunities
Log files: The Overlooked Source of SEO Opportunities
 
Django REST Framework における API 実装プラクティス | PyCon JP 2018
Django REST Framework における API 実装プラクティス | PyCon JP 2018Django REST Framework における API 実装プラクティス | PyCon JP 2018
Django REST Framework における API 実装プラクティス | PyCon JP 2018
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
クラウドを支えるハードウェア・ソフトウェア基盤技術
クラウドを支えるハードウェア・ソフトウェア基盤技術クラウドを支えるハードウェア・ソフトウェア基盤技術
クラウドを支えるハードウェア・ソフトウェア基盤技術
 
The Seven Wastes of Software Development
The Seven Wastes of Software DevelopmentThe Seven Wastes of Software Development
The Seven Wastes of Software Development
 
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
 
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
 

Kürzlich hochgeladen

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Kürzlich hochgeladen (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Quick Wikipedia Mining using Elastic Map Reduce