SlideShare ist ein Scribd-Unternehmen logo
1 von 11
Downloaden Sie, um offline zu lesen
Common failures to be avoided
 when we analyze Wikipedia
        public data


           Felipe Ortega
           GsyC/Libresoft

   Wikimania 2010, Gdańsk, Poland.
#1 Beware of special types of pages
●   Official page count includes articles with just one link.
    ●   You must consider if you need to filter out
        disambiguation pages.
●   Pay attention to redirects.
    ●   Sometimes people wonder how the number of
        pages in main namespace in the dump is so high.
●   Break down evolution trends by namespace.
    ●   Articles are very different from other pages.
    ●   Explore % of already existing talk pages.
    ●   Connections from user pages.
#2 Plan your hardware carefully
●   There are some general rules.
    ●   Parallelize as much as possible.
    ●   Buy more memory before buying more disk...
    ●   But take a look at your disk requirements.
        – It's very different when you can work on
          decompressed data, on the fly.
    ●   Hardware RAID is not always the best solution.
        –   RAID 10 in Linux can perform decently in many
            average studies.
3# Know your engines (DBs)
●   Correct configuration of DB engine is crucial.
    ●   You'll always fall short with standard configs.
    ●   Fine tune parameters according to your hardware.
    ●   Exploit memory as much as possible.
        – E.g. MEMORY engine in MySQL.
    ●   Avoid unnecessary backup...
    ●   But be sure that you have copies of relevant info
        elseware!
    ●   Think about your process:
        –   Read only vs. read-write.
4# Organize your code
●   Using a SCM is a must.
    ●   SVN, GIT.
●   Upload your code to public repository.
    ●   BerliOS, SourceForge, GitHub...
●   Document your code...
    ●   ...if you ever aspire to get interest from other
        developers.
●   Use consistent version numbers.
●   Test, test, test...
    ●   Include sanity checks and “toy tests”.
#5 Use the right “spell”
●   Target data is well defined:
    ●   XML
    ●   Big portions of plain text
    ●   Inter-wiki links and outlinks.
●   Some alternatives
    ●   CelementTree (high-speed parsing)
    ●   Python (modules/short scripts) or Java (big
        projects).
    ●   Perl (regexps).
    ●   Sed & awk
#6 Avoid reinventing the wheel
●   Consider to develop only if:
    ●   No available solution fits your needs.
        – Or you can only find proprietary/evaluation
          sofwtare.
    ●   Performance of other solutions is really bad
●   Example: pywikipediabot
    ●   Simple library to query Wikipedia API.
    ●   Solves many simple needs of
        researchers/programmers.
#7 Automate everything

●   Huge data repositories.
●   Even small samples are excessively time
    consuming if processed by hand.
●   You will start to concat individual processes.
●   You will save time for later executions.
●   Your study will be reproducible.
    ●   Updating results after several months
        becomes no-brainer solution.
#8 Extreme case of Murphy's Law

●   Always expect the worst possible case.
    ●   Many caveats in each implementation.
    ●   Countless particular cases.
    ●   It's not OK with just the “average
        solution”.
         – Standard algorithms may take much
           more than expected to finish the job.
#9 Not many graphical interfaces

●   Some good reasons for that
    ●  Difficult to automate
     ● Hard to display dynamic results in real-time.


     ● Almost impossible to compute all results in a

       reasonable time frame for huge data
       collections (e.g. English Wikipedia).
●   To the best of my knowledge, there are very
    few tools with graphical interfaces out there.
●   Is there a real need for that??
#10 Communication channels

●   Wikimedia-research-l
    ●   Mailing list about research on Wikimedia
        projects.
    ●   http://meta.wikimedia.org/wiki/Research
    ●   http://meta.wikimedia.org/wiki/Wikimedia_Research_Network
    ●
        http://acawiki.org/Home
●   Final comments
    ●   Need for consolidated info point, once for all

Weitere ähnliche Inhalte

Ähnlich wie Caveats

Ähnlich wie Caveats (20)

Become a Better Developer with Debugging Techniques for Drupal (and more!)
Become a Better Developer with Debugging Techniques for Drupal (and more!)Become a Better Developer with Debugging Techniques for Drupal (and more!)
Become a Better Developer with Debugging Techniques for Drupal (and more!)
 
Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote
 
engage 2014 - JavaBlast
engage 2014 - JavaBlastengage 2014 - JavaBlast
engage 2014 - JavaBlast
 
Ceph: A decade in the making and still going strong
Ceph: A decade in the making and still going strongCeph: A decade in the making and still going strong
Ceph: A decade in the making and still going strong
 
Go at Skroutz
Go at SkroutzGo at Skroutz
Go at Skroutz
 
LCE12: Intro Training: Upstreaming 101
LCE12: Intro Training: Upstreaming 101LCE12: Intro Training: Upstreaming 101
LCE12: Intro Training: Upstreaming 101
 
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentationOpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
OpenCms Days 2014 - Introducing the 9.5 OpenCms documentation
 
Path dependent-development (PyCon India)
Path dependent-development (PyCon India)Path dependent-development (PyCon India)
Path dependent-development (PyCon India)
 
2019 PHP Serbia - Boosting your performance with Blackfire
2019 PHP Serbia - Boosting your performance with Blackfire2019 PHP Serbia - Boosting your performance with Blackfire
2019 PHP Serbia - Boosting your performance with Blackfire
 
Create your library
Create your libraryCreate your library
Create your library
 
The Good, the Bad and the Ugly things to do with android
The Good, the Bad and the Ugly things to do with androidThe Good, the Bad and the Ugly things to do with android
The Good, the Bad and the Ugly things to do with android
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade in
 
Liferay portals in real projects
Liferay portals  in real projectsLiferay portals  in real projects
Liferay portals in real projects
 
C4ainaction-Introduction to the Pyramid Web Framework
C4ainaction-Introduction to the Pyramid Web FrameworkC4ainaction-Introduction to the Pyramid Web Framework
C4ainaction-Introduction to the Pyramid Web Framework
 
An overview of data and web-application development with Python
An overview of data and web-application development with PythonAn overview of data and web-application development with Python
An overview of data and web-application development with Python
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional Programmer
 
Performance optimization techniques for Java code
Performance optimization techniques for Java codePerformance optimization techniques for Java code
Performance optimization techniques for Java code
 
Path Dependent Development (PyCon AU)
Path Dependent Development (PyCon AU)Path Dependent Development (PyCon AU)
Path Dependent Development (PyCon AU)
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's Ceph
 

Kürzlich hochgeladen

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Kürzlich hochgeladen (20)

General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 

Caveats

  • 1. Common failures to be avoided when we analyze Wikipedia public data Felipe Ortega GsyC/Libresoft Wikimania 2010, Gdańsk, Poland.
  • 2. #1 Beware of special types of pages ● Official page count includes articles with just one link. ● You must consider if you need to filter out disambiguation pages. ● Pay attention to redirects. ● Sometimes people wonder how the number of pages in main namespace in the dump is so high. ● Break down evolution trends by namespace. ● Articles are very different from other pages. ● Explore % of already existing talk pages. ● Connections from user pages.
  • 3. #2 Plan your hardware carefully ● There are some general rules. ● Parallelize as much as possible. ● Buy more memory before buying more disk... ● But take a look at your disk requirements. – It's very different when you can work on decompressed data, on the fly. ● Hardware RAID is not always the best solution. – RAID 10 in Linux can perform decently in many average studies.
  • 4. 3# Know your engines (DBs) ● Correct configuration of DB engine is crucial. ● You'll always fall short with standard configs. ● Fine tune parameters according to your hardware. ● Exploit memory as much as possible. – E.g. MEMORY engine in MySQL. ● Avoid unnecessary backup... ● But be sure that you have copies of relevant info elseware! ● Think about your process: – Read only vs. read-write.
  • 5. 4# Organize your code ● Using a SCM is a must. ● SVN, GIT. ● Upload your code to public repository. ● BerliOS, SourceForge, GitHub... ● Document your code... ● ...if you ever aspire to get interest from other developers. ● Use consistent version numbers. ● Test, test, test... ● Include sanity checks and “toy tests”.
  • 6. #5 Use the right “spell” ● Target data is well defined: ● XML ● Big portions of plain text ● Inter-wiki links and outlinks. ● Some alternatives ● CelementTree (high-speed parsing) ● Python (modules/short scripts) or Java (big projects). ● Perl (regexps). ● Sed & awk
  • 7. #6 Avoid reinventing the wheel ● Consider to develop only if: ● No available solution fits your needs. – Or you can only find proprietary/evaluation sofwtare. ● Performance of other solutions is really bad ● Example: pywikipediabot ● Simple library to query Wikipedia API. ● Solves many simple needs of researchers/programmers.
  • 8. #7 Automate everything ● Huge data repositories. ● Even small samples are excessively time consuming if processed by hand. ● You will start to concat individual processes. ● You will save time for later executions. ● Your study will be reproducible. ● Updating results after several months becomes no-brainer solution.
  • 9. #8 Extreme case of Murphy's Law ● Always expect the worst possible case. ● Many caveats in each implementation. ● Countless particular cases. ● It's not OK with just the “average solution”. – Standard algorithms may take much more than expected to finish the job.
  • 10. #9 Not many graphical interfaces ● Some good reasons for that ● Difficult to automate ● Hard to display dynamic results in real-time. ● Almost impossible to compute all results in a reasonable time frame for huge data collections (e.g. English Wikipedia). ● To the best of my knowledge, there are very few tools with graphical interfaces out there. ● Is there a real need for that??
  • 11. #10 Communication channels ● Wikimedia-research-l ● Mailing list about research on Wikimedia projects. ● http://meta.wikimedia.org/wiki/Research ● http://meta.wikimedia.org/wiki/Wikimedia_Research_Network ● http://acawiki.org/Home ● Final comments ● Need for consolidated info point, once for all