SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
Wikipedia Mining
Spring Freebase User Group meeting
2008-04-16 / zenkat
Why Mine Wikipedia?


• How can we automatically extract the
  unstructured content from Wikipedia …


• … to create a structured database of
  information …


• … that can be leveraged by users in
  applications and data loads

                                         2
A Remarkable Source of Information




      2.15 M articles as of April 2008
      Doubling every 12 - 18 months      3
Problem is …

• Wikipedia is written by humans, for humans.
 -   Great if you need to look up a fact, or learn about something



• But you can’t …
 -   Ask questions:
     “What movies by George Lucas has Harrison Ford starred in?”

 -   Search effectively:
     “Find me all companies that build personal computers.”

 -   Build applications:
     “Let’s make a social app that ranks consumer goods listed in
     wikipedia.”
                                                                     4
From unstructured …




                      5
… to structured




                  6
Searching for Structure: Topics




          Articles define a topic

                                    7
Searching for Structure: Types




     Categories & Lists provide type
                                       8
Searching for Structure: Types




     Categories & Lists provide type
                                       9
Searching for Structure: Properties




  Templates & Infoboxes give properties
                                          10
Searching for Structure: Properties




                       {
                           quot;queryquot; : [
                             {
                               quot;typequot; : quot;/architecture/structurequot;
                               quot;namequot; : null,
                               quot;height_metersquot; : null,
                               quot;sortquot; : quot;-height_metersquot;,
                               quot;limitquot; : 10,
                             }
                           ]
                       }




  What are the highest buildings in the world?
                                                                    11
Searching for Structure: Properties




{
    quot;queryquot; : [
      {
        quot;typequot; : quot;/location/countryquot;
        quot;namequot; : null,
        ”official_languagequot; : “English”,
        quot;limitquot; : 100
      }
    ]
}




    What are all the countries that speak English?
                                                     12
A Treasure Trove Waiting To Be Opened

•       2,150,000 articles (ie, topics)


•       7,100,000 category refs (ie, typings)
    -   Found within 280,000 categories



•       42,000,000 template values (ie, properties)
    -   Found within 10,000 templates and 56,000 template keys




•       All growing at ~2% every two weeks


•       Available information doubles every year!
                                                                 13
Topic Population From Wikipedia

       Topic Name
                                       Blurb




                    Wikipedia Attribution


                                             Image




                                            Wikipedia
                                              Link




                                                        14
Fresh Topic




              15
Similar, but different …

•       Many pages in wikipedia are not topics
    -   Disambiguation pages, lists, categories, images, docs, talk …


•       Only store a 1200-character blurb
    -   We’re not wikipedia, after all


•       Don’t need to add “(suffix)” to names
    -   “Python (genus)” vs “Python (programming language)”
    -   Freebase types disambiguate without names


•       Cities should be specified without state suffix
    -   “San Francisco” vs “San Francisco, California”
    -   Cleanup in progress, some exceptions remain


•       “Exclusionist” vs “Inclusionist”
    -   Exclusionists appear to be winning in Wikilandia
    -   Freebase is inherently more inclusionist                        16
You Can’t Read The Same Wikipedia Twice


Every 2 weeks …


 -   65,000   new pages     -   8,000   deletes
 -   30,000   new topics    -   5,000   name changes
 -   80,000   new aliases   -   1,000   page ID changes
 -   10,000   merges        -   1,000   splits



                            … change in Wikipedia

                                                          17
Keeping track of changes …

•       Store reference information within freebase
    -   Page_ids, article titles and redirects




    -   Page_id (WPID) is stored in /wikipedia/en_id
    -   Article titles and redirects are stored in /wikipedia/en
    -   “mwcl_wikipedia_en”, “mw_infobot” user


•       None of these IDs are stable in wiki-land …
                                                                   18
Determining actions by comparing keys

          case          action


          new topic     create a new topic


          name change   add new name as en key; if quot;untouchedquot;, rename the topic


          id change     change the en_id to the new value


          merge         move the en key to the new topic; if quot;untouchedquot;, merge the topics


          split         create new topic, move en key from old topic to new topic


          delete        keep topic, but delete en_id and en keys from topic




•       Because we are more inclusionist than wikipedia,
        we usually do not delete topics.
•       Topic renames only occur on “untouched” topics.
•       Merges occur automatically on “untouched” topics
    -   Otherwise, flagged for review in “pipeline”
                                                                                             19
Map Template Fields To Properties




                                    20
Map Template Fields To Properties
{{infobox Aircraft
 |subtemplate={{Infobox Boeing Aircraft}}
 |name =Boeing 777
 |manufacturer =[[Boeing Commercial Airplanes]]
 |first flight =[[June 12]] [[1994]]
 |introduction =[[June 7]] [[1995]] with [[United]]
 |primary user = [[Singapore Airlines]]               MediaWiki
 |more users = [[Air France-KLM]]
 |produced = 1993 - Present                           Template
 |number built = 723 as of March 2008
 |unit cost = US$187.5-253 million                    Rendering
}}




                                                                  21
Map Template Fields To Properties
{{infobox Aircraft
 |subtemplate={{Infobox Boeing Aircraft}}
 |name =Boeing 777
 |manufacturer =[[Boeing Commercial Airplanes]]
 |first flight =[[June 12]] [[1994]]
 |introduction =[[June 7]] [[1995]] with [[United]]
 |primary user = [[Singapore Airlines]]               MediaWiki
 |more users = [[Air France-KLM]]
 |produced = 1993 - Present                           Template
 |number built = 723 as of March 2008
 |unit cost = US$187.5-253 million                    Rendering
}}




            “manufacturer” -->
  /aviation/aircraft_model/manufacturer




                                                                  22
Just the Starting Point …

• Extracted to date from Wikipedia:

 - 2,365,000 topics
 - 2,895,000 typings
 - 5,638,000 properties


• A complement to user-entered data
 -   User data always takes precedence, won’t be overwritten



• Processes are being automated to keep in sync

                                                               23
Thanks!



          Tristan Buckner   Topic updater
           /user/tristan    Image loader


            Colin Evans
                                WEX
            /user/colin


             Al Marks       Category mapper
             /user/al       Template mapper
                                 WEX




                                              24

Weitere ähnliche Inhalte

Ă„hnlich wie Freebase: Wikipedia Mining 20080416

From Android NDK To AOSP
From Android NDK To AOSPFrom Android NDK To AOSP
From Android NDK To AOSPMin-Yih Hsu
 
MongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsMongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsSteven Francia
 
A Gentle Introduction To Object Oriented Php
A Gentle Introduction To Object Oriented PhpA Gentle Introduction To Object Oriented Php
A Gentle Introduction To Object Oriented PhpMichael Girouard
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in PracticeJaime Crespo
 
Aeliapedia: Knowledge Building with XWiki at AELIA
Aeliapedia: Knowledge  Building with XWiki at  AELIAAeliapedia: Knowledge  Building with XWiki at  AELIA
Aeliapedia: Knowledge Building with XWiki at AELIAXWiki
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)Emil Eifrem
 
Blogs And Wikis In Academia
Blogs And Wikis In AcademiaBlogs And Wikis In Academia
Blogs And Wikis In AcademiaBill Warters
 
Cohere: Towards Web 2.0 Argumentation
Cohere: Towards Web 2.0 ArgumentationCohere: Towards Web 2.0 Argumentation
Cohere: Towards Web 2.0 ArgumentationSimon Buckingham Shum
 
Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responsesdarrelmiller71
 
Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slideslancesfa
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Communitytinacallahan
 
Semantic MediaWiki Workshop
Semantic MediaWiki WorkshopSemantic MediaWiki Workshop
Semantic MediaWiki WorkshopDan Bolser
 
Managing Social Content with MongoDB
Managing Social Content with MongoDBManaging Social Content with MongoDB
Managing Social Content with MongoDBMongoDB
 
LibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedLibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedtutorialsruby
 
LibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedLibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedtutorialsruby
 
DBpedia - An Interlinking-Hub in the Web of Data
DBpedia - An Interlinking-Hub in the Web of DataDBpedia - An Interlinking-Hub in the Web of Data
DBpedia - An Interlinking-Hub in the Web of DataJakob .
 
Stupid Index Block Tricks
Stupid Index Block TricksStupid Index Block Tricks
Stupid Index Block Trickshannonhill
 

Ă„hnlich wie Freebase: Wikipedia Mining 20080416 (20)

Tel Vortrag
Tel VortragTel Vortrag
Tel Vortrag
 
From Android NDK To AOSP
From Android NDK To AOSPFrom Android NDK To AOSP
From Android NDK To AOSP
 
Scalax
ScalaxScalax
Scalax
 
MongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsMongoDB, E-commerce and Transactions
MongoDB, E-commerce and Transactions
 
A Gentle Introduction To Object Oriented Php
A Gentle Introduction To Object Oriented PhpA Gentle Introduction To Object Oriented Php
A Gentle Introduction To Object Oriented Php
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in Practice
 
Aeliapedia: Knowledge Building with XWiki at AELIA
Aeliapedia: Knowledge  Building with XWiki at  AELIAAeliapedia: Knowledge  Building with XWiki at  AELIA
Aeliapedia: Knowledge Building with XWiki at AELIA
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
 
OMEKA
OMEKAOMEKA
OMEKA
 
Blogs And Wikis In Academia
Blogs And Wikis In AcademiaBlogs And Wikis In Academia
Blogs And Wikis In Academia
 
Cohere: Towards Web 2.0 Argumentation
Cohere: Towards Web 2.0 ArgumentationCohere: Towards Web 2.0 Argumentation
Cohere: Towards Web 2.0 Argumentation
 
Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responses
 
Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slides
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Community
 
Semantic MediaWiki Workshop
Semantic MediaWiki WorkshopSemantic MediaWiki Workshop
Semantic MediaWiki Workshop
 
Managing Social Content with MongoDB
Managing Social Content with MongoDBManaging Social Content with MongoDB
Managing Social Content with MongoDB
 
LibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedLibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresented
 
LibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresentedLibX2.0-Code4Lib-2009AsPresented
LibX2.0-Code4Lib-2009AsPresented
 
DBpedia - An Interlinking-Hub in the Web of Data
DBpedia - An Interlinking-Hub in the Web of DataDBpedia - An Interlinking-Hub in the Web of Data
DBpedia - An Interlinking-Hub in the Web of Data
 
Stupid Index Block Tricks
Stupid Index Block TricksStupid Index Block Tricks
Stupid Index Block Tricks
 

KĂĽrzlich hochgeladen

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

KĂĽrzlich hochgeladen (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Freebase: Wikipedia Mining 20080416

  • 1. Wikipedia Mining Spring Freebase User Group meeting 2008-04-16 / zenkat
  • 2. Why Mine Wikipedia? • How can we automatically extract the unstructured content from Wikipedia … • … to create a structured database of information … • … that can be leveraged by users in applications and data loads 2
  • 3. A Remarkable Source of Information 2.15 M articles as of April 2008 Doubling every 12 - 18 months 3
  • 4. Problem is … • Wikipedia is written by humans, for humans. - Great if you need to look up a fact, or learn about something • But you can’t … - Ask questions: “What movies by George Lucas has Harrison Ford starred in?” - Search effectively: “Find me all companies that build personal computers.” - Build applications: “Let’s make a social app that ranks consumer goods listed in wikipedia.” 4
  • 7. Searching for Structure: Topics Articles define a topic 7
  • 8. Searching for Structure: Types Categories & Lists provide type 8
  • 9. Searching for Structure: Types Categories & Lists provide type 9
  • 10. Searching for Structure: Properties Templates & Infoboxes give properties 10
  • 11. Searching for Structure: Properties { quot;queryquot; : [ { quot;typequot; : quot;/architecture/structurequot; quot;namequot; : null, quot;height_metersquot; : null, quot;sortquot; : quot;-height_metersquot;, quot;limitquot; : 10, } ] } What are the highest buildings in the world? 11
  • 12. Searching for Structure: Properties { quot;queryquot; : [ { quot;typequot; : quot;/location/countryquot; quot;namequot; : null, ”official_languagequot; : “English”, quot;limitquot; : 100 } ] } What are all the countries that speak English? 12
  • 13. A Treasure Trove Waiting To Be Opened • 2,150,000 articles (ie, topics) • 7,100,000 category refs (ie, typings) - Found within 280,000 categories • 42,000,000 template values (ie, properties) - Found within 10,000 templates and 56,000 template keys • All growing at ~2% every two weeks • Available information doubles every year! 13
  • 14. Topic Population From Wikipedia Topic Name Blurb Wikipedia Attribution Image Wikipedia Link 14
  • 16. Similar, but different … • Many pages in wikipedia are not topics - Disambiguation pages, lists, categories, images, docs, talk … • Only store a 1200-character blurb - We’re not wikipedia, after all • Don’t need to add “(suffix)” to names - “Python (genus)” vs “Python (programming language)” - Freebase types disambiguate without names • Cities should be specified without state suffix - “San Francisco” vs “San Francisco, California” - Cleanup in progress, some exceptions remain • “Exclusionist” vs “Inclusionist” - Exclusionists appear to be winning in Wikilandia - Freebase is inherently more inclusionist 16
  • 17. You Can’t Read The Same Wikipedia Twice Every 2 weeks … - 65,000 new pages - 8,000 deletes - 30,000 new topics - 5,000 name changes - 80,000 new aliases - 1,000 page ID changes - 10,000 merges - 1,000 splits … change in Wikipedia 17
  • 18. Keeping track of changes … • Store reference information within freebase - Page_ids, article titles and redirects - Page_id (WPID) is stored in /wikipedia/en_id - Article titles and redirects are stored in /wikipedia/en - “mwcl_wikipedia_en”, “mw_infobot” user • None of these IDs are stable in wiki-land … 18
  • 19. Determining actions by comparing keys case action new topic create a new topic name change add new name as en key; if quot;untouchedquot;, rename the topic id change change the en_id to the new value merge move the en key to the new topic; if quot;untouchedquot;, merge the topics split create new topic, move en key from old topic to new topic delete keep topic, but delete en_id and en keys from topic • Because we are more inclusionist than wikipedia, we usually do not delete topics. • Topic renames only occur on “untouched” topics. • Merges occur automatically on “untouched” topics - Otherwise, flagged for review in “pipeline” 19
  • 20. Map Template Fields To Properties 20
  • 21. Map Template Fields To Properties {{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] MediaWiki |more users = [[Air France-KLM]] |produced = 1993 - Present Template |number built = 723 as of March 2008 |unit cost = US$187.5-253 million Rendering }} 21
  • 22. Map Template Fields To Properties {{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] MediaWiki |more users = [[Air France-KLM]] |produced = 1993 - Present Template |number built = 723 as of March 2008 |unit cost = US$187.5-253 million Rendering }} “manufacturer” --> /aviation/aircraft_model/manufacturer 22
  • 23. Just the Starting Point … • Extracted to date from Wikipedia: - 2,365,000 topics - 2,895,000 typings - 5,638,000 properties • A complement to user-entered data - User data always takes precedence, won’t be overwritten • Processes are being automated to keep in sync 23
  • 24. Thanks! Tristan Buckner Topic updater /user/tristan Image loader Colin Evans WEX /user/colin Al Marks Category mapper /user/al Template mapper WEX 24