We develop a portal for ranking Wikipedia articles in various language according to their quality criteria. Currently the following languages are considered: Belarusian, English, French, German, Polish, Russian, Ukrainian.
Lessons learnt:
1. Extraction methods should be improved.
2. Mapping to ontologies can be useful for comparison.
3. Identification of publications (better than hash) is needed.
4. External repositories are not open enough.
5. Distributions point at some problems with extraction.
6. The are plenty of use cases for analyses of citations.
Analysis of citation statistics can improve
quality modelling of Wikipedia articles.
4. References and citation templates
<ref name="Trimble 1987">{{cite journal
|last=Trimble |first=V.
|date=1987
|title=Existence and nature of dark matter in the
universe
|journal=[[Annual Review of Astronomy and
Astrophysics]]
|volume=25|pages=425–472
|bibcode=1987ARA&A..25..425T
|doi=10.1146/annurev.aa.25.090187.002233
}}</ref>
4Krzysztof Węcel
14. Methods
• DBpedia Extraction Framework
– CitationExtractor
• adaptation to Polish templates for citation
• hard-coded rules
– several issues
• incorrect titles for some publications
– <http://doi.org/10.1051/aas:1999404> dc:title
"3.15576E8"^^<http://dbpedia.org/datatype/second> .
• processing limits
– JAXP00010004: The accumulated size of entities is "50 000
001" that exceeded the "50 000 000" ;limit set by
"FEATURE_SECURE_PROCESSING"
• PyCiExtractor
– own implementation in Python
14Krzysztof Węcel
15. Specific issues
• titles can vary significantly
• given name and family name are sometimes distinguished
• specific naming of consecutive authors
– first1, last1, first2, last2, …
– imię1, nazwisko1, imię2, nazwisko2, …
• date field
– various formats
• access data is (an should be) different for individual items
15Krzysztof Węcel
22. External citation databases
• benefits and tasks
– disambiguation of reference details
– fusion of references
– real statistics on publication’s citation
– classification of publications (topic, quality, IF, stats)
• dereferencing identifiers:
– DOI, arXiv, bibcode, LCCN, …
• libraries/repositories
– Google Scholar, Mendeley, ResearchGate, BibSonomy, Microsoft
Academic Search, many more
22Krzysztof Węcel
23. Our scenario: Worldcat
• the world’s largest library catalog
• collections of 72,000 libraries in 170 countries
• WorldCat Search API
23Krzysztof Węcel
25. Characteristics of citations
• focus on Polish citations
• other languages for comparison
• several aspects analysed:
– citing templates
– citing articles
– cited domains
• charts
– frequency vs. frequency rank (Zipf law)
– frequency vs. number of citations
25Krzysztof Węcel
26. Frequency vs. number of citations (PL)
Observation
Zipf’s law is
suprisingly
accurate
26Krzysztof Węcel
40. Wikirank.net
• we develop a portal for ranking Wikipedia articles in various
language according to their quality criteria
• languages: Belarusian, English, French, German, Polish,
Russian, Ukrainian
• current modules:
– WikiRank
– Top Articles
– Citation Index
– Websites Rank
http://wikirank.net
40Krzysztof Węcel
45. CiteRank
• a new module with a goal to rank citations used within
various language editions of Wikipedia
http://cite.wikirank.net/ (DBpedia framework)
http://cite2.wikirank.net/ (PyCiExtractor)
45Krzysztof Węcel
46. Top titles
• still a problem with title extraction
• geography is a dominating topic
46Krzysztof Węcel
56. Surprise 2 – 1st place in English wiki
but: 404 Link broken 56Krzysztof Węcel
57. Lessons learnt
• Extraction methods should be improved.
• Mapping to ontologies can be useful for comparison.
• Identification of publications (better than hash) is needed.
• External repositories are not open enough.
• Distributions point at some problems with extraction.
• The are plenty of use cases for analyses of citations.
Citation statistics can improve quality modelling
of Wikipedia articles.
57Krzysztof Węcel
Hinweis der Redaktion
{{Google books|7ydCAAAAIAAJ|History of the Western Insurrection|page=42}}
https://en.wikipedia.org/wiki/Template:Google_books
news, press,
citing templates – 1 citation can be used many time within article
citing articles – only unique citations identified, i.e. one per article
cited domains – many web citations can point to a single source, thus increasing the „rank” of the source