SlideShare ist ein Scribd-Unternehmen logo
1 von 57
DBpedia Citation Challenge
(Not only) Polish Citations in Wikipedia:
analysis, comparison, directions
Krzysztof Węcel, Włodzimierz Lewoniewski, Paweł Sobociński
DBpedia Community Meeting, Leipzig, 15.09.2016
Outline
• Extraction
• Linking
• Exploration
• Ranking
2Krzysztof Węcel
Extraction
3
References and citation templates
<ref name="Trimble 1987">{{cite journal
|last=Trimble |first=V.
|date=1987
|title=Existence and nature of dark matter in the
universe
|journal=[[Annual Review of Astronomy and
Astrophysics]]
|volume=25|pages=425–472
|bibcode=1987ARA&A..25..425T
|doi=10.1146/annurev.aa.25.090187.002233
}}</ref>
4Krzysztof Węcel
Citation rendering – external sites
5Krzysztof Węcel
Citation templates
• {{cite web …
• {{cite journal …
• {{cite book …
• {{cite conference
but also
• {{Google books|ID|title|page=|
keywords=|text=|plainurl=}}
6Krzysztof Węcel
Citation templates cnt’d
• Polish
– {{cytuj
– {{cytuj stronę …
– {{cytuj pismo …
– {{cytuj książkę …
• German
– {{Literatur …
– {{Internetquelle …
– but also
• {{DOI …
• {{ISSN …
7Krzysztof Węcel
Number of templates
8Krzysztof Węcel
Number of templates
9Krzysztof Węcel
Number of templates
10Krzysztof Węcel
Number of templates
11Krzysztof Węcel
Number of templates
12Krzysztof Węcel
Number of templates
13Krzysztof Węcel
Methods
• DBpedia Extraction Framework
– CitationExtractor
• adaptation to Polish templates for citation
• hard-coded rules
– several issues
• incorrect titles for some publications
– <http://doi.org/10.1051/aas:1999404> dc:title
"3.15576E8"^^<http://dbpedia.org/datatype/second> .
• processing limits
– JAXP00010004: The accumulated size of entities is "50 000
001" that exceeded the "50 000 000" ;limit set by
"FEATURE_SECURE_PROCESSING"
• PyCiExtractor
– own implementation in Python
14Krzysztof Węcel
Specific issues
• titles can vary significantly
• given name and family name are sometimes distinguished
• specific naming of consecutive authors
– first1, last1, first2, last2, …
– imię1, nazwisko1, imię2, nazwisko2, …
• date field
– various formats
• access data is (an should be) different for individual items
15Krzysztof Węcel
Sample variants of title
16Krzysztof Węcel
Linking
17
Reuse of attributes
18Krzysztof Węcel
Completeness of attributes
19Krzysztof Węcel
Ontologies/Vocabularies
• bibo:
– The Bibliographic Ontology, http://bibliographic-
ontology.org/, 2016
– http://purl.org/ontology/bibo/
• fabio:
– FaBiO, the FRBR-aligned Bibliographic Ontology,
http://www.sparontologies.net/ontologies/fabio/so
urce.html, 2016
– http://purl.org/spar/fabio
20Krzysztof Węcel
Mappings to ontologies
21Krzysztof Węcel
External citation databases
• benefits and tasks
– disambiguation of reference details
– fusion of references
– real statistics on publication’s citation
– classification of publications (topic, quality, IF, stats)
• dereferencing identifiers:
– DOI, arXiv, bibcode, LCCN, …
• libraries/repositories
– Google Scholar, Mendeley, ResearchGate, BibSonomy, Microsoft
Academic Search, many more
22Krzysztof Węcel
Our scenario: Worldcat
• the world’s largest library catalog
• collections of 72,000 libraries in 170 countries
• WorldCat Search API
23Krzysztof Węcel
Exploration
24
Characteristics of citations
• focus on Polish citations
• other languages for comparison
• several aspects analysed:
– citing templates
– citing articles
– cited domains
• charts
– frequency vs. frequency rank (Zipf law)
– frequency vs. number of citations
25Krzysztof Węcel
Frequency vs. number of citations (PL)
Observation
Zipf’s law is
suprisingly
accurate
26Krzysztof Węcel
Frequency vs. frequency rank (PL)
27Krzysztof Węcel
Frequency rank – articles (PL)
Observation
Zipf works for
articless, too
28Krzysztof Węcel
Number of citations – articles (PL)
29Krzysztof Węcel
Frequency rank for domains (PL)
Comment
unique citation,
i.e. counted in
Wikipedia
article only
once
30Krzysztof Węcel
Frequency rank for articles (PL)
Comment
ID’d, i.e.
identified
citation, e.g. by
URL, ISBN or
DOI
31Krzysztof Węcel
Citations by type (PL)
Observation
books seem to
dominate in
Polish
32Krzysztof Węcel
Citations by type (EN)
Observation
other/hash
sources seem
to dominate in
English
33Krzysztof Węcel
Identification of articles (EN)
Observation
there is
probably an
issue with
hashed articles
in English, i.e.
no stright line
34Krzysztof Węcel
Comparison: freq rank for domains
Observation
more domains
are cited in
English
35Krzysztof Węcel
Comparison: freq rank for all articles
Observation
there are more
citations in
Polish than in
English
(cited at least
10 times)
36Krzysztof Węcel
New data, all languages - domains
Comment
data extracted
using
PyCiExtractor,
numbers seem
to better reflect
reality
37Krzysztof Węcel
New data, all languages - articles
38Krzysztof Węcel
Ranking
39
Wikirank.net
• we develop a portal for ranking Wikipedia articles in various
language according to their quality criteria
• languages: Belarusian, English, French, German, Polish,
Russian, Ukrainian
• current modules:
– WikiRank
– Top Articles
– Citation Index
– Websites Rank
http://wikirank.net
40Krzysztof Węcel
WikiRank – sample article
41Krzysztof Węcel
Wikirank – sample article cnt’d
42Krzysztof Węcel
Citation Index
43Krzysztof Węcel
Websites Rank
44Krzysztof Węcel
CiteRank
• a new module with a goal to rank citations used within
various language editions of Wikipedia
http://cite.wikirank.net/ (DBpedia framework)
http://cite2.wikirank.net/ (PyCiExtractor)
45Krzysztof Węcel
Top titles
• still a problem with title extraction
• geography is a dominating topic
46Krzysztof Węcel
Top titles
• some titles are very popular
47Krzysztof Węcel
Top titles
• even for frequent references there are plenty of ambiguities
48Krzysztof Węcel
Completeness
• yes, we agree, it might be misleading…
49Krzysztof Węcel
Most cited in Poland
50Krzysztof Węcel
Most cited – plants taxonomy
51Krzysztof Węcel
Details of citation – author name variants
52Krzysztof Węcel
Surprise – 7th place in Polish Wiki
• www.navin.org.np – National Association of Village
Development Committees in Nepal (NAVIN)
53Krzysztof Węcel
NAVIN citation – details
54Krzysztof Węcel
Sample article citing NAVIN
55Krzysztof Węcel
Surprise 2 – 1st place in English wiki
but: 404 Link broken 56Krzysztof Węcel
Lessons learnt
• Extraction methods should be improved.
• Mapping to ontologies can be useful for comparison.
• Identification of publications (better than hash) is needed.
• External repositories are not open enough.
• Distributions point at some problems with extraction.
• The are plenty of use cases for analyses of citations.
Citation statistics can improve quality modelling
of Wikipedia articles.
57Krzysztof Węcel

Weitere ähnliche Inhalte

Was ist angesagt?

The Chinese Women’s Magazines Database
The Chinese Women’s Magazines DatabaseThe Chinese Women’s Magazines Database
The Chinese Women’s Magazines Database
Matthias Arnold
 
Grey literature - a hidden resource
Grey literature - a hidden resourceGrey literature - a hidden resource
Grey literature - a hidden resource
Rebecca
 

Was ist angesagt? (8)

Andrew Janes UKAD 2016 Forum
Andrew Janes   UKAD 2016 Forum Andrew Janes   UKAD 2016 Forum
Andrew Janes UKAD 2016 Forum
 
Recent advances in the project EXCITE – Extraction of Citations from PDF Docu...
Recent advances in the project EXCITE – Extraction of Citations from PDF Docu...Recent advances in the project EXCITE – Extraction of Citations from PDF Docu...
Recent advances in the project EXCITE – Extraction of Citations from PDF Docu...
 
British Library Labs Presentation at Ed Tech Hackathon 2013 - hackathoncentra...
British Library Labs Presentation at Ed Tech Hackathon 2013 - hackathoncentra...British Library Labs Presentation at Ed Tech Hackathon 2013 - hackathoncentra...
British Library Labs Presentation at Ed Tech Hackathon 2013 - hackathoncentra...
 
The Chinese Women’s Magazines Database
The Chinese Women’s Magazines DatabaseThe Chinese Women’s Magazines Database
The Chinese Women’s Magazines Database
 
Wikipedia and Special Collections: A Special Relationship
Wikipedia and Special Collections: A Special RelationshipWikipedia and Special Collections: A Special Relationship
Wikipedia and Special Collections: A Special Relationship
 
Bibliosight (UKCoRR presentation)
Bibliosight (UKCoRR presentation)Bibliosight (UKCoRR presentation)
Bibliosight (UKCoRR presentation)
 
Researcher's Brand: How to get recognized and cited? (Yerevan, Armenia - Summ...
Researcher's Brand: How to get recognized and cited? (Yerevan, Armenia - Summ...Researcher's Brand: How to get recognized and cited? (Yerevan, Armenia - Summ...
Researcher's Brand: How to get recognized and cited? (Yerevan, Armenia - Summ...
 
Grey literature - a hidden resource
Grey literature - a hidden resourceGrey literature - a hidden resource
Grey literature - a hidden resource
 

Andere mochten auch

Jubatusにおける機械学習のテスト@MLCT
Jubatusにおける機械学習のテスト@MLCTJubatusにおける機械学習のテスト@MLCT
Jubatusにおける機械学習のテスト@MLCT
Yuya Unno
 

Andere mochten auch (20)

Jakość DBpedii na podstawie oceny Wikipedii
Jakość DBpedii na podstawie oceny Wikipedii Jakość DBpedii na podstawie oceny Wikipedii
Jakość DBpedii na podstawie oceny Wikipedii
 
Using DBpedia for Spotting and Disambiguating Entities
Using DBpedia for Spotting and Disambiguating EntitiesUsing DBpedia for Spotting and Disambiguating Entities
Using DBpedia for Spotting and Disambiguating Entities
 
Missingbot DBpedia Meeting Dublin 2015
Missingbot DBpedia Meeting Dublin 2015Missingbot DBpedia Meeting Dublin 2015
Missingbot DBpedia Meeting Dublin 2015
 
D bpedia association meeting dublin wkg
D bpedia association meeting dublin wkgD bpedia association meeting dublin wkg
D bpedia association meeting dublin wkg
 
Pundit at 3rd DBpedia Community Meeting 2015
Pundit at 3rd DBpedia Community Meeting 2015Pundit at 3rd DBpedia Community Meeting 2015
Pundit at 3rd DBpedia Community Meeting 2015
 
Linking Implicit entities - DBpedia Meetup
Linking Implicit entities - DBpedia MeetupLinking Implicit entities - DBpedia Meetup
Linking Implicit entities - DBpedia Meetup
 
DBpedia as Gaeilge Chapter
DBpedia as Gaeilge ChapterDBpedia as Gaeilge Chapter
DBpedia as Gaeilge Chapter
 
20140130 metadata vocabularies_and_cultural_heritage_final
20140130 metadata vocabularies_and_cultural_heritage_final20140130 metadata vocabularies_and_cultural_heritage_final
20140130 metadata vocabularies_and_cultural_heritage_final
 
20150209 improving the_d_bpedia_ontology_v2
20150209 improving the_d_bpedia_ontology_v220150209 improving the_d_bpedia_ontology_v2
20150209 improving the_d_bpedia_ontology_v2
 
DBpedia in the Japanese LOD cloud
DBpedia in the Japanese LOD cloudDBpedia in the Japanese LOD cloud
DBpedia in the Japanese LOD cloud
 
DBpedia/association Introduction The Hague 12.2.2016
DBpedia/association Introduction The Hague 12.2.2016DBpedia/association Introduction The Hague 12.2.2016
DBpedia/association Introduction The Hague 12.2.2016
 
Enriching Cultural Heritage Data with DBpedia
Enriching Cultural Heritage Data with DBpediaEnriching Cultural Heritage Data with DBpedia
Enriching Cultural Heritage Data with DBpedia
 
DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in Dublin
 
Integration of Web Protégé into DBpedia
Integration of Web Protégé into DBpediaIntegration of Web Protégé into DBpedia
Integration of Web Protégé into DBpedia
 
8th DBpedia meeting / California 2016
8th DBpedia meeting /  California 20168th DBpedia meeting /  California 2016
8th DBpedia meeting / California 2016
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPedia
 
Jubatusにおける機械学習のテスト@MLCT
Jubatusにおける機械学習のテスト@MLCTJubatusにおける機械学習のテスト@MLCT
Jubatusにおける機械学習のテスト@MLCT
 
LODを閲覧する/作成する
LODを閲覧する/作成するLODを閲覧する/作成する
LODを閲覧する/作成する
 
LODを検索する
LODを検索するLODを検索する
LODを検索する
 

Ähnlich wie DBpedia Citation Challenge. (Not only) Polish Citations in Wikipedia: analysis, comparison, directions

VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
Nees Jan van Eck
 
Search challenges for collections of book records
Search challenges for collections of book recordsSearch challenges for collections of book records
Search challenges for collections of book records
Arjen de Vries
 
Scopus, ScienceDirect and Mendeley
Scopus, ScienceDirect and Mendeley Scopus, ScienceDirect and Mendeley
Scopus, ScienceDirect and Mendeley
nulibrary
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
Carole Goble
 
VOSviewer and CitNetExplorer Tutorial
VOSviewer and CitNetExplorer TutorialVOSviewer and CitNetExplorer Tutorial
VOSviewer and CitNetExplorer Tutorial
Nees Jan van Eck
 

Ähnlich wie DBpedia Citation Challenge. (Not only) Polish Citations in Wikipedia: analysis, comparison, directions (20)

Citations and References in DBpedia
Citations and References in DBpediaCitations and References in DBpedia
Citations and References in DBpedia
 
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
 
Thompson 6-jun15-final
Thompson 6-jun15-finalThompson 6-jun15-final
Thompson 6-jun15-final
 
Multiple perspectives on bibliometric data
Multiple perspectives on bibliometric dataMultiple perspectives on bibliometric data
Multiple perspectives on bibliometric data
 
CST4599 Nov 2021
CST4599 Nov 2021CST4599 Nov 2021
CST4599 Nov 2021
 
CS honours library training
CS honours library trainingCS honours library training
CS honours library training
 
Wikipedia as source of collaboratively created Knowledge Organization Systems
Wikipedia as source of collaboratively created Knowledge Organization SystemsWikipedia as source of collaboratively created Knowledge Organization Systems
Wikipedia as source of collaboratively created Knowledge Organization Systems
 
CEM3005W Library practical 2016
CEM3005W Library practical 2016CEM3005W Library practical 2016
CEM3005W Library practical 2016
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge Graphs
 
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
 
Clarivate Analytics Content Selection Process
Clarivate Analytics Content Selection ProcessClarivate Analytics Content Selection Process
Clarivate Analytics Content Selection Process
 
Search challenges for collections of book records
Search challenges for collections of book recordsSearch challenges for collections of book records
Search challenges for collections of book records
 
Web of Science, Scopus, Dimensions, and beyond: The evolving landscape of bib...
Web of Science, Scopus, Dimensions, and beyond: The evolving landscape of bib...Web of Science, Scopus, Dimensions, and beyond: The evolving landscape of bib...
Web of Science, Scopus, Dimensions, and beyond: The evolving landscape of bib...
 
Scopus, ScienceDirect and Mendeley
Scopus, ScienceDirect and Mendeley Scopus, ScienceDirect and Mendeley
Scopus, ScienceDirect and Mendeley
 
Science Mapping and Research Positioning
Science Mapping and Research PositioningScience Mapping and Research Positioning
Science Mapping and Research Positioning
 
Advanced information and research skills for music
Advanced information and research skills for musicAdvanced information and research skills for music
Advanced information and research skills for music
 
Research dissemination presentation
Research dissemination presentationResearch dissemination presentation
Research dissemination presentation
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
 
VOSviewer and CitNetExplorer Tutorial
VOSviewer and CitNetExplorer TutorialVOSviewer and CitNetExplorer Tutorial
VOSviewer and CitNetExplorer Tutorial
 
Information and research skills for historians
Information and research skills for historiansInformation and research skills for historians
Information and research skills for historians
 

Kürzlich hochgeladen

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 

Kürzlich hochgeladen (20)

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 

DBpedia Citation Challenge. (Not only) Polish Citations in Wikipedia: analysis, comparison, directions

  • 1. DBpedia Citation Challenge (Not only) Polish Citations in Wikipedia: analysis, comparison, directions Krzysztof Węcel, Włodzimierz Lewoniewski, Paweł Sobociński DBpedia Community Meeting, Leipzig, 15.09.2016
  • 2. Outline • Extraction • Linking • Exploration • Ranking 2Krzysztof Węcel
  • 4. References and citation templates <ref name="Trimble 1987">{{cite journal |last=Trimble |first=V. |date=1987 |title=Existence and nature of dark matter in the universe |journal=[[Annual Review of Astronomy and Astrophysics]] |volume=25|pages=425–472 |bibcode=1987ARA&A..25..425T |doi=10.1146/annurev.aa.25.090187.002233 }}</ref> 4Krzysztof Węcel
  • 5. Citation rendering – external sites 5Krzysztof Węcel
  • 6. Citation templates • {{cite web … • {{cite journal … • {{cite book … • {{cite conference but also • {{Google books|ID|title|page=| keywords=|text=|plainurl=}} 6Krzysztof Węcel
  • 7. Citation templates cnt’d • Polish – {{cytuj – {{cytuj stronę … – {{cytuj pismo … – {{cytuj książkę … • German – {{Literatur … – {{Internetquelle … – but also • {{DOI … • {{ISSN … 7Krzysztof Węcel
  • 14. Methods • DBpedia Extraction Framework – CitationExtractor • adaptation to Polish templates for citation • hard-coded rules – several issues • incorrect titles for some publications – <http://doi.org/10.1051/aas:1999404> dc:title "3.15576E8"^^<http://dbpedia.org/datatype/second> . • processing limits – JAXP00010004: The accumulated size of entities is "50 000 001" that exceeded the "50 000 000" ;limit set by "FEATURE_SECURE_PROCESSING" • PyCiExtractor – own implementation in Python 14Krzysztof Węcel
  • 15. Specific issues • titles can vary significantly • given name and family name are sometimes distinguished • specific naming of consecutive authors – first1, last1, first2, last2, … – imię1, nazwisko1, imię2, nazwisko2, … • date field – various formats • access data is (an should be) different for individual items 15Krzysztof Węcel
  • 16. Sample variants of title 16Krzysztof Węcel
  • 20. Ontologies/Vocabularies • bibo: – The Bibliographic Ontology, http://bibliographic- ontology.org/, 2016 – http://purl.org/ontology/bibo/ • fabio: – FaBiO, the FRBR-aligned Bibliographic Ontology, http://www.sparontologies.net/ontologies/fabio/so urce.html, 2016 – http://purl.org/spar/fabio 20Krzysztof Węcel
  • 22. External citation databases • benefits and tasks – disambiguation of reference details – fusion of references – real statistics on publication’s citation – classification of publications (topic, quality, IF, stats) • dereferencing identifiers: – DOI, arXiv, bibcode, LCCN, … • libraries/repositories – Google Scholar, Mendeley, ResearchGate, BibSonomy, Microsoft Academic Search, many more 22Krzysztof Węcel
  • 23. Our scenario: Worldcat • the world’s largest library catalog • collections of 72,000 libraries in 170 countries • WorldCat Search API 23Krzysztof Węcel
  • 25. Characteristics of citations • focus on Polish citations • other languages for comparison • several aspects analysed: – citing templates – citing articles – cited domains • charts – frequency vs. frequency rank (Zipf law) – frequency vs. number of citations 25Krzysztof Węcel
  • 26. Frequency vs. number of citations (PL) Observation Zipf’s law is suprisingly accurate 26Krzysztof Węcel
  • 27. Frequency vs. frequency rank (PL) 27Krzysztof Węcel
  • 28. Frequency rank – articles (PL) Observation Zipf works for articless, too 28Krzysztof Węcel
  • 29. Number of citations – articles (PL) 29Krzysztof Węcel
  • 30. Frequency rank for domains (PL) Comment unique citation, i.e. counted in Wikipedia article only once 30Krzysztof Węcel
  • 31. Frequency rank for articles (PL) Comment ID’d, i.e. identified citation, e.g. by URL, ISBN or DOI 31Krzysztof Węcel
  • 32. Citations by type (PL) Observation books seem to dominate in Polish 32Krzysztof Węcel
  • 33. Citations by type (EN) Observation other/hash sources seem to dominate in English 33Krzysztof Węcel
  • 34. Identification of articles (EN) Observation there is probably an issue with hashed articles in English, i.e. no stright line 34Krzysztof Węcel
  • 35. Comparison: freq rank for domains Observation more domains are cited in English 35Krzysztof Węcel
  • 36. Comparison: freq rank for all articles Observation there are more citations in Polish than in English (cited at least 10 times) 36Krzysztof Węcel
  • 37. New data, all languages - domains Comment data extracted using PyCiExtractor, numbers seem to better reflect reality 37Krzysztof Węcel
  • 38. New data, all languages - articles 38Krzysztof Węcel
  • 40. Wikirank.net • we develop a portal for ranking Wikipedia articles in various language according to their quality criteria • languages: Belarusian, English, French, German, Polish, Russian, Ukrainian • current modules: – WikiRank – Top Articles – Citation Index – Websites Rank http://wikirank.net 40Krzysztof Węcel
  • 41. WikiRank – sample article 41Krzysztof Węcel
  • 42. Wikirank – sample article cnt’d 42Krzysztof Węcel
  • 45. CiteRank • a new module with a goal to rank citations used within various language editions of Wikipedia http://cite.wikirank.net/ (DBpedia framework) http://cite2.wikirank.net/ (PyCiExtractor) 45Krzysztof Węcel
  • 46. Top titles • still a problem with title extraction • geography is a dominating topic 46Krzysztof Węcel
  • 47. Top titles • some titles are very popular 47Krzysztof Węcel
  • 48. Top titles • even for frequent references there are plenty of ambiguities 48Krzysztof Węcel
  • 49. Completeness • yes, we agree, it might be misleading… 49Krzysztof Węcel
  • 50. Most cited in Poland 50Krzysztof Węcel
  • 51. Most cited – plants taxonomy 51Krzysztof Węcel
  • 52. Details of citation – author name variants 52Krzysztof Węcel
  • 53. Surprise – 7th place in Polish Wiki • www.navin.org.np – National Association of Village Development Committees in Nepal (NAVIN) 53Krzysztof Węcel
  • 54. NAVIN citation – details 54Krzysztof Węcel
  • 55. Sample article citing NAVIN 55Krzysztof Węcel
  • 56. Surprise 2 – 1st place in English wiki but: 404 Link broken 56Krzysztof Węcel
  • 57. Lessons learnt • Extraction methods should be improved. • Mapping to ontologies can be useful for comparison. • Identification of publications (better than hash) is needed. • External repositories are not open enough. • Distributions point at some problems with extraction. • The are plenty of use cases for analyses of citations. Citation statistics can improve quality modelling of Wikipedia articles. 57Krzysztof Węcel

Hinweis der Redaktion

  1. {{Google books|7ydCAAAAIAAJ|History of the Western Insurrection|page=42}} https://en.wikipedia.org/wiki/Template:Google_books news, press,
  2. citing templates – 1 citation can be used many time within article citing articles – only unique citations identified, i.e. one per article cited domains – many web citations can point to a single source, thus increasing the „rank” of the source
  3. They are not evenly distributed
  4. There are just so many authors…