SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Analysing Structured Scholarly Data
Embedded in Web Pages
Pracheta Sahoo, Ujwal Gadiraju, Ran Yu,
Sriparna Saha and Stefan Dietze
WWW 2016
April 11th
, 2016
Montreal, Canada
OVERVIEW
❏ INTRODUCTION
❏ MOTIVATION
❏ RESEARCH
QUESTIONS
❏ ANALYSES
❏ CONCLUSIONS
❏ FUTURE WORK
INTRODUCTION (1/3)
The Web: nearly 46 trillion
Web pages indexed by Google
VS
Linked Data: approx. 1000
datasets & 100 billion
statements
● different order of
magnitude w.r.t. scale &
dynamics
Are there other semantics (structured facts) on the Web?
INTRODUCTION (2/3)
● Web pages embed structured data
(microdata, microformats and RDFa)
○ Interpretation of web documents
(search & retrieval)
● Increase in prevalence of embedded
markup (2014 Google study of 12 bn
pages estimates an adoption of 26%)
● “Web Data Commons” (Meusel et al.
[ISWC’14])
○ Markup from Common Crawl (2.2 bn
pages)
○ 17 billion RDF quads
○ Markup in 26% of pages, 14% of PLDs
in 2013 (increase from 6% in 2011)
Other semantics
(structured facts) on
the Web!
INTRODUCTION (3/3)
Characteristics of Markup Data
MOTIVATION
● Embedded markup ⇒ sparsely
linked, large % of coreferences,
redundant statements
● Uptake and reuse of embedded
markup is hindered by the lack
of dynamics, scale
● Lack of understanding of the
adoption of markup for
scholarly resource metadata
WHAT WE BRING TO THE TABLE ...
● Study of scholarly data
extracted from embedded
annotations (Web Data
Commons)
● Shape & characteristics of
entity descriptions
● Level of adoption of terms
& types, distributions
across TLDs, PLDs, data
publishers
RESEARCH QUESTIONS
RQ1 What are frequently used
terms & types for scholarly data?
RQ2 How are statements about
bibliographic data distributed
across the web? Who are the key
providers of bibliographic markup?
RQ3 What are the frequent errors
that can be observed?
DATASET
● Web Data Commons (WDC) 2014 dataset
● Subset ⇒ all statements describing entities
of type s:ScholarlyArticle or co-
occuring on same document with any s:
ScholarlyArticle instance
○ 6,793,764 quads
○ 1,184,623 entities
○ 83 distinct classes
○ 429 distinct predicates
DATASET - Considerations
● s:ScholarlyArticle is the only type which
explicitly refers to scholarly articles
● We focus on schema.org, the most
widely used schema
● Types considered ⇒ s:ScholarlyArticle,
s:Person and s:Organization
○ 280,616 instances (s:
ScholarlyArticle)
○ 847,417 insrances (s:Person)
○ 3,798 instances (s:Organization)
SCHOLARLY TYPES & PREDICATES (½)
Cumulative dist. of predicates over instances across
extracted types
1 to 14
1 to 9 1 to 4
SCHOLARLY TYPES & PREDICATES (2/2)
Top-10 Predicates for s:ScholarlyArticle
DOMAINS & DOCUMENTS (1/5)
Distribution of Entities & Statements across PLDs
DOMAINS & DOCUMENTS (2/5)
Top-10 PLDs (ranked by no. of entities)
DOMAINS & DOCUMENTS (3/5)
Distribution of Entities & Statements across TLDs
DOMAINS & DOCUMENTS (4/5)
Distribution of Entities & Statements across HTML
Documents
DOMAINS & DOCUMENTS (5/5)
Top-10 Documents Ranked According to
Embedded Entities
TOPICS & PUBLICATION TYPES (1/4)
Distribution of Scholarly Articles across Publishers
TOPICS & PUBLICATION TYPES (2/4)
Top-10 Publishers and corresponding no. of
Publications
TOPICS & PUBLICATION TYPES (3/4)
Top-10 Publication Types (genres) across WDC
TOPICS & PUBLICATION TYPES (4/4)
Top-10 Article Titles (ranked by frequency of occurrence)
FREQUENT ERRORS - Schema Violations
Top-10 Misused Predicates
CONCLUSIONS (½)
● First study on coverage & char. of
bibliographic metadata embedded
in web pages.
● Early adopters ⇒ publishers,
libraries, other providers of
bibliographic data.
● Usage of terms, types ⇒ dist.
across providers, domains and
topics follows a power law; few
providers & documents
contributing to majority of data.
● Top-k genres & publishers indicate a
bias towards French, English data
providers.
● Article titles, PLDs & publishers ⇒
bias Computer Science and Life
Sciences.
● In this study we only consider entities
tagged explicitly as "scholarlyArticle",
a deeper analysis considering more
types (article, book, etc.) and other
creative works can shed light on the
true scale of and potential of
embedded markup data.
CONCLUSIONS (2/2)
FUTURE WORK
● Targeted crawl of typical
providers of scholarly data
(publishers, academic
orgs., libraries, etc.)
● Consider implicitly typed
bibliographic or creative
work as scholarly data
Contact Details :
gadiraju@l3s.de
http://www.L3S.de
LIMITATIONS
● Our study is limited to
schema.org & the types of
s:ScholarlyArticle, s:
Person, s:Organization.
● We consider only explicitly
linked scholarly works.

Weitere ähnliche Inhalte

Was ist angesagt?

pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
Gregor Hagedorn
 

Was ist angesagt? (20)

Open science platforms
Open science platformsOpen science platforms
Open science platforms
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL Platforms
 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web
 
Gonzalez-8-jun15
Gonzalez-8-jun15Gonzalez-8-jun15
Gonzalez-8-jun15
 
BibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationBibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 Presentation
 
Reference Hackers
Reference HackersReference Hackers
Reference Hackers
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Data Publishing and Institutional Repositories
Data Publishing and Institutional RepositoriesData Publishing and Institutional Repositories
Data Publishing and Institutional Repositories
 
Creating Incentives
Creating IncentivesCreating Incentives
Creating Incentives
 
Sparql a simple knowledge query
Sparql  a simple knowledge querySparql  a simple knowledge query
Sparql a simple knowledge query
 
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
 
Bluffer's Guide to Institutional Repositories
Bluffer's Guide to Institutional RepositoriesBluffer's Guide to Institutional Repositories
Bluffer's Guide to Institutional Repositories
 
Expanding the content categories at JaLC
Expanding the content categories at JaLCExpanding the content categories at JaLC
Expanding the content categories at JaLC
 
DataCite overview 2014
DataCite overview 2014DataCite overview 2014
DataCite overview 2014
 
Freire model api
Freire model apiFreire model api
Freire model api
 
GBIF ideas
GBIF ideasGBIF ideas
GBIF ideas
 
Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013
 
Efficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessEfficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining Process
 

Andere mochten auch

Plan grand palais visiteur
Plan grand palais visiteur Plan grand palais visiteur
Plan grand palais visiteur
0665
 
January 15, 2015
January 15, 2015January 15, 2015
January 15, 2015
khyps13
 
Jenki formation-jenkins-hudson-integration-continue
Jenki formation-jenkins-hudson-integration-continueJenki formation-jenkins-hudson-integration-continue
Jenki formation-jenkins-hudson-integration-continue
CERTyou Formation
 
Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth_Ramaswamy_Resume_11-22-2015Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth Ramaswamy
 
e-network_IWM3765_14[1] (2 files merged)
e-network_IWM3765_14[1] (2 files merged)e-network_IWM3765_14[1] (2 files merged)
e-network_IWM3765_14[1] (2 files merged)
Bhupendra Shakya
 

Andere mochten auch (20)

Photos retrouvaille 2015 provigo
Photos retrouvaille 2015 provigoPhotos retrouvaille 2015 provigo
Photos retrouvaille 2015 provigo
 
Plan grand palais visiteur
Plan grand palais visiteur Plan grand palais visiteur
Plan grand palais visiteur
 
January 15, 2015
January 15, 2015January 15, 2015
January 15, 2015
 
체감형 게임활용 교육사례 2014 Kinect School
체감형 게임활용 교육사례 2014 Kinect School체감형 게임활용 교육사례 2014 Kinect School
체감형 게임활용 교육사례 2014 Kinect School
 
Clipping pacto ong pacto ambiental anexo
Clipping pacto ong pacto ambiental anexoClipping pacto ong pacto ambiental anexo
Clipping pacto ong pacto ambiental anexo
 
by geethuraj
by geethurajby geethuraj
by geethuraj
 
Obejtos yeissa ortiz
Obejtos yeissa ortizObejtos yeissa ortiz
Obejtos yeissa ortiz
 
Xerradamotivacional
XerradamotivacionalXerradamotivacional
Xerradamotivacional
 
Jenki formation-jenkins-hudson-integration-continue
Jenki formation-jenkins-hudson-integration-continueJenki formation-jenkins-hudson-integration-continue
Jenki formation-jenkins-hudson-integration-continue
 
Work Sample - Arch Design 2
Work Sample - Arch Design 2Work Sample - Arch Design 2
Work Sample - Arch Design 2
 
And Then I Met Her
And Then I Met HerAnd Then I Met Her
And Then I Met Her
 
Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth_Ramaswamy_Resume_11-22-2015Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth_Ramaswamy_Resume_11-22-2015
 
Manual software para acionamneto v75
Manual software para acionamneto v75Manual software para acionamneto v75
Manual software para acionamneto v75
 
e-network_IWM3765_14[1] (2 files merged)
e-network_IWM3765_14[1] (2 files merged)e-network_IWM3765_14[1] (2 files merged)
e-network_IWM3765_14[1] (2 files merged)
 
King Of Buns
King Of BunsKing Of Buns
King Of Buns
 
너 커서 뭐 될래? Dream Come True
너 커서 뭐 될래? Dream Come True너 커서 뭐 될래? Dream Come True
너 커서 뭐 될래? Dream Come True
 
Ғалымдар өмірінен
Ғалымдар өміріненҒалымдар өмірінен
Ғалымдар өмірінен
 
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
 
Cricket quiz 2014 mains
Cricket quiz 2014 mainsCricket quiz 2014 mains
Cricket quiz 2014 mains
 
Diapos de sindrome treacher collins
Diapos de sindrome treacher collinsDiapos de sindrome treacher collins
Diapos de sindrome treacher collins
 

Ähnlich wie Analysing Structured Scholarly Data Embedded in Web Pages

Summary of Trends in Cataloging
Summary of Trends in CatalogingSummary of Trends in Cataloging
Summary of Trends in Cataloging
William Worford
 
RDA Presentation
RDA PresentationRDA Presentation
RDA Presentation
jendibbern
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015
Kerstin Forsberg
 
Reuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and RealizationReuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and Realization
andrea huang
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
Lucy McKenna
 

Ähnlich wie Analysing Structured Scholarly Data Embedded in Web Pages (20)

A theory of Metadata enriching & filtering
A theory of  Metadata enriching & filteringA theory of  Metadata enriching & filtering
A theory of Metadata enriching & filtering
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
 
Researcher identifiers in 21st c-rev to submit
Researcher identifiers in 21st c-rev to submitResearcher identifiers in 21st c-rev to submit
Researcher identifiers in 21st c-rev to submit
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Linked Data
Linked DataLinked Data
Linked Data
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
Metadata for researchers
Metadata for researchers Metadata for researchers
Metadata for researchers
 
Rec4LRW – Scientific Paper Recommender System for Literature Review and Writing
Rec4LRW – Scientific Paper Recommender System for Literature Review and WritingRec4LRW – Scientific Paper Recommender System for Literature Review and Writing
Rec4LRW – Scientific Paper Recommender System for Literature Review and Writing
 
Removing Barriers to Data Sharing: the Research Data Alliance
Removing Barriers to Data Sharing: the Research Data AllianceRemoving Barriers to Data Sharing: the Research Data Alliance
Removing Barriers to Data Sharing: the Research Data Alliance
 
Research data management workshop april12 2016
Research data management workshop april12 2016 Research data management workshop april12 2016
Research data management workshop april12 2016
 
Research data management workshop April 2016
Research data management workshop April 2016Research data management workshop April 2016
Research data management workshop April 2016
 
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
 
Summary of Trends in Cataloging
Summary of Trends in CatalogingSummary of Trends in Cataloging
Summary of Trends in Cataloging
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical Informatics
 
RDA Presentation
RDA PresentationRDA Presentation
RDA Presentation
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015
 
Reuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and RealizationReuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and Realization
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge Graphs
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 

Analysing Structured Scholarly Data Embedded in Web Pages

  • 1. Analysing Structured Scholarly Data Embedded in Web Pages Pracheta Sahoo, Ujwal Gadiraju, Ran Yu, Sriparna Saha and Stefan Dietze WWW 2016 April 11th , 2016 Montreal, Canada
  • 2. OVERVIEW ❏ INTRODUCTION ❏ MOTIVATION ❏ RESEARCH QUESTIONS ❏ ANALYSES ❏ CONCLUSIONS ❏ FUTURE WORK
  • 3. INTRODUCTION (1/3) The Web: nearly 46 trillion Web pages indexed by Google VS Linked Data: approx. 1000 datasets & 100 billion statements ● different order of magnitude w.r.t. scale & dynamics Are there other semantics (structured facts) on the Web?
  • 4. INTRODUCTION (2/3) ● Web pages embed structured data (microdata, microformats and RDFa) ○ Interpretation of web documents (search & retrieval) ● Increase in prevalence of embedded markup (2014 Google study of 12 bn pages estimates an adoption of 26%) ● “Web Data Commons” (Meusel et al. [ISWC’14]) ○ Markup from Common Crawl (2.2 bn pages) ○ 17 billion RDF quads ○ Markup in 26% of pages, 14% of PLDs in 2013 (increase from 6% in 2011)
  • 7. MOTIVATION ● Embedded markup ⇒ sparsely linked, large % of coreferences, redundant statements ● Uptake and reuse of embedded markup is hindered by the lack of dynamics, scale ● Lack of understanding of the adoption of markup for scholarly resource metadata
  • 8. WHAT WE BRING TO THE TABLE ... ● Study of scholarly data extracted from embedded annotations (Web Data Commons) ● Shape & characteristics of entity descriptions ● Level of adoption of terms & types, distributions across TLDs, PLDs, data publishers
  • 9. RESEARCH QUESTIONS RQ1 What are frequently used terms & types for scholarly data? RQ2 How are statements about bibliographic data distributed across the web? Who are the key providers of bibliographic markup? RQ3 What are the frequent errors that can be observed?
  • 10. DATASET ● Web Data Commons (WDC) 2014 dataset ● Subset ⇒ all statements describing entities of type s:ScholarlyArticle or co- occuring on same document with any s: ScholarlyArticle instance ○ 6,793,764 quads ○ 1,184,623 entities ○ 83 distinct classes ○ 429 distinct predicates
  • 11. DATASET - Considerations ● s:ScholarlyArticle is the only type which explicitly refers to scholarly articles ● We focus on schema.org, the most widely used schema ● Types considered ⇒ s:ScholarlyArticle, s:Person and s:Organization ○ 280,616 instances (s: ScholarlyArticle) ○ 847,417 insrances (s:Person) ○ 3,798 instances (s:Organization)
  • 12. SCHOLARLY TYPES & PREDICATES (½) Cumulative dist. of predicates over instances across extracted types 1 to 14 1 to 9 1 to 4
  • 13. SCHOLARLY TYPES & PREDICATES (2/2) Top-10 Predicates for s:ScholarlyArticle
  • 14. DOMAINS & DOCUMENTS (1/5) Distribution of Entities & Statements across PLDs
  • 15. DOMAINS & DOCUMENTS (2/5) Top-10 PLDs (ranked by no. of entities)
  • 16. DOMAINS & DOCUMENTS (3/5) Distribution of Entities & Statements across TLDs
  • 17. DOMAINS & DOCUMENTS (4/5) Distribution of Entities & Statements across HTML Documents
  • 18. DOMAINS & DOCUMENTS (5/5) Top-10 Documents Ranked According to Embedded Entities
  • 19. TOPICS & PUBLICATION TYPES (1/4) Distribution of Scholarly Articles across Publishers
  • 20. TOPICS & PUBLICATION TYPES (2/4) Top-10 Publishers and corresponding no. of Publications
  • 21. TOPICS & PUBLICATION TYPES (3/4) Top-10 Publication Types (genres) across WDC
  • 22. TOPICS & PUBLICATION TYPES (4/4) Top-10 Article Titles (ranked by frequency of occurrence)
  • 23. FREQUENT ERRORS - Schema Violations Top-10 Misused Predicates
  • 24. CONCLUSIONS (½) ● First study on coverage & char. of bibliographic metadata embedded in web pages. ● Early adopters ⇒ publishers, libraries, other providers of bibliographic data. ● Usage of terms, types ⇒ dist. across providers, domains and topics follows a power law; few providers & documents contributing to majority of data.
  • 25. ● Top-k genres & publishers indicate a bias towards French, English data providers. ● Article titles, PLDs & publishers ⇒ bias Computer Science and Life Sciences. ● In this study we only consider entities tagged explicitly as "scholarlyArticle", a deeper analysis considering more types (article, book, etc.) and other creative works can shed light on the true scale of and potential of embedded markup data. CONCLUSIONS (2/2)
  • 26. FUTURE WORK ● Targeted crawl of typical providers of scholarly data (publishers, academic orgs., libraries, etc.) ● Consider implicitly typed bibliographic or creative work as scholarly data
  • 28. LIMITATIONS ● Our study is limited to schema.org & the types of s:ScholarlyArticle, s: Person, s:Organization. ● We consider only explicitly linked scholarly works.