The current status of Linked Open Data (LOD) shows evidence of many datasets available on the Web in RDF. In the meantime, there are still many challenges to overcome by organizations in their journey of publishing five stars datasets on the Web. Those challenges are not only technical, but are also organizational. At this moment where connectionist AI is gaining a wave of popularity with many applications, LOD needs to go beyond the guarantee of FAIR principles. One direction is to build a sustainable LOD ecosystem with FAIR-S principles. In parallel, LOD should serve as a catalyzer for solving societal issues (LOD for Social Good) and personal empowerment through data (Social Linked Data).
1. THE FUTURE OF LINKED
OPEN DATA
Ghislain Atemezing, PhD
Director R&D - MONDECA
@gatemezing
1ESSnet Linked Open Statistics - Sofia, Bulgaria - 28th May 2019
2. AGENDA
❖ Current status of LOD
❖ Challenges
➢ LOD is NOT (only) about Technology
❖ Signs of Hope
❖ Towards sustainable LOD ecosystem - FAIRS
(FAIR + Sustainable)
2
3. RDF: Simple or hard to use?
“RDF is hard to sell”
“RDF is heavy” - Eoin MacCuirc
“RDF is simple enough that you can
build a complex system”
“It’s difficult to standardize vocabularies
because of many ego”
"The Semantic Web is . . . an extension
of the current one, in which
information is given well-defined
meaning." "Meaning is expressed by
RDF."
Is RDF hard to use? Why?
3
4. Google Trends - RDF vs LOD - Last five years
LOD is popular than RDF
LOD & RDF search decreasing since 2004
4
5. LOD Evolution in the last decade
2008 2014 2019
34 datasets
1,239 datasets
16,147 links
570 datasets
2,909 links
In the last five years,
- 2X datasets available
- 5X links in the LOD 5
6. LOD Stats by 2024 - Predictions
LOD will contain at least:
- 2,688 datasets
- 88,808 links
Is this realistic or not?
6
8. What’s up LOD? - State of the LOD Cloud in 2019
- How many datasets by domain are
available ?
- How many vocabularies by datasets?
- What are the most used predicates for
interlink by category?
- Number of linked datasets?
- How many datasets are using cube
vocab?
- How many broken links? 8
In 2019, you can’t simply
have an answer by looking
at the LOD Cloud.
10. Publishers, do they ever know who is consuming
their datasets ?
Not always .. why?
Are we building towers of
knowledge?
How to know who are consuming
our dataset?
What are the incentives for the
publishers? 10
11. Are we (really) data driven ORGs?
Many use cases of semantic technologies in
industry
Why and how people are still sceptical with RDF
?
The problem is maybe NOT about the technology
ORGs should show the path through massive
data generation on the Web
11
12. (Some) Challenges to create LOD
Shared vocabulary management
Ontology creation: No clear methodology / Lack of
internal expertise
Mappings to ontology are not trivial
Links to external datasets (which ones? Default:
DBpedia ?)
Pan-national interpretation and comparison is
particularly challenging.
12
13. More Challenges
Maintenance of tools : we can’t trust
tools built by PhDs / interns
Versioning of datasets in LOD
Annual review of datasets (Who?)
General commitment/ Find a real
business value
13
14. Organizational challenges: where is the CDO?
Lack of data gouvernance in our ORGs
Minimal data sharing within ORG
No existing practice for documenting
knowledge
Lack of visions on harmonizing different
“data lakes”
14
15. Challenges - Metadata / Versioning
Frequent releases of datasets in LOD
Manage versions and track Diff of datasets
Proper use of metadata to track changes / check
data consistency
Data Quality and Provenance attached to datasets
Licensing issues (How to properly cite and reuse
datasets )
15
16. Signs of Hope
Many semantic technology advances in
reducing the barriers of querying billion triples
even in a normal laptop
Photo by Ron Smith on Unsplash 16
17. Democratizing Access to LOD
“Fernández, J. D., Beek, W., Martínez-Prieto, M.A., and Arias, M. LOD-a-lot: A Queryable Dump of the
LOD cloud (2017). http://purl.org/HDT/lod-a-lot.”
28 billion unique triples from 650K
datasets - All LOD in a medium size
laptop:
524 GB of disk space; 15.7 GB of RAM
17
18. Google Data Search or Schema.org In Action
Google data service launched in 2008
Based on schema.org ( cf.
https://toolbox.google.com/datasetsearch/search?query=
Site%3Adata.gouv.fr )
Uses DCAT and other structured metadata to discover
open datasets
One DCAT file per Dataset / Googlebot is not smart
enough
Link: https://toolbox.google.com/datasetsearch 18
20. Beyond FAIR Principles
Findable: unique ids that are resolvable,
Accessible: common access method,
Interoperable: shared vocabularies &
taxonomies,
Reusable: provenance, license
20
FAIR + Sustainable => FAIRS
21. 56M items, 700M statements, 400 lang, 20K active contrib
p/month, 900M edits, 8.5M daily SPARQL queries
Healthy community that helps write sparql queries.
Showing that technology is mature
60M links to DBpedia, 7.7 Billion triples
Many applications in Chabot
(Apple Siri, research, scientists, etc)
SPARQL is affordable and usable
Wikidata - A community for Wikibase
21
22. The future starts today: data is
infinite, the Web is here to
stay, semantic technologies
are mature.
22
23. Towards a Sustainable LOD Ecosystem
Work on having a board of committed people from
different expertises (Tech, academics, industry,
government, etc)
Gather and Promote LOD tools, applications based
on past experience
Learned from errors of the past
Create a real community of publishers and
consumers out of W3C
Liaise with W3C to create a community group?
23
24. LOD for Social Good: Killer Apps ?
Develop and use LOD for solving
societal issues
Apps achieve any of the 17
Sustainable Development Goals
(SDGs) included in the 2030 Agenda
for Sustainable Development.
LOD to enhance advances on
Misinformation issues on the Web
24
25. Solutions for future LOD -
Create a forum with different stakeholders to discuss
LOD issues and maintenance
Create a new way to manage and maintain datasets in
LOD (W3C community, mix of community? Foundation
ala Apache ?
New enforcement rules for LOD management life-
cycle
More use cases of datasets with probabilistic and
temporal models
25
26. Graph Of Linked “Insights” Datasets ?
Statistical models also find “insights”
over datasets
Data scientists spent hours
understanding underlying data to
generate reports, dashboards or
applications
How to model that knowledge and
publish on the Web?
“Current insights after data
analysis gets stored in a
spreadsheet and then gets lost.
We want to create a graph of
insights, link them and generate
new insights” - Lambert
Hogenhout, UN #kgc2019
https://twitter.com/juansequeda/status/1126144558
683885569
26
27. Takeaway message
The maturity of semantic technologies is fully
demonstrated in many real world applications
The Web is a precious mean to exchange
information both by humans and machines
Versioning and datasets updates are still
challenging.
For exchanging knowledge, LOD is probably the
(only) solution - The only way to be able to make
“AI intelligent”
New applications combining AI (autonomous
agents, chatbots, etc)
27
28. The more you publish datasets as
LOD, the more you are preparing
the next generation of
"prescriptive" autonomous agents.
Classical AI (predictive) alone
(neural networks, machine
learning) can’t make this happen
28
29. “The future [of the Web] is
still so much bigger than
the past.” - Tim BL (2018)
So will be the future of
Linked Open Data…
Just publish and share your
Assets on the Web
https://inrupt.com/blog/one-small-step-for-the-web
29
31. THE FUTURE OF LINKED
OPEN DATA
Ghislain Atemezing, PhD
Director R&D - MONDECA
@gatemezing
31ESSnet Linked Open Statistics - Sofia, Bulgaria - 28th May 2019
Hinweis der Redaktion
Results for why RDF is not easier for middle 30% of developers?
Popularity of searching terms RDF vs LOD since 2014. LOD search is more popular than RDF. (beware of RDF = Rwanda Defence Force)
RDF search decreasing since 2014. Same for LOD.
March 2019 : 1,239 datasets with 16,147 links (as of March 2019)
March 2008: 34 datasets
March 2014: 570 datasets and 2,909 linkage relationships between the datasets
March 2024 : 1,239 *2.17 -> 2,688+ datasets with 16,147*5,5 with 88,808 links
“Data, Scientific; Astell, Mathias (2017): Benefits of Open Research Data Infographic. figshare. Figure” https://doi.org/10.6084/m9.figshare.5179006.v3
Figshare from Open Science community could be a way to go…
How many of you have ontologists team?
More Data governance from publishers is needed
Many tools built during a project are not maintained anymore after completion
Who are culprit?
By using two technologies, RDF binary format and Linked Data Fragments, you can even deploy and run all LOD in a medium size laptop.
May 2019: 14M datasets from 3k repositories. Crawl #DCAT and http://schema.org metadata
Link guidelines Google dataset search: https://developers.google.com/search/docs/data-types/dataset#sitemap
You can use one single DCAT for all your datasets in your domain
Build a community for LOD as Wikidata for wikibase. Active users / active community
See Grafana
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?refresh=1m&orgId=1&from=now-1y&to=now
We need to review the current challenges to make a more sustainable LOD ecosystem
I really like the idea of having workshops on this topic of SDGs, see a call at ISWC 2019 https://sw4sg2019.github.io/iswc2019/
Todo: best practices for data aggregation and scale for publishing
This is similar to the idea of Eurostat to create a KG of explained statistics.
The more you publish datasets as LOD, the more you are preparing the next generation of “prescriptive” autonomous agents. Classical AI alone (neural networks, machine learning) can’t make this happen. A Need to find new ways to maintain and update data in the LOD