Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

SFScon 2020 - Peter Hopfgartner - Open Data de luxe

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 25 Anzeige

SFScon 2020 - Peter Hopfgartner - Open Data de luxe

Herunterladen, um offline zu lesen

Linked Open Data is the most usable kind of Open Data. An example of a well integrated source of Linked Open Data on tourism and mobility is the Open Data Hub operated by NOI. We will use the SPARQL querying language, a W3C standard, to query the data and show how this differs from other access methods. The tour will start by querying the end point directly from the command line with tools, like curl. Then, one by one, well known data science software packages. like R and Pandas, will be used to directly work with these datasets, to perform statistical calculations and generating graphs from data.
In the final part, these software packages will be used to query data from other well known data sources, like Wikidata and DBpedia.

Linked Open Data is the most usable kind of Open Data. An example of a well integrated source of Linked Open Data on tourism and mobility is the Open Data Hub operated by NOI. We will use the SPARQL querying language, a W3C standard, to query the data and show how this differs from other access methods. The tour will start by querying the end point directly from the command line with tools, like curl. Then, one by one, well known data science software packages. like R and Pandas, will be used to directly work with these datasets, to perform statistical calculations and generating graphs from data.
In the final part, these software packages will be used to query data from other well known data sources, like Wikidata and DBpedia.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie SFScon 2020 - Peter Hopfgartner - Open Data de luxe (20)

Anzeige

Weitere von South Tyrol Free Software Conference (20)

Aktuellste (20)

Anzeige

SFScon 2020 - Peter Hopfgartner - Open Data de luxe

  1. 1. Open Data de luxe: Querying public SPARQL endpoints from the command line, R and Pandas Bolzano - 13 NOV 2020 We make data actually usable Making the most of Open Data Hub, Wikidata, DBpedia and other sources of high quality data
  2. 2. Evolutions of Open Data: 5 star Open Data ★ available on the web (whatever format) but with an open licence ★★ plus: available as machine-readable structured data (e.g. excel instead of image scan of a table) ★★★ plus: non-proprietary format (e.g. CSV instead of excel) ★★★★ plus: Use open standards from W3C (RDF and SPARQL) to identify things ★★★★★plus: Link your data to other people’s data to provide context https://5stardata.info/
  3. 3. Evolutions of Open Data: FAIR Findable, Accessible, Interoperable and Reusable (FAIR) FAIR data is not always open data (personal data, competitive data etc.) ❖ It facilitates data interchange on the web ❖ It facilitates data integration across sources even when schemas are different ❖ It supports evolution of schemas over time with minimal disruption to data consumers https://www.go-fair.org
  4. 4. Technology of choice: 1 - RDF RDF is “a standard model for data interchange on the Web” Large graphs are build on triples @prefix ab: <http://learningsparql.com/ns/addressbook#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . ab:i0432 ab:firstName "Richard" ; ab:lastName "Mutt" ; ab:spouse ab:i9771 . ab:i8301 ab:firstName "Craig" ; ab:lastName "Ellis" ; ab:patient ab:i9771 . ab:i9771 ab:firstName "Cindy" ; ab:lastName "Marshall" . ab:spouse rdf:type owl:SymmetricProperty ; rdfs:comment "Identifies someone's spouse" . ab:patient rdf:type rdf:Property ; rdfs:comment "Identifies a doctor's patient" . subject predicate object
  5. 5. Technology of choice: 2 - SPARQL SPARQL, the language to to select, update, create and delete triples PREFIX schema: <http://schema.org/> PREFIX geo: <http://www.opengis.net/ont/geosparql#> SELECT * WHERE { ?t a schema:PerformingArtsTheater ; geo:asWKT ?pos ; schema:name ?posLabel . }
  6. 6. Technology of choice: 2 - SPARQL SPARQL is similar to SQL, but is web age: ★ HTTP/S as transport protocol ★ No drivers required ★ Standardized by the W3C
  7. 7. Your personal SPARQL database: Tracker Tracker is the file system indexer used by the Gnome desktop, e.g. for Full Text Search $ tracker sparql -q "SELECT DISTINCT ?performerName WHERE {?s <http://www.tracker-project.org/temp/nmm#performer> ?performerName . }" Results: urn:artist:Yasmine%20Hamdan urn:artist:Otfried%20Preu%C3%9Fler urn:artist:Queens%20Of%20The%20Stone%20Age urn:artist:Guns%20N'Roses ...
  8. 8. Big SPARQL endpoints: Wikidata Wikidata handles the fact data for wikipedia articles Data from Wikidata Link to Wikidata entry
  9. 9. Big SPARQL endpoints: Wikidata
  10. 10. Big SPARQL endpoints: DBpedia DBpedia extracts the data from Wikipedia and makes this data available and downloadable
  11. 11. Big SPARQL endpoints: Typical queries
  12. 12. Big SPARQL endpoints: datacommons.org Operated by Google. Integrates many data sources: ★ United States Census ★ World Bank ★ US Bureau of Labor Statistics ★ Wikipedia ★ National Oceanic and Atmospheric Administration ★ Federal Bureau of Investigation ★ ...
  13. 13. 0 KM endpoints: The Open Data Hub Operated by NOI Techpark (https://sparql.opendatahub.bz.it/)
  14. 14. How can I use these end points for my analyses?
  15. 15. $ curl -X POST https://query.wikidata.org/sparql -H "Accept: text/csv" --data-urlencode query@countries.rq Command line: cURL # countries.rq SELECT DISTINCT ?countryLabel ?population ?area WHERE { ?country wdt:P31 wd:Q6256 . ?country wdt:P1082 ?population . ?country wdt:P2046 ?area . MINUS {?country wdt:P31 wd:Q3024240 .} SERVICE wikibase:label { bd:serviceParam wikibase:language "en,[AUTO_LANGUAGE]". } } ORDER BY DESC(?population)
  16. 16. $ ${JENA_DIR}/bin/rsparql --service 'https://query.wikidata.org/sparql' --query countries.rq --results=CSV > countries.csv Command line: rsparql from Apache # countries.rq SELECT DISTINCT ?countryLabel ?population ?area WHERE { ?country wdt:P31 wd:Q6256 . ?country wdt:P1082 ?population . ?country wdt:P2046 ?area . MINUS {?country wdt:P31 wd:Q3024240 .} SERVICE wikibase:label { bd:serviceParam wikibase:language "en,[AUTO_LANGUAGE]". } } ORDER BY DESC(?population)
  17. 17. Directly from R library(WikidataQueryServiceR) r <- query_wikidata(' SELECT DISTINCT ?countryLabel ?population ?area WHERE { ?country wdt:P31 wd:Q6256 . ?country wdt:P1082 ?population . ?country wdt:P2046 ?area . MINUS {?country wdt:P31 wd:Q3024240 .} SERVICE wikibase:label { bd:serviceParam wikibase:language "en,[AUTO_LANGUAGE]". } } ORDER BY DESC(?population) ') head(r) # A tibble: 6 x 3 countryLabel population area <chr> <dbl> <dbl> 1 People's Republic of China 1409517397 9596961 2 India 1326093247 3287263 3 United States of America 328239523 9826675 ...
  18. 18. Python with the requests module import requests url = "https://sparql.opendatahub.bz.it/sparql" q = """ PREFIX schema: <http://schema.org/> PREFIX geo: <http://www.opengis.net/ont/geosparql#> SELECT * WHERE { ?t a schema:PerformingArtsTheater ; geo:asWKT ?pos ; schema:name ?posLabel . } """ r = requests.get(url, params = {'query': q}, headers={'Content-Type': 'application/sparql-results+json'}) print(r.json()) It works, but the returned results are not directly usable as a table.
  19. 19. Python with sparql_client import sparql endpoint = "https://sparql.opendatahub.bz.it/sparql" q = """ PREFIX schema: <http://schema.org/> PREFIX geo: <http://www.opengis.net/ont/geosparql#> SELECT * WHERE { ?t a schema:PerformingArtsTheater ; geo:asWKT ?pos ; schema:name ?posLabel . } """ result = sparql.query(endpoint, q) for row in result: print (row) (<IRI <http://noi.example.org/data/poi/9621F83525089644A0D47464D27D634E>>, <Literal "POINT (11.3534199999999998 46.4990740000000002)"^^<http://www.opengis.net/ont/geosparql#wktLiteral>>, <Literal "Kleinkunsttheater Carambolage">) ... Good, but needs some rework for Pandas
  20. 20. Python with sparql-dataframe import sparql_dataframe endpoint = "https://sparql.opendatahub.bz.it/sparql" q = """ PREFIX schema: <http://schema.org/> PREFIX geo: <http://www.opengis.net/ont/geosparql#> SELECT * WHERE { ?t a schema:PerformingArtsTheater ; geo:asWKT ?pos ; schema:name ?posLabel . } """ df = sparql_dataframe.get(endpoint, q) Most comfortable solution for Pandas
  21. 21. What makes RDF / SPARQL great for data exchange? ★ Data really be queried, not only downloaded ★ Well structured data with rich data models, often standardized and good metadata ★ Data is easy to integrate ★ Technology is easy to integrate
  22. 22. Thank you for your attention
  23. 23. Diego Calvanese Scientific advisor of the board Full professor at unibz ACM Fellow Benjamin Cogrel CTO Chair of the board Peter Hopfgartner CEO Marco Montali Scientific consultant Assoc. professor at unibz The Team Guohui Xiao Chief scientist Jun. professor at unibz
  24. 24. Big SPARQL endpoints: Typical queries # Wikidata: bands that start with "Radio" # try it on https://query.wikidata.org SELECT DISTINCT ?band ?bandLabel WHERE { ?band wdt:P31 wd:Q215380 . ?band rdfs:label ?bandLabel . FILTER(STRSTARTS(?bandLabel, 'Radio')) . } # DBPedia: facts about Joe Biden SELECT ?property ?hasValue ?isValueOf WHERE { { <http://dbpedia.org/resource/Joe_Biden> ?property ?hasValue } UNION { ?isValueOf ?property <http://dbpedia.org/resource/Joe_Biden> } }
  25. 25. Evolutions of Open Data: Linked Data ❏ Use URIs to name (identify) things. ❏ Use HTTP URIs so that these things can be looked up (interpreted, “dereferenced”). ❏ Provide useful information about what a name identifies when it’s looked up, using open standards such as RDF, SPARQL, etc. ❏ Refer to other things using their HTTP URI-based names when publishing data on the Web. Tim Berners-Lee, 2006

×