SlideShare ist ein Scribd-Unternehmen logo
1 von 138
Downloaden Sie, um offline zu lesen
DataGraft
Data-as-a-Service for Open Data
Dumitru Roman
dumitru.roman@sintef.no
https://datagraft.net
About me
• Education
– Eng (2003), Technical University of Cluj-Napoca, Romania
– PhD (2008), University of Innsbruck, Austria
• Current positions
– Senior Research Scientist, SINTEF, Norway
– Associate Professor, University of Oslo, Norway
• Expertise and responsibilities
– Initiating, leading, and carrying out (research-intensive) projects on
data management and service-oriented topics
– Involved with over 20 large-scale R&D projects at the European level
during the past 12 years
2
“Technology for a better society”
• Public and private
companies
• Data owners
• Data publishers
• Data integrators and
aggregators
• Developers
• Improved data access
• Data-driven decision making
• Cost reduction when
working with data
• Reduction on the
dependency on generic
infrastructures providers
(e.g. generic cloud)
• Increase in the speed of
making data available
• Increase in the reuse of data
• Data cleaning
• Data transformation
• Data publication
• Data-as-a-Service
• Open data
• Linked data (RDF, SPARQL)
DataGraft
3
4
Outline
Session #1: Open Data
• Open Data
• (Open) Data Quality Issues
• Linked (Open) Data
– RDF, RDFS, SPARQL
Session #2: DataGraft
• Data-as-a-Service: DataGraft
• Examples and Demo
• Big Data and DataGraft
• Open Data in Malaysian
context (by Dennis Gan)
• (Optional: Hands on)
5
What is Open Data?
What is Linked Data?
Challenges in (Linked Open) Data?
How to publish Linked Open Data?
Linked Open Data Use Cases?
(Linked) Open Data and Big Data?
Open Data
What can open data do for you?
(Source: The ODI, https://vimeo.com/110800848)
7
Open Data
…is changing the nature of business
...reflects a cultural shift to a more open
society
8
Example: Personalized and Localized Urban
Quality Index (PLUQI)
The index includes data from various
domains:
Daily life satisfaction
weather, transportation, community, …
Healthcare level
number of doctors, hospitals, suicide statistics, …
Safety and security
number of police stations, fire stations, crimes
per capita, …
Financial satisfaction
prices, incomes, housing, savings, debt,
insurance, pension, …
Level of opportunity
jobs, unemployment, education, re-education, …
Environmental needs and efficiency
green space, air quality,…
9
PLUQI – potential usage
• Place recommendation for travel agencies or travelers
• Policy analysis and optimization for (local) government
• Understanding the citizen’s voice and demands regarding
environmental conservation
• Commercial impact analysis for retailer and franchises
• Location recommendation and understanding local issues
for real estate
• Risk analysis and management for insurance and
financial companies
• Local marketing and sales force optimization for
marketers
10
Open Data
• Businesses can develop new ideas, services and applications;
improve decision making, cost savings
• Can increase government transparency and accountability, quality
of public services
• Citizens get better and timely access to public services
11
Source: McKinsey
http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_a
nd_performance_with_liquid_information
Gartner:
By 2016, the use of "open data" will continue to
increase — but slowly, and predominantly limited to
Type A enterprises.
By 2017, over 60% of government open data
programs that do not effectively use open data
internally, will be scaled back or discontinued.
By 2020, enterprises and governments will fail to
protect 75% of sensitive data and will declassify and
grant broad/public access to it.
Source: Garner
http://training.gsn.gov.tw/uploads/news/6.Gartner+ExP+Briefing_Open+Data
_JUN+2014_v2.pdf
Lots of open datasets on the Web…
• A large number of datasets have been published as open data in the
recent years
• Many kinds of data: cultural, science, finance, statistics, transport,
environment, …
• Popular formats: tabular (e.g. CSV, XLS), HTML, XML, JSON, …
12
…but few actually used
• Few applications utilizing open
and distributed datasets at present
• Challenges for data consumers
– Data quality issues
– Difficult or unreliable data access
– Licensing issues
• Challenges for data publishers
– Lack of expertise & resources: not easily to publish & maintain high
quality data
– Unclear monetization & sustainability
13
Open Data Portal Datasets Applications
data.gov ~ 200 000 ~ 80
publicdata.eu ~ 48 000 ~ 85
data.gov.uk ~ 31 000 ~ 390
data.norge.no ~ 620 ~ 60
data.gov.my ~ 1065 ~ 10
Lots of datasets are in tabular format
– Records organized in silos of
collections
– Very few links within and/or
across collections
– Difficult to understand the nature
of the data
– Difficult to integrate / query
14
europeandataportal.eu
Openly
available on
the web as a
document
Available
under
structured
format (XLS)
Available
under non-
proprietary
formats (CSV)
Uses URIs to
denote things
Linked to other
data to provide
context
Tim Berners-Lee's
5 stars open data
rating system
15
1-Star Benefits
Consumers:
 Ability to look at, print,
store, modify and
share data
 Ability to use data as
input to a system
Publishers:
 Easily publish data
 Ensure transparency
5-Star Benefits
Consumers:
 Discover more (related) data while
consuming the data
 Directly learn about the data schema
? Have to deal with broken data links
? Trust issues
Publishers:
 Make data discoverable
 Increase the value of data
 Gain the same benefits from the links
as the consumers
? Need to invest resources to link data
? May need to clean data
16
…
Tabular Data Graph Data
• Lots of open datasets are in tabular format
• CSV, Excel, TSV, etc.
• Records organized in silos of collections
• Very few links within and/or across
collections
• Difficult to understand the nature of the data
• Difficult to integrate / query
Based on Linked Data
• Method for publishing data on the Web
• Self-describing data and relations
• Interlinking
• Accessed using semantic queries
• Open standards by W3C
− Data format: RDF
− Knowledge representation: RDFS/OWL
− Query language: SPARQL
http://www.w3.org/standards/semanticweb/data
europeandataportal.eu
17
Tabular
Data
Graph
Data
18
(Open) Data Quality Issues
Tabular data
Tabular data is data that is structured into rows and columns
Correspondence with reality:
1) Each row represents an entity
2) Each column header represents an attribute of entity
3) Each column value represents a value of attribute
4) Each table represents a collection of entities
20
Tabular data files
Tabular data can be stored in different formats:
 Tabular Text Formats (pure tabular data)
Delimiter-separated values:
- CSV – comma-separated values
- Less common, including TSV – tab-separated values, colon-separated
values etc.
 Spreadsheet Formats (meta-data information about the document,
tabular data, formulas)
- XLS (Excel spreadsheet)
- XLSX (Excel 2007 format)
21
Tabular data quality issues
When a dataset does not satisfy specified data quality
criteria, it means that it contains data quality issues.
In order to provide higher data quality, these quality
issues should be detected and removed.
22
Types of quality issues
23
Types of quality issues
24
Types of quality issues
25
What types of data quality issues can occur?
26
Types of quality issues
27
Types of quality issues
Actual information model:
order
street
house
28
Types of quality issues
Actual information model:
order
has address
address
29
Types of quality issues
30
Types of quality issues
Data model:
observation
has make
make
31
Types of quality issues
Data model:
observation
make
year
number 32
Summary of data quality issues
33
How to resolve data quality issues?
Workflow:
1) Identify data quality issues
2) Define transformation functions to resolve them
3) Execute transformation and verify the result
34
Transformation function types
By scope:
 Functions on rows
 Functions on columns
 Functions transforming entire
dataset
By caused effect:
 Data reordering functions
 Data extraction functions
 Data manipulation functions
 Data enrichment functions
35
Transformation functions
Scope Name Description Effect
Rows
Add Row Create a new record in a dataset Data enrichment
Take/Drop Rows Extract only relevant rows by index
Data extraction. Resolves issues: “Rows, describing entities not
belonging to a collection”
Shift Row Change row's position inside a dataset Data reordering, simplifies quality issues detection
Filter Rows Extract only relevant rows by condition
Data extraction. Resolves issues: “Rows, describing entities not
belonging to a collection”
Entire
dataset
Remove
Duplicates
Remove similar rows Data extraction. Resolves issues: “Duplicate rows”
Sort Dataset
Sorts dataset by given column names in
given order
Data reordering, simplifies quality issues detection
Reshape Dataset
(Melt)
Move columns to rows
Data manipulation. Resolves issues: “Column headers, containing
attribute values”
Reshape Dataset
(Cast)
Move rows to columns by categorizing
and aggregating
Data enrichment, simplifies quality issues detection
Group and
Aggregate
Group values by column or multiple
columns and perform aggregation
Data enrichment, simplifies quality issues detection
Columns
Add Column
Add a column with a manually specified
value
Data enrichment
Derive Column
Add a column with values, computed
from other columns
Data enrichment
Take/Drop
Columns
Take or drop selected column(s) Data extraction. Resolves issues: “Columns not related to model”
Shift Column Arbitrarily change column's order Data reordering, simplifies quality issues detection
Merge Columns Merge columns using custom separator
Data manipulation. Resolves issues: “Single value is splitted across
multiple columns”
Split Column Split column using custom separator
Data manipulation. Resolves issues: “Multiple values stored in one
column”
Rename Columns Change column headers Data manipulation. Resolves issues: “Incorrect column headers”
Map columns Apply function to all values in a column
Data manipulation. Resolves issues: “Illegal values”, “Missing values”,
“Inconsistent values” 36
Tabular data cleaning tools
 CLI tools (e.g. Unix awk, csvkit, CSVfix) – lack of convenient user interface
 Programming languages and libraries for data analysis (R, agate for
Python) – users need knowledge in programming
 Spreadsheet software (Microsoft Excel, LibreOffice Calc, Google
Spreadsheets) - were not initially created for data cleaning, hard to debug,
code is mixed up with data
 Frameworks/tools designed to be used for interactive data cleaning and
transformation in ETL process
37
Example: vehicle registration data
https://www.ssb.no/statistikkbanken/selectvarval/Define.asp?subjectcode=&ProductId=&MainTable=RegKjoretoy&nvl=&PLanguage=1&nyTmpVar=true&C
MSSubjectArea=transport-og-reiseliv&KortNavnWeb=bilreg&StatVariant=&checked=true
38
Example: vehicle registration data
(continued)
* Data obtained from StatBank Norway https://www.ssb.no/en/statistikkbanken 39
Map columns – applying a function to all
values in a column
Effect: data manipulation
Resolves anomalies: Illegal values, Missing values, Inconsistent values
Required parameters:
For all columns that should be mapped
1) Name of column to manipulate
2) Name of function to apply
40
Before:
Map columns – apply function to all values in
a column
41
After:
Map columns – apply function to all values in
a column
42
Derive column – add a column with values
computed from others
Effect: data enrichment
Adds new information to data
Required parameters:
1) Name of derived column
2) Column(s) to derive from
3) Function to derive with
43
Before:
Derive column – add a column with values
computed from others
44
After:
Derive column – add a column with values
computed from others
45
Cast dataset – move rows to columns by
categorizing and aggregating
Effect: data enrichment
Adds new information to data, simplifies anomaly detection
Required parameters:
1) Column name for variable (what to categorize and put to headers)
2) Column name for value (on what to perform aggregations)
46
Before:
Cast dataset – move rows to columns by
categorizing and aggregating
47
After:
Cast dataset – move rows to columns by
categorizing and aggregating
48
RDF mapping
Reusing of existing vocabularies is encouraged. Helps to interlink data.
49
50
RDF mapping
http://vocabs.datagraft.net/vehicles
51
Linked (Open) Data
RDF, RDFS, SPARQL
Linked Data
• Method for publishing data on the Web
• Self-describing data and relations
• Interlinking
• Accessed using semantic queries
http://www.w3.org/standards/sema
nticweb/data
53
Linked open data cloud
By Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak - http://lod-cloud.net/, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=36956792
54
Linked Data principles
• Every thing is represented by a URI
• URIs of things can be dereferenced
• Things are linked to other things by relating their URIs
55
Linked Data technology
• Data format:
• Knowledge representation: RDFS/OWL
• Query language:
• Linking medium: HTTP
56
Graph data structure
Alice
Jim
Peter
57
RDF in reality: using URLs to identify things
58
Resource Description Framework (RDF)
Basics
• RDF making statements on resources (entities)
o Triple data model: subject -> predicate -> object (Alice's age is 34)
• Subjects and objects:
o Resources (URIs of entities) – can have properties related to them (http://my-
domain.com/Alice)
o Literals – constant values ("female", "3.14159"); can not be subjects
o Blank nodes – used to specify composite properties (e.g., address which is composed
of a country, city, street name, house number, zip code etc.)
• Realtionships (a.k.a. predicates) – relate one subject to one object
59
RDF serialisation formats
• Turtle family of RDF languages (N-Triples, Turtle, TriG and N-Quads)
60
<http://example.org/bob#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
<http://example.org/bob#me> <http://xmlns.com/foaf/0.1/knows> <http://example.org/alice#me> .
<http://example.org/bob#me> <http://schema.org/birthDate> "1990-07 04"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://example.org/bob#me> <http://xmlns.com/foaf/0.1/topic_interest> <http://www.wikidata.org/entity/Q12418> .
<http://www.wikidata.org/entity/Q12418> <http://purl.org/dc/terms/title> "Mona Lisa" .
<http://www.wikidata.org/entity/Q12418> <http://purl.org/dc/terms/creator>
<http://dbpedia.org/resource/Leonardo_da_Vinci> .
<http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619> <http://purl.org/dc/terms/subject>
<http://www.wikidata.org/entity/Q12418> .
• JSON-LD (JSON-based RDF syntax)
"@context": "example-context.json",
"@id": "http://example.org/bob#me",
"@type": "Person",
"birthdate": "1990-07-04",
"knows": "http://example.org/alice#me",
"interest": {
"@id": "http://www.wikidata.org/entity/Q12418",
"title": "Mona Lisa",
"subject_of": "http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619",
"creator": "http://dbpedia.org/resource/Leonardo_da_Vinci"
}
RDF serialisation formats (continued)
• RDFa (for HTML and XML embedding)
61
<body prefix="foaf: http://xmlns.com/foaf/0.1/ schema: http://schema.org/ dcterms: http://purl.org/dc/terms/">
<div resource="http://example.org/bob#me" typeof="foaf:Person">
<p>Bob knows <a property="foaf:knows" href="http://example.org/alice#me">Alice</a>
and was born on the <time property="schema:birthDate" datatype="xsd:date">1990-07-04</time>.</p>
<p>Bob is interested in <span property="foaf:topic_interest"
resource="http://www.wikidata.org/entity/Q12418">the Mona Lisa</span>.</p>
</div>
<div resource="http://www.wikidata.org/entity/Q12418">
<p>The <span property="dcterms:title">Mona Lisa</span> was painted by
<a property="dcterms:creator" href="http://dbpedia.org/resource/Leonardo_da_Vinci">Leonardo da Vinci</a>
and is the subject of the video
<a href="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">'La Joconde à
Washington'</a>. </p>
</div>
<div resource="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">
<link property="dcterms:subject" href="http://www.wikidata.org/entity/Q12418"/>
</div>
</body>
RDF serialisation formats (continued)
• RDF/XML (XML syntax for RDF)
62
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:schema="http://schema.org/">
<rdf:Description rdf:about="http://example.org/bob#me">
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
<schema:birthDate rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1990-07-04</schema:birthDate>
<foaf:knows rdf:resource="http://example.org/alice#me"/>
<foaf:topic_interest rdf:resource="http://www.wikidata.org/entity/Q12418"/>
</rdf:Description>
<rdf:Description rdf:about="http://www.wikidata.org/entity/Q12418">
<dcterms:title>Mona Lisa</dcterms:title>
<dcterms:creator rdf:resource="http://dbpedia.org/resource/Leonardo_da_Vinci"/>
</rdf:Description>
<rdf:Description rdf:about="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">
<dcterms:subject rdf:resource="http://www.wikidata.org/entity/Q12418"/>
</rdf:Description>
</rdf:RDF>
RDF Schema (RDFS)
• basic capabilities for describing RDF vocabularies
• includes concepts to describe:
o classes, class hierarchies (sub-classes) and instances (typing)
o non-standard literal data types
o property hierarchies (sub-properties)
o predicate domain and range
o utility properties (labels, comments, additional information
about things, definitions of reources)
o …
63
Linked data vocabulary sources
64
Querying RDF: SPARQL
• RDF Query language
– Based on graph matching
• Uses SQL-like syntax
• Query types:
– SELECT – table of raw values
– CONSTRUCT, DESCRIBE – RDF graph
– ASK – boolean
65
SPARQL querying – example graph
a:Alice c:Jimb:Peter
foaf:knows foaf:knows
foaf:Person
rdf:type
"Lissy" "Pety" "Jimbo"
foaf:nickfoaf:nick foaf:nick
foaf:knows
66
SPARQL querying – query
Question: What are the nicknames of people that Alice knows?
Query:
@prefix a: <http://alice.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/>
.
select where {
a:Alice foaf:knows .
foaf:nick
}
a:Alice
foaf:knows
?someone
foaf:nick
?nickname
67
SPARQL querying – matching to the graph
a:Alice c:Jimb:Peter
foaf:knows foaf:knows
foaf:Person
rdf:type
"Lissy" "Pety" "Jimbo"
foaf:nickfoaf:nick foaf:nick
foaf:knows
68
SPARQL querying – result
Query:
@prefix a: <http://alice.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/>
.
select where {
a:Alice foaf:knows .
foaf:nick
}
nickname
"Pety"
"Jimbo"
69
Data integration using Linked Data: using
URIs
Example: Relational DB or spreadsheet – dataset about scientific publications:
ID Name Home page
1 Alice http://alice.org/
2 Tim https://www.w3.org/People/Berners-Lee/
ID author ISBN Publication topic
1 978-3-16-14410-0 "On the frictional coefficient of bananas"
1
534-1-22-66975-1
"Do woodpeckers get headaches?"
2 1-933019-33-6 "The Semantic Web"
70
Data integration using Linked Data: using
URIs (continued)
a:Alice
http://.../978-3-16-148410-0
http://.../534-1-22-663975-1
foaf:topic
foaf:topic
"On the frictional coefficient of
bananas"
"Do woodpeckers get headaches?"
t:Tim http://.../1-933019-33-6
foaf:publications
foaf:topic
"The Semantic Web"
Graph representation of new dataset:
71
Data integration using Linked Data: Using
URIs (continued)
Same URI!
72
Data integration using Linked Data: Using
URIs (continued)
a:Alice c:Jimb:Peter
foaf:knows foaf:knows
foaf:Person
rdf:type
"Lissy" "Pety" "Jimbo"
foaf:nickfoaf:nick foaf:nick
foaf:knows
…978-3-16-148410-0
…534-1-22-663975-1
foaf:topic
foaf:topic
"On the frictional coefficient
of bananas"
"Do woodpeckers get
headaches?"
Resulting graph:
73
Query federation using SPARQL
74
Linked Data is great for Open Data
• Linked Data is a great means to represent data
– Semantics are part of the data
– Naturally linked to other data
– Querying language
• How Linked Data can improve Open Data:
– Easier integration, free data from silos
– Seamless interlinking of data
– Understand the data
– New ways to query and interact with data
75
… but has been ignored by the mainstream
• Difficult to make it accessible to people
– Publishers
– Developers
– Data workers
• Challenges with using Linked Data
– Lack of tooling and expertise to publish high quality Linked Data
– Lack of resources to host LOD endpoints / unreliable data access
• DataGraft: packaging Linked Data to make it more
approachable to the open data community
76
Data-as-a-Service: DataGraft
78
“Data is the new oil”
…but many of us just need gasoline
Data-as-a-Service
…is the new filling station
Data-as-a-Service
• Outsourcing of various data operations to the cloud
• Eliminates
– upfront costs on data infrastructure
– ongoing investment of time and resources in managing the data
infrastructure
• Complete package for
– transformation of raw data into meaningful data assets
– reliable delivery of data assets
79
was developed to allow
data workers
to manage their data in a
simple, effective, and efficient way
Powerful
data transformation and
reliable data access capabilities
80
DataGraft
Data Transformation and
RDF Publication Process
• Interactive design of transformations?
• Repeatable transformations?
• Reuse/share transformations (user-based access)?
• Cloud-based deployment of transformations?
• Self-serviced process?
• Data and Transformation as-a-Service? 81
Transform
Generate
RDF
Ontology X
Ontology X
Ontology X
Ontology
mapping
RDF Graph
Raw Data Prepared Data
Map
Map
RDF Triple
Store
Tabular
Data
Graph
Data
DataGraft: Data-as-a-Service
For the Data Transformation and RDF Publication Process
82
83
https://www.ssb.no/statistikkbanken
Example: Using statistical data
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
Data records (rows)
Add row
Take row(s)
Drop row(s)
Shift row
Filter rows (grep)
Remove duplicate rows
Entire dataset
Sort
Reshape dataset
Group (categorize) and aggregate
Columns
Add column(s)
Take column(s)
Drop column(s)
Move column
Merge columns
Split column
Rename column(s)
Apply function to all values in a column
103
104
105
106
107
Data pages and federated querying
108
What is the
population of
locations and
total number of
persons employed
in Human health
and social work
activities?
Configuring data visualizations
109
110
111
112
113
APIs
DataGraft key feature:
Flexible management and sharing of data
and transformations
Fork, reuse and extend
transformations built by other
professionals from DataGraft’s
transformations catalog
Interactively build,
modify and share data
transformations
Share transformations
privately or publicly
Reuse transformations to
repeatably clean and
transform spreadsheet
data
Programmatically access transformations
and the transformation catalogue
114
Reuse of transformations in environmental
data publishing
TRAGSA Pilot
• Number of
transformations: 42
– Created via reuse: 25
• Number of triples:
– ~ 7.7M
ARPA Pilot
• Number of
transformations: 5
– Created via reuse: 2
• Number of triples:
– ~ 14K
115
Forking/reusing transformations helped us spend less
time on creating new transformations
DataGraft key feature:
Reliable data hosting and querying services
Host data on DataGraft’s
reliable, cloud-based
semantic graph database
Share data privately or
publicly
Query data through
your own SPARQL
endpoint
Programmatically
access the data
catalogue
116
Operations & maintenance
performed on behalf of users
Grafter Grafterizer
Semantic
Graph DBaaSData Portal
DataGraft
117
DataGraft Enablers
DataGraft – 1 package 2 audiences
DataGraft
Data Publisher Application Developer
Helping
integrating and
publishing data
Giving better,
easier tools
118
Examples and Demo
The context: Statsbygg
120
• A public sector administration
company
• Norwegian government's key
advisor in construction and
property affairs
• Building commissioner
• Property manager
• Property developer
• Interest:
Exploit/Share
property data in
novel ways
• For efficiency and sustainability of
the property included in the
government's civil estate
Example: Reporting state-owned
real estate properties in Norway
Example: Reporting state-owned
real estate properties in Norway (cont’)
• A hard copy of 314 pages and as a
PDF file
• 6 Person-Months
• Data collection with spreadsheets
• Quality assurance through e-mails
and phone correspondence
Pains
• Time consuming
• Poor data quality
• Static report without live updating
• Live service
• Efficient sharing of data
• Simplified integration with external
datasets
• Live updating
• Reliable access
• …
• Risk and vulnerability analysis,
e.g. buildings affected by
flooding
• Analysis of leasing prices
Report Reporting Service 3rd party services
121
Sample data
122
Cleaning, Transformation, Publishing,
Integration, Querying, Visualization,
Service Access
Demo Scenario
• Interactively create tabular data transformations
• Reuse/extend data transformations (incl. data
annotations)
• RDF data publication and querying
• Integrating and visualising data from different
sources
• (Using 3rd party tools with DataGraft)
123
Demo sample data
124
Cleaning, Transformation, Publishing,
Integration, Querying, Visualization,
Service Access
Demo sample data
125
Cleaning, Transformation, Publishing,
Integration, Querying, Visualization,
Service Access
Benefits of DataGraft in use cases
• Simplified data publishing process
• Integration with external data sources using
established web standards
• Data that was not publicly available – now published
(e.g. air quality data in Oslo)
• Time-efficient publishing
• Repeatable data transformation process
126
DataGraft and Big Data
• Desired features:
– real-time interactivity
– large datasets batch transformation capability
We are developing a hybrid solution to work with both
batch and real-time processing.
127
DataGraft and Big Data:
High-level architecture
128
DataGraft – targeted impacts
Reduction in costs
for organisations which lack
sufficient expertise and resources to
make their data available
Reduction on the dependency
of data owners on generic Cloud platforms
to build, deploy and maintain their linked
data from scratch
Increase in the speed of
publishing
new datasets and updating existing
datasets
Reduction in the cost and
complexity of developing
applications that use data
Increase in the reuse of data
by providing reliable access to numerous
datasets hosted on DataGraft.net
129
• Gathering enough of good datasets
• Designing/implementing
2. Able to focus on
service quality
Example: The benefit of DataGraft in PLUQI
130
• Reducing cost for implementing
transformations
• Integrating the process is
simpler
1. 23% of development
cost reduction
Datasets
gathering
Data
transformation
Data
provisioning/access
Implementing
App
Before
Datasets
gathering
Data
transformation
Data
provisioning/
access
Implementing
App
After (with DataGraft)
DataGraft in numbers
(as of end of Jan 2016)
131
238
Registered users
607 (208 public)
Registered
Data transformations
1828
Uploaded files
192
Public Data
pages
DataGraft in the wild
• Investigating crime data in small geographies
• Used DataGraft to transform data and publish RDF
132http://benproctor.co.uk/investigating-crime-data-at-small-geographies/
Data Science and DataGraft
Greater Data Science:
1. Data Exploration and
Preparation
2. Data Representation and
Transformation
3. Computing with Data
4. Data Visualization and
Presentation
5. Data Modeling
6. Science about Data Science
133
“50 years of Data Science” by David Donoho
http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf
DataGraft
134https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/
135
Summary
• DataGraft – emerging Data-as-a-Service solution for
making (linked) data more accessible
– Platform, portal, methodology, APIs
– Online service, functional and documented
– Validated through several use cases
• Key features:
– Support for Sharable/Repeatable/Reusable Data
Transformations
– Reliable RDF Database-as-a-Service
136
https://datagraft.net
Thank you!
Contact: dumitru.roman@sintef.no 137
138

Weitere ähnliche Inhalte

Was ist angesagt?

Delivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea UrsanerDelivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea UrsanerData Con LA
 
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Big Data Fabric for At-Scale Real-Time Analysis by Edwin RobbinsData Con LA
 
An introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligenceAn introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligenceDavid Walker
 
Simplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data VirtualizationSimplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data VirtualizationDenodo
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data LakeRobert Chong
 
Data Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management RequirementsData Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management RequirementsSnapLogic
 
Opportunity: Data, Analytic & Azure
Opportunity: Data, Analytic & Azure Opportunity: Data, Analytic & Azure
Opportunity: Data, Analytic & Azure Abhimanyu Singhal
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Where does Fast Data Strategy Fit within IT Projects
Where does Fast Data Strategy Fit within IT ProjectsWhere does Fast Data Strategy Fit within IT Projects
Where does Fast Data Strategy Fit within IT ProjectsDenodo
 
DW Appliance
DW ApplianceDW Appliance
DW ApplianceShankar R
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...SoftServe
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseBui Ha
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Zaloni
 
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...Denodo
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricksBrandon Berlinrut
 

Was ist angesagt? (20)

Solution architecture for big data projects
Solution architecture for big data projectsSolution architecture for big data projects
Solution architecture for big data projects
 
Delivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea UrsanerDelivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea Ursaner
 
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 
An introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligenceAn introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligence
 
Simplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data VirtualizationSimplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data Virtualization
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data Lake
 
Data Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management RequirementsData Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management Requirements
 
Opportunity: Data, Analytic & Azure
Opportunity: Data, Analytic & Azure Opportunity: Data, Analytic & Azure
Opportunity: Data, Analytic & Azure
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
2022 02 Integration Bootcamp
2022 02 Integration Bootcamp2022 02 Integration Bootcamp
2022 02 Integration Bootcamp
 
Where does Fast Data Strategy Fit within IT Projects
Where does Fast Data Strategy Fit within IT ProjectsWhere does Fast Data Strategy Fit within IT Projects
Where does Fast Data Strategy Fit within IT Projects
 
DW Appliance
DW ApplianceDW Appliance
DW Appliance
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data Warehouse
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...
 
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
 

Andere mochten auch

What is DaaS
What is DaaSWhat is DaaS
What is DaaSmagic2011
 
TUW-ASE Summer 2015: Data as a Service - Models and Data Concerns
TUW-ASE Summer 2015: Data as a Service - Models and Data ConcernsTUW-ASE Summer 2015: Data as a Service - Models and Data Concerns
TUW-ASE Summer 2015: Data as a Service - Models and Data ConcernsHong-Linh Truong
 
Enabling Data as a Service with the JBoss Enterprise Data Services Platform
Enabling Data as a Service with the JBoss Enterprise Data Services PlatformEnabling Data as a Service with the JBoss Enterprise Data Services Platform
Enabling Data as a Service with the JBoss Enterprise Data Services Platformprajods
 
TUW- 184.742 Data as a Service – Concepts, Design & Implementation, and Ecosy...
TUW- 184.742 Data as a Service – Concepts, Design & Implementation, and Ecosy...TUW- 184.742 Data as a Service – Concepts, Design & Implementation, and Ecosy...
TUW- 184.742 Data as a Service – Concepts, Design & Implementation, and Ecosy...Hong-Linh Truong
 
Open and Proprietary Data Economies in Malaysia: The Consumption Perspective
Open and Proprietary Data Economies in Malaysia: The Consumption PerspectiveOpen and Proprietary Data Economies in Malaysia: The Consumption Perspective
Open and Proprietary Data Economies in Malaysia: The Consumption PerspectiveSandra Hanchard
 
The 101 Of Web 2.0 by Roslan Bakri Zakariah
The 101 Of Web 2.0 by Roslan Bakri ZakariahThe 101 Of Web 2.0 by Roslan Bakri Zakariah
The 101 Of Web 2.0 by Roslan Bakri ZakariahIdeashare
 
Industry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraftIndustry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraftRuleML
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Datadapaasproject
 
"Cerved - A business perspective"
"Cerved - A business perspective" "Cerved - A business perspective"
"Cerved - A business perspective" dapaasproject
 
Linked Data as a Service
Linked Data as a ServiceLinked Data as a Service
Linked Data as a ServicePeter Haase
 
Malaysia Big Data Analytics Initiative: 2015 Imperatives
Malaysia Big Data Analytics Initiative: 2015 ImperativesMalaysia Big Data Analytics Initiative: 2015 Imperatives
Malaysia Big Data Analytics Initiative: 2015 ImperativesPeter Kua
 
Wrangleconf Big Data Malaysia 2016
Wrangleconf Big Data Malaysia 2016Wrangleconf Big Data Malaysia 2016
Wrangleconf Big Data Malaysia 2016Adam Gibson
 
Bring DevOps to the Cloud with Data as a Service [DaaS]
Bring DevOps to the Cloud with Data as a Service [DaaS]Bring DevOps to the Cloud with Data as a Service [DaaS]
Bring DevOps to the Cloud with Data as a Service [DaaS]Amazon Web Services
 
Tracxn Startup Research: Data as a Service Landscape, August 2016
Tracxn Startup Research: Data as a Service Landscape, August 2016Tracxn Startup Research: Data as a Service Landscape, August 2016
Tracxn Startup Research: Data as a Service Landscape, August 2016Tracxn
 
Data Architecture not Just for Microservices
Data Architecture not Just for MicroservicesData Architecture not Just for Microservices
Data Architecture not Just for MicroservicesEberhard Wolff
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureMapR Technologies
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationMongoDB
 

Andere mochten auch (20)

Data as a service
Data as a serviceData as a service
Data as a service
 
What is DaaS
What is DaaSWhat is DaaS
What is DaaS
 
TUW-ASE Summer 2015: Data as a Service - Models and Data Concerns
TUW-ASE Summer 2015: Data as a Service - Models and Data ConcernsTUW-ASE Summer 2015: Data as a Service - Models and Data Concerns
TUW-ASE Summer 2015: Data as a Service - Models and Data Concerns
 
Enabling Data as a Service with the JBoss Enterprise Data Services Platform
Enabling Data as a Service with the JBoss Enterprise Data Services PlatformEnabling Data as a Service with the JBoss Enterprise Data Services Platform
Enabling Data as a Service with the JBoss Enterprise Data Services Platform
 
TUW- 184.742 Data as a Service – Concepts, Design & Implementation, and Ecosy...
TUW- 184.742 Data as a Service – Concepts, Design & Implementation, and Ecosy...TUW- 184.742 Data as a Service – Concepts, Design & Implementation, and Ecosy...
TUW- 184.742 Data as a Service – Concepts, Design & Implementation, and Ecosy...
 
Open and Proprietary Data Economies in Malaysia: The Consumption Perspective
Open and Proprietary Data Economies in Malaysia: The Consumption PerspectiveOpen and Proprietary Data Economies in Malaysia: The Consumption Perspective
Open and Proprietary Data Economies in Malaysia: The Consumption Perspective
 
The 101 Of Web 2.0 by Roslan Bakri Zakariah
The 101 Of Web 2.0 by Roslan Bakri ZakariahThe 101 Of Web 2.0 by Roslan Bakri Zakariah
The 101 Of Web 2.0 by Roslan Bakri Zakariah
 
Industry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraftIndustry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraft
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
 
"Cerved - A business perspective"
"Cerved - A business perspective" "Cerved - A business perspective"
"Cerved - A business perspective"
 
Linked Data as a Service
Linked Data as a ServiceLinked Data as a Service
Linked Data as a Service
 
Malaysia Big Data Analytics Initiative: 2015 Imperatives
Malaysia Big Data Analytics Initiative: 2015 ImperativesMalaysia Big Data Analytics Initiative: 2015 Imperatives
Malaysia Big Data Analytics Initiative: 2015 Imperatives
 
Wrangleconf Big Data Malaysia 2016
Wrangleconf Big Data Malaysia 2016Wrangleconf Big Data Malaysia 2016
Wrangleconf Big Data Malaysia 2016
 
Bring DevOps to the Cloud with Data as a Service [DaaS]
Bring DevOps to the Cloud with Data as a Service [DaaS]Bring DevOps to the Cloud with Data as a Service [DaaS]
Bring DevOps to the Cloud with Data as a Service [DaaS]
 
Tracxn Startup Research: Data as a Service Landscape, August 2016
Tracxn Startup Research: Data as a Service Landscape, August 2016Tracxn Startup Research: Data as a Service Landscape, August 2016
Tracxn Startup Research: Data as a Service Landscape, August 2016
 
Data Architecture not Just for Microservices
Data Architecture not Just for MicroservicesData Architecture not Just for Microservices
Data Architecture not Just for Microservices
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data Architecture
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 

Ähnlich wie DataGraft: Data-as-a-Service for Open Data

Open government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactElena Simperl
 
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...Alistair Hamilton
 
Enabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseEnabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseMarin Dimitrov
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo
 
Data Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata HarmonisationData Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata HarmonisationAlan McSweeney
 
The web of data: how are we doing so far?
The web of data: how are we doing so far?The web of data: how are we doing so far?
The web of data: how are we doing so far?Elena Simperl
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapubeswcsummerschool
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationDenodo
 
Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)Anna Fensel
 
Relational Database explanation with detail.pdf
Relational Database explanation with detail.pdfRelational Database explanation with detail.pdf
Relational Database explanation with detail.pdf9wldv5h8n
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...Neo4j
 
Solving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBSolving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBMongoDB
 
Chapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.pptChapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.pptAnasSamara3
 
Creating Effective Data Visualizations in Excel 2016: Some Basics
Creating Effective Data Visualizations in Excel 2016:  Some BasicsCreating Effective Data Visualizations in Excel 2016:  Some Basics
Creating Effective Data Visualizations in Excel 2016: Some BasicsShalin Hai-Jew
 

Ähnlich wie DataGraft: Data-as-a-Service for Open Data (20)

Open government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impact
 
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
 
Enabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseEnabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and Reuse
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
 
Data Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata HarmonisationData Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata Harmonisation
 
The web of data: how are we doing so far?
The web of data: how are we doing so far?The web of data: how are we doing so far?
The web of data: how are we doing so far?
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow Tutorial
 
Ch_2.pdf
Ch_2.pdfCh_2.pdf
Ch_2.pdf
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)
 
Relational Database explanation with detail.pdf
Relational Database explanation with detail.pdfRelational Database explanation with detail.pdf
Relational Database explanation with detail.pdf
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing Paradigms
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
 
lecture5 (1) (2).pptx
lecture5 (1) (2).pptxlecture5 (1) (2).pptx
lecture5 (1) (2).pptx
 
Solving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBSolving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDB
 
Chapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.pptChapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.ppt
 
Creating Effective Data Visualizations in Excel 2016: Some Basics
Creating Effective Data Visualizations in Excel 2016:  Some BasicsCreating Effective Data Visualizations in Excel 2016:  Some Basics
Creating Effective Data Visualizations in Excel 2016: Some Basics
 

Mehr von dapaasproject

Geospatial Big Data: Business Cases from proDataMarket
Geospatial Big Data: Business Cases from proDataMarketGeospatial Big Data: Business Cases from proDataMarket
Geospatial Big Data: Business Cases from proDataMarketdapaasproject
 
Data-as-a-Service: DataGraft
Data-as-a-Service: DataGraftData-as-a-Service: DataGraft
Data-as-a-Service: DataGraftdapaasproject
 
proDataMarket presentation at "European Data Forum"
proDataMarket presentation at "European Data Forum"proDataMarket presentation at "European Data Forum"
proDataMarket presentation at "European Data Forum"dapaasproject
 
proDataMarket presentation at "Spatial Data on The Web"
proDataMarket presentation at "Spatial Data on The Web"proDataMarket presentation at "Spatial Data on The Web"
proDataMarket presentation at "Spatial Data on The Web"dapaasproject
 
proDataMarket presentation at "Linked Data Europe: Big Geospatial Data"
proDataMarket presentation at "Linked Data Europe: Big Geospatial Data"proDataMarket presentation at "Linked Data Europe: Big Geospatial Data"
proDataMarket presentation at "Linked Data Europe: Big Geospatial Data"dapaasproject
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Datadapaasproject
 

Mehr von dapaasproject (6)

Geospatial Big Data: Business Cases from proDataMarket
Geospatial Big Data: Business Cases from proDataMarketGeospatial Big Data: Business Cases from proDataMarket
Geospatial Big Data: Business Cases from proDataMarket
 
Data-as-a-Service: DataGraft
Data-as-a-Service: DataGraftData-as-a-Service: DataGraft
Data-as-a-Service: DataGraft
 
proDataMarket presentation at "European Data Forum"
proDataMarket presentation at "European Data Forum"proDataMarket presentation at "European Data Forum"
proDataMarket presentation at "European Data Forum"
 
proDataMarket presentation at "Spatial Data on The Web"
proDataMarket presentation at "Spatial Data on The Web"proDataMarket presentation at "Spatial Data on The Web"
proDataMarket presentation at "Spatial Data on The Web"
 
proDataMarket presentation at "Linked Data Europe: Big Geospatial Data"
proDataMarket presentation at "Linked Data Europe: Big Geospatial Data"proDataMarket presentation at "Linked Data Europe: Big Geospatial Data"
proDataMarket presentation at "Linked Data Europe: Big Geospatial Data"
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
 

Kürzlich hochgeladen

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 

Kürzlich hochgeladen (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 

DataGraft: Data-as-a-Service for Open Data

  • 1. DataGraft Data-as-a-Service for Open Data Dumitru Roman dumitru.roman@sintef.no https://datagraft.net
  • 2. About me • Education – Eng (2003), Technical University of Cluj-Napoca, Romania – PhD (2008), University of Innsbruck, Austria • Current positions – Senior Research Scientist, SINTEF, Norway – Associate Professor, University of Oslo, Norway • Expertise and responsibilities – Initiating, leading, and carrying out (research-intensive) projects on data management and service-oriented topics – Involved with over 20 large-scale R&D projects at the European level during the past 12 years 2
  • 3. “Technology for a better society” • Public and private companies • Data owners • Data publishers • Data integrators and aggregators • Developers • Improved data access • Data-driven decision making • Cost reduction when working with data • Reduction on the dependency on generic infrastructures providers (e.g. generic cloud) • Increase in the speed of making data available • Increase in the reuse of data • Data cleaning • Data transformation • Data publication • Data-as-a-Service • Open data • Linked data (RDF, SPARQL) DataGraft 3
  • 4. 4
  • 5. Outline Session #1: Open Data • Open Data • (Open) Data Quality Issues • Linked (Open) Data – RDF, RDFS, SPARQL Session #2: DataGraft • Data-as-a-Service: DataGraft • Examples and Demo • Big Data and DataGraft • Open Data in Malaysian context (by Dennis Gan) • (Optional: Hands on) 5 What is Open Data? What is Linked Data? Challenges in (Linked Open) Data? How to publish Linked Open Data? Linked Open Data Use Cases? (Linked) Open Data and Big Data?
  • 7. What can open data do for you? (Source: The ODI, https://vimeo.com/110800848) 7
  • 8. Open Data …is changing the nature of business ...reflects a cultural shift to a more open society 8
  • 9. Example: Personalized and Localized Urban Quality Index (PLUQI) The index includes data from various domains: Daily life satisfaction weather, transportation, community, … Healthcare level number of doctors, hospitals, suicide statistics, … Safety and security number of police stations, fire stations, crimes per capita, … Financial satisfaction prices, incomes, housing, savings, debt, insurance, pension, … Level of opportunity jobs, unemployment, education, re-education, … Environmental needs and efficiency green space, air quality,… 9
  • 10. PLUQI – potential usage • Place recommendation for travel agencies or travelers • Policy analysis and optimization for (local) government • Understanding the citizen’s voice and demands regarding environmental conservation • Commercial impact analysis for retailer and franchises • Location recommendation and understanding local issues for real estate • Risk analysis and management for insurance and financial companies • Local marketing and sales force optimization for marketers 10
  • 11. Open Data • Businesses can develop new ideas, services and applications; improve decision making, cost savings • Can increase government transparency and accountability, quality of public services • Citizens get better and timely access to public services 11 Source: McKinsey http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_a nd_performance_with_liquid_information Gartner: By 2016, the use of "open data" will continue to increase — but slowly, and predominantly limited to Type A enterprises. By 2017, over 60% of government open data programs that do not effectively use open data internally, will be scaled back or discontinued. By 2020, enterprises and governments will fail to protect 75% of sensitive data and will declassify and grant broad/public access to it. Source: Garner http://training.gsn.gov.tw/uploads/news/6.Gartner+ExP+Briefing_Open+Data _JUN+2014_v2.pdf
  • 12. Lots of open datasets on the Web… • A large number of datasets have been published as open data in the recent years • Many kinds of data: cultural, science, finance, statistics, transport, environment, … • Popular formats: tabular (e.g. CSV, XLS), HTML, XML, JSON, … 12
  • 13. …but few actually used • Few applications utilizing open and distributed datasets at present • Challenges for data consumers – Data quality issues – Difficult or unreliable data access – Licensing issues • Challenges for data publishers – Lack of expertise & resources: not easily to publish & maintain high quality data – Unclear monetization & sustainability 13 Open Data Portal Datasets Applications data.gov ~ 200 000 ~ 80 publicdata.eu ~ 48 000 ~ 85 data.gov.uk ~ 31 000 ~ 390 data.norge.no ~ 620 ~ 60 data.gov.my ~ 1065 ~ 10
  • 14. Lots of datasets are in tabular format – Records organized in silos of collections – Very few links within and/or across collections – Difficult to understand the nature of the data – Difficult to integrate / query 14 europeandataportal.eu
  • 15. Openly available on the web as a document Available under structured format (XLS) Available under non- proprietary formats (CSV) Uses URIs to denote things Linked to other data to provide context Tim Berners-Lee's 5 stars open data rating system 15
  • 16. 1-Star Benefits Consumers:  Ability to look at, print, store, modify and share data  Ability to use data as input to a system Publishers:  Easily publish data  Ensure transparency 5-Star Benefits Consumers:  Discover more (related) data while consuming the data  Directly learn about the data schema ? Have to deal with broken data links ? Trust issues Publishers:  Make data discoverable  Increase the value of data  Gain the same benefits from the links as the consumers ? Need to invest resources to link data ? May need to clean data 16 …
  • 17. Tabular Data Graph Data • Lots of open datasets are in tabular format • CSV, Excel, TSV, etc. • Records organized in silos of collections • Very few links within and/or across collections • Difficult to understand the nature of the data • Difficult to integrate / query Based on Linked Data • Method for publishing data on the Web • Self-describing data and relations • Interlinking • Accessed using semantic queries • Open standards by W3C − Data format: RDF − Knowledge representation: RDFS/OWL − Query language: SPARQL http://www.w3.org/standards/semanticweb/data europeandataportal.eu 17
  • 20. Tabular data Tabular data is data that is structured into rows and columns Correspondence with reality: 1) Each row represents an entity 2) Each column header represents an attribute of entity 3) Each column value represents a value of attribute 4) Each table represents a collection of entities 20
  • 21. Tabular data files Tabular data can be stored in different formats:  Tabular Text Formats (pure tabular data) Delimiter-separated values: - CSV – comma-separated values - Less common, including TSV – tab-separated values, colon-separated values etc.  Spreadsheet Formats (meta-data information about the document, tabular data, formulas) - XLS (Excel spreadsheet) - XLSX (Excel 2007 format) 21
  • 22. Tabular data quality issues When a dataset does not satisfy specified data quality criteria, it means that it contains data quality issues. In order to provide higher data quality, these quality issues should be detected and removed. 22
  • 23. Types of quality issues 23
  • 24. Types of quality issues 24
  • 25. Types of quality issues 25
  • 26. What types of data quality issues can occur? 26
  • 27. Types of quality issues 27
  • 28. Types of quality issues Actual information model: order street house 28
  • 29. Types of quality issues Actual information model: order has address address 29
  • 30. Types of quality issues 30
  • 31. Types of quality issues Data model: observation has make make 31
  • 32. Types of quality issues Data model: observation make year number 32
  • 33. Summary of data quality issues 33
  • 34. How to resolve data quality issues? Workflow: 1) Identify data quality issues 2) Define transformation functions to resolve them 3) Execute transformation and verify the result 34
  • 35. Transformation function types By scope:  Functions on rows  Functions on columns  Functions transforming entire dataset By caused effect:  Data reordering functions  Data extraction functions  Data manipulation functions  Data enrichment functions 35
  • 36. Transformation functions Scope Name Description Effect Rows Add Row Create a new record in a dataset Data enrichment Take/Drop Rows Extract only relevant rows by index Data extraction. Resolves issues: “Rows, describing entities not belonging to a collection” Shift Row Change row's position inside a dataset Data reordering, simplifies quality issues detection Filter Rows Extract only relevant rows by condition Data extraction. Resolves issues: “Rows, describing entities not belonging to a collection” Entire dataset Remove Duplicates Remove similar rows Data extraction. Resolves issues: “Duplicate rows” Sort Dataset Sorts dataset by given column names in given order Data reordering, simplifies quality issues detection Reshape Dataset (Melt) Move columns to rows Data manipulation. Resolves issues: “Column headers, containing attribute values” Reshape Dataset (Cast) Move rows to columns by categorizing and aggregating Data enrichment, simplifies quality issues detection Group and Aggregate Group values by column or multiple columns and perform aggregation Data enrichment, simplifies quality issues detection Columns Add Column Add a column with a manually specified value Data enrichment Derive Column Add a column with values, computed from other columns Data enrichment Take/Drop Columns Take or drop selected column(s) Data extraction. Resolves issues: “Columns not related to model” Shift Column Arbitrarily change column's order Data reordering, simplifies quality issues detection Merge Columns Merge columns using custom separator Data manipulation. Resolves issues: “Single value is splitted across multiple columns” Split Column Split column using custom separator Data manipulation. Resolves issues: “Multiple values stored in one column” Rename Columns Change column headers Data manipulation. Resolves issues: “Incorrect column headers” Map columns Apply function to all values in a column Data manipulation. Resolves issues: “Illegal values”, “Missing values”, “Inconsistent values” 36
  • 37. Tabular data cleaning tools  CLI tools (e.g. Unix awk, csvkit, CSVfix) – lack of convenient user interface  Programming languages and libraries for data analysis (R, agate for Python) – users need knowledge in programming  Spreadsheet software (Microsoft Excel, LibreOffice Calc, Google Spreadsheets) - were not initially created for data cleaning, hard to debug, code is mixed up with data  Frameworks/tools designed to be used for interactive data cleaning and transformation in ETL process 37
  • 38. Example: vehicle registration data https://www.ssb.no/statistikkbanken/selectvarval/Define.asp?subjectcode=&ProductId=&MainTable=RegKjoretoy&nvl=&PLanguage=1&nyTmpVar=true&C MSSubjectArea=transport-og-reiseliv&KortNavnWeb=bilreg&StatVariant=&checked=true 38
  • 39. Example: vehicle registration data (continued) * Data obtained from StatBank Norway https://www.ssb.no/en/statistikkbanken 39
  • 40. Map columns – applying a function to all values in a column Effect: data manipulation Resolves anomalies: Illegal values, Missing values, Inconsistent values Required parameters: For all columns that should be mapped 1) Name of column to manipulate 2) Name of function to apply 40
  • 41. Before: Map columns – apply function to all values in a column 41
  • 42. After: Map columns – apply function to all values in a column 42
  • 43. Derive column – add a column with values computed from others Effect: data enrichment Adds new information to data Required parameters: 1) Name of derived column 2) Column(s) to derive from 3) Function to derive with 43
  • 44. Before: Derive column – add a column with values computed from others 44
  • 45. After: Derive column – add a column with values computed from others 45
  • 46. Cast dataset – move rows to columns by categorizing and aggregating Effect: data enrichment Adds new information to data, simplifies anomaly detection Required parameters: 1) Column name for variable (what to categorize and put to headers) 2) Column name for value (on what to perform aggregations) 46
  • 47. Before: Cast dataset – move rows to columns by categorizing and aggregating 47
  • 48. After: Cast dataset – move rows to columns by categorizing and aggregating 48
  • 49. RDF mapping Reusing of existing vocabularies is encouraged. Helps to interlink data. 49
  • 50. 50
  • 52. Linked (Open) Data RDF, RDFS, SPARQL
  • 53. Linked Data • Method for publishing data on the Web • Self-describing data and relations • Interlinking • Accessed using semantic queries http://www.w3.org/standards/sema nticweb/data 53
  • 54. Linked open data cloud By Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak - http://lod-cloud.net/, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=36956792 54
  • 55. Linked Data principles • Every thing is represented by a URI • URIs of things can be dereferenced • Things are linked to other things by relating their URIs 55
  • 56. Linked Data technology • Data format: • Knowledge representation: RDFS/OWL • Query language: • Linking medium: HTTP 56
  • 58. RDF in reality: using URLs to identify things 58
  • 59. Resource Description Framework (RDF) Basics • RDF making statements on resources (entities) o Triple data model: subject -> predicate -> object (Alice's age is 34) • Subjects and objects: o Resources (URIs of entities) – can have properties related to them (http://my- domain.com/Alice) o Literals – constant values ("female", "3.14159"); can not be subjects o Blank nodes – used to specify composite properties (e.g., address which is composed of a country, city, street name, house number, zip code etc.) • Realtionships (a.k.a. predicates) – relate one subject to one object 59
  • 60. RDF serialisation formats • Turtle family of RDF languages (N-Triples, Turtle, TriG and N-Quads) 60 <http://example.org/bob#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> . <http://example.org/bob#me> <http://xmlns.com/foaf/0.1/knows> <http://example.org/alice#me> . <http://example.org/bob#me> <http://schema.org/birthDate> "1990-07 04"^^<http://www.w3.org/2001/XMLSchema#date> . <http://example.org/bob#me> <http://xmlns.com/foaf/0.1/topic_interest> <http://www.wikidata.org/entity/Q12418> . <http://www.wikidata.org/entity/Q12418> <http://purl.org/dc/terms/title> "Mona Lisa" . <http://www.wikidata.org/entity/Q12418> <http://purl.org/dc/terms/creator> <http://dbpedia.org/resource/Leonardo_da_Vinci> . <http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619> <http://purl.org/dc/terms/subject> <http://www.wikidata.org/entity/Q12418> . • JSON-LD (JSON-based RDF syntax) "@context": "example-context.json", "@id": "http://example.org/bob#me", "@type": "Person", "birthdate": "1990-07-04", "knows": "http://example.org/alice#me", "interest": { "@id": "http://www.wikidata.org/entity/Q12418", "title": "Mona Lisa", "subject_of": "http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619", "creator": "http://dbpedia.org/resource/Leonardo_da_Vinci" }
  • 61. RDF serialisation formats (continued) • RDFa (for HTML and XML embedding) 61 <body prefix="foaf: http://xmlns.com/foaf/0.1/ schema: http://schema.org/ dcterms: http://purl.org/dc/terms/"> <div resource="http://example.org/bob#me" typeof="foaf:Person"> <p>Bob knows <a property="foaf:knows" href="http://example.org/alice#me">Alice</a> and was born on the <time property="schema:birthDate" datatype="xsd:date">1990-07-04</time>.</p> <p>Bob is interested in <span property="foaf:topic_interest" resource="http://www.wikidata.org/entity/Q12418">the Mona Lisa</span>.</p> </div> <div resource="http://www.wikidata.org/entity/Q12418"> <p>The <span property="dcterms:title">Mona Lisa</span> was painted by <a property="dcterms:creator" href="http://dbpedia.org/resource/Leonardo_da_Vinci">Leonardo da Vinci</a> and is the subject of the video <a href="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">'La Joconde à Washington'</a>. </p> </div> <div resource="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619"> <link property="dcterms:subject" href="http://www.wikidata.org/entity/Q12418"/> </div> </body>
  • 62. RDF serialisation formats (continued) • RDF/XML (XML syntax for RDF) 62 <?xml version="1.0" encoding="utf-8"?> <rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:schema="http://schema.org/"> <rdf:Description rdf:about="http://example.org/bob#me"> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/> <schema:birthDate rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1990-07-04</schema:birthDate> <foaf:knows rdf:resource="http://example.org/alice#me"/> <foaf:topic_interest rdf:resource="http://www.wikidata.org/entity/Q12418"/> </rdf:Description> <rdf:Description rdf:about="http://www.wikidata.org/entity/Q12418"> <dcterms:title>Mona Lisa</dcterms:title> <dcterms:creator rdf:resource="http://dbpedia.org/resource/Leonardo_da_Vinci"/> </rdf:Description> <rdf:Description rdf:about="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619"> <dcterms:subject rdf:resource="http://www.wikidata.org/entity/Q12418"/> </rdf:Description> </rdf:RDF>
  • 63. RDF Schema (RDFS) • basic capabilities for describing RDF vocabularies • includes concepts to describe: o classes, class hierarchies (sub-classes) and instances (typing) o non-standard literal data types o property hierarchies (sub-properties) o predicate domain and range o utility properties (labels, comments, additional information about things, definitions of reources) o … 63
  • 65. Querying RDF: SPARQL • RDF Query language – Based on graph matching • Uses SQL-like syntax • Query types: – SELECT – table of raw values – CONSTRUCT, DESCRIBE – RDF graph – ASK – boolean 65
  • 66. SPARQL querying – example graph a:Alice c:Jimb:Peter foaf:knows foaf:knows foaf:Person rdf:type "Lissy" "Pety" "Jimbo" foaf:nickfoaf:nick foaf:nick foaf:knows 66
  • 67. SPARQL querying – query Question: What are the nicknames of people that Alice knows? Query: @prefix a: <http://alice.org/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . select where { a:Alice foaf:knows . foaf:nick } a:Alice foaf:knows ?someone foaf:nick ?nickname 67
  • 68. SPARQL querying – matching to the graph a:Alice c:Jimb:Peter foaf:knows foaf:knows foaf:Person rdf:type "Lissy" "Pety" "Jimbo" foaf:nickfoaf:nick foaf:nick foaf:knows 68
  • 69. SPARQL querying – result Query: @prefix a: <http://alice.org/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . select where { a:Alice foaf:knows . foaf:nick } nickname "Pety" "Jimbo" 69
  • 70. Data integration using Linked Data: using URIs Example: Relational DB or spreadsheet – dataset about scientific publications: ID Name Home page 1 Alice http://alice.org/ 2 Tim https://www.w3.org/People/Berners-Lee/ ID author ISBN Publication topic 1 978-3-16-14410-0 "On the frictional coefficient of bananas" 1 534-1-22-66975-1 "Do woodpeckers get headaches?" 2 1-933019-33-6 "The Semantic Web" 70
  • 71. Data integration using Linked Data: using URIs (continued) a:Alice http://.../978-3-16-148410-0 http://.../534-1-22-663975-1 foaf:topic foaf:topic "On the frictional coefficient of bananas" "Do woodpeckers get headaches?" t:Tim http://.../1-933019-33-6 foaf:publications foaf:topic "The Semantic Web" Graph representation of new dataset: 71
  • 72. Data integration using Linked Data: Using URIs (continued) Same URI! 72
  • 73. Data integration using Linked Data: Using URIs (continued) a:Alice c:Jimb:Peter foaf:knows foaf:knows foaf:Person rdf:type "Lissy" "Pety" "Jimbo" foaf:nickfoaf:nick foaf:nick foaf:knows …978-3-16-148410-0 …534-1-22-663975-1 foaf:topic foaf:topic "On the frictional coefficient of bananas" "Do woodpeckers get headaches?" Resulting graph: 73
  • 75. Linked Data is great for Open Data • Linked Data is a great means to represent data – Semantics are part of the data – Naturally linked to other data – Querying language • How Linked Data can improve Open Data: – Easier integration, free data from silos – Seamless interlinking of data – Understand the data – New ways to query and interact with data 75
  • 76. … but has been ignored by the mainstream • Difficult to make it accessible to people – Publishers – Developers – Data workers • Challenges with using Linked Data – Lack of tooling and expertise to publish high quality Linked Data – Lack of resources to host LOD endpoints / unreliable data access • DataGraft: packaging Linked Data to make it more approachable to the open data community 76
  • 78. 78 “Data is the new oil” …but many of us just need gasoline Data-as-a-Service …is the new filling station
  • 79. Data-as-a-Service • Outsourcing of various data operations to the cloud • Eliminates – upfront costs on data infrastructure – ongoing investment of time and resources in managing the data infrastructure • Complete package for – transformation of raw data into meaningful data assets – reliable delivery of data assets 79
  • 80. was developed to allow data workers to manage their data in a simple, effective, and efficient way Powerful data transformation and reliable data access capabilities 80 DataGraft
  • 81. Data Transformation and RDF Publication Process • Interactive design of transformations? • Repeatable transformations? • Reuse/share transformations (user-based access)? • Cloud-based deployment of transformations? • Self-serviced process? • Data and Transformation as-a-Service? 81 Transform Generate RDF Ontology X Ontology X Ontology X Ontology mapping RDF Graph Raw Data Prepared Data Map Map RDF Triple Store
  • 82. Tabular Data Graph Data DataGraft: Data-as-a-Service For the Data Transformation and RDF Publication Process 82
  • 84. 84
  • 85. 85
  • 86. 86
  • 87. 87
  • 88. 88
  • 89. 89
  • 90. 90
  • 91. 91
  • 92. 92
  • 93. 93
  • 94. 94
  • 95. 95
  • 96. 96
  • 97. 97
  • 98. 98
  • 99. 99
  • 100. 100
  • 101. 101
  • 102. 102 Data records (rows) Add row Take row(s) Drop row(s) Shift row Filter rows (grep) Remove duplicate rows Entire dataset Sort Reshape dataset Group (categorize) and aggregate Columns Add column(s) Take column(s) Drop column(s) Move column Merge columns Split column Rename column(s) Apply function to all values in a column
  • 103. 103
  • 104. 104
  • 105. 105
  • 106. 106
  • 107. 107
  • 108. Data pages and federated querying 108 What is the population of locations and total number of persons employed in Human health and social work activities?
  • 110. 110
  • 111. 111
  • 112. 112
  • 114. DataGraft key feature: Flexible management and sharing of data and transformations Fork, reuse and extend transformations built by other professionals from DataGraft’s transformations catalog Interactively build, modify and share data transformations Share transformations privately or publicly Reuse transformations to repeatably clean and transform spreadsheet data Programmatically access transformations and the transformation catalogue 114
  • 115. Reuse of transformations in environmental data publishing TRAGSA Pilot • Number of transformations: 42 – Created via reuse: 25 • Number of triples: – ~ 7.7M ARPA Pilot • Number of transformations: 5 – Created via reuse: 2 • Number of triples: – ~ 14K 115 Forking/reusing transformations helped us spend less time on creating new transformations
  • 116. DataGraft key feature: Reliable data hosting and querying services Host data on DataGraft’s reliable, cloud-based semantic graph database Share data privately or publicly Query data through your own SPARQL endpoint Programmatically access the data catalogue 116 Operations & maintenance performed on behalf of users
  • 117. Grafter Grafterizer Semantic Graph DBaaSData Portal DataGraft 117 DataGraft Enablers
  • 118. DataGraft – 1 package 2 audiences DataGraft Data Publisher Application Developer Helping integrating and publishing data Giving better, easier tools 118
  • 120. The context: Statsbygg 120 • A public sector administration company • Norwegian government's key advisor in construction and property affairs • Building commissioner • Property manager • Property developer • Interest: Exploit/Share property data in novel ways • For efficiency and sustainability of the property included in the government's civil estate Example: Reporting state-owned real estate properties in Norway
  • 121. Example: Reporting state-owned real estate properties in Norway (cont’) • A hard copy of 314 pages and as a PDF file • 6 Person-Months • Data collection with spreadsheets • Quality assurance through e-mails and phone correspondence Pains • Time consuming • Poor data quality • Static report without live updating • Live service • Efficient sharing of data • Simplified integration with external datasets • Live updating • Reliable access • … • Risk and vulnerability analysis, e.g. buildings affected by flooding • Analysis of leasing prices Report Reporting Service 3rd party services 121
  • 122. Sample data 122 Cleaning, Transformation, Publishing, Integration, Querying, Visualization, Service Access
  • 123. Demo Scenario • Interactively create tabular data transformations • Reuse/extend data transformations (incl. data annotations) • RDF data publication and querying • Integrating and visualising data from different sources • (Using 3rd party tools with DataGraft) 123
  • 124. Demo sample data 124 Cleaning, Transformation, Publishing, Integration, Querying, Visualization, Service Access
  • 125. Demo sample data 125 Cleaning, Transformation, Publishing, Integration, Querying, Visualization, Service Access
  • 126. Benefits of DataGraft in use cases • Simplified data publishing process • Integration with external data sources using established web standards • Data that was not publicly available – now published (e.g. air quality data in Oslo) • Time-efficient publishing • Repeatable data transformation process 126
  • 127. DataGraft and Big Data • Desired features: – real-time interactivity – large datasets batch transformation capability We are developing a hybrid solution to work with both batch and real-time processing. 127
  • 128. DataGraft and Big Data: High-level architecture 128
  • 129. DataGraft – targeted impacts Reduction in costs for organisations which lack sufficient expertise and resources to make their data available Reduction on the dependency of data owners on generic Cloud platforms to build, deploy and maintain their linked data from scratch Increase in the speed of publishing new datasets and updating existing datasets Reduction in the cost and complexity of developing applications that use data Increase in the reuse of data by providing reliable access to numerous datasets hosted on DataGraft.net 129
  • 130. • Gathering enough of good datasets • Designing/implementing 2. Able to focus on service quality Example: The benefit of DataGraft in PLUQI 130 • Reducing cost for implementing transformations • Integrating the process is simpler 1. 23% of development cost reduction Datasets gathering Data transformation Data provisioning/access Implementing App Before Datasets gathering Data transformation Data provisioning/ access Implementing App After (with DataGraft)
  • 131. DataGraft in numbers (as of end of Jan 2016) 131 238 Registered users 607 (208 public) Registered Data transformations 1828 Uploaded files 192 Public Data pages
  • 132. DataGraft in the wild • Investigating crime data in small geographies • Used DataGraft to transform data and publish RDF 132http://benproctor.co.uk/investigating-crime-data-at-small-geographies/
  • 133. Data Science and DataGraft Greater Data Science: 1. Data Exploration and Preparation 2. Data Representation and Transformation 3. Computing with Data 4. Data Visualization and Presentation 5. Data Modeling 6. Science about Data Science 133 “50 years of Data Science” by David Donoho http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf DataGraft
  • 135. 135
  • 136. Summary • DataGraft – emerging Data-as-a-Service solution for making (linked) data more accessible – Platform, portal, methodology, APIs – Online service, functional and documented – Validated through several use cases • Key features: – Support for Sharable/Repeatable/Reusable Data Transformations – Reliable RDF Database-as-a-Service 136
  • 138. 138