Open Data, by definition, provides the chance to re-shape and publish heterogeneous pieces and fragments of information which are open, namely anyone is free to use, reuse, and redistribute it. In order for users to fully benefit this idea, Open Data Systems of tomorrow must provide high quality data, relying on real time and ubiquitous services, along with a deep integration with mobile and smart devices and infrastructures.
In this session, we present a syntheses of Whitehall proposal addressed a this vision: is addressed at building Open Data in a fully-fledged Big Data infrastructure, realized using graph based and NoSQL technologies. This idea is shaped in a cultural heritage scenario, where data in envisaged at valorizing one of the main assets of Italy: cultural heritage.
4. Obama Vision
―In the coming year, we’ll also
work to rebuild people’s faith in
the institution of government.
Because you deserve to know
exactly how and where your tax
dollars are being spent, you’ll
be able to go to a website and
get that information for the very
first time in history. Because
President Obama you deserve to know when your
The State of the Union elected officials are meeting
Speech - 2011 with lobbyists, I ask Congress
http://www.whitehouse.gov/ to do what the White House has
state-of-the-union-2011
already done — put that
5 information online.―
5. What Open Data is
Open data is the idea that certain data should be freely available to
everyone to use and republish as they wish, without restrictions from
copyright, patents or other mechanisms of control.
The goals of the open data movement are similar to those of other
"Open" movements such as open source, open content, and open
access (Ref. Wikipedia)
Citizen centricity comes from citizen empowerment, namely
disintermediation .wrt traditional actors
6
6. Expected Payoff
• Ubiquitous access
• Re usability
• Optimization
• Social and Cultural enrichment
ROI ―A greater than 100X return on investment in direct Federal IT spending
through economies of scope is achievable by equipping agencies with an
Open Data platform that is the shared foundation for numerous programs that
are independently funded today‖
[http://www.socrata.com/blog/open-data-as-a-platform/]
OD turns to be a formidable tool for:
• Analyzing Spending Review on administrations expenses
• Enforcing Fact Checking on declarations policies and campaigns
7
7. Where Open Data is
http://census.okfn.org/
https://nycopendata.socrata.com/
https://dati.lombardia.it
..and counting
8
8. Italian Digital Agenda : Open Data + E-Gov
Italy established in 2012 a ―control room‖ of experts aimed at promoting Open
Data in the context of a digital agenda
Open Data is integrated with E-Gov
1. Enabling Infrastructures
2. PA digital switchover
3. Purposive and regulative set of norms and rules
4. Communication plan
The challenge is: optimizing services and costs:
• Digital Identities and related services, unified and web based registry
offices, e-payments, continuous census, interoperability of EU platforms
• Digital health, Cultural Heritage
• eLearning, eProcurement, eRecruitment
9
11. Roadmap to Open Data
Data assets Identify Use Cases Identify ROI Architectur Legal LOD
analysis and Final Users • Risk e Definition Issues Feasibility
• Relevant Datasets • Best Practices and assessment • Identify service • Copyright
report, Exec
• Customer internal similar datasets • Savings level • Licensing
utive Plan
processes analysis • Linked Data Cloud • Identify non • W3C • Liability of Data
and Road
quantifiable Compliance update Map
ROI
Identify Development Data Enrichment Validation and Composition
Datasets • Architecture • Metadata Publication of Services
LOD
• Data Analysis Definition description, Ontolo • W3C Compliance • Documentatio Services
• Datasets Store gy, RDF • Data Localization, n
• Data and
transformation • Internal linking • External Linking History • Build Ecosystem Platform
• SPARQL • External • Communication • Public API
•Normalization Endpoint component Plan
(GIS, Data-
Mining, BI
Analysis modules)
Service Development Knowledge Transfer
15
13. 5★ Open Data
Tim Berners-Lee, the inventor of the Web and Linked Data
initiator, suggested a 5 star deployment scheme for Open Data.
make PUBLIC stuff available on the Web (whatever
★
format, .jpeg .pdf) under an open license
make it available as structured data (e.g., Excel
★★
instead of image scan of a table)
use non-proprietary formats
★★★
(e.g., CSV instead of Excel)
use URIs to denote things, so that people
★★★★
can point at your stuff
★★★★★ link your data to other data to provide context
17
14. W3C Roadmap
Having Standard Names/URIs for All Government
Objects aids in discoverability, improves
metadata, and ensures authenticity.
• Provide permanent, patterned and/or
discoverable URI/URLs to your data
• Create a web page with a plain language
description of the dataset to help search
engines find the data, so people can use it.
• Provide links out to other data and documentation.
• Ensure that data is findable and can be referenced for as long as
people need it
• Data published in industry standards like (X)HTML, XML and RDF
can be used as an object database or RESTful API
18
15. Linked Open Data
Recommended best practice for exposing, sharing, and connecting
pieces of data, information, and knowledge on the Semantic Web
using URIs OWL and RDF.
1. Requires Ontologies to be applied to
data
2. Allows heterogeneous Nodes to be
traversed in a semantically coherent
fashion
19
16. Botticelli Case
One may specify that the author’s mention of ―La Primavera‖ at Uffizi
Museum LINKS to exactly the same person as the one described on
the Dbpedia (LOD of Wikipedia)
http://live.dbpedia.org/page/Primavera_(painting)
http://live.dbpedia.org/page/Sandro_Botticelli
http://live.dbpedia.org/page/Adoration_of_the_Magi_of_1475_(Botticelli)
The link is not just a hyperlink because it is typed.
In the BOTTICELLI page, the information about his life and works is
structured, by means of the topology
20
17. Semantic Network
Enable Reasoning: OWL-DL, based on Description Logics, represent
decidable fragments of First Order Logic
Sandro_Botticelli category: Italian_Renaissance_painters
category: Italian_Renaissance_painters category:Quattrocento_painters
Sandro_Botticelli category:Quattrocento_painters
http://live.dbpedia.org/page/Category:Italian_Renaissance_painters
http://live.dbpedia.org/page/Sandro_Botticelli
http://live.dbpedia.org/page/Category:Quattrocento_painters
21
18. Linked Open Data Cloud (2011)
Doubled in size
every 10 months,
since 2007
Media
User-generated
Geographic
Publications
Government
Cross-Domain
Life Sciences
22
19. Recipes for Serving Information as Linked Data
• Entities must be identified with referenceable HTTP URIs.
• At the MIME-type application/rdf+xml, the data source must return
an RDF/XML description.
• RDF descriptions should also contain RDF links to resources
provided by other data sources, so that clients can navigate the Web
of Data as a whole by following RDF links.
23
21. Towards Government 2.0
“ Governments IT need to redefine themselves as Government as a Platform ”
Open Data is the platform for Open Government.
Actors:
• Institutions: to better serve services for citizens
• Civic-minded developers: to serve themselves and the others by extending
the platform (i.e. mash-ups, applications)
What actors need: Open Data management platforms, consistent admin tools
and a powerful Open Data Catalog to consolidate the entire Open Data
lifecycle (STEP 1-5)
25
23. Open Data As-A-Service
REST API
Mobile App
REST API
Web App
REST API
Mobile App
Data-on-Demand data are not closed inside CMS applications but are
consumed on-demand As-a-Service
Data as Web Resources RESTful API make it possible to retrieve data as
a web resource (through URI)
27
24. Socrata: GovStat Approach
Socrata is being realeasing fragments of the platform as Open Source
in Git Hub
https://github.com/open-data-standards
Business Model is moving to advanced data analysis tools, mining, real
time monitoring, decision making support systems
http://www.socrata.com/govstat/
28
26. Open Data in a Cultural Heritage Scenario
Art Galleries, Libraries, Archives and Museums (GLAMS) are exploring the
added value of sharing their data resources as LOD
Key facts:
• Rich and structured data sets accumulated over many years by experts
• Ability to reach out to audiences to both enrich datasets and to evaluation
services
• Long-standing expertise in meta-data management and
(co-) curation
• Authoritative knowledge on a wide range of subjects
30
27. GLAMS LOD Examples
In Agora, the Rijksmuseum Amsterdam and the
Netherlands Institute for Sound and Vision collaborate
with the Computer Science and History departments at
the VU to integrate their collections and enrich with
historical information to facilitate a more comprehensive
understanding of the historical dimension of objects in
online heritage collections. [http://agora.cs.vu.nl/]
The Amsterdam Museum was the first museum in the
Netherlands to convert its complete museum collection
database to RDF. The resulting resource consists of
more than 5 Million RDF triples describing over than
70.000 cultural heritage objects. Several working
examples uses this dataset, such as a mobile city
guide.
31
28. GLAMS LOD Examples
Europeana is a pan-European initiative that provides
access to millions of objects as LOD through API. The
Europeana Thought Lab[5] search interface shows how
LOD principles can aid the search process. Europeana
has been a strong supporter for the uptake of CC0, the
"no rights reserved" in Creative Commons-licenses
[http://pro.europeana.eu]
Open Images provides access to a large and growing
collection of Creative Commons licensed archive
material. The meta-data is converted to RDF, allowing
the creation of rich semantic links between other
datasets such as the Amsterdam Museum dataset
[http://www.openimages.eu/]
32
29. PROS and CONS of LOD for GLAMS
PROS
• Driving users to online content held by GLAMS (e.g., by improved
search engine optimization);
• Stimulating collaboration in the library, archives and museums
domain and beyond, for instance by inviting people to clean/enrich
existing data;
• Enabling new scholarship that can only be done with open data;
• Allowing the creation of new services for discovery;
• quoting Verwayen (2011) ―increas[ing] relevance to digital society.‖
CONS
• Loss of Attribution to the ―memory institution‖, which may turn to
decrease values of the artworks
• Loss of potential Incomes: open data may not be sold
33
30. Metrics of Success
Incomes: measured in money
Public Outreach: to measure the
number of (online) visitors
Reuse: to measure the use of data and content by heritage
institutions themselves and by others
Public Participation: to measure the amount of added metadata
and content
34
32. Developing Open Linked Data
We may recognize few contingencies in our scenario:
• Exponential growth in data volumes
• Rise of connectedness
• Increase in degrees of semi-structure
• Structures and Schemes emerge rather than having a pre-defined
upfront
Key facts:
• Volume: the size of the stored data
• Velocity: the rate at which data changes over time
• Variety: the degree to which data is regularly or irregularly
structured, dense or sparse, and importantly connected or
37 disconnected
33. ER Approach
We do not know the structure of the documents in design time.
Adopting an ER approach we have to define vertical tables
38
34. Relational Model Weakness
In ER model relationships are semantic free (direction, name)
• As the amount of semi-structured information increases, the
relational model becomes burdened
• Maintenance overheads: join tables and maintaining foreign key constraints
just to make the database work.
• Large join tables, sparsely populated rows and lots of null-checking logic
• Difficult to face with reciprocal queries in nowadays semi-structured,
real-world cases
• Recommendation systems, social networks
39
35. Aggregate Stores Weakness
Aggregates allow to mimic relationships embedding cross-stores
identifiers, but:
• Is up to the developer to manage, infer and reify useful knowledge
from that
• Do not provide index-free adjacency
• Delete must be checked
• Traversing relationships is expensive, each link requiring index
lookup
• Brute force computing an entire data set is O(n) since all n aggregates in
the data store must be considered. That’s far too costly where we’d prefer
O(log n)
• Impractical in real time scenario
40
36. Storing data in Graphs
Graph theory was pioneered by Euler in the 18th century, received
multidisciplinary contributes across centuries
• Facebook, Google and Twitter have centered their business models
around their own proprietary distributed graph technologies
Facebook TAO
Twitter FlockDB
Graph databases store information in ways that much more closely
resemble the ways the world is organized and the humans ―think
about‖ data.
Top 10 Gartner IT technologies in 2013 ―[..] are designed to support
new transaction, interaction and observation use cases involving web
scale, mobile, cloud and clustered environments‖
41
37. From Relational to Graph based Modeling
Graph DB place relationships as first-class abstractions of the data
model
• It contains nodes and relationships
• Nodes contain properties (key-
value pairs)
• Relationships are named, directed
and always have a start and end
node
• Relationships can also contain
properties
A Graph –[:RECORDS_DATA_IN] Nodes –[:WHICH_HAVE]
Properties.
Nodes –[:LINKED_BY] Relationships
42
38. From Relational to Graph based Modeling
Shake RDBMS while keeping all the relationships, and you’ll see a
graph
Where RDBMS are optimized for aggregated data, Graph Database
are optimized for highly connected data
43
39. Traversing Map Performances
Friend of Friend (FoF) problem : for any person in a social network,
look for a route to some other person in the graph at most depth=N
hops away.
For a social network containing 1,000,000 people each with ~50 friends
the results (*) shows that graph databases are the best choice
Depth RDBMS Execution Time (s) Neo4j (s) Returned Records
2 0.016 0.01 ~2500
3 30.267 0.168 ~110,000
4 1543.505 1.359 ~600,000
5 Unfinished 2.132 ~800,000
45 (*) Graph Databases, O’ Reilly – To Appear
41. Neo4j Graph DB
intuitive, using a graph model for data
representation
reliable, with full ACID transactions
durable and fast, using a custom disk-based, native storage engine
massively scalable, up to several billion nodes/relationships/properties
highly-available, when distributed across multiple machines
expressive, with a powerful, human readable graph query language
fast, with a powerful traversal framework for high-speed graph queries
embeddable, with a few small jars
simple, accessible by a convenient REST interface or an object-
oriented Java API
48
42. Spring Data and Neo4J
Promotes POJO based development for the Graph Database Neo4j.
It maps annotated entity classes to the Neo4j Graph Database with
advanced mapping functionality.
Seamless integration of the Cypher Query Language
49
43. Spring Data Neo4j
It is possible to derive queries for domain entities from finder method
names like Iterable<T>
@Indexed fields will be converted into index-lookups of the start
clause, navigation along relationships will be reflected in the match
clause properties with operators will end up as expressions in the
where clause
50