The following is technical brief to U.S. EPA's Chief Data Scientist on open data information architecture, the use of Linked Data and the EPA Linked Data Management Service. The briefing was held in February 2016 and was educational in nature.
Call Girls In Yamuna Vihar꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
3 Round Stones Briefing to U.S. EPA's Chief Data Scientist on Open Data
1. Bernadette Hyland
CEO & co-founder
11911 Freedom Drive, Suite 850
Reston,VA 20190
Tel. +1-571-331-3758
bhyland@3RoundStones.com
@BernHyland
info@3RoundStones.com
@3RoundStones
ExtendYourReach.
Linked Data for
Smarter Decisions.
Follow up information prepared for
RobinThottungal, Chief Data Scientist / Director of Analytics
US Environmental Protection Agency - Feb 26, 2016
2. Today’s reality at EPA
»Tens of thousands of sources
»Many formats - JSON, XML,
CSV, PDF, PPT, SHP, SHX, text,
binary…
»Thousands of data silos
»No single source of truth
»Varied interpretations
»Brittle interfaces - lack of
interoperability
Image Credit: Smart Data Collective
3. WideVariety of Data at EPA
3
Image Credit: MarkLogic, see http://www.marklogic.com/resources/marklogic-semantics-datasheet/
resource_download/datasheets/
4. Credit: Frederick Giasson, Data Scientist & Software Developer,
http://fgiasson.com/blog/index.php/2014/07/23/big-structures-where-the-semantic-web-meets-artificial-intelligence/
Potential at EPA …
• Findable data
• Accessible data
• Interoperable
data
• Re-usable data
• Shared context
• Data Platforms
(HDFS, NoSQL)
5. Linked Data is helping to extend & augment
EPA’s significant investment in enterprise relational technologies
How?
By leveraging NoSQL Data Platforms that rigorously adhere to
international data interoperability standards. *
* Relevant international data exchange standards are published by the
W3C, OGC, IEEE
Image Credit: MarkLogic
6. Graph databases, as a subset of NoSQL databases, are the
most efficient way to look at the relationships between data
items, patterns of relationships and interactions.
Image Credit: Cray, see http://www.cray.com/blog/graph-databases-101/
Graph Databases 101
7. Hadoop Integration
»While over 90% of the world’s data has been created in the last two
years, EPA has tremendous variety of data requires the “right tool for
the job”
»Historic data (“short, wide, complex data”) vs.
»Granular sensor & GIS data (“long skinny data”)
»Core mission-based systems with robust historic data, includes:
»Toxics Release Inventory (TRI)
»Facilities Registry (FRS)
»RCRA Handler
»EPA’s enterprise information architecture should include a data
platform that leverages Hadoop: HDFS and MapReduce, and
accommodates EPA’s robust data landscape.
»Must support modern, open source tools for application development,
visualizations, crowdsourcing, and deployment on the Web
8. 8
One option - MarkLogic
Integrates Hadoop Ecosystem &
EPA’s Robust Data Landscape
Image Credit: http://www.marklogic.com/what-is-marklogic/features/hadoop-integration/
9. EPA Robust Data Ecosystem is adaptable
using a Linked Data Approach
» Makes data integration faster and easier
» By using a global addressing scheme, HTTP URIs.
» Uses semantics to “glue” together data faster.
» Common semantic definitions link traditional relational
models.
» No more out of data documentation using standard
vocabularies.
» Robust search and discovery by leveraging the semantic graph.
» Scales to the Web!
9
10. All modern data platforms
deployed at EPA should
»Support options for data modeling - Linked Data (JSON-LD, RDF), SQL (JSON, XML)
»Native store and query of documents, blobs and structured data.
»Standards-based query interface across documents and data, e.g., Full support for
SPARQL 1.1
»Offer enterprise functionality including high availability & disaster recovery, scalability &
elasticity, ACID transactions
»Be deployable on FedRamp certified cloud provider certifying controls for security, high
availability, disaster recovery
»Scale to billions of statements, triples, etc.
»Store unstructured data across clusters like Hadoop, making it easy to move data
partitions.
11. »Much but not all of
EPA’s data is well
suited for a Linked
Data approach.
»Linked Data is based
on 20+ year old idea,
a system of linked
information systems
M A N N I N G
David Wood
Marsha Zaidman
Luke Ruth
WITH Michael Hausenblas
FOREWORD BY Tim Berners-Lee
Structured data on the Web
Linked Data
16. Linked Data
Platform is in QA
now! https://usepa.
3roundstones.net
Anticipated to
move to production
in 2016.
17. shared innovation™
Search for facilities
where we live. Unlike
many EPA Web portals,
linked data is human
AND machine readable
data. No screen
scraping is required.
Encourages re-use
(discourages data silos)
18. The EPA Linked
Data service
CONNECTS data
silos, and provides
familiar map and
table data views
19. Click to drill down
to pollution reports
that combine data
from 5 previously
unconnected data
silos.
22. Previously, only people
who employed complex
screen scraping
techniques could get at
this data. Now, EPA open
data is available using an
international data
standard, with one click!
37. Callimachus is a scalable Web application server for
publishing and consuming open data
Who uses it?
• Government, international publishers, healthcare / life sciences
What pain does Callimachus address?
• Integration of data silos where a graph approach is needed
• Rapid creation of visualizations, dashboards (mashups) & info graphics
• Less expensive solution to a data warehouse
Example apps?
• Collaborative knowledge management
• Publishing workflow
• Drug discovery / clinical trials
• Predictive Analytics
38. data interoperability & portability
Supports:
• HTML5, XHTML5, CSS3, JavaScript
• XQuery, XProc, XPath, XSLT
• SPARQL 1.1 Query, Update, Federated Query,
Service Description, Property Paths, Graph Store
HTTP Protocol
• RDF/XML, RDF/Turtle, JSON-LD, SPARQL XML,
SPARQL JSON
Callimachus is fanatical about
39. Public
Application, Script or automated client
Web Browser
SPARQL endpointREST APIResource URIs
Linked Data management system
located at a Tier 1 Cloud Provider
(FISMA compliant)
RDF Database
Registered developer
41. “Big Data Is Important, but
Open Data Is More Valuable”
As change agents, enterprise architects can help
their organizations become richer through
strategies such as open data.
David Newman, VP Research, Gartner