1. ResourceSync:
Web-Based
Resource
Synchronization
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
ResourceSync is funded by
The Sloan Foundation & JISC
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
2. ResourceSync Core Team – NISO & OAI
Cornell University & OAI:
Berhard Haslhofer, Carl Lagoze, Simeon Warner
Old Dominion University & OAI:
Michael L. Nelson
Los Alamos National Laboratory & OAI:
Martin Klein, Robert Sanderson, Herbert Van de Sompel
NISO:
Todd Carpenter, Nettie Lagace, Peter Murray
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
3. ResourceSync Technical Group
• Manuel Bernhardt, Delving B.V.
• Kevin Ford, Library of Congress
• Richard Jones, JISC
• Graham Klyne, JISC
• Stuart Lewis, JISC
• David Rosenthal, LOCKSS
• Christian Sadilek, Red Hat
• Shlomo Sanders, Ex Libris, Inc.
• Sjoerd Siebinga, Delving B.V.
• Ed Summers, Library of Congress
• Jeff Young, OCLC Online Computer Library Center
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
4. ResourceSync
ResourceSync: What & Why?
Problem Perspective & Conceptual Approach
Technical Details
Q&A
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
5. ResourceSync
ResourceSync: What & Why?
Problem Perspective & Conceptual Approach
Technical Details
Q&A
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
6. Synchronize What?
• Web resources – things with a URI that can be dereferenced and
are cache-able (no dependency on underlying OS, technologies
etc.)
• Small websites/repositories (a few resources) to large
repositories/datasets/linked data collections (many millions of
resources)
• That change slowly (weeks/months) or quickly (seconds), and
where latency needs may vary
• Focus on needs of research communication and cultural heritage
organizations, but aim for generality
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
7. Why?
… because lots of projects and services are doing synchronization
but have to resort to ad-hoc, case by case, approaches!
• Project team involved with projects that need this
• Experience with OAI-PMH: widely used in repos but
o XML metadata only
o Attempts at synchronizing actual content via OAI-PMH
(complex object formats, dc:identifier) not successful.
o Web technology has moved on since 1999
• Devise a shared solution for data, metadata, linked data?
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
8. Use Cases – The Basics
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
9. Use Cases - More
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
10. Out Of Scope (For Now)
• Bidirectional synchronization
• Destination-defined selective synchronization (query)
• Bulk URI migration
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
11. Use Case: arXiv Mirroring
• 1M article versions, ~800/day created or
updated at 8 PM US Eastern Time
• Metadata and full-text for each article
• Accuracy important
• Want low barrier for others to use
• Look for more general solution than current
homebrew mirroring (running with minor
modifications since 1994!) and occasional rsync
(filesystem layout specific, auth issues)
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
12. Use Case: DBpedia Live Duplication
• Average of 2 updates per second
• Want low latency => need a push technology
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
13. ResourceSync
ResourceSync: What & Why?
Problem Perspective & Conceptual Approach
Technical Details
Q&A
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
14. ResourceSync Problem
• Consideration:
• Source (server) A has resources that change over time: they
get created, modified, deleted
• Destination (servers) X, Y, and Z leverage (some) resources
of Source A.
• Problem:
• Destinations want to keep in step with the resource changes
at Source A: resource synchronization.
• Goal:
• Design an approach for resource synchronization aligned
with the Web Architecture that has a fair chance of adoption
by different communities.
• The approach must scale better than recurrent HTTP
HEAD/GET on resources.
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
15. Destination: 3 Basic Synchronization Needs
1. Baseline synchronization – A destination must be able to
perform an initial load or catch-up with a source
- avoid out-of-band setup
2. Incremental synchronization – A destination must have some
way to keep up-to-date with changes at a source
- subject to some latency; minimal: create/update/delete
- allow to catch-up after destination has been offline
3. Audit – A destination should be able to determine whether it is
synchronized with a source
- subject to some latency
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
16. Source Capability 1: Describing Content
In order to advertise the resources that a source wants destinations
to know about, it may describe them:
o Publish an inventory of resource URIs and possibly
associated metadata
- Destination GETs the Content Description
- Destination GETs listed resources by their URI
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
17.
18.
19. Source Capability 2: Communicating Change Events
In order to achieve lower latency, a source may communicate about
changes to its resources:
o 2.1. Change Set: Publish a list of recent change events
(create, update, delete resource)
- Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
20.
21.
22.
23.
24. Source Capability 2: Communicating Change Events
In order to achieve lower latency, a source may communicate about
changes to its resources:
o 2.1. Change Set: Publish a list of recent change events
(create, update, delete resource)
- Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.
o 2.2. Push Change Set: Push a list of recent change events
(create, update, delete resource) towards (a) destination(s)
- Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
25. Source Capability 3: Providing Access to Versions
In order to allow a destination to catch up with missed changes, a
source may support:
o 3.1. Historical Change Sets: Provide access to change events that
occurred prior to the ones listed in the current Change Set
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
26.
27.
28.
29. Source Capability 3: Providing Access to Versions
In order to allow a destination to catch up with missed changes, a
source may support:
o 3.1. Historical Change Sets: Provide access to change events that
occurred prior to the ones listed in the current Change Set
o 3.2. Historical Content: Provide access to prior resource versions
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
30. Source Capability 4: Transferring Content
By default, content is transferred in response to a GET issued by a
destination against a URI of a source’s resource. But a source may
support additional mechanisms:
o 4.1. Dump: Publish a package of resource representations
and necessary metadata
- Destination GETs the Dump
- Destination unpacks the Dump
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
31.
32. Source Capability 4: Transferring Content
By default, content is transferred in response to a GET issued by a
destination against a URI of a source’s resource. But a source may
support additional mechanisms:
o 4.1. Dump: Publish a package of resource representations
and necessary metadata
- Destination GETs the Dump
- Destination unpacks the Dump
o 4.2. Alternate Content Transfer: Support alternative
mechanisms to optimize getting content, e.g. content via a
mirror site, only changes not the entire changed resource.
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
33. Source: Advertise Capabilities
A source needs to advertise the capabilities it supports to allow a
destination to discover them
• Some capabilities may be provided by a third party, not the
source itself
o e.g. Historical Change Sets, Historical Content
o But the source should still make those third party capabilities
discoverable - trust
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
34. ResourceSync
ResourceSync: What & Why?
Problem Perspective & Conceptual Approach
Technical Details
Q&A
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
35. So Many Choices
Push
DSNotify
OAI-PMH Pull
rsync
Crawl
OAI-ORE
RDFsync
WebDAV Col. Syn.
XMPP
Atom SWORD AtomPub
Sitemap RSS
SPARQLpush PubSubHubbub
SDShare XMPP
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
36. ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
37. ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
38. A Framework Based on Sitemaps
• Modular framework allowing selective deployment
• Sitemap is the core component throughout the
framework
o Introduce extension elements and attributes:
- In ResourceSync namespace (rs:) to
accommodate synchronization needs
- In XHTML namespace (xhtml:) mainly to
accommodate discovery needs
o Reuse Sitemap format for Change Sets (both
current and historical) and for manifest in Dump
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
39. Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
40. Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
41. Sitemap with Added Datetime
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
42. Change Types: Extend lastmod, Use expires
!
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
43. Sitemap with lastmod and expires
!
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
44. Sitemap Discovery via robots.txt
!
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
45. Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
46. Change Set: An rs Typed Sitemap
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
47. More rs Extension Elements
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
48. Change Set with rs and xhtml Extensions
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
49. Change Set Discovery via Sitemap
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
50. Pushing Change Sets via XMPP PubSub
XMPP Publish-Subscribe: Client to Subscription Service,
Subscription Service to Client(s) communication
• One of the XMPP (Extensible Messaging and Presence Protocol)
extensions http://xmpp.org/extensions/xep-0060.html
• Apple Notifications based on XMPP PubSub
• Available tools, see http://xmpp.org/about-xmpp/
technology-overview/pubsub/#impl-client
o XMPP Servers with PubSub support:
- ejabberd , OpenFire , Tigase , SleekXMPP
o XMPP libraries with PubSub support:
- Strophe (C, JavaScript), XMPP4R (Ruby), SleekXMPP
(Python), PubSub Client (Python)
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
51. Pushing Change Sets via XMPP PubSub
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
52. Change Set via XMPP
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
53. Push Change Set Discovery via Sitemap
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
54. Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
55. Discovering a Historical Change Set via a Current Change Set
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
56. Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
57. Discovering Historical Content – Link to Version Resource
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
59. Original Resources and Mementos
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
60. Bridge from Present to Past
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
61. Bridge from Past to Present
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
62. Memento Framework
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
63. Discovering Historical Content – Link to Memento TimeGate
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
64. Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
65. Dump
• Two formats currently under discussion:
o Format based on ZIP:
- Package content
- Add manifest (manifest.xml) expressed in
Sitemap format
- ZIP it up
o WARC files as used by the web archiving
community
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
66. Mapping URI to File Path with rs:path
!
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
67. Manifest (manifest.xml) Expressed in Sitemap Format
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
68. Dump Discovery via Sitemap
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
69. Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
70. Alternate Location
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
71. Alternate Protocol, e.g. Obtain Changes Only
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
72. Timeline
• August 2012
o First draft spec shared for feedback with ResourceSync team
• September 2012
o In-person meeting of ResourceSync Team
o Revise spec, conduct experiments
o Solicit broad feedback
o Paper in D-Lib Magazine
• December 2012 – Finalize specification (?)
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
73. Pointers
• First draft spec:
http://www.openarchives.org/rs/0.1/resourcesync!
• Simulator code on github
http://github.org/resync/simulator!
• NISO workspace
http://www.niso.org/workrooms/resourcesync/!
!
• List for public comment coming soon
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands
74. ResourceSync:
Web-Based
Resource
Synchronization
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
ResourceSync is funded by
The Sloan Foundation & JISC
ResourceSync – Herbert Van de Sompel
TICER Summer School, August 22 2012, Tilburg, The Netherlands