Web & Social Media Analytics Previous Year Question Paper.pdf
PhD defense
1. A Framework for
Web Object Self-Preservation
A Ph.D. Defense
Chuck Cartledge
30 May 2014
2. A warning from Jeff Rothenberg
“Digital Information Lasts
Forever—Or Five Years,
Whichever Comes First.”
2
Jeff Rothenberg, Ensuring the Longevity of Digital Information, Scientific
American 272 (1995), 42 - 47.
3. A warning from William Arms
“Tomorrow we could see the
National Library of Medicine
abolished by Congress, Elsevier
dismantled by a corporate raider,
the Royal Society declared
bankrupt, or the University of
Michigan Press destroyed by a
meteor. All are highly unlikely,
but over a long period of time
unlikely events will happen."
3
William Y. Arms, Preservation of Scientific Serials: Three Current
Examples, Journal of Electronic Publishing 5 (1999), no. 2.
4. Overview
• Warnings
• Preservation context
• Research questions
• Background and related
work
• Unsupervised Small-World
– Emergent behavior
– Graph theory
– Preservation
• Demonstration
• Questions and answers
4
5. Preservation in an analog age
• Benign neglect
– Don’t touch
– Keep away from sunshine
– Keep away from moisture
– Keep away from insects
• Last for hundreds of
years
5
Josie McClure picture taken Feb 30, 1907 at Poteau, I.T.
Fifteen years of age When this was taken weighed 140 lbs.
6. Preservation in a digital age
6
• Constant use
– Use often
– Exposure to lots of things
– Make lots of copies
– Monitor the integrity
• Last for ??? unknown
years
• This is a Brave New
World
Google image search, 31 March 2014, about 91,700,000 results (0.84 seconds)
7. Everything has a lifespan
• Exponential growth of
digital artifacts
• Representing increasing
portion of personal and
cultural heritage
• Short human lifetime to
manage data
• Potentially, short
institutional life time
• Need to preserve
artifacts beyond human
lifespan and institutions
that create and house
artifacts
7 Dissertation, section 1.3
8. Research questions
• Can web objects (WOs)
be constructed to
outlive the people and
institutions that
created them?
• Can we leverage
aspects of naturally
occurring networks and
group behavior for
preservation?
8
A WO is a digital object that lives on the Web. A WO is a fundamental
element in this dissertation.
9. Unsupervised Small-World (USW) is
at the nexus of multiple disciplines
9
Mathematical
structures used to
model pairwise
relations between
objects
Ensuring that
digital information
of continuing
value remains
accessible and
usable
Movement of
the inanimate
11. Emergent behavior: model
• Craig Reynolds – basis of herd
and flock behavior in
computer animations
– 3 rules
• Collision avoidance
• Velocity matching
• Flock centering
– No central control, everything
based on local knowledge only
• Simple rules
– Complex behavior
– Emergent behavior
11
Craig W. Reynolds, Computer Animation with Scripts and
Actors, ACM SIGGRAPH, vol. 16, ACM, 1982, pp. 289 - 296.
Images http://www.red3d.com/cwr/boids/ Flock centering
Velocity matching
Collision avoidance
12. Emergent behavior: communication
• Need to know what
my neighbors are
doing
• Need to tell
neighbors what I
am doing
• A school of fish do
not have a Dagon
that controls them
12 Dissertation, section 5.3
14. Preservation: primitives
14 William Y. Arms, Digital Libraries, The MIT Press, December 1999
png
png png png
Replication Emulation
png
tiff eps bmp
Migration
15. Preservation: OAIS model
• Provides
standard model
and terminology
for archival
systems
• Terms of interest
– Submission
Information
Package
– Ingest
– Data
Management
– Archival
Storage
– Access
– Dissemination
Information
Package
15
Council of the Consultative Committee for Space Data Systems (CCSDS), Reference Model for an Open
Archival Information System (OAIS), Tech. report, Consultive Committee for Space Data Systems 650.0-
M-2, Magenta Book, 2012.
Carl Lagoze, Herbert Van de Sompel, Pete Johnston, Michael Nelson, Robert Sanderson, and Simeon
Warner, ORE User Guide - Resource Map Implementation in Atom, Tech. report, Open Archives Initiative,
2004.
17. Graph theory: definitions
• Graph: G = (V,E)
• Graph can be connected or
disconnected
• Some graph metrics work
only with connected graphs
and not with disconnected
graphs
– Clustering coefficient (C(G))
– Average path length (L(G))
– Degree distribution
17
Reka Albert and Albert-Laszlo Barabasi, Statistical Mechanics of
Complex Networks, Reviews of Modern Physics 74 (2002), no. 1, 47.
18. Graph theory: Watts and Strogatz
small-world
18
Duncan J. Watts and
Steven H. Strogatz,
Collective dynamics
of `small world'
networks, Nature 393
(1998), 440 - 442.
Stanley Milgram, The
Small-World Problem,
Psychology Today 2
(1967), no. 1, 60 - 67.
19. Small-world graphs are common
19
Actual Random graph
Nodes Edges C(G) L(G) C(G) L(G)
WECC 4941 6594 0.0801 18.99 0.00054 8.7
C.
elegans
248 511 0.21 2.87 0.05 2.62
Email 148 ~500,000 0.44 2.25 0.11 2.0
Ake J Holmgren, Using Graph Models to Analyze the Vulnerability of Electric Power Networks, Risk Analysis 26
(2006), no. 4, 955 - 969.
Lav R Varshney, Beth L Chen, Eric Paniagua, David H Hall, and Dmitri B Chklovskii, Structural Properties of the
Caenorhabditis elegans Neuronal Network, PLoS computational biology 7 (2011), no. 2, e1001066.
Shinako Matsuyama and Takao Terano, Analyzing the ENRON Communication Network Using Agent-Based
Simulation, Journal of Networks 3 (2008), no. 7.
West Elect.
Coord.
Council Enron
e-mail
20. Small-world: high C(G) and low L(G)
20
• The ubiquitous presence of small-world graphs
points to something inherently “correct” and
desirable about them.
Symbol Meaning
k Degree
n Order of the graph
Dissertation, section 5.2.3
21. USW is at the nexus of multiple
disciplines
21
Creation of
small-world
graphs that are
robust and
resilient
Meet
fundamental
requirements
of replication,
migration, and
data
management
WO’s use of
emergent
behavior to
create, monitor,
and optimize
the USW
system
22. Euclidean geometry
• To draw a straight line from any point
to any point.
• To produce [extend] a finite straight
line continuously in a straight line.
• To describe a circle with any center
and distance [radius].
• That all right angles are equal to one
another.
• That, if a straight line falling on two
straight lines make the interior angles
on the same side less than two right
angles, the two straight lines, if
produced indefinitely, meet on that
side on which are the angles less than
the two right angles.
22
The sum of angles A, B, and
C is equal to 180 degrees
Euclid of Alexandria, The Elements, Alexandria, 300 BCE.
23. Non-Euclidean geometries
23
• Non-Euclidean geometry
– To draw a straight line from any point to any
point.
– To produce [extend] a finite straight line
continuously in a straight line.
– To describe a circle with any center and
distance [radius].
– That all right angles are equal to one another.
– That, if a straight line falling on two straight
lines make the interior angles on the same side
less than two right angles, the two straight
lines, if produced indefinitely, meet on that side
on which are the angles less than the two right
angles.
• Spherical geometry
– Two lines at right angles to the same line can
meet
– Triangles can have 180 to 540 degrees
– Circles are straight lines
24. Digital library world
• Digital libraries
– The technical framework exists within a legal
and social framework
– Understanding of digital library concepts is
hampered by terminology
– The underlying architecture should be separate
from the content stored in the library
– Names and identifiers are the basic building
block for the digital library
– Digital library objects are more than collections
of bits
– The digital library object that is used is
different from the stored object
– Repositories must look after the information
they hold
– Users want intellectual works, not digital
objects
• Basic digital library tenets
24
Robert Kahn and Robert Wilensky, A Framework for Distributed Digital Object Services,
International Journal on Digital Libraries 6 (2006), no. 2, 115 - 123.
William Y. Arms, Key Concepts in the Architecture of the Digital Library, D-Lib Magazine 1
(1995), no. 1.
25. Digital library worlds of
possibilities
25
• Digital libraries
– The technical framework exists within a legal and
social framework
– Understanding of digital library concepts is
hampered by terminology
– The underlying architecture should be separate
from the content stored in the library
– Names and identifiers are the basic building
block for the digital library
– Digital library objects are more than collections
of bits
– The digital library object that is used is different
from the stored object
– Repositories must look after the information
they hold
– Users want intellectual works, not digital objects
What if there were no repositories?
26. “No Repositories” → USW
• No global
knowledge
– No omnipotent
enforcer
– No omnipresent
monitor
• Opportunistic
preservation
• Self-describing
Web Objects
26
32. USW interpretation of flocking
32
Flock
centering
Velocity
matching
Collision
avoidance
Craig Reynolds’ “boids” USW interpretation
Each WO has a unique URI
Matching number of copies/family members
Move with friends to new hosts
Dissertation, Chapter 2
33. Building a USW graph
33
• Graph
exploration (b)
• Choosing
connections
• Detecting loss
Dissertation, Chapter 5
34. WOs wandering in the USW graph
• Wandering WO is
“introduced” to an
existing WO
• If a connection is not
made, then an attempt
is made to another
existing WO
• Process is repeated
until a connection is
made
• No global knowledge
– No omnipotent
enforcer
– No omnipresent
monitor
• No repositories
34
Dissertation, Chapter 5
35. USW friend selection process
• Selection from possible sets
– WOset: WOs connected to candidate WO
– visitedSet: WOs that the wandering WO has explored
– toBeVisitedSet: WOs that the wandering WO has
discovered
• Selection approaches
– Random from visitedSetUtoBeVisitedSet
– FIFO from visitedSetUtoBeVisitedSet
– LIFO from visitedSetUtoBeVisitedSet
– Preferentially attach to WOset then random for remaining
35 Dissertation section 6.7.5
38. Robustness of USW graphs
• Definition: able to continue when
damaged
• Attack vs. failure
– Intentional vs. random
• Component selection
– Vertex
– Edge
• Selection attribute
– Degree
– Betweeness
• Attribute value
– High
– Low
• Attack profile notation: A{D|E|V}{H|L}
38
Sample graph
Charles L. Cartledge and Michael L. Nelson, Connectivity Damage to a Graph
by the Removal of an Edge or Vertex, Tech. report, arXiv 1103.3075, ODU CS
Dept.
39. Different attack profiles selections
39
AEH AEL AVH
ADLAVL ADH
Charles L. Cartledge and Michael L. Nelson, Connectivity Damage to a Graph
by the Removal of an Edge or Vertex, Tech. report, arXiv 1103.3075, ODU CS
Dept.
40. Four attacks using AEL profile
40
Deletion #1. Deletion #2.
Deletion #3. Deletion #4.
41. Our damage vs. Albert, Jeong,
and Barabasi’s damage
41
s = Reka Albert, Hawoong Jeong, and Albert-Laszlo Barabasi, Error and Attack Tolerance of
Complex Networks, Nature 406 (2000), no. 6794, 378 - 382.
50 … 5 20 … 20
16 … 1 10 … 10
42. Measuring damage
• Desired characteristics
– Different framgentation cases result
in different values
– Useful without additional graph
state information
42
Charles L. Cartledge and Michael L. Nelson, Connectivity Damage to a Graph by the Removal of an Edge or
Vertex, Tech. report, arXiv 1103.3075, ODU CS Dept.
Dissertation Appendix D
44. Global A{DV}H damage (100 nodes)
44
• 100 node graph
• Execution time:
~36 hours
• Attacker has total
knowledge of the
graph
• Attacker has
unrestricted
resources to
damage the graph
• Results:
– Small-world the
most connected is
not the most
valuable
– Random and
scale-free
degreeness does
not make a
difference
45. Attack profile efficacy on sample graph
Attack profile Attacks efficacy
AEdge High The core of the graph 1.43
AEdge Low The periphery of the graph 1.00
AVertex High The core of the graph 1.42
AVertex Low The periphery of the graph 1.00
ADegree High The core of the graph 1.40
ADegree Low The periphery of the graph 1.00
45
• If the attacker's goal is to disconnect the graph by repeated use of
the same attack profile, then the most effective profiles in order
are: AEH , AVH , and ADH.
• HTTP/HTML does not support AE* attack profiles
Dissertation, section 5.6.6
46. Detecting loss of family members
• Each “active maintainer” WO
checks its family’s status
– Check family member
accessibility
– Check friend accessibility
• If family member is lost, use
friends to select candidate
host
• If too few candidate hosts, use
friends to explore and discover
new hosts
46
Dissertation, section 6.8
48. When to make family members?
• What is a copy?
• Who makes the copies?
• How many to make? Answer:
defined by originating domain
– 0 to start
– Soft lower limit (csoft)
– Hard upper limit (chard)
• Where to make them?
– Distributed across known hosts
– Too many or too few hosts
• When to make them?
48
Norman Paskin, On Making and Identifying a Copy, D-Lib Magazine 9 (2003), no. 1.
Henry M. Gladney and John L. Bennett, What Do We Mean by Authentic? What's the
Real McCoy?, D-Lib Magazine 9 (2003), no. 7/8.
49. USW preservation definitions
• Hierarchy of family WOs
– Progenitor – initial WO
– Copies – more recent WO copies
– Each WO is timestamped with creation time
• WO roles
– Active maintainer – eldest WO charged with
making copies and related housekeeping
– Passive maintainer – all other WOs
• Order of precedence
– If progenitor is accessible then it is the active
maintainer
– If declared active maintainer is accessible then it is
the active maintainer
– Otherwise, WO declares itself active maintainer
• If family is disconnected then multiple active
maintainers are possible until reconnection then the
eldest WO declares itself active maintainer
49
Progenitor
Copies
Dissertation, Appendix A
50. Active and passive maintenance
activities
50
Active
Passive
Active Active
PassivePassive
• Active maintainer (the WO with earliest timestamp) – currently
charged with making copies and related housekeeping
• Passive maintainer – all other WOs
XProgenitor
Is lost
Progenitor
returns
Progenitor
declares
act. as copy.
Time
51. Progenitor is lost
51
Active
Passive
Active Active
PassivePassive
• Active maintainer – currently charged with making copies and
related housekeeping
• Passive maintainer – all other WOs
XProgenitor
Is lost
Progenitor
returns
Progenitor
declares
act. as copy.
Time
52. A new active maintainer
52
Active
Passive
Active Active
PassivePassive
• Active maintainer – currently charged with making copies and
related housekeeping
• Passive maintainer – all other WOs
XProgenitor
Is lost
Progenitor
returns
Progenitor
declares
act. as copy.
Time
53. Progenitor returns and assumes
active maintainer role
53
Active
Passive
Active Active
PassivePassive
• Active maintainer – currently charged with making copies and
related housekeeping
• Passive maintainer – all other WOs
XProgenitor
Is lost
Progenitor
returns
Progenitor
declares
act. as copy.
Time
54. Progenitor has made copies
54
Time
Copy
declares
active
Copies
created
Replacement
created
Excess copies Excess deleted
55. A copy is disconnected from the
family
55
Time
Copy
declares
active
Copies
created
Replacement
created
Excess copies Excess deleted
56. Two active maintainers make
copies
56
Time
Copy
declares
active
Copies
created
Replacement
created
Excess copies Excess deleted
57. Disconnected copy is
reconnected to the progenitor
57
Time
Copy
declares
active
Copies
created
Replacement
created
Excess copies Excess deleted
58. Family has too many copies
58
Time
Copy
declares
active
Copies
created
Replacement
created
Excess copies Excess deleted
• Copy management policies
– Active: explicit removal
– Passive: “natural attrition”
• Equivalent of Reynolds’
velocity matching, making
and monitoring copies
59. USW copying policies
• Least aggressive –
one at a time to chard
• Moderately
aggressive – as
quickly as possible to
csoft and then one at a
time chard
• Most aggressive – as
quickly as possible to
chard
• Different results
59
WOs preservation status Hosts utilization status
None
< Csoft
Csoft <= N < Chard
N == Chard
0%
< 25% < 75%
< 50 % > 75%
Dissertation, section 6.7.4
63. Least aggressive (t = 100)
63
A full YouTube video is available at: http://youtu.be/sHJGYphqtK4
64. Least aggressive (final)
64
• Results
– System
stabilized
– Host capacity
limited
– Some WOs
without any
copies
– Some hosts
unused
• “Least aggressive” is
not an effective
policy
65. Which policy to choose?
65
• Moderately aggressive
results in an additional 18%
of WOs meeting their
preservation goals and makes
more efficient use of limited
host resources sooner
• Most aggressive results in
almost the same percentage
of WOs meeting their goals,
but places a strain on the host
resources
Charles L. Cartledge and Michael L. Nelson,
When Should I Make Preservation Copies of
Myself?, arXiv preprint arXiv:1202.4185
(2012).
66. Make new family members on
new hosts
• Spreading copies
across hosts
increases the
WO’s
preservation
likelihood
• Learn about new
hosts from
friends
66 Dissertation, Appendix A
Reynolds’
flock
centering
Move with friends
to new hosts
67. Crowd sourcing of family
member creation
• “Everyone is a curator …”
– Crowd sourced activity
– Unscheduled
– Willing to wait a long time
• Enlist humans in creation
and maintenance –
opposite of benign neglect
67
Frank McCown, Michael L. Nelson, and Herbert Van de Sompel,
Everyone is a Curator: Human-Assisted Preservation for ORE
Aggregations, Proceedings of the DigCCurr 2009 (2009).
68. USW simulation vs.
implementation
68
USW Theory HTTP/HTML
reality
Communications Instantaneous Asynchronous
Edges Bidirectional Directional
Temporal effects None Inconsistences
69. Some WO reference
implementation details
69
Sawood Alam, HTTP Mailbox - Asynchronous RESTful Communication, Master's thesis, Old
Dominion University, Norfolk, VA, 2013.
Carl Lagoze, Herbert Van de Sompel, Pete Johnston, Michael Nelson, Robert Sanderson, and
Simeon Warner, ORE User Guide - Resource Map Implementation in Atom, Tech. report, Open
Archives Initiative, 2004.
Sawood Alam, Charles L. Cartledge, and Michael L. Nelson, Support for Various HTTP Methods
on the Web, Tech. Report arXiv:1405.2330 (2014).
WO memory: simulated via “edit”
service
Direct WO to WO communication:
simulated via the HTTP Mailbox
70. Demonstration of the reference
implementation
1. Selection of a web page to be preserved
2. Creation of a WO from the web page
3. Adding the WO to an existing USW graph
a. Pages copied from flickr.com, arXiv.org, radiolab.org, and
gutenburg.org
b. All pages instrumented to become USW WOs
4. Creating preservation copies
5. Detecting that a copy was lost
6. Creating a replacement copy
70
71. USW contributions
7171
Expanded graph theory by
creating an algorithm that
creates small-world graphs
based on locally collected
data (chapter 6)
Developed a new way to
quantify damage in
connected and disconnected
graphs (section 5.2)
Developed techniques to
optimize when and where to
create preservation copies
(section 5.5)
Developed techniques
to achieve emergent
behavior in WOs
(section 6.2)
73. Preserve Me Viz! with new
connections
• New friend
connections
• New copy
locations
73
74. Preserve Me “Basic” on a copy
• Differences between
active and passive
maintainers.
• Active maintainer is
responsible for making
copies.
• Passive maintainer
sends alerts to the
active maintainer
• Passive maintainer
may assume active
maintainer role if
active is not available.
74
76. USW algorithm popup
76
• Written in
JavaScript
• Relies on domain
services
– Copy -> creates
copy of a WO
– Edit -> update
own REM
• Uses
communications
mechanism based
on Sawood Alam’s
master’s thesis
77. USW Preservation: copies (1 of 2)
• WO copies are not bit by
bit identical to the original
WO
• REsource Map (REM)
points to a resource
– Point to the “essence” of
the original
– Point to local copies of
the resources
– Can be used to recreate
the “essence” of the
original
• Resource has two
attributes:
– Size
– Update frequency
77
84. Comparing graphs
• Small-world graphs occur in natural
and man made systems
• Small-world graphs are robust
• How to algorithmically and
incrementally create small-world
graphs?
84
Symbol Meaning
K Degree
<k> Average degree
N Order of the graph
85. Quantifying damage
• All graph components
are not equally
valuable
• How to identify most
valuable
• Greedy repair is the
obverse of identifying
the most damaging
component by
identifying where to
place the most
beneficial component
85
Charles L. Cartledge and Michael L. Nelson, Connectivity Damage to a Graph
by the Removal of an Edge or Vertex, Tech. report, arXiv 1103.3075, ODU CS
Dept.
86. Global normalized A{DV}H damage
(40 - 750 nodes)
• Arithmetic
series of
possible
solutions
• Early attacks
are most
effective, later
attacks are
incrementally
effective
86
87. Long term growth analysis of
USW graph
Based on the idea of a
game
– Create the graph
– Attack the graph using
AVH profile to remove
10% of the WOs
– Repair the graph,
every surviving WO
gets 2 opportunities
(may be unsuccessful
in repair attempts)
– Repeat until steady
state
87
90. Final states for copying policies
and named conditions
90
Dissertation,
Appendix H
91. Host capacity and WO desires
91
Famine FeastStraddle
B.Low
B.High
Dissertation,
Appendix H
92. Man-made small-world graph: Western
Electricity Coordinating Council
92
Western Electricity Coordinating Council
Actual Random graph
Nodes Edges C(G) L(G) C(G) L(G)
4941 6594 0.0801 18.99 0.00054 8.7
Ake J Holmgren, Using Graph Models to Analyze the Vulnerability of
Electric Power Networks, Risk Analysis 26 (2006), no. 4, 955 - 969.
93. Naturally occurring small-world graph: C.
elegans nematode
93
Caenorhabditis elegans
Actual Random graph
Nodes Edges C(G) L(G) C(G) L(G)
248 511 0.21 2.87 0.05 2.62
Lav R Varshney, Beth L Chen, Eric Paniagua, David H Hall, and Dmitri B
Chklovskii, Structural Properties of the Caenorhabditis elegans Neuronal
Network, PLoS computational biology 7 (2011), no. 2, e1001066.
94. Organic small-world graph: Enron e-mail
94
Enron e-mail
Actual Random graph
Nodes Edges C(G) L(G) C(G) L(G)
148 ~500,000 0.44 2.25 0.11 2.0
Shinako Matsuyama and Takao Terano, Analyzing the ENRON
Communication Network Using Agent-Based Simulation, Journal of
Networks 3 (2008), no. 7.
95. USW WO determines number of
friends
• Number of
connections
95 Dissertation, section 6.7
Symbol Meaning
ln, log2 Natural and base 2
logarithms
n Order of the
discovered USW
graph
g Simple scalar