Bibliotheca Digitalis Summer school: Beyond the Page: enriching the digital library - Lou Burnard

Bibliotheca Digitalis
Reconstitution of Early Modern Cultural Networks
From Primary Source to Data
DARIAH / Biblissima Summer School
Le Mans, 4-8 July 2017
Beyond the Page:
enriching the digital library
Lou Burnard
1st day, July 4th – Digital sources: theoretical fundamentals

Beyond the Page : enriching the digital library
Lou Burnard
1/32

The Textual Trinity
A document can be described in
terms of...
its physical state (because
texts are made up of glyphs
arranged in particular ways)
its linguistic nature (because
texts are made of words
used in particular ways)
its intentions (because texts
are supposed to tell us
something about the world)
(Burnard 1987, Burnard 1989,
Burnard & Greenstein 1994)
3/32

(Or maybe it’s more than a trinity)
4/32

Software families
Existing software systems tend to specialize ...
document management and production systems
image management and production systems
linguistic analysis and management
database systems
5/32

Convergence
But convergence is now on everyone’s digital agenda. When you
make a mashup combining
a GIS database about places in the Aegean sea
a historical gazeteer of placenames in the same area
a corpus of texts mentioning those placenames
you need to combine the strengths of a database with tools for
linguistic analysis, and with tools for rendering spatial information.
A few examples:
https://pleiades.stoa.org/places/109236
http://www.mappingpaintings.org
https://mapoflondon.uvic.ca/map.htm
6/32

The problem
Today’s digital library
applications still focus on
serving up virtual pages for
the reader: the metaphor of
the book is so pervasive that
we can barely see it.
Self-evidently, digitization
makes it possible to offer
cheaper and more accessible
simulations of printed or
written pages.
But this is not enough...
digital texts should aim to
go ‘beyond the page’
7/32

What use is a digital text ?
Digital applications enable us to do more with a text, and especially
with a collection of texts!
more than simply read it from beginning to end
more than attach annotations to it for others to read,
more than perform brute-force “text mining” on it.
The content of the digital library must therefore be enriched, even if
this requires the use of techniques which are not currently
automatable.
8/32

What’s that noise in the digital library?
A digital edition should capture the intentions and meaning of
a text, not simply its appearance
Otherwise, there can be no analysis beyond the documentary
level, no ‘conversation between books’
9/32

Enrichment or Representation?
When we go from this... ... to this, what is happening?
10/32

Editing
It’s customary to distinguish (at least) these types or levels of
interpretation:
paleographic level : identifying the characters and other
graphemic components
documentary or diplomatic level : determining what was
originally written
editorial or semantic level : determining how it ought to be
read
Digitization provides an opportunity to make each step explicit,
complex, and reversible
11/32

The hermeneutic circle of digital enrichment
12/32

Enrichment
Adding markup to a document determines how it can be
processed. It can concern many different aspects :
the presentation of the document – its use of writing styles or
typefaces, its rendering and layout
the rhetorical organization of a document – its sections and
subsections, its paragraphs and lists and headings and
footnotes
metatextual aspects of the document – its corrections and
additions and deletions and errors and lacunae
linguistic properties of a document – its syntax and
morphology and semantics
the document as an object – information about its origins and
custodial history, its transmission and reception, its social
function and category...
and many others.
13/32

Let’s focus on just one aspect: the treatment of names occurring in
a document.
14/32

Some background theory
Reference is a fundamental semiotic concept
Natural languages often distinguish words associated with
abstract concepts from words associated with (concepts
concerning) specific objects
Proper names, technical terms, etc behave differently from
other kinds of word and often have a different linguistic status
they do not appear in lexicons
they are often ‘non-translatable’
What distinguishes them is chiefly their association with real
(or fictive) entities. ‘king’ is a noun with no particular referent;
‘Martin Luther King’ refers to a specific person, as does (in
context) ‘the king’.
Likewise with places, ‘city’ refers to a type of place, not a
particular one; ‘City of London’ refers to a particular place, as
does (in context) ‘the city’
15/32

named entity recognition is a multi-stage operation
decide which input strings reference named entities
decide which particular entities are intended
(optionally) assemble and associate other information about
each referenced entity
Only the first of these is (more or less) automatable, despite
decades of research.
16/32

The NLP (MUC) ‘Named Entity Recognition’ paradigm
input strings are linguistically analysed (parsed,
morphologically analysed, etc.) for candidate tokens
candidates are resolved and disambiguated using a
(pre-existing) ‘knowledge base’ such as Wikipedia
data mining and language modelling systems work similarly,
though the knowledge base may be less structured
The real challenge is to build the knowledge base ...
17/32

Kinds of entity
persons, historical or fictional : ‘Lou Burnard’, ‘Harry Potter’,
‘Pseudo-Dionysius the Areopagite’
named places, of any kind ‘Le Mans’, ‘Atlantis’, ‘Prussia’, ‘the
Eiffel Tower’
named groupings of people ‘The Drones’, ‘Gallimard’, ‘the
Thracians’
Physical objects, works of art etc. ‘the Alfred Jewel’, ‘Excalibur’,
‘the Mona Lisa’
etc. (Are animals objects or people?)
18/32

Entity properties
What might you want to know about an entity? Some things are
obvious, but the list is in principle unbounded:
the various names associated with them at different times
their chronology (birth, death, creation etc.)
their composition, dimensions, classifications, etc.
their associations with other entities
identifiers used for them in standard authority control lists
The last is particularly important for work in the LOD paradigm.
19/32

Kinds of entity reference
TEI provides several elements for the markup of names and nominal
expressions:
<rs> (‘referring string’) – any phrase which refers to a person or
place, e.g. ‘the girl you mentioned’, ‘10 miles Northeast of
Attica’ ...
<name> – any lexical item recognized as a proper name e.g.
‘Budleigh Salterton’ , ‘Bouallebec’, ‘John Doe’ ...
<persName>, <placeName>, <orgName>: specific types of
name: ‘syntactic sugar’ for <name type="person"> etc.
A rich set of proposals for the components of such elements
A project must decide which approach best suits its needs
20/32

Nominal expressions
often have internal structure
are sometimes ambiguous (same referent, different target)
are often multiform (different referent, same target)
TEI XML markup can help...
21/32

Components of personal names
<persName xml:lang="de">
<forename type="first">Johann</forename>
<forename type="middle">Sebastian</forename>
<surname>Bach</surname>
</persName>
<persName xml:lang="fr">
<forename type="composé">Jean-Sébastien</forename>
<surname>Bach</surname>
</persName>
Not to mention... <roleName> (‘Emperor’, ‘conseiller’), <genName>
(‘the Elder’) <addName> (‘Hammer of the Scots’), <nameLink> (‘van
der’) ...
22/32

Components of place names
names of a specific geo-political type (<district>,
<settlement>, <region>, <country>, <bloc>)
<placeName>
<district>6ème arr.</district>
<settlement type="city">Paris, </settlement>
<country>France</country>
</placeName>
names of geographical features such as a mountains or rivers
and terms for such features (<geogName> and <geogFeat>)
<placeName>
<geogFeat>Mont</geogFeat>
<geogName>Blanc</geogName>
</placeName>
a relational expression
<rs type="place">
<measure>10 miles</measure>
<offset>Northeast of</offset>
<settlement>Attica</settlement>
</rs>
23/32

Resolving referents
Within a single language, in a single document, the same person is
referred to in different ways:
<persName>Clara Schumann</persName> ....
<persName>Clara</persName> ....
<persName>Frau Schumann</persName>
The @ref can be used to show that these are all references to the
same person
<persName ref="#CS">Clara Schumann</persName> ....
<persName ref="#CS">Clara</persName> .... <persName>Clara
Wieck</persName> ...
<persName ref="#CS">Frau Schumann</persName>
24/32

Associating reference and entity
the value of @ref can be any form of URI, pointing to a place
where there is more information about this entity, provided
locally or externally
<persName ref="https://en.wikipedia.org/wiki/Clara_Schumann">
Clara
Schumann</persName>
<persName ref="#CS">Clara Schumann</persName>
<persName ref="myBib:CS">Clara Schumann</persName>
All we want to say about CS can be provided using a <person>
element somewhere
<person xml:id="CS">
<persName notAfter="1840-09-12">Clara Wieck</persName>
<birth when="1819-09-13">
<placeName>Leipzig</placeName>
</birth>
<ref type="VIAF"
target="http://viaf.org/viaf/44499359"/>
<idno type="ISNI">ISN:0000000121305653</idno>

</person>
25/32

Resolving ambiguity
Person or place?
<s>Jean likes
<name>Nancy</name>
</s>
We could clarify this by using a more precise tag (<persName> or
<placeName>) rather than <name>. Or we could resolve it by
supplying the appropriate target for the @ref attribute on <name>:
<s>Jean likes
<name ref="#PLACE123">Nancy</name>
</s>

<person xml:id="PERS123">
<persName>
<forename>Nancy</forename>
<surname>Ide</surname>
</persName>

</person>
<place xml:id="PLACE123">
<placeName notBefore="1400">Nancy</placeName>
<placeName notAfter="0056">Nantium</placeName>
26/32

Data vs. Text
TEI distinguishes names from things.
The assumption is that names are found in source texts, whereas
things exist in the real world, and are described by additional data.
Data can take a semi-textual form structured in XML, though it need
not do so.
‘Text is not a special type of data; data is a special type of text.’
27/32

For example
Extract from Histoire Chronologique de la Chancelerie de France..., p. 5
personal names (Odolric, Adalric, Gezon, Lothaire, Adaleron,
Arnoul) ...
names of social positions (Grand Chancelier, Secretaire, Roi...)
a nick name (‘dit Le Faineant’)
titles of other sources (pour la donation de l’Abbaie de
Bonneval, Antiquitez de Troyes)
explicit quotation (‘Sinum Lotarii gloriosissimi Regis... ’)
The formatting helps... but only a bit: we need to make these things
explicit.
28/32

Another example: Paris, BnF, ms. français 16753
First page of Registres de permis d’imprimer...
29/32

One possible encoding...
This seems to be text as data...
30/32

.... continued
... and this seems to be data as text...
31/32

Tentative conclusions, intended to provoke debate
reading a text involves identifying and understanding its data
reading many texts at a distance contributes to, but does not
replace, an understanding of the data they represent
data is itself a kind of text, requiring the same nuanced
interpretive judgment
32/32

Bibliotheca Digitalis Summer school: Beyond the Page: enriching the digital library - Lou Burnard

Recommended

Recommended

More Related Content

Similar to Bibliotheca Digitalis Summer school: Beyond the Page: enriching the digital library - Lou Burnard

Similar to Bibliotheca Digitalis Summer school: Beyond the Page: enriching the digital library - Lou Burnard (20)

More from Bibliothèques Virtuelles Humanistes - CESR, Université de Tours, UMR 7323

More from Bibliothèques Virtuelles Humanistes - CESR, Université de Tours, UMR 7323 (20)

Recently uploaded

Recently uploaded (20)

Bibliotheca Digitalis Summer school: Beyond the Page: enriching the digital library - Lou Burnard