Magnus Huber - The Old Bailey Corpus: Spoken English in the 18th and 19th Centuries
1. The Old Bailey Corpus
Spoken English in the 18th and
19th centuries
The use of historical court records in
the investigation of language change
Digital History Seminar, 21 February 2012
Magnus Huber
Department of English
University of Giessen
Otto-Behaghel-Str. 10B
D-35394 Giessen, Germany
magnus.huber@anglistik.uni-giessen.de
2. Structure
1. Introduction
1.1 Corpus linguistics, sociolinguistics and
sociohistorical linguistics
1.2 The Proceedings of the Old Bailey
1.3 Turning the Proceedings into a linguistic corpus
2. How linguistically accurate is OBC?
2.1 Comparison with alternative accounts
2.2 Language event and its representation
2.3 Internal consistency: negative contraction
2.4 Sociolinguistic potential: relative clauses
3. Brief summary 2
3. 1. Introduction
1.1 Corpus linguistics, sociolinguistics and
sociohistorical linguistics
Definition of linguistic corpus
Generally speaking, a
(usually large) collection of
machine-readable texts used
as a database in linguistic
analyses
Importance of
spoken language
Spoken language precedes
written language
4. Peter Trudgill (1974)
The social differentiation of English in Norwich
100 Percentage
80 of (ng):[n] by
60 social class
40 and sex
20 Female
0 Male
MMC LMC UWC MWC LWC
MMC middle middle class drinking
LMC lower middle class
UWC upper working class
(ng):[n]
MWC middle working class = [drɪnkɪn]
LWC lower working class
5. Historical linguistics: language change
ye > you in subject position
when ye
come set it in
sech rewle as
ye seeme
best (1465)
And thus in
hast fare you
hartely well
(1545)
7. 1.2 The Proceedings of the Old Bailey
• Old Bailey = London's Central Criminal Court
• meets 8 times/year, from 1830s 10 times/year
• "Proceedings" published 1674-1913
• start as a commercial enterprise: publishers
send scribes into courtroom
• proceedings taken down in shorthand
• sold privately by publishers
• City of London gains more and more control
during 18th century
7
10. Original computerized Proceedings (Sheffield)
<unit id="t17330510-1"><trial><info><identifier>t17330510-
1</identifier><source>173305100002</source><header>Sar
ah Sanders, theft: specified place, 10 May 1733.</header>
<pfro>17330510</pfro><ntrial>2</ntrial><psession>1733040
4</psession><nsession>17330628</nsession></info>
<p>1. <person gender="f"><defend
gender="f"><given>Sarah </given><surname>Sanders
</surname></defend></person>, was indicted for <off><theft
type="specified place">stealing a Portugal Piece of Gold,
value 36 s. a Gold Ring, value 10 s. a Gold Ring set with
Vermillion Stones, value 7 s. 6d. a Silver Girdle Buckle, value
10 s. three Aprons, a Shirt, a Shift, and 2 Ells of Holland, the
Goods of <person gender="m"><victim
gender="m"><given>John </given><surname>Underwood
</surname></victim> </person>, in his House</theft></off>,
<cd>March 4</cd>.</p>
<p>John Underwood. The Prisoner was my
<deflabel>Servant</deflabel>, she came to me very well
recommended, but had not staid above ten Weeks before
several [. . .]
11. Original computerized Proceedings (Sheffield)
<unit id="t17330510-1"><trial><info><identifier>t17330510-
1</identifier><source>173305100002</source><header>Sar
ah Sanders, theft: specified place, 10 May 1733.</header>
<pfro>17330510</pfro><ntrial>2</ntrial><psession>1733040
4</psession><nsession>17330628</nsession></info>
<p>1. <person gender="f"><defend
gender="f"><given>Sarah </given><surname>Sanders
</surname></defend></person>, was indicted for <off><theft
type="specified place">stealing a Portugal Piece of Gold,
value 36 s. a Gold Ring, value 10 s. a Gold Ring set with
Vermillion Stones, value 7 s. 6d. a Silver Girdle Buckle, value
10 s. three Aprons, a Shirt, a Shift, and 2 Ells of Holland, the
Goods of <person gender="m"><victim
gender="m"><given>John </given><surname>Underwood
</surname></victim> </person>, in his House</theft></off>,
<cd>March 4</cd>.</p>
<p>John Underwood. The Prisoner was my
<deflabel>Servant</deflabel>, she came to me very well
recommended, but had not staid above ten Weeks before
several [. . .]
12. Sociolinguistically useful XML-tags
in Sheffield Proceedings
• name
<given>Sarah</given> <surname>Sanders</surname>
• year
<identifier>t17180110-1</identifier>
• gender
<defend gender="f">
• age
<age>43</age>
• profession
<deflabel>Servant</deflabel>
• origin
<crimeloc>Tottenham</crimeloc>
13. 1.3 Turning the Proceedings
into a linguistic corpus of
early spoken English
13
14. <unit id="t17330510-1"><trial><info><identifier>t17330510-
1</identifier><source>173305100002</source><header>Sa
rah Sanders, theft: specified place, 10 May 1733.</header>
<pfro>17330510</pfro><ntrial>2</ntrial><psession>173304
04</psession><nsession>17330628</nsession></info>
<p>1. <person gender="f"><defend
gender="f"><given>Sarah </given><surname>Sanders
</surname></defend></person>, was indicted for
<off><theft type="specified place">stealing a Portugal Piece
of Gold, value 36 s. a Gold Ring, value 10 s. a Gold Ring set
with Vermillion Stones, value 7 s. 6d. a Silver Girdle Buckle,
value 10 s. three Aprons, a Shirt, a Shift, and 2 Ells of
<speech>
Holland, the Goods of <person gender="m"><victim
gender="m"><given>John </given><surname>Underwood
</surname></victim> </person>, in his House</theft></off>,
<cd>March 4</cd>.</p>
<p>John Underwood. The Prisoner was my
<deflabel>Servant</deflabel>, she came to me very well
recommended, but had not staid above ten Weeks before
several [. . .]
15. Tagging spoken language
• Need for automatic annotation
• Perl script identifying non-linguistic
patterns indicating spoken language
in the original proceedings
– layout
– metalinguistic information
• Linguistic markers indicating spoken
language? > 1st + 2nd person prns
16. Automatic speech tagging
e.g. "Q. – A."-sequences
<speech> </speech>
Q. Did you see him on Sunday night? - A.
<speech>
Yes, at Walworth, on Sunday night, the
12th of January, at one o'clock - I am sure
</speech>
of that.</p>
18. - <xml>
- <document name="19100426"> Social data file
... • XML format
- <speaker id="271"> • attributes of every speaker
<sex>m</sex>
<age></age>
in OBC
<given>Thomas</given> • plus: scribe, printer,
<surname>Tuckey</surname> publisher
<occupation>Warder</occupation>
<occupation2></occupation2>
<hiscolabel>Prison Guard</hiscolabel>
<hiscocode>58930</hiscocode>
<hiscolabel2></hiscolabel2>
<hiscocode2></hiscocode2>
<crimescene></crimescene>
<birthplace></birthplace>
<workplace>Wormwood Scrubs Prison</workplace>
<placeofresidence></placeofresidence>
<role>witness</role>
</speaker>
...
- </document> 18
- </xml>
19. 2. How linguistically accurate is OBC?
2.1. Comparison with alternative accounts, e.g.
trial of John Ayliffe, 17591024-27, vs. alternative
account The tryal at large of John Ayliffe
Proceedings (718 words) Tryal (1290 words)
Thomas. I am clerk to Mr Jones, Henry Thomas. I am clerk to Mr
a Stationer in the Temple. Jones, a Stationer, in the Temple.
Hargrave. By Mr Ayliffe: I saw Walter Hargrave. By Mr Ayliffe. – I
him seal and deliver it. saw him sign, seal, and deliver it, as
his act and deed.
./. John Fannen. I am not sure; but to
the best of my remembrance, it was
sometime the beginning of
December last, at Mr Fox's house.
19
20. Proceedings (718 words) Tryal (1290 words)
Hargrave. Because he said he Walter Hargrave. The reason Mr
was not willing Mr Fox should Ayliffe gave, was, that he would not
know of it? on any account have it come to Mr
Fox's ears.
Thomas. I can't particularly say Henry Thomas. I cannot positively
that; sometimes we leave a say. – We sometimes leave out the
blank by the gentlemens desire, conclusion by gentlemen's desire, in
perhaps they may add another order that they may add a covenant,
covenant, or something of that or some such thing, if it should be
sort, I can't recollect the reason thought necessary; but I cannot
for that. particularly recollect the reason why
the conclusion was omitted in this
case.
20
21. 2.2 Language event ↔ written representation
Letters
formulation writing
Trial proceedings (e.g. Old Bailey Proceedings)
speech perception shorthand expanding proof type
event by scribe script shorthand reading setting
21
22. Gurney (1752)
Brachygraphy: or short-writing
'to take a Speech,
or Sermon
verbatim, as a
Person talks in
common' (p. 3)
Scribes
Thomas Gurney
(1749-1770)
Joseph Gurney
(1770-1782)
22
23. Recording linguisticdetails
• no distinction between inflected and
uninflected auxiliaries
= 'may' or 'mayst'
= 'can' or 'canst'
= 'should' or 'shouldst'
• dot placed on the top left of the noun phrase
= allomorphs a and an
• auxiliary contractions
'you will' (you w-il) vs. 'you'll' (you-l)
but │ 'it will' ~ 'twill' (│= <t> and it)
23
24. 2.3 Internal consistency:
negative contraction
e.g. do not > don't, need not > needn't, was not > wasn't
N = 1,344,244
NEG contraction in %
18
16
14
12
10
8
6
4
2
0
24
1732-1759 1760-1789 1790-1819 1820-1849 1850-1879 1818-1913
25. Negative contraction in the
OBC, 1732-1912 1. Lexeme?
AUX form % contr. N AUX form % contr. N
do not 28.9 189,776 is not 0.2 47,142
will not 27.7 17,302 must not 0.2 1,620
shall not 20.6 4,172 would not 0.2 52,123
cannot 13.3 106,005 had not 0.1 72,395
are not 3.2 11,552 has not 0.1 9,244
dare not 3.1 260 should not 0.1 20,192
need not 0.6 2,136 was not 0.1 64,574
did not 0.4 429,143 may not 0.0 1,271
does not 0.4 9,539 might not 0.0 2,404
have not 0.4 44,038 ought not 0.0 1,221
could not 0.2 85,361
25
26. Negative contraction in the
OBC, 1732-1912 2. Frequency?
AUX form % contr. N AUX form % contr. N
do not 28.9 189,776 is not 0.2 47,142
will not 27.7 17,302 must not 0.2 1,620
shall not 20.6 4,172 would not 0.2 52,123
cannot 13.3 106,005 had not 0.1 72,395
are not 3.2 11,552 has not 0.1 9,244
dare not 3.1 260 should not 0.1 20,192
need not 0.6 2,136 was not 0.1 64,574
did not 0.4 429,143 may not 0.0 1,271
does not 0.4 9,539 might not 0.0 2,404
have not 0.4 44,038 ought not 0.0 1,221
could not 0.2 85,361
26
27. Negative contraction in the
OBC, 1732-1912 3. Tense?
AUX form % contr. N AUX form % contr. N
do not 28.9 189,776 is not 0.2 47,142
will not 27.7 17,302 must not 0.2 1,620
shall not 20.6 4,172 would not 0.2 52,123
cannot 13.3 106,005 had not 0.1 72,395
are not 3.2 11,552 has not 0.1 9,244
dare not 3.1 260 should not 0.1 20,192
need not 0.6 2,136 was not 0.1 64,574
did not 0.4 429,143 may not 0.0 1,271
does not 0.4 9,539 might not 0.0 2,404
have not 0.4 44,038 ought not 0.0 1,221
could not 0.2 85,361
27
28. Explaining the absence of
negative contraction
• combination of phonology and genre
• n't is phonetically reduced, less salient than not
• do-don't [u - o(u)] vs. did-didn't [ɪ - ɪ]
can-can't vs. could-couldn't
will-won't vs. would-wouldn't
shall-shan't vs. should-shouldn't
• negative contraction is (near) absent where the
context (e.g. change in the stem vowel in the
negative) does not allow disambiguation
28
29. Hierarchy of perceptive difference
between positive and negative
contracted forms
V change C change/ Score
addition
do-don('t) 1 1 2
will-won('t) 1 1 2
shall-shan('t) 0.5 1 1.5
can-can('t) 0.5 0 0.5
29
30. 2.4 Sociolinguistic potential: relative
clauses
• random extracts of speech events from OBC:
20,000 words/decade (10,000 w. each for m + f)
• 2500+ relative clauses, of which 1533 restrictive
1720- % 1780- % 1840- % ∑ %
1779 1839 1913
that 259 53.8 240 45.4 136 26.0 635 41.4
zero 107 22.2 118 22.3 201 38.4 426 27.8
which 70 14.6 97 18.3 92 17.6 259 16.9
who 38 7.9 69 13.0 89 17.0 196 12.8
whom 6 1.2 2 0.4 5 1.0 13 0.8
whose 1 0.2 3 0.6 0 0.0 4 0.3
∑ 481 529 523 1533 30
31. Diagram 1 Distribution of that with regard to
animacy of the head
100%
80%
60%
40%
20%
0%
1720-1779 1780-1839 1840-1913
non-human 121 164 105
human 137 76 31
1720-1779 vs 1780-1839 p = 0.000
1720-1779 vs 1840-1913 p = 0.000
1780-1839 vs 1840-1913 p = 0.070
31
32. Diagram 2 Distribution of that and pronominal
relativizers with human heads
100%
80%
60%
40%
20%
0%
1720-1779 1780-1839 1840-1913
PRN 49 72 93
that 137 76 31
1720-1779 vs 1780-1839: p = 0.000
1720-1779 vs 1840-1913: p = 0.000
1780-1839 vs 1840-1913: p = 0.000 32
33. Diagram 3 Relativizers by gender (excl. genitives)
p = 0.135 p = 0.001 p = 0.000
100%
80%
60%
40%
20%
0%
f m f m f m
1720-1779 1780-1839 1840-1913
PRN 43 71 56 112 66 119
zero 53 54 66 52 110 73
that 124 134 108 132 72 64
f 1720-1779 vs 1780-1839: p = 0.135 m 1720-1779 vs 1780-1839: p = 0.033
f 1720-1779 vs 1840-1913: p = 0.000 m 1720-1779 vs 1840-1913: p = 0.000
f 1780-1839 vs 1840-1913: p = 0.000 m 1780-1839 vs 1840-1913: p = 0.000
34. Diagram 4 Zero relativizer by gender (excl. genitives)
100%
80%
60%
40%
20%
0%
f m f m f m
1720-1779 1780-1839 1840-1913
other 167 205 164 244 138 173
zero 53 54 66 52 110 73
f 1720-1779 vs 1780-1839: p = 0.268 m 1720-1779 vs 1780-1839: p = 0.326
f 1720-1779 vs 1840-1913: p = 0.000 m 1720-1779 vs 1840-1913: p = 0.022
f 1780-1839 vs 1840-1913: p = 0.000 m 1780-1839 vs 1840-1913: p = 0.001
36. References
• Gurney, Thomas. 1752. Brachygraphy: or short-writing.
2nd ed. London: [no publisher].
• Nevalainen, Terttu & Raumolin-Brunberg, Helena (eds).
1996. Sociolinguistics and language history: studies
based on the corpus of early English correspondence.
Amsterdam: Rodopi.
• Trudgill, Peter. 1974. The Social Differentiation of
English in Norwich. Cambridge: Cambridge University
Press.
• van Leeuwen, Marco H.D., Ineke Maas and Andrew
Miles. 2002. HISCO: Historical international standard
classification of occupations. Leuven: Leuven University
Press. 36