1. Statistical methods for Historical Linguistics
Robin J. Ryder
Centre de Recherche en Math´matiques de la D´cision
e e
Universit´ Paris-Dauphine
e
7 March 2013
UCLA
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 1 / 98
2. Introduction
A large number of recent papers describe computationally-intensive
statistical methods for Historical Linguistics
Increased computational power
Advances in statistical methodology
Complex linguistic questions which cannot be answered with
traditional methods
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 2 / 98
3. Caveats
I am not a linguist
I am a statistician
These papers were not written by me; most figures were created by
the papers’ authors
I use the word ”evolution” in a broad sense
”All models are wrong, but some are useful”
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 3 / 98
4. Aims of this talk
Review of several recent papers on statistical models for Historical
Linguistics
Statisticians won’t replace linguists
When done correctly, collaborations between statisticians and linguists
can provide useful results
Brief introduction to some major statistical concepts + statistical
comments
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 4 / 98
5. Statistics slides
Different colour
Feel free to ignore the equations
Useful (essential?) if you wish to collaborate with statisticians
Limitations of statistical tools
Statistics should not be a black box
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 5 / 98
6. Advantages of statistical methods
Analyse (very) large datasets
Test multiple hypotheses
Cross-validation
Estimate uncertainty
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 6 / 98
7. Statistical method in a nutshell
1 Collect data
2 Design model
3 Perform inference (MCMC, ...)
4 Check convergence
5 In-model validation (is our inference method able to answer questions
from our model?)
6 Model mis-specification analysis (do we need a more complex model?)
7 Conclude
In general, it is more difficult to perform inference for a more complex
model.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 7 / 98
8. Contents
1 Swadesh: glottochronology
2 Gray and Atkinson: Language phylogenies
3 Lieberman et al., Pagel et al.: Frequency of use
4 Perreault and Mathew, Atkinson: Phoneme expansion
5 Dunn et al.: Word-order universals
6 Bouchard-Cˆt´ et al: automated cognates
oe
7 Overall conclusions
Tomorrow: phylogenetics for core vocabulary
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 8 / 98
9. Outline
1 Swadesh: glottochronology
2 Gray and Atkinson: Language phylogenies
3 Lieberman et al., Pagel et al.: Frequency of use
4 Perreault and Mathew, Atkinson: Phoneme expansion
5 Dunn et al.: Word-order universals
6 Bouchard-Cˆt´ et al: automated cognates
oe
7 Overall conclusions
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 9 / 98
10. Swadesh (1952)
LEXICO-STATISTIC DATING OF PREHISTORIC ETHNIC CONTACTS
WithSpecialReference North
to American
Indiansand Eskimos
MORRIS SWADESH
PREHISTORY refersto the long period of early measuring amountof radioactivity going
the still
humansociety before writing was availableforthe on. Consequently, is possible to determine
it
recording events. In a fewplaces it gives way
of within certainlimitsof accuracythetimedepthof
to the modernepoch of recorded history much
as any archeological whichcontains suitablebit
site a
as six or eightthousand yearsago; in manyareas of bone, wood, grass, or any otherorganic sub-
this happened only in the last few centuries. stance.
Everywhere prehistory represents greatobscure
a Lexicostatistic dating makes use of very dif-
depthwhichscienceseeks to penetrate. And in- ferent materialfromcarbondating, but the broad
deed powerful means have been foundforillumi- theoretical prin~iple similar. Researchesby the
is
natingthe unrecorded past,including evidence
the presentauthorand several otherscholarswithin
of archeological findsand that of the geographic the last few years have revealedthat the funda-
distribution culturalfactsin the earliestknown
of mentaleveryday vocabularyof any language-as
periods. Much dependson thepainstaking analy- againstthe specializedor "cultural"vocabulary-
sis and comparison data, and on the effective
of changes at a relatively constantrate. The per-
readingof theirimplications. Very important is centage of retainedelementsin a suitable test
thecombined of all theevidence,
use linguistic and vocabularytherefore indicatesthe elapsed time.
ethnographic well as archeological,
as biological, a
Wherever speechcommunity comesto be divided
and geological. And it is essentialconstantly to into two or more parts so that linguistic change
seek new means of expandingand rendering more goes separatewaysin each of thenew speechcom-
accurateour deductions about prehistory. munities, percentage commonretainedvo-
the of
One of the mostsignificant recenttrendsin the cabulary gives an index of theamountof timethat
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 10 / 98
11. Question at hand
Aim: dating the MRCA (Most Recent Common Ancestor) of a pair of
languages.
Data: ”core vocabulary” (Swadesh lists). 215 or 100 words.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 11 / 98
12. all dog good live right spit two
and drink grass liver (side) split vomit
animal dry green long river
road squeeze walk
ashes dull guts louse
at root stab warm
dust hair man
back rope stand
ear hand many wash
bad rotten star
earth he meat water
bark round stick
eat head moon
rub stone we
egg hear
because mother salt wet
eye heart
belly sand straight
fall heavy suck what
big mountain say
far here sun when
bird mouth
fat hit scratch swell
bite name where
father hold sea swim
black narrow
fear horn white
blood near see tail
how seed ten who
blow neck
feather hunt
bone new sew that wide
few
breast night sharp there
fight husband wife
nose short they
breathe fire I wind
not sing thick
burn fish ice wing
old sit thin
child five if
one skin think wipe
claw float in
other sky this
flow kill with
cloud sleep thou
person
cold flower knee
play small three
fly know woman
come smell throw
pull
Robin count fog
J. Ryder (Paris Dauphine) lake
Statistics and Historical Linguistics smoke UCLA March 2013 woods 98
12 /
tie
13. Assumptions
Swadesh assumed that core vocabulary evolves at a constant rate (through
time, space and meanings). Given a pair of languages with percentage C
of shared cognates, and a constant retention rate r , the age t of the
MRCA is
log C
t=
2 log r
The constant r was estimated using a pair of languages for which the age
of the MRCA is known.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 13 / 98
14. Issues with glottochronology
Many statistical shortcomings. Mainly:
1 Simplistic model
2 No evaluation of uncertainty of estimates
3 Only small amounts of data are used
See Bergsland and Vogt (1962) for debunking of glottochronology.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 14 / 98
15. What has changed?
More elaborate models + model misspecification analyses
We can estimate the uncertainty (⇒easier to answer ”I don’t know”)
Large amounts of data
Tomorrow: new analysis of Bergsland and Vogt’s data.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 15 / 98
16. Outline
1 Swadesh: glottochronology
2 Gray and Atkinson: Language phylogenies
3 Lieberman et al., Pagel et al.: Frequency of use
4 Perreault and Mathew, Atkinson: Phoneme expansion
5 Dunn et al.: Word-order universals
6 Bouchard-Cˆt´ et al: automated cognates
oe
7 Overall conclusions
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 16 / 98
18. Swadesh lists, better analysed
Use Swadesh lists for 87 Indo-European languages, and a phylogenetic
model from Genetics
Assume a tree-like model of evolution
Bayesian inference via MCMC (Markov Chain Monte Carlo)
Reconstruct trees and dates
Main parameter of interest: age of the root (Proto-Indo-European,
PIE)
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 18 / 98
19. Lexical trees
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 19 / 98
20. Correcting the issues with glottochronology
Returning to the issues with glottochronology:
1 Simplistic model → Slightly better, but the model of evolution is
rudimentary
2 No evaluation of uncertainty of estimates → Bayesian inference
3 Only small amounts of data are used → Large number of languages
reduces variability of estimates
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 20 / 98
21. Why be Bayesian?
In the settings described in this talk, it usually makes sense to use
Bayesian inference, because:
The models are complex
Estimating uncertainty is paramount
The output of one model is used as the input of another
We are interested in complex functions of our parameters
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 21 / 98
22. Frequentist statistics
Statistical inference deals with estimating an unknown parameter θ
given some data D.
In the frequentist view of statistics, θ has a true fixed (deterministic)
value.
Uncertainty is measured by confidence intervals, which are not
intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20)
for θ, I cannot say that there is a 95% probability that θ belongs to
the interval [80 ; 120].
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 22 / 98
23. Frequentist statistics
Statistical inference deals with estimating an unknown parameter θ
given some data D.
In the frequentist view of statistics, θ has a true fixed (deterministic)
value.
Uncertainty is measured by confidence intervals, which are not
intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20)
for θ, I cannot say that there is a 95% probability that θ belongs to
the interval [80 ; 120].
Frequentist statistics often use the maximum likelihood estimator: for
which value of θ would the data be most likely (under our model)?
L(θ|D) = P[D|θ]
ˆ
θ = arg max L(θ|D)
θ
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 22 / 98
24. Bayesian statistics
In the Bayesian framework, the parameter θ is seen as inherently
random: it has a distribution.
Before I see any data, I have a prior distribution on π(θ), usually
uninformative.
Once I take the data into account, I get a posterior distribution,
which is hopefully more informative.
π(θ|D) ∝ π(θ)L(θ|D)
Different people have different priors, hence different posteriors. But
with enough data, the choice of prior matters little.
We are now allowed to make probability statements about θ, such as
”there is a 95% probability that θ belongs to the interval [78 ; 119]”
(credible interval)
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 23 / 98
25. Advantages and drawbacks of Bayesian statistics
More intuitive interpretation of the results
Easier to think about uncertainty
In a hierarchical setting, it becomes easier to take into account all the
sources of variability
Prior specification: need to check that changing your prior does not
change your result
Computationally intensive
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 24 / 98
26. Effect of prior
Example: Bernoulli model (biased coin). θ=probability of success.
Observe y = 72 successes out of n = 100 trials.
ˆ
Frequentist estimate: θ = 0.72
Confidence interval: [0.63 0.81].
Bayesian estimate: will depend on the prior.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 25 / 98
27. Effect of prior
prior(x)
8
4
0
0.0 0.2 0.4 0.6 0.8 1.0
x
prior(x)
8
4
0
0.0 0.2 0.4 0.6 0.8 1.0
x
prior(x)
8
4
0
0.0 0.2 0.4 0.6 0.8 1.0
x
y = 72, n = 100
Black:prior; green: likelihood; red: posterior.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 26 / 98
28. Effect of prior
prior(x)
8
4
0
0.0 0.2 0.4 0.6 0.8 1.0
x
prior(x)
8
4
0
0.0 0.2 0.4 0.6 0.8 1.0
x
prior(x)
8
4
0
0.0 0.2 0.4 0.6 0.8 1.0
x
y = 7, n = 10
Black:prior; green: likelihood; red: posterior.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 27 / 98
29. Effect of prior
prior(x)
25
0 10
0.0 0.2 0.4 0.6 0.8 1.0
x
25
prior(x)
0 10
0.0 0.2 0.4 0.6 0.8 1.0
x
25
prior(x)
0 10
0.0 0.2 0.4 0.6 0.8 1.0
x
y = 721, n = 1000
Black:prior; green: likelihood; red: posterior.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 28 / 98
30. Bayesian inference for lexical trees
The tree parameter is seen as random: it has a distribution
Via MCMC, G & A get a sample of possible trees, with associated
probabilities, rather than a single tree
The uncertainty in trees is thus made explicit
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 29 / 98
31. G & A: Assumptions
Tree-like evolution
Genetics model
Independence of cognates
Constant rate of change
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 30 / 98
32. G & A: conclusions
Age of PIE: 7800-9800 BP (Before Present)
Large error bars, but this is a good thing
Reconstruct many known features of the tree of Indo-European
languages
Little validation of the model, no model misspecification analysis
Much more on these data and more elaborate models and analyses
tomorrow
These trees can also be used as a building block to answer other
questions
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 31 / 98
33. Outline
1 Swadesh: glottochronology
2 Gray and Atkinson: Language phylogenies
3 Lieberman et al., Pagel et al.: Frequency of use
4 Perreault and Mathew, Atkinson: Phoneme expansion
5 Dunn et al.: Word-order universals
6 Bouchard-Cˆt´ et al: automated cognates
oe
7 Overall conclusions
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 32 / 98
34. Lieberman et al. (2007)
Vol 449 | 11 October 2007 | doi:10.1038/nature06137
LETTERS
Quantifying the evolutionary dynamics of language
Erez Lieberman1,2,3*, Jean-Baptiste Michel1,4*, Joe Jackson1, Tina Tang1 & Martin A. Nowak1
Human language is based on grammatical rules1–4. Cultural evolu- applied to three-quarters of all Old English verbs21. The excep-
tion allows these rules to change over time5. Rules compete with tions—ancestors of the modern irregular verbs—were mostly
each other: as new rules rise to prominence, old ones die away. To members of the so-called ‘strong’ verbs. There are seven different
quantify the dynamics of language evolution, we studied the regu- classes of strong verbs with exemplars among the Modern English
larization of English verbs over the past 1,200 years. Although an irregular verbs, each with distinguishing markers that often include
elaborate system of productive conjugations existed in English’s characteristic vowel shifts. Although stable coexistence of multiple
proto-Germanic ancestor, Modern English uses the dental suffix, rules is one possible outcome of rule dynamics, this is not what
‘-ed’, to signify past tense6. Here we describe the emergence of this occurred in English verb inflection22. We therefore define regularity
linguistic rule amidst the evolutionary decay of its exceptions, with respect to the modern ‘-ed’ rule, and call all these exceptional
known to us as irregular verbs. We have generated a data set of forms ‘irregular’.
verbs whose conjugations have been evolving for more than a We consulted a large collection of grammar textbooks describing
millennium, tracking inflectional changes to 177 Old-English verb inflection in these earlier epochs, and hand-annotated every
irregular verbs. Of these irregular verbs, 145 remained irregular irregular verb they described (see Supplementary Information).
in Middle English and 98 are still irregular today. We study how This provided us with a list of irregular verbs from ancestral
the rate of regularization depends on the frequency of word usage. forms of English. By eliminating verbs that were no longer part of
The half-life of an irregular verb scales as the square root of its Modern English, we compiled a list of 177 Old English irregular
usage frequency: a verb that is 100 times less frequent regularizes verbs that remain part of the language to this day. Of these 177 Old
10 times as fast. Our study provides a quantitative analysis of the English irregulars, 145 remained irregular in Middle English and 98
regularization process by which ancestral forms gradually yield to are still irregular in Modern English. Verbs such as ‘help’, ‘grip’ and
an emerging linguistic rule. ‘laugh’, which were once irregular, have become regular with the
Natural languages comprise elaborate systems of rules that enable
Robin J. Ryder (Paris Dauphine) passing of time.
7 Statistics and Historical Linguistics UCLA March 2013 33 / 98
35. Question at hand
Wish to verify that the rate of change tends to be slower for words which
are used more frequently
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 34 / 98
36. Data
Regularization of irregular verbs
Old English, Middle English and Modern English (1200 years of
evolution)
177 irregular verbs, of which 79 regularize
Frequency of use
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 35 / 98
37. Results
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 36 / 98
38. Conclusions
Regularization rate changes as the square root of usage frequency
A verb used 4 times more often regularizes at half the rate
Evolutionary history known
Sub-optimal: binning
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 37 / 98
39. Pagel et al.
Vol 449 | 11 October 2007 | doi:10.1038/nature06176
LETTERS
Frequency of word-use predicts rates of lexical
evolution throughout Indo-European history
Mark Pagel1,2, Quentin D. Atkinson1 & Andrew Meade1
´
Greek speakers say ‘‘oura’’, Germans ‘‘schwanz’’ and the French paired meanings in the Bantu languages. This indicates that variation
‘‘queue’’ to describe what English speakers call a ‘tail’, but all of in the rates of lexical replacement among meanings is not merely an
these languages use a related form of ‘two’ to describe the number historical accident, but rather is linked to some general process of
after one. Among more than 100 Indo-European languages and language evolution.
dialects, the words for some meanings (such as ‘tail’) evolve Social and demographic factors proposed to affect rates of
rapidly, being expressed across languages by dozens of unrelated language change within populations of speakers include social status11,
words, while others evolve much more slowly—such as the num- the strength of social ties12, the size of the population13 and levels of
ber ‘two’, for which all Indo-European language speakers use the outside contact14. These forces may influence rates of evolution on a
same related word-form1. No general linguistic mechanism has local and temporally specific scale, but they do not make general
been advanced to explain this striking variation in rates of lexical predictions across language families about differences in the rate of
replacement among meanings. Here we use four large and diver- lexical replacement among meanings. Drawing on concepts from
gent language corpora (English2, Spanish3, Russian4 and Greek5) theories of molecular15 and cultural evolution16–18, we suggest that
and a comparative database of 200 fundamental vocabulary mean- the frequency with which different meanings are used in everyday
ings in 87 Indo-European languages6 to show that the frequency language may affect the rate at which new words arise and become
with which these words are used in modern language predicts their adopted in populations of speakers. If frequency of meaning-use is
rate of replacement over thousands of years of Indo-European a shared and stable feature of human languages, then this could
language evolution. Across all 200 meanings, frequently used provide a general mechanism to explain the large differences across
words evolve at slower rates and infrequently used words evolve meanings in observed rates of lexical replacement. Here we test this
Robin J. Ryder (Paris Dauphine) holds separately and identically idea by examining the relationship between the rates at which Indo-
more rapidly. This relationship Statistics and Historical Linguistics UCLA March 2013 38 / 98
40. Question at hand
Check link between frequency of use and rate of change for vocabulary.
Hypothesis: when a meaning is used more often, the corresponding word
has less chances of changing.
Problem: since this rate is expected to be much slower than that of strong
verb regularization, we need to look at much deeper history. But then the
evolutionary history is unknown.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 39 / 98
41. Workaround
Use Indo-European core vocabulary data, and frequencies from
English, Greek, Russian and Spanish
Get a sample from the distribution on trees and ancestral ages using
G&A’s method
For each tree in the sample, estimate the rate of change for each
meaning.
Average across all trees.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 40 / 98
42. Results
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 41 / 98
43. Comments
Even if there is high uncertainty in the phylogenies, we can still
answer other questions (integrating out the tree)
Similar results for Bantu (Pagel and Meade 2006)
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 42 / 98
44. Sampling from a distribution
Given a parameter θ with probability density π(θ), we often wish to
estimate the expected value E [θ], or the probability mass of an interval
P[a < θ < b].
If we can sample n independent and identically distributed (iid) values
θ1 , . . . , θn from π(·), then (under general assumptions) we have a good
estimated of E [θ]:
n
1
θi −→ E [θ]
n n→∞
i=1
(and the same holds for Ih = h(θ) dθ for any measurable function h).
Convergence is fast.
But in general, getting an iid sample is difficult or impossible.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 43 / 98
45. MCMC
If we are able to compute the value of the density π(θ) at any point θ, we
can use MCMC (Markov Chain Monte Carlo) to generate correlated
θ1 , . . . , θn from π(·). The convergence still holds:
n
1
θi −→ E [θ]
n n→∞
i=1
but is much slower. Since we use a finite value of n, we need to check the
convergence of the algorithm. In general, this is done by:
1 Checking the trace of the algorithm
2 Checking the autocorrelations
3 Running the algorithm for a very large value of n
Curse of dimensionality: when the dimensionality of the parameter space
increases, convergence slows dramatically.
More during the TraitLab tutorial.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 44 / 98
46. ABC
If we are not able to compute π(θ), approximate inference may still be
possible using a likelihood-free method such as Approximate Bayesian
Computation (ABC).
The idea is to draw many values of the parameters; for each value,
simulate a new dataset; keep only parameter values which lead to
simulated data ”close” to the observed data.
The theoretical properties are still mostly unknown. This method induces
an approximation, and it is difficult to control the size of the
approximation.
But sometimes, this is the only possibility.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 45 / 98
47. Outline
1 Swadesh: glottochronology
2 Gray and Atkinson: Language phylogenies
3 Lieberman et al., Pagel et al.: Frequency of use
4 Perreault and Mathew, Atkinson: Phoneme expansion
5 Dunn et al.: Word-order universals
6 Bouchard-Cˆt´ et al: automated cognates
oe
7 Overall conclusions
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 46 / 98
48. Perreault and Mathew
Dating the Origin of Language Using Phonemic Diversity
Charles Perreault1., Sarah Mathew2*.
1 Santa Fe Institute, Santa Fe, New Mexico, United States of America, 2 Centre for the Study of Cultural Evolution, Stockholm University, Stockholm, Sweden
Abstract
Language is a key adaptation of our species, yet we do not know when it evolved. Here, we use data on language phonemic
diversity to estimate a minimum date for the origin of language. We take advantage of the fact that phonemic diversity
evolves slowly and use it as a clock to calculate how long the oldest African languages would have to have been around in
order to accumulate the number of phonemes they possess today. We use a natural experiment, the colonization of
Southeast Asia and Andaman Islands, to estimate the rate at which phonemic diversity increases through time. Using this
rate, we estimate that present-day languages date back to the Middle Stone Age in Africa. Our analysis is consistent with the
archaeological evidence suggesting that complex human behavior evolved during the Middle Stone Age in Africa, and does
not support the view that language is a recent adaptation that has sparked the dispersal of humans out of Africa. While
some of our assumptions require testing and our results rely at present on a single case-study, our analysis constitutes the
first estimate of when language evolved that is directly based on linguistic data.
Citation: Perreault C, Mathew S (2012) Dating the Origin of Language Using Phonemic Diversity. PLoS ONE 7(4): e35289. doi:10.1371/journal.pone.0035289
Editor: Michael D. Petraglia, University of Oxford, United Kingdom
Received September 8, 2011; Accepted March 14, 2012; Published April 27, 2012
Copyright: ß 2012 Perreault, Mathew. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: Funding was provided by the Swedish Research Council [grant number 2009-2390]. The funders had no role in study design, data collection and
analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: sarah.mathew@arklab.su.se
. These authors contributed equally to this work.
Introduction of the globe to be colonized. The loss of phonemes through serial
founder effect is consistent with other lines of evidence that
A capacity for language is a hallmark of our species [1,2], yet we indicate that phonemic diversity is determinedMarch 2013
Robin J. Ryderlittle about Dauphine)its appearance. Language appears in
know (Paris the timing of Statistics and Historical Linguistics UCLA by cultural 47 / 98
50. Premises
Two small populations B and C disperse from the same parent A. B
colonizes a large territory, C a small isolated island.
B will tend to accumulate new phonemes at a faster rate than C
(PB > PC ).
If the island is small enough, C will have about the same phonemic
diversity as the ancestral language A (E (PC ) = PA ).
The difference in modern-day phonemic diversity is due entirely to B,
so if we know the splitting time t, we can estimate the rate of
phoneme accumulation in a large territory.
Two models of phoneme accumulation: linear and exponential.
PB − PC ln(PB ) − ln(PC )
r= k=
t t
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 49 / 98
51. Idea
Estimate r or k
Estimate modern phonemic diversity in Africa (PAfrica ) and guess
initial phonemic diversity Pinitial (take Pinitial = 11 or 29, minimum
and median modern phonemic diversity).
Then we can deduce the age of the initial language.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 50 / 98
52. Parameter estimation
Single natural experiment: Andaman islands (colonized 45-65 kya).
Estimate PB and PC by averaging over 20 languages.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 51 / 98
53. Results
(Linear/exponential model; 2 estimates for colonization of Andaman
islands; 2 guesses of Pinitial ).
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 52 / 98
54. Comments
Highly dependent on strong assumptions
Needs model validation
The parameter estimates are very noisy, since they are based on
related languages.
Intriguing
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 53 / 98
55. 1. N. C. Seeman, J. Theor. Biol. 99, 237 (1982). 16. S. M. Douglas et al., Nature 459, 414 (2009). Center for Bio-Inspired Solar Fuel Production, an
2. N. C. Seeman, Nature 421, 427 (2003). 17. Y. Ke et al., J. Am. Chem. Soc. 131, 15903 (2009). Energy Frontier Research Center funded by the U.S.
Atkinson 3.
4.
5.
T. J. Fu, N. C. Seeman, Biochemistry 32, 3211 (1993).
J. H. Chen, N. C. Seeman, Nature 350, 631 (1991).
E. Winfree, F. Liu, L. A. Wenzler, N. C. Seeman,
18. H. Dietz, S. M. Douglas, W. M. Shih, Science 325,
725 (2009).
19. Materials and methods are available as supporting
Department of Energy, Office of Science, Office of
Basic Energy Sciences under award DE-SC0001016.
Nature 394, 539 (1998). material on Science Online.
6. H. Yan, S. H. Park, G. Finkelstein, J. H. Reif, T. H. LaBean, 20. P. W. K. Rothemund et al., J. Am. Chem. Soc. 126,
Supporting Online Material
16344 (2004). www.sciencemag.org/cgi/content/full/332/6027/342/DC1
Science 301, 1882 (2003).
Materials and Methods
7. W. M. Shih, J. D. Quispe, G. F. Joyce, Nature 427, 618 21. P. J. Hagerman, Annu. Rev. Biophys. Chem. 17, 265
SOM Text
(2004). (1988).
Figs. S1 to S39
8. F. Mathieu et al., Nano Lett. 5, 661 (2005). Acknowledgments: We thank the AFM applications group
Tables S1 to S19
9. R. P. Goodman et al., Science 310, 1661 (2005). at Bruker Nanosurfaces for assistance in acquiring some
Reference 22
10. F. A. Aldaye, H. F. Sleiman, J. Am. Chem. Soc. 129, of the high-resolution AFM images, using Peak Force
13376 (2007). Tapping with ScanAsyst on the MultiMode 8. This research 18 January 2011; accepted 4 March 2011
11. Y. He et al., Nature 452, 198 (2008). was partly supported by grants from the Office of Naval 10.1126/science.1202998
Downloaded from www.sciencemag.org on April 19, 2011
0.131, df = 503, P = 0.003). To account for any
Phonemic Diversity Supports a Serial non-independence within language families, the
analysis was repeated, first using mean values at
Founder Effect Model of Language the language family level (table S2) and then
using a hierarchical linear regression framework
Expansion from Africa to model nested dependencies in variation at the
family, subfamily, and genus levels (15). These
analyses confirm that, consistent with a founder
Quentin D. Atkinson1,2* effect model, smaller population size predicts
reduced phoneme inventory size both between
Human genetic and phenotypic diversity declines with distance from Africa, as predicted by a families (family-level analysis r = 0.468, df = 49,
serial founder effect in which successive population bottlenecks during range expansion P < 0.001; fig. S1B) and within families, con-
progressively reduce diversity, underpinning support for an African origin of modern humans. trolling for taxonomic affiliation {hierarchical lin-
Recent work suggests that a similar founder effect may operate on human culture and ear model: fixed-effect coefficient (b) = 0.0338
language. Here I show that the number of phonemes used in a global sample of 504 languages to 0.0985 [95% highest posterior density (HPD)],
is also clinal and fits a serial founder–effect model of expansion from an inferred origin in P = 0.009}.
Africa. This result, which is not explained by more recent demographic history, local language Figure 1B shows clear regional differences in
diversity, or statistical non-independence within language families, points to parallel phonemic diversity, with the largest phoneme
mechanisms shaping genetic and linguistic diversity and supports an African origin of inventories in Africa and the smallest in South
modern human languages. America and Oceania. A series of linear regres-
sions was used to predict phoneme inventory size
he number of phonemes—perceptually ing the evolution of phonemes (11, 16) and lan- from the log of speaker population size and dis-
T distinct units of sound that differentiate
words—in a language is positively corre-
lated with the size of its speaker population (1) in
guage generally (17–20). This raises the possibility
that the serial founder–effect model used to trace
our genetic origins to a recent expansion from Africa
tance from 2560 potential origin locations around
the world (15). Incorporating modern speaker
population size into the model controls for geo-
such a way that small populations have fewer (4–9) could also be applied to global phonemic graphic patterning in population size and means
phonemes. Languages continually gain and lose diversity to investigate the origin and expansion that the analysis is conservative about the amount
phonemes because of stochastic processes (2, 3). of modern human languages. Here I examine geo- of variation attributed to ancient demography.
If phoneme distinctions are more likely to be lost graphic variation in phoneme inventory size using Model fit was evaluated with the Bayesian in-
Robin in small founder populations, then a succession
J. Ryder (Paris Dauphine) Statistics and Historical Linguistics
data on vowel, consonant, and tone inventories formation criterionUCLA March 2013
(BIC) (22). Following previ- 54 / 98
56. Question at hand
Can we use phonemic data to reconstruct the expansion of language?
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 55 / 98
57. Premises
Premise 1: Size of population is positively correlated with number of
phonemes.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 56 / 98
58. Premises
Premise 1: Size of population is positively correlated with number of
phonemes.
The effect is small, but significant. Still holds when you restrict to
population < 5000.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 56 / 98
59. Premises
Premise 1: Size of population is positively correlated with number of
phonemes.
Premise 2: Language expansion occurred with a series of small founder
groups.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 56 / 98
60. Premises
Premise 1: Size of population is positively correlated with number of
phonemes.
Premise 2: Language expansion occurred with a series of small founder
groups.
Conclusion: During language expansion, the number of phonemes
decreases at the frontier of expansion.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 56 / 98
61. Idea
Test many different possible points of origin. From the true point of
origin, you should observe a cline in phoneme diversity.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 57 / 98
62. Data
Data from WALS, 504 languages (all data with languages). Three traits,
grouped into classes by WALS:
Tone structure (No tone / Simple tone / Complex tone)
Number of vowels (2–4 / 5–6 / 7–14)
Number of consonants (6–14 / 15–18 / 19–25 / 26–33 / 34+)
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 58 / 98
63. Map of languages
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 59 / 98
64. Tone map
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 60 / 98
65. Vowel map
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 61 / 98
66. Consonant map
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 62 / 98
67. Measure of phonemic diversity
A measure of phonemic diversity is created.
Scale and center the three components: tone, vowels, consonants
(mean=0, variance=1)
Diversity is measured as the unweighted sum of the three components
This is probably the most contentious aspect of the paper. The issues this
raises are dealt with in the Supplementary Material.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 63 / 98
68. Main result
For each possible origin, perform a linear regression of phoneme diversity
against distance from origin.
Best regression:
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 64 / 98
69. Map of best origin
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 65 / 98
70. Validation
Several other factors could have an influence. The author checks:
Dependence within language families
Modern population size
Geographic area
Population density
Local language diversity (number of languages within N km,
N = 500, 1000, 2000, 3000)
A second origin
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 66 / 98
71. Validation
Several other factors could have an influence. The author checks:
Dependence within language families
Modern population size
Geographic area
Population density
Local language diversity (number of languages within N km,
N = 500, 1000, 2000, 3000)
A second origin
Other factors certainly play a role. However, as long as they are
uncorrelated with distance to Africa, they will not bias the results.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 66 / 98
72. Tones vs Vowels vs Consonants
The measure of diversity is fishy.
The author also checks for Tones, Vowels and Consonants separately.
The conclusions are unchanged.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 67 / 98
73. Comments
Statistically sound.
The data are what they are...
Assumptions: see multiple comments in Linguistic Typology (2011)
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 68 / 98
74. Bayesian model choice
The evidence for or against a model given data is summarized in the Bayes
factor:
P[M = 2|y ]/P[M = 1|y ]
B21 (y ) = ]
P[M = 2]/P[M = 1]
m2 (y )
=
m1 (y )
where
mk (y ) = L(θk |y )πk (θk )dθk
Θk
This includes an automatic penalty for complexity: when a model is more
complex, the integral is over a larger space, where the likelihood is most of
the time very small.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 69 / 98
75. Interpreting the Bayes factor
Jeffrey’s scale of evidence states that
If log10 (B21 ) is between 0 and 0.5, then the evidence in favor of
model 2 is weak
between 0.5 and 1, it is substantial
between 1 and 2, it is strong
above 2, it is decisive
(and symmetrically for negative values)
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 70 / 98
76. Outline
1 Swadesh: glottochronology
2 Gray and Atkinson: Language phylogenies
3 Lieberman et al., Pagel et al.: Frequency of use
4 Perreault and Mathew, Atkinson: Phoneme expansion
5 Dunn et al.: Word-order universals
6 Bouchard-Cˆt´ et al: automated cognates
oe
7 Overall conclusions
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 71 / 98
77. Dunn et al.
LETTER doi:10.1038/nature09923
Evolved structure of language shows lineage-specific
trends in word-order universals
Michael Dunn1,2, Simon J. Greenhill3,4, Stephen C. Levinson1,2 & Russell D. Gray3
Languages vary widely but not without limit. The central goal of after the noun, whereas dominant object–verb ordering predicts post-
linguistics is to describe the diversity of human languages and positions, relative clauses and genitives before the noun4. One general
explain the constraints on that diversity. Generative linguists fol- explanation for these observations is that languages tend to be consist-
lowing Chomsky have claimed that linguistic diversity must be ent (‘harmonic’) in their order of the most important element or ‘head’
constrained by innate parameters that are set as a child learns a of a phrase relative to its ‘complement’ or ‘modifier’3, and so if the verb
language1,2. In contrast, other linguists following Greenberg have is first before its object, the adposition (here preposition) precedes the
claimed that there are statistical tendencies for co-occurrence of noun, while if the verb is last after its object, the adposition follows the
traits reflecting universal systems biases3–5, rather than absolute noun (a ‘postposition’). Other functionally motivated explanations
constraints or parametric variation. Here we use computational emphasize consistent direction of branching within the syntactic struc-
phylogenetic methods to address the nature of constraints on ture of a sentence9 or information structure and processing efficiency5.
linguistic diversity in an evolutionary framework6. First, contrary To demonstrate that these correlations reflect underlying cognitive
to the generative account of parameter setting, we show that the or systems biases, the languages must be sampled in a way that controls
evolution of only a few word-order features of languages are for features linked only by direct inheritance from a common
strongly correlated. Second, contrary to the Greenbergian general- ancestor10. However, efforts to obtain a statistically independent sample
izations, we show that most observed functional dependencies of languages confront several practical problems. First, our knowledge
between traits are lineage-specific rather than universal tendencies. of language relationships is incomplete: specialists disagree about high-
These findings support the view that—at least with respect to word level groupings of languages and many languages are only tentatively
order—cultural evolution is the primary factor that determines assigned to language families. Second, a few large language families
linguistic structure, with the current state of a linguistic system contain the bulk of global linguistic variation, making sampling purely
shaping and constraining future states. from unrelated languages impractical. Some balance of related, unre-
Human language is unique amongst animal communication sys- lated and areally distributed languages has usually been aimed for in
Robin J. Ryderonly for itsDauphine)
tems not (Paris structural complexity but also for its diversity at practice11,12.
Statistics and Historical Linguistics UCLA March 2013 72 / 98
78. Question at hand
Do the following traits evolve independently?
1 Adjective-Noun order
2 Adposition-Noun phrase order
3 Demonstrative-Noun order
4 Genitive-Noun order
5 Numeral-Noun order
6 Object-Verb order
7 Relative clause-Noun order
8 Subject-Verb order
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 73 / 98
79. Idea
Difficult to get an independent sample of languages
Instead, look at languages known to be related and check for
co-evolution
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 74 / 98
80. Data
4 language families:
Indo-European
Bantu
Uto-Aztecan
Austronesian
Word-order data mainly from WALS.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 75 / 98
81. Lexical tree
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 76 / 98
82. Methods
Start with a sample of trees constructed with lexical data.
Put two traits at the leaves
Perform Bayesian model choice between independent model and
co-evolution model
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 77 / 98
83. Two models
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 78 / 98
84. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 79 / 98
85. Results
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 80 / 98
86. Conclusions
The networks look very different.
Co-evolution seems to be very different between different language
families.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 81 / 98
87. What went wrong?
1 Nothing about borrowing or geography
2 No attempt at cross-validation
3 Over-interpreting the data
4 Testing too many hypotheses
5 They did not test the hypothesis that different families have the same
dependencies!
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 82 / 98
88. Outline
1 Swadesh: glottochronology
2 Gray and Atkinson: Language phylogenies
3 Lieberman et al., Pagel et al.: Frequency of use
4 Perreault and Mathew, Atkinson: Phoneme expansion
5 Dunn et al.: Word-order universals
6 Bouchard-Cˆt´ et al: automated cognates
oe
7 Overall conclusions
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 83 / 98
89. Automatic cognate detection?
All models of lexical data take as granted cognacy judgements (and
family membership)
It would be nice to do without these: we could increase the number
of languages under study, as well as the number of meanings
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 84 / 98
90. Levenshtein distance
The Levenshtein distance (or edit distance) between two words is the
number of changes (substitutions, insertions, deletions) necessary to
change one word into the other.
Swedish tre (/tre:/) and Latin tres (/tre:s/): distance 1 (1 insertion)
Swedish tre and Dutch drie (/dri/): distance 2 (2 substitutions)
(Can be normalized by word length)
By calculating the distance between all words in a meaning list, get a
distance between two languages. Deduce a tree (several possible
algorithms: Neighbour joining, UPGMA...)
Unlike other methods, this does not require cognacy judgements, and
it takes into account the phonetics (more information).
Serva and Petroni 2008, van der Ark et al. 2007: ”the resulting tree looks
close to the truth” (except it doesn’t)
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 85 / 98
91. Levenshtein distance: so many problems!
Does not require that compared words be cognates
Does not require that compared languages be cousins
No model of phonetic change (hence no model validation or
mis-specification)
Some changes are more likely that others
Changes can be systematic (e.g. vowel shift), not independent
See Greenhill (2010) for a detailed rebuttal.
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 86 / 98
92. Bouchard-Cˆt´ et al. (2012)
oe
Automated reconstruction of ancient languages using
probabilistic models of sound change
Alexandre Bouchard-Côtéa,1, David Hallb, Thomas L. Griffithsc, and Dan Kleinb
a
Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; bComputer Science Division and cDepartment of Psychology,
University of California, Berkeley, CA 94720
Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for review
March 19, 2012)
One of the oldest problems in linguistics is reconstructing the building on work on the reconstruction of ancestral sequences
words that appeared in the protolanguages from which modern and alignment in computational biology (8–12). Several groups
languages evolved. Identifying the forms of these ancient lan- have recently explored how methods from computational biology
guages makes it possible to evaluate proposals about the nature can be applied to problems in historical linguistics, but such work
of language change and to draw inferences about human history. has focused on identifying the relationships between languages
Protolanguages are typically reconstructed using a painstaking (as might be expressed in a phylogeny) rather than reconstructing
manual process known as the comparative method. We present a the languages themselves (13–18). Much of this type of work has
family of probabilistic models of sound change as well as algo- been based on binary cognate or structural matrices (19, 20), which
rithms for performing inference in these models. The resulting discard all information about the form that words take, simply in-
system automatically and accurately reconstructs protolanguages dicating whether they are cognate. Such models did not have the
from modern languages. We apply this system to 637 Austronesian goal of reconstructing protolanguages and consequently use a rep-
COMPUTER SCIENCES
languages, providing an accurate, large-scale automatic reconstruc- resentation that lacks the resolution required to infer ancestral
tion of a set of protolanguages. Over 85% of the system’s recon- phonetic sequences. Using phonological representations allows us
structions are within one character of the manual reconstruction to perform reconstruction and does not require us to assume that
provided by a linguist specializing in Austronesian languages. Be- cognate sets have been fully resolved as a preprocessing step. Rep-
ing able to automatically reconstruct large numbers of languages resenting the words at each point in a phylogeny and having a model
provides a useful way to quantitatively explore hypotheses about the of how they change give a way of comparing different hypothesized
factors determining which sounds in a language are likely to change cognate sets and hence inferring cognate sets automatically.
over time. We demonstrate this by showing that the reconstructed The focus on problems other than reconstruction in previous
SYCHOLOGICAL AND
COGNITIVE SCIENCES
Austronesian protolanguages provide compelling support for a hy- computational approaches has meant that almost all existing
pothesis about the relationship between the function of a sound protolanguage reconstructions have been done manually. How-
and its probability of changing that was first proposed in 1955. ever, to obtain more accurate reconstructions for older languages,
large numbers of modern languages need to be analyzed. The
Proto-Austronesian language, for instance, has over 1,200 de-
Robin ancestral | computational | diachronic
J. Ryder (Paris Dauphine) Statistics and Historical Linguistics (21). All of these languages could potentially
scendant languages UCLA March 2013 87 / 98
93. Aim
Model of sound change along a tree (including systematic changes)
Very large number of parameters
Attempt to perform automatically the comparative method
Possible uses: reconstruction of ancient forms, cognate detection,
phylogenies...
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 88 / 98
94. Model
Assume that the phylogeny is known.
/tre:s/
Latin /ste:lla/
/tre/ /tRes/
Italian Spanish /stella/ /estReLa/
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 89 / 98
95. Model
Assume that the phylogeny is known.
/tre:s/
Latin /ste:lla/
/tre/ /tRes/
Italian Spanish /stella/ /estReLa/
Latin to Italian: /tre:s/ → /tre/
P[t → t] = 0.98
P[r → r] = 0.99
P[e: → e] = 0.95
P[s → −] = 0.20
(These numbers are made up.)
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 89 / 98
96. Transition probabilities
Branch Latin to Italian, transition probabilities for /t/
t 0.98
d 0.01
k 0.01
anything else 0
We would like to estimate the transition probabilities from any phoneme to
any other phoneme, on each branch. Huge number of parameters (number
of phonemes×number of phonemes×number of branches).
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 90 / 98
97. Context dependent
In fact, the probability that /s/ is deleted depends on the position within
the word.
The transition probabilities become context dependent:
P[s → −|preceded by e and is last letter] = 0.70
P[r → r|between t and e] = 0.99
Even more parameters!
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 91 / 98
98. Reducing the number of parameters
Generic contexts of the form ”CrV”
Dirichlet prior: force most parameters to be zero (sparsity)
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 92 / 98
99. Implementation
Possible to use large datasets: dumps from Wiktionary, Bible...
Expectation-Maximization algorithm
Only point estimates of the parameters, not uncertainty
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 93 / 98
100. Results: reconstruction
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 94 / 98
101. Results: Most probable transitions
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 95 / 98
102. Bouchard et al.: Comments
Impressive first step
Much more refinement is possible, especially in defining the context
Allows quantitative tests of open questions (e.g. functional load,
universality of changes)
Joint estimation of sound changes and phylogenies
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 96 / 98
103. Outline
1 Swadesh: glottochronology
2 Gray and Atkinson: Language phylogenies
3 Lieberman et al., Pagel et al.: Frequency of use
4 Perreault and Mathew, Atkinson: Phoneme expansion
5 Dunn et al.: Word-order universals
6 Bouchard-Cˆt´ et al: automated cognates
oe
7 Overall conclusions
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 97 / 98
104. Overall conclusions
When done right, statistical methods can provide new insight into
linguistic history
The conclusions will only ever be as good as the data
Importance of collaboration in building the model and in checking for
mis-specification
Major avenues for future research. Challenges in finding relevant
data, building models, and statistical inference:
Models for morphosyntactical traits
Putting together lexical, phonemic and morphosyntactic traits
Escape the tree-like model (borrowing, networks, diffusion processes...)
Incorporate geography
Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 98 / 98