Statistical Methods in Historical Linguistics

Statistical methods for Historical Linguistics

Robin J. Ryder

Centre de Recherche en Math´matiques de la D´cision
e e
Universit´ Paris-Dauphine
e

7 March 2013
UCLA

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 1 / 98

Introduction

A large number of recent papers describe computationally-intensive
statistical methods for Historical Linguistics
Increased computational power
Advances in statistical methodology
Complex linguistic questions which cannot be answered with
traditional methods


Caveats

I am not a linguist
I am a statistician
These papers were not written by me; most ﬁgures were created by
the papers’ authors
I use the word ”evolution” in a broad sense
”All models are wrong, but some are useful”


Aims of this talk

Review of several recent papers on statistical models for Historical
Linguistics
Statisticians won’t replace linguists
When done correctly, collaborations between statisticians and linguists
can provide useful results
Brief introduction to some major statistical concepts + statistical
comments


Statistics slides

Diﬀerent colour
Feel free to ignore the equations
Useful (essential?) if you wish to collaborate with statisticians
Limitations of statistical tools
Statistics should not be a black box


Advantages of statistical methods

Analyse (very) large datasets
Test multiple hypotheses
Cross-validation
Estimate uncertainty


Statistical method in a nutshell

1 Collect data
2 Design model
3 Perform inference (MCMC, ...)
4 Check convergence
5 In-model validation (is our inference method able to answer questions
from our model?)
6 Model mis-speciﬁcation analysis (do we need a more complex model?)
7 Conclude
In general, it is more diﬃcult to perform inference for a more complex
model.


Contents

1 Swadesh: glottochronology

2 Gray and Atkinson: Language phylogenies

3 Lieberman et al., Pagel et al.: Frequency of use

4 Perreault and Mathew, Atkinson: Phoneme expansion

5 Dunn et al.: Word-order universals

6 Bouchard-Cˆt´ et al: automated cognates
oe

7 Overall conclusions
Tomorrow: phylogenetics for core vocabulary

Outline






oe



Swadesh (1952)

LEXICO-STATISTIC DATING OF PREHISTORIC ETHNIC CONTACTS
WithSpecialReference North
to American
Indiansand Eskimos
MORRIS SWADESH

PREHISTORY refersto the long period of early measuring amountof radioactivity going
the still
humansociety before writing was availableforthe on. Consequently, is possible to determine
it
recording events. In a fewplaces it gives way
of within certainlimitsof accuracythetimedepthof
to the modernepoch of recorded history much
as any archeological whichcontains suitablebit
site a
as six or eightthousand yearsago; in manyareas of bone, wood, grass, or any otherorganic sub-
this happened only in the last few centuries. stance.
Everywhere prehistory represents greatobscure
a Lexicostatistic dating makes use of very dif-
depthwhichscienceseeks to penetrate. And in- ferent materialfromcarbondating, but the broad
deed powerful means have been foundforillumi- theoretical prin~iple similar. Researchesby the
is
natingthe unrecorded past,including evidence
the presentauthorand several otherscholarswithin
of archeological findsand that of the geographic the last few years have revealedthat the funda-
distribution culturalfactsin the earliestknown
of mentaleveryday vocabularyof any language-as
periods. Much dependson thepainstaking analy- againstthe specializedor "cultural"vocabulary-
sis and comparison data, and on the effective
of changes at a relatively constantrate. The per-
readingof theirimplications. Very important is centage of retainedelementsin a suitable test
thecombined of all theevidence,
use linguistic and vocabularytherefore indicatesthe elapsed time.
ethnographic well as archeological,
as biological, a
Wherever speechcommunity comesto be divided
and geological. And it is essentialconstantly to into two or more parts so that linguistic change
seek new means of expandingand rendering more goes separatewaysin each of thenew speechcom-
accurateour deductions about prehistory. munities, percentage commonretainedvo-
the of
One of the mostsignificant recenttrendsin the cabulary gives an index of theamountof timethat

Question at hand

Aim: dating the MRCA (Most Recent Common Ancestor) of a pair of
languages.
Data: ”core vocabulary” (Swadesh lists). 215 or 100 words.


all dog good live right spit two
and drink grass liver (side) split vomit
animal dry green long river
road squeeze walk
ashes dull guts louse
at root stab warm
dust hair man
back rope stand
ear hand many wash
bad rotten star
earth he meat water
bark round stick
eat head moon
rub stone we
egg hear
because mother salt wet
eye heart
belly sand straight
fall heavy suck what
big mountain say
far here sun when
bird mouth
fat hit scratch swell
bite name where
father hold sea swim
black narrow
fear horn white
blood near see tail
how seed ten who
blow neck
feather hunt
bone new sew that wide
few
breast night sharp there
fight husband wife
nose short they
breathe fire I wind
not sing thick
burn fish ice wing
old sit thin
child five if
one skin think wipe
claw float in
other sky this
flow kill with
cloud sleep thou
person
cold flower knee
play small three
fly know woman
come smell throw
pull
Robin count fog
J. Ryder (Paris Dauphine) lake
Statistics and Historical Linguistics smoke UCLA March 2013 woods 98
12 /
tie

Assumptions

Swadesh assumed that core vocabulary evolves at a constant rate (through
time, space and meanings). Given a pair of languages with percentage C
of shared cognates, and a constant retention rate r , the age t of the
MRCA is
log C
t=
2 log r
The constant r was estimated using a pair of languages for which the age
of the MRCA is known.


Issues with glottochronology

Many statistical shortcomings. Mainly:
1 Simplistic model
2 No evaluation of uncertainty of estimates
3 Only small amounts of data are used
See Bergsland and Vogt (1962) for debunking of glottochronology.


What has changed?

More elaborate models + model misspeciﬁcation analyses
We can estimate the uncertainty (⇒easier to answer ”I don’t know”)
Large amounts of data
Tomorrow: new analysis of Bergsland and Vogt’s data.


Outline






oe



nature of effusive silicic lavas , and with textural observations at Competing interests statement The authors declare that they have no competing financial
the outcrop scale down to the microscale9–11 (Fig. 1). As opposed to interests.

the common view that explosive volcanism “is defined as involving
Gray and Atkinson
fragmentation of magma during ascent”1, we conclude that frag-
mentation may play an equally important role in reducing the
Correspondence and requests for materials should be addressed to H.M.G.
(hmg@seismo.berkeley.edu).

likelihood of explosive behaviour, by facilitating magma degassing.
Because shear-induced fragmentation depends so strongly on the
rheology of the ascending magma, our findings are in a broader
sense equivalent to Eichelberger’s hypothesis1 that “higher viscosity ..............................................................
of magma may favour non-explosive degassing rather than
hinder it”, albeit with the added complexity of shear-induced
Language-tree divergence times
fragmentation. A support the Anatolian theory
Received 19 May; accepted 15 November 2003; doi:10.1038/nature02138.
1. Eichelberger, J. C. Silicic volcanism: ascent of viscous magmas from crustal reservoirs. Annu. Rev.
of Indo-European origin
Earth Planet. Sci. 23, 41–63 (1995).
2. Dingwell, D. B. Volcanic dilemma: Flow or blow? Science 273, 1054–1055 (1996). Russell D. Gray & Quentin D. Atkinson
3. Papale, P. Strain-induced magma fragmentation in explosive eruptions. Nature 397, 425–428 (1999).
4. Dingwell, D. B. & Webb, S. L. Structural relaxation in silicate melts and non-Newtonian melt rheology
Department of Psychology, University of Auckland, Private Bag 92019,
in geologic processes. Phys. Chem. Miner. 16, 508–516 (1989).
5. Webb, S. L. & Dingwell, D. B. The onset of non-Newtonian rheology of silcate melts. Phys. Chem.
Auckland 1020, New Zealand
.............................................................................................................................................................................
Miner. 17, 125–132 (1990).
6. Webb, S. L. & Dingwell, D. B. Non-Newtonian rheology of igneous melts at high stresses and strain Languages, like genes, provide vital clues about human history1,2.
rates: experimental results for rhyolite, andesite, basalt, and nephelinite. J. Geophys. Res. 95, The origin of the Indo-European language family is “the most
15695–15701 (1990).
intensively studied, yet still most recalcitrant, problem of his-
7. Newman, S., Epstein, S. & Stolper, E. Water, carbon dioxide and hydrogen isotopes in glasses from the
ca. 1340 A.D. eruption of the Mono Craters, California: Constraints on degassing phenomena and torical linguistics”3. Numerous genetic studies of Indo-European
initial volatile content. J. Volcanol. Geotherm. Res. 35, 75–96 (1988). origins have also produced inconclusive results4,5,6. Here we
8. Villemant, B. & Boudon, G. Transition from dome-forming to plinian eruptive styles controlled by analyse linguistic data using computational methods derived
H2O and Cl degassing. Nature 392, 65–69 (1998).
9. Polacci, M., Papale, P. & Rosi, M. Textural heterogeneities in pumices from the climactic eruption of
from evolutionary biology. We test two theories of Indo-
Mount Pinatubo, 15 June 1991, and implications for magma ascent dynamics. Bull. Volcanol. 63, European origin: the ‘Kurgan expansion’ and the ‘Anatolian
83–97 (2001). farming’ hypotheses. The Kurgan theory centres on possible
10. Tuffen, H., Dingwell, D. B. & Pinkerton, H. Repeated fracture and healing of silicic magma generates archaeological evidence for an expansion into Europe and the
flow banding and earthquakes? Geology 31, 1089–1092 (2003).
11. Stasiuk, M. V. et al. Degassing during magma ascent in the Mule Creek vent (USA). Bull. Volcanol. 58,
Near East by Kurgan horsemen beginning in the sixth millen-
117–130 (1996). nium BP 7,8. In contrast, the Anatolian theory claims that Indo-
12. Goto, A. A new model for volcanic earthquake at Unzen Volcano: Melt rupture model. Geophys. Res. European languages expanded with the spread of agriculture
Lett. 26, 2541–2544 (1999).
from Anatolia around 8,000–9,500 years BP 9. In striking agree-
13. Mastin, L. G. Insights into volcanic conduit flow from an open-source numerical model. Geochem.
Geophys. Geosyst. 3, doi:10.1029/2001GC000192 (2002). ment with the Anatolian hypothesis, our analysis of a matrix of
14. Proussevitch, A. A., Sahagian, D. L. & Anderson, A. T. Dynamics of diffusive bubble growth in 87 languages with 2,449 lexical items produced an estimated age
magmas: Isothermal case. J. Geophys. Res. 3, 22283–22307 (1993). range for the initial Indo-European divergence of between 7,800
15. Lensky, N. G., Lyakhovsky, V. & Navon, O. Radial variations of melt viscosity around growing bubbles
and gas overpressure in vesiculating magmas. Earth Planet. Sci. Lett. 186, 1–6 (2001).
and 9,800 years BP. These results were robust to changes in coding
16. Rust, A. C. & Manga, M. Effects of bubble deformation on the viscosity of dilute suspensions. procedures, calibration points, rooting of the trees and priors in
J. Non-Newtonian Fluid Mech. 104, 53–63 (2002). the bayesian analysis.

NATURE | VOL 426 | 27 NOVEMBER 2003 | www.nature.com/nature © 2003 Nature Publishing Group 435


Swadesh lists, better analysed

Use Swadesh lists for 87 Indo-European languages, and a phylogenetic
model from Genetics
Assume a tree-like model of evolution
Bayesian inference via MCMC (Markov Chain Monte Carlo)
Reconstruct trees and dates
Main parameter of interest: age of the root (Proto-Indo-European,
PIE)


Lexical trees


Correcting the issues with glottochronology

Returning to the issues with glottochronology:
1 Simplistic model → Slightly better, but the model of evolution is
rudimentary
2 No evaluation of uncertainty of estimates → Bayesian inference
3 Only small amounts of data are used → Large number of languages
reduces variability of estimates


Why be Bayesian?

In the settings described in this talk, it usually makes sense to use
Bayesian inference, because:
The models are complex
Estimating uncertainty is paramount
The output of one model is used as the input of another
We are interested in complex functions of our parameters


Frequentist statistics

Statistical inference deals with estimating an unknown parameter θ
given some data D.
In the frequentist view of statistics, θ has a true ﬁxed (deterministic)
value.
Uncertainty is measured by conﬁdence intervals, which are not
intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20)
for θ, I cannot say that there is a 95% probability that θ belongs to
the interval [80 ; 120].


Frequentist statistics

Statistical inference deals with estimating an unknown parameter θ
given some data D.
In the frequentist view of statistics, θ has a true ﬁxed (deterministic)
value.
Uncertainty is measured by conﬁdence intervals, which are not
intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20)
for θ, I cannot say that there is a 95% probability that θ belongs to
the interval [80 ; 120].
Frequentist statistics often use the maximum likelihood estimator: for
which value of θ would the data be most likely (under our model)?

L(θ|D) = P[D|θ]
ˆ
θ = arg max L(θ|D)
θ


Bayesian statistics

In the Bayesian framework, the parameter θ is seen as inherently
random: it has a distribution.
Before I see any data, I have a prior distribution on π(θ), usually
uninformative.
Once I take the data into account, I get a posterior distribution,
which is hopefully more informative.

π(θ|D) ∝ π(θ)L(θ|D)

Different people have different priors, hence different posteriors. But
with enough data, the choice of prior matters little.
We are now allowed to make probability statements about θ, such as
”there is a 95% probability that θ belongs to the interval [78 ; 119]”
(credible interval)


Advantages and drawbacks of Bayesian statistics

More intuitive interpretation of the results
Easier to think about uncertainty
In a hierarchical setting, it becomes easier to take into account all the
sources of variability
Prior speciﬁcation: need to check that changing your prior does not
change your result
Computationally intensive


Eﬀect of prior

Example: Bernoulli model (biased coin). θ=probability of success.
Observe y = 72 successes out of n = 100 trials.
ˆ
Frequentist estimate: θ = 0.72
Conﬁdence interval: [0.63 0.81].
Bayesian estimate: will depend on the prior.


Eﬀect of prior
prior(x)
8
4
0

0.0 0.2 0.4 0.6 0.8 1.0
x
prior(x)
8
4
0

0.0 0.2 0.4 0.6 0.8 1.0
x
prior(x)
8
4
0

0.0 0.2 0.4 0.6 0.8 1.0
x

y = 72, n = 100
Black:prior; green: likelihood; red: posterior.

Eﬀect of prior
prior(x)
8
4
0

0.0 0.2 0.4 0.6 0.8 1.0
x
prior(x)
8
4
0

0.0 0.2 0.4 0.6 0.8 1.0
x
prior(x)
8
4
0

0.0 0.2 0.4 0.6 0.8 1.0
x

y = 7, n = 10

Eﬀect of prior
prior(x)
25
0 10

0.0 0.2 0.4 0.6 0.8 1.0
x
25
prior(x)
0 10

0.0 0.2 0.4 0.6 0.8 1.0
x
25
prior(x)
0 10

0.0 0.2 0.4 0.6 0.8 1.0
x

y = 721, n = 1000

Bayesian inference for lexical trees

The tree parameter is seen as random: it has a distribution
Via MCMC, G & A get a sample of possible trees, with associated
probabilities, rather than a single tree
The uncertainty in trees is thus made explicit


G & A: Assumptions

Tree-like evolution
Genetics model
Independence of cognates
Constant rate of change


G & A: conclusions

Age of PIE: 7800-9800 BP (Before Present)
Large error bars, but this is a good thing
Reconstruct many known features of the tree of Indo-European
languages
Little validation of the model, no model misspeciﬁcation analysis
Much more on these data and more elaborate models and analyses
tomorrow
These trees can also be used as a building block to answer other
questions


Outline






oe



Lieberman et al. (2007)
Vol 449 | 11 October 2007 | doi:10.1038/nature06137

LETTERS
Quantifying the evolutionary dynamics of language
Erez Lieberman1,2,3*, Jean-Baptiste Michel1,4*, Joe Jackson1, Tina Tang1 & Martin A. Nowak1

Human language is based on grammatical rules1–4. Cultural evolu- applied to three-quarters of all Old English verbs21. The excep-
tion allows these rules to change over time5. Rules compete with tions—ancestors of the modern irregular verbs—were mostly
each other: as new rules rise to prominence, old ones die away. To members of the so-called ‘strong’ verbs. There are seven different
quantify the dynamics of language evolution, we studied the regu- classes of strong verbs with exemplars among the Modern English
larization of English verbs over the past 1,200 years. Although an irregular verbs, each with distinguishing markers that often include
elaborate system of productive conjugations existed in English’s characteristic vowel shifts. Although stable coexistence of multiple
proto-Germanic ancestor, Modern English uses the dental suffix, rules is one possible outcome of rule dynamics, this is not what
‘-ed’, to signify past tense6. Here we describe the emergence of this occurred in English verb inflection22. We therefore define regularity
linguistic rule amidst the evolutionary decay of its exceptions, with respect to the modern ‘-ed’ rule, and call all these exceptional
known to us as irregular verbs. We have generated a data set of forms ‘irregular’.
verbs whose conjugations have been evolving for more than a We consulted a large collection of grammar textbooks describing
millennium, tracking inflectional changes to 177 Old-English verb inflection in these earlier epochs, and hand-annotated every
irregular verbs. Of these irregular verbs, 145 remained irregular irregular verb they described (see Supplementary Information).
in Middle English and 98 are still irregular today. We study how This provided us with a list of irregular verbs from ancestral
the rate of regularization depends on the frequency of word usage. forms of English. By eliminating verbs that were no longer part of
The half-life of an irregular verb scales as the square root of its Modern English, we compiled a list of 177 Old English irregular
usage frequency: a verb that is 100 times less frequent regularizes verbs that remain part of the language to this day. Of these 177 Old
10 times as fast. Our study provides a quantitative analysis of the English irregulars, 145 remained irregular in Middle English and 98
regularization process by which ancestral forms gradually yield to are still irregular in Modern English. Verbs such as ‘help’, ‘grip’ and
an emerging linguistic rule. ‘laugh’, which were once irregular, have become regular with the
Natural languages comprise elaborate systems of rules that enable
Robin J. Ryder (Paris Dauphine) passing of time.
7 Statistics and Historical Linguistics UCLA March 2013 33 / 98

Question at hand

Wish to verify that the rate of change tends to be slower for words which
are used more frequently


Data

Regularization of irregular verbs
Old English, Middle English and Modern English (1200 years of
evolution)
177 irregular verbs, of which 79 regularize
Frequency of use


Results


Conclusions

Regularization rate changes as the square root of usage frequency
A verb used 4 times more often regularizes at half the rate
Evolutionary history known
Sub-optimal: binning


Pagel et al.
Vol 449 | 11 October 2007 | doi:10.1038/nature06176

LETTERS
Frequency of word-use predicts rates of lexical
evolution throughout Indo-European history
Mark Pagel1,2, Quentin D. Atkinson1 & Andrew Meade1

´
Greek speakers say ‘‘oura’’, Germans ‘‘schwanz’’ and the French paired meanings in the Bantu languages. This indicates that variation
‘‘queue’’ to describe what English speakers call a ‘tail’, but all of in the rates of lexical replacement among meanings is not merely an
these languages use a related form of ‘two’ to describe the number historical accident, but rather is linked to some general process of
after one. Among more than 100 Indo-European languages and language evolution.
dialects, the words for some meanings (such as ‘tail’) evolve Social and demographic factors proposed to affect rates of
rapidly, being expressed across languages by dozens of unrelated language change within populations of speakers include social status11,
words, while others evolve much more slowly—such as the num- the strength of social ties12, the size of the population13 and levels of
ber ‘two’, for which all Indo-European language speakers use the outside contact14. These forces may influence rates of evolution on a
same related word-form1. No general linguistic mechanism has local and temporally specific scale, but they do not make general
been advanced to explain this striking variation in rates of lexical predictions across language families about differences in the rate of
replacement among meanings. Here we use four large and diver- lexical replacement among meanings. Drawing on concepts from
gent language corpora (English2, Spanish3, Russian4 and Greek5) theories of molecular15 and cultural evolution16–18, we suggest that
and a comparative database of 200 fundamental vocabulary mean- the frequency with which different meanings are used in everyday
ings in 87 Indo-European languages6 to show that the frequency language may affect the rate at which new words arise and become
with which these words are used in modern language predicts their adopted in populations of speakers. If frequency of meaning-use is
rate of replacement over thousands of years of Indo-European a shared and stable feature of human languages, then this could
language evolution. Across all 200 meanings, frequently used provide a general mechanism to explain the large differences across
words evolve at slower rates and infrequently used words evolve meanings in observed rates of lexical replacement. Here we test this
Robin J. Ryder (Paris Dauphine) holds separately and identically idea by examining the relationship between the rates at which Indo-
more rapidly. This relationship Statistics and Historical Linguistics UCLA March 2013 38 / 98

Question at hand

Check link between frequency of use and rate of change for vocabulary.
Hypothesis: when a meaning is used more often, the corresponding word
has less chances of changing.
Problem: since this rate is expected to be much slower than that of strong
verb regularization, we need to look at much deeper history. But then the
evolutionary history is unknown.


Workaround

Use Indo-European core vocabulary data, and frequencies from
English, Greek, Russian and Spanish
Get a sample from the distribution on trees and ancestral ages using
G&A’s method
For each tree in the sample, estimate the rate of change for each
meaning.
Average across all trees.


Results


Comments

Even if there is high uncertainty in the phylogenies, we can still
answer other questions (integrating out the tree)
Similar results for Bantu (Pagel and Meade 2006)


Sampling from a distribution

Given a parameter θ with probability density π(θ), we often wish to
estimate the expected value E [θ], or the probability mass of an interval
P[a < θ < b].
If we can sample n independent and identically distributed (iid) values
θ1 , . . . , θn from π(·), then (under general assumptions) we have a good
estimated of E [θ]:
n
1
θi −→ E [θ]
n n→∞
i=1

(and the same holds for Ih = h(θ) dθ for any measurable function h).
Convergence is fast.
But in general, getting an iid sample is diﬃcult or impossible.


MCMC

If we are able to compute the value of the density π(θ) at any point θ, we
can use MCMC (Markov Chain Monte Carlo) to generate correlated
θ1 , . . . , θn from π(·). The convergence still holds:
n
1
θi −→ E [θ]
n n→∞
i=1

but is much slower. Since we use a ﬁnite value of n, we need to check the
convergence of the algorithm. In general, this is done by:
1 Checking the trace of the algorithm
2 Checking the autocorrelations
3 Running the algorithm for a very large value of n
Curse of dimensionality: when the dimensionality of the parameter space
increases, convergence slows dramatically.
More during the TraitLab tutorial.

ABC

If we are not able to compute π(θ), approximate inference may still be
possible using a likelihood-free method such as Approximate Bayesian
Computation (ABC).
The idea is to draw many values of the parameters; for each value,
simulate a new dataset; keep only parameter values which lead to
simulated data ”close” to the observed data.
The theoretical properties are still mostly unknown. This method induces
an approximation, and it is diﬃcult to control the size of the
approximation.
But sometimes, this is the only possibility.


Outline






oe



Perreault and Mathew

Dating the Origin of Language Using Phonemic Diversity
Charles Perreault1., Sarah Mathew2*.
1 Santa Fe Institute, Santa Fe, New Mexico, United States of America, 2 Centre for the Study of Cultural Evolution, Stockholm University, Stockholm, Sweden

Abstract
Language is a key adaptation of our species, yet we do not know when it evolved. Here, we use data on language phonemic
diversity to estimate a minimum date for the origin of language. We take advantage of the fact that phonemic diversity
evolves slowly and use it as a clock to calculate how long the oldest African languages would have to have been around in
order to accumulate the number of phonemes they possess today. We use a natural experiment, the colonization of
Southeast Asia and Andaman Islands, to estimate the rate at which phonemic diversity increases through time. Using this
rate, we estimate that present-day languages date back to the Middle Stone Age in Africa. Our analysis is consistent with the
archaeological evidence suggesting that complex human behavior evolved during the Middle Stone Age in Africa, and does
not support the view that language is a recent adaptation that has sparked the dispersal of humans out of Africa. While
some of our assumptions require testing and our results rely at present on a single case-study, our analysis constitutes the
first estimate of when language evolved that is directly based on linguistic data.

Citation: Perreault C, Mathew S (2012) Dating the Origin of Language Using Phonemic Diversity. PLoS ONE 7(4): e35289. doi:10.1371/journal.pone.0035289
Editor: Michael D. Petraglia, University of Oxford, United Kingdom
Received September 8, 2011; Accepted March 14, 2012; Published April 27, 2012
Copyright: ß 2012 Perreault, Mathew. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: Funding was provided by the Swedish Research Council [grant number 2009-2390]. The funders had no role in study design, data collection and
analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: sarah.mathew@arklab.su.se
. These authors contributed equally to this work.

Introduction of the globe to be colonized. The loss of phonemes through serial
founder effect is consistent with other lines of evidence that
A capacity for language is a hallmark of our species [1,2], yet we indicate that phonemic diversity is determinedMarch 2013
Robin J. Ryderlittle about Dauphine)its appearance. Language appears in
know (Paris the timing of Statistics and Historical Linguistics UCLA by cultural 47 / 98

Aim

Date Proto-Human.


Premises

Two small populations B and C disperse from the same parent A. B
colonizes a large territory, C a small isolated island.
B will tend to accumulate new phonemes at a faster rate than C
(PB > PC ).
If the island is small enough, C will have about the same phonemic
diversity as the ancestral language A (E (PC ) = PA ).
The diﬀerence in modern-day phonemic diversity is due entirely to B,
so if we know the splitting time t, we can estimate the rate of
phoneme accumulation in a large territory.
Two models of phoneme accumulation: linear and exponential.

PB − PC ln(PB ) − ln(PC )
r= k=
t t


Idea

Estimate r or k
Estimate modern phonemic diversity in Africa (PAfrica ) and guess
initial phonemic diversity Pinitial (take Pinitial = 11 or 29, minimum
and median modern phonemic diversity).
Then we can deduce the age of the initial language.


Parameter estimation

Single natural experiment: Andaman islands (colonized 45-65 kya).
Estimate PB and PC by averaging over 20 languages.


Results

(Linear/exponential model; 2 estimates for colonization of Andaman
islands; 2 guesses of Pinitial ).


Comments

Highly dependent on strong assumptions
Needs model validation
The parameter estimates are very noisy, since they are based on
related languages.
Intriguing


1. N. C. Seeman, J. Theor. Biol. 99, 237 (1982). 16. S. M. Douglas et al., Nature 459, 414 (2009). Center for Bio-Inspired Solar Fuel Production, an
2. N. C. Seeman, Nature 421, 427 (2003). 17. Y. Ke et al., J. Am. Chem. Soc. 131, 15903 (2009). Energy Frontier Research Center funded by the U.S.

Atkinson 3.
4.
5.
T. J. Fu, N. C. Seeman, Biochemistry 32, 3211 (1993).
J. H. Chen, N. C. Seeman, Nature 350, 631 (1991).
E. Winfree, F. Liu, L. A. Wenzler, N. C. Seeman,
18. H. Dietz, S. M. Douglas, W. M. Shih, Science 325,
725 (2009).
19. Materials and methods are available as supporting
Department of Energy, Office of Science, Office of
Basic Energy Sciences under award DE-SC0001016.

Nature 394, 539 (1998). material on Science Online.
6. H. Yan, S. H. Park, G. Finkelstein, J. H. Reif, T. H. LaBean, 20. P. W. K. Rothemund et al., J. Am. Chem. Soc. 126,
Supporting Online Material
16344 (2004). www.sciencemag.org/cgi/content/full/332/6027/342/DC1
Science 301, 1882 (2003).
Materials and Methods
7. W. M. Shih, J. D. Quispe, G. F. Joyce, Nature 427, 618 21. P. J. Hagerman, Annu. Rev. Biophys. Chem. 17, 265
SOM Text
(2004). (1988).
Figs. S1 to S39
8. F. Mathieu et al., Nano Lett. 5, 661 (2005). Acknowledgments: We thank the AFM applications group
Tables S1 to S19
9. R. P. Goodman et al., Science 310, 1661 (2005). at Bruker Nanosurfaces for assistance in acquiring some
Reference 22
10. F. A. Aldaye, H. F. Sleiman, J. Am. Chem. Soc. 129, of the high-resolution AFM images, using Peak Force
13376 (2007). Tapping with ScanAsyst on the MultiMode 8. This research 18 January 2011; accepted 4 March 2011
11. Y. He et al., Nature 452, 198 (2008). was partly supported by grants from the Office of Naval 10.1126/science.1202998

Downloaded from www.sciencemag.org on April 19, 2011
0.131, df = 503, P = 0.003). To account for any
Phonemic Diversity Supports a Serial non-independence within language families, the
analysis was repeated, first using mean values at
Founder Effect Model of Language the language family level (table S2) and then
using a hierarchical linear regression framework
Expansion from Africa to model nested dependencies in variation at the
family, subfamily, and genus levels (15). These
analyses confirm that, consistent with a founder
Quentin D. Atkinson1,2* effect model, smaller population size predicts
reduced phoneme inventory size both between
Human genetic and phenotypic diversity declines with distance from Africa, as predicted by a families (family-level analysis r = 0.468, df = 49,
serial founder effect in which successive population bottlenecks during range expansion P < 0.001; fig. S1B) and within families, con-
progressively reduce diversity, underpinning support for an African origin of modern humans. trolling for taxonomic affiliation {hierarchical lin-
Recent work suggests that a similar founder effect may operate on human culture and ear model: fixed-effect coefficient (b) = 0.0338
language. Here I show that the number of phonemes used in a global sample of 504 languages to 0.0985 [95% highest posterior density (HPD)],
is also clinal and fits a serial founder–effect model of expansion from an inferred origin in P = 0.009}.
Africa. This result, which is not explained by more recent demographic history, local language Figure 1B shows clear regional differences in
diversity, or statistical non-independence within language families, points to parallel phonemic diversity, with the largest phoneme
mechanisms shaping genetic and linguistic diversity and supports an African origin of inventories in Africa and the smallest in South
modern human languages. America and Oceania. A series of linear regres-
sions was used to predict phoneme inventory size
he number of phonemes—perceptually ing the evolution of phonemes (11, 16) and lan- from the log of speaker population size and dis-

T distinct units of sound that differentiate
words—in a language is positively corre-
lated with the size of its speaker population (1) in
guage generally (17–20). This raises the possibility
that the serial founder–effect model used to trace
our genetic origins to a recent expansion from Africa
tance from 2560 potential origin locations around
the world (15). Incorporating modern speaker
population size into the model controls for geo-
such a way that small populations have fewer (4–9) could also be applied to global phonemic graphic patterning in population size and means
phonemes. Languages continually gain and lose diversity to investigate the origin and expansion that the analysis is conservative about the amount
phonemes because of stochastic processes (2, 3). of modern human languages. Here I examine geo- of variation attributed to ancient demography.
If phoneme distinctions are more likely to be lost graphic variation in phoneme inventory size using Model fit was evaluated with the Bayesian in-
Robin in small founder populations, then a succession
J. Ryder (Paris Dauphine) Statistics and Historical Linguistics
data on vowel, consonant, and tone inventories formation criterionUCLA March 2013
(BIC) (22). Following previ- 54 / 98

Question at hand

Can we use phonemic data to reconstruct the expansion of language?


Premises
Premise 1: Size of population is positively correlated with number of
phonemes.


Premises
phonemes.

The eﬀect is small, but signiﬁcant. Still holds when you restrict to
population < 5000.

Premises
phonemes.

Premise 2: Language expansion occurred with a series of small founder
groups.


Premises
phonemes.

Premise 2: Language expansion occurred with a series of small founder
groups.

Conclusion: During language expansion, the number of phonemes
decreases at the frontier of expansion.


Idea

Test many diﬀerent possible points of origin. From the true point of
origin, you should observe a cline in phoneme diversity.


Data

Data from WALS, 504 languages (all data with languages). Three traits,
grouped into classes by WALS:
Tone structure (No tone / Simple tone / Complex tone)
Number of vowels (2–4 / 5–6 / 7–14)
Number of consonants (6–14 / 15–18 / 19–25 / 26–33 / 34+)


Map of languages


Tone map


Vowel map


Consonant map


Measure of phonemic diversity

A measure of phonemic diversity is created.
Scale and center the three components: tone, vowels, consonants
(mean=0, variance=1)
Diversity is measured as the unweighted sum of the three components
This is probably the most contentious aspect of the paper. The issues this
raises are dealt with in the Supplementary Material.


Main result
For each possible origin, perform a linear regression of phoneme diversity
against distance from origin.
Best regression:


Map of best origin


Validation

Several other factors could have an inﬂuence. The author checks:
Dependence within language families
Modern population size
Geographic area
Population density
Local language diversity (number of languages within N km,
N = 500, 1000, 2000, 3000)
A second origin


Validation

Several other factors could have an inﬂuence. The author checks:
Dependence within language families
Modern population size
Geographic area
Population density
Local language diversity (number of languages within N km,
N = 500, 1000, 2000, 3000)
A second origin

Other factors certainly play a role. However, as long as they are
uncorrelated with distance to Africa, they will not bias the results.


Tones vs Vowels vs Consonants

The measure of diversity is ﬁshy.
The author also checks for Tones, Vowels and Consonants separately.

The conclusions are unchanged.


Comments

Statistically sound.
The data are what they are...
Assumptions: see multiple comments in Linguistic Typology (2011)


Bayesian model choice

The evidence for or against a model given data is summarized in the Bayes
factor:
P[M = 2|y ]/P[M = 1|y ]
B21 (y ) = ]
P[M = 2]/P[M = 1]
m2 (y )
=
m1 (y )
where
mk (y ) = L(θk |y )πk (θk )dθk
Θk

This includes an automatic penalty for complexity: when a model is more
complex, the integral is over a larger space, where the likelihood is most of
the time very small.


Interpreting the Bayes factor

Jeﬀrey’s scale of evidence states that
If log10 (B21 ) is between 0 and 0.5, then the evidence in favor of
model 2 is weak
between 0.5 and 1, it is substantial
between 1 and 2, it is strong
above 2, it is decisive
(and symmetrically for negative values)


Outline






oe



Dunn et al.

LETTER doi:10.1038/nature09923

Evolved structure of language shows lineage-specific
trends in word-order universals
Michael Dunn1,2, Simon J. Greenhill3,4, Stephen C. Levinson1,2 & Russell D. Gray3

Languages vary widely but not without limit. The central goal of after the noun, whereas dominant object–verb ordering predicts post-
linguistics is to describe the diversity of human languages and positions, relative clauses and genitives before the noun4. One general
explain the constraints on that diversity. Generative linguists fol- explanation for these observations is that languages tend to be consist-
lowing Chomsky have claimed that linguistic diversity must be ent (‘harmonic’) in their order of the most important element or ‘head’
constrained by innate parameters that are set as a child learns a of a phrase relative to its ‘complement’ or ‘modifier’3, and so if the verb
language1,2. In contrast, other linguists following Greenberg have is first before its object, the adposition (here preposition) precedes the
claimed that there are statistical tendencies for co-occurrence of noun, while if the verb is last after its object, the adposition follows the
traits reflecting universal systems biases3–5, rather than absolute noun (a ‘postposition’). Other functionally motivated explanations
constraints or parametric variation. Here we use computational emphasize consistent direction of branching within the syntactic struc-
phylogenetic methods to address the nature of constraints on ture of a sentence9 or information structure and processing efficiency5.
linguistic diversity in an evolutionary framework6. First, contrary To demonstrate that these correlations reflect underlying cognitive
to the generative account of parameter setting, we show that the or systems biases, the languages must be sampled in a way that controls
evolution of only a few word-order features of languages are for features linked only by direct inheritance from a common
strongly correlated. Second, contrary to the Greenbergian general- ancestor10. However, efforts to obtain a statistically independent sample
izations, we show that most observed functional dependencies of languages confront several practical problems. First, our knowledge
between traits are lineage-specific rather than universal tendencies. of language relationships is incomplete: specialists disagree about high-
These findings support the view that—at least with respect to word level groupings of languages and many languages are only tentatively
order—cultural evolution is the primary factor that determines assigned to language families. Second, a few large language families
linguistic structure, with the current state of a linguistic system contain the bulk of global linguistic variation, making sampling purely
shaping and constraining future states. from unrelated languages impractical. Some balance of related, unre-
Human language is unique amongst animal communication sys- lated and areally distributed languages has usually been aimed for in
Robin J. Ryderonly for itsDauphine)
tems not (Paris structural complexity but also for its diversity at practice11,12.
Statistics and Historical Linguistics UCLA March 2013 72 / 98

Question at hand

Do the following traits evolve independently?
1 Adjective-Noun order
2 Adposition-Noun phrase order
3 Demonstrative-Noun order
4 Genitive-Noun order
5 Numeral-Noun order
6 Object-Verb order
7 Relative clause-Noun order
8 Subject-Verb order


Idea

Diﬃcult to get an independent sample of languages
Instead, look at languages known to be related and check for
co-evolution


Data

4 language families:
Indo-European
Bantu
Uto-Aztecan
Austronesian
Word-order data mainly from WALS.


Lexical tree


Methods

Start with a sample of trees constructed with lexical data.
Put two traits at the leaves
Perform Bayesian model choice between independent model and
co-evolution model


Two models


Results


Conclusions

The networks look very different.
Co-evolution seems to be very different between different language
families.


What went wrong?

1 Nothing about borrowing or geography
2 No attempt at cross-validation
3 Over-interpreting the data
4 Testing too many hypotheses
5 They did not test the hypothesis that diﬀerent families have the same
dependencies!


Outline






oe



Automatic cognate detection?

All models of lexical data take as granted cognacy judgements (and
family membership)
It would be nice to do without these: we could increase the number
of languages under study, as well as the number of meanings


Levenshtein distance

The Levenshtein distance (or edit distance) between two words is the
number of changes (substitutions, insertions, deletions) necessary to
change one word into the other.
Swedish tre (/tre:/) and Latin tres (/tre:s/): distance 1 (1 insertion)
Swedish tre and Dutch drie (/dri/): distance 2 (2 substitutions)
(Can be normalized by word length)
By calculating the distance between all words in a meaning list, get a
distance between two languages. Deduce a tree (several possible
algorithms: Neighbour joining, UPGMA...)
Unlike other methods, this does not require cognacy judgements, and
it takes into account the phonetics (more information).
Serva and Petroni 2008, van der Ark et al. 2007: ”the resulting tree looks
close to the truth” (except it doesn’t)


Levenshtein distance: so many problems!

Does not require that compared words be cognates
Does not require that compared languages be cousins
No model of phonetic change (hence no model validation or
mis-speciﬁcation)
Some changes are more likely that others
Changes can be systematic (e.g. vowel shift), not independent
See Greenhill (2010) for a detailed rebuttal.


Bouchard-Cˆt´ et al. (2012)
oe

Automated reconstruction of ancient languages using
probabilistic models of sound change
Alexandre Bouchard-Côtéa,1, David Hallb, Thomas L. Grifﬁthsc, and Dan Kleinb
a
Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; bComputer Science Division and cDepartment of Psychology,
University of California, Berkeley, CA 94720

Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for review
March 19, 2012)

One of the oldest problems in linguistics is reconstructing the building on work on the reconstruction of ancestral sequences
words that appeared in the protolanguages from which modern and alignment in computational biology (8–12). Several groups
languages evolved. Identifying the forms of these ancient lan- have recently explored how methods from computational biology
guages makes it possible to evaluate proposals about the nature can be applied to problems in historical linguistics, but such work
of language change and to draw inferences about human history. has focused on identifying the relationships between languages
Protolanguages are typically reconstructed using a painstaking (as might be expressed in a phylogeny) rather than reconstructing
manual process known as the comparative method. We present a the languages themselves (13–18). Much of this type of work has
family of probabilistic models of sound change as well as algo- been based on binary cognate or structural matrices (19, 20), which
rithms for performing inference in these models. The resulting discard all information about the form that words take, simply in-
system automatically and accurately reconstructs protolanguages dicating whether they are cognate. Such models did not have the
from modern languages. We apply this system to 637 Austronesian goal of reconstructing protolanguages and consequently use a rep-

COMPUTER SCIENCES
languages, providing an accurate, large-scale automatic reconstruc- resentation that lacks the resolution required to infer ancestral
tion of a set of protolanguages. Over 85% of the system’s recon- phonetic sequences. Using phonological representations allows us
structions are within one character of the manual reconstruction to perform reconstruction and does not require us to assume that
provided by a linguist specializing in Austronesian languages. Be- cognate sets have been fully resolved as a preprocessing step. Rep-
ing able to automatically reconstruct large numbers of languages resenting the words at each point in a phylogeny and having a model
provides a useful way to quantitatively explore hypotheses about the of how they change give a way of comparing different hypothesized
factors determining which sounds in a language are likely to change cognate sets and hence inferring cognate sets automatically.
over time. We demonstrate this by showing that the reconstructed The focus on problems other than reconstruction in previous

SYCHOLOGICAL AND
COGNITIVE SCIENCES
Austronesian protolanguages provide compelling support for a hy- computational approaches has meant that almost all existing
pothesis about the relationship between the function of a sound protolanguage reconstructions have been done manually. How-
and its probability of changing that was ﬁrst proposed in 1955. ever, to obtain more accurate reconstructions for older languages,
large numbers of modern languages need to be analyzed. The
Proto-Austronesian language, for instance, has over 1,200 de-
Robin ancestral | computational | diachronic
J. Ryder (Paris Dauphine) Statistics and Historical Linguistics (21). All of these languages could potentially
scendant languages UCLA March 2013 87 / 98

Aim

Model of sound change along a tree (including systematic changes)
Very large number of parameters
Attempt to perform automatically the comparative method
Possible uses: reconstruction of ancient forms, cognate detection,
phylogenies...


Model

Assume that the phylogeny is known.
/tre:s/
Latin /ste:lla/

/tre/ /tRes/
Italian Spanish /stella/ /estReLa/


Model

Assume that the phylogeny is known.
/tre:s/
Latin /ste:lla/

/tre/ /tRes/
Italian Spanish /stella/ /estReLa/

Latin to Italian: /tre:s/ → /tre/
P[t → t] = 0.98
P[r → r] = 0.99
P[e: → e] = 0.95
P[s → −] = 0.20
(These numbers are made up.)


Transition probabilities

Branch Latin to Italian, transition probabilities for /t/
t 0.98
d 0.01
k 0.01
anything else 0
We would like to estimate the transition probabilities from any phoneme to
any other phoneme, on each branch. Huge number of parameters (number
of phonemes×number of phonemes×number of branches).


Context dependent

In fact, the probability that /s/ is deleted depends on the position within
the word.
The transition probabilities become context dependent:
P[s → −|preceded by e and is last letter] = 0.70
P[r → r|between t and e] = 0.99
Even more parameters!


Reducing the number of parameters

Generic contexts of the form ”CrV”
Dirichlet prior: force most parameters to be zero (sparsity)


Implementation

Possible to use large datasets: dumps from Wiktionary, Bible...
Expectation-Maximization algorithm
Only point estimates of the parameters, not uncertainty


Results: reconstruction


Results: Most probable transitions


Bouchard et al.: Comments

Impressive first step
Much more refinement is possible, especially in defining the context
Allows quantitative tests of open questions (e.g. functional load,
universality of changes)
Joint estimation of sound changes and phylogenies


Outline






oe



Overall conclusions

When done right, statistical methods can provide new insight into
linguistic history
The conclusions will only ever be as good as the data
Importance of collaboration in building the model and in checking for
mis-specification
Major avenues for future research. Challenges in finding relevant
data, building models, and statistical inference:
Models for morphosyntactical traits
Putting together lexical, phonemic and morphosyntactic traits
Escape the tree-like model (borrowing, networks, diffusion processes...)
Incorporate geography


Statistical Methods in Historical Linguistics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Statistical Methods in Historical Linguistics

Ähnlich wie Statistical Methods in Historical Linguistics (20)

Mehr von Robin Ryder

Mehr von Robin Ryder (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Statistical Methods in Historical Linguistics