Effects of variation on the computation of numerical likelihood ratios for forensic voice comparison

Effects of variation on the
computation of numerical likelihood
ratios for forensic voice comparison
Vincent Hughes
Paul Foulkes
Department of Language and Linguistic Science

1. introduction
• Likelihood Ratio (LR) = “logically and legally correct
framework” for assessing forensic comparison
evidence (Rose & Morrison 2009: 143)
p(E|Hp)
p(E|Hd)
2
Hughes & Foulkes
IAFPA 2012
LR =

• assessment of similarity of observed features in the
criminal and known samples, and their typicality
• typicality = dependent on patterns in the relevant
population (Aitken & Taroni 2004)
– definedby the defencehypothesis
– quantified relative to a sampled sub-section of
that population(reference data)
3
Hughes & Foulkes
IAFPA 2012
Hp Hd
from Berger (2012)

• Rose (2004: 4) default Hd:
“same-sexspeaker(s) of the language”
• ‘logical relevance’ (Kaye 2004, 2008)
4
Hughes & Foulkes
IAFPA 2012
Study Feature
REFERENCE DATA
Speech style N speakers Age Language
Rose et al
(2003)
/ɕ/ /o/ /N/ Read 60 20-50 Japanese
Rose et al
(2006)
/aI/ Read 166 19-64 Australian
English
Morrison
(2008)
/aI/ Read 27 19-64 Australian
English
Kinoshita et al
(2009)
f0 Controlled
spontaneous
201 No
control
Japanese

Hughes & Foulkes
IAFPA 2012
5
• collecting referencedata
– bespokecase-by-casedata
– ‘off-the-shelf’data
• inevitable mismatch between the off-the-shelf
data and the facts of the case at trial
• LRs necessary vary with different reference
data

2. research questions
to what extent are LRs affected by…
i. varying N speakers in the reference data?
ii. varying N tokens per speaker in the
reference data?
iii. dialect mismatch between target voice and
reference data?
6
Hughes & Foulkes
IAFPA 2012

7
Hughes & Foulkes
IAFPA 2012 7
Raw LR Log10 LR Verbal expression
>10000 5 Very strong evidence
1000-10000 4 Strong evidence
100-1000 3 Moderately strong evidence
10-100 2 Moderate evidence
1-10 1 Limited evidence
1-0.1 -1 Limited evidence
0.1-0.01 -2 Moderate evidence
0.01-0.001 -3 Moderately strong evidence
0.001-0.0001 -4 Strong evidence
<0.0001 -5 Very strong evidence
Champod and Evett (2000)
Hp
Hd

3. method
8
Hughes & Foulkes
IAFPA 2012
• 1 set of reference
data
• 4 sets of test data
• GOOSE /u:/
• dynamic time-normalised measurements of F1
and F2 (McDougall 2004, 2006)
F2
F1

• reference data:
– New Zealand English (NZE) from Canterbury
Corpus (ONZE)
– 120 male speakers (born 1932-1987)
– min 10 tokensper speaker (codedfor context)
– auto-generatedformant data
9
Hughes & Foulkes
IAFPA 2012
• test data:
– NZE/ Manchester/ Newcastle/York
– 8 male speakers per set (aged 16-31)
– 16 tokens per speaker (coded for context)

• why GOOSE /u:/?
– not a regional stereotype (Labov 1971) of any of the
test set dialects
10
Hughes & Foulkes
IAFPA 2012
200
300
400
500
600
700
800
05001000150020002500
F1 (Hz)
F2 (Hz)
Manchester
Newcastle
York
ONZE

11
Hughes & Foulkes
IAFPA 2012
• data reduction using quadratic polynomials
• LR calculated using Multivariate Kernel Density
formula (Aitken and Lucy 2004, Morrison 2007)
• accuracy of output assessed using log
likelihood-ratio cost function (Cllr) (Brümmer and
du Preez 2006)
01
2
2 axaxay ++=

4. results
i. number of reference speakers
12
Hughes & Foulkes
IAFPA 2012
– test data combined
• 32 same-speaker comparisons
• 992 different-speaker comparisons
– starting with 120 speakers
• 10 tokens per speaker
– ten speakers removed at a time

N speakers
13
Hughes & Foulkes
IAFPA 2012
0 20 40 60 80 100 120
-6
-5
-4
-3
-2
-1
0
1
2
3
4
Number of Speakers
Log10 LR
Log1o LR Verbal expression
+/- 1 Limited evidence
+/- 2 Moderate evidence
+/- 3 Moderately strong evidence
+/- 4 Strong evidence
+/- 5 Very strong evidence
same-speaker pairs
Mean Log10 LR
Standard deviation

N speakers
14
Hughes & Foulkes
IAFPA 2012
0 20 40 60 80 100 120
-6
-5
-4
-3
-2
-1
0
1
2
3
4
Number of Speakers
Log10 LR
• stablemean > 20 speakers
• increasedvariance< 40 speakers
same-speaker pairs
Mean Log10 LR
Standard deviation

15
Hughes & Foulkes
IAFPA 2012
N speakers
0 20 40 60 80 100 120
-6
-5
-4
-3
-2
-1
0
1
2
3
4
Number of Speakers
Log10 LR
Mean Log10 LR
Standard deviation
different-speaker pairs

16
Hughes & Foulkes
IAFPA 2012
N speakers
0 20 40 60 80 100 120
-6
-5
-4
-3
-2
-1
0
1
2
3
4
Number of Speakers
Log10 LR
• stablemean
• Cllr:
- Lowest = 0.606 (120 speakers)
- Highest = 1.203 (10 speakers)
Mean Log10 LR
Standard deviation

17
Hughes & Foulkes
IAFPA 2012
ii. number of tokens per speaker in the
reference data
–test data combined
• 32 same-speakercomparisons
• 992 different-speakercomparisons
–max N tokens shared by 102 speakers = 13
–LRs calculated 11 times with 1 token per
reference speaker removed at each stage

18
Hughes & Foulkes
IAFPA 2012
N tokens
0 1 2 3 4 5 6 7 8 9 10 11 12 13
-20
-10
0
2
4
-18
-16
-14
-12
-8
-6
-4
-2
Number of Tokens per Speaker
Log10 LR
• mean LRs = stable
• standard deviation = stable
Mean Log10 LR
Standard deviation
same-speaker pairs

19
Hughes & Foulkes
IAFPA 2012
N tokens
• continual increase in mean & SD as N tokens decreases
0 1 2 3 4 5 6 7 8 9 10 11 12 13
-20
-10
0
2
4
-18
-16
-14
-12
-8
-6
-4
-2
Number of Tokens per Speaker
Log10 LR
Mean Log10 LR
Standard deviation

20
Hughes & Foulkes
IAFPA 2012
N tokens
• massive increase in strength of evidence
Mean Log10 LR
Standard deviation
• Cllr:
- Lowest = 0.648 (13 tokens)
- Highest = 0.762 (5 tokens)

21
Hughes & Foulkes
IAFPA 2012
iii. dialect mismatch
– 4 independent test sets
• ONZE, Manchester, Newcastle and York
– 102 speakers in the reference data
– 13 tokens per reference speaker

-5 -4 -3 -2 -1 0 1 2 3 4 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Log10 Likelihood Ratio
Cumulative Proportion
22
Hughes & Foulkes
IAFPA 2012 22
Support for prosecution
(same speaker)
Support for defence
(different speakers)
dialect mismatch

dialect mismatch: F1 and F2
23
Hughes & Foulkes
IAFPA 2012
same-speaker
pairs
different-speaker
pairs
ONZE (match)
Newcastle
Manchester
York

-12 -10 -8 -6 -4 -2 0 2 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
24
Hughes & Foulkes
IAFPA 2012
same-speaker
pairs
different-speaker
pairs
ONZE (match)
Newcastle
Manchester
York

-12 -10 -8 -6 -4 -2 0 2 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
25
Hughes & Foulkes
IAFPA 2012
same-speaker
pairs
different-speaker
pairs
71%
58%
ONZE (match)
Newcastle
Manchester
York

5. discussion
26
Hughes & Foulkes
IAFPA 2012
i. number of reference speakers
– evidenceof “population size effect” (Ishihara and
Kinoshita 2008)
• misrepresentative estimation of the strength of
evidence with small N speakers in reference data
– mean LRs & variance stable > ca. 40 speakers
• different-speaker pairs more sensitive

27
Hughes & Foulkes
IAFPA 2012
ii. number of tokens per reference
speaker
– mean and SD for same-speaker pairs robust
– different-speaker pairs very sensitive to the
removal of even a single token
– What if your reference data doesn’t match the
case at trial?

28
Hughes & Foulkes
IAFPA 2012
iii. dialect mismatch
- same-speakerstrengthof evidenceoverestimated
• generally equivalent to one verbal category
- multitudeof issueswith different-speakerpairs
• overestimation of LRs for York (BUT issues of between-
speaker variation)
• high levels of contrary to fact support for the
prosecution for Manchester and Newcastle
• potential miscarriages of justice

29
Hughes & Foulkes
IAFPA 2012
5. conclusion
• positive practical implications
- mean and variance of LRs stable until only small N
speakers in the reference data
- good Cllr, even with relatively small N tokens per speaker
- but the more speakers and the more tokens the better
• predictably dialect matters
- even for features which aren’t expected to display
considerable variation according to region
- default Hd needs to account for this
- how narrowly do we need to define dialect?
- what about other ‘logically relevant’ class factors?

Thanks
Questions?
Hughes & Foulkes
IAFPA 2012
30
Vincent Hughes vh503@york.ac.uk

References
Aitken, C. G. G. and Taroni, F. (2004) Statistics and the evaluation of evidence for forensic
scientists (2nd edition). Chichester: John Wiley & Sons.
Berger, C. (2012) Modern evidential interpretation, reporting and fallacies. Lecture given at the
BBfor2 Summer School in Forensic Evidence Evaluation and Validation. Universidad
Autonoma de Madrid, Spain. 18-21 July 2012.
Brümmer, N. and du Preez, J. (2006) Application independent evaluation of speaker detection.
Computer Speech and Language 20: 230-275.
Champod, C. and Evett, I. W. (2000) Commentary on A. P. A. Broeders (1999) ‘Some
observations on the use of probability scales in forensic identification’. Forensic
Linguistics 7(2): 238-243.
Ishihara, S. and Kinoshita, Y. (2008) How many do we need? Exploration of the Population Size
Effect on the performance of forensic speaker classification. Paper presented at the 9th
Annual Conference of the International Speech Communication Association
(Interspeech). Brisbane, Australia. 1941-1944.
Kaye, D. H. (2004) Logical relevance: problems with the reference population and DNA mixtures
in People v. Pizarro. Law, Probability and Risk 3: 211-220.
Kaye, D. H. (2008) DNA probabilities in People v. Prince: When are racial and ethnic statistics
relevant? In Speed, T. And Nolan, D. (eds.) Probability and Statistics: Essays in Honour
of David A Freedman. Beachwood, OH: Institute of Mathematical Statistics. 289-301.
31
Hughes & Foulkes
IAFPA 2012

32
Hughes & Foulkes
IAFPA 2012
Kinoshita, Y., Ishihara, S. and Rose, P. (2009) Exploring the discriminatory potential of F0
distribution parameters in traditional speaker recognition. International Journal of
Speech, Language and the Law 16(1): 91-111.
Labov, W. (1971) The study of language in its social context. In Fishman, J. A. (ed.) Advances in
the Sociology of Language (vol. 1). The Hague: Mouton. 152-216.
Loakes, D. (2006) A forensic phonetic investigation into the speech patterns of identical and
non-identical twins. PhD Dissertation, University of Melbourne.
McDougall, K. (2004) Speaker-specific formant dynamics: An experiment on Australian English
/aɪ/. International Journalof Speech, Language and the Law 11(1): 103-130.
McDougall, K. (2006) Dynamic features of speech and the characterisation of speakers: towards
a new approach using formant frequencies. International Journal of Speech, Language
and the Law 13(1): 89-126.
Morrison, G. S. (2007) Matlab implementation of Aitken and Lucy’s (2004) Forensic
Likelihood-Ratio Software Using Multivariate-Kernel-Density Estimation [software].
Available: http://geoff-morrison.net.
Morrison, G. S. (2008) Forensic voice comparison using likelihood ratios based on polynomial
curves fitted to the formant trajectories of Australian English /aI/. International
Journalof Speech, Language and the Law 5(2): 249-266.

33
Hughes & Foulkes
IAFPA 2012
Rose, P. (2004) Technical Forensic Speaker Identification from a Bayesian Linguist's
Perspective. Keynote paper, Forensic Speaker Recognition Workshop, Speaker
Odyssey ’04. 31 May - 3 June 2004, Toledo, Spain. 3-10.
Rose, P. (2011) Forensic voice comparison with Japanese vowel acoustics – a likelihood ratio-
based approach using segmental cepstra. Proceedings of the 17th International
Congress of Phonetic Sciences. 17-21 August 2011, Hong Kong. 1718-1721.
Rose, P., Osanai, T. and Kinoshita, Y. (2003) Strength of forensic speaker identification evidence
multispeaker formant- and cepstrum-based segmental discrimination with a Bayesian
likelihood ratio as threshold. Forensic Linguistics 10(2): 179-202.
Rose, P., Kinoshita, Y. and Alderman, T. (2006) Realistic extrinsic forensic speaker
discrimination with the diphthong /aI/. Proceedings of the 10th Australian
Conference on Speech Science and Technology, 8-10 December 2004, Sydney:
Macquarie University. 329-334
Rose, P. and Morrison, G. S. (2009) A response to the UK Position Statement on forensic speaker
comparison. International Journal of Speech, Language and the Law 16(1): 139-163.
Wells, J. C. (1982) Accents of English (3 vols). Cambridge: Cambridge University Press.

Effects of variation on the computation of numerical likelihood ratios for forensic voice comparison

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Effects of variation on the computation of numerical likelihood ratios for forensic voice comparison