Crowdsourcing Speech Intelligibility Judgements

ASA 173, Boston
Crowdsourcing Speech
Intelligibility Judgements
Maria K Wolters, University of
Edinburgh
Karl B Isaac, freelance
researcher
Contact: maria.wolters@ed.ac.uk,
@mariawolters
with many thanks to Steve Renals &
the EPSRC MultiMemoHome team

Key Questions
❖ What can we know about the context of the judgements
people make?
❖ How might they affect performance?
❖ could explain some of increased variation in results
❖ could yield new hypotheses about real-world
intelligibility
❖ How can we improve the experience?

Data
❖ Series of 14 lab and Amazon Mechanical Turk
experiments on speech synthesis intelligibility (Isaac,
2015, PhD thesis)
❖ Lab vs Mechanical Turk
❖ effect of type of test sentences
❖ effect of noise and reverberation

Experiment Overview
Study complete
not
complete
Aim
amt 167 62
Semantically unpredictable sentences,
AMT vs Lab, 4 systems
matrix 61 40 testing matrix sentences
newvoice 61 49 three new voices
lowrev 68 NA effects of low reverberation
highrev 36 NA effects of high reverberation
noiserev 78 183 noise x reverberation
Total 471 334
no exclusions and ﬁltering

Important aspects of context
❖ People’s hearing
❖ How they are listening
❖ Where they are listening
❖ Experience with speech tested
❖ Did they do what they were supposed to do?

Hearing Issues
❖ Self-report does not correlate very well with actual
hearing loss (Wolters, Isaac, Johnson 2011)
❖ Yet, many instances of self-reported hearing difﬁculties
that affect ability to understand speech in noise, with no
hearing loss (Bharawaj et al., 2015)

How people are listening
❖ Headphones versus no headphones
❖ Type of headphones (earbuds, on ear, full ear …)
❖ Features of headphones
❖ conﬁguration of listening device (phone / computer;
browser; volume)

Where they are listening
❖ Room acoustics
❖ Public / private
❖ Interruptions
❖ background noise
❖ source
❖ loudness
❖ ﬂuctuating / constant / bursty

Experience with Speech Type
❖ Dialect
❖ Life history
❖ exposure to target speech

Did They Do What They Were Supposed To Do?
❖ Manipulation checks, such as very easy sentence
❖ Different task / item, that stirs people out of „tickybox“
mode
❖ Instructions at the start, then questions about aspects of
instructions at the end (people are surprisingly honest!)

Effect on Performance
❖ Context Variables:
❖ self-reported hearing problems
❖ self-reported loudness of background noise
❖ Performance Variables:
❖ Word error rate (WER) mean for each within-participant
condition
❖ self-reported performance

Self-Reported Hearing
(Hearing Handicap Inventory for Adults)
Study mean median IQR Max >=10
amt 3 0 0 38 21 (13%)
matrix 3 0 0 34 4 (7%)
newvoice 3.5 0 4 36 10 (16%)
lowrev 1 0 0 18 5 (7%)
highrev 1.5 0 0 28 2 (6%)
noiserev 1.5 0 0 20 6 (8%)

Self-Reported Noise Loudness
Study
1
(none)
2 3 4
5
(LOUD)
median IQR
matrix 25 29 4 3 0 2 1
newvoice 29 20 7 4 1 2 1
lowrev 36 16 11 4 0 1 1
highrev 18 15 1 1 1 1.5 1
noiserev 44 22 5 1 6 1 1
not captured in AMT study

Mean WER
Study min mean median IQR Max
amt 0.06 0.20 0.18 0.8 1.00
matrix 0 0.09 0.08 0.40 0.32
newvoice 0 0.14 0.14 0.15 0.42
lowrev 0 0.05 0.04 0.06 0.5
highrev 0 0.15 0.08 0.22 0.92
noiserev 0 0.50 0.48 0.88 1.16

Self-Reported Intelligibility
Study usually all
usually
most
worse
link
Mean WER
amt 7 (4%) 125 (75%) 35 (21%) p<0.0001
matrix 27 (44%) 33 (54%) 1 (2%) p<0.005
newvoice 10 (16%) 47 (77%) 4 (6.5%) p<0.01
lowrev 45 (66%) 21 (31%) 1 (1%) p<0.001
highrev 11 (31%) 22 (61%) 3 (8%) p<0.05
noiserev 7 (9%) 31 (40%) 40 (51%) p<0.0001
Link with mean WER assessed using Kruskal-Wallis test

Checking for Correlations
❖ Spearman test as implemented in R package coin
❖ stratiﬁed by relevant experimental variables
❖ H0 is that mean WER and HHIA score / loudness are
independent, given the experimental variable

HHIA vs Mean WER
Study by System by Reverb by SNR
amt p=0.55
matrix p=0.08
newvoice p<0.01
lowrev p=0.37 p=0.44
highrev p=0.88 p=0.85
noiserev p=0.11 p<0.01 p<0.005
self-reported hearing becomes relevant
* in the most difﬁcult study (noiserev)
* in the study with the highest number of people over threshold

Loudness vs WER
matrix p=0.08
newvoice p=0.30
lowrev p=0.11 p=0.17
highrev p=0.14 p<0.07
noiserev p<0.05 p=0.14 p=0.18
no evidence for a strong inﬂuence

Loudness vs Self-Reported Understanding
matrix p<0.01
newvoice p<0.005
lowrev p<0.005 p<0.005
highrev p<0.005 p<0.005
noiserev p<0.001 p<0.001 p<0.001
Self-reported loudness of environment noise
relates to self-reported difﬁculty, not WER

Effects of Context on Performance
• can be subtle
• may depend on whether self-reported or measured
performance
• may depend on who shows up for your study: better
understanding of possible confounders!
Suggestion: build up library of context data across studies

How Can We Make it Easier?
❖ Design between subject rather than within. 90 sentences
on ﬁnal study was a killer
❖ Pay a living wage
❖ encourage free comments that can be mined for useful
information (think canary in a coal mine)
❖ offer more info on goal of study, opt-in to receive results
summary

Canaries in the Comment Coalmine
❖ issues with the software
❖ issues with their memory
❖ typing while listening
❖ issues with UK accent for US listeners
❖ how they adjusted the volume at their end

Conclusion
❖ Use consistent brief questions regarding context to better characterise your
samples across all your studies
❖ Use free comments to look for aspects you hadn’t considered before
❖ Be kind to your participants
Questions?
Contact:  
maria.wolters@ed.ac.uk, @mariawolters,  
http://mariawolters.net
Dr Karl B Isaac

Crowdsourcing Speech Intelligibility Judgements

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Crowdsourcing Speech Intelligibility Judgements

Ähnlich wie Crowdsourcing Speech Intelligibility Judgements (20)

Mehr von Maria Wolters

Mehr von Maria Wolters (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Crowdsourcing Speech Intelligibility Judgements