This talk looks at the variation in participants that take part in speech intelligibility studies, and explores how that variability can be characterised and integrated into interpreting and discussing results.
1. ASA 173, Boston
Crowdsourcing Speech
Intelligibility Judgements
Maria K Wolters, University of
Edinburgh
Karl B Isaac, freelance
researcher
Contact: maria.wolters@ed.ac.uk,
@mariawolters
with many thanks to Steve Renals &
the EPSRC MultiMemoHome team
2. Key Questions
❖ What can we know about the context of the judgements
people make?
❖ How might they affect performance?
❖ could explain some of increased variation in results
❖ could yield new hypotheses about real-world
intelligibility
❖ How can we improve the experience?
3. Data
❖ Series of 14 lab and Amazon Mechanical Turk
experiments on speech synthesis intelligibility (Isaac,
2015, PhD thesis)
❖ Lab vs Mechanical Turk
❖ effect of type of test sentences
❖ effect of noise and reverberation
4. Experiment Overview
Study complete
not
complete
Aim
amt 167 62
Semantically unpredictable sentences,
AMT vs Lab, 4 systems
matrix 61 40 testing matrix sentences
newvoice 61 49 three new voices
lowrev 68 NA effects of low reverberation
highrev 36 NA effects of high reverberation
noiserev 78 183 noise x reverberation
Total 471 334
no exclusions and filtering
5. Important aspects of context
❖ People’s hearing
❖ How they are listening
❖ Where they are listening
❖ Experience with speech tested
❖ Did they do what they were supposed to do?
6. Hearing Issues
❖ Self-report does not correlate very well with actual
hearing loss (Wolters, Isaac, Johnson 2011)
❖ Yet, many instances of self-reported hearing difficulties
that affect ability to understand speech in noise, with no
hearing loss (Bharawaj et al., 2015)
7. How people are listening
❖ Headphones versus no headphones
❖ Type of headphones (earbuds, on ear, full ear …)
❖ Features of headphones
❖ configuration of listening device (phone / computer;
browser; volume)
8. Where they are listening
❖ Room acoustics
❖ Public / private
❖ Interruptions
❖ background noise
❖ source
❖ loudness
❖ fluctuating / constant / bursty
10. Did They Do What They Were Supposed To Do?
❖ Manipulation checks, such as very easy sentence
❖ Different task / item, that stirs people out of „tickybox“
mode
❖ Instructions at the start, then questions about aspects of
instructions at the end (people are surprisingly honest!)
11. Effect on Performance
❖ Context Variables:
❖ self-reported hearing problems
❖ self-reported loudness of background noise
❖ Performance Variables:
❖ Word error rate (WER) mean for each within-participant
condition
❖ self-reported performance
12. Self-Reported Hearing
(Hearing Handicap Inventory for Adults)
Study mean median IQR Max >=10
amt 3 0 0 38 21 (13%)
matrix 3 0 0 34 4 (7%)
newvoice 3.5 0 4 36 10 (16%)
lowrev 1 0 0 18 5 (7%)
highrev 1.5 0 0 28 2 (6%)
noiserev 1.5 0 0 20 6 (8%)
13. Self-Reported Noise Loudness
Study
1
(none)
2 3 4
5
(LOUD)
median IQR
matrix 25 29 4 3 0 2 1
newvoice 29 20 7 4 1 2 1
lowrev 36 16 11 4 0 1 1
highrev 18 15 1 1 1 1.5 1
noiserev 44 22 5 1 6 1 1
not captured in AMT study
14. Mean WER
Study min mean median IQR Max
amt 0.06 0.20 0.18 0.8 1.00
matrix 0 0.09 0.08 0.40 0.32
newvoice 0 0.14 0.14 0.15 0.42
lowrev 0 0.05 0.04 0.06 0.5
highrev 0 0.15 0.08 0.22 0.92
noiserev 0 0.50 0.48 0.88 1.16
15. Self-Reported Intelligibility
Study usually all
usually
most
worse
link
Mean WER
amt 7 (4%) 125 (75%) 35 (21%) p<0.0001
matrix 27 (44%) 33 (54%) 1 (2%) p<0.005
newvoice 10 (16%) 47 (77%) 4 (6.5%) p<0.01
lowrev 45 (66%) 21 (31%) 1 (1%) p<0.001
highrev 11 (31%) 22 (61%) 3 (8%) p<0.05
noiserev 7 (9%) 31 (40%) 40 (51%) p<0.0001
Link with mean WER assessed using Kruskal-Wallis test
16. Checking for Correlations
❖ Spearman test as implemented in R package coin
❖ stratified by relevant experimental variables
❖ H0 is that mean WER and HHIA score / loudness are
independent, given the experimental variable
17. HHIA vs Mean WER
Study by System by Reverb by SNR
amt p=0.55
matrix p=0.08
newvoice p<0.01
lowrev p=0.37 p=0.44
highrev p=0.88 p=0.85
noiserev p=0.11 p<0.01 p<0.005
self-reported hearing becomes relevant
* in the most difficult study (noiserev)
* in the study with the highest number of people over threshold
19. Loudness vs WER
Study by System by Reverb by SNR
matrix p=0.08
newvoice p=0.30
lowrev p=0.11 p=0.17
highrev p=0.14 p<0.07
noiserev p<0.05 p=0.14 p=0.18
no evidence for a strong influence
20. Loudness vs Self-Reported Understanding
Study by System by Reverb by SNR
matrix p<0.01
newvoice p<0.005
lowrev p<0.005 p<0.005
highrev p<0.005 p<0.005
noiserev p<0.001 p<0.001 p<0.001
Self-reported loudness of environment noise
relates to self-reported difficulty, not WER
22. Effects of Context on Performance
• can be subtle
• may depend on whether self-reported or measured
performance
• may depend on who shows up for your study: better
understanding of possible confounders!
Suggestion: build up library of context data across studies
23. How Can We Make it Easier?
❖ Design between subject rather than within. 90 sentences
on final study was a killer
❖ Pay a living wage
❖ encourage free comments that can be mined for useful
information (think canary in a coal mine)
❖ offer more info on goal of study, opt-in to receive results
summary
24. Canaries in the Comment Coalmine
❖ issues with the software
❖ issues with their memory
❖ typing while listening
❖ issues with UK accent for US listeners
❖ how they adjusted the volume at their end
25. Conclusion
❖ Use consistent brief questions regarding context to better characterise your
samples across all your studies
❖ Use free comments to look for aspects you hadn’t considered before
❖ Be kind to your participants
Questions?
Contact:
maria.wolters@ed.ac.uk, @mariawolters,
http://mariawolters.net
Dr Karl B Isaac