Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
ASA 09 Poster-portlandOR-051209
1. EXP 2 RESULTS
Voice learning
• Voice ID significantly improved across training sessions. This was
true for performance in the training task (F(3, 38) = 80.15, p < .0001) and
in the post-training ID task (F(3, 38) = 43.95, p < .0001, See Fig. 6).
• Voice recognition in the training task was significantly better for AV
than the A group (F(1, 38) = 6.96, p < .05).
• Performance in the post-training voice identification task did not differ
as a function of training format (F(1, 38) = 1.807, p > .05).
Auditory word recognition
• Word recognition performance did not differ significantly for trained vs.
untrained voices (F(1, 38) = .211, p > .05, See Fig. 7).
EXP 1 RESULTS
Voice learning
• Voice identification was significantly more difficult for Talker Set
1 than Talker Set 2 (See Fig. 3).
• F0s for Talker Set 1 overlapped within the same gender
whereas within-gender F0s for Talker Set 2 were spread
apart.
• When a less stringent criterion was used for the training task,
there was a significant difference in mean total sessions
between the auditory and auditory-visual groups in Talker Set 1
(See Table 5).
Table 5. Total sessions required to achieve 75% criterion performance in
the voice training task.
Auditory word recognition
• There was no significant difference in auditory word
recognition between the trained voices and the untrained voices
(See Fig. 4). This was may be due to: 1) intelligibility differences
between Talker Sets 1 and 2; and 2) different numbers of
training sessions across subjects and conditions.
FIG 3. Average performance in
the voice identification task. Error
bars represent 95% confidence
interval. The dotted line
represents chance level (25%).
STIMULI
• 320 sentences created by Bell & Wilson (2001).
• 7-9 words in each sentence, semantically neutral.
• 3 key words per sentence (e.g., She felt a rough spot on her
skin.).
• 5 male and 5 female talkers produced the sentences
(See Table 1); recordings were in an audiovisual HD
format.
Table 1. Talker Acoustics.
• All sentences were processed via12-channel noise-band CI
simulator (See Fig. 1 & Table 2).
Table 2. Corner frequencies
for frequency analysis filters
for CI simulation.
1pPP25. Effects of Training Format on Voice Learning of Spectrally Degraded Voices
Sangsook Choi1, Karen Iler Kirk1, Tom Talavage2, Vidya Krull1, Christopher Smalt2, Steve Baker2
1. Dept. of Speech Language and Hearing Science, 2. Dept. of Electrical and Computer Engineering
BACKGROUND
• Perception of talker-specific information (i.e., indexical
information such as a talker’s voice) is closely related to
linguistic processing of an utterance (For review, Nygaard,
2005).
• Audiovisual speech facilitates speech perception in adverse
listening environments (Sumby Pollack, 1954). Similar
facilitative effects were found in voice identification in NH
listeners (Sheffert Olson, 2004).
• It is well known that audiovisual speech provides large
benefits to individuals with hearing impairment (Erber, 1972;
1975), including CI recipients (Tyler et al., 1997).
• Little is known about indexical perception in CI recipients
and their abilities to integrate indexical and linguistic
information in speech recognition.
PURPOSE OF THE STUDY
In the present study, CI-simulated speech was used to
examine the effect of presentation format (auditory vs.
auditory-visual) on learning of spectrally degraded voices and
the transfer of voice learning to auditory recognition of
spectrally degraded speech.
EXPERIMENT 1
Participants: 32 native speakers of General American
English with normal-hearing and normal or corrected-to-
normal vision.
Procedures:
Voice learning
• Familiarized with voices in A or AV presentation format.
• 1/2 group in each presentation format trained on Talker Set 1, and
the remainder trained on Talker Set 2.
• After familiarization, voice identification was tested in an A format
and the feedback was provided on every trial.
• Voice identification training continued until criterion performance
was reached or 10 sessions were completed.
Auditory word recognition
• Upon completion of voice ID training, auditory word recognition
was tested using novel sentences produced by all talkers in Sets 1
and 2 (1/2 trained voices, 1/2 untrained voices).
Hypothesis 1: Auditory-visual familiarization will accelerate
the learning of spectrally degraded voices. Therefore, fewer
training sessions should be necessary to achieve voice
identification criterion performance (80% in 2 consecutive
sessions) in the AV training group compared to the A training
group.
Hypothesis 2: Familiarity with a talker’s voice will facilitate
subsequent spoken word recognition. That is, word
recognition performance for the trained voices will be greater
than for the untrained voices.
EXP 1 RESULTS
Voice learning
• 75 % of the listeners met the criterion after an average of
4.83 sessions (see Tables 3 4).
• All who did not meet the criterion were trained with
Talker Set 1, and 75% of the listeners who did not meet
the criterion were in the auditory group (See Table 3).
Table 3. Total number of subjects who met and did not meet criterion.
• Overall, there was no significant difference in mean total
training sessions between the auditory and auditory-visual
training formats (t (1, 21.99) = .451, p .05, See Table 4).
Table 4. Total sessions required to achieve 80% criterion performance
in the voice training task.
EXPERIMENT 2
Participants: A new group of 40 native speakers of General American
English with NH and normal or corrected-to normal vision
Stimuli: Six talkers in the training set and 4 talkers in the untrained set.
Sets were matched on average intelligibility.
Procedures:
Voice learning
• Familiarized with voices in A or AV format
• Voice ID training was carried in an A format. But the feedback about the
accuracy of each answer was provided in A or AV format.
• Upon completion of each training, subjects completed an auditory-only voice
ID task consisting of 20 new sentences spoken by 6 trained voices (total 120
tokens).
Auditory word recognition
• Pre- and post-training word recognition testing was carried in an A format
using sentences spoken by both trained and untrained talkers.
Hypothesis 1: Auditory-visual voice training will facilitate voice learning.
Therefore, voice ID performance should be better for subjects trained in
AV compared with those trained in A.
Hypothesis 2: Talker-specific knowledge learned from voice ID training
should facilitate subsequent word recognition. Therefore, recognition of
spectrally degraded speech should be greater for trained than untrained
voices.
FIG 4. Average auditory word
recognition performance. Error
bars represent 95%
confidence interval.
CONCLUSIONS
• Auditory-visual talker information facilitated recognition for
spectrally degraded voices.
• Voice ID training significantly improved perception of spectrally
degraded voices.
• Improved familiarity with a talker’s voice through voice training
did not improve subsequent auditory recognition of spectrally
degraded speech.
FIG 6. Mean voice ID
performance. Error bars
represent 95% confidence
interval. The dotted line is at
chance performance (33%).
FIG 7. Pre- and post-training mean
key-word recognition performance.
Error bars represent 95% confidence
interval.
ACKNOWLEDGEMENTS
This work was supported by NIH NIDCD grant DC008875.
Channel Start Freq Stop Freq
1 188 313
2 313 563
3 563 813
4 813 1063
5 1063 1313
6 1313 1688
7 1688 2188
8 2188 2813
9 2813 3688
10 3688 4813
11 4813 6188
12 6188 7938 FIG 1. Overview of the CI simulator
Total
Talker Set 1 Talker Set 2 Talker Set 1 Talker Set 2
Criterion met 2 8 6 8 24
Criterion unmet 6 0 2 0 8
Auditory Auditory-visual
Auditory Auditory-visual Auditory Auditory-visual
Mean 7 4.13 3.88 4.13
SD 2.08 1.25 1.36 2.75
t-statistic
Talker Set 1 Talker Set 2
t (1,14) = .231, p .05t (1,9.55) = -3.19, p .05
FIG 2. A schematic of
voice ID training and word
recognition tasks.
Voice ID Training
Sentence
transcription
A-only
Pre-training word recognition
Voice ID with
Feedback
A or AV A-onlyA or AV
Voice ID TestFamiliarization
Sentence
transcription
Post-training word recognition
A-only
Training repeated for 4 sessions
FIG 5. A schematic of voice
training and word recognition
tasks used for EXP 2.
Talker Sex Age
Mean F0
(Hz)
Min F0
(Hz)
Max F0
(Hz)
01 f 25 207 77 366
02 f 23 201 78 308
03 f 21 183 104 322
04 f 24 172 78 452
05 f 36 145 84 334
06 m 27 137 80 295
07 m 20 124 90 181
08 m 27 116 81 191
09 m 28 113 103 175
10 m 28 113 76 271
Total
Talker Set 1 Talker Set 2 Talker Set 1 Talker Set 2
Mean 6 4.25 5.67 4.5 4.83
SD 1.41 1.75 2.73 2.51 2.24
Mean 4.83
SD 2.24
t-statistic
5 4.6
2.57 1.78
t (1, 21.99) = .451, p .05
Auditory Auditory-visual