Our talk at CHI2015 in Seoul, South Korea. Find more information at www.kotarohara.com .
YouTube: https://youtu.be/isqsYLkX9gA
Makeability Lab: http://www.cs.umd.edu/~jonf/
Microsoft Research: http://research.microsoft.com/
ABSTRACT
Language barrier is the primary challenge for effective cross-lingual conversations. Spoken language translation (SLT) is perceived as a cost-effective alternative to less affordable human interpreters, but little research has been done on how people interact with such technology. Using a prototype translator application, we performed a formative evaluation to elicit how people interact with the technology and adapt their conversation style. We conducted two sets of studies with a total of 23 pairs (46 participants). Participants worked on storytelling tasks to simulate natural conversations with 3 different interface settings. Our findings show that collocutors naturally adapt their style of speech production and comprehension to compensate for inadequacies in SLT. We conclude the paper with the design guidelines that emerged from the analysis.
8. “The infrequent use of
interpreters* in the
delivery ward was
among the most
important reasons for
the reduced quality of
[health] care.”
Kale and Syed (2010)
* Interpreter: a person who translates
the words that someone is speaking
into a different language
(Merriam-Webster)
11. Speech Recognition & Machine Translation
Performance
Fügen et al. (2007) noted that “automatic systems can
already provide usable information for people.”
2015
12. 2015
Little research has paid attention to how people interact with
spoken language translation technology
16. Hamon O., et al., End-to-End Evaluation in Simultaneous Translation, ECACL2009
Hello!
Speech
Text
Spoken Language Translation System
Spoken Language Translation
Automatic speech recognition Machine translation
Speech synthesis
こんにちは
a
あ
17. Hamon O., et al., End-to-End Evaluation in Simultaneous Translation, ECACL2009
Hello!
Speech
Text
Spoken Language Translation System
Spoken Language Translation
Automatic speech recognition Machine translation
Speech synthesis
こんにちは
a
あ
18. Danger.
Do not enter
Effect of machine translation on text-based communication
Gao et al. 2014; Yamashita et al. 2009; Yamashita and Ishida 2006
21. Effect of spoken language translation system to
interlingual communication is under-explored.
Key Point
22. Evaluation of NESPOLE!
Constantini et al. 2002
• No detailed discussion on
how people adapted in
using the system
• Push-to-talk interface made
it hard to analyze turn-taking
behavior
• Examined usability of
spoken language
translation system
23. Explore how spoken language translation
affects a natural conversation between
people speaking different languages
Goal
26. Video to show how the system work:
One side speaks in English, another
side in German
Translator Tool | Demo
Skype Translator Demo, English-German Conversation
27. Video to show how the system work:
One side speaks in English, another
side in German
Translator Tool | Demo
28. Video to show how the system work:
One side speaks in English, another
side in German
Video chat interface
Translator Tool | Interface Components
29. Video to show how the system work:
One side speaks in English, another
side in German
Closed Caption (CC)
Translator Tool | Interface Components
30. Video to show how the system work:
One side speaks in English, another
side in German
Translated Text-to-Speech (TTS) Audio
Translator Tool | Interface Components
31. Video to show how the system work:
One side speaks in English, another
side in German
Headset
Microphone
Translator Tool | Interface Components
35. Study Method | Participants
8 French-German pairs
(N=16 participants)
15 English-German pairs
(N=30 participants)
No common language With common language
(English)
All participants could speak
English reasonably well.
36. Study Method | Participants
8 French-German pairs
(N=16 participants)
15 English-German pairs
(N=30 participants)
With common language
(English)
I will mainly talk about
this study
No common language
37. Study Method | Conversation Task
Olivia trainierte für den
Tanzwettbewerb.
(Olivia was practicing for the
dance-off)
Conversation Task
German Speaker French Speaker
38. A starting sentence was
provided to the participant
Study Method | Conversation Task
Olivia trainierte für den
Tanzwettbewerb.
(Olivia was practicing for the
dance-off)
Conversation Task
German Speaker French Speaker
Translate
39. Study Method | Conversation Task
Olivia trainierte für den
Tanzwettbewerb.
(Olivia was practicing for the
dance-off)
Je ne savais pas Olivia dansé.
Combien de temps elle s'y
adonne ?
(I didn’t know Olivia danced. How long
has she been practicing?)
Conversation Task
German Speaker
Translate
French Speaker
40. Study Method | Conversation Task
Olivia trainierte für den
Tanzwettbewerb.
(Olivia was practicing for the
dance-off)
Seit sechs Jahren
(For six years)
C'est une longue période.
(That’s a long time.)
Conversation Task
German Speaker
…
French Speaker
Je ne savais pas Olivia dansé.
Combien de temps elle s'y
adonne ?
(I didn’t know Olivia danced. How long
has she been practicing?)
41. We asked participants to perform 9 conversation tasks
(~3 min ea.) in their respective languages and an
additional conversation task in English
The goal of each task was to collaboratively construct
a coherent story
Study Method | Conversation Task
Tasks were conducted using three different settings of
the translator tool
Conversation Task
42. Study Method | Interface Settings
Closed Caption &
Text-to-Speech (CC & TTS)
CC
Closed Caption only (CC)
CC
Text-to-Speech only (TTS)
Interface Settings
43. Closed Caption & Text-to-SpeechClosed Caption Text-to-Speech
With Closed Caption and Text-to-Speech
CC CC
44. Closed Caption & Text-to-SpeechClosed Caption Text-to-Speech
CC CC
48. Closed Caption & Text-to-SpeechClosed Caption Text-to-Speech
CC CC
49. Closed Caption and
Text-to-Speech
CC
Closed Caption
CC
Text-to-Speech
Round 1
Study Method | Interface Settings
Interface Settings
Order was permuted for each pair
for counter balancing
50. CC CCRound 1
Study Method | Interface Settings
Interface Settings
CC CC
CC CC
Round 2
Round 3
9 tasks with different interface settings and
different starting sentences
51. CC CCRound 1
Study Method | Interface Settings
Interface Settings
CC CC
CC CC
Round 2
Round 3
Eng A baseline English task to compare
with/without translation settings
52. CC CCRound 1
Study Method | Data
Data
CC CC
CC CC
Round 2
Round 3
Post-task survey (after every task)
E.g., “We had a successful conversation.”
Strongly
Agree
Strongly
Disagree
Eng
53. Eng
CC CCRound 1
Study Method | Data
Data
CC CC
CC CC
Round 2
Round 3
Post-round survey
(after every 3 tasks)
Interface setting preference ranking
CC & TTS: __________, CC: __________, TTS: __________Best Neutral Worst
54. CC CCRound 1
Study Method | Data
Data
CC CC
CC CC
Round 2
Round 3
Eng Post-session survey
and Interview
57. Three interface settings
Three rounds (nine tasks)
Two languages
3 x 3 x 2 mixed design study
With-in subject
With-in subject
Between subject
Quantitative Analysis | Analysis Method
Analysis Method
58. Transformed ordinal data in survey responses with
aligned rank transformation
We analyzed the data with restricted maximum
likelihood model
Quantitative Analysis | Analysis Method
Analysis Method
59. Interface Preference Results
Quantitative Analysis | Interface Preference Results
Which interface setting did you favor the most and least?
Closed Caption & Text-to-SpeechClosed Caption only Text-to-Speech only
CC CC
60. 59%
35%
6%
0%
25%
50%
75%
100%
Closed Caption & Text-to-
Speech
Closed Caption only Text-to-Speech only
Quantitative Analysis | Interface Preference Results
Most Favored Interface Setting
by French-German Group
CC CC
Interface Settings
Percentage of people who
favored the interface setting
61. 59%
35%
6%
0%
25%
50%
75%
100%
Closed Caption & Text-to-
Speech
Closed Caption only Text-to-Speech only
Quantitative Analysis | Interface Preference Results
Most Favored Interface Setting
by French-German Group
Interface settings with closed
caption were preferred
CC CC
62. 59%
35%
6%
0%
25%
50%
75%
100%
Closed Caption & Text-to-
Speech
Closed Caption only Text-to-Speech only
Quantitative Analysis | Interface Preference Results
Most Favored Interface Setting
by French-German Group
CC CC
24% more people preferred
to have text-to-speech
63. 59%
35%
6%
0%
25%
50%
75%
100%
Closed Caption & Text-to-
Speech
Closed Caption only Text-to-Speech only
Quantitative Analysis | Interface Preference Results
Most Favored Interface Setting
by French-German Group
CC CC
Set the Closed Caption & Text-to-Speech to the default,
but allow users to turn on/off setting
65. 2.2 2.2
2.7
4.8
1
2
3
4
5
Round 1 Round 2 Round 3 English
Overall Conversation Quailty
p < 0.01
p < 0.01
Participants felt that
their conversation
quality improved
Perceived Conversation Quality over Rounds
Quantitative Analysis | Perceived Conversation Quality
F2,112=6.275, p<0.01; Error bars are standard errors
5-point Likert scale score (higher is better)
Rating
66. 2.2 2.2
2.7
4.8
1
2
3
4
5
Round 1 Round 2 Round 3 English
Overall Conversation Quailty
Perceived conversation
quality is not as good as
that of English task
Perceived Conversation Quality over Rounds
Quantitative Analysis | Perceived Conversation Quality
5-point Likert scale score (higher is better)
Rating
69. Interview transcripts and survey responses were
coded with a content analysis method; recurring
themes were extracted and grouped together
Post-session interviews were transcribed by authors
Content Analysis | Method
Method
70. 62.5%
of the participants noted that
proper noun recognition
errors hindered the
conversation.
Content Analysis | Proper Noun Recognition Errors
Proper Noun Recognition Errors
71. ‘Ava’, there was no way it was getting ‘Ava’ no matter how
many times I tried. So in German, when you say ‘but,’ it
sounds very similar. So that’s what it was picking up.
French-German Pair 4, German speaker
Content Analysis | Proper Noun Recognition Errors
72. ‘Ava’, there was no way it was getting ‘Ava’ no matter how
many times I tried. So in German, when you say ‘but,’ it
sounds very similar. So that’s what it was picking up.
French-German Pair 4, German speaker
Content Analysis | Proper Noun Recognition Errors
Lack of an adaptation technique
73. Content Analysis | Grammar and Word Order Errors
37.5%
of the participants noted that
grammatical errors and
misplacement of words were
problematic.
Grammar and Word Order Errors
74. French-German Pair 3, German speaker
Because [...] German sentence structure is so different,
[translated] sentence wouldn’t make any sense when
[synthesized speech] said it, but when you see it on the
screen, you can kind of reorganize the words in the way it
supposed to go. And then it makes sense.
Content Analysis | Grammar and Word Order Errors
75. French-German Pair 3, German speaker
Because [...] German sentence structure is so different
[translated] sentence wouldn’t make any sense when
[synthesized speech] said it, but when you see it on the
screen, you can kind of reorganize the words in the way it
supposed to go. And then it makes sense.
Content Analysis | Grammar and Word Order Errors
People could find a
fallback strategy
76. Content Analysis | Information from Original Speech
31.3%
of the participants noted that
original speech was useful.
Information from Original Speech
77. [Having partner’s original voice] humanizes the
relationships as well, because if you only get the robotic
voice, I think it just cuts you off from the person you are
engaged with. Whereas you talk, then it feels much more
human relationship.
French-German Pair 1, French speaker
Content Analysis | Information from Original Speech
An original speech can be useful
even when a person do not know
the language
78. Content Analysis | Forgiving
31.3%
of the participants mentioned
that they forgave the errors of
the translator tool
Forgiving
79. My sentence became shorter, I pronounced more clearly,
I was speaking ... I mean, if I were speaking to someone
who’s not a native German, and if I didn’t have a
translator, then I would also slow down, obviously.
French-German Pair 5, German speaker
Content Analysis | Forgiving
81. My partner and I would begin to speak then the
translation from the previous statement would catch up,
translation and speaker would then all be talking at the
same time.
English-German Pair 4, English speaker
Content Analysis | Speech Overlap
82. Content Analysis | Speech Overlap
Speech Overlap
25.0%
of the participants mentioned
that they speech overlap was
a problem
French-German Group
83. Content Analysis | Speech Overlap
Speech Overlap
25.0%
50.0%
Participants from the English-German group perceived
speech overlap as a more severe problem
French-German Group English-German Group
This was partly because German
speakers could respond without
seeing/hearing translation.
85. Design Guidelines
• Support Users’ Adaptation for Speech Production
• Offer Fallback Strategies with Non-verbal Input
• Support Comprehension of Messages
• Support Users’ Turn Taking
89. Summary
We used Skype Translator to elicit interaction
problems in using spoken language translation for
interlingual conversation.
Our analysis revealed how people adjusted
(or could not adjust) their behaviors to overcome
the problems.
makeability lab
90. Kotaro Hara Shamsi T. Iqbal
Microsoft Skype Translator team
makeability lab
92. makeability lab
Image and Icon Credit
- Friendlies by Mo Riza (https://www.flickr.com/photos/moriza/2565606353)
- 133/2011 euroscola by rosipaw (https://www.flickr.com/photos/rosipaw/5768035647/)
- Japantimes (http://www.japantimes.co.jp/news/2014/05/08/world/social-issues-world/births-fall-economies-may-falter)
- Windows8core.com (http://www.windows8core.com/bing-translator-app-for-windows-8rt-lands-in-store-review/)
- Jigsaw puzzle by James Petts
- Speech bubble from Wikimedia Commons (http://commons.wikimedia.org/wiki/File:Speech_bubble.svg)
- Head silhouette from Channelweb (http://www.channelweb.co.uk/crn-uk/news/2244637/good-week-bad-week)
- Microphone from Uxrepo.com (http://uxrepo.com/icon/mic-by-mfg-labs)
- Document from IconArchive (http://www.iconarchive.com/show/mono-general-2-icons-by-custom-icon-design/document-icon.html)
- Funny Signs from Fanpop (http://www.fanpop.com/clubs/picks/images/1771904/title/funny-signs-photo)
- Compass Zoomed by Gwgs (https://www.flickr.com/photos/gwgs/346806882)
- Beaker by Emily van den Heever (https://thenounproject.com/EmvdHeever/)
Language is a important component of spoken communication. Successful communication is a key for education, business, public health, and medicine. But language barrier can be a challenge.
https://www.flickr.com/photos/moriza/2565606353
Human interpreters can translate multi-lingual conversation
https://www.flickr.com/photos/rosipaw/5768035647/in/photolist-9MGG3t-8eBocz-dXBimX-8JZ3M9-cTZyQQ-cTZphm-cTZyqj-cTZsdw-cTZrhd-cTZqnU-cTZqLh-cTZvvY-cTZngE-cTZuUw-cTZoqf-cTZC2L-cTZAX9-cTZkYu-cTZq4f-cTZveo-cTZzt9-cTZoEJ-cTZtMm-cTZwQf-cTZvLf-cTZpJw-cTZzPm-cTZx3m-cTZAxG-cTZu2G-cTZy1o-cTZz7C-nz7Hr4-fTxJ4a-cTZoUh-cTZqZu-cTZrvb-cTZBFG-cTZtq9-cTZvYf-cTZo8W-cTZxJN-cTZAHU-cTZumA-cTZrXo-cTZwvN-cTZxqb-cTZmBN-cTZnBm-cTZwdj
but they are usually expensive and are not readily available.
https://www.flickr.com/photos/rosipaw/5768035647/in/photolist-9MGG3t-8eBocz-dXBimX-8JZ3M9-cTZyQQ-cTZphm-cTZyqj-cTZsdw-cTZrhd-cTZqnU-cTZqLh-cTZvvY-cTZngE-cTZuUw-cTZoqf-cTZC2L-cTZAX9-cTZkYu-cTZq4f-cTZveo-cTZzt9-cTZoEJ-cTZtMm-cTZwQf-cTZvLf-cTZpJw-cTZzPm-cTZx3m-cTZAxG-cTZu2G-cTZy1o-cTZz7C-nz7Hr4-fTxJ4a-cTZoUh-cTZqZu-cTZrvb-cTZBFG-cTZtq9-cTZvYf-cTZo8W-cTZxJN-cTZAHU-cTZumA-cTZrXo-cTZwvN-cTZxqb-cTZmBN-cTZnBm-cTZwdj
This is a problem in some domain. For example, researchers reported that: “the infrequent use of interpreters in the delivery ward was among the most important reasons for the reduced quality of health care.”
-- Image
http://www.japantimes.co.jp/news/2014/05/08/world/social-issues-world/births-fall-economies-may-falter/#.VTJmZmSqqko
Automatic spoken language translation is expected to be a cost effective alternative.
http://www.windows8core.com/bing-translator-app-for-windows-8rt-lands-in-store-review/
Decades of research improved performance of the technology.
Decades of research improved performance of the technology. And some researchers note that it is useable in limited domains
But it turns out that limited research has examined how people interact with automatic spoken language translation systems.
So this void is what we are trying to fill in.
https://www.flickr.com/photos/14730981@N08/11919275515/in/photolist-98g5LU-VsJi-jagpmg-7h1rhP-8235Xr-6oGa5H-2PDGfv-dLfwLV-cJtbW-dCt5fq-4je4XQ-9aTAwj-cHWLp-andneR-6RG4Y8-TxcXa-8e1tCA-dVTZKV-5ubCmB-rQ7V3-5g6PTY-5s7AQJ-6jBtmv-bbSdSF-4qFLRz-5TZQTM-4qB8zB-d4W64-9hGRpw-5mueLj-nYFZGZ-drdDdf-okMnQt-4gthEa-5fa48-5ks8JE-9U3veb-e3KE4-5g2tSn-6NiA1j-euJWx-HQB4-doc317-4tmQPr-65AJiP-65EZSJ-gUw58c-E6fPg-7bfimU-6stS7R/
So this void is what we are trying to fill in.
https://www.flickr.com/photos/14730981@N08/11919275515/in/photolist-98g5LU-VsJi-jagpmg-7h1rhP-8235Xr-6oGa5H-2PDGfv-dLfwLV-cJtbW-dCt5fq-4je4XQ-9aTAwj-cHWLp-andneR-6RG4Y8-TxcXa-8e1tCA-dVTZKV-5ubCmB-rQ7V3-5g6PTY-5s7AQJ-6jBtmv-bbSdSF-4qFLRz-5TZQTM-4qB8zB-d4W64-9hGRpw-5mueLj-nYFZGZ-drdDdf-okMnQt-4gthEa-5fa48-5ks8JE-9U3veb-e3KE4-5g2tSn-6NiA1j-euJWx-HQB4-doc317-4tmQPr-65AJiP-65EZSJ-gUw58c-E6fPg-7bfimU-6stS7R/
Spoken language translation system has been studied and developed for a long time. A typical system works like this. First, you say something in your language. For example, “Hello.” Then a machine automatically recognizes that. The recognized speech gets translated, and turned into speech, text or both.
-- Images
Speech bubble: http://commons.wikimedia.org/wiki/File:Speech_bubble.svg
Head silhouette: http://www.channelweb.co.uk/crn-uk/news/2244637/good-week-bad-week
Microphone: http://uxrepo.com/icon/mic-by-mfg-labs
Document: http://www.iconarchive.com/show/mono-general-2-icons-by-custom-icon-design/document-icon.html
Spoken language translation system has been studied and developed for a long time. A typical system works like this. First, you say something in your language. For example, “Hello.” Then a machine automatically recognizes that. The recognized speech gets translated, and turned into speech, text or both.
-- Images
Speech bubble: http://commons.wikimedia.org/wiki/File:Speech_bubble.svg
Head silhouette: http://www.channelweb.co.uk/crn-uk/news/2244637/good-week-bad-week
Microphone: http://uxrepo.com/icon/mic-by-mfg-labs
Document: http://www.iconarchive.com/show/mono-general-2-icons-by-custom-icon-design/document-icon.html
Previous work studied the effect of machine translation on text-based communication.
Yamashita et al. found that machine translation can make it difficult for people to understand each other. And also, they found that people are unaware of miscommunication.
[8] Gao, G., Xu, B., Cosley, D., and Fussell, S.R. How Beliefs About the Presence of Machine Translation Impact Multilingual Collaborations. Proc. of CSCW (2014), 1549–1560.
[25] Yamashita, N., Inaba, R., Kuzuoka, H., and Ishida, T. Difficulties in Establishing Common Ground in Multiparty Groups Using Machine Translation. Proc. of CHI (2009), 679– 688.
[26] Yamashita, N. and Ishida, T. Effects of Machine Translation on Collaborative Work. Proc. of CSCW (2006), 515–524.
Previous work by Gao et al. reported benefits of presenting real-time transcription of English conversation between native and non-native speakers. Change in native speakers’ speaking behavior and transcription helped non-native speakers to better comprehend conversation.
[9] Gao, G., Yamashita, N., Hautasaari, A.M.J., Echenique, A., and Fussell, S.R. Effects of Public vs. Private Automated Transcripts on Multiparty Communication Between Native and Non-native English Speakers. Proc. of CHI (2014), 843–852.
17. Pan, Y., Jiang, D., Picheny, M., and Qin, Y. Effects of Real- time Transcription on Non-native Speaker’s Comprehension in Computer-mediated Communications. Proc. of CHI (2009), 2353–2356.
Although informative, these works investigated how each technology affects interlingual conversation,
-- Images
Speech bubble: http://commons.wikimedia.org/wiki/File:Speech_bubble.svg
Head silhouette: http://www.channelweb.co.uk/crn-uk/news/2244637/good-week-bad-week
Microphone: http://uxrepo.com/icon/mic-by-mfg-labs
Document: http://www.iconarchive.com/show/mono-general-2-icons-by-custom-icon-design/document-icon.html
but, they did not study the effect of spoken language translation system as a whole.
Most relevant to this talk is the evaluation of NESPOLE, a spoken-language translation system.
But there was no detailed discussion about how people adapted to using such systems.
Also, their push-to-talk interface made it difficult to understand its effect on turn taking in natural conversation.
[3] Costantini, E., Pianesi, F., and Burger, S. The added value of multimodality in the NESPOLE! speech-to-speech translation system: an experimental study. IEEE ICMI, (2002), 235–240.
So the goal of this research is to explore how a spoken language translation affect natural conversation in interlingual conversation.
- What are the overall impressions on quality and challenges in using the translator tool?
- Do people adapt the way they use the translator tool?
- How do people cope with erroneous speech recognition and translation?
-- Images
https://www.flickr.com/photos/gwgs/346806882/in/photolist-8bpcwF-8vehvb-5yM2pE-e4fun8-apgpCT-4Rfevi-8CATmQ-2ayKFB-8zGRD5-di4W8P-6mKLG7-5GPFbd-gMUP-8r5kN-8MrtxC-4qUQRH-oTP2cU-26RpWu-3KBnLW-8mHxhM-ikWUr8-gMZJ-bGyQr4-64KPXS-4TLNQY-5Cb7xa-5CsUyY-bUZpUc-6J91J4-8vbd3v-adssy-oY1gpY-nUJWgi-hPRLK-B4X9F-bPHTU-ikW2yY-6uT8i4-4cUNBm-msahsP-bHX4zT-9BRZ3-64KEES-4mpLN-49d1GA-9y2ppL-6KL7Vz-7QvR1Y-p5ywhm-wDtB1
In the rest of my talk, I will introduce the translator tool that we used in this study. I will describe the study method. Then I will talk about quantitative analysis and content analysis.
For this study, we used a prototype spoken language translation tool called Skype Translator.
Let me show you how it worked. I will show you a video from our study. Here, an English speaker and a German speaker talk to each other using the tool. A person starts a conversation with a starting sentence we gave to them, then have a natural conversation.
As you have seen, the translator has a video chat interface.
A closed caption that shows recognized and translated speech.
Translated speech could also be played through text-to-speech sysnthesis. Both text-to-speech synthesis and closed caption could be turned on and off.
In this study, people were asked to wear a microphone.
To investigate how the translator tool affect interlingual conversation, we recruited 8 French-German pairs and 15 English-German pairs.
The French-German group represented the case where people did not share any common language.
The English-German group represented the case where both people spoke English, but one side preferred not to speak it.
Everyone in the study could speak English reasonably well.
Because both study settings were similar, I will mainly talk about this study..
To simulate a natural conversation between two speakers, we asked them to perform the following conversation task.
Let’s say a German speaker started a conversation with “Olivia was practicing for the dance-off”.
The starting sentence was provided to the speaker. The speech was then translated into French.
Then the French speaker responded. “I didn’t know Olivia danced!” The response got translated.
And so on.
The goal of the task was to collaboratively construct a coherent story
We asked participants to perform 9 conversation tasks in their languages. And after the 9th task, they performed the same task in English.
Tasks were conducted using three different settings of the translator tool.
To study the effect of closed caption and text-to-speech features, we asked each participant to use three different interface settings.
Closed-caption with Text-to-Speech synthesis, Closed caption only, and text-to-speech only.
With both closed caption and text-to-speech on, participants could both see and hear the translated speech.
With both closed caption and text-to-speech on, participants could both see and hear the translated speech.
In the closed caption only setting, participants could see translation, but did not hear the synthesized translation.
In the closed caption only setting, participants could see translation, but did not hear the synthesized translation.
In the Text-to-speech only setting, participants could hear translated audio, but could not read translation.
In the Text-to-speech only setting, participants could hear translated audio, but could not read translation.
Participants were exposed to each setting. In this case, a pair was first exposed to closed caption and text-to-speech setting. Then closed caption only setting. Then text-to-speech only setting.
Then they repeated the same thing for two more rounds. In total, they worked on 9 conversation tasks.
Then at the very end, they performed a baseline English task.
Our primary goals were to understand users’ interface preference and if they became used to using the tool.
As I described, there were three interface settings. Three rounds. Two languages. This was a 3 by 3 by 2 mixed design; interface settings and rounds were with-in subject factors and language was a between subject factor.
We transformed ordinal data using aligned rank transformation. We analyzed data with restricted maximum likelihood model.
We first looked at the interface preference that we collected after each round. We basically asked “which interface setting you favored most and least?”
I will show you which setting people preferred. X-axis represents interface settings. Y-axis shows the percentage of people who preferred the setting.
59% preferred having both closed caption and text-to-speech. 35% preferred having only closed caption. Only 6% preferred having text-to-speech only.
The preference was split between two settings with closed caption.
But closed caption and text-to-speech setting was preferred more.
So if you are a system designer, you should set closed caption and text-to-speech to the default, but allow people to turn on and off the setting.
I will show you if people thought the conversation quality improved. X-axis shows rounds. Y axis shows the average score.
First round, people said score was 2.2, same for the round 2. But the score improved 2.7 in the third round.
The third round was significantly more successful compared to the first and the second. Which means people thought they had better conversations.
But compared to the base-line English conversation, it was not as good.
I will discuss why people were satisfied or dissatisfied with the experience. I will also present how people got adapted to using the translator.
We transcribed the post-session interview.
We coded the transcript and free text response in the survey. We extracted the recurring themes and grouped them together.
One of the most prominent complaint was proper noun recognition. 62.5% of participants mentioned it was a problem.
One participant said: Ava, there was no way it was getting Ava no matter how many times I tried. So in German, when you say ‘but,’ it sounds very similar. So that’s what it was picking up.
This was a severe problem, because there wasn’t any good way to workaround the problem.
Another prominent problem was grammar and word order errors.
A German speaker noted that: “Because German sentence structure is so different, translated sentence wouldn’t make any sense when synthesized speech said it”
“But when you see it on the screen, you can kind of reorganize the words in the way it supposed to go. And then it makes sense.”
This was a problem, but people could use it by getting used to it.
Other adaptation techniques included getting information from the original speech. 31.3% of people found it helpful.
One person said “having partner’s original voice humanizes the relationships as well, because if you only get the robotic voice, I think it just cuts you off from th person you are engaged with. Whereas you talk, then it feels much more human relationship.”
People were also forgiving towards imperfect system.
One person said “My sentence became shorter, I pronounced more clearly, I was speaking.. I mean, if I were speaking to someone who’s not a native German, and if I didn’t have a translator, then I would also slow down, obviously.”
So far I have only talked about the study results from the French-German group. We found some difference between two groups, so I want to introduce it.
A English speaker said that “my partner and I would begin to speak then the translation from the previous statement would catch up, translation and speaker would then all be talking at the same time.”
This happened because of system latency. 25% of participants from French-German group noted that the speech overlap was a problem. This is a big portion, but…
Higher portion of people from English-German group said it is a problem. This shows the problem is more sever in the latter group.
Some of these guidelines have been applied to the later version of Skype Tranlator.
Our German participants in Eng-Ger group had reasonable English proficiency. This may have impacted their experience with using the tool.
We did not compare people’s experience using the translator tool and experience in conversation with a interpreter. This would have been a better control setting.
Our analysis looked into data from survey and interview. Work that would complement it is conversation analysis.
Follow up summative studies should be conducted to confirm and extend our findings.
https://www.flickr.com/photos/moriza/2565606353