This document provides a summary of Himanshu Bansal's internship project on developing a Speech and Eye Tracking Enabled Computer Assisted Translation (SEECAT) system. The internship involved developing techniques for word-fixation remapping and using mutual disambiguation between eye gaze and speech recognition to improve machine translation and automated speech recognition. Experiments on a reading task and translation tasks from English to other languages found significant improvements in eye gaze and speech recognition accuracy when integrating information from both eye tracking and speech recognition compared to using them individually. The best performing technique was an unweighted approach using gaze and speech recognition bag-of-words.
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Intern presentation
1. Presentation on Internship Work
Speech and Eye Tracking Enabled Computer Assisted Translation
(SEECAT)
Copenhagen Business School
By: Himanshu Bansal
4. BACKGROUND
We need translation
To convey our thoughts foreign language speaker
To understand foreign language text and speech
-------------------------------------------------
Training Data for automated system
To prepare high quality manuscripts of same text in different language
8. BACKGROUND
SEECAT as an extension of CASMACAT
Translator reads a source text on a computer screen and speaks out the translation
in the target language, a process called sight translation. This sight translation
process is supported by an Automatic Speech Recognition (ASR) and a Machine
Translation (MT) system, which transcribe the spoken speech signal into the target
text and which assist the translator with partial translation proposals, predictions
and completions on the computer monitor. An eye-tracking device follows the
translators gaze path on the screen, detects where he or she faces translation
problems and triggers reactive assistance.
9. STUCTURE of INTERSHIP
21 May- 7 Jun
Lectures and hands on sessions (CBS)
8 Jun- 28 Jun
Divided into teams, worked at summer house (Nykobing, Falster)
29 Jun- 21 July
Integration (CBS)
# Excursions planned for every weekend
10. GAZE TEAM
Himanshu, Kritika and Rucha
Part -1
Word- Fixation Remapping
Part-2
Mutual disambiguation between gaze and speech
11. Word- Fixation Remapping
Word- fixation mapping is useful for cognitive/linguistic research, usability studies
and most importantly for providing interactivity into the system
12. Word- Fixation Remapping
Issues
Identification of the Fixations in a stream of gaze samples.
Mapping the Fixations to words/characters (Dealing with variable error)
Evaluation scheme for the fixation mapping
17. Word- Fixation Remapping
Our Approach -> Evaluation
● Input:
– Manually annotated fixation to word mapping (Gold Standard)
– Machine computed fixation to word mapping
● Output:
– The average character/word error.
● Method:
– Compute the overlaps in the gaze fixation durations in the manual and
machine annotations.
– For the overlapping fixations, compute the absolute differences in the
cursor positions of the two mappings.
20. Mutual disambiguation between gaze and speech
Motivation
Ambiguity in gaze
Variable Error
-Midas Touch
System Errors
- Eye Tracker
- Algorithm
- Calibration
Ambiguity in ASR
Domain of training data
Accent of speaker
Morphology of language
Speaking Style
-Co-articulation
21. Mutual disambiguation between gaze and speech
Motivation
Consider a simple example:
● User reads the text “where is the bat”
● ASR can help map gaze points to
● Gaze can help disambiguate ASR output
Where is the bat.
There it is, behind the door
I can't find it! Where is it?
Look properly! Its right there.
Here is the mat
Here is the bat
where is the bat
where is a pat
ASR
Hypothesis
where
it there
theis Possible words
being gazed Intersection
22. Mutual disambiguation between gaze and speech
Inspiration from literature research
Meyer et.al. studied eye movements in an object naming task. It was shown that
people consistently fixated objects prior to naming them.
Griffin showed that when multiple objects were being named in a single
utterance, Speech about one object was being produced while the next object
was fixated and lexically processed.
23. Mutual disambiguation between gaze and speech
Experiments -> Reading Task
• 5 participants read English Text
• Eye Gaze and Speech Recorded
• 6 sets of 11 sentences
• 5 sets in domain and 1 out of domain
24. Mutual disambiguation between gaze and speech
Experiments -> Translation Task
• 4 participants translated English Text
• 4 sets of 10-10 very simple sentences
• Target languages - Hi, Sp, Da, It
• Eye gaze on source language words and speech in target languages recorded
25. Mutual disambiguation between gaze and speech
Approach
ASR word lattice
Reference sentence:
Leaving next day in the
morning
28. Mutual disambiguation between gaze and speech
Approach
Composed word-lattice
Reference sentence: Leaving next day in the morning
29. Mutual disambiguation between gaze and speech
System
• Performed experiments on Translog
• Speech hypothesis are provided by AT&T Watson server
• Transformed these format to word-lattice format using python
• Generate bag of words from x,y coordinates using our algo of part 1 using c#
and python
• In case of translation tasks, gaze bag of words are transformed into target
language bag of words using lexicons (1 more level of ambiguity)
• Composed these lattices using OpenFST
30. Mutual disambiguation between gaze and speech
System -> experiment with algo
• Weights of gaze words : should consider or not
• Weights of ASR words: should consider or not
Then used Latin square ->
• Unweighted ASR with Weighted Gaze Bag-of-words (WGUA)
• Unweighted ASR with Unweighted Gaze Bag-of-words (UGUA)
• Weighted ASR with Weighted Gaze Bag-of-words (WGWA)
• Weighted ASR with Unweighted Gaze Bag-of-words (UGWA)
31. ASR
SCLITE was used to get the word accuracies of then-best hypotheses
with respect to the reference sentence.
Eye Gaze
Precision: ((Wg) inters (Wr))/Wg
Recall = ((Wg) inters (Wr))/Wr
F-Measure (Harmonic mean of precision and recall)
Sentence Recognition Error (SRI; 1 or 0 )
Wg =Unique words in the gazed words
Wr =Unique words in the reference sentence
Mutual disambiguation between gaze and speech
Evaluation
32. Mutual disambiguation between gaze and speech
Research Design
Reading Task
Independent Variables
Domain of test data
Weights of ASR
Weights of Eye Gaze
Dependent Variable
Gaze f-measure
Gaze SRI
ASR Word accuracy
Translation Task
Independent Variables
Target language
Weights of ASR
Weights of Eye Gaze
Dependent Variable
Gaze f-measure
Gaze SRI
ASR Word accuracy
33. Mutual disambiguation between gaze and speech
Reading– Paired T-test
In domain Out of Domain All
Gaze_Precision 8.63075E-07 no improvement 9.2475E-07
Gaze_Recall 0.048194288 no improvement 0.048220416
Gaze_F-Measure 1.68557E-06 no improvement 1.7924E-06
ASR_WrdAccr 0.040033133 0.86206786 0.067110316
In domain Out of Domain All
Gaze_Precision 8.63075E-07 no improvement 9.2475E-07
Gaze_Recall 0.048194288 no improvement 0.048220416
Gaze_F-Measure 1.68557E-06 no improvement 1.7924E-06
ASR_WrdAccr 0.007594247 0.86206786 0.017268861
In domain Out of Domain All
Gaze_Precision 8.63075E-07 no improvement 9.2475E-07
Gaze_Recall 0.048194288 no improvement 0.048220416
Gaze_F-Measure 1.68557E-06 no improvement 1.7924E-06
ASR_WrdAccr 0.040033133 0.86206786 0.067110316
In domain Out of Domain All
Gaze_Precision 8.63075E-07 no improvement 9.2475E-07
Gaze_Recall 0.048194288 no improvement 0.048220416
Gaze_F-Measure 1.68557E-06 no improvement 1.7924E-06
ASR_WrdAccr 0.040033133 0.363217468 0.099456245
WGUAUGWAUGUAWGWA
37. Mutual disambiguation between gaze and speech
Conclusions
Reading Task
• Significant improvement in both Gaze F-measure and ASR accuracy after
integration
• Gaze recall fall significantly
• SRI also improved
• Improvement in In-domain task was lot more than out-of-domain task
• Out of the four experiments UGWA is observed best
38. Mutual disambiguation between gaze and speech
Conclusions
Translation Task
• Significant improvement in Gaze F-measure only for all languages
• ASR accuracy improved non-significantly
• For Hindi and Danish SRI decreased a lot
• Again UGWA is observed to be best (For 3 languages)
39. Mutual disambiguation between gaze and speech
Overview flowchart
Input from gaze
Fixation-word
remapping algo
Got x, y from already
logged files EVALUATION: fixation duration
intersection b/w machine and manual
3 manual and 1 machine
Static text reading
experiment
Eye gaze data
captured (translog)
Audio recorded at
sentence level
Word-
lattices
Bag of
Words
EVALUATION: comparison with BoW of
reference sentence: precision and recall
10 best
hypothesis
Watson
server
EVALUATION: compared 1st best with
reference text (edit distance) - ScLite
Word
lattices
Eye gaze
disambiguation
ASR
disambiguation
With weighted & un-
weighted ASR lattices
Improved
BoW
Improved
Hypothesis
EVAL
EVAL
Majority
based also
Sentence
Identfi.
40. LEARNING
Academic
Worked with Tobii T-60
Experiment Design
Python
Latex
Moses
Translog
Putty
Cygwin
Audacity
OpenFST
Got an idea of MT and ASR
Personal
Communication Skills
Project-Management
morning reporting
presentations
weekly targets and check
Kayaking
Two string kite
Bit of Cooking