2. Problem Statement & Motivation
Importance of spoken English
English language has a very high socio-economic impact – with people speaking the language fluently
reported to earn 30-50% more than their peers who don’t.
Grading spoken English in a scalable way needed by companies, training organization and also
individuals.
Problem Statement
Scalable grading of spontaneous English speech, as good as experts.
3. Why are automated methods not accurate?
Speaker independent Speech
recognition for spontaneous
speech is a hard problem!
6. Crowdsourcing task
Worker quality control
• Each worker is assigned a risk level which reflects the
quality of his past work.
• Based on the state, number and when to give a gold
standard task is determined.
7. Supervised learning setup
Experiment Details
• Sample Size : 566
• 319 India
• 247 from Philippines
Expert Grading
• Two expert raters
• Overall score based on Pronunciation/Fluency
Content-Org/Grammar.
• Inter-rater correlation ~0.8.
The learning task
• Modelling done separately for Indian and Philippines
set.
• Linear ridge regression, Neural Networks and SVM
regression with different kernels were used to build
the models.
8. Case study
• Studied deployment of proposed algorithm in
Philippines.
• Event had 500 applicants for the role of a
customer support executive. The scoring
algorithm was tested on a subset of 150 students.
• Internal expert graded each candidate’s speech as
hirable or not-hireable.
9. Features used
We use three classes of features
• Force Alignment features (FA) and
• The speech sample is forced aligned on the crowdsourced transcription.
• Features like– rate of speech, position and length of pauses, log likelihood of recognition, posterior probability,
hesitations and repetitions, etc are derived.
• Natural Language Processing features (NLP).
• Surface level features : number of words, complexity or difficulty of words and the number of common words
used.
• Semantic features like the coherency in text, context of the words spoken, sentiment of the text and grammar
correctness.
• Crowd Grades (CG)
• Crowd provides scores on - pronunciation, fluency, content organization and grammar.
• These grades are combined to form a composite score.
10. Experiment and Results
Crowdsourced transcriptions + Crowd grades outperforms all other methods
Accuracy nears inter expert agreement (~0.8).
11. Summing it up
• Svar provides an automated assessment of candidate’s pronunciation and fluency.
• Crowdsourcing, in addition to NLP feature, renders reliable composite scores.
• Speech assessments can be made scalable with accuracy nearly matching experts’ opinion.
Hinweis der Redaktion
- A high number of jobs in knowledge economies across the globe require English.
- Companies want to be able to test scalable
- Training insttns need to be scalably test and provide feedback
great transcription to know what is spoken;
once we know what is spoken- we can compare the pronunciation of the candidate with a good pronounciation of the word; But because transcription is bad; we don't get to know what is spoken; this makes feature derivation inaccurate
Pure ML ~0.5
So we use two sets of features: One derived from aligning speech sample crowd transcription AND the other directly crowd grades
Easy usability
This is where people transcribe
This is where people grade
We had a novel idea to give every turker a state: the state tells the current reliability of the worker; it depends on how many gold standards s/he did write; A high reliability worker sees less gold standards and low reliability sees more of them. This helps manage risk with money spent on gold standards.
SUPERVISED LEARNING; OUTPUT is expert grades; and we are trying to do with our system
We use several techniques like NN, SVM, etc. with crossvalidation