SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Speech Recognition
Amit Sharma
1310751033
CSE 8th
SPEECH RECOGNITION
A Process that enables the computers to recognize
and translate spoken language into text. It is also
known as "automatic speech recognition" (ASR),
"computer speech recognition", or just "speech to text"
(STT).
APPLICATIONS
• Medical Transcription
• Military
• Telephone and similar domains
• Serving the disabled
• Home automation system
• Automobile
• Voice dialing (“Call home” )
• Data entry (“A pin number”)
• Speech to text processing (“word processors, emails”)
RECOGNITION PROCESS
Voice Input Analog to Digital Acoustic Model
Language Model
Out Speech EngineFeedback
HOW DO HUMANS DO IT ?
Articulation produces sound
waves which the ear conveys
to the brain for processing
HOW MIGHT COMPUTERS DO IT ?
Acoustic waveform Acoustic signal
Speech recognition
• Digitization
• Acoustic analysis of the
speech signal
• Linguistic interpretation
FLOW SUMMERY OF RECOGNITION
PROCESS
 User Input:
System catches users’ voice in the form of
analog acoustic signal.
 Digitization:
Digitize the analog signal.
 Phonetic Breakdown:
Breaking signals into phenome.
FLOW SUMMERY OF RECOGNITION
PROCESS
 Statistical Modeling:
Mapping phenomes to their phonetic
representation using statistics model.
 Matching:
According to Grammar, phonetic representation and
Dictionary, the system returns a word plus a confidence
score)
TYPES OF SPEECH RECOGNITION
• SPEAKER INDEPANDENT:
Recognize speech of a large group of people
• SPEAKER DEPANDENT:
Recognize speech patterns from only person
• SPEARKER ADAPTIVE:
System usually begins with a speaker
independent model and adjust these models more
closely to each individual during a brief training period
Approaches
to SR
Statistics
Based
Template
Based
Template-based approach
• Store examples of units (words, phenomes),
then find the example that most closely fits the
input
• Just a complex similarity matching problem
• OK for discrete utterances, and single user
Template-based approach
• Hard to distinguished very similar templates
• Quickly degrades when input differs from
template
Statistics based approach
• Collects a large corpus of transcribed speech
recording
• Train the computer to learn the correspondences at
different possibilities(Machine Learning)
• At run time, apply the statistical processes to search
through the space of all possible solutions, and pick
the statistically most likely one
What’s Hard About That ?
• Digitization:
Analog signals into Digital representation
• Signal Processing:
Separating speech from background noise
• Phonetics:
Variability in human speech
• Channel Variability:
The quality and position of microphone and background
environment will affect the output
SPEECH RECOGNITION THROUGH THE
DECADES
- 1950-60s (Baby-Talk)
• ‘They’ first focus on NUMBERS
• Recognize only DIGITS
• 1962, IBM developed ‘SHOEBOX’ which can recognize 16 words
spoken in English
SPEECH RECOGNITION THROUGH THE
DECADES
- 1970s (SR Takes Off)
• U.S. DoD’s DARPA initiate a research program called Speech
Understanding Research Program.
• Code Name was ‘HARPY’ which can understand 1101 words.
• First commercial speech recognition company, Threshold
Technology was setup, as well as Bell Laboratories' introduction of
a system that could interpret multiple people's voices.
SPEECH RECOGNITION THROUGH THE
DECADES
- 1980s (SR Turns Toward Prediction)
• SR vocabulary jumped from about a few hundred words to several
thousand words
• One major reason was a new statistical method known as the hidden
Markov model.
• Rather than simply using templates for words and looking for sound
patterns, HMM considered the probability of unknown sounds' being
words.
• Programs took discrete dictation, so you had … to … pause … after …
each … and … every … word.
SPEECH RECOGNITION THROUGH THE
DECADES
⁻ 1990s (Automatic Speech Recognition)
• In the '90s, computers with faster processors finally
arrived, and speech recognition softwares became
viable for ordinary people.
• Dragons’ Naturally Speaking arrived. The application
recognized continuous speech, so one could speak, well
naturally, at about 100 words per minute. However,
about 45 minutes training was required by the user.
SPEECH RECOGNITION THROUGH THE
DECADES
- 2000s
• Topped out 80% accuracy
• 2002, Google Voice Search was released, that allows users to
use Google Search by speaking on a mobile phone or computer
• 2011, Apple’s Siri was released. Its a built-in "intelligent assistant" that
enables Apple user’s speak voice commands in order to operate the
mobile device and its apps
• 2014, MS Cortana was released. Its also a built-in “intelligent personal
assistant” which can set reminders, recognize natural voice without
the requirement for keyboard input, and answer questions using
information from the Bing search engine.
Artificial Neural Net
0011100101
Artificial Neural Net
0011100101
DO IT YOURSELF
Artificial Neural Net
Sound wave saying ‘Hello’
• But we aren’t quite there yet.
• The big problem is that speech varies in speed
• One person might say “hello!” very quickly and another
person might say “heeeelllllllllllllooooo!” very slowly,
producing a much longer sound file with much more
data. Both sound files should be recognized as exactly
the same text — “hello!”
• Automatically aligning audio files of various lengths to a
fixed-length piece of text turns out to be pretty hard
• To work around this, we have to use some special tricks
and extra processing in addition to a deep neural
network. Let’s see how it works!
Artificial Neural Net
- The first step in speech recognition is obvious —
we need to feed sound waves into a computer.
- But sound is transmitted as waves. How do we turn
sound waves into numbers?
Turning Sounds into Bits
A waveform of saying “Hello”
Let’s zoom in on one tiny part of the sound wave and
take a look:
To turn this sound wave into numbers, we just record
of the height of the wave at equally-spaced points:
• This is called sampling.
• We are taking a reading thousands of times a second
and recording a number representing the height of the
sound wave at that point in time.
• Sampled at 16Khz (16,000 samples/sec).
• Lets sample our “Hello” sound wave 16,000 times per
second. Here’s the first 100 samples:
Each number represents the amplitude of the sound wave at 1/16000th of a second intervals
DIGITAL SAMPLING
A Quick Sidebar
- Loosing our data while sampling, due to the gaps?
Pre-processing our Sampled Sound Data
- We now have an array of numbers with each
number representing the sound wave’s amplitude
at 1/16,000th of a second intervals.
- some pre-processing is done on the audio data,
instead of feeding these numbers right into a
neural network.
- Let’s start by grouping our sampled audio into 20-
millisecond-long chunks.
• Here’s our first 20 milliseconds of audio (i.e., our first 320
samples):
• Plotting those numbers as a simple line graph gives us a
rough approximation of the original sound wave for that
20 millisecond period of time:
• To make this data easier for a neural network to process,
we are going to break apart this complex sound wave
into it’s component parts.
• We’ll break out the low-pitched parts, the next-lowest-
pitched-parts, and so on. Then by adding up how much
energy is in each of those frequency bands (from low to
high), we create a fingerprint for this audio snippet.
• We do this using a mathematic operation called
a Fourier transform.
• It breaks apart the complex sound wave into the simple
sound waves that make it up. Once we have those
individual sound waves, we add up how much energy is
contained in each one.
• Each number below represents how much energy was in
each 50hz band of our 20 millisecond audio clip:
• Lot easier on a chart:
• If we repeat this process on every 20 millisecond chunk
of audio, we end up with a spectrogram (each column
from left-to-right is one 20ms chunk):
The full spectrogram of the “hello” sound clip
Recognizing Characters from Short Sounds
• Now that we have our audio in a format that’s easy to
process, we will feed it into a deep neural network.
• The input to the neural network will be 20 millisecond
audio chunks.
• For each little audio slice, it will try to figure out
the letter that corresponds the sound currently being
spoken.
• After we run our entire audio clip through the neural
network (one chunk at a time), we’ll end up with a
mapping of each audio chunk to the letters most likely
spoken during that chunk.
• Here’s what that mapping looks like saying “Hello”:
• Our neural net is predicting that one likely thing that were
said was “HHHEE_LL_LLLOOO”. But it also thinks that it
was possible that it could be “HHHUU_LL_LLLOOO” or
even “AAAUU_LL_LLLOOO”.
• We have some steps we follow to clean up this output.
First, we’ll replace any repeated characters a single
character:
o HHHEE_LL_LLLOOO becomes HE_L_LO
o HHHUU_LL_LLLOOO becomes HU_L_LO
o AAAUU_LL_LLLOOO becomes AU_L_LO
• Then we’ll remove any blanks:
o HE_L_LO becomes HELLO
o HU_L_LO becomes HULLO
o AU_L_LO becomes AULLO
• That leaves us with three possible transcriptions —
 “Hello”, “Hullo” and “Aullo”.
• The trick is to combine these pronunciation-based
predictions with likelihood scores based on large
database of written text.
• Of our possible transcriptions “Hello”, “Hullo” and “Aullo”,
obviously “Hello” will appear more frequently in a
database of text and thus is probably correct. So we’ll
pick “Hello” as our final transcription instead of the
others. Done!
What the Future Holds
• Voice will be a primary interface for the connected home, providing a
natural means to communicate with alarm systems, lights, kitchen
appliances, sound systems and more, as users go about their day-
to-day lives.
• More and more major cars on the market will adopt intelligent, voice-
driven systems for entertainment and location-based search,
keeping drivers’ and passengers’ eyes and hands free.
• Small-screened and screen less wearables will continue their
upward climb in popularity.
• Voice-controlled devices will also dominate workplaces that require
hands-free mobility, such as hospitals, warehouses, laboratories and
production plants.
• Intelligent virtual assistants built into mobile operating systems keep
getting better.
[~] $ Questions_?

Weitere ähnliche Inhalte

Was ist angesagt?

Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
Amrita More
 

Was ist angesagt? (20)

Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail Inteligence
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
speech processing basics
speech processing basicsspeech processing basics
speech processing basics
 
Deep Learning for Speech Recognition - Vikrant Singh Tomar
Deep Learning for Speech Recognition - Vikrant Singh TomarDeep Learning for Speech Recognition - Vikrant Singh Tomar
Deep Learning for Speech Recognition - Vikrant Singh Tomar
 
Voice recognition
Voice recognitionVoice recognition
Voice recognition
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
Emotion Speech Recognition - Convolutional Neural Network Capstone Project
Emotion Speech Recognition - Convolutional Neural Network Capstone ProjectEmotion Speech Recognition - Convolutional Neural Network Capstone Project
Emotion Speech Recognition - Convolutional Neural Network Capstone Project
 
SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK
 
Speech Recognition
Speech Recognition Speech Recognition
Speech Recognition
 
Lip reading Project
Lip reading ProjectLip reading Project
Lip reading Project
 
A seminar report on speech recognition technology
A seminar report on speech recognition technologyA seminar report on speech recognition technology
A seminar report on speech recognition technology
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition system
 
Linear Predictive Coding
Linear Predictive CodingLinear Predictive Coding
Linear Predictive Coding
 
Speech Recognition by Iqbal
Speech Recognition by IqbalSpeech Recognition by Iqbal
Speech Recognition by Iqbal
 
Speech signal processing lizy
Speech signal processing lizySpeech signal processing lizy
Speech signal processing lizy
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 

Ähnlich wie Speech Recognition System

How speech reorganization works
How speech reorganization worksHow speech reorganization works
How speech reorganization works
Muhammad Taqi
 
Silent sound interface
Silent sound interfaceSilent sound interface
Silent sound interface
Jeevitha Reddy
 
Ig2task1worksheetelliot 140511141816-phpapp02
Ig2task1worksheetelliot 140511141816-phpapp02Ig2task1worksheetelliot 140511141816-phpapp02
Ig2task1worksheetelliot 140511141816-phpapp02
ElliotBlack
 
What Is A Neural Network? | How Deep Neural Networks Work | Neural Network Tu...
What Is A Neural Network? | How Deep Neural Networks Work | Neural Network Tu...What Is A Neural Network? | How Deep Neural Networks Work | Neural Network Tu...
What Is A Neural Network? | How Deep Neural Networks Work | Neural Network Tu...
Simplilearn
 

Ähnlich wie Speech Recognition System (20)

Speech recognizers & generators
Speech recognizers & generatorsSpeech recognizers & generators
Speech recognizers & generators
 
Speech Analysis
Speech AnalysisSpeech Analysis
Speech Analysis
 
Assign
AssignAssign
Assign
 
Digital speech within 125 hz bandwidth (DS-125)
Digital speech within 125 hz bandwidth (DS-125)Digital speech within 125 hz bandwidth (DS-125)
Digital speech within 125 hz bandwidth (DS-125)
 
How speech reorganization works
How speech reorganization worksHow speech reorganization works
How speech reorganization works
 
Silent sound interface
Silent sound interfaceSilent sound interface
Silent sound interface
 
Digital speech within 125 hz bandwidth
Digital speech within 125 hz bandwidthDigital speech within 125 hz bandwidth
Digital speech within 125 hz bandwidth
 
Week two a d conversion
Week two a d conversionWeek two a d conversion
Week two a d conversion
 
Reverb w5 imp_2
Reverb w5 imp_2Reverb w5 imp_2
Reverb w5 imp_2
 
Silent sound technologyrevathippt
Silent sound technologyrevathipptSilent sound technologyrevathippt
Silent sound technologyrevathippt
 
Thingy editedd
Thingy editeddThingy editedd
Thingy editedd
 
Ig2task1worksheetelliot 140511141816-phpapp02
Ig2task1worksheetelliot 140511141816-phpapp02Ig2task1worksheetelliot 140511141816-phpapp02
Ig2task1worksheetelliot 140511141816-phpapp02
 
Automatic Speech Recognion
Automatic Speech RecognionAutomatic Speech Recognion
Automatic Speech Recognion
 
What Is A Neural Network? | How Deep Neural Networks Work | Neural Network Tu...
What Is A Neural Network? | How Deep Neural Networks Work | Neural Network Tu...What Is A Neural Network? | How Deep Neural Networks Work | Neural Network Tu...
What Is A Neural Network? | How Deep Neural Networks Work | Neural Network Tu...
 
Dante Audio Networking Fundamentals
Dante Audio Networking FundamentalsDante Audio Networking Fundamentals
Dante Audio Networking Fundamentals
 
The secerts to great sounding samples.txt
The secerts to great sounding samples.txtThe secerts to great sounding samples.txt
The secerts to great sounding samples.txt
 
The secerts to great sounding samples.txt
The secerts to great sounding samples.txtThe secerts to great sounding samples.txt
The secerts to great sounding samples.txt
 
COMP 4010 Lecture5 VR Audio and Tracking
COMP 4010 Lecture5 VR Audio and TrackingCOMP 4010 Lecture5 VR Audio and Tracking
COMP 4010 Lecture5 VR Audio and Tracking
 
Ece speech-recognition-report
Ece speech-recognition-reportEce speech-recognition-report
Ece speech-recognition-report
 
Artificial Intelligence- An Introduction
Artificial Intelligence- An IntroductionArtificial Intelligence- An Introduction
Artificial Intelligence- An Introduction
 

Kürzlich hochgeladen

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Kürzlich hochgeladen (20)

PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 

Speech Recognition System

  • 2. SPEECH RECOGNITION A Process that enables the computers to recognize and translate spoken language into text. It is also known as "automatic speech recognition" (ASR), "computer speech recognition", or just "speech to text" (STT).
  • 3. APPLICATIONS • Medical Transcription • Military • Telephone and similar domains • Serving the disabled • Home automation system • Automobile • Voice dialing (“Call home” ) • Data entry (“A pin number”) • Speech to text processing (“word processors, emails”)
  • 4. RECOGNITION PROCESS Voice Input Analog to Digital Acoustic Model Language Model Out Speech EngineFeedback
  • 5. HOW DO HUMANS DO IT ? Articulation produces sound waves which the ear conveys to the brain for processing
  • 6. HOW MIGHT COMPUTERS DO IT ? Acoustic waveform Acoustic signal Speech recognition • Digitization • Acoustic analysis of the speech signal • Linguistic interpretation
  • 7.
  • 8. FLOW SUMMERY OF RECOGNITION PROCESS  User Input: System catches users’ voice in the form of analog acoustic signal.  Digitization: Digitize the analog signal.  Phonetic Breakdown: Breaking signals into phenome.
  • 9. FLOW SUMMERY OF RECOGNITION PROCESS  Statistical Modeling: Mapping phenomes to their phonetic representation using statistics model.  Matching: According to Grammar, phonetic representation and Dictionary, the system returns a word plus a confidence score)
  • 10. TYPES OF SPEECH RECOGNITION • SPEAKER INDEPANDENT: Recognize speech of a large group of people • SPEAKER DEPANDENT: Recognize speech patterns from only person • SPEARKER ADAPTIVE: System usually begins with a speaker independent model and adjust these models more closely to each individual during a brief training period
  • 12. Template-based approach • Store examples of units (words, phenomes), then find the example that most closely fits the input • Just a complex similarity matching problem • OK for discrete utterances, and single user
  • 13. Template-based approach • Hard to distinguished very similar templates • Quickly degrades when input differs from template
  • 14. Statistics based approach • Collects a large corpus of transcribed speech recording • Train the computer to learn the correspondences at different possibilities(Machine Learning) • At run time, apply the statistical processes to search through the space of all possible solutions, and pick the statistically most likely one
  • 15. What’s Hard About That ? • Digitization: Analog signals into Digital representation • Signal Processing: Separating speech from background noise • Phonetics: Variability in human speech • Channel Variability: The quality and position of microphone and background environment will affect the output
  • 16. SPEECH RECOGNITION THROUGH THE DECADES - 1950-60s (Baby-Talk) • ‘They’ first focus on NUMBERS • Recognize only DIGITS • 1962, IBM developed ‘SHOEBOX’ which can recognize 16 words spoken in English
  • 17. SPEECH RECOGNITION THROUGH THE DECADES - 1970s (SR Takes Off) • U.S. DoD’s DARPA initiate a research program called Speech Understanding Research Program. • Code Name was ‘HARPY’ which can understand 1101 words. • First commercial speech recognition company, Threshold Technology was setup, as well as Bell Laboratories' introduction of a system that could interpret multiple people's voices.
  • 18. SPEECH RECOGNITION THROUGH THE DECADES - 1980s (SR Turns Toward Prediction) • SR vocabulary jumped from about a few hundred words to several thousand words • One major reason was a new statistical method known as the hidden Markov model. • Rather than simply using templates for words and looking for sound patterns, HMM considered the probability of unknown sounds' being words. • Programs took discrete dictation, so you had … to … pause … after … each … and … every … word.
  • 19. SPEECH RECOGNITION THROUGH THE DECADES ⁻ 1990s (Automatic Speech Recognition) • In the '90s, computers with faster processors finally arrived, and speech recognition softwares became viable for ordinary people. • Dragons’ Naturally Speaking arrived. The application recognized continuous speech, so one could speak, well naturally, at about 100 words per minute. However, about 45 minutes training was required by the user.
  • 20. SPEECH RECOGNITION THROUGH THE DECADES - 2000s • Topped out 80% accuracy • 2002, Google Voice Search was released, that allows users to use Google Search by speaking on a mobile phone or computer • 2011, Apple’s Siri was released. Its a built-in "intelligent assistant" that enables Apple user’s speak voice commands in order to operate the mobile device and its apps • 2014, MS Cortana was released. Its also a built-in “intelligent personal assistant” which can set reminders, recognize natural voice without the requirement for keyboard input, and answer questions using information from the Bing search engine.
  • 23. Artificial Neural Net Sound wave saying ‘Hello’
  • 24. • But we aren’t quite there yet. • The big problem is that speech varies in speed • One person might say “hello!” very quickly and another person might say “heeeelllllllllllllooooo!” very slowly, producing a much longer sound file with much more data. Both sound files should be recognized as exactly the same text — “hello!” • Automatically aligning audio files of various lengths to a fixed-length piece of text turns out to be pretty hard • To work around this, we have to use some special tricks and extra processing in addition to a deep neural network. Let’s see how it works! Artificial Neural Net
  • 25. - The first step in speech recognition is obvious — we need to feed sound waves into a computer. - But sound is transmitted as waves. How do we turn sound waves into numbers? Turning Sounds into Bits
  • 26. A waveform of saying “Hello”
  • 27. Let’s zoom in on one tiny part of the sound wave and take a look:
  • 28. To turn this sound wave into numbers, we just record of the height of the wave at equally-spaced points:
  • 29. • This is called sampling. • We are taking a reading thousands of times a second and recording a number representing the height of the sound wave at that point in time. • Sampled at 16Khz (16,000 samples/sec). • Lets sample our “Hello” sound wave 16,000 times per second. Here’s the first 100 samples: Each number represents the amplitude of the sound wave at 1/16000th of a second intervals
  • 30. DIGITAL SAMPLING A Quick Sidebar - Loosing our data while sampling, due to the gaps?
  • 31. Pre-processing our Sampled Sound Data - We now have an array of numbers with each number representing the sound wave’s amplitude at 1/16,000th of a second intervals. - some pre-processing is done on the audio data, instead of feeding these numbers right into a neural network. - Let’s start by grouping our sampled audio into 20- millisecond-long chunks.
  • 32. • Here’s our first 20 milliseconds of audio (i.e., our first 320 samples):
  • 33. • Plotting those numbers as a simple line graph gives us a rough approximation of the original sound wave for that 20 millisecond period of time:
  • 34. • To make this data easier for a neural network to process, we are going to break apart this complex sound wave into it’s component parts. • We’ll break out the low-pitched parts, the next-lowest- pitched-parts, and so on. Then by adding up how much energy is in each of those frequency bands (from low to high), we create a fingerprint for this audio snippet. • We do this using a mathematic operation called a Fourier transform. • It breaks apart the complex sound wave into the simple sound waves that make it up. Once we have those individual sound waves, we add up how much energy is contained in each one.
  • 35. • Each number below represents how much energy was in each 50hz band of our 20 millisecond audio clip:
  • 36. • Lot easier on a chart:
  • 37. • If we repeat this process on every 20 millisecond chunk of audio, we end up with a spectrogram (each column from left-to-right is one 20ms chunk): The full spectrogram of the “hello” sound clip
  • 38. Recognizing Characters from Short Sounds • Now that we have our audio in a format that’s easy to process, we will feed it into a deep neural network. • The input to the neural network will be 20 millisecond audio chunks. • For each little audio slice, it will try to figure out the letter that corresponds the sound currently being spoken.
  • 39.
  • 40. • After we run our entire audio clip through the neural network (one chunk at a time), we’ll end up with a mapping of each audio chunk to the letters most likely spoken during that chunk. • Here’s what that mapping looks like saying “Hello”:
  • 41.
  • 42. • Our neural net is predicting that one likely thing that were said was “HHHEE_LL_LLLOOO”. But it also thinks that it was possible that it could be “HHHUU_LL_LLLOOO” or even “AAAUU_LL_LLLOOO”. • We have some steps we follow to clean up this output. First, we’ll replace any repeated characters a single character: o HHHEE_LL_LLLOOO becomes HE_L_LO o HHHUU_LL_LLLOOO becomes HU_L_LO o AAAUU_LL_LLLOOO becomes AU_L_LO
  • 43. • Then we’ll remove any blanks: o HE_L_LO becomes HELLO o HU_L_LO becomes HULLO o AU_L_LO becomes AULLO • That leaves us with three possible transcriptions —  “Hello”, “Hullo” and “Aullo”. • The trick is to combine these pronunciation-based predictions with likelihood scores based on large database of written text. • Of our possible transcriptions “Hello”, “Hullo” and “Aullo”, obviously “Hello” will appear more frequently in a database of text and thus is probably correct. So we’ll pick “Hello” as our final transcription instead of the others. Done!
  • 44. What the Future Holds • Voice will be a primary interface for the connected home, providing a natural means to communicate with alarm systems, lights, kitchen appliances, sound systems and more, as users go about their day- to-day lives. • More and more major cars on the market will adopt intelligent, voice- driven systems for entertainment and location-based search, keeping drivers’ and passengers’ eyes and hands free. • Small-screened and screen less wearables will continue their upward climb in popularity. • Voice-controlled devices will also dominate workplaces that require hands-free mobility, such as hospitals, warehouses, laboratories and production plants. • Intelligent virtual assistants built into mobile operating systems keep getting better.