Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Deep Learning in practice : Speech recognition and beyond - Meetup

3.779 Aufrufe

Veröffentlicht am

Retrouvez la présentation de notre Meetup du 27 septembre 2017 présenté par notre collaborateur Abdelwahab HEBA : Deep Learning in practice : Speech recognition and beyond

Veröffentlicht in: Technologie
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Deep Learning in practice : Speech recognition and beyond - Meetup

  1. 1. Deep learning in practice : Speech recognition and beyond Abdel HEBA 27 septembre 2017
  2. 2. 2 / 56 OutlineOutline ● Part 1 : Basics of Machine Learning ( Deep and Shallow) and of Signal Processing ● Part 2 : Speech Recognition ● Acoustic representation ● Probabilistic speech recognition ● Part 3 : Neural Network Speech Recognition ● Hybrid neural networks ● End-to-End architecture ● Part 4 : Kaldi
  3. 3. 3 / 56 Reading MaterialReading Material
  4. 4. 4 / 56 A Deep-Learning Approach Books: Bengio, Yoshua (2009). "Learning Deep Architectures fo r AI" .   L. Deng and D. Yu (2014) "Deep Learning: Methods and Applications" http://research.microsoft.com/pubs/209355/DeepLearning-NowPublishing-Vol 7-SIG-039.pdf   D. Yu and L. Deng (2014). "Automatic Speech Recognition: A Deep Learning Approach” (Publisher: Springer). Reading MaterialReading Material
  5. 5. 5 / 56 Reading MaterialReading Material
  6. 6. 6 / 56 Part I : Machine Learning ( Deep/Shallow)Part I : Machine Learning ( Deep/Shallow) and Signal Processingand Signal Processing
  7. 7. 7 / 56 Current view of Artificial Intelligence, Machine Learning & DeepCurrent view of Artificial Intelligence, Machine Learning & Deep LearningLearning Edureka blog – what-is-deep-learning
  8. 8. 8 / 56 Current view of Machine Learning founding & disciplinesCurrent view of Machine Learning founding & disciplines Edureka blog – what-is-deep-learning
  9. 9. 9 / 56 Machine Learning Paradigms : An OverviewMachine Learning Paradigms : An Overview Machine learning Data Analysis/ Statistic s Programs
  10. 10. 10 / 56 Supervised Machine Learning (classification)Supervised Machine Learning (classification) measurements (features) & associated ‘class’ labels (colors used to show class labels) Training data set Training algorithm Parameters/weights (and sometimes structure) Learned model Training phase (usually offline)
  11. 11. 11 / 56 Supervised Machine Learning (classification)Supervised Machine Learning (classification) Input test data point structure + parameters predicted class label or label sequence (e.g. sentence) Learned model Output measurements (features) only Test phase (run time, online)
  12. 12. 12 / 56 What Is Deep Learning ?What Is Deep Learning ? Deep learning Machine learning Deep learning (deep machine learning, or deep structured learning, or hierarchical learning, or sometimes DL) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, with complex structures or otherwise, composed of multiple non- linear transformations.[1](p198)[2] [3][4]
  13. 13. 13 / 56 Evolution of Machine LearningEvolution of Machine Learning (Slide from: Yoshua Bengio)
  14. 14. 14 / 56 Face RecognitionFace Recognition
  15. 15. Y LeCun MA Ranzato D-AE DBN DBM AEPerceptron RBM GMM BayesNP SVM Sparse Coding  DecisionTree Boosting SHALLOW DEEP Conv. Net Neural Net RNN Bayes Nets Modified from
  16. 16. Y LeCun MA Ranzato SHALLOW DEEP Neural Networks Probabilistic Models D-AE DBN DBM AEPerceptron RBM GMM BayesNP SVM Sparse Coding  DecisionTree Boosting Conv. Net Deep Neural Net RNN Bayes Nets Modified from
  17. 17. Y LeCun MA Ranzato SHALLOW DEEP Neural Networks Probabilistic Models Conv. Net D-AE DBN DBM AEPerceptron RBM ?GMM BayesNP SVM Supervised Supervised Unsupervised Sparse Coding  Boosting DecisionTree Deep Neural Net RNN ?Bayes Nets Modified from
  18. 18. 18 / 56 Part II : Speech RecognitionPart II : Speech Recognition
  19. 19. 19 / 56 Human Communication : verbal & non verbal informationHuman Communication : verbal & non verbal information
  20. 20. 20 / 56 Speech recognition problemSpeech recognition problem
  21. 21. 21 / 56 Speech recognition problemSpeech recognition problem ● Automatic speech recognition ● Spontaneous vs read speech ● Large vocabulary ● In noise ● Low resource ● Far-Field ● Accent-independent ● Speaker-adaptative ● Speaker identification ● Speech enhancement ● Speech separation
  22. 22. 22 / 56 Speech representationSpeech representation ● Same word : « Appeler »
  23. 23. 23 / 56 Speech representationSpeech representation We want a low-dimensionality representation, invariant to speaker, background noise, rate of speaking etc. ● Fourier analysis shows energy in different frequency bands
  24. 24. 24 / 56 Acoustic representationAcoustic representation Vowel triangle as seen from the formants 1 & 2
  25. 25. 25 / 56 Acoustic representationAcoustic representation ● Features used in speech recognition ● Mel Frequency Cepstral Coefficients – MFCC ● Perceptual Linear Prediction – PLP ● RASTA-PLP ● Filter Banks Coefficient – F-BANKs
  26. 26. 26 / 56 Speech Recognition asSpeech Recognition as transduction Fromtransduction From signal to languagesignal to language
  27. 27. 27 / 56 Speech Recognition asSpeech Recognition as transduction Fromtransduction From signal to languagesignal to language
  28. 28. 28 / 56 Speech Recognition asSpeech Recognition as transduction Fromtransduction From signal to languagesignal to language
  29. 29. 29 / 56 Probabilistic speech recognitionProbabilistic speech recognition ● Speech signal represented as an acoustic observation sequence ● We want to find the most likely word sequence W ● We model this with a Hidden Markov Model ● The system has a set of discrete states, ● Transitions from state to state according to transition probabilities (Markovian : memoryless) ● Acoustic observation when making a transition is conditioned on state alone. P(o|c) ● We seek to recover the state sequence and consequently the word sequence
  30. 30. 30 / 56 Speech Recognition asSpeech Recognition as transduction - Phone Recognitiontransduction - Phone Recognition ● Training Algorithm (N iteration) ● Align data & text ● Compute probabilities P(o/p) of each segments o ● Update boundaries
  31. 31. 31 / 56 Speech Recognition asSpeech Recognition as transduction - Lexicontransduction - Lexicon ● Construct graph using Weighted Finite State Transducers (WFST)
  32. 32. 32 / 56 Speech Recognition asSpeech Recognition as transductiontransduction ● Compose Lexicon FST with Grammar FST L o G ● Transduction via Composition ● Map output labels of lexicon to input labels of Language Model. ● Join and optimize end-to-end graph.
  33. 33. 33 / 56 Different steps of acoustic modelingDifferent steps of acoustic modeling
  34. 34. 34 / 56 DecodingDecoding
  35. 35. 35 / 56 DecodingDecoding ● We want to find the most likely word sequence W knowing the observation o in the graph
  36. 36. 36 / 56 Part III : Neural Networks for Speech RecognitionPart III : Neural Networks for Speech Recognition
  37. 37. 37 / 56 Three main paradigms for neural networks for speechThree main paradigms for neural networks for speech ● Use neural networks to compute nonlinear feature representation ● « Bottleneck » or « tandem » features ● Use neural networks to estimate phonetic unit probabilities (Hybrid networks) ● Use end-to-end neural networks
  38. 38. 38 / 56 Neural network featuresNeural network features ● Train a neural network to discriminate classes. ● Use output or a low-dimensional bottleneck layer representation as features.
  39. 39. 39 / 56 Hybrid Speech Recognition SystemHybrid Speech Recognition System ● Train the network as a classifier with a softmax across the phonetic units.
  40. 40. 40 / 56 Hybrid Speech Recognition SystemHybrid Speech Recognition System
  41. 41. 41 / 56 Neural network architectures for speech recognitionNeural network architectures for speech recognition ● Fully connected ● Convolutional Networks (CNNs) ● Recurrent neural networks (RNNs) ● LSTMs ● GRUs
  42. 42. 42 / 56 Neural network architectures for speech recognitionNeural network architectures for speech recognition ● Convolutional Neural network
  43. 43. 43 / 56 Neural network architectures for speech recognitionNeural network architectures for speech recognition ● Recurrent Neural Network
  44. 44. 44 / 56 Neural network architectures for speech recognitionNeural network architectures for speech recognition ● Recurrent Neural Network
  45. 45. 45 / 56 Neural network architectures for speech recognitionNeural network architectures for speech recognition ● Recurrent Neural Network
  46. 46. 46 / 56 Neural network architectures for speech recognitionNeural network architectures for speech recognition ● Recurrent Neural Network
  47. 47. 47 / 56 End-To-End Neural Networks for Speech Recognition :End-To-End Neural Networks for Speech Recognition : CTC Loss FucntionCTC Loss Fucntion
  48. 48. 48 / 56 End-To-End Speech Recognition :End-To-End Speech Recognition : CTC InputCTC Input ● Graphem-based model : c {A,B,C…,Z,Blank,Space} ● P(c=HHH_E_LL_LO___|x)= P(c₁=H|x)P(c₂=H|x)...P(c₆=blank|x)..
  49. 49. 49 / 56 Connexionist Temporal Classification (CTC)Connexionist Temporal Classification (CTC) ● CTC Loss Function :
  50. 50. 50 / 56 Connexionist Temporal Classification (CTC)Connexionist Temporal Classification (CTC) ● Mise à jour du réseau avec la CTC Loss Function :● Mise à jour du réseau avec la CTC Loss Function : ● Backprobagation :
  51. 51. 51 / 56 Home messageHome message ● Speech Recognition systems ● HMM-GMM traditional system ● Hybrid ASR system ● Use Neural Networks for feature representation ● Or , use Neural Networks for phoneme recognition ● End-To-End Neural Networks system ● Grapheme based model ● Need lot of date to perform ● Complex modeling
  52. 52. 52 / 56 Part IV : KaldiPart IV : Kaldi
  53. 53. 53 / 56 The Kaldi ToolkitThe Kaldi Toolkit ● Kaldi is specifically designed for speech recognition research application ● Kaldi training tools ● Data preparation (link text to wav, speaker to utt..) ● Feature extraction : MFCC, PLP, F-BANKs, Pitch, LDA, HLDA, fMLLR, MLLT, VTLN, etc. ● Scripts for building finite state transducer : converting Lexicon & Language model to fst format ● HMM-GMM traditional system ● Hybrid system ● Online decoding
  54. 54. 54 / 56 Kaldi ArchitectureKaldi Architecture
  55. 55. 55 / 56 LinSTT use KaldiLinSTT use Kaldi Site CLIPS ENST IRENE LIA LIMSI LIUM LORIA Linagora WER 40.7 45.4 35.4 26.7 11.9 23.6 27.6 26.23 Audio Corpus 90h 90h 90h 90h 90h +100h 90h +90h 90h 90h #states 1,500 114 6,000 3,600 12,000 7,000 6,000 15,000 #gaussians 24k 14k 200k 230k 370k 154k 90k 500k #pronunciations 38k 118k 118k 130k 276k 107k 112k 105k
  56. 56. Thanks for your attentionThanks for your attention LINAGORA – headquarters 80, rue Roque de Fillol 92800 PUTEAUX FRANCE Phone : +33 (0)1 46 96 63 63 Info : info@linagora.com Web : www.linagora.com facebook.com/Linagora/ @linagora

×