Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Who doesn't know of the super cool scenes in "Minority Report": intelligent machines and innovative user interfaces with speech and gestures?
In this deep dive, we will talk about how deep learning can enable such interactions using some Microsoft projects in the area of NUI (Natural User Interfaces): Kinect, Handpose, Skype Translator etc. Which predictive models are being used? What do we do if we don't have sufficient data? Finally we will dare an outlook into the future how new and innovative human-machine-interaction concepts can change our user experience with computers and in light of industry 4.0.

  • Loggen Sie sich ein, um Kommentare anzuzeigen.

  • Gehören Sie zu den Ersten, denen das gefällt!

Deep Learning for New User Interactions (Gestures, Speech and Emotions)

  1. 1. Deep Learning for New User Interactions (Gestures, Speech and Emotions) Olivia Klose, Software Development Engineer, Microsoft Dr. Marcel Tilly, Program Manager, Microsoft
  2. 2. https://www.technologyreview.com/lists/technologies/2013/
  3. 3. Deep Neural Networks … is inspired by the neural network in the brain # of Neurons in the brains (~100 billion) = # of Trees in the Amazon Rainforest (~ 300 billion) # of Synapses (~ 100 - 1000 trillion) = # of Leaves in the Amazon Rainforest
  4. 4. https://www.youtube.com/watch?v=V1eYniJ0Rnk
  5. 5. Scale in Compute Scale in Data Better Algorithms More Investment
  6. 6. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 WER % Improving domain knowledge
  7. 7. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 WER % stuck
  8. 8. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 WER % Deep learning + Big Data + scalable tools
  9. 9. http://arxiv.org/abs/1609.03528 http://blogs.microsoft.com/next/2016/10/18/historic-achievement-microsoft-researchers-reach-human-parity-conversational-speech-recognition
  10. 10. Speech Recognition Breakthrough for the Spoken, Translated Word
  11. 11. Skype Translator Skype Translator Bots Skype Service Automatic Speech Recognition Speech Correction Translation Text To Speech
  12. 12. Skype Translator Skype Translator Bots Skype Service Automatic Speech Recognition Speech Correction Translation Text To Speech Software “robots” Separate and manage audio streams
  13. 13. Skype Translator Skype Translator Bots Skype Service Automatic Speech Recognition Speech Correction Translation Text To Speech • Machine Learning • Deep Neural Network • New language = new training this is hum pig
  14. 14. Skype Translator Skype Translator Bots Skype Service Automatic Speech Recognition Speech Correction Translation Text To Speech this is hum pig • Punctuation • Capitalization • Disfluency removal • Lattice Rescoring this is hum pig. This is hum pig. This is pig. This is big.
  15. 15. Skype Translator Skype Translator Bots Skype Service Automatic Speech Recognition Speech Correction Translation Text To Speech this is hum pig this is hum pig. This is hum pig. This is pig. This is big.
  16. 16. Skype Translator Skype Translator Bots Skype Service Automatic Speech Recognition Speech Correction Translation Text To Speech this is hum pig C’est grand. this is hum pig. This is hum pig. This is pig. This is big. • Microsoft Translator core API • Statistical Machine Translation • 45 supported languages
  17. 17. Skype Translator Skype Translator Bots Skype Service Automatic Speech Recognition Speech Correction Translation Text To Speech Microsoft Translator TTS API this is hum pig C’est grand. this is hum pig. This is hum pig. This is pig. This is big.
  18. 18. Skype Translator Skype Translator Bots Skype Service Automatic Speech Recognition Speech Correction Translation Text To Speech this is hum pig C’est grand. this is hum pig. This is hum pig. This is pig. This is big.
  19. 19. front view top viewside viewinput depth inferred body parts (no tracking or smoothing) https://www.microsoft.com/en-us/research/video/real-time-human-pose-recognition-in-parts-from-single-depth-images-2/
  20. 20. Kinect Gesture Data Set
  21. 21. https://www.microsoft.com/en-us/research/video/handpose-fully-articulated-hand-tracking/
  22. 22. bicycle road building road cat road building car grass water cow https://www.microsoft.com/en-us/research/publication/semantic-segmentation-as-image-representation-for-scene-recognition/
  23. 23. 28,2 25,8 16,4 11,7 7,3 6,7 5,1 3.5 ILSVRC 2010 NEC America ILSVRC 2011 Xerox ILSVRC 2012 AlexNet ILSVRC 2013 Clarifi ILSVRC 2014 VGG ILSVRC 2014 GoogleNet Human Performance ILSVRC 2015 ResNet ImageNet Classification top-5 error (%) Microsoft researchers win ImageNet computer vision challenge
  24. 24. 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC 2014) input Conv 7x7+ 2(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 1x1+ 1(V) Conv 3x3+ 1(S) LocalRespNorm MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat AveragePool 7x7+ 1(V) FC Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max0 Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max1 Soft maxAct ivat ion soft max2 GoogleNet, 22 layers (ILSVRC 2014) ResNet, 152 layers (ILSVRC 2015) 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2
  25. 25. Open-source, cross-platform toolkit for learning and evaluating deep neural networks. Expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks Production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server. http://cntk.ai
  26. 26. O P(1) X W(1), b(1) W(2), b(2) S(1) Sigmoid P(2) Softmax Hidden Layer Output Layer
  27. 27. B1=Parameter(HDim) W1=Parameter(HDim, SDim) X=Input(SDim) labels=Input(LDim) T1=Times(W1, X) P1=Plus(T1, B1) S1=Sigmoid(P1) B2=Parameter(LDim, 1) W2=Parameter(LDim, HDim) T2=Times(W2, S1) P2=Plus(T2, B1) CrossEntropy=CrossEntropyWithSoftmax(labels, P2) ErrPredict=ErrorPrediction(labels, P2) FeatureNodes=(X) LabelNodes=(labels) CriteriaNodes=(CrossEntropy) EvalNodes=(ErrPredict) OutputNodes=(P2)
  28. 28. https://github.com/azure/ObjectDetectionUsingCntk
  29. 29. https://github.com/azure/ObjectDetectionUsingCntk
  30. 30. https://github.com/azure/ObjectDetectionUsingCntk
  31. 31. https://github.com/azure/ObjectDetectionUsingCntk
  32. 32. Vision Computer Vision | Emotion | Face | Video Speech Computer Recognition | Speaker Recognition Speech | Translator Language Bing Spell Check | Language Understanding Linguistic Analysis | Text Analytics | Web Language Model Knowledge Academic Knowledge | Entity Linking Knowledge Exploration | Recommendations Search Bing Auto Suggest | Bing Image Search | Bing News Search Bing Video Search | Bing Web Search Cognitive Services Give your solutions a human side http://microsoft.com/cognitive
  33. 33. Computer Vision API Content of Image: Categories v0: [{ “name”: “animal”, “score”: 0.9765625 }] V1: [{ "name": "grass", "confidence": 0.9999992847442627 }, { "name": "outdoor", "confidence": 0.9999072551727295 }, { "name": "cow", "confidence": 0.99954754114151 }, { "name": "field", "confidence": 0.9976195693016052 }, { "name": "brown", "confidence": 0.988935649394989 }, { "name": "animal", "confidence": 0.97904372215271 }, { "name": "standing", "confidence": 0.9632768630981445 }, { "name": "mammal", "confidence": 0.9366017580032349, "hint": "animal" }, { "name": "wire", "confidence": 0.8946959376335144 }, { "name": "green", "confidence": 0.8844101428985596 }, { "name": "pasture", "confidence": 0.8332059383392334 }, { "name": "bovine", "confidence": 0.5618471503257751, "hint": "animal" }, { "name": "grassy", "confidence": 0.48627158999443054 }, { "name": "lush", "confidence": 0.1874018907546997 }, { "name": "staring", "confidence": 0.165890634059906 }] Describe 0.975 "a brown cow standing on top of a lush green field“ 0.974 “a cow standing on top of a lush green field” 0.965 “a large brown cow standing on top of a lush green field”
  34. 34. https://www.youtube.com/watch?v=R2mC-NUAmMk
  35. 35. marcel.tilly@microsoft.com olivia.klose@microsoft.com

×