SlideShare a Scribd company logo
1 of 23
Download to read offline
Comparing ML-Based Audio
with ML-Based Vision: An
Introduction to ML Audio
for ML Vision Engineers
Josh Morris
Engineering Manager, Machine Learning
DSP Concepts
The Audio of Things Approach
DSP Concepts helps product makers deliver remarkable Audio Experience through a flexible and modular approach
within a design platform environment. This system makes the entire workflow faster and easier across prototyping,
design, debugging, tuning, production, and even over-the-air updates.
The Audio Experience Design
platform that accelerates feature
development and makes audio
innovation easy
Audio IP
• Building blocks to product-ready
solutions
Prototyping Hardware
• Test and verify performance
Debug Tools
• Pinpoint issues and reduce risk
Tuning Tools
• Optimize playback and voice control
Expertise
• Algorithm, software, tuning, testing
Worldwide Support
Santa Clara, Stuttgart, Taipei, Seoul
© 2022 DSP Concepts 2
Processing at the edge is getting more and more powerful making it possible to do things that were
reserved for the cloud
Audio is becoming increasingly popular
• Standalone applications
• Smart assistant
• Voice control
• Denoising
• Multi-modal applications
• Industrial sensing
• Anomaly detection
Motivation for Talk
© 2022 DSP Concepts 3
• Feature engineering
• How is audio different from
vision?
• How is audio like vision?
• Implementation
• Similarities
• Differences
• Common problems in vision and
their audio analogues
• Classification
• Sequence decoding
• Restoration/denoising
Agenda
© 2022 DSP Concepts 4
Features
• Audio is a sequence of
samples
• Somewhere between
video/images
• Inherent left to right structure
to data
• Sample rate
• Bit depth
How Are Audio Features Different from Vision
Features?
© 2022 DSP Concepts
https://en.wikipedia.org/wiki/Sampling_(signal_processing)
https://en.wikipedia.org/wiki/File:Mike_Austin_Sequence.JPG
6
A lot of techniques employed for ML audio-based solutions borrow techniques from
vision. This means we need to take a 1D sequence of samples and make them look
like an image.
Manipulating Audio from Time Domain to Frequency
Domain
© 2022 DSP Concepts
https://en.wikipedia.org/wiki/Sampling_(signal_processing)
https://upload.wikimedia.org/wikipedia/commons/c/c5/Spectrogram-19thC.png
7
• Fast Fourier transform (FFT)
• Shows frequency over time
• Linearly spaced frequency
bins
• Can apply processing in the
frequency domain and then
use an inverse fast Fourier
transform (IFFT) to get time
domain audio
Fast Fourier Transform
© 2022 DSP Concepts
https://commons.wikimedia.org/wiki/File:FFT-Time-Frequency-View.png
8
• Built from short time Fourier
transform
• Repeat Fourier transform with
set window and hop size
• short-time Fourier transform (STFT)
• Take magnitude squared of
frequency bins
• Common for classification
tasks
Power Spectrogram
© 2022 DSP Concepts
https://upload.wikimedia.org/wikipedia/commons/9/99/Mount_Rainier_soundscape.jpg
9
• Constant Q transform (CQT)
• Logarithmically spaced
frequency bins
• Popular for musical
applications
• More computationally efficient
since fewer bins are needed
to cover a frequency range
Constant Q Transform
© 2022 DSP Concepts
https://en.wikipedia.org/wiki/Constant-Q_transform#/media/File:CQT-piano-chord.png
10
• Mel spectrogram
• Triangular frequency windows
• Filter banks that attempt to
approximate human hearing
• Humans struggle to hear
frequencies that are close together.
This anomaly is known as masking.
Mel Spectrogram
© 2022 DSP Concepts
http://www.ifp.illinois.edu/~minhdo/teaching/speaker_recognition/speaker_recognition.html
11
• Feature normalization
• Multi-channel features for
complex audio
• Color channels in audio
• Real and imaginary
components in frequency time
How Are Audio Features the Same as Vision?
© 2022 DSP Concepts
https://en.wikipedia.org/wiki/Fourier_transform#/media/File:Rising_circular.gif
https://en.wikipedia.org/wiki/Grayscale#/media/File:Beyoglu_4671_tricolor.png
12
Implementation
• Take audio and create an
image-like input with fixed
dimensions.
• Take the input signal into the
frequency domain
• Take a window of feature
vectors
• Slide window with a hop,
usually smaller than the input
dimension of the model
How Do Implementations of Audio Solutions Differ
from Vision Solutions?
https://www.jonnor.com/2021/12/audio-classification-with-machine-learning-europython-2019/
© 2022 DSP Concepts 14
• Use the same layers
• Convolutional layers
• Conventional
• Depth-wise separable
• Residual blocks
• RNN
• Use the same training tricks
How Are Implementations of Audio Solutions Like
Vision Solutions?
© 2022 DSP Concepts
https://www.semanticscholar.org/paper/Singing-Voice-Separation-with-Deep-U-Net-
Networks-Jansson-Humphrey/83ea11b45cba0fc7ee5d60f608edae9c1443861d
15
• Some popular vision model
architectures show up
• EfficientNet
• Similar concepts
• Encoder-decoder network
• Transfer learning
How Are Implementations of Audio Solutions Like
Vision Solutions?
https://ai.googleblog.com/2021/08/soundstream-end-to-end-neural-audio.html
© 2022 DSP Concepts 16
Common Problems in Vision and Their Audio
Analogues
• Very similar
• Convolution layers pull out structural
information
• Trained on frequency domain features
• In audio, we usually have a sliding
window
• Like working with a camera stream
• Can be important for energy savings in
complex systems
• Motion detection
• Voice activity detection
Classification
https://www.tensorflow.org/tutorials/audio/simple_audo
© 2022 DSP Concepts 18
• Vision
• Optical character recognition
• Audio
• Automatic speech recognition
• Both look for structural information in
their inputs and decode them to a
character sequence
• Similar architectures
• RCNN
• Transformer
Sequence Decoding
© 2022 DSP Concepts
https://distill.pub/2017/ctc/
19
• Vision
• Directly regress the image
• Audio
• Regress a gain mask which is
applied to audio stream
• Applied in a streaming fashion
• Window and hop
Restoration/Denoising
© 2022 DSP Concepts
https://www.mathworks.com/help/audio/ug/denoise-speech-using-deep-
learning-networks.html
20
• Feature engineering
• After some preprocessing things
are more similar than not
• Implementation
• Windowing with a set stride and
hop allows us to deal with
streams of data
• Common problems in vision and
their audio analogues
• Classification
• Sequence decoding
• Restoration/denoising
Conclusion
© 2022 DSP Concepts 21
Resources
Getting Started with Audio
Audio Classification using Transfer Learning
https://www.tensorflow.org/tutorials/audio/transfer
_learning_audio
Speech Command Recognition
https://www.tensorflow.org/tutorials/audio/simple_
audio
Get a 30 Day Trial of Audio Weaver
https://w.dspconcepts.com/audio-weaver
© 2022 DSP Concepts 22
Audio Weaver accelerates audio feature development and enables collaboration across product teams. With
over 550 optimized processing modules, audio designs can be developed and implemented on hardware
without writing any DSP code.
The Audio Weaver Framework: Overview
© 2022 DSP Concepts
AWE Designer
Windows-based graphical design environment
 Standard Edition: Design GUI
 Pro Edition: Works with MathWorks® MATLAB®
platform
Audio IP Modules
Building blocks for product developers
 From low-level primitives to complete designs
 From DSP Concepts and our third-party partners
AWE Core
The embedded processing engine
 Optimized target-specific libraries
 Available for multiple processors
 Supports multicore and multi-instance implementation
23

More Related Content

Similar to “Comparing ML-Based Audio with ML-Based Vision: An Introduction to ML Audio for ML Vision Engineers,” a Presentation from DSP Concepts

iDiff 2008 conference #6 IP-Racine DVS
iDiff 2008 conference #6  IP-Racine DVSiDiff 2008 conference #6  IP-Racine DVS
iDiff 2008 conference #6 IP-Racine DVS
Benoit Michel
 
JP Keynote Nikkei Embedded Processor Symposium 2002
JP Keynote Nikkei Embedded Processor Symposium 2002JP Keynote Nikkei Embedded Processor Symposium 2002
JP Keynote Nikkei Embedded Processor Symposium 2002
Lee Flanagin
 
Maniteja_Professional_Resume
Maniteja_Professional_ResumeManiteja_Professional_Resume
Maniteja_Professional_Resume
Vaddi Maniteja
 
DTT Regionalization @ iTVF2015 - Istanbul
DTT Regionalization @ iTVF2015 - IstanbulDTT Regionalization @ iTVF2015 - Istanbul
DTT Regionalization @ iTVF2015 - Istanbul
Berry Eskes
 

Similar to “Comparing ML-Based Audio with ML-Based Vision: An Introduction to ML Audio for ML Vision Engineers,” a Presentation from DSP Concepts (20)

ODSA Sub-Project Launch
ODSA Sub-Project LaunchODSA Sub-Project Launch
ODSA Sub-Project Launch
 
iDiff 2008 conference #6 IP-Racine DVS
iDiff 2008 conference #6  IP-Racine DVSiDiff 2008 conference #6  IP-Racine DVS
iDiff 2008 conference #6 IP-Racine DVS
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Track 1 session 2 - st dev con 2016 - dsp concepts - innovating iot+wearab...
Track 1   session 2 - st dev con 2016 -  dsp concepts - innovating iot+wearab...Track 1   session 2 - st dev con 2016 -  dsp concepts - innovating iot+wearab...
Track 1 session 2 - st dev con 2016 - dsp concepts - innovating iot+wearab...
 
What’s new in MPEG?
What’s new in MPEG?What’s new in MPEG?
What’s new in MPEG?
 
JP Keynote Nikkei Embedded Processor Symposium 2002
JP Keynote Nikkei Embedded Processor Symposium 2002JP Keynote Nikkei Embedded Processor Symposium 2002
JP Keynote Nikkei Embedded Processor Symposium 2002
 
Efficient video perception through AI
Efficient video perception through AIEfficient video perception through AI
Efficient video perception through AI
 
AudioCodes Session Border Controller Update
AudioCodes Session Border Controller UpdateAudioCodes Session Border Controller Update
AudioCodes Session Border Controller Update
 
SUT Video Collaboration Overview
SUT Video Collaboration OverviewSUT Video Collaboration Overview
SUT Video Collaboration Overview
 
Commercial pres for slideshare
Commercial pres for slideshareCommercial pres for slideshare
Commercial pres for slideshare
 
Maniteja_Professional_Resume
Maniteja_Professional_ResumeManiteja_Professional_Resume
Maniteja_Professional_Resume
 
[DSC Europe 22] Make some noise for AI in JavaScript - Sead Delalic
[DSC Europe 22] Make some noise for AI in JavaScript - Sead Delalic[DSC Europe 22] Make some noise for AI in JavaScript - Sead Delalic
[DSC Europe 22] Make some noise for AI in JavaScript - Sead Delalic
 
Ccvp plus module 1
Ccvp plus module 1Ccvp plus module 1
Ccvp plus module 1
 
NetSuite For Manufacturing _ Cloud Manufacturing Software for Modern Manufact...
NetSuite For Manufacturing _ Cloud Manufacturing Software for Modern Manufact...NetSuite For Manufacturing _ Cloud Manufacturing Software for Modern Manufact...
NetSuite For Manufacturing _ Cloud Manufacturing Software for Modern Manufact...
 
Design & Simulation With Verilog
Design & Simulation With Verilog Design & Simulation With Verilog
Design & Simulation With Verilog
 
ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...
ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...
ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...
 
Introduction to Fog
Introduction to FogIntroduction to Fog
Introduction to Fog
 
Cisco Multi-Service FAN Solution
Cisco Multi-Service FAN SolutionCisco Multi-Service FAN Solution
Cisco Multi-Service FAN Solution
 
DTT Regionalization @ iTVF2015 - Istanbul
DTT Regionalization @ iTVF2015 - IstanbulDTT Regionalization @ iTVF2015 - Istanbul
DTT Regionalization @ iTVF2015 - Istanbul
 
Janus + Audio @ Open Source World
Janus + Audio @ Open Source WorldJanus + Audio @ Open Source World
Janus + Audio @ Open Source World
 

More from Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
Edge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
Edge AI and Vision Alliance
 

More from Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

“Comparing ML-Based Audio with ML-Based Vision: An Introduction to ML Audio for ML Vision Engineers,” a Presentation from DSP Concepts

  • 1. Comparing ML-Based Audio with ML-Based Vision: An Introduction to ML Audio for ML Vision Engineers Josh Morris Engineering Manager, Machine Learning DSP Concepts
  • 2. The Audio of Things Approach DSP Concepts helps product makers deliver remarkable Audio Experience through a flexible and modular approach within a design platform environment. This system makes the entire workflow faster and easier across prototyping, design, debugging, tuning, production, and even over-the-air updates. The Audio Experience Design platform that accelerates feature development and makes audio innovation easy Audio IP • Building blocks to product-ready solutions Prototyping Hardware • Test and verify performance Debug Tools • Pinpoint issues and reduce risk Tuning Tools • Optimize playback and voice control Expertise • Algorithm, software, tuning, testing Worldwide Support Santa Clara, Stuttgart, Taipei, Seoul © 2022 DSP Concepts 2
  • 3. Processing at the edge is getting more and more powerful making it possible to do things that were reserved for the cloud Audio is becoming increasingly popular • Standalone applications • Smart assistant • Voice control • Denoising • Multi-modal applications • Industrial sensing • Anomaly detection Motivation for Talk © 2022 DSP Concepts 3
  • 4. • Feature engineering • How is audio different from vision? • How is audio like vision? • Implementation • Similarities • Differences • Common problems in vision and their audio analogues • Classification • Sequence decoding • Restoration/denoising Agenda © 2022 DSP Concepts 4
  • 6. • Audio is a sequence of samples • Somewhere between video/images • Inherent left to right structure to data • Sample rate • Bit depth How Are Audio Features Different from Vision Features? © 2022 DSP Concepts https://en.wikipedia.org/wiki/Sampling_(signal_processing) https://en.wikipedia.org/wiki/File:Mike_Austin_Sequence.JPG 6
  • 7. A lot of techniques employed for ML audio-based solutions borrow techniques from vision. This means we need to take a 1D sequence of samples and make them look like an image. Manipulating Audio from Time Domain to Frequency Domain © 2022 DSP Concepts https://en.wikipedia.org/wiki/Sampling_(signal_processing) https://upload.wikimedia.org/wikipedia/commons/c/c5/Spectrogram-19thC.png 7
  • 8. • Fast Fourier transform (FFT) • Shows frequency over time • Linearly spaced frequency bins • Can apply processing in the frequency domain and then use an inverse fast Fourier transform (IFFT) to get time domain audio Fast Fourier Transform © 2022 DSP Concepts https://commons.wikimedia.org/wiki/File:FFT-Time-Frequency-View.png 8
  • 9. • Built from short time Fourier transform • Repeat Fourier transform with set window and hop size • short-time Fourier transform (STFT) • Take magnitude squared of frequency bins • Common for classification tasks Power Spectrogram © 2022 DSP Concepts https://upload.wikimedia.org/wikipedia/commons/9/99/Mount_Rainier_soundscape.jpg 9
  • 10. • Constant Q transform (CQT) • Logarithmically spaced frequency bins • Popular for musical applications • More computationally efficient since fewer bins are needed to cover a frequency range Constant Q Transform © 2022 DSP Concepts https://en.wikipedia.org/wiki/Constant-Q_transform#/media/File:CQT-piano-chord.png 10
  • 11. • Mel spectrogram • Triangular frequency windows • Filter banks that attempt to approximate human hearing • Humans struggle to hear frequencies that are close together. This anomaly is known as masking. Mel Spectrogram © 2022 DSP Concepts http://www.ifp.illinois.edu/~minhdo/teaching/speaker_recognition/speaker_recognition.html 11
  • 12. • Feature normalization • Multi-channel features for complex audio • Color channels in audio • Real and imaginary components in frequency time How Are Audio Features the Same as Vision? © 2022 DSP Concepts https://en.wikipedia.org/wiki/Fourier_transform#/media/File:Rising_circular.gif https://en.wikipedia.org/wiki/Grayscale#/media/File:Beyoglu_4671_tricolor.png 12
  • 14. • Take audio and create an image-like input with fixed dimensions. • Take the input signal into the frequency domain • Take a window of feature vectors • Slide window with a hop, usually smaller than the input dimension of the model How Do Implementations of Audio Solutions Differ from Vision Solutions? https://www.jonnor.com/2021/12/audio-classification-with-machine-learning-europython-2019/ © 2022 DSP Concepts 14
  • 15. • Use the same layers • Convolutional layers • Conventional • Depth-wise separable • Residual blocks • RNN • Use the same training tricks How Are Implementations of Audio Solutions Like Vision Solutions? © 2022 DSP Concepts https://www.semanticscholar.org/paper/Singing-Voice-Separation-with-Deep-U-Net- Networks-Jansson-Humphrey/83ea11b45cba0fc7ee5d60f608edae9c1443861d 15
  • 16. • Some popular vision model architectures show up • EfficientNet • Similar concepts • Encoder-decoder network • Transfer learning How Are Implementations of Audio Solutions Like Vision Solutions? https://ai.googleblog.com/2021/08/soundstream-end-to-end-neural-audio.html © 2022 DSP Concepts 16
  • 17. Common Problems in Vision and Their Audio Analogues
  • 18. • Very similar • Convolution layers pull out structural information • Trained on frequency domain features • In audio, we usually have a sliding window • Like working with a camera stream • Can be important for energy savings in complex systems • Motion detection • Voice activity detection Classification https://www.tensorflow.org/tutorials/audio/simple_audo © 2022 DSP Concepts 18
  • 19. • Vision • Optical character recognition • Audio • Automatic speech recognition • Both look for structural information in their inputs and decode them to a character sequence • Similar architectures • RCNN • Transformer Sequence Decoding © 2022 DSP Concepts https://distill.pub/2017/ctc/ 19
  • 20. • Vision • Directly regress the image • Audio • Regress a gain mask which is applied to audio stream • Applied in a streaming fashion • Window and hop Restoration/Denoising © 2022 DSP Concepts https://www.mathworks.com/help/audio/ug/denoise-speech-using-deep- learning-networks.html 20
  • 21. • Feature engineering • After some preprocessing things are more similar than not • Implementation • Windowing with a set stride and hop allows us to deal with streams of data • Common problems in vision and their audio analogues • Classification • Sequence decoding • Restoration/denoising Conclusion © 2022 DSP Concepts 21
  • 22. Resources Getting Started with Audio Audio Classification using Transfer Learning https://www.tensorflow.org/tutorials/audio/transfer _learning_audio Speech Command Recognition https://www.tensorflow.org/tutorials/audio/simple_ audio Get a 30 Day Trial of Audio Weaver https://w.dspconcepts.com/audio-weaver © 2022 DSP Concepts 22
  • 23. Audio Weaver accelerates audio feature development and enables collaboration across product teams. With over 550 optimized processing modules, audio designs can be developed and implemented on hardware without writing any DSP code. The Audio Weaver Framework: Overview © 2022 DSP Concepts AWE Designer Windows-based graphical design environment  Standard Edition: Design GUI  Pro Edition: Works with MathWorks® MATLAB® platform Audio IP Modules Building blocks for product developers  From low-level primitives to complete designs  From DSP Concepts and our third-party partners AWE Core The embedded processing engine  Optimized target-specific libraries  Available for multiple processors  Supports multicore and multi-instance implementation 23