Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Fraunhofer iais audio mining - automatic metadata gereration of audio streams- Kohler, Joachim

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 47 Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Fraunhofer iais audio mining - automatic metadata gereration of audio streams- Kohler, Joachim (20)

Weitere von FIAT/IFTA (20)

Anzeige

Aktuellste (20)

Fraunhofer iais audio mining - automatic metadata gereration of audio streams- Kohler, Joachim

  1. 1. 1 Fraunhofer IAIS Audio Mining: Automatic meta data generation of audio streams FIAT/IFTA Media Management Seminar, Lugano 2017 Dr. Joachim Köhler Head of Department NetMedia Fraunhofer-Institut for Intelligent Analysis and Information Systems
  2. 2. © Fraunhofer IAIS Fraunhofer is the largest organization for applied research in Europe  More than 80 research institutions, including 69 Fraunhofer institutes  More than 24,500 employees, the majority educated in the natural sciences or engineering  An annual research volume of 2.1 billion euros, of which 1.9 billion euros is generated through contract research  2/3 of this research revenue derives from contracts with industry and from publicly financed research projects.  1/3 is contributed by the German federal government and the Länder governments in the form of institutional financing.  International collaboration through representative offices in Europe, the US, Asia and the Middle East 3
  3. 3. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 5 Fraunhofer Institute Centre Schloss Birlinghoven International Research in Big Data and Cognitive Computing 600 interdisciplinary scientists – 3 Institutes  Fraunhofer Institute for Applied Information Technology FIT  Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS  Fraunhofer Institute for Algorithms and Scientific Computing SCAI One of the largest research locations for applied computer science and mathematics in Germany Close cooperation with regional universities
  4. 4. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 8 Fraunhofer & digital archiving and broadcasting  Several Fraunhofer Institutes have contributed to many seminars of the German VFM on automatic metadata generation  Fraunhofer IAIS generated a study on Future Technologies for media archives & concept for an innovative archive system: Media Data Hub  Participation in many European research projects (LIVE, AXES, CubRIK, LinkedTV, MiCO)  Workshop with directors of broadcast archives 2012  Technology portfolio  Music & Video Analytics (IDMT)  Audio Mining (IAIS)  Media Data Hub (IAIS)  Quality Control and fingerprinting (IDMT) Activities, portfolio, networking VFM Technology workshop 2012
  5. 5. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 9 The Future of Media Archives: Strategic & Conceptual Native crossmedia • Crossmedia from data model to UI • Using graph-based data models (e.g. Europeana) Media Data Hub • Linking and integration of data silos • Bringing all metadata sources into one application (archive, legal, ) Massive automation of documentation • Manual annotation will be reduced, process-optimized • Future: up to 100% automatic annotation (like in press archives) Near to production environment • Search and access immediately after production process • Interfacing to production systems (OpenMedia, Avid, etc.)
  6. 6. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 10 Mining Technologies for Media Archiving ( Report »Archiv system of the future – Strategic innovation concept«, Fraunhofer 2014) ; Technology readiness level (TRL)  Text Mining  Audio Mining  Video Mining  Object & face recognition  Video OCR  Image Similarity  Audio- and video fingerprinting  Recommendation technologies  Interactive data visualization  Personalization and contextualization  Facetted Search  Linking of information items Anwendung Kriterium Unterstützte Dossiererstellung Reifegrad 4-5 Integrations- und Betriebsaufwand 3-5 Mehrwert 5
  7. 7. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 11 Results of the 2nd FIAT/IFTA MAM Survey
  8. 8. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Fraunhofer IAIS Audio Mining  B2B Speech Recognition Solution for the Media Industry  Key Facts  Large Vocabulary Continuous Speech Recognition (1.000.000 words) optimized for media content  Automatic structuring of audio-visual content  Applications along the Media Asset Chain  Archive: Indexing and transcription of media archive content  Online: Search functionalities for media portals (e.g. InClip-Search) and content-based recommendation  TV-Distribution: Subtitling for TV content  SocialTV: Second Screen information enrichment  Advertising:/Marketing Video Search Engine Optimization (VSEO) and contextualized advertisement
  9. 9. © Fraunhofer SPEECH TECHNOLOGY AND SOLUTION Audio Mining Solution
  10. 10. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Audiomining powered by Fraunhofer IAIS Feature Advantage for Customer Automatic Speech Segmentation Fast browsing through long videos Finding relevant segments quickly Speaker Clustering / Speaker Detection Searching for segments with specific speaker Searching for statements by person Speech Recognition Search for relevant videos Search within videos for relevant section Keyword Generation Generate Tag Cloud Get a rough summary of the video
  11. 11. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Speaker Diarization  Unstructured audio recording  Homogeneous segments Speech Speech Detection of speech Speech Male Voice Male Voice Detection of gender Female Voice Speaker 1 Speaker 1 Speaker recognition Speaker 2  Jingle recognition (e.g. programm) Start of News Show
  12. 12. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Automatic Speech Recognition  Converts speech signal into written text  Prerequisite for further steps (text mining)  Based on statistical models to be trained by large amount of data  Three components:  Acoustic model (How do phonemes sound?)  Lexicon (How are words pronounced?)  Speech language (Which words are probable?)  Automatic speech recognition computes most probable word sequence Language model Lexicon Acoustic model recognized text
  13. 13. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Progress in Speech Recognition  Massive Usage of Deep Learning Technology:  Improvement of acoustic modelling (many speakers, many speaking styles, etc. )  Gaussian Mixtures (GMM) => Deep Neural Networks (DNN) Microsoft Research  Dahl, Deng, Acero (2012): Context- Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition  Reduction of error rate from 23% to 13%
  14. 14. © Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS 19Dr. Joachim Köhler DNNs for Speech Recognition Dr. Joachim Köhler
  15. 15. © Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS 20Dr. Joachim Köhler Speech Recognition is currently one of the Top Technolgoies DNN based applications from Amzon, Microsoft, Google & co Dr. Joachim Köhler Amazon Alexa Echo 2016 Apple: Siri 2015 Google Now: 2015 Microsoft: Cortana 2016
  16. 16. © Fraunhofer Deep Learning  Speech recognition  Image recognition  Text understanding  Machine translations  Breast cancer diagnostics  Game play A game changer towards artificial intelligence big data + machine learning = progress in AI Quelle: Y. Bengio, ML tutorial, KDD 2014 Quelle: S. Jones, nvidia blog, 2014 Quelle: Microsoft Research, 2014 Quelle: Ciresan et al., Proc MICCAI, 2013 Quelle: Mnih et al., Nature, 2015 Quelle: Xiong et al., Science 2015
  17. 17. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Speech Recognition System Setup (German) powered by Fraunhofer IAIS  Acoustic Training Data: GER-TV 1000h (LREC 2014)  Language Model Training Data: 71.8 M words (news domain)  Competetive on the German market, English system in progress  Using deep neural networks (DNNs) for acoustic modelling (instead of Gaussian Mixtures Models)  stable, continuous improvement, integration of up-to-date research results GMM Gaussian Mixture Model, DNN Deep Neural Network Jahr Acoustic Model Language Model Training data [h] WER [%] planned WER [%] Spontaneous 2012/13 1. GMM 3gram, 200k 105 26.4 33.5 2013/14 2. GMM 3gram, 200k 323 24.0 31.1 2014 3. DNN 3gram, 200k 323 18.4 22.6 2015 4. DNN 5gram, 510k 1005 13.3 16.5 2016 5. RNN 5gram, 510k 1005 11.9 14.5
  18. 18. © Fraunhofer Ongoing Research on RNN-CTC  RNN-CTC: Connectionist Temporal Classification. What's new: solve speech recognition as an end-to-end machine learning task, everything is a (deep) recurrent neural network (RNN)  1000h speech corpus, ~2 weeks training time on GPU cluster.  About ~10% relative reduction on average in WER with RNN-CTC Beyond HMM, HMM-DNN Approaches
  19. 19. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Speaker Recogntion using iVectors 2,5 -3,9 -1,6 -2,8 4,3 3,2 0,9 0,2 3,3 -0,5 1,7 -2,3 -0,5 -3,3 -1,7 0,3 -3,0 -1,8 -0,2 2,0 0,1 0,4 -0,3 0,5 -0,1 0,6 2,2 -1,6 0,3 -0,8 -2,4 -1,4 0,3 1,4 -1,7 -0,6 -1,3 -1,0 -1,9 0,0 -1,3 0,8 -1,3 -0,4 1,2 2,4 -0,1 1,8 0,6 -0,4 -1,2 -1,3 -1,4 1,0 -2,1 -0,1 0,1 -1,3 0,4 1,2 -0,1 -1,3 -0,9 -0,2 -2,1 0,6 -0,6 0,2 0,9 0,0 0,0 -0,6 0,5 -2,0 -0,5 1,3 0,2 0,4 1,3 0,8 0,0 -0,6 -0,8 -0,3 -0,9 -1,4 1,4 0,0 0,7 0,9 -0,5 0,4 1,2 0,2 0,7 -0,8 -0,3 -3,3 -0,4 -1,1 -1,1 1,4 -0,2 -0,3 -1,0 -0,1 -0,1 -1,1 0,8 0,4 -0,2 -1,5 -0,3 -0,7 -0,2 -0,6 -0,3 -0,2 -0,2 0,7 0,3 1,7 -0,6 1,4 -1,5 -0,1 0,3 -0,9 0,1 -0,6 -0,4 -0,4 -0,3 0,3 0,6 -0,3 0,0 0,8 0,8 -0,3 0,2 0,2 -0,5 0,9 0,4 1,1 0,5 0,0 -0,2 0,9 -1,2 -0,8 0,2 -1,0 -0,7 0,6 -0,7 0,2 0,9 -0,9 -0,2 2,6 1,0 -0,2 0,4 -0,2 1,0 0,1 -1,0 0,8 0,1 -1,4 0,6 -0,2 -0,5 0,9 -0,3 0,2 1,2 0,4 -0,1 0,6 0,6 0,5 -0,7 -0,2 1,9 0,7 0,4 -1,3 -1,6 0,1 -0,6 0,1 1,4 0,0 -0,6 0,4 -0,2 0,5 1,7 0,6 0,3 0,2 0,3 -0,1 -0,4 -0,3 -0,3 0,4 0,2 0,3 1,4 0,1 0,5 -0,6 -0,4 -0,5 2,0 0,2 0,7 1,6 -0,8 -1,2 0,2 -0,4 -0,5 1,1 -0,1 0,1 -0,2 -2,2 0,2 0,8 -0,2 2,0 -0,9 0,5 -1,2 1,0 -0,1 0,2 0,4 0,6 0,1 0,2 -0,9 -0,1 -0,2 -0,1 -0,4 1,2 -0,1 -1,2 0,0 0,6 1,9 -1,6 0,5 1,1 1,6 0,2 1,6 -0,4 -0,1 1,1 -0,4 0,1 0,4 -0,2 0,8 1,3 1,4 1,5 -0,4 -0,9 -0,4 -0,1 -0,6 -0,1 0,1 -0,6 -1,1 1,2 0,2 -1,3 0,4 -0,5 -1,7 0,4 0,9 -0,1 -1,2 -0,2 -0,6 0,8 -0,2 -1,3 0,8 -0,3 2,3 -0,7 -0,2 -0,1 -0,2 -0,3 0,1 1,0 1,5 0,7 0,0 0,8 -1,0 -0,2 -0,9 -0,7 -0,8 0,8 1,6 -0,1 0,7 -0,1 1,0 -0,5 1,5 -1,4 1,6 0,4 0,8 1,2 -0,5 0,7 -1,0 -1,3 -0,2 0,6 0,6 0,8 0,6 0,6 0,0 1,1 0,0 0,1 0,5 -0,2 0,9 0,5 -0,7 -0,2 -0,2 0,4 -0,6 -0,7 -0,4 1,2 0,0 -0,2 0,1 0,2 0,3 0,6 0,1 -1,1 0,6 1,1 0,3 -0,1 -0,7 0,8 0,1 -0,2 -0,1 0,5 -0,9 -0,2 0,2 0,4 -0,9 0,1 -1,6 -0,2 0,6 -0,8 -1,3 -1,1 1,0 -0,6 -0,6 -0,8 -0,7 -0,8 1,6 0,3 -0,4 0,6 -0,6 0,5 -0,1 0,5 -1,3 1,6 0,3 7,3 8,2 1,3 1,4 -0,1 0,3 -0,9 2,9 -3,9 -0,4 -5,6 -2,0 -0,3 0,6 -0,9 -0,3 -2,6 -0,1 -0,2 -0,4 -0,4 0,0 -0,5 1,5 -4,0 -0,5 -0,9 8,6 -1,8 -0,2 -1,0 -1,2 1,0 -2,2 -1,5 -0,2 0,0 -1,7 -1,2 0,1 1,0 0,6 4,3 0,0 1,3 -0,2 -1,0 1,3 -0,3 2,8 -1,6 1,1 0,0 -0,1 -1,2 -0,5 -0,4 -0,2 0,1 0,0 0,4 -3,4 -1,9 0,3 -0,1 1,3 0,0 0,0 0,3 0,0 0,2 -0,8 0,4 0,2 0,6 -1,0 -1,2 0,0 -0,1 0,5 -0,1 -0,6 0,1 -2,4 0,0 -0,4 0,3 0,7 0,2 2,9 0,0 0,0 0,0 0,2 -3,3 0,6 0,9 -0,8 0,0 0,0 0,4 0,4 0,0 0,1 0,7 1,1 0,3 -0,2 -0,6 -0,2 1,3 0,1 -0,1 0,2 0,0 0,2 0,9 0,1 -2,0 0,4 -2,1 0,0 0,0 0,2 -0,7 0,1 -0,5 0,0 -0,1 0,1 0,2 -0,2 0,1 0,0 0,6 0,5 -0,4 -0,2 -0,2 0,8 -0,3 -0,2 1,0 0,2 0,0 -0,1 0,4 2,0 -0,5 -0,2 0,0 0,4 0,7 0,1 -0,4 1,4 -0,8 0,2 -1,8 1,5 -0,1 1,0 -0,4 1,3 0,0 0,4 -1,3 0,0 -0,3 -0,5 0,1 0,5 0,4 -0,6 -0,1 2,0 -1,0 -0,2 0,7 -1,7 0,2 0,4 -0,2 -1,3 1,1 -0,1 0,9 -0,3 0,2 0,8 0,1 -1,5 0,0 -0,2 -0,2 0,3 0,2 -1,0 -0,5 -0,4 -0,1 -0,2 0,0 0,0 0,0 0,2 0,1 -0,4 -0,1 3,4 -0,1 0,6 -0,1 -0,2 0,4 -3,0 0,1 1,7 0,0 1,1 -1,7 0,0 -0,2 0,5 -2,1 -0,1 0,1 0,1 -2,0 -0,1 0,9 0,3 -3,6 -0,3 0,3 0,0 0,3 0,1 -0,2 0,4 -0,6 0,0 0,0 0,8 0,2 0,1 -0,1 0,2 -0,7 0,2 1,1 0,0 0,2 3,0 1,1 -1,0 1,7 0,2 0,0 1,3 0,2 -0,1 0,7 -0,2 -0,1 0,2 -0,1 0,6 -3,1 0,3 0,5 0,4 0,3 -0,2 0,0 -0,2 0,0 0,0 0,5 0,7 -1,0 -0,2 -0,3 0,0 0,3 0,7 -0,1 -0,5 -0,1 -0,5 0,3 0,2 1,1 0,1 0,0 0,2 -0,3 0,7 0,1 0,0 0,1 0,0 0,2 0,0 0,3 1,4 -0,3 0,0 -0,3 0,2 -0,4 1,1 0,0 0,2 -0,1 0,5 0,1 0,4 0,0 -1,0 1,1 2,3 0,6 0,5 -0,5 -0,2 -0,2 -0,1 -0,1 -0,3 0,1 0,1 0,2 -0,5 1,7 0,4 0,4 0,0 0,7 0,0 0,0 0,3 0,2 -0,2 0,6 -0,1 -0,1 0,0 0,2 0,0 -0,2 -0,1 -1,1 0,0 0,1 0,0 0,3 -0,1 0,0 0,1 0,3 -0,5 1,9 0,0 0,0 -0,6 -0,1 -0,1 0,1 0,0 0,8 -0,9 0,0 0,1 0,0 0,3 0,0 0,0 -0,2 -0,5 0,2 0,1 -0,7 1,4 -0,5 0,6 0,9 0,4 0,0 2,2 0,1 0,2 0,3 -0,2 -0,1 0,0 -0,3 iVector Comparison  Sebastian Kurz  Confidence: 0,05
  20. 20. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Fraunhofer IAIS Audio Mininig: Technology  Speaker diarization to structure recordings automatically (e.g. speaker information)  ASR System based on KALDI open source package  Using Deep Neural Networks  Completely speaker independent  Real-time processing  Trained on 1000 hours large-scale German broadcast database  Service-orientated architecture to control and run the recognition engine Web services Messaging Audio Mining core Audio Mining Monitor AudioMining iFinder Structural Analysis Structural Analysis Structural Analysis Automatic Speech Recognitio n Automatic Speech Recognitio n Automatic Speech Recognitio n
  21. 21. © Fraunhofer USER INTERFACE Audio Mining Solution
  22. 22. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de GUI: Media Search Interface Search functionality: Find audio and video files with specific keywords, specific words in the title or the transcript, or with a specific series name.
  23. 23. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de GUI: Segmentation, Sub-Titles, Preview Preview functionality: Select a media file from the right- hand side to watch it or listen to it. Subtitles: Audio Mining creates subtitles based on the transcript and the structural analysis results. Segmentation/Speaker clustering: Audio Mining detects whenever the speaker changes and divides the media file into multiple segments. Jump to a specific segment by clicking the timeline.
  24. 24. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de GUI: Word Positioning, Snippets Advanced search functionality: You are also able to search for a specific word inside the transcript. Word occurrences: Marks indicate the occurrences of the search term. Click on a mark to jump to the corresponding position in the video.
  25. 25. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de GUI: Keywords Keywords: Audio Mining generates keywords for every media file, based on particular relevant words in the transcript. Again, marks indicate the occurrences of the keyword. Click on a mark to jump to the corresponding position in the video.
  26. 26. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de GUI: Full transcript Transcript: Audio Mining provides a transcript for every media file. Again, the video or audio file is divided into segments. Different colours indicate different speakers. You are able to export the transcript to different file formats.
  27. 27. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de GUI: Recommendation Recommendations: You have just watched an exciting video and are now looking for a similar one? No problem! Audio Mining recommends related media files, based on the similarity of their keywords.
  28. 28. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Audio Mining: Status Demo System: https://nm-demo.iais.fraunhofer.de/customer_demo/  Fraunhofer IAIS provide web-based test account for interested customers  https://nm-demo.iais.fraunhofer.de/$TV-station  HR, SWR, BR, RBB, ZDF, …  Easy to use, simple upload functionality  Positive feedback  Segmentation and speaker diarization very useful (improvement possible)  ASR quality for many types of radio and TV program good  Keyword search and keyword access is very positive  Full transcript is useful  Keyword generation as interesting alternative for summary and fixed semantic vocabulary  Export in several formats possible
  29. 29. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Audio Mining: Challenges and Research Issues Feedback from media archive professionals of ARD  Overlapping speech segments, voice over  Short speaker turns are difficult to detect  Overlapping speech segments reduces ASR quality (“talk show”)  Voice over: Start in language 1, continue with language 2  Hard to solve  Background noise, noisy conditions  Noise degrades ASR quality  Solutions: data augmentation, speech enhancement  Very open domains, unlimited vocabulary, Out-Of Vocabulary Problem, Names  Regular update of the language models required (e.g. “Incirlik“, „James Comey“)  Mixed/multiple languages  Foreign names (ARD pronunciation dictionary)  Dialects  BR provides several dialects of the German language for research work  Punctuation mark are required
  30. 30. © Fraunhofer SYSTEM ARCHITECTURE Audio Mining Solution
  31. 31. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de System architecture Audio Mining core Audio Mining Monitor Audio Mining core iFinder Web services Messaging Clients (e.g. AREMA) Web interface AudioMining Analysis requests ↓ ↑ Analysis results ← Analysis priorities Asset details, . processing updates, . deletion updates → Analysis priorities ↓ ↑ Asset details, processing updates, deletion updates Import, analysis, status and deletion requests ↓ ↑ Asset status, details
  32. 32. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de AudioMining System architecture Audio Mining Monitor Audio Mining core iFinder Web services Messaging Clients (e.g. AREMA) Web interface Audio Mining core Audio Mining Data base Search index File system
  33. 33. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de System architecture Audio Mining core Audio Mining Monitor Web services Messaging Clients (e.g. AREMA) Web interface AudioMining iFinder Structural Analysis Structural Analysis Structural Analysis Automatic Speech Recognition Automatic Speech Recognition Automatic Speech Recognition
  34. 34. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Audio Mining Monitor System architecture Audio Mining core Audio Mining core iFinder Web services Messaging Clients (e.g. AREMA) Web interface AudioMining Data baseAudio Mining Monitor HTTP Server Messaging Server
  35. 35. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Infrastructure and Scalability Server (1): Scheduling and Media Repository  VM, ≥ 2 Cores (≥ 2 GHz, 64-bit), 30 GB RAM  SLES, JRE 8, MySQL, Bash 4  Server (2): Audio-Analyses  Processing capacity per core (AMD Opteron 6234): 17 h Audiomaterial am Tag 4 GB RAM  For 20 h Audio data per day:  ≥ 2 Cores (≥ 2 GHz, 64-bit), 8 GB RAM  SLES  Audio processing is fully scalable  Tested on 480 cores to process several thousands hours/day
  36. 36. © Fraunhofer REFERENCE PROJECTS Audio Mining Solution
  37. 37. © Fraunhofer Speech Recognition for Media Archiving powered by Fraunhofer IAIS Customer: WDR, German Broadcaster (Archive Department) Project facts:  Integration of Fraunhofer IAIS Audio- Mining system into the WDR IT environment (ARCHIMEDES und IVZ)  Content mining of large amounts of AV- data, immediately!  Better navigation and segmentation of radio and TV material  Search in spoken utterances  Full transcription and keyword generation Technology provided by Fraunhofer:  Broadcast speech recognition  Automatic speech segmentation Strukturierte Aufbereitung Speech Recognition Structured Segmentation
  38. 38. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Content Analytics for ARD Mediathek Artifical Intelligence powered by Fraunhofer  Content analytics of 200.000 media assets  Advanced search and retrieval capabilities  Full transcription of multimedia content  Daily processing of 2000 new media assets from radio and TV  Core technology for recommendation and personalization services  Link: http://www.ardmediathek.de
  39. 39. © Fraunhofer Speech Recognition for the „ARD-Mediathek“ powered by Fraunhofer IAIS Customer: SWR/ Redaktion ARD.de (Link: www.ardmediathek.de), 2014/15 Project facts:  Processing of 200.000 media assets (average duration 15 minutes/asset)  Service based (crawling, processing, metadata transfer)  Daily amount: 2000 assets (update mechanism every 60 minutes) Technology provided by Fraunhofer:  Speaker diarization, speech recognition, key word extraction)
  40. 40. © Fraunhofer real-time analysis of heterogeneous news streams News-Stream Objectives  Big data infrastructure for efficient and real-time analysis of heterogeneous news streams  Semantic analysis of multimodal and unstructred news data  Piloting in real-life scenarios Technologies and Applications  Real-time speaker recognition  Audio „citation“ search  Heatmap & Social Media Monitoring, …  Project duration: 09/2014 bis 12/2017 http://newsstreamproject.org/ 49
  41. 41. © Fraunhofer KA3: Cologne Centre for Analysis and Archiving of AV Data Centre Project of the German BMBF eHumanities Program  Project objectives  Creation of a centre for the e- Humanities Research in Germany with the focus on AV data  Contribution of Fraunhofer IAIS  Development and providing tools for automatic analytics of speech and audio recordings (oral history scenario, interaction scenario)  Use Case 1: Oral History  Use Case 2: Interaction Scenario  Duration: 10/2015 – 09/2018  Partners : Univ. Köln, MPI for Psycholinguistics, Fernuniversität in Hagen
  42. 42. © Fraunhofer KA3: Use Case Interaction Scenario Challenges:  Very fast dialogues, short speaker turns  Backchanncel sounds („mmh“, „hmm“, „ja“, …)  Overlapping speech segments Technologies:  Improved speaker clustering  Speech/non speech segmentation with deep learning  Overlapping speech segments with RNN  Automatic segmentation of speech recordings Arbitrary # of speakers : max. 2 Sprecher: 2 speakers : references:
  43. 43. © Fraunhofer KA3: Use Case Oral History Speech Recognition: Reference & ASR Ouput Example: Kruse (clean recording) zwischendrin hatte ich natürlich auch versucht noch mit bei der Medizin zu landen das war aber damals deswegen so schwierig weil das glaube ich ein Jahr war bevor der Numerus clausus in der Medizin eingeführt wurde und man musste so mit sechshundert Anfängern ungefähr um sechs Uhr auf der Treppe sitzen damit man um acht Uhr in die Vorlesung kam und das war für mich zwischendrin hatte ich natürlich auch versucht sich noch beim bei der Medizin zu landen das war aber damals deswegen so schwierig weil das glaube ich ein Jahr war bevor der Numerus clausus in der Medizin eingeführt wurde und man musste somit sechshundert Anfängern ungefähr um sechs Uhr auf der Treppe sitzen damit man um acht Uhr in der Vorlesung kam und das war für mich dann habe ich dieses Studium abgeschlossen und hatte mich kurz auch mal dafür interessiert in eine Berufstätigkeit im Entwicklungsdienst deutscher Entwicklungsdienst hieß das glaube ich einzusteigen hatte aber auch gleichzeitig so einen Hiwi-Job am Institut und so blieb ich dann hängen und hatte eben einfach die Chance weil man dann auch gefördert wird oder die Chance hat in einem bestimmten Projekt zu arbeiten dass ich dann daran gedacht habe zu promovieren dann habe ich dieses Studium abgeschlossen und hatte mich kurz auch mal dafür interessiert in eine Berufstätigkeit im indem ein Entwicklungsland Deutscher Entwicklungsdienst hieß das glaube ich einzusteigen hatte aber auch gleichzeitig so ein ein Hiwi Job am Institut und so lieblich dann hängen und hatte eben einfach die Chance weil man dann auch gefördert wird oder die Chance hat _ einen bestimmten Projekt zu arbeiten dass ich dann daran gedacht habe zu promovieren
  44. 44. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de KA3/Newsstream: Forced Alignment & Editing of Transcripts  If a complete and almost perfect transcription text is availalbe, the missing time code will be generated by forced alignment  Input: audio file, transcript  Output: segmentation file (MPEG-7, ELAN)  Part of iFinder 3.0
  45. 45. © Fraunhofer Joachim.Koehler@iais.fraunhofer.de Summary and Outlook Summary  Deep Learning and large corpora have led to massive progress for Speech2Text  Speech2Text provides good transcription quality for broadcast speech (about 10% error), however not perfect  Audio Mining more then S2T: speech segmentation, speaker recognition, citations, …  Many advantages: annotation costs, immediate availability , more details and time codes  Some disadvantages: Challenging recording conditions, explosion of metadata  Conclusion: Acceptance for Audio Mining/S2T is given !!!!  Test Account possible: https://nm-demo.iais.fraunhofer.de/customer_demo Outlook  Several research issues are still open (dialects, overlapping speech segments, …)  Further improvement is expected (evaluation of Deep Learning, more data, engineering)  Important issue: Integration into MAM workflows
  46. 46. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 59 Let‘s do more with your data! Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS www.iais.fraunhofer.de Link: https://www.iais.fraunhofer.de/audiomining.html Contact Dr. Joachim Köhler Head of Image Processing +49 (0)2241 14-1900 joachim.koehler@iais.fraunhofer.de
  47. 47. © Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS 60 Disclaimer Copyright © by Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Hansastraße 27 c, 80686 Munich, Germany All rights reserved. Responsible contact is: Katrin Berkler | Silke Loh | Public Relations | pr@iais.fraunhofer.de All copyrights for this presentation and their content are owned in full by the Fraunhofer-Gesellschaft, unless expressly indicated otherwise. Each presentation may be used for personal editorial purposes only. Modifications of images and text are not permitted. Any download or printed copy of this presentation material shall not be distributed or used for commercial purposes without prior consent of the Fraunhofer-Gesellschaft. Notwithstanding the above mentioned, the presentation may only be used for reporting on Fraunhofer- Gesellschaft and its institutes free of charge provided source references to Fraunhofer’s copyright shall be included correctly and provided that two free copies of the publication shall be sent to the above mentioned address. The Fraunhofer-Gesellschaft undertakes reasonable efforts to ensure that the contents of its presentations are accurate, complete and kept up to date. Nevertheless, the possibility of errors cannot be entirely ruled out. The Fraunhofer-Gesellschaft does not take any warranty in respect of the timeliness, accuracy or completeness of material published in its presentations, and disclaims all liability for (material or non-material) loss or damage arising from the use of content obtained from the presentations. The afore mentioned disclaimer includes damages of third parties. Registered trademarks, names, and copyrighted text and images are not generally indicated as such in the presentations of the Fraunhofer-Gesellschaft. However, the absence of such indications in no way implies that these names, images or text belong to the public domain and may be used unrestrictedly with regard to trademark or copyright law.

×