This document discusses using a limited speech corpus of recordings from Dutch news anchor Philip Bloemendal to develop a text-to-speech (TTS) engine. It evaluates how much of the Dutch language can be synthesized using the corpus and methods to improve it, like finding synonyms and decompounding compounds. It also explores using neural networks to colorize old black-and-white video footage from the archive to make it more engaging for viewers. While the TTS engine works well for common words, full sentences have lower coverage, and colorization introduces artifacts but can increase attention to the archive's collection.
3. Beeld en Geluid
• collects, preserves and opens the Dutch audiovisual heritage for as
many users as possible
• one of the largest audiovisual archives in Europe. The institute
manages over 70 percent of the Dutch audiovisual heritage
• Was interested in ways to re-use old Polygoonjournaals footage
• Text-To-Speech engine based on Philip Bloemendal
5. Research
• Can the current corpus of audio recordings of Bloemendal be used
to construct a TTS engine?
• How large percentage of the Dutch language can be constructed with the
current corpus?
• What can we do to improve?
• How well is the text-to-speech engine recognizable as Philip Bloemendal?
• How well comprehensive are the constructed audiofiles?
6. How large percentage of the Dutch language can
be constructed with the current corpus?
• Constructing the corpus
• How many ‘Polygoonjournaals’
• Openbeelden – OAI (Open Archives Initiative)
• Extract audio
• Speech analysis – roughly 35000 distinct words
• XML files
• Evaluation
• Metrics
• Corpora
• Language changes
7. How large percentage of the Dutch language can
be constructed with the current corpus?
• Approach: 4 corpora to test against
• Contemporary news articles (same domain, different time) | 50 articles
• News articles from the 1970s (same domain, time) | 50 articles
• E-books (different domain, various times) |6 books
• Tweets (different domain, different time) | 1000 tweets
• Evaluation
• Number of distinct words
• Number of sentences
8. What can we do to improve performance?
• It is to be expected that many (contemporary) words have not
been pronounced by Philip
• Various approaches
• Change format (Lowercase, diareses)
• Numbers
• Finding synonyms
• Decompounding
10. Decompounding
• Dutch language allows for compounding words
• School, hoofd -> Schoolhoofd
• Regen, water -> regenwater
• Staat, hoofd -> StaatShoofd
• Each word is distinct in the corpus
• Decompounding is computationally expensive
• Computationally expensive for large corpora, long words
• Constructed Bigrams and Trigrams
11. Results (words)
Dataset Unique words Unique words
found
After synsets After
decompounding
Contemporary
news
2743 2019 2106 2448
Old news 16191 7703 8261 11541
Tweets 27180 7692 8446 13440
Books 26575 11440 12922 20207
13. How comprehensible / recognizable are
sentences
• 8 people tested the software
• Philip was recognized (or ‘that news guy’)
• Words with more consonants were easier to recognize
• When user input their own sentences, more recognition
• When sentences were demonstrated without subtitles, less
• Speed of software / GUI limited testing capabilities
14. The use of Deep Neural
Networks in colorizing video
Rudy Marsman | VU University | NISV
15. Neural Networks
• Recent progress in computational power made implementation of
Deep Neural Nets possible
• Neural Net trained on large training set can accurately make
predictions in real-world examples
16. Zhang et al.
• Richard Zhang et al. trained a neural net to colorize images
• Trained on over a million images
• Fools humans into thinking colorized photo is original 20% of time
• Resizes image to fit input layer of 200x200 pixels
• Gained popularity in news website / forums
18. Implementation on video
• Extract individual frames from video using FFMPEG
• Colorize each individual frame
• Re-compile video and attach original audio file
20. Applications
• Colorized videos are more ‘tangible’ and ‘alive’ than black/white
• Showing colorized Polygoonjournaals can augment TTS engine
• General positive responses on technology may increase attention
to NISV collection
• NISV Employees were enthousiastic
21. Issues
• Each frame is considered independent and is colorized thusly
• Artifacts appear between frames
• Slow performance without use of Nvidia GPU
• Low resolution
• Predicted colors still far from perfect
22. Conclusions
• Current corpus covers many of often used words
• Various implemented approacheds increase coverage
• Low coverage for sentences -> real world approach may need
improvement
• Audio is recognizable and understandable
• Neural Networks may be used to colorize video footage