Iasa Presentatie

•Download as PPT, PDF•

1 like•284 views

The document discusses using speech technology to help disclose and provide access to audiovisual archives by automatically generating time-stamped content descriptions. It faces challenges from a large backlog of undisclosed analog material with minimal descriptions. The approach involves digitizing content, adding metadata, and using speech recognition to generate content descriptions when transcripts are not available. This would allow online retrieval of archive fragments and reduce human effort for annotation. The project tests this approach on the Radio Rijnmond archives of over 60,000 hours of broadcasts.

Hidden treasures lost forever? Speech technology for the disclosure of Dutch audiovisual archives Mies Langelaar and Willemijn Heeren

Contents ,[object Object],[object Object],[object Object],[object Object]

Introduction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

The approach ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

The test case ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Searching the RR archives I Minimal content descriptions per hour data

Main problems ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Towards solutions … ,[object Object],[object Object]

AV Collection of Rotterdam Municipal Archives ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Work in progress ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Work in progress (2) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

How to ensure long term sustainability ,[object Object]

Trusted Digital Repository Feeder System Workflow Job Queue File Storage Characterisation Preservation Planning Migration Technical Registry Active Preservation Data Management Access Reporting Storage Adaptor Passive Preservation Ingest Toolkit Preservation Controller Workflow Controller User Administrator Archivist Metadata Store

[object Object],[object Object],[object Object],[object Object],How to ensure long term access to data?

Disclosure through speech technology ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

AV archiving workflow CHoral Content production Indexing ,[object Object],[object Object],[object Object],[object Object],End user ASR IR UI

Research ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Alignment + Speech signal Typed transcript Landgenooten waar ik enkele Begin frame # End frame # Word 00000 54400 -silence- 54400 65280 Landgenooten 65280 69120 Waar 69120 73600 Ik 73600 79520 Enkele … … …

Automatic speech recognition Acoustic model Language model Pronunciation dictionary Speech recognition 50+ hour audio 250-500 M words Pre-processing Classification speech/nonspeech Segmentation of speakers 2 nd recognition with adapted models Word level index

Types of word level indexes ,[object Object],[object Object],ASR: Er is een bekend beeld voor veel ouders de grote show in onveilige situatie voor de school TXT: ‘t is een bekend beeld voor veel ouders. De chaotische en onveilige situatie voor de school “ D’66 is z’n ene zetel kwijt”

Discussion ASR ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

User interface development ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

CHoral speech technology for GAR ,[object Object],[object Object]

Disc ussion ,[object Object],[object Object],[object Object],[object Object],[object Object]

Viewers also liked

Presentatie Informatie van nu, beschikbaar in de toekomstMies Langelaar

Big data x big archives = great opportunitiesKVANdagen

Presentation E Depot ConcernMies Langelaar

Een e-depot voor de kleine(re) archiefdienst: kruimels in het keuzebosKVANdagen

Bridging The Gap Eca 2010Mies Langelaar

32 Ways a Digital Marketing Consultant Can Help Grow Your BusinessBarry Feldman

Viewers also liked (6)

Presentatie Informatie van nu, beschikbaar in de toekomst

Big data x big archives = great opportunities

Presentation E Depot Concern

Een e-depot voor de kleine(re) archiefdienst: kruimels in het keuzebos

Bridging The Gap Eca 2010

32 Ways a Digital Marketing Consultant Can Help Grow Your Business

Similar to Iasa Presentatie

Digitisaing oral history by Sally HonePublicLibraryServices

1. digital audio recordingPaulo Abelho

Digital Preservation for Public BroadcastingRebecca Fraimow

Audiovisual content exploitation JTS2010 roelandordelman.nl

Navigating the Analog Waves: Digitizing Audio Cassettes for Your CollectionKay Gregg

Ig2 task 1 work sheet lewis brady copyLewisB2013

Ig2 task 1 work sheetluisfvazquez1

IG2 Task 1 Work SheetKyleFielding

Digital Preservation Best Practices: Lessons Learned From Across the PondBenoit Pauwels

Digital Presentation Best Practices: Lessons Learned From Across the PondULB - Bibliothèques

Sound recording glossary improved vershion 2ThomasDowson123

Print to Pixels: Digitizing in Your LibraryMartin Kalfatovic

Ig2 task 1 work sheetAdambailey-eccles

Michel Merten; Digitaliseren van videoVlaamse Vereniging voor Bibliotheek, Archief & Documentatie vzw (VVBAD)

New coding techniques, standardisation, and quality metricsTouradj Ebrahimi

Intro to Digitization Projectszsrlibrary

Broadcasters Dilemma with Archive Asset Management – Torn between long term a...FIAT/IFTA

DAT-to-file ingest - Brecht Declercq (VRT)Brecht Declercq

Similar to Iasa Presentatie (20)

Digitisaing oral history by Sally Hone

1. digital audio recording

Digital Preservation for Public Broadcasting

Audiovisual content exploitation JTS2010

Navigating the Analog Waves: Digitizing Audio Cassettes for Your Collection

Ig2 task 1 work sheet lewis brady copy

Ig2 task 1 work sheet

IG2 Task 1 Work Sheet

Digital Preservation Best Practices: Lessons Learned From Across the Pond

Digital Presentation Best Practices: Lessons Learned From Across the Pond

Sound recording glossary improved vershion 2

Print to Pixels: Digitizing in Your Library

Ig2 task 1 work sheet

Michel Merten; Digitaliseren van video

New coding techniques, standardisation, and quality metrics

Intro to Digitization Projects

Broadcasters Dilemma with Archive Asset Management – Torn between long term a...

DAT-to-file ingest - Brecht Declercq (VRT)

Iasa Presentatie

1. Hidden treasures lost forever? Speech technology for the disclosure of Dutch audiovisual archives Mies Langelaar and Willemijn Heeren

6. Searching the RR archives I Minimal content descriptions per hour data

7. Searching the RR archives II ? ? ? ? ?

10.

11.

12.

13.

14.

15. Trusted Digital Repository Feeder System Workflow Job Queue File Storage Characterisation Preservation Planning Migration Technical Registry Active Preservation Data Management Access Reporting Storage Adaptor Passive Preservation Ingest Toolkit Preservation Controller Workflow Controller User Administrator Archivist Metadata Store

16.

17.

18.

19.

20.

21. Alignment + Speech signal Typed transcript Landgenooten waar ik enkele Begin frame # End frame # Word 00000 54400 -silence- 54400 65280 Landgenooten 65280 69120 Waar 69120 73600 Ik 73600 79520 Enkele … … …

22. Automatic speech recognition Acoustic model Language model Pronunciation dictionary Speech recognition 50+ hour audio 250-500 M words Pre-processing Classification speech/nonspeech Segmentation of speakers 2 nd recognition with adapted models Word level index

23.

24.

25.

26.

27.

28.

Editor's Notes

The dense descriptions, generally per hour of audio, lead to large chunks for user exploration when results are found for a given query.
The undisclosed part of the collection cannot be accessed, and its content is largely unknown.
For ‘disclosure’ the speech technology researchers want “to automatically generate a time-stamped content description”. The automation will reduce the human annotation effort, and the fact that annotations are time-stamped means that words are linked to locations in the audio recording, allowing fragments to be retrieved in addition to entire audiovisual documents. The technology used for disclosure depends on (1) the available metadata, and (2) the availability of context documents, i.e. documents that are either directly related to the recording or to the topic of the recording. When there is a transcript of te recording available, the words in the transcripts can be aligned to the audio. During this process the locations of the known words are determined in the audio signal. The result is a fairly accurate index of which word was said where in the audio. When it is unknown exactly what was said in the audio recording ASR can be used to generate hypotheses of what was said where in the recording. Context documents can be valuable here to improve the models used for speech recognition. Speech recognizers generate output that is generally not without errors, but up to word error rates of 30 to 40% -- that is 3 or 4 out of every ten words were recognized in correctly-- the automatically generated content descriptions may successfully be used as search indexes. This is explained by the fact that speech is redundant, i.e. when something is on-topic it will be referred to more than once, and that many of the words that have a high risk of being mis-recognized make a relatively small contribution to the information content, i.e. prepositions (in, at), determiners (a, the) etc.
How does CHoral technology fit into the archiving workflow? This of course is a simplified representation, but it gives a general idea. After content has been produced, it is transferred to the archives for preservation. The data are being stored, archivists index the collection, and users may search the index for recordings of possible interest. <start animation> CHoral uses the recordings and the existing metadata to give the user a new kind of access. In addition to searching the catalogue for recordings that can be listened to at the archive’s listening room, search results come with audio fragments that can be listened to online, e.g., from the searcher’s home or work location. <animation 2> The technology consists of automatic speech recognition for index generation, information retrieval technology for finding relevant audio fragments in the collection, and of new user interface components that support interaction with the audio fragments.
During alignment the locations of known words are determined in the speech signal. By matching the acoustics in the speech signal to the expected acoustics of individual words each word in the transcript is matched to the location in the audio where it is most likely to occur. This results in an index that gives exact word positions for each word in the transcript. The accuracy of the resulting index is very high.
The following type of speech recognition system is used. Before the actual recognition process is started some pre-processing is done. This consists of (1) classifying the audio document into speech and non-speech segments, so that the parts of the recordings that do not contain speech (e.g., music, street noise) are not fed to the recognition system. Moreover, the speech may be segmented into coherent chunks per speaker, so that models may be adapted to individual speakers. The speech recognition system itself consists of three components: (1) an acoustic model that models the different speech sounds of a languages, (2) a language model that models which sequences of words are likely, and (3) a dictionary that prescribes out of which speech sounds a word is made up. To develop an acoustic model, over 50 hours of annotated speech materials are needed. To develop a language model, texts of hundreds of millions words are used. The output of the ASR system is a word level index, or a hypothesis of which words were spoken where in the audio document. Instead of running the recognition process just once, the output of the first round may be used to better choose the models used during recognition. Therefore, a so-called second pass is often run with adapted models to arrive at a more accurate index.
The output of an ASR system can take several forms. The most well-known form of output is in sentences, reflecting the most likely word sequence that was recognized by the system. For indexing purposes, however, other output types should be considered. One candidate are lattice structures that do not only store the most likely word sequency for a certain fragment of audio, but also alternative words that are likely at certain positions. In this way, alternatives are kept available.
For successful take up of technology some investments are needed. Thanks to the ongoing digitization process as well as standardization of formats audio documents should increasingly be fit for automatic processing without further adaptations. The quality of automatic annotations depends on the quality of the ASR models, and those can be tuned to different domains by accurate transcriptions of representative samples and/or (large amounts of) text data on the same or a strongly related topic. But when an ASR system is used to automatically generate time-stamped content descriptions, should those descriptions be validated by archivists? And if so, how?
A surrogate is a textual or visual represensation of the content of a spoken word document that can be used by searchers to assess a document’s contents before he/she decides to listen to the audio.

Iasa Presentatie

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to Iasa Presentatie

Similar to Iasa Presentatie (20)

Iasa Presentatie

Editor's Notes