The document discusses using speech technology to help disclose and provide access to audiovisual archives by automatically generating time-stamped content descriptions. It faces challenges from a large backlog of undisclosed analog material with minimal descriptions. The approach involves digitizing content, adding metadata, and using speech recognition to generate content descriptions when transcripts are not available. This would allow online retrieval of archive fragments and reduce human effort for annotation. The project tests this approach on the Radio Rijnmond archives of over 60,000 hours of broadcasts.
15. Trusted Digital Repository Feeder System Workflow Job Queue File Storage Characterisation Preservation Planning Migration Technical Registry Active Preservation Data Management Access Reporting Storage Adaptor Passive Preservation Ingest Toolkit Preservation Controller Workflow Controller User Administrator Archivist Metadata Store
16.
17.
18.
19.
20.
21. Alignment + Speech signal Typed transcript Landgenooten waar ik enkele Begin frame # End frame # Word 00000 54400 -silence- 54400 65280 Landgenooten 65280 69120 Waar 69120 73600 Ik 73600 79520 Enkele … … …
22. Automatic speech recognition Acoustic model Language model Pronunciation dictionary Speech recognition 50+ hour audio 250-500 M words Pre-processing Classification speech/nonspeech Segmentation of speakers 2 nd recognition with adapted models Word level index
23.
24.
25.
26.
27.
28.
Editor's Notes
The dense descriptions, generally per hour of audio, lead to large chunks for user exploration when results are found for a given query.
The undisclosed part of the collection cannot be accessed, and its content is largely unknown.
For ‘disclosure’ the speech technology researchers want “to automatically generate a time-stamped content description”. The automation will reduce the human annotation effort, and the fact that annotations are time-stamped means that words are linked to locations in the audio recording, allowing fragments to be retrieved in addition to entire audiovisual documents. The technology used for disclosure depends on (1) the available metadata, and (2) the availability of context documents, i.e. documents that are either directly related to the recording or to the topic of the recording. When there is a transcript of te recording available, the words in the transcripts can be aligned to the audio. During this process the locations of the known words are determined in the audio signal. The result is a fairly accurate index of which word was said where in the audio. When it is unknown exactly what was said in the audio recording ASR can be used to generate hypotheses of what was said where in the recording. Context documents can be valuable here to improve the models used for speech recognition. Speech recognizers generate output that is generally not without errors, but up to word error rates of 30 to 40% -- that is 3 or 4 out of every ten words were recognized in correctly-- the automatically generated content descriptions may successfully be used as search indexes. This is explained by the fact that speech is redundant, i.e. when something is on-topic it will be referred to more than once, and that many of the words that have a high risk of being mis-recognized make a relatively small contribution to the information content, i.e. prepositions (in, at), determiners (a, the) etc.
How does CHoral technology fit into the archiving workflow? This of course is a simplified representation, but it gives a general idea. After content has been produced, it is transferred to the archives for preservation. The data are being stored, archivists index the collection, and users may search the index for recordings of possible interest. <start animation> CHoral uses the recordings and the existing metadata to give the user a new kind of access. In addition to searching the catalogue for recordings that can be listened to at the archive’s listening room, search results come with audio fragments that can be listened to online, e.g., from the searcher’s home or work location. <animation 2> The technology consists of automatic speech recognition for index generation, information retrieval technology for finding relevant audio fragments in the collection, and of new user interface components that support interaction with the audio fragments.
During alignment the locations of known words are determined in the speech signal. By matching the acoustics in the speech signal to the expected acoustics of individual words each word in the transcript is matched to the location in the audio where it is most likely to occur. This results in an index that gives exact word positions for each word in the transcript. The accuracy of the resulting index is very high.
The following type of speech recognition system is used. Before the actual recognition process is started some pre-processing is done. This consists of (1) classifying the audio document into speech and non-speech segments, so that the parts of the recordings that do not contain speech (e.g., music, street noise) are not fed to the recognition system. Moreover, the speech may be segmented into coherent chunks per speaker, so that models may be adapted to individual speakers. The speech recognition system itself consists of three components: (1) an acoustic model that models the different speech sounds of a languages, (2) a language model that models which sequences of words are likely, and (3) a dictionary that prescribes out of which speech sounds a word is made up. To develop an acoustic model, over 50 hours of annotated speech materials are needed. To develop a language model, texts of hundreds of millions words are used. The output of the ASR system is a word level index, or a hypothesis of which words were spoken where in the audio document. Instead of running the recognition process just once, the output of the first round may be used to better choose the models used during recognition. Therefore, a so-called second pass is often run with adapted models to arrive at a more accurate index.
The output of an ASR system can take several forms. The most well-known form of output is in sentences, reflecting the most likely word sequence that was recognized by the system. For indexing purposes, however, other output types should be considered. One candidate are lattice structures that do not only store the most likely word sequency for a certain fragment of audio, but also alternative words that are likely at certain positions. In this way, alternatives are kept available.
For successful take up of technology some investments are needed. Thanks to the ongoing digitization process as well as standardization of formats audio documents should increasingly be fit for automatic processing without further adaptations. The quality of automatic annotations depends on the quality of the ASR models, and those can be tuned to different domains by accurate transcriptions of representative samples and/or (large amounts of) text data on the same or a strongly related topic. But when an ASR system is used to automatically generate time-stamped content descriptions, should those descriptions be validated by archivists? And if so, how?
A surrogate is a textual or visual represensation of the content of a spoken word document that can be used by searchers to assess a document’s contents before he/she decides to listen to the audio.