1. About OpenSoNaR-CGN
SoNaR-500 and CGN made accessible through a web application,
WhiteLab, which makes it possible to explore and search these collections
with use of information contained in the metadata and linguistic
annotations.
WhiteLab
• Web application for exploring and searching large text collections
• Provides direct access to the texts, audio, transcriptions, and linguistic
annotations
• Uses CQP query language (CQP)
• Offers user interfaces for novice, advanced, and expert users
• Developed by de Taalmonsters in collaboration with Tilburg University
and INL; the current version (2.0) with Radboud University/CLST.
Explore
• View the composition of a collection or corpus through the tree map
view
• Retrieve statistics: frequency lists of (word) tokens, lemmas, parts of
speech, phonetic form
• Retrieve n-grams (max. n=5); combinations of words, lemmas, parts of
speech and/or phonetic forms
• Retrieve specific samples (CGN) or documents (SoNaR)
Search
• Selection of subcorpus by means of metadata filter(s)
• Specification of search pattern or query involving
̶ one or more word(s)
̶ POS tag(s)
̶ lemma(s)
• Queries make use of CQP; however, users can opt to specify their queries
without having to use CQP: search patterns formulated in the simple or
extended version of the interface are interpreted and converted to CQP
automatically.
Presentation of results
• Concordance (KWIC), sorted on the basis of lexical information or
metadata
• Link to larger context in which result was found
• (For CGN data) link to aligned audio file
• Graphical display of frequencies and other statistics
Export of results
Retrieved lists of (meta) data may be exported in tsv format.
SoNaR-500
• Reference corpus of contemporary written Dutch as encountered in
texts originating from the Dutch speaking language area in the
Netherlands and Flanders as well as Dutch translations published in and
targeted at this area.
• Comprises 500+ M words (~ 2 M documents) and includes various
genres and text types, incl. books, magazines, newspapers, discussion
fora, web sites, autocues, and subtitles
• Comes with metadata relating to authors and texts
• Linguistic annotations available: POS tagging, lemmatisation
CGN
• Corpus of contemporary spoken standard Dutch as spoken by adults in
the Netherlands and Flanders
• ~ 9 M words (800+ hours of speech), including various types of speech,
ranging from prepared monologues to spontaneous conversations
• Audio recordings & orthographic transcriptions
• For a subset of the data also phonetic transcriptions are available
• Comes with metadata relating to speakers (e.g. gender, age) and
recordings
• Linguistic annotations include POS tagging and lemmatisation
OpenSoNaR-CGN was developed by de Taalmonsters in collaboration with
Radboud University Nijmegen/CLST, Tilburg University, and INL.
We gratefully acknowledge the feedback we received from our user group
and the funding provided by CLARIN NL under grant number CLARIN-NL-15-
005.
Tree map view of CGN
Metadata filters for specifying subcorpora
Query specification in “extended” mode
Results presented in the form of a concordance
Query in CQP