Open sonar martinreynaert

•

0 gefällt mir•485 views

CLARIAH

poster at the CLARIAH 2016 day

Wissenschaft

About OpenSoNaR-CGN
SoNaR-500 and CGN made accessible through a web application,
WhiteLab, which makes it possible to explore and search these collections
with use of information contained in the metadata and linguistic
annotations.
WhiteLab
• Web application for exploring and searching large text collections
• Provides direct access to the texts, audio, transcriptions, and linguistic
annotations
• Uses CQP query language (CQP)
• Offers user interfaces for novice, advanced, and expert users
• Developed by de Taalmonsters in collaboration with Tilburg University
and INL; the current version (2.0) with Radboud University/CLST.
Explore
• View the composition of a collection or corpus through the tree map
view
• Retrieve statistics: frequency lists of (word) tokens, lemmas, parts of
speech, phonetic form
• Retrieve n-grams (max. n=5); combinations of words, lemmas, parts of
speech and/or phonetic forms
• Retrieve specific samples (CGN) or documents (SoNaR)
Search
• Selection of subcorpus by means of metadata filter(s)
• Specification of search pattern or query involving
̶ one or more word(s)
̶ POS tag(s)
̶ lemma(s)
• Queries make use of CQP; however, users can opt to specify their queries
without having to use CQP: search patterns formulated in the simple or
extended version of the interface are interpreted and converted to CQP
automatically.
Presentation of results
• Concordance (KWIC), sorted on the basis of lexical information or
metadata
• Link to larger context in which result was found
• (For CGN data) link to aligned audio file
• Graphical display of frequencies and other statistics
Export of results
Retrieved lists of (meta) data may be exported in tsv format.
SoNaR-500
• Reference corpus of contemporary written Dutch as encountered in
texts originating from the Dutch speaking language area in the
Netherlands and Flanders as well as Dutch translations published in and
targeted at this area.
• Comprises 500+ M words (~ 2 M documents) and includes various
genres and text types, incl. books, magazines, newspapers, discussion
fora, web sites, autocues, and subtitles
• Comes with metadata relating to authors and texts
• Linguistic annotations available: POS tagging, lemmatisation
CGN
• Corpus of contemporary spoken standard Dutch as spoken by adults in
the Netherlands and Flanders
• ~ 9 M words (800+ hours of speech), including various types of speech,
ranging from prepared monologues to spontaneous conversations
• Audio recordings & orthographic transcriptions
• For a subset of the data also phonetic transcriptions are available
• Comes with metadata relating to speakers (e.g. gender, age) and
recordings
• Linguistic annotations include POS tagging and lemmatisation
OpenSoNaR-CGN was developed by de Taalmonsters in collaboration with
Radboud University Nijmegen/CLST, Tilburg University, and INL.
We gratefully acknowledge the feedback we received from our user group
and the funding provided by CLARIN NL under grant number CLARIN-NL-15-
005.
Tree map view of CGN
Metadata filters for specifying subcorpora
Query specification in “extended” mode
Results presented in the form of a concordance
Query in CQP

Empfohlen

I1 sorin hermon_ancient_wisdomevaminerva

Linked Data and cultural heritage data: an overview of the approaches from Eu...The European Library

State of Tools for NLP in Danish: 2018Leon Derczynski

NJVR: The NanJing Vocabulary RepositoryGong Cheng

RDF Data and Image Annotations in ResearchSpace (slides)Vladimir Alexiev, PhD, PMP

K2 elhanan adler_israelbibliographicdataevaminerva

Links, languages and semantics: linked data approaches in The European Libra...Valentine Charles

Empfohlen

I1 sorin hermon_ancient_wisdomevaminerva

Linked Data and cultural heritage data: an overview of the approaches from Eu...The European Library

State of Tools for NLP in Danish: 2018Leon Derczynski

NJVR: The NanJing Vocabulary RepositoryGong Cheng

RDF Data and Image Annotations in ResearchSpace (slides)Vladimir Alexiev, PhD, PMP

K2 elhanan adler_israelbibliographicdataevaminerva

Links, languages and semantics: linked data approaches in The European Libra...Valentine Charles

Sharing an Open Methodology for Building Domain-specific Corpora for EAP Alannah Fitzgerald

K2 elhanan adler_israelbibliographicdataevaminerva

LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...locloud

IVACS Symposium 2010nottyknight

The Open-Source FLAX Language System Alannah Fitzgerald

Investigating the Panama Papers Connections with Neo4j - Stefan Komar, Neo4jNeo4j

Neo4j Partner Tag Berlin - Investigating the Panama Papers connections with n...Neo4j

Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project

Corpus linguisticsKing Saud University

Named Entity Recognition for Europeana Newspaperscneudecker

Introduction to text to speechBilgin Aksoy

Introduction to natural language processing (NLP)Alia Hamwi

Session5 03.george rehmIMPACT Centre of Competence

Innovative methods for data integration: Linked Data and NLPariadnenetwork

Attia sfcm presentationMohammed Attia

co:op-READ-Convention Marburg - Günter MühlbergerICARUS - International Centre for Archival Research

Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionZachary S. Brown

Cork AI Meetup Number 3Nick Grattan

DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...Felipe Albrecht

Fsmnlp presentation mohammed_attiaMohammed Attia

Improving Description through Collaboration: The Ethnomusicological Video for...Jenn Riley

Information Extraction from EuroParliament and UK Parliament dataWim Peters

Weitere ähnliche Inhalte

Was ist angesagt?

Sharing an Open Methodology for Building Domain-specific Corpora for EAP Alannah Fitzgerald

K2 elhanan adler_israelbibliographicdataevaminerva

LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...locloud

IVACS Symposium 2010nottyknight

The Open-Source FLAX Language System Alannah Fitzgerald

Investigating the Panama Papers Connections with Neo4j - Stefan Komar, Neo4jNeo4j

Neo4j Partner Tag Berlin - Investigating the Panama Papers connections with n...Neo4j

Was ist angesagt? (7)

Sharing an Open Methodology for Building Domain-specific Corpora for EAP

K2 elhanan adler_israelbibliographicdata

LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...

IVACS Symposium 2010

The Open-Source FLAX Language System

Investigating the Panama Papers Connections with Neo4j - Stefan Komar, Neo4j

Neo4j Partner Tag Berlin - Investigating the Panama Papers connections with n...

Ähnlich wie Open sonar martinreynaert

Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project

Corpus linguisticsKing Saud University

Named Entity Recognition for Europeana Newspaperscneudecker

Introduction to text to speechBilgin Aksoy

Introduction to natural language processing (NLP)Alia Hamwi

Session5 03.george rehmIMPACT Centre of Competence

Innovative methods for data integration: Linked Data and NLPariadnenetwork

Attia sfcm presentationMohammed Attia

co:op-READ-Convention Marburg - Günter MühlbergerICARUS - International Centre for Archival Research

Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionZachary S. Brown

Cork AI Meetup Number 3Nick Grattan

DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...Felipe Albrecht

Fsmnlp presentation mohammed_attiaMohammed Attia

Improving Description through Collaboration: The Ethnomusicological Video for...Jenn Riley

Information Extraction from EuroParliament and UK Parliament dataWim Peters

Audiovisual collections, the spoken word and user needs of scholars in the Hu...roelandordelman.nl

Reborn Digital: coding textPip Willcox

Information Extraction in the TalkOfEurope Creative CampWim Peters

ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...Franck Michel

Curation Technologies for Multilingual EuropeGeorg Rehm

Ähnlich wie Open sonar martinreynaert (20)

Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services

Corpus linguistics

Named Entity Recognition for Europeana Newspapers

Introduction to text to speech

Introduction to natural language processing (NLP)

Session5 03.george rehm

Innovative methods for data integration: Linked Data and NLP

Attia sfcm presentation

co:op-READ-Convention Marburg - Günter Mühlberger

Teaching Machines to Listen: An Introduction to Automatic Speech Recognition

Cork AI Meetup Number 3

DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...

Fsmnlp presentation mohammed_attia

Improving Description through Collaboration: The Ethnomusicological Video for...

Information Extraction from EuroParliament and UK Parliament data

Audiovisual collections, the spoken word and user needs of scholars in the Hu...

Reborn Digital: coding text

Information Extraction in the TalkOfEurope Creative Camp

ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...

Curation Technologies for Multilingual Europe

Mehr von CLARIAH

ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018CLARIAH

DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018CLARIAH

Masterclass innosurance 2018CLARIAH

Flat TLACLARIAH

QB'er demonstrationCLARIAH

Collection registration for the CLARIAH Media Suite.CLARIAH

CMDI2RDFCLARIAH

2016 05-20-clariah-wp4CLARIAH

2016 05-20-clariah-wp3CLARIAH

2016 05-20-clariah-wp2CLARIAH

2016 05-20-clariah-wp5CLARIAH

MTAS Henny BrugmanCLARIAH

LREC Ton vd WoudenCLARIAH

Paqu Gertjan van Noord en Jan OdijkCLARIAH

Struc data Auke RijpmaCLARIAH

Diachronous conceptuallexicons Marieke van Erp / Piek VossenCLARIAH

Corpus studio Erwin KomenCLARIAH

Athena richard zijdemanCLARIAH

Struc data aukerijpmaCLARIAH

Anansi jauco noordzijCLARIAH

Mehr von CLARIAH (20)

ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018

DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018

Masterclass innosurance 2018

Flat TLA

QB'er demonstration

Collection registration for the CLARIAH Media Suite.

CMDI2RDF

2016 05-20-clariah-wp4

2016 05-20-clariah-wp3

2016 05-20-clariah-wp2

2016 05-20-clariah-wp5

MTAS Henny Brugman

LREC Ton vd Wouden

Paqu Gertjan van Noord en Jan Odijk

Struc data Auke Rijpma

Diachronous conceptuallexicons Marieke van Erp / Piek Vossen

Corpus studio Erwin Komen

Athena richard zijdeman

Struc data aukerijpma

Anansi jauco noordzij

Kürzlich hochgeladen

Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B

Topic 9- General Principles of International Law.pptxJorenAcuavera1

Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane

Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju

GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1

Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD

The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju

Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1

basic entomology with insect anatomy and taxonomyDrAnita Sharma

Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48

Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju

Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde

Radiation physics in Dental Radiology...navyadasi1992

Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita

Citronella presentation SlideShare mani upadhyayupadhyaymani499

preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003

Harmful and Useful Microorganisms Presentationtahreemzahra82

Kürzlich hochgeladen (20)

Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx

Topic 9- General Principles of International Law.pptx

Microphone- characteristics,carbon microphone, dynamic microphone.pptx

Pests of Bengal gram_Identification_Dr.UPR.pdf

GenBio2 - Lesson 1 - Introduction to Genetics.pptx

Carbon Dioxide Capture and Storage (CSS)

The dark energy paradox leads to a new structure of spacetime.pptx

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf

Environmental Biotechnology Topic:- Microbial Biosensor

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx

basic entomology with insect anatomy and taxonomy

Vision and reflection on Mining Software Repositories research in 2024

Pests of soyabean_Binomics_IdentificationDr.UPR.pdf

Microteaching on terms used in filtration .Pharmaceutical Engineering

Radiation physics in Dental Radiology...

Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine

Citronella presentation SlideShare mani upadhyay

preservation, maintanence and improvement of industrial organism.pptx

Harmful and Useful Microorganisms Presentation

Open sonar martinreynaert

1. About OpenSoNaR-CGN SoNaR-500 and CGN made accessible through a web application, WhiteLab, which makes it possible to explore and search these collections with use of information contained in the metadata and linguistic annotations. WhiteLab • Web application for exploring and searching large text collections • Provides direct access to the texts, audio, transcriptions, and linguistic annotations • Uses CQP query language (CQP) • Offers user interfaces for novice, advanced, and expert users • Developed by de Taalmonsters in collaboration with Tilburg University and INL; the current version (2.0) with Radboud University/CLST. Explore • View the composition of a collection or corpus through the tree map view • Retrieve statistics: frequency lists of (word) tokens, lemmas, parts of speech, phonetic form • Retrieve n-grams (max. n=5); combinations of words, lemmas, parts of speech and/or phonetic forms • Retrieve specific samples (CGN) or documents (SoNaR) Search • Selection of subcorpus by means of metadata filter(s) • Specification of search pattern or query involving ̶ one or more word(s) ̶ POS tag(s) ̶ lemma(s) • Queries make use of CQP; however, users can opt to specify their queries without having to use CQP: search patterns formulated in the simple or extended version of the interface are interpreted and converted to CQP automatically. Presentation of results • Concordance (KWIC), sorted on the basis of lexical information or metadata • Link to larger context in which result was found • (For CGN data) link to aligned audio file • Graphical display of frequencies and other statistics Export of results Retrieved lists of (meta) data may be exported in tsv format. SoNaR-500 • Reference corpus of contemporary written Dutch as encountered in texts originating from the Dutch speaking language area in the Netherlands and Flanders as well as Dutch translations published in and targeted at this area. • Comprises 500+ M words (~ 2 M documents) and includes various genres and text types, incl. books, magazines, newspapers, discussion fora, web sites, autocues, and subtitles • Comes with metadata relating to authors and texts • Linguistic annotations available: POS tagging, lemmatisation CGN • Corpus of contemporary spoken standard Dutch as spoken by adults in the Netherlands and Flanders • ~ 9 M words (800+ hours of speech), including various types of speech, ranging from prepared monologues to spontaneous conversations • Audio recordings & orthographic transcriptions • For a subset of the data also phonetic transcriptions are available • Comes with metadata relating to speakers (e.g. gender, age) and recordings • Linguistic annotations include POS tagging and lemmatisation OpenSoNaR-CGN was developed by de Taalmonsters in collaboration with Radboud University Nijmegen/CLST, Tilburg University, and INL. We gratefully acknowledge the feedback we received from our user group and the funding provided by CLARIN NL under grant number CLARIN-NL-15- 005. Tree map view of CGN Metadata filters for specifying subcorpora Query specification in “extended” mode Results presented in the form of a concordance Query in CQP