Contribution to the 'Opening up speech archives' conference, February 7, 2013.
By Johan Oomen, Roeland Ordelman, Erwin Verbruggen
Context: http://lukemckernan.com/2013/02/05/opening-up-speech-archives/
1. Audiovisual archives and digital humanities
Netherlands Institute for Sound and Vision
Johan Oomen
Head of R&D (+ researcher VU University)
Roeland Ordelman
Policy advisor audiovisual access (+ researcher
University of Twente)
Erwin Verbruggen
Project manager EUscreen
http://www.walkerart.org/calendar/2009/benches-binoculars
contact: joomen@beeldengeluid.nl
8 February 2013
*
#ousa2013
4. Agenda
Johan Oomen
– Open archives for Digital Humanities
Roeland Ordelman
- Speech search and Digital Humanities
Erwin Verbruggen
- EUscreen and DH
*
6. Images for the Future
http://imagesforthefuture.com/en/news/images-
future-90-seconds
@johanoomen
*
7. It would take over 6 million
years to watch the amount
of video that will cross
global IP networks each
month in 2016.
Every second, 1.2 million
minutes of video content
will cross the network in
2016.
goal:
...be the best provider of your content
http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827
white_paper_c11-481360_ns827_Networking_Solutions_White_Paper.htm
9. Explorative search
Bron M., van Gorp J., Nack F., de Rijke M., van Gorp J., de Leeuw S., "A Subjunctive Exploratory Search Interface to Support Media Studies Researchers", SIGIR '12: 35th
international ACM SIGIR conference on Research and development in information retrieval,, Portland, Oregon, ACM, pp. 425-434 , August, 2012.
13. Vocabularies
Over 20 million
records and growing.
14. Archives and DH
1. Digitisation as driver for change
• Towards a cultural commonwealth
• Archives as a bridge to CS and DH
2. Mutual benefit
• digging into data ó adding meaning
3. From pilots to sustainable solutions
• Standards (W3C)
• In-house production system
• Shared infrastructures (i.e. CLARIAH.eu)
*
15. Audiovisual collections, the
spoken word and user needs of
scholars in the Humanities
Observations based on related
work in The Netherlands
2005-2012 Roeland Ordelman
@roelandordelman
16. E-Research E-research
• New and/or rapid ways to gain knowledge
• Digital resources and information technology
• Big data & data mining (social sciences)
• Digital Humanities / E-Humanities
• Digitization, Infra, Tools, Standards
• CLARIN.eu / DARIAH.eu
17. Emerging focus audiovisual
Emerging focus on on audiovisual
• Multi-modal, multi-semiotic:
• multiple layers of meaning / interpretation
• E.g., “quote + intonation + images + discourse”
• New dimensions for scholarly research
• Large investments in digitization:
• Images for the Future: 200k hours of film, video
and audio
• Various digitization projects for scientific
collections
21. Spoken word search 2005-2012
• Wide range of projects in various domains
• Radio
• Daily ingest: selection of programs
• Woord.nl: public access to radio content
• Historical video collections with sparse data
• ``Oral History’’
• Development of an ASR service for
cultural heritage institutions
22. 1st experiment on ASR for
humanities: access to
personal recordings of Dutch
novelist WF Hermans
26. Access to Radio interviews
Experiments with various types of access and result
presentation: speaker changes, speaking rate, search
strategies, word clouds
28. ACCESS TO
DISTRIBUTED ORAL
HISTORY
COLLECTIONS
• Infrastructure for
searching collections
at various institutes in
The Netherlands
• Harvesting of
Metadata (OAI-PMH)
• ASR as a service
• Evaluated with Oral
Historians
29. Observations on speech search
• Large variation in ASR performance
• Performance (and decisions on use)
should be assessed in context of
application: audiovisual search
• Usefulness in audiovisual search should
be assessed in context of use scenarios
• Use scenarios require specific
presentation/visualization requests
30. Usefulness of results
• Perception of usefulness
• Usefulness in context of search/data exploration
• Educate / Expectation management
• Guide searching
• Show why (errors, confidence, trust-levels, cut-offs)
• Focus on research needs
• Improve on ASR quality
• Educate: how to record an interview (Oral History)
• Use available textual resources (alignment, vocab optimization)
• Improve on search application
• Visualization
• Result presentation
• documents versus segments
• combination of information sources
• cross/within-collection linking
31. Methodology
Methodology (1) (1)
• E-research is an intervention in current practices!
• Promise:
• increased efficiency, relevance, novelty
• Interest of scholars:
• tools that facilitate or simplify existing practice (RIN
report, 2011)
• Co-development ICT-researchers & scholars to adjust
expectations. Examples:
• Finding more in less time may not be a goal in itself for
humanities researchers
• Deep engagement with primary texts versus results on the
segment level
32. Methodology (2)
• 4 stages:
1. Preliminary archival search
• Browsing as a general interest
• Purpose driven (checking details, complementary resources)
• Item-oriented (finding first mentioning of something)
• Collection-oriented (thematic, source, person, event)
2. Content analysis
• Visualization, compression, aggregation
• (optionally) go back to (1)
3. Presentation and dissemination
• Enhanced publications (persistent identifiers on segment level)
4. Curation
• Trusted digital repository
• (spoken) search scenarios: facilitate these stages
33. ASR for ASR for
research research
• Triple-A: Accessible, Affordable, Accurate
• Individual researchers sending files to ASR?
• Embedded in suite of research tools?
• What about integration in search
applications?
• Stagnation due to inadequate local infrastructures
• Variation across collections requires ‘tailor-
made’ approaches: e.g., speaker adaptation,
vocabulary adaptation, alignment, collection
of related resources (information trail)
34. ASR
ASR service service
Upload: via http, ftp, api
Model of use:
• Free test bundle (10h)
• Various small/medium/large
bundles
• Reduced costs (only
hardware and maintenance)
• Management by CH body
• Maintenance by industry
partner
38. Metadata
mint.image.ece.ntua.gr/
Based on EBUcore
Mapped to the Europeana Data Model
MAPPING TOOL ANNOTATION TOOL
Massive uploads Item and
Group Level Annotation
Schema Mapping Service
Connection with
Quality Control EUscreen Thesauri
Europeana Preview Services Search and Browsing Services