SlideShare ist ein Scribd-Unternehmen logo
1 von 67
Improving Access to
Historic Public Broadcasting
through Speech-to-Text,
Crowdsourcing, and Machine Learning
Casey E. Davis Kaufman
Senior Project Manager, WGBH Media Library and Archives
Project Manager, American Archive of Public Broadcasting
Jim Bodor
Senior Director, Digital Product Development
WGBH Digital
americanarchive.org
@amarchivepub
facebook.com/amarchivepub
Quantifying the need for
preservation
• Over 537 million sound recordings held in collecting institutions in
the United States
• 57% are either rare or unique
• Only 17% had been digitized by the time the paper was published
in 2014
• No such study has been conducted for film and video materials, but
it is likely that the numbers will be similar or higher
The need is imminent
• The 2012 National Recording Preservation Plan stated that “many
endangered analog formats must be digitized within the next 15 or
20 years before further degradation makes preservation efforts all
but impossible.”
• As this report was years in the making, we may now have no more
than 10 to 15 years to preserve this material.
Public television has been responsible for the production, broadcast,
and dissemination of some of the most important programs which in
aggregate form the richest audiovisual source of cultural history in the
United States. . . . [I]t is still not easy to overstate the immense
cultural value of this unique audiovisual legacy, whose loss would
symbolize one of the great conflagrations of our age, tantamount to
the burning of Alexandria’s library in the age of antiquity.
-Television and Video Preservation (1997), a Library of Congress
report
Growing the Collection
• Acquiring up to 25,000 hours of new content per year
• Working with donors and vendors
• Dealing with collections in different stages
• Already digitized material
• Born-digital material
• Material that needs to be digitized
• PBS NewsHour
• American Masters
• NET programs
• NHPR
• Don Voegeli Collection
• Southern California Public Radio
• Ken Burns’ The Civil War interviews
• Eyes on the Prize Interviews
• Peabody Awards Collection
• KBOO Community Radio
Goal: A Centralized Web
Portal for Discovery
• All AAPB digitized content discoverable through single searches
• Direct links to historic public media available on other sites
• One-stop shopping for users
• Helps solve the separate silos syndrome
• Digital Public Library of America (DPLA) as a model
Launched New Projects
• National Educational Television (NET) Collection Catalog Project
• Build a national inventory of titles created for National Educational Television (pre-
PBS)
• AAPB National Digital Stewardship Residency
• Expand the NDSR program to include geographically diverse residencies at
organizations with public media collections
• Improving Access to Time-Based Media through Crowdsourcing & Machine Learning
Public Broadcasting Metadata Dictionary
(PBCore)
• Continuing the development of PBCore as a
metadata schema for media materials
• Engaging the PBCore community for input in
the development
• Outreach to new adopters of PBCore
• Collaborating with EBUCore to utilize their
RDF ontology
For more information: pbcore.org
the situation
■ 72,000 digitized television and radio programs
■ incomplete, inaccurate metadata records
■ limited staff resources
■ we need to know what we have in the collection
■ we have a responsibility to users to provide access to the collection
■ continued growth of the collection (content and sparse metadata)
potential: transforming
content into data
• Computational Tools
• Speech-to-text
• Audio analysis
• Image Analysis
• Visualization of Data
• How can we use them?
Speech-to-text Transcription
github.com/popuparchive/american-archive-kaldi
a crowdsourcing game
once corrected…
• JSON transcripts will be stored on AAPB’s Amazon S3 account
• Transcripts will be indexed for keyword searching on the AAPB
website
• Transcripts will be made available alongside the media on the
record page
• Transcripts can play as captions within the player
• Transcripts can be harvested via an API and used as a dataset for
research such as a digital humanities project
• Transcripts can be provided to the stations that contributed the
content
The LAPPS Grid
The LAPPS Grid Project
• Collaborative effort among US partners
• Brandeis University
• Vassar College
• Carnegie-Mellon University
• Linguistic Data Consortium (University of Pennsylvania)
• Funded by the US National Science Foundation
• Builds on
• foundation laid in several projects
• SILT (Brandeis/Vassar), The Language Grid, PANACEA, LinguaGrid… momentum
toward a comprehensive network of web services and resources within the NLP
community
Overall Goals
• Design, develop, and promote a Language Application Grid based on Service Grid
Software
• Support development and deployment of integrated natural language applications
• Enable federation of grids and services
• Provide an open advancement (OA) framework for component- and application-
based evaluation
• Provide access to language resources for members of the NLP community as well
as researchers in a wide range of social science and humanities disciplines
• Enable easy navigation through licensing issues
• Actively promote adoption, use, and community involvement with the LAPPS Grid
• Actively pursue creation of an interoperable global network of grids and frameworks
Functionality
• Provides access to
• basic NLP processing tools basic (search, indexing, named entity
spotting, semantic tagging
• language resources such as mono- and multi-lingual corpora and
lexicons
• Enables pipelining tools to create custom NLP applications and “black
box” composite services
• Ultimately a community-based project
• Services contributed by members of the community
• Existing service repositories and grids federated to enable universal
access
www.lappsgrid.org
Current State of the Project
• Galaxy as workflow engine
• Evaluation services (CMU) - UIMA
• Multiple versions of basic NLP pipeline components (tokenizers, pos taggers,
named entity taggers, coreference modules, phrase-structure, dependency
parse, relation extraction, etc.)
• Currently, Stanford tools, OpenNLP, GATE tools, LingPipe, others
• Using open source tools for development
• Can mix and match, find best set using evaluation tools
• Data sources: MASC (plain text, GrAF, JSON-LD), Gigaword (XML)
• Will soon add all LDC holdings, OANC
Audio Waveform Analysis
The State of Recorded Sound
Preservation in the United States:
A National Legacy at Risk in the
Digital Age (2010)
• Suggested that if scholars and students do not use sound archives,
cultural heritage institutions will be less inclined to preserve them.
• Archives and libraries must collaborate with patrons and scholars to
understand how recordings are and might be used.
• Scholars need to know what kinds of analysis are possible in an
age of large, freely available collections and advanced
computational analysis.
http://www.hipstas.org
Credit: Tanya Clement, HiPSTAS,
tclement@ischool.utexas.edu
HiPSTAS: Primary Goals
• To develop a virtual research environment in which users can better
access and analyze spoken word collections of interest to
humanists through:
• an assessment of scholarly requirements for analyzing sound
• an assessment of technological infrastructures needed to support
discovery
• preliminary tests that demonstrate the efficacy of using such tools
in humanities scholarship
• a freely available, open-source, API-driven version for general use
Credit: Tanya Clement, HiPSTAS,
tclement@ischool.utexas.edu
ARLO (Adaptive Recognition with
Layered Optimization)
HZ, a unit of
frequency
Time
a heat based color scheme.
White – hottest, most intense
Yellow
Red
Green
Blue
Black – coolest, least intense
Energy represented by
Credit: Tanya Clement, HiPSTAS,
tclement@ischool.utexas.edu
AAPB dataset
• 50 selected sound characteristics and speaker voices
• sounds such as NPR intro, Independent Television News (ITN)
intro, classical music, dog barks, audience clapping
• Voices such as John F. Kennedy, Eleanor Roosevelt, Martin
Luther King, Jr., etc.
• 4,000 hours of video and audio which includes samples of each
speaker and sound
• metadata & transcripts
Supervised Classification
Searching for Sound with SoundCredit: Tanya Clement, HiPSTAS,
tclement@ischool.utexas.edu
Unsupervised Classification
Searching for Sound with SoundCredit: Tanya Clement, HiPSTAS,
tclement@ischool.utexas.edu
Get Results
Credit: Tanya Clement, HiPSTAS,
tclement@ischool.utexas.edu
Visualize Results
Credit: Tanya Clement, HiPSTAS,
tclement@ischool.utexas.edu
Visualize Results
Credit: Tanya Clement, HiPSTAS,
tclement@ischool.utexas.edu
Blue = sung; green = spoken; red = instrumental
55 John Alan Lomax recordings 1926-1941
Visualize Results
Credit: Tanya Clement, HiPSTAS,
tclement@ischool.utexas.edu
Questions
• Features: What are we measuring?
• Ground Truth: What’s the answer? How do we know when we’re
accurate?
• Optimization: Accuracy vs. Efficiency – how do you balance the
accuracy of your results against the computational resources you
need to achieve that level of accuracy?
Credit: Tanya Clement, HiPSTAS,
tclement@ischool.utexas.edu
Questions
• Literacy: How much do we need to know about the technology of audio, of
computational methods, and of humanist inquiry to do new kinds of research in this
area?
• Usability: What kinds of interfaces and tools facilitate AV analysis in a diverse range
of disciplines and communities? Who gets access to these tools and for what kinds
of questions?
• Accuracy: Is good enough, good enough?
• Scalability: How much storage and processing power do users need to conduct local
and large-scale AV analyses? A Laptop? A Supercomputer?
• Sustainability: What are local, national, and global scale issues? How does this work
fit back into the access infrastructure already in place in archives, libraries,
classrooms? Is data enough to get us over the hump of our limited means for
discovery?
Credit: Tanya Clement, HiPSTAS,
tclement@ischool.utexas.edu
Future Possibilities
• Searching for sound with Sound
• Metadata improvement and validation
• Re-allocation of staff
Credit: Tanya Clement, HiPSTAS,
tclement@ischool.utexas.edu
AAPB can be of value for scholarship because...
• scholarship pertaining to the period of 1973 onwards is “limited, fragmentary,
and politically conflicted”
• for the 1980s, “the archival and monographic work … has not yet been done”
• accounts about the 1990s and later have “not really been history”
Kim Phillips-Fein, “1973 to the Present,”
in American History Now (2011)
The Importance
of Local History
• “emphasis on diversity”
• “the history of the nation is many
different stories, no one of which
can be considered the ‘main’ story”
• a “skepticism about finding common
definitions of American nationalism
or discovering common values”
among many historians of the
1960s and 1970s
History from the
bottom up
(quotes from
Alan Brinkley)
The Importance of Local History for...
• relating “national experiences to larger processes
and local resolutions.”
Thomas Bender
Rethinking American History in a Global Age (2002)
Get your
station on
the map!
Why get involved in the AAPB?
So that your station can begin to identify, manage and preserve your collection
So that other producers can find and potentially license your content
So that scholars can refer to your content in their research
So that educators and students can access your content as primary source material
So that lifelong learners can watch and listen to the programs they remember from
the past
Your station has created an archival record of your community and our shared
cultural heritage. Making this historic content available fulfills public media’s core
mission to educate, inspire, and enlighten.
Contributing digital files to AAPB
AAPB can acquire a certain number of hours of digital content per year.
For your collection to be considered, fill out our Collection Acquisitions Form:
Download:
https://s3.amazonaws.com/americanarchive.org/resources/AAPB_collection_acquisi
tions_form.pdf
We also ask all donors to sign our AAPB Deed of Gift, an agreement between the
collection donor and WGBH and the Library of Congress.
Download:
https://s3.amazonaws.com/americanarchive.org/resources/AAPB_model-deed-of-
gift.docx
AAPB Technical Specifications
Once a collection has been approved for submission to AAPB, AAPB works
closely with collection donors and their digitization vendors to clarify
required and requested technical specifications.
These include file formats, naming conventions, storage media for
delivery of files, associated technical metadata, etc.
We would be happy to work with you to develop your digitization
vendor RFP.
Current AAPB tech specs for donors are available for download here:
https://s3.amazonaws.com/americanarchive.org/resources/AAPB_Tec
hSpecs_160606.pdf
How can my station support this work?
Hire an archivist.
Advocate for staff resources to be allocated to archiving your content, and get them trained on best
practices. Licensing and your own reuse of existing footage are potential ROIs.
Hire Library Science graduate student interns. Check out ALA’s accredited library programs in
your area: http://www.ala.org/accreditedprograms/directory
Partner with a local university to preserve your analog collection.
Seek funding from local foundations or donors interested in media and/or local history.
Seek federal or other national grants targeted toward archiving.
Grant programs
Institute of Museum and Library Services – imls.gov
Council on Library and Information Resources – clir.org
National Endowment for the Humanities – neh.gov
National Endowment for the Arts – arts.gov
Association for Recorded Sound Collections – arsc-audio.org
NARA’s National Historical Publications & Records Commission – archives.gov/nhprc
Grammy Foundation – grammy.org/grammy-foundation/grants
americanarchive.org
@amarchivepub
facebook.com/amarchivepub
Thank you!
Casey E. Davis Kaufman
Casey_Davis-Kaufman@wgbh.org
@caseyedavis1
Jim Bodor
Jim_Bodor@wgbh.org

Weitere ähnliche Inhalte

Was ist angesagt?

Preserving Your Station Legacy with the American Archive of Public Broadcasti...
Preserving Your Station Legacy with the American Archive of Public Broadcasti...Preserving Your Station Legacy with the American Archive of Public Broadcasti...
Preserving Your Station Legacy with the American Archive of Public Broadcasti...WGBH Media Library and Archives
 
Keeping the Broadcast Historic Record: An Archive of Public Media in the Making
Keeping the Broadcast Historic Record: An Archive of Public Media in the MakingKeeping the Broadcast Historic Record: An Archive of Public Media in the Making
Keeping the Broadcast Historic Record: An Archive of Public Media in the MakingWGBH Media Library and Archives
 
Building the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access WorkflowsBuilding the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access WorkflowsWGBH Media Library and Archives
 
Music Objects to Social Machines
Music Objects to Social MachinesMusic Objects to Social Machines
Music Objects to Social MachinesDavid De Roure
 
The Nevada Test Site Project: Finding Treasures in Firsthand Historical Acco...
The Nevada Test Site Project:  Finding Treasures in Firsthand Historical Acco...The Nevada Test Site Project:  Finding Treasures in Firsthand Historical Acco...
The Nevada Test Site Project: Finding Treasures in Firsthand Historical Acco...Cory Lampert
 
Ethnography in the virtual world: Methodological opportunities and challenges
Ethnography in the virtual world: Methodological opportunities and challengesEthnography in the virtual world: Methodological opportunities and challenges
Ethnography in the virtual world: Methodological opportunities and challengesDigital Sociology Mini-Conference
 
The Place of Streaming Video in Scholarship
The Place of Streaming Video in ScholarshipThe Place of Streaming Video in Scholarship
The Place of Streaming Video in ScholarshipPratt_Symposium
 
Digital Library Project Proposal
Digital Library Project ProposalDigital Library Project Proposal
Digital Library Project ProposalMicah Vandegrift
 
Using Computational Tools and Crowdsourcing Games to Increase Metadata and Di...
Using Computational Tools and Crowdsourcing Games to Increase Metadata and Di...Using Computational Tools and Crowdsourcing Games to Increase Metadata and Di...
Using Computational Tools and Crowdsourcing Games to Increase Metadata and Di...WGBH Media Library and Archives
 
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...Dov Winer
 

Was ist angesagt? (13)

AAPB Introduction at AMIA 2014
AAPB Introduction at AMIA 2014AAPB Introduction at AMIA 2014
AAPB Introduction at AMIA 2014
 
Preserving Your Station Legacy with the American Archive of Public Broadcasti...
Preserving Your Station Legacy with the American Archive of Public Broadcasti...Preserving Your Station Legacy with the American Archive of Public Broadcasti...
Preserving Your Station Legacy with the American Archive of Public Broadcasti...
 
Keeping the Broadcast Historic Record: An Archive of Public Media in the Making
Keeping the Broadcast Historic Record: An Archive of Public Media in the MakingKeeping the Broadcast Historic Record: An Archive of Public Media in the Making
Keeping the Broadcast Historic Record: An Archive of Public Media in the Making
 
Building the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access WorkflowsBuilding the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access Workflows
 
Music Objects to Social Machines
Music Objects to Social MachinesMusic Objects to Social Machines
Music Objects to Social Machines
 
Ir1
Ir1Ir1
Ir1
 
The Nevada Test Site Project: Finding Treasures in Firsthand Historical Acco...
The Nevada Test Site Project:  Finding Treasures in Firsthand Historical Acco...The Nevada Test Site Project:  Finding Treasures in Firsthand Historical Acco...
The Nevada Test Site Project: Finding Treasures in Firsthand Historical Acco...
 
Ethnography in the virtual world: Methodological opportunities and challenges
Ethnography in the virtual world: Methodological opportunities and challengesEthnography in the virtual world: Methodological opportunities and challenges
Ethnography in the virtual world: Methodological opportunities and challenges
 
The Place of Streaming Video in Scholarship
The Place of Streaming Video in ScholarshipThe Place of Streaming Video in Scholarship
The Place of Streaming Video in Scholarship
 
Digital Library Project Proposal
Digital Library Project ProposalDigital Library Project Proposal
Digital Library Project Proposal
 
Podcasting in 2007
Podcasting in 2007Podcasting in 2007
Podcasting in 2007
 
Using Computational Tools and Crowdsourcing Games to Increase Metadata and Di...
Using Computational Tools and Crowdsourcing Games to Increase Metadata and Di...Using Computational Tools and Crowdsourcing Games to Increase Metadata and Di...
Using Computational Tools and Crowdsourcing Games to Increase Metadata and Di...
 
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...
 

Ähnlich wie Improving Access to Historic Public Broadcasting through Speech-to-Text, Crowdsourcing, and Machine Learning

Boston Library Consortium Webinars: Use of AAPB in Humanities Research"
Boston Library Consortium Webinars: Use of AAPB in Humanities Research"Boston Library Consortium Webinars: Use of AAPB in Humanities Research"
Boston Library Consortium Webinars: Use of AAPB in Humanities Research"Ryn Marchese
 
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...wnradmin
 
How to Use the American Archive of Public Broadcasting as a Resource in the C...
How to Use the American Archive of Public Broadcasting as a Resource in the C...How to Use the American Archive of Public Broadcasting as a Resource in the C...
How to Use the American Archive of Public Broadcasting as a Resource in the C...WGBH Media Library and Archives
 
Media Ecology Project: Archival Access Online, Born-Networked Scholarship
Media Ecology Project: Archival Access Online, Born-Networked ScholarshipMedia Ecology Project: Archival Access Online, Born-Networked Scholarship
Media Ecology Project: Archival Access Online, Born-Networked Scholarshipnmdjohn
 
Boston Library Consortium Webinar Part 1, Accessibility of AAPB for Academic ...
Boston Library Consortium Webinar Part 1, Accessibility of AAPB for Academic ...Boston Library Consortium Webinar Part 1, Accessibility of AAPB for Academic ...
Boston Library Consortium Webinar Part 1, Accessibility of AAPB for Academic ...Ryn Marchese
 
Accessibility of the American Archive of Public Broadcasting in Academic Libr...
Accessibility of the American Archive of Public Broadcasting in Academic Libr...Accessibility of the American Archive of Public Broadcasting in Academic Libr...
Accessibility of the American Archive of Public Broadcasting in Academic Libr...WGBH Media Library and Archives
 
AAPB Educators Webinar
AAPB Educators WebinarAAPB Educators Webinar
AAPB Educators WebinarRyn Marchese
 
Building AAPB Participation into Digitization Grant Proposals: Requirements, ...
Building AAPB Participation into Digitization Grant Proposals: Requirements, ...Building AAPB Participation into Digitization Grant Proposals: Requirements, ...
Building AAPB Participation into Digitization Grant Proposals: Requirements, ...WGBH Media Library and Archives
 
Spanish in the U.S.: Developing an open linguistic corpus
Spanish in the U.S.: Developing an open linguistic corpusSpanish in the U.S.: Developing an open linguistic corpus
Spanish in the U.S.: Developing an open linguistic corpusSpanish in Texas Project
 
Webinar: Getting Started with Digitization An Introduction for Libraries-2016...
Webinar: Getting Started with Digitization An Introduction for Libraries-2016...Webinar: Getting Started with Digitization An Introduction for Libraries-2016...
Webinar: Getting Started with Digitization An Introduction for Libraries-2016...TechSoup
 
Sound Matters: a framework for the creative use and re-use of sound: field re...
Sound Matters: a framework for the creative use and re-use of sound: field re...Sound Matters: a framework for the creative use and re-use of sound: field re...
Sound Matters: a framework for the creative use and re-use of sound: field re...Jisc
 
Theory and Practice of Building Digital Public Spaces
Theory and Practice of Building Digital Public SpacesTheory and Practice of Building Digital Public Spaces
Theory and Practice of Building Digital Public Spacesacecarruthers
 
Pratt SILS Cultural Heritage: Description and Access Spring 2011
Pratt SILS Cultural Heritage: Description and Access Spring 2011Pratt SILS Cultural Heritage: Description and Access Spring 2011
Pratt SILS Cultural Heritage: Description and Access Spring 2011PrattSILS
 
Creative Connections Presentation Notes
Creative Connections Presentation NotesCreative Connections Presentation Notes
Creative Connections Presentation NotesTor Loney
 
Sarah Michalak, HathiTrust #RLUK14
Sarah Michalak, HathiTrust #RLUK14Sarah Michalak, HathiTrust #RLUK14
Sarah Michalak, HathiTrust #RLUK14ResearchLibrariesUK
 
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...Samuel W. Shogren, MPA., LEAD assoc.
 

Ähnlich wie Improving Access to Historic Public Broadcasting through Speech-to-Text, Crowdsourcing, and Machine Learning (20)

Boston Library Consortium Webinars: Use of AAPB in Humanities Research"
Boston Library Consortium Webinars: Use of AAPB in Humanities Research"Boston Library Consortium Webinars: Use of AAPB in Humanities Research"
Boston Library Consortium Webinars: Use of AAPB in Humanities Research"
 
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
WNR.sg - Keynote Address by Mr John van Oudenaren, Director, World Digital Li...
 
How to Use the American Archive of Public Broadcasting as a Resource in the C...
How to Use the American Archive of Public Broadcasting as a Resource in the C...How to Use the American Archive of Public Broadcasting as a Resource in the C...
How to Use the American Archive of Public Broadcasting as a Resource in the C...
 
Media Ecology Project: Archival Access Online, Born-Networked Scholarship
Media Ecology Project: Archival Access Online, Born-Networked ScholarshipMedia Ecology Project: Archival Access Online, Born-Networked Scholarship
Media Ecology Project: Archival Access Online, Born-Networked Scholarship
 
Boston Library Consortium Webinar Part 1, Accessibility of AAPB for Academic ...
Boston Library Consortium Webinar Part 1, Accessibility of AAPB for Academic ...Boston Library Consortium Webinar Part 1, Accessibility of AAPB for Academic ...
Boston Library Consortium Webinar Part 1, Accessibility of AAPB for Academic ...
 
Accessibility of the American Archive of Public Broadcasting in Academic Libr...
Accessibility of the American Archive of Public Broadcasting in Academic Libr...Accessibility of the American Archive of Public Broadcasting in Academic Libr...
Accessibility of the American Archive of Public Broadcasting in Academic Libr...
 
Groeling, Tim: NewsScape: Preserving TV News
Groeling, Tim: NewsScape: Preserving TV NewsGroeling, Tim: NewsScape: Preserving TV News
Groeling, Tim: NewsScape: Preserving TV News
 
AAPB Educators Webinar
AAPB Educators WebinarAAPB Educators Webinar
AAPB Educators Webinar
 
Building AAPB Participation into Digitization Grant Proposals: Requirements, ...
Building AAPB Participation into Digitization Grant Proposals: Requirements, ...Building AAPB Participation into Digitization Grant Proposals: Requirements, ...
Building AAPB Participation into Digitization Grant Proposals: Requirements, ...
 
KCariani cv10_2015
KCariani cv10_2015KCariani cv10_2015
KCariani cv10_2015
 
Spanish in the U.S.: Developing an open linguistic corpus
Spanish in the U.S.: Developing an open linguistic corpusSpanish in the U.S.: Developing an open linguistic corpus
Spanish in the U.S.: Developing an open linguistic corpus
 
Webinar: Getting Started with Digitization An Introduction for Libraries-2016...
Webinar: Getting Started with Digitization An Introduction for Libraries-2016...Webinar: Getting Started with Digitization An Introduction for Libraries-2016...
Webinar: Getting Started with Digitization An Introduction for Libraries-2016...
 
Sound Matters: a framework for the creative use and re-use of sound: field re...
Sound Matters: a framework for the creative use and re-use of sound: field re...Sound Matters: a framework for the creative use and re-use of sound: field re...
Sound Matters: a framework for the creative use and re-use of sound: field re...
 
Digital Public Library of America
Digital Public Library of AmericaDigital Public Library of America
Digital Public Library of America
 
Let the Computer Do the Work
Let the Computer Do the WorkLet the Computer Do the Work
Let the Computer Do the Work
 
Theory and Practice of Building Digital Public Spaces
Theory and Practice of Building Digital Public SpacesTheory and Practice of Building Digital Public Spaces
Theory and Practice of Building Digital Public Spaces
 
Pratt SILS Cultural Heritage: Description and Access Spring 2011
Pratt SILS Cultural Heritage: Description and Access Spring 2011Pratt SILS Cultural Heritage: Description and Access Spring 2011
Pratt SILS Cultural Heritage: Description and Access Spring 2011
 
Creative Connections Presentation Notes
Creative Connections Presentation NotesCreative Connections Presentation Notes
Creative Connections Presentation Notes
 
Sarah Michalak, HathiTrust #RLUK14
Sarah Michalak, HathiTrust #RLUK14Sarah Michalak, HathiTrust #RLUK14
Sarah Michalak, HathiTrust #RLUK14
 
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
 

Mehr von WGBH Media Library and Archives

Implementing Samvera Open Source Technology at WGBH and the American Archive ...
Implementing Samvera Open Source Technology at WGBH and the American Archive ...Implementing Samvera Open Source Technology at WGBH and the American Archive ...
Implementing Samvera Open Source Technology at WGBH and the American Archive ...WGBH Media Library and Archives
 
American Archive of Public Broadcasting: a Digital Library for Teaching Media...
American Archive of Public Broadcasting: a Digital Library for Teaching Media...American Archive of Public Broadcasting: a Digital Library for Teaching Media...
American Archive of Public Broadcasting: a Digital Library for Teaching Media...WGBH Media Library and Archives
 
Putting the Pieces Together: Creating a National Educational Television Catalog
Putting the Pieces Together: Creating a National Educational Television CatalogPutting the Pieces Together: Creating a National Educational Television Catalog
Putting the Pieces Together: Creating a National Educational Television CatalogWGBH Media Library and Archives
 
DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...
DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...
DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...WGBH Media Library and Archives
 
FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...
FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...
FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...WGBH Media Library and Archives
 
Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...
Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...
Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...WGBH Media Library and Archives
 
Challenges, Workflows, and Insights in the Collaboration to Preserve America'...
Challenges, Workflows, and Insights in the Collaboration to Preserve America'...Challenges, Workflows, and Insights in the Collaboration to Preserve America'...
Challenges, Workflows, and Insights in the Collaboration to Preserve America'...WGBH Media Library and Archives
 
American Archive of Public Broadcasting: Preservation and Content Continuity
American Archive of Public Broadcasting: Preservation and Content ContinuityAmerican Archive of Public Broadcasting: Preservation and Content Continuity
American Archive of Public Broadcasting: Preservation and Content ContinuityWGBH Media Library and Archives
 

Mehr von WGBH Media Library and Archives (14)

Wikipedia Editathon: How to Guide
Wikipedia Editathon: How to GuideWikipedia Editathon: How to Guide
Wikipedia Editathon: How to Guide
 
FIX IT+ Transcript Editing
FIX IT+ Transcript EditingFIX IT+ Transcript Editing
FIX IT+ Transcript Editing
 
Implementing Samvera Open Source Technology at WGBH and the American Archive ...
Implementing Samvera Open Source Technology at WGBH and the American Archive ...Implementing Samvera Open Source Technology at WGBH and the American Archive ...
Implementing Samvera Open Source Technology at WGBH and the American Archive ...
 
American Archive of Public Broadcasting: a Digital Library for Teaching Media...
American Archive of Public Broadcasting: a Digital Library for Teaching Media...American Archive of Public Broadcasting: a Digital Library for Teaching Media...
American Archive of Public Broadcasting: a Digital Library for Teaching Media...
 
Putting the Pieces Together: Creating a National Educational Television Catalog
Putting the Pieces Together: Creating a National Educational Television CatalogPutting the Pieces Together: Creating a National Educational Television Catalog
Putting the Pieces Together: Creating a National Educational Television Catalog
 
DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...
DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...
DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...
 
FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...
FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...
FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...
 
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
 
Let the Public and the Computer do the Metadata Work!
Let the Public and the Computer do the Metadata Work!Let the Public and the Computer do the Metadata Work!
Let the Public and the Computer do the Metadata Work!
 
Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...
Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...
Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...
 
NET Collection Catalog Project
NET Collection Catalog ProjectNET Collection Catalog Project
NET Collection Catalog Project
 
PBCore RDF Ontology Hackathon | Code4Lib 2015
PBCore RDF Ontology Hackathon | Code4Lib 2015PBCore RDF Ontology Hackathon | Code4Lib 2015
PBCore RDF Ontology Hackathon | Code4Lib 2015
 
Challenges, Workflows, and Insights in the Collaboration to Preserve America'...
Challenges, Workflows, and Insights in the Collaboration to Preserve America'...Challenges, Workflows, and Insights in the Collaboration to Preserve America'...
Challenges, Workflows, and Insights in the Collaboration to Preserve America'...
 
American Archive of Public Broadcasting: Preservation and Content Continuity
American Archive of Public Broadcasting: Preservation and Content ContinuityAmerican Archive of Public Broadcasting: Preservation and Content Continuity
American Archive of Public Broadcasting: Preservation and Content Continuity
 

Kürzlich hochgeladen

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 

Kürzlich hochgeladen (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Improving Access to Historic Public Broadcasting through Speech-to-Text, Crowdsourcing, and Machine Learning

  • 1. Improving Access to Historic Public Broadcasting through Speech-to-Text, Crowdsourcing, and Machine Learning Casey E. Davis Kaufman Senior Project Manager, WGBH Media Library and Archives Project Manager, American Archive of Public Broadcasting Jim Bodor Senior Director, Digital Product Development WGBH Digital
  • 3.
  • 4.
  • 5. Quantifying the need for preservation • Over 537 million sound recordings held in collecting institutions in the United States • 57% are either rare or unique • Only 17% had been digitized by the time the paper was published in 2014 • No such study has been conducted for film and video materials, but it is likely that the numbers will be similar or higher
  • 6. The need is imminent • The 2012 National Recording Preservation Plan stated that “many endangered analog formats must be digitized within the next 15 or 20 years before further degradation makes preservation efforts all but impossible.” • As this report was years in the making, we may now have no more than 10 to 15 years to preserve this material.
  • 7. Public television has been responsible for the production, broadcast, and dissemination of some of the most important programs which in aggregate form the richest audiovisual source of cultural history in the United States. . . . [I]t is still not easy to overstate the immense cultural value of this unique audiovisual legacy, whose loss would symbolize one of the great conflagrations of our age, tantamount to the burning of Alexandria’s library in the age of antiquity. -Television and Video Preservation (1997), a Library of Congress report
  • 8. Growing the Collection • Acquiring up to 25,000 hours of new content per year • Working with donors and vendors • Dealing with collections in different stages • Already digitized material • Born-digital material • Material that needs to be digitized
  • 9. • PBS NewsHour • American Masters • NET programs • NHPR • Don Voegeli Collection • Southern California Public Radio • Ken Burns’ The Civil War interviews • Eyes on the Prize Interviews • Peabody Awards Collection • KBOO Community Radio
  • 10. Goal: A Centralized Web Portal for Discovery • All AAPB digitized content discoverable through single searches • Direct links to historic public media available on other sites • One-stop shopping for users • Helps solve the separate silos syndrome • Digital Public Library of America (DPLA) as a model
  • 11. Launched New Projects • National Educational Television (NET) Collection Catalog Project • Build a national inventory of titles created for National Educational Television (pre- PBS) • AAPB National Digital Stewardship Residency • Expand the NDSR program to include geographically diverse residencies at organizations with public media collections • Improving Access to Time-Based Media through Crowdsourcing & Machine Learning
  • 12. Public Broadcasting Metadata Dictionary (PBCore) • Continuing the development of PBCore as a metadata schema for media materials • Engaging the PBCore community for input in the development • Outreach to new adopters of PBCore • Collaborating with EBUCore to utilize their RDF ontology For more information: pbcore.org
  • 13.
  • 14. the situation ■ 72,000 digitized television and radio programs ■ incomplete, inaccurate metadata records ■ limited staff resources ■ we need to know what we have in the collection ■ we have a responsibility to users to provide access to the collection ■ continued growth of the collection (content and sparse metadata)
  • 15. potential: transforming content into data • Computational Tools • Speech-to-text • Audio analysis • Image Analysis • Visualization of Data • How can we use them?
  • 17.
  • 18.
  • 19.
  • 20.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. once corrected… • JSON transcripts will be stored on AAPB’s Amazon S3 account • Transcripts will be indexed for keyword searching on the AAPB website • Transcripts will be made available alongside the media on the record page • Transcripts can play as captions within the player • Transcripts can be harvested via an API and used as a dataset for research such as a digital humanities project • Transcripts can be provided to the stations that contributed the content
  • 32. The LAPPS Grid Project • Collaborative effort among US partners • Brandeis University • Vassar College • Carnegie-Mellon University • Linguistic Data Consortium (University of Pennsylvania) • Funded by the US National Science Foundation • Builds on • foundation laid in several projects • SILT (Brandeis/Vassar), The Language Grid, PANACEA, LinguaGrid… momentum toward a comprehensive network of web services and resources within the NLP community
  • 33. Overall Goals • Design, develop, and promote a Language Application Grid based on Service Grid Software • Support development and deployment of integrated natural language applications • Enable federation of grids and services • Provide an open advancement (OA) framework for component- and application- based evaluation • Provide access to language resources for members of the NLP community as well as researchers in a wide range of social science and humanities disciplines • Enable easy navigation through licensing issues • Actively promote adoption, use, and community involvement with the LAPPS Grid • Actively pursue creation of an interoperable global network of grids and frameworks
  • 34. Functionality • Provides access to • basic NLP processing tools basic (search, indexing, named entity spotting, semantic tagging • language resources such as mono- and multi-lingual corpora and lexicons • Enables pipelining tools to create custom NLP applications and “black box” composite services • Ultimately a community-based project • Services contributed by members of the community • Existing service repositories and grids federated to enable universal access
  • 36.
  • 37.
  • 38.
  • 39.
  • 40. Current State of the Project • Galaxy as workflow engine • Evaluation services (CMU) - UIMA • Multiple versions of basic NLP pipeline components (tokenizers, pos taggers, named entity taggers, coreference modules, phrase-structure, dependency parse, relation extraction, etc.) • Currently, Stanford tools, OpenNLP, GATE tools, LingPipe, others • Using open source tools for development • Can mix and match, find best set using evaluation tools • Data sources: MASC (plain text, GrAF, JSON-LD), Gigaword (XML) • Will soon add all LDC holdings, OANC
  • 42. The State of Recorded Sound Preservation in the United States: A National Legacy at Risk in the Digital Age (2010) • Suggested that if scholars and students do not use sound archives, cultural heritage institutions will be less inclined to preserve them. • Archives and libraries must collaborate with patrons and scholars to understand how recordings are and might be used. • Scholars need to know what kinds of analysis are possible in an age of large, freely available collections and advanced computational analysis.
  • 43. http://www.hipstas.org Credit: Tanya Clement, HiPSTAS, tclement@ischool.utexas.edu
  • 44. HiPSTAS: Primary Goals • To develop a virtual research environment in which users can better access and analyze spoken word collections of interest to humanists through: • an assessment of scholarly requirements for analyzing sound • an assessment of technological infrastructures needed to support discovery • preliminary tests that demonstrate the efficacy of using such tools in humanities scholarship • a freely available, open-source, API-driven version for general use Credit: Tanya Clement, HiPSTAS, tclement@ischool.utexas.edu
  • 45. ARLO (Adaptive Recognition with Layered Optimization) HZ, a unit of frequency Time a heat based color scheme. White – hottest, most intense Yellow Red Green Blue Black – coolest, least intense Energy represented by Credit: Tanya Clement, HiPSTAS, tclement@ischool.utexas.edu
  • 46.
  • 47. AAPB dataset • 50 selected sound characteristics and speaker voices • sounds such as NPR intro, Independent Television News (ITN) intro, classical music, dog barks, audience clapping • Voices such as John F. Kennedy, Eleanor Roosevelt, Martin Luther King, Jr., etc. • 4,000 hours of video and audio which includes samples of each speaker and sound • metadata & transcripts
  • 48. Supervised Classification Searching for Sound with SoundCredit: Tanya Clement, HiPSTAS, tclement@ischool.utexas.edu
  • 49. Unsupervised Classification Searching for Sound with SoundCredit: Tanya Clement, HiPSTAS, tclement@ischool.utexas.edu
  • 50. Get Results Credit: Tanya Clement, HiPSTAS, tclement@ischool.utexas.edu
  • 51. Visualize Results Credit: Tanya Clement, HiPSTAS, tclement@ischool.utexas.edu
  • 52. Visualize Results Credit: Tanya Clement, HiPSTAS, tclement@ischool.utexas.edu Blue = sung; green = spoken; red = instrumental 55 John Alan Lomax recordings 1926-1941
  • 53. Visualize Results Credit: Tanya Clement, HiPSTAS, tclement@ischool.utexas.edu
  • 54. Questions • Features: What are we measuring? • Ground Truth: What’s the answer? How do we know when we’re accurate? • Optimization: Accuracy vs. Efficiency – how do you balance the accuracy of your results against the computational resources you need to achieve that level of accuracy? Credit: Tanya Clement, HiPSTAS, tclement@ischool.utexas.edu
  • 55. Questions • Literacy: How much do we need to know about the technology of audio, of computational methods, and of humanist inquiry to do new kinds of research in this area? • Usability: What kinds of interfaces and tools facilitate AV analysis in a diverse range of disciplines and communities? Who gets access to these tools and for what kinds of questions? • Accuracy: Is good enough, good enough? • Scalability: How much storage and processing power do users need to conduct local and large-scale AV analyses? A Laptop? A Supercomputer? • Sustainability: What are local, national, and global scale issues? How does this work fit back into the access infrastructure already in place in archives, libraries, classrooms? Is data enough to get us over the hump of our limited means for discovery? Credit: Tanya Clement, HiPSTAS, tclement@ischool.utexas.edu
  • 56. Future Possibilities • Searching for sound with Sound • Metadata improvement and validation • Re-allocation of staff Credit: Tanya Clement, HiPSTAS, tclement@ischool.utexas.edu
  • 57. AAPB can be of value for scholarship because... • scholarship pertaining to the period of 1973 onwards is “limited, fragmentary, and politically conflicted” • for the 1980s, “the archival and monographic work … has not yet been done” • accounts about the 1990s and later have “not really been history” Kim Phillips-Fein, “1973 to the Present,” in American History Now (2011)
  • 58. The Importance of Local History • “emphasis on diversity” • “the history of the nation is many different stories, no one of which can be considered the ‘main’ story” • a “skepticism about finding common definitions of American nationalism or discovering common values” among many historians of the 1960s and 1970s History from the bottom up (quotes from Alan Brinkley)
  • 59. The Importance of Local History for... • relating “national experiences to larger processes and local resolutions.” Thomas Bender Rethinking American History in a Global Age (2002)
  • 61. Why get involved in the AAPB? So that your station can begin to identify, manage and preserve your collection So that other producers can find and potentially license your content So that scholars can refer to your content in their research So that educators and students can access your content as primary source material So that lifelong learners can watch and listen to the programs they remember from the past Your station has created an archival record of your community and our shared cultural heritage. Making this historic content available fulfills public media’s core mission to educate, inspire, and enlighten.
  • 62. Contributing digital files to AAPB AAPB can acquire a certain number of hours of digital content per year. For your collection to be considered, fill out our Collection Acquisitions Form: Download: https://s3.amazonaws.com/americanarchive.org/resources/AAPB_collection_acquisi tions_form.pdf We also ask all donors to sign our AAPB Deed of Gift, an agreement between the collection donor and WGBH and the Library of Congress. Download: https://s3.amazonaws.com/americanarchive.org/resources/AAPB_model-deed-of- gift.docx
  • 63. AAPB Technical Specifications Once a collection has been approved for submission to AAPB, AAPB works closely with collection donors and their digitization vendors to clarify required and requested technical specifications. These include file formats, naming conventions, storage media for delivery of files, associated technical metadata, etc. We would be happy to work with you to develop your digitization vendor RFP. Current AAPB tech specs for donors are available for download here: https://s3.amazonaws.com/americanarchive.org/resources/AAPB_Tec hSpecs_160606.pdf
  • 64. How can my station support this work? Hire an archivist. Advocate for staff resources to be allocated to archiving your content, and get them trained on best practices. Licensing and your own reuse of existing footage are potential ROIs. Hire Library Science graduate student interns. Check out ALA’s accredited library programs in your area: http://www.ala.org/accreditedprograms/directory Partner with a local university to preserve your analog collection. Seek funding from local foundations or donors interested in media and/or local history. Seek federal or other national grants targeted toward archiving.
  • 65. Grant programs Institute of Museum and Library Services – imls.gov Council on Library and Information Resources – clir.org National Endowment for the Humanities – neh.gov National Endowment for the Arts – arts.gov Association for Recorded Sound Collections – arsc-audio.org NARA’s National Historical Publications & Records Commission – archives.gov/nhprc Grammy Foundation – grammy.org/grammy-foundation/grants
  • 67. Thank you! Casey E. Davis Kaufman Casey_Davis-Kaufman@wgbh.org @caseyedavis1 Jim Bodor Jim_Bodor@wgbh.org

Hinweis der Redaktion

  1. As we move beyond the original 40,000 hours, we are beginning to get a better handle on how much material we can accept per year. As we work with donors, we are dealing with collections that are in different stages. For example, some have already been digitized from original analog, while others have yet to be. We also have born-digital content and collections that might have material in more than one of these stages. Casey will talk about how to get your material into the AAPB in more depth, but we are happy to work with you to find the best solution. I’m going to talk a little bit about the material we’ve accepted since the original 40,000 were ingested.
  2. Here are some collections we have recently accepted or are in the process of acquiring. We are working with NewsHour Productions to digitize, preserve, and make publicly accessible on the AAPB website 32 years of NewsHour predecessor programs, from October 1975 to December 2007, that currently exist on obsolete analog formats. We have selected a digitization vendor, accessioned content at WETA, and instituted quality control procedures to ensure that all digitized files will be properly preserved for present and future generations. Thirteen Productions LLC has agreed to provide AAPB with metadata and nearly 2,500 files containing more than 1,800 hours of complete interviews conducted with key figures in American culture and the arts for American Masters between 1993 and 2015. They have also agreed to give us 67 programs distributed by National Educational Television, the precursor to PBS. We acquired from NHPR a collection of interviews and speeches by presidential candidates from 1996-2012. This equals about 100 hours of content and was added to the AAPB in January of this year. AAPB has made arrangements with Ken Burns to preserve and make accessible digital master copies of eight complete interviews with prominent historians and others conducted for The Civil War that have not previously been seen by the public. We will receive programs from Southern California Public Radio, including award-winning stories chronicling endangered species and environmental issues in California; recordings of music written for public radio by Donald Voegeli, composer of the All Things Considered theme; and award-winning public radio programs on conservation issues produced by James Voegeli. Each of these collections presented their own challenges. Some are very large, while others are very small. The larger collections require a lot of planning, especially when grants are involved. Civil War – we’re taking the original files, not preservation files. NHPR – this one was relatively easy, as it was a small collection of radio programs. Radio is always easier to process on our end than video. The file sizes are smaller and not as complex as video files, so they go through the ingestion process faster. American Masters and NewsHour – these two situations are more complex. Both involve analog tapes that need to be digitized. The NewsHour also involves multiple entities which hold the physical analog tapes that need to be digitized and it is a grant project through CLIR, so there are additional requirements that need to be met. A commonality among all these collections is the need for a Deed of Gift. These collections need to be able to be viewed on the AAPB website and it often takes a lot of time for this process to complete. With that, I’ll hand it over to Casey, who will go over how you can get your material into the AAPB.
  3. Initial CPB funding has been extended for another year which will partially cover core staff. In addition, we’ve raised close to $2 million to launch 3 new projects which will help sustain our core staff for another 3 years and further our goals. The first is focused on the National Educational Television collection of pre PBS programing which contains some of the earliest public television programs covering historic events and issues of the 1950’s and 1960’s. The second is to place 7 Masters level graduates as paid fellows for 10 months to work on digital preservation projects at 7 stations or institutions holding public media content across the country. And third, working with the Pop-Up archive, we will create transcripts of all 40,000 hours using speech to text tools to improve searchability and create games for the public to correct and enhance the transcripts.
  4. As part of the American Archive we are also revitalizing PBCore –that’s a media data schema geared toward a/v collections. In particular we hope to map to a similar schema EUCore so PBcore users can take advantage of the rdf ontology EBUCore has developed.
  5. As audio visual archives we have sound. The sound does not always describe the visual, but it is a start. If we can put the sound into text, we can use language tools to work with it as data. Here are some of the tools and work we’ve been dong to help better describe our collections and determine what’s in them.
  6. Working with Pop-up archive and speech to text tools. Many of these computational tools use vocabulary we don’t’ tend to use in the archive world like “pipeline”, lexicon, phonemes. This slide is their description of what is happening with the tool – we need to decode it for us to understand and then give input on how it can help us as archivists. They take the sounds of speech and match it to letters and words. And the tools teaches itself as corrections are fed back into the decoding process.
  7. This is an example of decoding one of our transcripts. There is a list of words with sound pronunciations, the output is only words recognized as part of vocabulary, so we need to feed in new words for more accurate recognition.
  8. This is a sample of one of our transcript from the tool with a time stamp. This is pretty clean, with the exception of punctuation marks in the wrong places. It’s not parsing sentences very well. Can export transcripts n multiple formats using their API – this is and example from their website.
  9. This is an example of a transcript with entities identified. It can pull together entities into strings that make sense and identify categories. We are not sure all these strings are necessariy important to highlight – like “Halloween”
  10. Built upon something already existing called Kaldi. Pop up archive have just released the code in github but it is not yet tested and we are not sure about documentation. It is just for English
  11. Who is behind the project and the work
  12. Orchestrates access to and deployment of language resources and processing functions available from servers around the globe Enables users to add their own language resources, services, and even service grids Provides a critical missing layer of functionality for NLP Current frameworks (e.g., GATE, UIMA) do not provide general support for service discovery, composition, and reuse Communication among tools based on a specific internal format (e.g. UIMA CAS) LAPPS Grid enables calling tools and pipelines within GATE, UIMA, etc. as services themselves Thus interoperable with all other LAPPS Grid services
  13. What it does
  14. Front page with a tutorial and tools available on the left. Different tools have different requirements for input file requirement but most use Json LD file format. But there is a tool in the grid that you can upload your Json file and it will normalize or wrap it into the appropriate json format. Limitations of file size to upload and numbers of files to compute in the cloud. Could set own instance and run own computing power.
  15. Tools available- Named Entity Recognizers extract information from strings of text to locate and classify named entities into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, etc. Tools that normalizes output of name recognition by identifying what should be brought together For example the Stanford tool recognized that Sam and Jones were both separate names and the normalizer recognized that they should be brought together as a full name. Phrase chunking is a natural language process that separates and segments a sentence into its subconstituents, such as noun, verb, and prepositional phrases. Tokenizers are used to break a stream of text into words, phrases, symbols or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing and text mining. Taggers match an inputted vocabulary that you load with words in the document.
  16. Example of stanford named entity recognizer output and visualization – Radio Moscow
  17. Workflow builder – pipeline – you can line up tools and create a pipeline to run sequential tools. Parallel workflow = files run through tools at the same time.
  18. So basically these tools are being created, and they are being refined for use. We, as archivists, don’t know about many of them, and don’t fully understand a lot of them. But we have an opportunity to help them be refined for potential use on archival collections. And develop clear instructions on use and output. Often when helping clarify you can also help steer outcomes and design, and additional refinement. We should take advantage of these computational folks being hungry for big data sets, because we have them with our ever increasing digital collections.
  19. This report suggests that if scholars and students do not use sound archives, our cultural heritage institutions will not preserve them. Librarians and archivists need to know what scholars and students want to do with sound artifacts in order to make these collections more accessible; as well, scholars and students need to know what kinds of analysis are possible in an age of large, freely available collections and advanced computational analysis and visualization.
  20. To this end, the School of Information at the University of Texas at Austin and the Illinois Informatics Institute at the University of Illinois at Urbana-Champaign received a 2012 NEH Institutes in Advanced Technologies in the Digital Humanities grant to host two rounds of an Institute on High Performance Sound Technologies for Access and Scholarship (HiPSTAS) in May 2013 and May 2014. Humanists interested in sound scholarship, stewards of sound collections, and computer scientists and technologists versed in computational analytics and visualizations of sound gathered together to consider how more productive tools for advancing scholarship in spoken text audio could be developed. We also received a Preservation and Access Research and Development grant through December 2015 to specifically develop ARLO as an open source tool for use in archives and special collections.
  21. ARLO was developed for classifying bird calls and using visualizations to help scholars classify pollen grains. Initially developed by David Tcheng for ornithological research, ARLO was extended to the poetry world through the NEH-funded HIPSTAS project, which provided access, training, and technical support.[1] At a simplified level, ARLO works by producing images of the audio spectra and then comparing these visualized time-slices with others across a range of pre-selected audio files. In the analysis of bird song, the goal is to find matches that locate particular calls within longer field recordings. This matching process helps the ornithologist find the proverbial needle in the haystack without painstakingly auditing hundreds of hours of audio. Some of the HIPSTAS researchers are interested in comparable problems — identifying and tagging unnoticed elements within a file, such as moments of laughter or applause, or the "audio signature" that marks the provenance of remediated poems, which Chris Mustazza proposes in his Clipping piece “The Noise is the Content.” ARLO has the ability to extract basic prosodic features such as pitch, rhythm and timbre for discovery (clustering) and automated classification (prediction or supervised learning), as well as visualizations. The source code for ARLO is open-source and will be made available for research purposes for this and subsequent projects on Bitbucket: https://bitbucket.org/arloproject/ As part of HiPSTAS, the I3 team has developed ARLO’s interface (and documentation) has done some minor interface development to allow humanities users to test the machine learning system and perform exploratory discovery (clustering) and automated classification (prediction or supervised learning) processes as well as visualizations so that we can identify user needs. This development work for HiPSTAS has included limited interface development with ARLO for non-birding humanities users, such as the ability to analyze longer files, adding short keys for play, stop, fast-forward, etc. as well as infrastructure development that allows multiple users to use multiple collections and create separate tags or annotation sets and share them.
  22. ARLO's backend handles the heavy-lifting - modeling, searching, classification and so on. The backend is specifically designed to enable numerous machines to split up large tasks, when one machine just can't cope Users interact with ARLO through a web browser, meaning there is no software to install to individual's machines. We don't even use client side Flash or Java. A RESTful API allows users to script additional tools, automate tasks, and embed ARLOs methods in other tools.
  23. The majority of practical machine learning uses supervised learning. In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data. It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance. Supervised learning problems can be further grouped into regression and classification problems. Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease
  24. Unsupervised learning, on the other hand, allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data. These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data.
  25. Blue = sung; green = spoken; red = instrumental
  26. ohn A. Lomax Collection in UT Folklore Center Archives, Small Multiples. Instrumental sections are in red, spoken sections are in green, and sung sections are in blue.
  27. While it remains to be seen whether this research will be successful and if it’s even technically possible, I think it’s worth covering the implications should this experiment turn out to be a success. It will help scholars find what they are looking for. Currently scholars can only find content related to their search if the archivist has included this information in the metadata or if a tsearchable transcript exists. It will help archivists provide better access to their collections .Until now we have needed our audiovisual recordings and audio transcribed into text to search the recordings. But if this is successful, we may not need the transcripts to search the audio Or if we do have transcripts and other metadata, we can use that metadata to validate each of these components of metadata. We can search the audio with the audio itself.
  28. The material in the AAPB collection is especially important because of the era it reflects. There remains much basic excavation and interpretive work in recent American history for the present generation of scholars to accomplish. A recent essay noted that American history scholarship pertaining to the period of 1973 onwards is “limited, fragmentary, and politically conflicted.” Accounts about later periods, the author concluded, have “not really been history.”
  29. The American Archive collection contains a wealth of material produced locally for local audiences. These programs represent an untapped important resource. During the 1960s and 1970s, many historians began to focus on social history, history from the bottom up, instead of on national elites. This “emphasis on diversity,” Alan Brinkley has written, presumed “that the history of the nation is many different stories, no one of which can be considered the ‘main’ story.”
  30. More recently, some historians also have advocated for integrating the national story into wider contexts. The goal is to relate “national experiences to larger processes and also to local resolutions,” Thomas Bender has written.