This document discusses a project to catalog the National Educational Television (NET) collection to improve discoverability. The project involves:
1) Creating a comprehensive catalog of 8,000-10,000 entries covering 1953-1972 NET content with descriptive data and location information.
2) Designing a new web interface for the catalog to accommodate different types of data and user search behaviors.
3) Enriching the catalog with linked open data by reconciling descriptions with library authority files and assigning identifiers.
The goal is to make this historically significant but scattered public media collection more accessible through a centralized online portal. Challenges include the complexity of the content and lack of original metadata.
4. • A collaboration between WGBH and the Library
of Congress
• An unprecedented and historic collection of
American public radio and television content
• Dates back through the 1940s
• Preserved and made available to the public
What is the AAPB?
7. • Be a focal point for discoverability of historical public media content;
• Coordinate a national effort to preserve and make accessible historical public media content
• Provide content creators with standards and best practices, guidance, training, and advice
for storing, processing, preserving, and making accessible their historical content, and for
raising funds in order to accomplish these tasks;
• Disseminate content widely by facilitating the use of archival public media content by
scholars, educators, students, journalists, media producers, researchers, and the public, for the
purpose of learning, informing, and teaching;
• Increase public awareness of the significance of historical public media and the need to
preserve and make accessible significant public broadcasting programs; and
• Ensure the perpetuation of the archive by working toward financial sustainability.
Mission
8. • 40,000 hours of digital material
initially from over 100 stations
• 2.5 million inventory records from
120 stations
• Identified over 3 million items kept
at stations, archives, producers,
university collections across the
country
Initial Collection
9. Collection Growth
• Growing the collection by up to 25,000 hours of digitized content per year
• Assisting collection holders with digitization grant proposals and ingesting digital
files into our systems
• Recent acquisitions
– PBS NewsHour and predecessor series
– American Masters raw interviews
– Ken Burns’ The Civil War raw interviews
– Eyes on the Prize raw interviews
– NHPR presidential primary collection
– KBOO community radio programs
– NPACT coverage of Senate Watergate Hearings
– Southern California Public Radio environmental collection
– Vision Maker Media films
10. Goal: A Centralized Web Portal for Discovery
• All AAPB digitized content on specific
topics discoverable through single
searches
• Direct links to public media on other sites
• One-stop shopping for users
• Helps solve the separate silos syndrome
• Digital Commonwealth and DPLA as a
model
11.
12. Current Status
• Over 30,000 videos and sound recordings
available to anyone in the US
• Public access to over 50,000 hours (over
80,000 assets) on location at WGBH and
the Library
• 2.5 million searchable metadata records
25. State of the Collection
• Programs are scattered
• No complete or publicly accessible list of
titles
• Descriptions are limited, in obscure
sources, and not always
reliable/authoritative
27. What?
• Comprehensive catalog of 8,000-10,000 entries
covering 1953-72
• Descriptive data/location of assets
• Process the Library of Congress’s physical NET
holdings
For Whom?
• For collection managers: inform prioritization for
preservation and more description
Project Goals
28. Project Team
WGBH
• Project Manager
• Cataloger
• Web Designer
• Web Developer
Library of Congress
• Project Coordinator
• Catalogers
• Metadata
Specialist
30. Challenge: Working from Opposite Ends
WGBH
• Start with a list of titles
• Search for
descriptions
• Search for copies of
the titles
Library of Congress
• Start with films and
tapes
• Match with titles
• Search for descriptions
31. Challenge: Meeting in the Middle?
• No Original Identifiers ( )
• Title matching’s not an exact
science
• Distributed team means hosted
solution
32. Hosted Working Space
• Hosted Filemaker Web App
– Multiple users at the same time
– Access at a url through a browser
– Export data in a format our repository accepts
– Add link to record in MAVIS for batch import of
other fields
35. Challenge: Too Many Types of Data
Current Design
• Descriptive
– Title
– Date
– Short summary
– Keywords
– Creator/Rights
• Citation
New Design
• Descriptive
– More complicated
titles
– Longer summaries
– Longer credit list
• Citation
• Holdings
36. Core Content and Collapsible Sections
• Core content always stays on the page
• Other content broken into sections
– Descriptive, Holdings, Citation, (and maybe
Credits)
• Can expand or collapse any combination
of sections at any time
37.
38.
39.
40. Challenge: Users Search Differently
Current Design
• One search results view
– Thumbnail
– Abridged description
– Take up a lot of real
estate
• Access facet below the
fold
• Really long lists in facets
New Design
• Three results views
– Main view
– Gallery view
– List (or skim) view
• Access facet more
prominent
• Scroll-able or Type ahead
lists for facets
41.
42.
43.
44.
45.
46.
47. Challenge: Particularities of the Collection
• NET Collection isn’t like other collections
– Titles can be more complicated
– Not always intuitive how everything in the
collection hangs together
– Provenance of films and tape is complicated
– Provenance of descriptive information is more
complicated
48. NET Special Collection
• Collection Summary
• Collection Background
• Suggested Search Strategies
• Featured Items
• Other Resources
49.
50.
51. Step Three: Linked Data
By Cygri - Own work, CC BY-SA
3.0,
https://commons.wikimedia.org/w/i
ndex.php?curid=2768615
52. What is Linked Data?
• Using the web to link concepts and other
data (expressed as URIs) with each other
• Example:
[Stephen King] [is creator of] [The
Shining]
[Stephen King] [was born] [ 1947-09-21]
53. Challenge: How Can We Use Linked Data?
• Using Authorities and their URIs
– Library of Congress Name Authority Files
– EIDR IDs
• Using OpenRefine to reconcile existing data
with authorities and grab URIs
• Keeping URIs even if we’re not storing linked
data right now
54.
55. Challenge: How Can We Make Linked Data?
• We’ve done a lot of the intellectual work to
define new relationships
• How can we publish things as linked data,
so that other people can talk about things
we know about?
• Report forthcoming!
You can find our website at americanarchive.org. You can also find us on various social media platforms as amarchivpub. Don’t worry, if you can’t get all of these down right now, I’ll show this slide again at the end of the presentation.
So, before I talk about this specific project, I want to introduce the American Archive of Public Broadcasting. It’s a great project, but not everyone is always clear our how we’re structured and what we actually do. At our core, we’re a collection of radio and TV materials created by or for public tv and radio in the US dating back to the 1950’s. We’re a team of people at both WGBH and the Library of Congress, working to preserve these materials and make them accessible to the public.
WGBH is Boston’s Public television station. We produce fully one third of the content broadcast on PBS, including the series you see here, as well as Masterpiece, both Classic and Mystery. In addition to television, we have 2 radio stations and a large, award winning Digital department that is the number one producer for the sites you’ll find on PBS.org. As you can see, we produce a wide variety of programming from public affairs, to history and science, to children’s program, arts, culture, drama and how to’s. We have been on the air since 1951 with radio and 1955 with television.
At heart and through our mission we are an educational and cultural institution. We originated out of a consortium of academic universities in the Boston area. Because we have produced so much we have a large archive of educational programming that is of interest to scholars and researchers, in addition to the public.
I’m assuming most of you know who the Library of Congress is. The AAPB team at the Library (which is probably what I’ll call them throughout this presentation) is specifically situated in the Moving Image and Recorded Sound department, and split between their Capital Hill and Culpepper Virginia campuses. In the AAPB, WGBH is primarily responsible for providing access to the collection through cataloging and our website, while the Library is responsible for the long term preservation of the digital files. We do, however, collaborate on most decisions and projects.
The American Archive of Public Broadcasting seeks to preserve and make accessible significant historical content created by public media, and to coordinate a national effort to save at-risk public media before its content is lost to posterity.
Our mission and goals are challenging. In addition to preserving, we want to assure discoverability and access. We want to guide and support current content creators and stewards of the materials with best practices to protect this historic programming. We want to facilitate the use of the materials and increase public awareness of it’s importance. And of course we want to be able to sustain these goals into the future.
Initially, CPB funded an inventory project and then a large digitization project. Stations that participated in the inventory had the opportunity to choose items to be digitized – items important to them, or items that the only way they might find out what it is is by digitizing and watching or listening to it. CPB chose a single vendor – Crawford Media – to do all the digitization. Tapes were sent to Crawford in Atlanta. In addition about 5,000 hours of already digital content was identified to be added to the collection. So in the end, the initial collection consists of about 40,000 hours of content from about 100 organizations which totaled around 68,000 files. In 2013 CPB chose the collaboration between the Library of Congress and WGBH to be the future stewards of the Archive.
In the past few years we’ve been able to continue the growth of the collection. We have a goal of adding up to 25,000 hours of digitized content per year, and we help organizations with grant proposals and other specifics around digitization and ingestion of the files into our systems. So far, we;ev added content from PBS NewsHour, American Masters, Ken Burns, Eyes on the Prize, KBOO Community Radio, Vision Maker Media, the Watergate Hearings, and others. We are also currently engaged in discussions with many other organizations, to see where we can coordinate efforts for preservation and access.
AAPB hopes to provide a centralized web portal of discovery where researchers, educators, students – really anyone – can find relevant public broadcasting programs existing either on our own site or on sites belonging to other archives and stations. With approximately 1,250 public radio and television stations in existence in America, having one access point will aid scholars interested in researching how national or even international topics have been covered in divergent localities over the past 60+ years. AAPB has made a start at becoming that portal. If stations and archives operating their own websites will send us metadata, we will provide direct links from AAPB to digitized files on the other sites. For a researcher, this would be one-stop shopping. This is how Digital Commonwealth and Digital Public Library of America (DPLA) operate, and we hope that it helps solve the separate silos syndrome prevalent in public media archives.
Through the website, there are many ways users can explore and interact with our collection. I’ll talk about some of these specifically as we go on. We've recently launched a Help Preserve Public Media section, where we highlight our Citizen Archivist Toolkit: 3 crowdsourcing efforts to help describe our content. We've also created a Public Library toolkit for librarians who want to highlight our collection to their patrons. That's not publically available yet, but if you sign up here at the front, I'll pass you information to our Engagement and Use manager, who will get you the materials.
Our entire collection, even where we just have metadata records and no digitized files, is available for anyone to search on our website. That consists of about 2.5 million records total. All of the digitized content that we manage can be accessed by anyone if they come to WGBH or the Library of Congress. That’s about 80,000 assets that people can watch and listen to. Of those, we’ve been able to evaluate over 30,000 assets that we’ve cleared to go into, what we call, our Online Reading Room. These assets can be viewed or listened to anywhere within the US for free. All people have to do is click an accept button when we present them with the terms of use. We’re regularly adding new content into the Online Reading Room. For example, when we launched the website in 2015, there were around 7,000 assets in the ORR. In 3 years, we’ve been able to increase that to over 30,000, and we hope that number keeps going up!
Our website at americanarchive.org is where users can go to search the collection. If they already know what they’re looking for, they can search by keyword or advanced search.
They can also browse our special collections. These highlight some of the great content in the AAPB, and provides some collection level information like provenance and search strategies specific to the collection. We think of them as super-user-friendly versions of finding aids.
Users can also click through any of our curated exhibits. These take all kinds of content from our collection and use it to talk about historic and cultural topics trends such as the civil rights movement, presidential elections, and historic preservation and urban renewal. We even have a few exhibits about broadcasting it’s self, such as the evolution of the News Magazine format.
Users that are especially interested in content produced in their state or by a specific station they remember watching growing up, can search the collection that way. You’ll notice we have a bit of a gap in between the rockies and the Midwest. We’re working on fixing that as we continue to grow the collection.
And for users who are interested in general, but aren’t sure what our collection has to offer, we list some of the main topics to make browsing easier.
Users are presented with search results that they can scroll through to find what they’re most interested in.
And then they can go to specific record pages to watch or listen to the content and see more of the metadata.
Some of our content even has searchable and synced transcripts, which can help researchers find the exact point of an interview they want to see.
Okay, so I’ve talked your ear off about the AAPB. Now I want to tell you about this specific project. And to do that I need to explain what National Educational Television was.
NET existed from the mid 1950s through the early 1970s. It went through several phases as it grew, hence the multiple logos. At first, it was started by the Ford Foundation as a way to support this new thing called educational television, which grew into public television as we know it now. These stations were all very local, they made somewhere around 5 hours of content a week, which obviously isn’t enough to even fill the day time hours. So the Ford Foundation started the National Educational Television and Radio Center, which coordinated between local stations in Boston, in Texas, in San Francisco, in Philadelphia, etc., so that these stations could share content. So even if your station could only make 5 hours of programming a week, if you got a few shows from other stations too, then you had more content to broadcast. Now this was before they had interconnected satellites for broadcast, so a big part of NET’s role was getting a few copies of each program and sending them around from station to station. They were essentially going on tour. The film reels would spend a week or two in each city and then move onto another one. And as more cities started having more educational stations, there was more content to share.
These are the first educational tv stations. You’ll see that as we get later into the 50s, more and more cities start stations. Eventually in the 60s, NET started not only distributing content between the stations, but producing content of national significance. A lot of these were public affairs documentaries, which would be invaluable for historical and media studies research by scholars today.
When Lyndon B Johnson signed the Public Broadcasting Act of 1967 into law, the Corporation of Public Broadcasting was formed. A few years later, the Public Broadcasting Service (aka PBS) was created and essentially took over for NET. As PBS ramps up and NET winds down, there isn’t a lot of importance put on establishing an archive of what NET did, what content NET distributed, etc.
So when people go to research NET today, there’s no good central resource. There are papers in various archives, a lot stations have copies of programs that they think were distributed by NET, but there’s nothing authoritative or complete. The few things we do have to go on are a donation that PBS gave to the Library of Congress of material that was mostly from NET, although some early PBS got mixed in. We also have program files that NET left to PBS. PBS had these microfiched, and for years the only way to access the information on the microfiche was to go to the Library of Congress of WNET, the public broadcasting station in New York, who also had a copy. There was good data in these files, but it wasn’t very accessible. The other thing that was frustrating was that there was no real way for stations or other archives to talk to each other to see who had a copy of what. For the most part, these are programs that aired once and were never seen again, so who even knows if copies still exist. And if more than one person has a copy, who had the best copy. And if some one already digitized their copy, maybe I shouldn’t waste money digitizing my copy, and should digitize something else.
Okay, so with that scattered situation to set the stage, let’s get into the actual project that we’ve been doing. This project started in 2015 and was funded by the Council on Library and Information Resources. We’re wrapping up the project this year, although we’re not totally done yet, so some of the things I talk about will be plans, rather than actions we’ve already taken.
The main goal of the project is to publish a catalog of all known NET titles. This will bring together all of the titles into one place, rather than forcing researchers to look for information at each of the individual stations that originally produced the programs. The catalog should deduplicate titles, so that if a program aired in 1958 under one title and then reaired in 1965 under a different title, or as part of series, there shouldn’t be multiple records. Instead there should be one record that notes all of the different broadcast dates and titles.
The catalog also includes as much descriptive information about each program as we’ve been able to gather. Most of these descriptions comes from the microfiched program files that I mentioned earlier. Those were scanned, OCRed, corrected, and then key pieces of information were pulled into the catalog records. While we researches these titles to better describe them, we also did our best to find copies of the the programs. We reached out to original NET stations, local university archives, and the Moving Image Archiving community at large, to gather information about which programs they have, on what format, and in what condition. The final goal of the project is to process the NET programs that are in the Library of Congress’s collection. Once processed, these will also be added to the catalog as copies of the programs.
There are two main audiences we’re trying to serve with this project, besides those in general public interested in this topic. The first I’ve already alluded to, and that’s collection managers. Anyone who has NET programs in their collection could benefit from the intellectual work we’ve done. We have already sent a few stations the description we have because they didn’t know anything besides the titles. Collection managers also can use the information that we’ve gathered about copies of the programs. If we know that WGBH has a copy on a low quality tape, , the Library of Congress has it on 16mm film, then we have more information to decide is it better to get the higher quality, but more expensive film digitization, or to get lower quality, but cheaper tape digitization. Typically we go for the higher quality, but the main point is that we’re giving people more information on which to base their decisions.
We’re also trying to serve researchers. Putting more information on our website is the first step of that. We’ve also made some web designs decisions to help the search the records in better ways. And much like collection managers wanting to know where copies of a program exist and on what format, researchers are interested in that too. If a program hasn’t been digitized, our website will show the research where known copies exist, so they can arrange travel to those archives or to work with the station to digitize it.
To accomplish all of those goals, we’ve got a pretty diverse set of skills on our team. Managers handling the logistics, catalogers handling the metadata, a metadata specialist working on Linked Data (which I’ll talk about later), web designers helping us think through the best way to present information and developers making those designs something that people can take advantage of on the website.
So, the first thing we had to do was process sand catalog. We started this first because it’s the most time-consuming. We actually are still wrapping up some of it, even though we’ve moved onto other parts of the project as well. The other reason we started this first was because if we didn’t have records, there would be no point in designing a website to display them.
For each step of the project, I’ll talk about the challenges and how we worked through them. For cataloging, most of our challenges stemmed from the fact that the WGBH and the Library were working from 2 different ends of the description process. WGBH’s goal was to make that list of titles and then add descriptive records for each. So everything we were starting with was super abstract. At this point we didn’t even care if there was a copy of a program in existence. We just wanted to know was it distributed by NET and what was is about.
On the other hand, the Library has a huge collection of unprocessed film and 2inch videotape that they needed to know about. They started from the film can or tape, and then worked their way up to that more abstract, descriptive area. If you guys are familiar with the Bibilographic FRBR concept: WGBH was started with a work or an expression and the Library was starting from an Item.
It would have been much easier to start either with the abstract records, get those totally squared away, and then used that as a resource while processing the Library’s holdings. It also would have been easier to start with people’s holding and describe them as we processed. But this was a grant project, so there were enforced deadlines. Because the collection and information about it was so scattered, we knew we’d need all of the grant period to do both activities, so we had to do them at the same time.
So we tried to meet in the middle. The easiest way to match up two separate collections is when they have identifiers in common that you can match on. Something like an ISBN or some other standardized way of identifying a work. Unfortunately there were no consistently applied identifiers in the original NET collection. So as we worked, both teams were creating identifiers for each program. But since we were working at the same time, we were both making our own identifiers for our own cataloging systems. Ugh. We came up with a solution where the Library added their identifiers to the records in our system, so that there was a link to match the two systems together on. That’s been a good solution for us, but the hang up is adding that identifier at the beginning. The only we know we’re talking about the same thing, is by matching the titles and other descriptions. Now the main problem with that is that broadcast programs can have different titles, and when people write about programs in paperwork, which is where we were getting most of our data, they don’t always use the correct title. Here’s a real example. For a while there were records for programs called British Public Schools and Public Schools of England. It wasn’t until we discovered that these were produced by the same people and broadcast on the same day, that we were sure they were the same program.
And as for us working in 2 separate systems, well there’s 2 reasons for that. One, is that the Library was processing these titles into their main collection, so obvisouly that’s a different system than ours. The other is that we were 400 miles apart.
We ended up using a homegrown system to solve this problem. We created a filemaker database with a data model complimentary to the AAPB’s main system. By using filemaker we were able to more efficiently make batch changes to the records as we were researching the NET programs. Filemaker also has a version that can act as a hosted web application. By using this version, we were able to have multiple users editing at the same time and not require an catalogers to install filemaker software on their computer. That was really big for us because the Library of Congress staff have very strict regulation of what they can install on their computers and it takes a long time to get approval for new things. This way we could all see the same records, edit them regardless of location, and add those links to the Library’s MAVIS system that let us batch inmport the rest of the data we need from their system. Because we designed it specifically for this project, the working space database exports data that can go straight into the AAPB’s main metadata management system with no additional transformations.
So now that we have records, which are deduplicated, have well-researched descriptions, and indicate where copies of each program exist across disparate collections, we need a way to show users that information.
This is just a quick look at one of our current record pages, the data that we display, and how we lay it out.
So the main challenge on the record page is that we have new types of data to share and we have more data than we did before. Our current page mainly just shows basic descriptions. There’s a title and a short summary. A day, some key words, and a few names of creators. At the end of the page we show the citation for the resource in 3 formats.
Now that we have these new records, we want to not only show the description, but also information about what collections have holdings (kind of like a World Cat but for public media). We want to keep the citations, and for the fields we already showed, now we have even more information. The credit lists we have can be really long, the summaries are much longer than what we had before, and we have more shows that have complicated title situations.
So the main way we address these issues is by separating out the core metadata fields, so that they are always visible, and then putting the rest of the metadata into collapsible sections, so they user can manage what’s on the page.
Current results for reference. You’ll notice that even with this much of the screen, I can only see 2 search results. You’ll also notice those big boxes along the left. That’s what our access facet looks like right now. It basically tells users how many records they are searching through. Their first category is the most narrow, and it’s limiting their results to only things in the Online Reading Room which they can view or listen to. Or if they’re looking at All of the things we have digitized, which means they might not be able to play some of the media if the rights have been cleared for putting it in the Online Reading Room. And the third option is All of the records, which means they would also get metadata only records for media that hasn’t been digitized yet. A lot of users don’t see these facets, and so don’t realize that they can change the category their searching in.
First, we cleaned up the main view, which still shows a thumbnail, title, etc. but doesn’t leave quite so much white space around everything.
List or skim view. Only shows title, date, and org. Rather than thumbnails, which take a lot of room, simple iconography to show if it’s video or audio and if it’s digitized or not.
Gallery view with just thumbnails and titles, for a more visual representation of the results
You’ve been able to see the more prominent access facet along the top if these last few slides. Here it is with the help text expanded, which users get by hovering over the question marks.