ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symposium on Big Data, January 2013
1. Big Data: New Challenges for
Digital Preservation and Digital
Services
Leslie Johnston
Acting Director, National Digital Information
Infrastructure & Preservation Program
Library of Congress
2. What are the Biggest Insights that we
have Learned in Fifteen Years of
Building Digital Collections?
We can never guess every way that our
collections will be used.
Researchers do not use digital
collections the same way that they use
analog collections.
4. Data is not just generated by satellites,
identified during experiments, or collected
during surveys.
Datasets are not just scientific and business
tables and spreadsheets.
We have Big Data in our Libraries, Archives
and Museums.
5. What are examples of some
of the challenges of
collecting and preserving
large scale collections in
many formats, and making
them usable as collections
and as data?
6. More and more researchers want to use
collections as a whole, mining and organizing
the information in novel ways.
Researchers use algorithms to mine the rich
information and tools to create pictures that
translate that information into knowledge.
Researchers may want to interact with a
collection of artifacts, or they may want to
work with a data corpus.
7. We still have collections. But what we also
have is Big Data, which requires us to rethink
the infrastructure that is needed to support
Big Data services. Our community used to
expect researchers to come to us, ask us
questions about our collections, and use our
digital collections in our environment.
Now our collections are, more often than not,
self-serve.
9. National Digital
Newspaper Program
chroniclingamerica.loc.gov/
Some researchers want to search for stories in historic
newspapers.
Some researchers want to mine newspaper OCR for trends
across time periods and geographic areas.
Requests have come in to analyze all 5 million pages.
The site gets approximately 4 million hits per day.
The program has:
Multiple producers (25 now, ultimately 54)
Free and open public access
APIs for machine access and automated processes
Over 5.25 million newspaper pages ingested to date
Over 250 Tb of data
10. Packard Campus National
Audio-Visual Center
Preserving Film, Broadcast Television, and
Audio
The Packard Campus is a variety of preservation
workflows, including those for obsolete physical
formats such as wire recordings, wax cylinders,
and 2“ videotape. The Campus is fully equipped
to play back and preserve all antique film, video
and sound formats, and to maintain that
capability far into the future.
The facility also handles born-digital video and
audio received directly from producers and
copyright owners.
Over 3 PB of files.
11. WEB ARCHIVES
http://www.loc.gov/webarchiving/
lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
The Library has been archiving the web since 2000. Subject area
specialists curate the collections, and Library catalogers create
collection-level metadata records. Permission requirements vary by
site category.
The collections include:
• U.S. elections
• Web sites created by members of the House and Senate
• Thematic collections around events, such as elections in the
Philippines, the Iraq war, and the appointment of Supreme Court
Justices.
• Collections around an area of study, such as Legal “Blawgs”
When we began archiving election web sites, we imagined users
browsing through the web pages, studying the graphics or use of
phrases or links. But when our first researchers came to the Library,
they wanted to know about all those topics, but they used scripts to
query for them and sort them into categories. They were not very much
interested in reading web pages.
Approximately 6 billion files
Over 300 TB
12. THE TWITTER ARCHIVE
Every public tweet since Twitter’s launch in March
2006.
Research requests have included users looking for
their own Twitter history, the study of the
geographic spread of news, the study of the
spread of epidemics, and the study of the
transmission of new uses of language.
The collection comprises only a few TB, but over
10s of billions of tweets.
A White Paper is available online at:
http://blogs.loc.gov/loc/2013/01/update-on-the-
twitter-archive-at-the-library-of-congress/
social
science
visualization
social media status
events
personal
privacy
commercial
13. eSerials
Copyright Mandatory Deposit represents a large
acquisitions channel for the Library. In general, all
U.S. publishers are legally required to submit for
deposit two copies of each of their publications to
the Copyright Office. This mechanism has allowed
the Library to build the collection and to preserve
the publications.
eSerials became subject to mandatory deposit in
January 2010, with the publication of a new interim
regulation. Demands began in June 2010 and files
began to arrive in October 2010.
The files must come to the Library “as published” – in
whatever their original formats are.
Articles may be accompanied by their associated
datasets.
14. RESEARCH DATASETS
The datasets generated/used in the research
process.
Datasets can be:
Small, such as surveys of a small sample
population
Medium, such as a corpus of images
Big Data, such as years of observational
astronomical data.
15. It is not enough to be collecting
publications.
We have to collect and preserve
research data, in addition to
recognizing that the collections
we already have are Big Data to
be mined.
16. Are our institutions ready?
We are building large digital
collections and must consider
new ways in which they should
be managed and used.
17. I will mention infrastructure only in passing.
There are scale issues related to:
Storage
Backup and tape archiving
Bandwidth
Software development
Staffing for processing
18. Library of Congress Preservation Infrastructure
The Library developed the BagIt transfer specification for the
movement of files between and within organizations.
http://www.digitalpreservation.gov/documents/bagitspec.pdf
The Library inventories incoming files, and is gradually inventorying all
digital content.
The Library maintains multiple copies of files on servers and on tape,
in geographically distributed locations.
The Library has documented sustainability factors for file formats.
http://www.digitalpreservation.gov/formats/
For cases where we do have control over what comes in, we have a
“Best Edition” Preferred Formats statement, which is currently being
updated.
•http://www.copyright.gov/circs/circ07b.pdf
19. There are many new
activities to be planned for
with new researcher uses
and expectations.
20. How much ingest processing should be done with
data collections, or collections that can be treated as
data?
Should collections be processed to create a variety of
derivatives that might be used in various forms of
analysis before ingesting them?
Do libraries have sufficient infrastructure to create full-
test indexes for millions/billions of files to support full
discovery?
Do libraries support analysis? Analytical tools are still
in early days for the scale of large datasets.
21. And what are the service models?
If libraries decide that they will simply provide access to
data, do they limit it to the native format or provide pre-
processed or on-the-fly format transformation services for
downloads?
Can libraries handle the download traffic?
Can staff develop the expertise to provide guidance to
researchers in using analytical tools? Or is the expectation
that researchers will fend for themselves?
22. Libraries are increasingly looking towards self-service
– researchers need not ask to download or tell us that
they have. We may never know.
BUT, libraries do have collections that are limited to
on-site only access due to licenses or gift agreements.
In that case, libraries may have to consider providing
high-powered workstations with analytical tools for
researchers to work with these collections and take
analysis outputs away with them.
Both have policy implications and implications for
public service staffing.
24. Libraries are managing and preserving
the datasets and big data necessary for
re-use and replicability.
This is an important new role for libraries
in enabling new research.
And libraries need to make the deposit
and management of such data easier to
accomplish.