This is my part of the lecture from a much larger discussion with my research collaborators looking at how public housing has been represented in the media. As one of the researchers on this project, I worked with my colleagues to handle much of the data processing and initial visualization work.
2. Dealing with Data Problems
◦ While the Library licenses the content via
a content provider, access to the
underlying data for aggregated research is
and isn’t supported.
◦ In this case, access to content is limited
through both our subscriptions and
newspaper publishers themselves.
◦ For this project, licensing to many of the
sources David and Patrick were interested
in working with required licensing fees of
~$25-50,000 per newspaper.
3. Big “little” data
We worry a lot about big research data in the library and how this information will be preserved
and made accessible into the future
◦ But equally concerning – is big “little” data
Big “little” data has very specific problems:
1. Acquisition of the data can be really difficult
2. Storage tends to be inefficient and difficult
3. It’s incredibly hard to move around
4. For purposes of aggregation, it limits the types of tools that can be used for evaluation
5. When the data is closed, finding undocumented inconsistencies is hard
6. Data processing methodology
Created two data sets:
1. First data set focused on any digital object (excluding classifieds), that included references to public
housing
2. Second data set focused on any digital object (excluding classifieds), that included public housing
and 4 agreed upon synonyms for public housing
One of the benefits of using the resources that we did, was that there was very little article
duplication across resources (i.e., very little reliance on the Associated Press – meaning that
little data filtering needed to occur to account for duplicate data across newspapers)
7. Data processing methodology
From these sets – I wrote a suite of tools in C# that measured:
1) Presence of positive terms
2) Presences of negative terms
3) Neutral terms
4) Frequency of negative and positive terms
5) Proximity to positive and negative terms to provide weight
These tools utilized stemming to allow the tool to capture forms of words.
One thing that this work highlighted however, was the limitations in the data due to data quality. These
resources are ocr’ed representations of a particular newspaper article, classified, etc. – and ocr data
quality varies significantly across the titles. A secondary research project that I’ve begun is using these
data sets to test ocr quality of the set by utilizing word frequency to map unique words across a digital
object
14. Data processing methodology
Potential additional areas of inquiry:
• Representation of public housing in:
• letters to the editor
• Editorials
• Featured Articles
Hinweis der Redaktion
David Staley was the first faculty member outside of OSUL that I met when I first moved to Ohio, so when he and Patrick approached me with this particular problem I was definitely interested.
I approached content provider, and they allowed us to grandfather this project into a data pilot.
More researchers want this data, and their current system doesn’t make this process easy. So, to support researcher requests, content provider has been testing a program where all data is loaded to amazon, and researchers can then be granted access to these files, for a nominal fee, for processing.
Based on the Library’s subscriptions and publisher license data, content provider was able to make available content from ~1880-Present for the 8 historical African American newspapers.
I let David and Patrick know, and they tweaked their initial project scope, with that idea that we could evaluate the data we had, and maybe expand to other resources for later comparison.
Big data – astrometric data, physic data, etc.
Big “little” data has a number of problems particular to it
Getting the data can be a real challenging. In our case, data needed to be downloaded, one by one, from the content provider. (1 month)
Difficult to move around (our data set takes 3 weeks to do a full copy)
There are a lot of great python libraries for doing text mining and evaluation, and they just simply wouldn’t work over the data set