Reframing Public Housing: Visualization and Data Analytics in History

Dealing with Data Problems
◦ While the Library licenses the content via
a content provider, access to the
underlying data for aggregated research is
and isn’t supported.
◦ In this case, access to content is limited
through both our subscriptions and
newspaper publishers themselves.
◦ For this project, licensing to many of the
sources David and Patrick were interested
in working with required licensing fees of
~$25-50,000 per newspaper.

Big “little” data
We worry a lot about big research data in the library and how this information will be preserved
and made accessible into the future
◦ But equally concerning – is big “little” data
Big “little” data has very specific problems:
1. Acquisition of the data can be really difficult
2. Storage tends to be inefficient and difficult
3. It’s incredibly hard to move around
4. For purposes of aggregation, it limits the types of tools that can be used for evaluation
5. When the data is closed, finding undocumented inconsistencies is hard

Data processing methodology
Created two data sets:
1. First data set focused on any digital object (excluding classifieds), that included references to public
housing
2. Second data set focused on any digital object (excluding classifieds), that included public housing
and 4 agreed upon synonyms for public housing
One of the benefits of using the resources that we did, was that there was very little article
duplication across resources (i.e., very little reliance on the Associated Press – meaning that
little data filtering needed to occur to account for duplicate data across newspapers)

From these sets – I wrote a suite of tools in C# that measured:
1) Presence of positive terms
2) Presences of negative terms
3) Neutral terms
4) Frequency of negative and positive terms
5) Proximity to positive and negative terms to provide weight
These tools utilized stemming to allow the tool to capture forms of words.
One thing that this work highlighted however, was the limitations in the data due to data quality. These
resources are ocr’ed representations of a particular newspaper article, classified, etc. – and ocr data
quality varies significantly across the titles. A secondary research project that I’ve begun is using these
data sets to test ocr quality of the set by utilizing word frequency to map unique words across a digital
object

0
5
10
15
20
25
30
35
40
45
1930 1940 1950 1960 1970 1980 1990 2000
Cleveland Call Post
More Positive More Negative
Just Public Housing: Cleveland

-15
-10
-5
0
5
10
15
20
25
1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
Just Public Housing: Cleveland

Extended Terms: Cleveland
0
5
10
15
20
25
30
35
40
45
50
1930 1940 1950 1960 1970 1980 1990 2000
Cleveland Call Post
More Positive More Negative

Extended Terms: Cleveland
-15
-10
-5
0
5
10
15
20
25
1930 1940 1950 1960 1970 1980 1990 2000

Public Housing vs Extended Terms
-15
-10
-5
0
5
10
15
20
25
1930 1940 1950 1960 1970 1980 1990 2000
-15
-10
-5
0
5
10
15
20
25
1930 1940 1950 1960 1970 1980 1990 2000

Public Housing vs Extended Terms: NY
-10
-5
0
5
10
15
20
25
30
35
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
-15
-10
-5
0
5
10
15
20
25
30
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

Potential additional areas of inquiry:
• Representation of public housing in:
• letters to the editor
• Editorials
• Featured Articles

Reframing Public Housing: Visualization and Data Analytics in History

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Reframing Public Housing: Visualization and Data Analytics in History

Ähnlich wie Reframing Public Housing: Visualization and Data Analytics in History (20)

Mehr von Terry Reese

Mehr von Terry Reese (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Reframing Public Housing: Visualization and Data Analytics in History

Hinweis der Redaktion