SlideShare ist ein Scribd-Unternehmen logo
1 von 14
OSUL and Digital Humanities
Dealing with Data Problems
◦ While the Library licenses the content via
a content provider, access to the
underlying data for aggregated research is
and isn’t supported.
◦ In this case, access to content is limited
through both our subscriptions and
newspaper publishers themselves.
◦ For this project, licensing to many of the
sources David and Patrick were interested
in working with required licensing fees of
~$25-50,000 per newspaper.
Big “little” data
We worry a lot about big research data in the library and how this information will be preserved
and made accessible into the future
◦ But equally concerning – is big “little” data
Big “little” data has very specific problems:
1. Acquisition of the data can be really difficult
2. Storage tends to be inefficient and difficult
3. It’s incredibly hard to move around
4. For purposes of aggregation, it limits the types of tools that can be used for evaluation
5. When the data is closed, finding undocumented inconsistencies is hard
Sample Data Set
NewsPaper Processing tool
Data processing methodology
Created two data sets:
1. First data set focused on any digital object (excluding classifieds), that included references to public
housing
2. Second data set focused on any digital object (excluding classifieds), that included public housing
and 4 agreed upon synonyms for public housing
One of the benefits of using the resources that we did, was that there was very little article
duplication across resources (i.e., very little reliance on the Associated Press – meaning that
little data filtering needed to occur to account for duplicate data across newspapers)
Data processing methodology
From these sets – I wrote a suite of tools in C# that measured:
1) Presence of positive terms
2) Presences of negative terms
3) Neutral terms
4) Frequency of negative and positive terms
5) Proximity to positive and negative terms to provide weight
These tools utilized stemming to allow the tool to capture forms of words.
One thing that this work highlighted however, was the limitations in the data due to data quality. These
resources are ocr’ed representations of a particular newspaper article, classified, etc. – and ocr data
quality varies significantly across the titles. A secondary research project that I’ve begun is using these
data sets to test ocr quality of the set by utilizing word frequency to map unique words across a digital
object
0
5
10
15
20
25
30
35
40
45
1930 1940 1950 1960 1970 1980 1990 2000
Cleveland Call Post
More Positive More Negative
Just Public Housing: Cleveland
-15
-10
-5
0
5
10
15
20
25
1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
Just Public Housing: Cleveland
Extended Terms: Cleveland
0
5
10
15
20
25
30
35
40
45
50
1930 1940 1950 1960 1970 1980 1990 2000
Cleveland Call Post
More Positive More Negative
Extended Terms: Cleveland
-15
-10
-5
0
5
10
15
20
25
1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
Public Housing vs Extended Terms
-15
-10
-5
0
5
10
15
20
25
1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
-15
-10
-5
0
5
10
15
20
25
1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
Public Housing vs Extended Terms: NY
-10
-5
0
5
10
15
20
25
30
35
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
-15
-10
-5
0
5
10
15
20
25
30
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
Data processing methodology
Potential additional areas of inquiry:
• Representation of public housing in:
• letters to the editor
• Editorials
• Featured Articles

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Harvesting and semantically tagging media releases from political websites us...
Harvesting and semantically tagging media releases from political websites us...Harvesting and semantically tagging media releases from political websites us...
Harvesting and semantically tagging media releases from political websites us...
 
Getting Comfortable with Metadata Reuse
Getting Comfortable with Metadata ReuseGetting Comfortable with Metadata Reuse
Getting Comfortable with Metadata Reuse
 
Future Of Metadata –
Future Of Metadata –Future Of Metadata –
Future Of Metadata –
 
Open Source Reference Desk Software at the Victorian Parliamentary Library
Open Source Reference Desk Software at the Victorian Parliamentary LibraryOpen Source Reference Desk Software at the Victorian Parliamentary Library
Open Source Reference Desk Software at the Victorian Parliamentary Library
 
Driver Guidelines and Repository Interoperability
Driver Guidelines and Repository InteroperabilityDriver Guidelines and Repository Interoperability
Driver Guidelines and Repository Interoperability
 
Discovery elsewhere
Discovery elsewhereDiscovery elsewhere
Discovery elsewhere
 
Checking for Originality: Crossref Similarity Check
Checking for Originality: Crossref Similarity CheckChecking for Originality: Crossref Similarity Check
Checking for Originality: Crossref Similarity Check
 
Designing the Garden: Getting Grounded in Linked Data
Designing the Garden: Getting Grounded in Linked DataDesigning the Garden: Getting Grounded in Linked Data
Designing the Garden: Getting Grounded in Linked Data
 
Enabling re-use via CKAN: discoverability and interoperability
Enabling re-use via CKAN: discoverability and interoperabilityEnabling re-use via CKAN: discoverability and interoperability
Enabling re-use via CKAN: discoverability and interoperability
 
Moving to the network level: discovery and disclosure
Moving to the network level:discovery and disclosureMoving to the network level:discovery and disclosure
Moving to the network level: discovery and disclosure
 
UKSG Conference 2017 Breakout - KBART recommendations: challenges and achieve...
UKSG Conference 2017 Breakout - KBART recommendations: challenges and achieve...UKSG Conference 2017 Breakout - KBART recommendations: challenges and achieve...
UKSG Conference 2017 Breakout - KBART recommendations: challenges and achieve...
 
Introduction to Crossref
Introduction to CrossrefIntroduction to Crossref
Introduction to Crossref
 
Weaving a Web of Linked Data - September 26th, 2019
Weaving a Web of Linked Data - September 26th, 2019Weaving a Web of Linked Data - September 26th, 2019
Weaving a Web of Linked Data - September 26th, 2019
 
Quick Introduction to the Semantic Web, RDFa & Microformats
Quick Introduction to the Semantic Web, RDFa & MicroformatsQuick Introduction to the Semantic Web, RDFa & Microformats
Quick Introduction to the Semantic Web, RDFa & Microformats
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch Seminar
 
Linked data 20171106
Linked data 20171106Linked data 20171106
Linked data 20171106
 
New product developments - Jennifer Lin - London LIVE 2017
New product developments - Jennifer Lin - London LIVE 2017New product developments - Jennifer Lin - London LIVE 2017
New product developments - Jennifer Lin - London LIVE 2017
 
Linked Data: so what?
Linked Data: so what?Linked Data: so what?
Linked Data: so what?
 
Archives 2.0, the Archives Hub and AIM25
Archives 2.0, the Archives Hub and AIM25Archives 2.0, the Archives Hub and AIM25
Archives 2.0, the Archives Hub and AIM25
 
Semantics and Web 3.0
Semantics and Web 3.0Semantics and Web 3.0
Semantics and Web 3.0
 

Ähnlich wie Reframing Public Housing: Visualization and Data Analytics in History

Is Linked Open Data the way forward?
Is Linked Open Data the way forward?Is Linked Open Data the way forward?
Is Linked Open Data the way forward?
American Art Collaborative
 

Ähnlich wie Reframing Public Housing: Visualization and Data Analytics in History (20)

Here Comes Everything
Here Comes EverythingHere Comes Everything
Here Comes Everything
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Cal Poly - Data Management and the DMPTool
Cal Poly - Data Management and the DMPToolCal Poly - Data Management and the DMPTool
Cal Poly - Data Management and the DMPTool
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & Museums
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
20160414 23 Research Data Things
20160414 23 Research Data Things20160414 23 Research Data Things
20160414 23 Research Data Things
 
Lessons Learned from Lod Failure and Big Data : The Future Trend
Lessons Learned from Lod Failure and Big Data : The Future Trend Lessons Learned from Lod Failure and Big Data : The Future Trend
Lessons Learned from Lod Failure and Big Data : The Future Trend
 
Finding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics DatasetsFinding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics Datasets
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
Wire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub ProjectWire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub Project
 
From Data Sharing to Data Stewardship
From Data Sharing to Data StewardshipFrom Data Sharing to Data Stewardship
From Data Sharing to Data Stewardship
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research Requirements
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in Romania
 
Is Linked Open Data the way forward?
Is Linked Open Data the way forward?Is Linked Open Data the way forward?
Is Linked Open Data the way forward?
 
DataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data Sharing
 
Where's the Data?
Where's the Data?Where's the Data?
Where's the Data?
 
NISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to RealityNISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to Reality
 

Mehr von Terry Reese

Mehr von Terry Reese (20)

MarcEdit Shelter-In-Place Webinar 8: Automated editing through scripts and to...
MarcEdit Shelter-In-Place Webinar 8: Automated editing through scripts and to...MarcEdit Shelter-In-Place Webinar 8: Automated editing through scripts and to...
MarcEdit Shelter-In-Place Webinar 8: Automated editing through scripts and to...
 
MarcEdit Shelter-In-Place Webinar 7: Making Regular Expressions work for you ...
MarcEdit Shelter-In-Place Webinar 7: Making Regular Expressions work for you ...MarcEdit Shelter-In-Place Webinar 7: Making Regular Expressions work for you ...
MarcEdit Shelter-In-Place Webinar 7: Making Regular Expressions work for you ...
 
MarcEdit Shelter-In-Place Webinar 6: Regular Expressions and .NET, A Primer
MarcEdit Shelter-In-Place Webinar 6: Regular Expressions and .NET, A PrimerMarcEdit Shelter-In-Place Webinar 6: Regular Expressions and .NET, A Primer
MarcEdit Shelter-In-Place Webinar 6: Regular Expressions and .NET, A Primer
 
MarcEdit Shelter-In-Place Webinar 5.5: Transliterations in MarcEdit
MarcEdit Shelter-In-Place Webinar 5.5: Transliterations in MarcEditMarcEdit Shelter-In-Place Webinar 5.5: Transliterations in MarcEdit
MarcEdit Shelter-In-Place Webinar 5.5: Transliterations in MarcEdit
 
MarcEdit Shelter-In-Place Webinar 5: Working with MarcEdit's Linked Data Fram...
MarcEdit Shelter-In-Place Webinar 5: Working with MarcEdit's Linked Data Fram...MarcEdit Shelter-In-Place Webinar 5: Working with MarcEdit's Linked Data Fram...
MarcEdit Shelter-In-Place Webinar 5: Working with MarcEdit's Linked Data Fram...
 
MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh...
MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh...MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh...
MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh...
 
MarcEdit Shelter-in-place Webinar 2.5: Getting Started with MarcEdit Mac
MarcEdit Shelter-in-place Webinar 2.5: Getting Started with MarcEdit MacMarcEdit Shelter-in-place Webinar 2.5: Getting Started with MarcEdit Mac
MarcEdit Shelter-in-place Webinar 2.5: Getting Started with MarcEdit Mac
 
Working with the MarcEditor
Working with the MarcEditorWorking with the MarcEditor
Working with the MarcEditor
 
Slides from the NASIG 2018 Preconference
Slides from the NASIG 2018 PreconferenceSlides from the NASIG 2018 Preconference
Slides from the NASIG 2018 Preconference
 
Making complicated processes simple: a look at how MarcEdit 7 is expanding th...
Making complicated processes simple: a look at how MarcEdit 7 is expanding th...Making complicated processes simple: a look at how MarcEdit 7 is expanding th...
Making complicated processes simple: a look at how MarcEdit 7 is expanding th...
 
Rejoining the Information access landscape
Rejoining the Information access landscapeRejoining the Information access landscape
Rejoining the Information access landscape
 
Open metadata, open systems…redrawing the library metadata landscape
Open metadata, open systems…redrawing the library metadata landscapeOpen metadata, open systems…redrawing the library metadata landscape
Open metadata, open systems…redrawing the library metadata landscape
 
Getting Started with Regular Expressions In MarcEdit
Getting Started with Regular Expressions In MarcEditGetting Started with Regular Expressions In MarcEdit
Getting Started with Regular Expressions In MarcEdit
 
Fitting MarcEdit into the library software ecosystem
Fitting MarcEdit into the library software ecosystemFitting MarcEdit into the library software ecosystem
Fitting MarcEdit into the library software ecosystem
 
Thinking about Preservation: OSUL Content Manage Workflow
Thinking about Preservation: OSUL Content Manage WorkflowThinking about Preservation: OSUL Content Manage Workflow
Thinking about Preservation: OSUL Content Manage Workflow
 
#mashcat: Evolving MarcEdit: Leveraging Semantic Data in MarcEdit
#mashcat: Evolving MarcEdit: Leveraging Semantic Data in MarcEdit#mashcat: Evolving MarcEdit: Leveraging Semantic Data in MarcEdit
#mashcat: Evolving MarcEdit: Leveraging Semantic Data in MarcEdit
 
Harnessing the Lifecycle: Planning and Implementing a Strategic Digital Coll...
Harnessing the Lifecycle: Planning and Implementing a Strategic Digital Coll...Harnessing the Lifecycle: Planning and Implementing a Strategic Digital Coll...
Harnessing the Lifecycle: Planning and Implementing a Strategic Digital Coll...
 
Practical approaches to entification in library bibliographic data
Practical approaches to entification in library bibliographic dataPractical approaches to entification in library bibliographic data
Practical approaches to entification in library bibliographic data
 
Making RDA Easy(er) with MarcEdit
Making RDA Easy(er) with MarcEditMaking RDA Easy(er) with MarcEdit
Making RDA Easy(er) with MarcEdit
 
Open Repositories 2014 Poster -- Managing Change: An Organizational Outline f...
Open Repositories 2014 Poster -- Managing Change: An Organizational Outline f...Open Repositories 2014 Poster -- Managing Change: An Organizational Outline f...
Open Repositories 2014 Poster -- Managing Change: An Organizational Outline f...
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Kürzlich hochgeladen (20)

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

Reframing Public Housing: Visualization and Data Analytics in History

  • 1. OSUL and Digital Humanities
  • 2. Dealing with Data Problems ◦ While the Library licenses the content via a content provider, access to the underlying data for aggregated research is and isn’t supported. ◦ In this case, access to content is limited through both our subscriptions and newspaper publishers themselves. ◦ For this project, licensing to many of the sources David and Patrick were interested in working with required licensing fees of ~$25-50,000 per newspaper.
  • 3. Big “little” data We worry a lot about big research data in the library and how this information will be preserved and made accessible into the future ◦ But equally concerning – is big “little” data Big “little” data has very specific problems: 1. Acquisition of the data can be really difficult 2. Storage tends to be inefficient and difficult 3. It’s incredibly hard to move around 4. For purposes of aggregation, it limits the types of tools that can be used for evaluation 5. When the data is closed, finding undocumented inconsistencies is hard
  • 6. Data processing methodology Created two data sets: 1. First data set focused on any digital object (excluding classifieds), that included references to public housing 2. Second data set focused on any digital object (excluding classifieds), that included public housing and 4 agreed upon synonyms for public housing One of the benefits of using the resources that we did, was that there was very little article duplication across resources (i.e., very little reliance on the Associated Press – meaning that little data filtering needed to occur to account for duplicate data across newspapers)
  • 7. Data processing methodology From these sets – I wrote a suite of tools in C# that measured: 1) Presence of positive terms 2) Presences of negative terms 3) Neutral terms 4) Frequency of negative and positive terms 5) Proximity to positive and negative terms to provide weight These tools utilized stemming to allow the tool to capture forms of words. One thing that this work highlighted however, was the limitations in the data due to data quality. These resources are ocr’ed representations of a particular newspaper article, classified, etc. – and ocr data quality varies significantly across the titles. A secondary research project that I’ve begun is using these data sets to test ocr quality of the set by utilizing word frequency to map unique words across a digital object
  • 8. 0 5 10 15 20 25 30 35 40 45 1930 1940 1950 1960 1970 1980 1990 2000 Cleveland Call Post More Positive More Negative Just Public Housing: Cleveland
  • 9. -15 -10 -5 0 5 10 15 20 25 1930 1940 1950 1960 1970 1980 1990 2000 Article Content: Positive Over Negative Just Public Housing: Cleveland
  • 10. Extended Terms: Cleveland 0 5 10 15 20 25 30 35 40 45 50 1930 1940 1950 1960 1970 1980 1990 2000 Cleveland Call Post More Positive More Negative
  • 11. Extended Terms: Cleveland -15 -10 -5 0 5 10 15 20 25 1930 1940 1950 1960 1970 1980 1990 2000 Article Content: Positive Over Negative
  • 12. Public Housing vs Extended Terms -15 -10 -5 0 5 10 15 20 25 1930 1940 1950 1960 1970 1980 1990 2000 Article Content: Positive Over Negative -15 -10 -5 0 5 10 15 20 25 1930 1940 1950 1960 1970 1980 1990 2000 Article Content: Positive Over Negative
  • 13. Public Housing vs Extended Terms: NY -10 -5 0 5 10 15 20 25 30 35 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Article Content: Positive Over Negative -15 -10 -5 0 5 10 15 20 25 30 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Article Content: Positive Over Negative
  • 14. Data processing methodology Potential additional areas of inquiry: • Representation of public housing in: • letters to the editor • Editorials • Featured Articles

Hinweis der Redaktion

  1. David Staley was the first faculty member outside of OSUL that I met when I first moved to Ohio, so when he and Patrick approached me with this particular problem I was definitely interested. I approached content provider, and they allowed us to grandfather this project into a data pilot. More researchers want this data, and their current system doesn’t make this process easy. So, to support researcher requests, content provider has been testing a program where all data is loaded to amazon, and researchers can then be granted access to these files, for a nominal fee, for processing. Based on the Library’s subscriptions and publisher license data, content provider was able to make available content from ~1880-Present for the 8 historical African American newspapers. I let David and Patrick know, and they tweaked their initial project scope, with that idea that we could evaluate the data we had, and maybe expand to other resources for later comparison.
  2. Big data – astrometric data, physic data, etc. Big “little” data has a number of problems particular to it Getting the data can be a real challenging. In our case, data needed to be downloaded, one by one, from the content provider. (1 month) Difficult to move around (our data set takes 3 weeks to do a full copy) There are a lot of great python libraries for doing text mining and evaluation, and they just simply wouldn’t work over the data set