AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
ALA 2010 -- Jeremy York
1. HATHI TRUST
A Shared Digital Repository
Delivering Data For
New Generations of Research
New Generations of Research
Strategies and Challenges
Strategies and Challenges
Jeremy York
NISO/BISG Forum
NISO/BISG Forum
ALA 2010
2. Introduction
• Digital Repository
Digital Repository
– Initial focus on digitized book and journal content
– “Light” archive
Light archive
• Collections and Collaboration
– Comprehensive collection
C h i ll ti
– Shared strategies
– Local services
Local services
– Public Good
4. Language Distribution (1)
Language Distribution (1)
The top 10 languages make up ~86%
p g g p %
of all content
Polish
Arabic 1% Remaining
2% Italian
Languages
3%
Japanese 14%
4% English
48%
Chinese
h
4%
Spanish French
4% 7% German
8%
Russian
5%
* As of June 15, 2010
5. Language Distribution (2)
Language Distribution (2)
The next 40
Serbian Romanian Ancient‐Greek Slovenian Multiple
Yiddish p languages make up
2%
% 1%
% 1%1%% Portuguese
~13% of total
Panjabi 1%
Malayalam 1%
Bulgarian 6%
1%1% Slovak Finnish
Vietnamese 2% 1% Hindi
Greek Catalan
Armenian Malay 1%
2% Ukrainian 1% 6%
1% 1%%
1% 1 Hebrew
Hungarian 2% 6%
2% Sanskrit Indonesian
2% 6%
Norwegian Dutch
D t h
2% 5%
Bengali
2% Korean Latin
2%
5%
Persian Urdu
3% Undetermined 4%
3% Swedish
Tamil Danish Thai Czech Turkish 4%
3% Croatian 3% 3% Unknown
3% 4%
3% 4%
* As of June 15, 2010
6. Originating Institution
Originating Institution
Penn State
Uni ersit of Indiana University of
University of University of
University
Wisconsin University Minnesota
3% 1% 0%
6%
University of
California
25%
University of
Michigan
65%
* As of June 15, 2010
7. Content over time
Content over time
100%
80%
60% Minnesota
Penn State
40%
California
20% Indiana
0% Wisconsin
Michigan
Sep‐04
4
Nov‐04
Jan‐05
Mar‐05
May‐05
Jul‐05
Sep‐05
Nov‐05
an‐06
ar‐06
y‐06
Ja
May
Ma
N
* As of June 15, 2010
10. Data Distribution & APIs
Data Distribution & APIs
• OAI PMH
OAI‐PMH
• Metadata files
• Bibliographic API
ibli hi
• Data API
11. Extended Services
Extended Services
• Community Development Environment
Community Development Environment
• Non‐Google Ingest
• Non‐Book/Non‐Journal Ingest
k/ l
• Computational Research
14. SEASR Architecture
Visualizations
User Interfaces
Web
Apps Plugins Services
Apps
Meandre Workbench
r Tools Meandre Data‐Intensive Flows
Repositories
Components
Developer
Data
Data Analytics Visualization
Analysis
Component Repository Component Discovery Components
Flows
Meandre Infrastructure
Virtualization Infrastructure
Cloud Computing
15. SEASR @ Work – Tag Cloud
• Count tokens
• Filter options
supported
• St
Stem words d
16. SEASR @ Work – Entity Mash-up
• E tit E t ti with
Entity Extraction ith
OpenNLP or
Stanford NER
• Locations viewed on
Google Map
• D
Dates viewed on
i d
Simile Timeline
17. SEASR @ Work – Entities To
Network
• Identify entities
• Define relationships between entities within
same sentence
18. SEASR @ Work – Text Clustering
• Clustering of Text by token counts
• Filtering options for stop words Part of Speech
words,
• Dendogram Visualization
19. SEASR @ Work – Audio Analysis
• NEMA: Executes a SEASR
flow for each run
– Loads audio data
– Extracts features for
every 10 sec moving
window of audio
i d f di
– Loads and applies the
models
– Sends results back to
the WebUI
• NESTER: Annotation of
Audio via Spectral Analysis
20. SEASR @ Work – Zotero
• Plugin to Firefox
• Zotero manages the
collection
• Launch SEASR Analytics
– Citation Analysis uses the
JUNG network importance
algorithms to rank the authors
in the citation network that is
exported as RDF data from
Zotero to SEASR
– Zotero Export to Fedora
through SEASR
– Saves results from SEASR
Analytics to a Collection
• Launch MONK
Processing
– MONK DB Ingestion Workflow
21. SEASR @ Work – Emotion
Tracking
Goal is to have this type of Visualization to track emotions across
a text document (Leveraging flare.prefuse.org)