2. June 10, 2012
• Teresa A. Sullivan announces resignation…
…Gretchen M. Gueguen, Digital Archivist at Uva, prepares to attend
Rare Book School the next day
3. June 11 -14
• Reactions grow increasingly vocal around
Grounds as both town &
gown become suspicious of
the motives and actions of the
Board of Visitors…
…Meanwhile, Gretchen Gueguen turns OFF her computer in order
to fully pay attention to her RARE BOOK SCHOOL class about the
NINETEENTH CENTURY BOOK TRADE
4. June 18, 2012
• Decision is made to form a cross-
departmental group within the library to
discuss saving the historic record related
to these events
5. • University Archives: already decided to attend the first
rally in order to collect signs, have been in touch with
faculty senate
… and the Digital Archivist is increasingly interested in
nineteenth century illustration techniques!
• Digital Curation Services: convening meetings to discuss
coordinating activities and outreach
• Scholar’s Lab: creates Omeka-based digital collection
where any one can contribute material related to the
events
• University Records Manager: consults with group about
her activities/buried in FOIA requests
8. What’s the Big Deal?
• Digital is THE publishing platform
• “Twitter Revolution”
• Event was important for both the historic
nature of the events (message) but also
HOW it was communicated (medium)
10. Twitter API
• Allows you to download tweets as data
for a given hashtag, user, or keyword
search (#woo-hoo!)
• Has many tools available for doing all
kinds of neat stuff (#woo-hoo!)
• Limits you to just the last 1500 tweets for
any given search (#d’oh!)
11. Goals for Twitter Archive
1. Find a good tool for finding and saving
tweets
2. Figure out how to get the oldest possible
tweets related to these events
3. Figure out what people are posting and
linking to via twitter
12. 1. Find a good tool…
• The Archivist
• RSS reader
• Google docs script
13.
14. The Archivist
• Benefits:
– XML and tab-delimited
– Searches both tweet and username
• Drawbacks:
– Limited data captured (tweet and user, but
not geotag (if used), in reply to, etc.)
– Would only capture when opened
15.
16. RSS Feed
• Benefit
– Passive collection of tweets
• Drawback
– Unsure if data can be exported and in what
format
– Some readers don’t collect if they aren’t open
leading to missing tweets
18. TAGS Google Spreadsheet
• Benefits
– Passive collection at all times
– Spreadsheet format that can be exported
– Most complete data retrieval of all tools
• Drawbacks
– Can crash if it gets too full
– Given volume during the height of the crisis
had to export and delete from live version
once per day
20. 3. Posted content
• Links, pictures, video related to the story
• Could not find a tool to just extract these
to look through later
• Many shortened links that had to be
clicked on to find out what they held
• Many links were retweeted
28. Blogs and other web content
• How to capture everything else
• Tools for web capture
– Difficult to implement
– Don’t do exactly what is needed
– I’m running out of time!
• Solution:
– I have to look at it anyway to select, so
• Firefox “Save As”
• Screengrab plugin for screenshot
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39. Facebook & “Privacy”
• Rallies on grounds were organized
through Facebook “groups.”
• Some posts are visible
only to members of the group.
All others are only visible to
those with a Facebook account.
40.
41.
42. Facebook & Privacy
• Facebook accounts are free
• But my Facebook account
was my personal account
43.
44.
45. News
• Relatively easy to capture
• Overwhelming in volume
• Why capture the online version?
– Some things only appear online, some only in
print
– Online version, for many sources, allows
commenting
• Why capture this when it will be saved
elsewhere?
– Reference collection
– Databases may capture content but not
commentary
48. User Contributions
• Capture what the
public thought was
important
• Possible violations
of privacy or
intellectual property
49.
50.
51. Final Tally
• Tweets: 80,000
• News articles: 572
• Blog posts: 147
• Other web content: 196
• Twitter pictures: 243
• Video: 69
• Documents: 21
• User-Contributed Items: 118
52. What’s Next?
• Filling in some collection gaps
• Collection finding aid
• Working with OLE and
others to try and get into
repository and ensure
digital preservation
• Likely, reading room only
access for much of the
content
Hinweis der Redaktion
I’d like to take you back a couple of months now, to a more carefree innocent time…It was Sunday, the sun was shining, Uva was just getting into the swing of summer courses, when on June 10, Teresa A. Sullivan, the President of the University, suddently and unexpectedly announced her resignation… … meanwhile in a less closely observed part of town, Gretchen M. Gueguen, the Digital Archivist at Uva, prepared to attend Rare Book for the coming week
During the week that followed reactions to the resignation grew increasingly vocal and the community began to question the actions of Board of Visitors in the resignation… … meanwhile, I, Gretchen Gueguen, turned OFF my computer for the week in order to fully pay attention to my RARE BOOK SCHOOL class on the NINETEENTH CENTURY BOOK TRADE…needless to say, I was not the most informed person on campus about what was going on
By June 18 th activities around the library began to coalesce. Bradley Daigle, with the approval of both Karin and Martha reached out to several people across the libraries to discuss how to work together to save the historic record related to events.
And there was a lot of activity across all units of the library. Special Collections, which includes the University Archives has already decided to attend the rally on the 18 th to collect signs and other materials and had been in touch with the faculty senate. … meanwhile I was back and work with a newfound interest in nineteenth century illustration techniques! Digital Curation Services was convening meetings to discuss coordination and outreach Scholar’s Lab worked on setting up an online collection for the public to contribute their materials related to the protests The University Records Manager, who doesn’t work in the library but instead in the President’s Office was also busily working to fulfill freedom of information act email requests, but she consulted with us about what she would eventually be able to send to the archives in the future.
So it was that around 9 a.m. on July 19 th , that I finally thought… Wait a minute, di the president of the university get fired or something? I’m over-emphasizing here to be funny, but the events really did catch me a bit by surprise because I had been out of the office. In addition, we had never discussed whether or not my role here would include any capture of current social media or web-based information. It was something that I knew would have to be addressed, but it hadn’t been yet. So I’m going to spend the rest of this morning talking about the work that I did over the following weeks to try and capture some of the online content created by the university community in response to the crisis. I want to state that I was by no means the only person here working on things related to these events. I was not in charge of anything and I don’t speak for anyone’s actions but my own. But I do think that the issues I had to figure out are really emblematic of the kinds of work that libraries, and particularly those that gather archives of unique materials, are going to have to face. Hopefully they won’t face them in the midst of a crisis, but they will have to face them eventually.
So I’m going to spend the rest of this morning talking about the work that I did over the following weeks to try and capture some of the online content created by the university community in response to the crisis. (this is my desktop during the reinstatement BOV meeting). I want to state that I was by no means the only person here working on things related to these events. I was not in charge of anything and I don’t speak for anyone’s actions but my own. But I do think that the issues I had to figure out are really emblematic of the kinds of work that libraries, and particularly those that gather archives of unique materials, are going to have to face. Hopefully they won’t face them in the midst of a crisis, but they will have to face them eventually.
And they will have to face them because the internet is no longer “ephemera.” It is a publishing platform, it is a space for interaction and creation, it is a storage medium, it is all these things. The public events here at Uva, the emergency faculty senate meeting, the protests, even the content gathered for articles in the newspapers were all based in one way or another on web-based mediums. The catchy term for these kinds of things is now “Twitter Revolution” a term coined in the wake of protests in Iran in 2009 that were largely organized through the use of social media tools like twitter and facebook. Because of the use of these emerging technologies, I felt that capturing material related to campus events would be important not just because they documented undoubtedly historic events, but also because of the novel use of these technologies to communicate. In some sense, the medium was at least partly the message here.
So, on the 19 th I set to work trying to document these various online sources. I’m going to spend the rest of my time this morning talking with you about attempts to harvest content from these sources: twitter, blogs and other web objects, facebook, news sources, and video.
Twitter was the first and most important source to figure out. It was also really difficult. Twitter has an application programming interface, which is basically a set of open protocols that allow people to build tools that work with twitter’s data. This means that third-parties can build tools that allow you to download tweets into different data forms like xml or spreadsheets. The api limits you to no more than1500 tweets at a time. 1500 tweets is a lot, but when a topic is really popular, 1500 tweets can go by in no time at all. So time was of the essence…
My goals then for creating a twitter archive were first to find a good tool for finding and saving tweets. Secondly, I needed to figure out how to get the oldest tweets related to the subject that I could. Third, I wanted to use the tweets to figure out what people were talking about, posting links to, and organizing.
It took some experimentation to find a tool that really worked well… The first I tried was called “The Archivist”
Does a search, just as you would on twitter. This can be for hashtag, keyword, user profile (will get both that user’s tweets as well as those that reference them). Can save the output as XML or a tab-delimited text that can be imported into excel
So the benefits of using this tool was that it created both XML and tab delimited output. And it would search both the tweet content itself and the twitter handle (so if you searched for “Rector Dragas” you would get both tweets that mentioned her as well as her account (and that was a parody account, by the way). The drawbacks however were that the data captured was somewhat limited, the tweet, the user, and the time were captured but not some of the other data that twitter makes available. The other drawback was that the Archivist wasn’t set up to run simultaneous searches and had to be actually opened and running to capture them. This meant that sometimes once an hour if it was a busy day, I had to open the tool and do a dozen or so searches to get what had been posted in the last hour. This was obviously time-consuming and I did want to occasionally eat, sleep, or leave my desk. If it was really busy though and I waited too long between captures, I would exceed the api’s limit of 1500 tweets and be unable to capture some of them. I continued to use the Archivist, but also continued searching for another option
The next one I considered was using an RSS reader. The twitter api can also expose tweets via the “RSS” or really simple syndication, XML schema
The nice thing about RSS was that I could just set up the feeds and let them collect without having to go in and search every time I wanted to capture them. However, there wasn’t a very good way to export the data, the RSS schema was even more limited that the XML created by the Archivist. In addition, the readers weren’t collecting if they weren’t open, so I was still missing tweets
The last tool I tried was a script created for Google Spreadsheet recommended to me by Eric Johnson. You just open this customized google spreadsheet, tweak a couple of lines in the associated script and let it do it’s thing.
This ended up being the best option. It would capture tweets even when I didn’t have the spreadsheet open. It was saved in a spreadsheet, so it was exportable data, and it captured the most complete data of all the tools. It would crash if it became too full, which it did about once a day early on. But I quickly figured out that I could export out the data that was actually saved, then delete it from the live version and start again. So now instead of having to go through a series of procedures every hour, I only had to do it once a day, if that. I did still continue to use the Archivist as a back up. The Archivist was actually also better at searching individual accounts. But once the initial uproar died down, I didn’t need to back up using the Archivist as frequently. And I obviously felt much better having a back-up of the data. Overall, I estimate that we probably collected around 80,000 unique tweets. I have no analysis right now though of how much of that was retweeted content or irrelevant.
Once the tools were working well, the next issue to figure out was how to get older content. A search for the most prominent hashtags like #UVA or #BOV exceeded 1500 tweets every day or so. By monitoring the feeds though I identified some key people whose accounts would provide a view further back, in addition, some hashtags that weren’t as used didn’t exceed the 1500 number, so things like #whatthebovisdoing filled in some gaps from the early hours of first rally on grounds. Siva Vaidhyanathan, Larry Sabato, the Daily Cavalier, and WUVAonline also provided a good sample of twitter activity for the week of the resignation, prior to the rallies. A sidenote is that it was a challenge to try and capture some of the parody accounts before they were shut down, but I did get a few of them and they of course provide a lot of interesting color to the data
Finally, I realized that the twitter content could be a great lead into other web-based content related to the events. These were the websites, videos, pictures and articles that people were really talking about and which formed their conception of events. The issues were that I didn’t readily have on hand a tool that would extract these links for me for referral. I have used a tool like that in the past, but it was open source and is no longer available because it was bought by a third party who discontinued that service. I didn’t have a lot of time to exhaustively look for tools that performed this activity, so I spent a lot of time clicking on links. The two main issues with this was that many people shorten their URLs so that there was no way to tell what they led to without clicking on them. In addition, many people retweeted links. So, I saw the same sources again and again when I clicked on them
This is an example (this is just from my current twitter feed, by the way) that shows what it looks like when someone posts a link. The one item in the square there is a link to a picture. Twitter allows you to post pictures natively, so instead of posting them somewhere like instagram or facebook and linking to them, they really only exist within your twitter account. The only way to ensure these weren’t lost was to try and grab them when I saw them.
This is an example of a twitter picture. As you can see the picture just shows up within your twitter wallpaper, so that adds somewhat to the context of ho these were presented.
Here are just some others. Siva’s page advertises his book.
Some of these pictures were really great and showed a view of the events that didn’t really make it into official sources
And captures some things, like the slogans on the beta bridge, that didn’t really last very long
Other links were really important for the context of how they were introduced. This is one of my favorites. So without this introductory statement “What would have happened had President Sullived not been reinstated at UVA: everyone fired, replaced with internet cats” The following website doesn’t make as much sense
So, I began collecting links to sources, but they needed to be captured very soon in order to ensure that they were grabbed before they possibly disappeared. I knew that tools existed to set up web crawls, and that these tools were the basis of efforts at other institutions to capture their online presence. However, these tools are somewhat difficult to implement, requiring some sophisticated configuration. The output they produce is somewhat limited as well. The type of capture that I wanted to do (one post on a blog, for example, not the entire thing) was also not exactly the same as web-crawling. In the end, I realized that I was looking at each of these sources to decide if they should be included, I could just use the “Save-as” command in firefox to save them as a HTML document with a folder of associated content. In addtion, I used a firefox plugin called Screengrab to make a screenshot of the entire page. After twitter and facebook (which I’ll discuss in a minute), this content is some of the best that was captured. It was completely unmediated and therefore was sometimes interesting, sometimes completely uninformed and biased, occasionally hilarious, and really captures the essence of reactions to the situation.
A lot of the approach to the events was ironic and humerous, and it encouraged a sense of ownership and interaction. So people didn’t just post pictures of Sullivan and Dragas hugging, they included fake word bubbles and encouraged creativity in imagining what they were saying (some of these were funny)
This is in reference to one of my all-time favorite films, Rushmore, in which the protagonist a high school student named Max Fisher, petitions to save his high schools latin curriculum to impress the classics-loving kindergarten teacher he is trying to woo. He tells his rival “I saved latin. What did you ever do?”
On the other hand, there were some really serious and well-read analyses of events on blogs and other unvetted sources
Some, like this particular blog post by a UVA alum really galvanized people to discuss what was going on (and share conspiracy theories) in the comments
Other people used tools to document some of the bigger themes: this is a zotero list of all the articles and web pages mentioned by Dragas and Kington in their emails to each other. It highlights the roots of the thinking of the Rector and Vice Rector and this articles went on to be analyzed in the news media repeatedly
Other things were interesting in how they took advantage of the media
How they tried to take back the narrative of events
How they visualized and synthesized the events
And how they tried to use social media as a tool to effect change in a grass roots way (this is a petition on Change.org)
Overall, they add a really human element and they give the face of the everyday public embroiled in the event. This kind of narrative of the average person is something that archivists and historians really prize. The public statement of great figures tends to be valued and kept, while that of the common person can slip through the cracks. It also highlights how much of the message of these events was about personalizing the event: “I AM UVA” and how much people self-identify with the university
A subset of this kind of content with some particularly unique characteristics is facebook. The main rallies that took place on grounds were largely organized through facebook “Groups”…anyone on facebook can start a group and it’s just another wall where the members can post content as a way of discussing it with each other. Group administration allows the administrators and members to make some of their content visible to their members only. Collecting this would seem to be a violation of privacy since membership had to be requested and granted by an administrator. However, none of the content at all was visible to someone without a facebook account.
So this is what the group for Students, Families, & Friends United to Reinstate Teresa Sullivan looks like if you aren’t logged in to facebook.
But if I just sign in as a facebook-user, but not as a member of this group, I can see all of these posts as well as who is a member
So there are a lot of questions here, facebook accounts are free and anyone can get one, so the default of having groups invisible to the public but visible to facebook users seems somewhat contradictory (and a way for facebook to get more people to join). On the other hand, I needed to log in as myself in order to see these at all since we didn’t have a departmental account, and even if we did I doubt we would have set it up for this reason (although we may do so in the future). This is kind of embarrassing for me as, at the time my facebook profile picture was me and my dad in 1979 after I got a bath… (removing this from these pages is one of the next steps I want to work on…) In retrospect, setting up a facebook account for the department in order to capture pages could be a better solution. The profile could be public or private depending on what else we wanted to do with it.
Question: if this IS private, then is it ethical to see who RSVP’d to this event?
This group evolved over time, and I tried to capture it as it changed its look and message. By this point things were far more organized and focused on planning events rather than just joining together to share outrage. I also changed my facebook profile picture.
The other big source of online content was from news sources: papers, radio, and tv. In general this content isn’t much different from the other web content, but it did get to be really overwhelming in volume. The nice thing about it though, was that since these were more established sources with significant resources, I was less worried about the content disappearing. So much of the material I’ve collected from these sources has been gathered after the fact. The question of why to capture this content is an intriguing one. We have also collected the paper versions of the Daily Progress and some of the local weeklies and a lot of the content is redundant between those two sources. In addtion, the content of many of the papers is aggregated into databases like LexisNexis. Some things do appear only in one source or the other, so gathering both web and paper for things that are not preserved elsewhere makes sense. The paper also captured a lot of intangible factors. For example, seeing the huge bold REINSTATED headline on the top of the Washington post there (this is a scan of the front page grabbed from the Newseum) carried a different message than the online version. The other question of why save it locally when it is saved elsewhere, has two answers. First we are creating an easier access point for researchers. So in this case the capture really has to do more with access than preservation. But, the databases only grab content, not comments. The commentary is really interesting and this is one of those places where the medium is really shaping the message. That sort of content would not have typically been captured in the past unless someone wrote a letter to the editor and it was published. Even then it would only be one side of the story, not the ongoing dialogue that is found in some article (and, to be honest, a lot of trolling, spam and other nonsense as well). We decided that capturing this was most important for the local papers, and so have tried to be exhaustive with those. We also are trying to capture them at least twice in case details are updated over time..
The only issue we’ve found is with The Daily Progress, which requires an online subscription to access it’s “premium” content. Even when the article is downloaded during a free trial, the script which triggers this authentication is still saved and so it pops up and requires authentication before you can read the content in the HTML view. The HTML content is still saved however, and could be read in the code directly. We are still working on a solution to this particular issue.
A number of online sources involved some type of audio and video and these presented some particular difficulties for capture The most prevalent one was YouTube, which seems pretty obvious. While these are public posted and there isn’t any privacy violation to capture them, there is not an easy way to download them. YouTube’s basic license state that the owner is placing them on YouTube for access and basically says that they are not there to be downloaded. Users can opt to use a Creative Commons license instead which doesn’t have this restriction, but then the issue how to download is still present. I found another Firefox plugin called “download helper” that enabled me to download the ones to which we felt we could do so. News sources also do not encourage downloading, so we kept a list of these videos and could therefore ask for them at a later time. Audio podcasts are easier to capture, and WINA in particular did a lot of interesting interviews. However, they limit the number that are available online, and so I did not capture them all. I am discussing the matter with them because they are interested in participating. Several events were actually streamed online, which by default, means there is no download. However, Public Affairs is really interested in handing off a lot of their archives of some of these events, such as the Board Meetings and we should get those in the future.
Finally, although I didn’t mention this at the top, I want to also note the creation of the online contribution site created by Scholar’s Lab. Everyone involved agreed that this would be a great way to capture what the public thought was important, especially since we could have eyes to see everything. We realized that there would be issues though if we didn’t protect ourselves from the possibility of people posting content that infringed on someone else’s copyright. To protect against this, Bradley worked with Madelyn to get approval of language and Eric Johnson worked with Wayne Graham in the Scholar’s Lab so that the contribution form could allow people to indicate whether or not they wanted an item to be public, and which required that one of us on the staff approve the item before it was public.
So far, we’ve had over 100 contributions. There have been pictures, video, copies of emails, and links to online sources.
A lot of this content are not images that I’ve collected from other sources and in some cases it provides a better source than anything else I’ve collected. This picture, for example, documents one of the signs that we really liked, but haven’t yet gotten as a donation.
So the final tally of what we have in a digital format so far: Tweets: 80,000 News articles: 571 Blog posts: 147 Other web content: 196 Twitter pictures: 243 Video: 69 Documents: 21 User-Contributed Items: 118 These numbers will continue to grow, I’m sure. For example, this does not take into account pictures and video from public affairs. And this is in addition to the 100 or so rally signs and a couple dozen newspapers. Some of the rally signs are simply too large or fragile for us to properly store, so we will probably scan them for access and dispose of them, thereby growing the collection more.
So, I thought I’d end with a look at what we are hoping to do next with these materials. The first step is to continue to fill in some gaps. I’m going to be going back and recapturing local news sources to get the complete commentary if I can. In addition, we’ll work with Public Affairs to figure out what to do about the video, which is quite large. The next step will be for Special Collections to create a finding aid for the collection which will provide basic organization and description of the different groups of materials. Then we will be working with OLE and others to try and get this material into the repository to ensure digital preservation and management. Likely, much of the content will be fully accessible in the reading room only. Even though it is on a computer and we COULD publish it online for the world to see, we are still bound by intellectual property concerns and want to work within Fair Use. We gathered this material for use as a research resource and that will be our primary motivation.
So, with that, I would be happy to take any questions and I thank you for being here today.