This talk was given at the Visual Resources Association conference March 13 2015. The moderator was Trish Rose-Sandler and speakers included: Robert Guralnick, Guarav Vaidya, and Trish Rose-Sandler. Notes from the talk are visible when downloaded.
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Crowdsourcing your cultural heritage collections: considerations when choosing a platform
1. Crowdsourcing your cultural heritage
collections: considerations when
choosing a platform
Robert Guralnick, University of Florida
Guarav Vaidya, University of Colorado, Boulder
Trish Rose-Sandler, Missouri Botanical Garden
Image credit: Opensourceway
https://www.flickr.com/photos/opensourcew
ay/4370250237/
March 13 2015 Visual Resources Association conference
2. March 13 2015 Visual Resources Association conference
3. March 13 2015 Visual Resources Association conference
4. March 13 2015 Visual Resources Association conference
6. March 13 2015 Visual Resources Association conference
Factors to consider when choosing a platform
• Fit for purpose
• Size of user community
• Copyright restrictions
• Analytics
• User engagement
• System interoperability
7. March 13 2015 Visual Resources Association conference
8. March 13 2015 Visual Resources Association conference
Flickr as both an image sharing
and crowdsourcing platform: the
Biodiversity Heritage Library
experience
Trish Rose-Sandler, Missouri Botanical Garden
9. March 13 2015 Visual Resources Association conference
BHL portal
10. March 13 2015 Visual Resources Association conference
BHL Book Viewer
11. March 13 2015 Visual Resources Association conference
BHL Book Viewer
12. March 13 2015 Visual Resources Association conference
BHL Crowdsourcing of image descriptions
BHL images available since 2011
BHL images available since 2011
BHL images available since last week!
13. March 13 2015 Visual Resources Association conference
BHL’s latest crowdsourcing venture
14. March 13 2015 Visual Resources Association conference
Science Gossip UI
15. March 13 2015 Visual Resources Association conference
• Image hosting site created in 2004
• acquired by Yahoo in 2005
• 87 million registered members and 3.5 million
new images uploaded daily (Mar ‘13)
• hosts 6 billion images (Aug ‘11)
http://en.wikipedia.org/wiki/Flickr
Flickr basics
16. March 13 2015 Visual Resources Association conference
BHL Flickr stream
17. March 13 2015 Visual Resources Association conference
Internet Archive Book Images stream
18. March 13 2015 Visual Resources Association conference
BHL Flickr stream
19. March 13 2015 Visual Resources Association conference
How to get the word out?
20. March 13 2015 Visual Resources Association conference
Flickr machine tags
BHL asked folks to tag scientific and common
names as machine tags
Takes form of Namespace:predicate=value
Examples
taxonomy:binomial=Aegotheles savesi
taxonomy:common=owl
21. March 13 2015 Visual Resources Association conference
Flickr machine tags: searching and re-use
22. March 13 2015 Visual Resources Association conference
Getting data out of Flickr
Via APIs
• Flickr limits API calls to 3600 images per
hour
• 24 hrs to extract tags for 90k images
• Can use multiple API keys to get around
Flickr limitations
23. March 13 2015 Visual Resources Association conference
Flickr – success or failure for
crowdsourcing?
Google Analytics
18% of images in BHL Flickr stream have at least 1 tag or
more add by users
Total views = 88 million views of BHL images in last 4 yrs
Science Gossip since Mar 6 2015 - 140k
images classified by 330 users – huge
success!
Hinweis der Redaktion
Crowdsourcing as a method for gathering and transcribing information about cultural heritage objects has been used effectively in improving access to these collections for the past decade. Large numbers of the public can be harnessed in accomplishing a task too large for institutional staffing to complete. Its also a great way to build deeper connections to the content we produce by letting the public generate conversations around the content and make it more accessibleThis image I borrowed from opensource.com and an article on their site by Chris Grams called “2 reasons why the term “crowdsourcing” bugs me” http://opensource.com/business/10/1/2-reasons-why-term-crowdsourcing-bugs-me?sc_cid=70160000000IDmjAAG
I like the image because it show how there are two mindsets when it comes to crowdsourcing
The first is what Grams explains is the “manufacturing mindset” or factory model where the goal of crowdsourcing is simply about getting a task done cheaper and faster. The illustration on the left side reflects that mindset. The crowd is simply there to serve a single individual or organization.
The second he describes as a meritocracy, “where the best ideas win” and less of a socialist or collective approach. I would agree with him that some crowdsourcing projects use a peer to peer verification where an idea is not accepted until a consensus forms. Yet others take the collective approach where every idea is considered valid. I think the illustration on the right is more reflective of this mindset and demonstrates how everyone benefits and feeds off of the ideas of others in a collective crowdsourcing project.
There are lots of crowdsourcing applications out there that tackle all sorts of problems from fundraising to folding protein strings. For this talk we’ll focus on those that are designed for sharing and describing images and that have been used fairly widely in the cultural heritage community. Some examples include: Flickr, Wikimedia Commons, Zooniverse, and Metadata Games
Flickr was setup as an image sharing platform in 2004 and Flickr Commons was developed in 2008 to “provide access to publically held photo archives”
Library of Congress was the first cultural heritage institution to join FC and many others have followed suite such as the US National Archives, the Powerhouse Museum, and the New York Public Library. Most of these institutions have posted small numbers of images (LC 23k) NYPL (2500), Nat Archive (12k) The British Library came along in Dec 2013 and uploaded over 1 million images to FC - one of the largest batches of images to FC since it began. Internet Archive followed a year later with over 2.5 million images.
WC was launched in Sept 2004 to create a central repository for media files so they could be uploaded once but then referenced as many times as needed in different Wikimedia projects. Many galleries, libraries, archives and museums or GLAMS wanted to bulk upload their content to WC but there was no tool to do it. The GLAMwikiToolset project was a European initiative to make that easier. With this bulk upload tool many GLAMS are now sharing their content in WC
Zooniverse - what began in 2007 as a science focused crowdsourcing platform for 1 project called Galaxy Zoo now has over 2 dozen projects covering a wide variety of domains including: climate, nature, and humanities.
Metadata Games is a site designed by a game design dept at Dartmouth called Tiltfactor – combines crowdsourcing of metadata with gaming. . is used with over 44 Collections represented at 10 Institutions. Including the British Library, Boston Public Library ,and DPLA
Fit for Purpose:
In other words what is the platform designed primarily to do?
Both Zooniverse and Metadata Games were designed first and foremost as crowdsourcing applications and therefore have functionality better suited for this purpose - e.g. “peer to peer verification” [http://www.tiltfactor.org/wp-content/uploads2/tiltfactor_citizenArchivistsAtPlay_digra2013.pdf]. This is where Each images has more than one user looking at and tagging it which then allows a score to be assigned to a word or phrase to assess its accuracy. This can help with quality control which is a concern for some organizations venturing into utilizing the public as catalogers
Size of User community
Its important to consider what is the size of the community using the platform – what type of exposure will your content get and how many people could potentially want to tag your content?
As of 2014, Zooniverse had 1 million registered volunteers
Wikipedia is 6th most visited website in the world
As of 2013 Flickr had 87 million registered members and as of August 2011 the site reported that it was hosting more than 6 billion images
Copyright restrictions–
Does the application impose any copyright restrictions?
WC only accepts images that either public domain or you as the copyright holder are willing to apply a Creative Commons license without any commercial restrictions. This is because one of the core values of Wikimedia is that content that is shared there should be free to use, reuse, and change w/o permission
Analytics
Does the site provide any stats about folks interacting with your data? e.g. number of images tagged, # of tags per image, when are folks tagging most (month, day year)
User engagement
Does the platform have tools for users to interact with the data and for you to interact with the volunteers?
This is where a platform like Zooniverse really stands out - it has a blog and talk pages. Incentives - give feedback on how many items they’ve classified (some projects have badges)
System interoperability:
data input (bulk uploads), data output (exports, querying APIs)
With these factors in mind lets begin with our first speaker
Rob is a biodiversity scientist and informatician whose focus is on documenting the pace of global change and impacts on wildlife. Much of the data he uses for his work comes from natural history collections and citizen science naturalists. He has been deeply involved in ecological and biodiversity informatics initiatives to increase the quality, availability and utility of such datasets at the global scale. His particular informatics interest is building web-based tools to enhance discovery, and curate content of natural history collections data; this will be the key topic he discusses today
Gaurav is a graduate student at the University of Colorado Boulder, where he studies how often taxonomists update catalogues of species definitions and what that means for our understanding of global biodiversity. This is a convenient excuse for him to read old books about cranky taxonomists and the revel in the thrill of discovering something new for the first time. He's been an editor on Wikipedia since 2002, and -- although he barely edits anything pages himself -- he still thinks its the greatest thing since sliced bread.
Trish Rose-Sandler has over two decades of experience working in libraries, archives and museums. Since 2010, she’s worked at the Missouri Botanical Garden in St Louis where she provides data management assistance for the Biodiversity Heritage Library and is principal investigator for 2 BHL grant funded projects: Art of Life and Purposeful Gaming. She has been a VRA member for the past decade is finishing up a 4yr run as the co-chair for the VRA Core Oversight Committee.
I’ve talked about BHL with this community in past conferences but for anyone who is not familiar let me give you a brief overview.
BHL is a consortium of natural history libraries and museums who collaborate to digitize their historic literature and make it available for free online as part of a global biodiversity commons. We have digitized over 45 million pages of text about plants and animals
This is our portal where our content can be searched at biodiversitylibrary.org
Here is the viewer for navigating through books and journals once you’ve identified an item.
While many people are aware of BHL’s rich textual resources many are less aware of our hidden natural history illustrations. We estimate we have millions of images in the books but have not had descriptive metadata about them that allows them to be searched.
To address this challenge we have been both manually and automatically, through the development of algorithms, identifying pages with images and pushing them out to crowdsourcing environments.
We’ve utilized multiple crowdsourcing platforms for sharing and describing BHL images including Wikimedia Commons as Gaurav talked about, Flickr, and most recently Zooniverse,
I just want to plug the Zooniverse site before I jump into my discussion on Flickr because it just went live last week and we’re really excited over the level of participation we’re seeing .
Its called Science Gossip and can be found at sciencegossip.org. Our Zooniverse opportunity came about because BHL partnered with another project called Constructing Scientific Communities based in the UK which investigates Victorian citizen science periodicals. BHL contains lots of periodicals from this period and by serving up pages of BHL journals from this period within Zooniverse and asking the public to help us identify the images content and creators we can Better understand the range of individuals who made science through their images. This is the first Zooniverse project where citizen scientists are both the researchers and the subject of the research.
Here is the UI for Science Gossip which as you can see if pretty different from the UI for Notes from Nature and shows how customizable the platform is for the needs of different materials.
We are asking folks to help us tag contributors such as illustrators or engravers, add species, record hand-written transcrptions and add keywords about the general subject matter. We would love to have the VRA community participate in ScienceGossip so please do have a look.
Today I’ll focus my talk on Flickr since that is the platform we’ve been using the longest for both sharing images and crowdsourcing
Many of you are probably very familiar with this platform but here are some basic stats
Image hosting site created in 2004,
acquired by Yahoo in 2005,
As of March 2013 Flickr had 87 million registered members and 3.5 million new images uploaded daily
As of Aug 2011 site reported it was hosting 6 billion images
http://en.wikipedia.org/wiki/Flickr
stream created in 2011
https://www.flickr.com/photos/biodivlibrary
over 95k images, manually curated sets, full page plats
We are also Part of Internet Archive Books images stream in Flickr Thanks to the work of researcher Kalev Leetaru and developers at Smithsonian Libraries (SIL), Missouri Botanical Garden (MBG), and the Internet Archive (IA), over 1 million images from BHL are being added to the IA's Book Images Flickr stream. This work began in the summer of 2014 when Leetaru extracted over 14 million images from 2 million IA public domain books and pushed them to the Flickr Commons. BHL images are a subset of this collection because, as a digitization partner for BHL, IA not only scans many of BHL’s books and journals but also hosts all of its content at the Internet Archive as a mirror of the content found at the BHL portal.
The URL is on the screen but rather long so the easiest to find this content is to search on flickr for the term bookcollectionbiodiversity (all one word)
https://www.flickr.com/search/?tags=bookcollectionbiodiversity
Staff identify books or journals heavily illustrated with full plate pages and upload the item as a set
Created a script that somewhat automates the process of uploading so that adding bibliographic metadata does not have to be done by hand
Metadata in this stream contains Basic bibliographic information about the source from which the image came (in our case books or journals)
URL to get to the page within the BHL portal – because not only do we want the public to view our images in Flickr but also visit our site (promotional tool)
Copyright status – BHL only uploads public domain
We upload some – BHL page id, URL encoded in DC id, subject keywords that are pulled from the MARC records. In this case – catalog, flowers, gardening, seeds.
All of the photos you upload can be tagged by anyone as long as you give permission in your account settings under privacy and permission
Crowdsourcing is much more successful when you get the word out on an ongoing basis - BHL does regular blog and FB posts, tweets, and even some Flickr tagging parties.
We ask Flickr users to help us tag scientific and common names for species
If you are not familiar with what machine tags are its basically a way to not only add a term to an image but also specify the type of term it is. The allows machines to read any understand them better
Form of Namespace:predicate=value
e.g. “taxonomy:binomial=Aegotheles savesi”
taxonomy:common=owl
You can develop your own set of machine tags and ask people to use them or reuse some that already exist on Flickr. The taxonomic binomial tag was one that already existed on Flickr
By tagging scientific names not only does it allow for searching by people but also searching by machines
Machine tagging of species has allowed the Encyclopedia of Life, a BHL partner to pull those images into its platform and affiliate them with their species pages
Quality of tags? (as of Sept 2014 over 22,000 of these tags have been added to 14,000 BHL images)
75% are machine tags
taxonomy:binomial=Eurystomus glaucurus
geo:country=Australia
rest are just values – common names, illustrators, geographic locations
We haven’t come across any tags that are irrelevant or offensive so far
Already mentioned the data import a bit and the script we created. Many cultural institutions will want to bring that md back into their local system for searching which is what BHL wants to do.
Exporting data done via APIs - Slow but only way to get data out.
We have just begun experimenting with exporting via the APIs. Flickr has limits on how many requests can be made to its APIs per hour (3600) With 90k images in this stream it takes us about 24 hrs to extract it all. We have heard of others setting up multiple API keys to get around Flickr’s limits so we’ll probably need to go that route as we begin extracting data from the IA stream which in well over 1 million images, otherwise it would take us about 12 days to extract everything
So how do you gauge if a platform is succeeding or failing?
For Flickr You can query the API and pull out the tags and assess them as we have done.
You can also look at the Google Analytics for the site. We track those monthly
For BHL Flickr stream there are 17k images with one or more tags added by users (18% of all our images)
Many of those images can have more than 1 tag but analytics don’t tell us the total number of tags per image and don’t tell us how many individual users have added tags.
To judge success you can look at other factors
total views on your content (since 2011 we’ve had 88 million views of BHL content on Flickr)
Point of comparison with Zooniverse.
Since we went live on March 6th we’ve had 140k images classified by 330 users – a huge success!
Attribute this difference to “fit for purpose” Zooniverse designed for crowdsourcing, has an active community of classifiers, lots of ways for users to engage with images through talk etc.
Flickr has No way for users to interact with content other than tag or fav it. No way for content providers to interact with taggers.
So success yes but over a much longer period of time than Zooniverse