Más contenido relacionado

Presentaciones para ti(20)

Similar a Bl labs sfu-dhi_lab-dhilab-2019-workshop(20)


Bl labs sfu-dhi_lab-dhilab-2019-workshop

  1. 1 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Funded by the Andrew W. Mellon Foundation and the British Library Running since March 2013 A hands-on data exploration & challenge to become a derived data-set author on the British Library’s open data-set platform ( Mahendra Mahey, Manager of British Library, British Library, London, UK. Monday 25 February 2019, 1030 – 1200 (Keynote) Burnaby, Bennett Library, Wosk Seminar Room 7100 (inside Special Collections), Simon Fraser University, Vancouver, Canada
  2. 2 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Who do we work with? Surprises of serendipity and creating luck ? Researchers Artists Librarians Curators Software Developers Archivists Educators Working and Communicating Entrepreneurs
  3. 3 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Competition Awards Projects Tell us your ideas of what to do with our digital content (2013-16) Show us what you have already done with our digital content in research, artistic, commercial, learning and teaching, staff categories Talk to us about working on collaborative projects Tell us your ideas of what to do with our digital content Engagement • Roadshows • Events • Meetings • Conversations New! Digital Research Support How?
  4. 4 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Collections – not just books! > 180*million items > 0.8* m serial titles > 8* m stamps > 14* m books > 6* m sound recordings > 4* m maps > 1.6* m musical scores > 0.3* m manuscripts > 60* m patents King’s Library *Estimates
  5. 5 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Have you got X? Looking for Physical Content in the British Library
  6. 6 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol #bldigital 3 %* digitised * estimate Digital Partnerships Commercial & Other Organisations Bias in digitisation Sample Generator Over 720 Digital collections 15 %* Openly Licensed – most online 85 %* Available onsite only at the moment Digitisation / Curating Born Digital costs money, time, resources Research driven digitisation Heritage Made Digital Born Digital What percentage/proportion of our physical collections are digitised?
  7. 7 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Have you got X digitised / in digital form? Looking for Digitised / Digital Content in the BL
  8. 8 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Our Audience and Collections Audience research & Digital interests Digital collections we have This is where Labs works It starts with making connections, engagement, talking to people! All Labs need to do this!
  9. 9 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Finding Open British Library Cultural Heritage Datasets Collection Guides (234 as of 25/02/2019) Datasets about our collections Bibliographic datasets relating to our published and archival holdings Datasets for content mining Content suitable for use in text and data mining research Datasets for image analysis Image collections suitable for large-scale image-analysis-based research Datasets from UK Web Archive Data and API services available for accessing UK Web Archive Digital mapping Geospatial data, cartographic applications, digital aerial photography and scanned historic map materials Download collections as zips, no API Each dataset has a Digital Object Identifier (DOI) can be referenced for research Over 120 datasets available
  10. 10 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Explore Our Data at! • CSV of Metadata • 19th Century Books - Book Metadata - 01/09/2013. • Digitised Books - Flickr Tag History - Dec 2013 to March 2016. TSV • Digitised Hebrew Manuscripts - Metadata • Digitised Hebrew Manuscripts: Or 2210 - Or 2364 • Theatrical playbills from Britain and Ireland (OCR text only) • Portraits of actors, views of theatres and playbills (covering 1750 - 1821 in a single volume) • Volumes of Lysons Collectanea (Amusements), comprising broadsides, cuttings, advertisements on amusements.1660-1840.
  11. 11 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol The Story of the Digital Collection… Digital Collection Curator Who paid for the digitisation? Who did the digitisation? Technology used Born digital? Published Unpublished Where is it? Access / API? Can it still be accessed? Generates income Reputational risk in using? Legalities / Ethics / Morality Politics when digitised, e.g. Brexit? Personalities involved Surprises (e.g. gaps) Descriptive information Old format not supported What media was the digitisation done from? Is there any background documentation? No Descriptive information Inconsistent descriptive information Still there? Good to know the background ‘story’ of a Digital Collection if you want to use it for projects …
  12. 12 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol • Dialogue typically: – you are ‘lucky’ & we have the digital content / data relevant to your research – we don’t have exactly what your looking for, but is there anything of interest? Let’s talk… – engagement is hard work and it’s constantly required to maintain interest in our digital collections! • Artists find this dialogue easier… • We also tend to attract researchers with ‘fuzzier’ research boundaries and possibly open to more interdisciplinary / collaborative research What engagement does the BL have with researchers wanting use our digital content?
  13. 13 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol READING ROOM NOT ONLINE OPEN Onsite @ British Library £ Labs Residency Model Competition / Digital Research Support Application Challenges of access to Digital Collections at the BL Over 720 Digital collections 15 %* Openly Licensed – most online 85 %* Available onsite only at the moment
  14. 14 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Accessing digital collections onsite OPEN £ • ‘Onsite’ (interpretations vary) – in reading room at specific site for example? • Application process to be ‘Security cleared’ ‘trusted’ for some collections – Hence ‘Researcher in Residence Model’ - hot desks/reading room digital research spaces, remote access in secure environment such as Citrix and Virtual Machines • Often further permissions are required depending on what agreements are in place.. • Digital material can be on various media formats (not always online) and they need to be mounted on obsolete devices or transferred onto more modern equipment • We are getting better at providing access to our digital collections
  15. 15 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Phases of interaction at BL Labs Submit idea for support Ideas always change Once people experience the data and culture of the organisation
  16. 16 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol eResearch SA Open Data Directory
  17. 17 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol URLs to download sample files not on • • •
  18. 18 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Working with British Library Digitised Newspapers • Digitised through public / private means • Can use commercial products to look manually for content, with search interfaces but no APIs, useful starting point though, manual methods can translate into computational ones • OCR quality is not great, metadata is OK, but plenty of hidden material, approaches require to consider this, e.g. ‘Good, Bad and Ugly’ OCR • You can purchase drives from GALE Cengage with content (dependent on subscription)
  19. 19 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Good, Bad, Ugly Image Quality / OCR • Original image capture of newspaper images can effect the quality of the OCR • A poor image, very difficult to re-OCR • Good image quality much better chance for re-OCR • Bi-tonal, Grey Scale, Colour can effect the quality of the OCR • Methodology of working with collection at scale needs to acknowledge OCR and image quality
  20. 20 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Breaking Black Boxes – Melodee Beals
  21. 21 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Burney Collection • Gathered by the Reverend Charles Burney (1757- 1817) • 700 volumes, newspapers and news pamphlets, published in London, English provincial, Irish and Scottish papers, and a few examples from the American colonies. • 1271 titles • Around 1 million digitised page images – from around 2006 from Microfilm • OCR quality mixed, used custom XML format • Bi-tonal
  22. 22 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Web Interface – Burney Collection
  23. 23 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol OCR quality can be very poor!
  24. 24 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol 1268 Folders
  25. 25 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol burney_summary.xls
  28. 28 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Example files ‘service’ folder contains page level images and corresponding OCR XML BurneyB0001ORIWEEJO17151119service
  29. 29 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol APPLEBEE''S ORIGINAL WEEKLY JOURNAL FROM SATURDAY NOVEMBER 19 TO SATURDAY NOVEMBER 26 1715 WO2_B0001ORIWEEJO_1715_11_19-0001.tiff
  30. 30 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol JISC 1 and JISC 2 Newspapers
  31. 31 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Private BL NAS Accessible onsite or remotely if security cleared via CITRIX
  32. 32 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Accessing digitised newspapers onsite at the BL (JISC 1) 12 Volumes, 80TB of data
  33. 33 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Accessing digitised newspapers onsite at the BL Accessing ‘service’ Copy (post processed) and results of OCR available as XML
  34. 34 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Accessing digitised newspapers onsite at the BL Accessing ‘service’ Copy (post processed)
  35. 35 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Accessing digitised newspapers onsite at the BL Accessing OCR as XML
  36. 36 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol jisc_1.xls 79 Titles, 2 million pages
  37. 37 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Metadata from BL (JISC 1 and 2) • Title Metadata – Title, as written – Normalised title across all variants – Standardised title abbreviation – Variant titles, with associated dates – Place of publication – Dates of publication – Genre, such as newspaper – Sub-collection, such as Regional Daily Issue Metadata Volume Number Issue Number Date as printed Normalised date (YYYY.MM.DD) Number of pages The microfilm reel number The OCR quality Page image data The number of the image within that issue The filename The spatial coordinates for the page within the image The degree of page skew
  38. 38 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Metadata from Gale (JISC 1 and 2) • Standardised identifier • Newspaper title • Standardised title abbreviation • Project codes • Digitized collection name • Issue number • Date as printed • Standardised date (Month, DD, YYYY) • Standardised date (YYYYMMDD) • Day of the week • Number of Pages • Copyright holder Language Unique ID for publication Holding Library Citation of the physical item Title metadata Title as recorded in the MARC Library Catalogue Dates of publication Genre, such as newspaper Conversion credit, usually a vendor Article Unique ID OCR quality SC, or standardized category of article Unique ID(s) of page(s) Unique ID(s) of individual column(s) Column number Headline Article type
  39. 39 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Samples for JISC 1 ‘master’ contains high res tiff ‘service’ contains post processed tiff and OCR XML BNWL - The Belfast News-Letter - 1871 - November 14 BNWL - The Belfast News-Letter - 1885 - September 12 DNLN - Daily News - 21 Jan 1846 - 31 Dec 1900
  40. 40 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol JISC 2 Collection • 22 Titles • Regional titles • 1020550 pages
  41. 41 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol jisc_2.xls
  42. 42 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol JISC 2 • 40 TB • Stored differently locally 192,353 folders
  43. 43 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Samples for JISC 2 • Organised differently
  44. 44 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Samples for JISC 2 Lancaster Gazetter, And General Advertiser For Lancashire West Southampton Herald Berrows Worcester Journal A - Contains post processed files M - Contains JP2 O - Contains ALTO XML
  45. 45 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Previous ideas of using collection • Bob Nicholson – Finding jokes • Katrina Navickas – Political meetings • Hannah Murray – Black abolitionist performances • Jennifer Batt – Finding poetry • Surendra Singh – Finding suicide articles • Melodee Beals – Evidence of copy and paste • Ryan Cordel – Viral Texts • Paul Fyfe - Snipping out images
  46. 46 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Useful resources • • • • chaeology.VPR.pdf?sequence=1
  47. 47 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Use of Overproof OCR Correction? Re-OCR with ABBY FineReader? RE-OCR
  48. 48 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Virtual Infrastructure for OCR text OCR text ‘scraped’ from digitised newspapers and put in cloud Jupyter notebook Write python code and results in web browser Access available for researchers ‘in residence’
  49. 49 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol 65,000 digitised 19th Century books Image: Artwork by Alicia Martin 2007 / 2008 Paid for by: For a full list: Subjects include: Philosophy Poetry History Literature 1789 - 1876
  50. 50 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Working with the MS Books Collection • Metadata • Page level images • OCR Text • Flickr Commons - images snipped out and user generated tags for images • 19th Century Books Collection data
  51. 51 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol 30 August 2012
  52. 52 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Metadata MicrosoftBooks.xls - Over 65,000 titles
  53. 53 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol MS Books – Finish Titles
  54. 54 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Fiction / Non Fiction
  55. 55 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Latin American Studies
  56. 56 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol ALTO XML – Sample Files – 1800 - 1809 1502 Zip Files
  57. 57 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol OCR Text – JSON File
  58. 58 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol 002819694
  59. 59 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  60. 60 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  61. 61 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Optically Character Recognised (OCR) generated Text Scanned Page Image on Flickr Commons
  62. 62 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Worked better for female faces than men’s Press Posts image every 30 minutes 1,020,418 images need tagging! Creative uses of images Face recognition Algorithms based on photos Mechanical Curator with an algorithmic brain (Circles, Squares and Slanty etc) Wikimedia Flickr Commons Individual URL & API Snipping out images from 65,000 Digitised Books* >1000,000,000* views >17,000,000* tags Work @ BL by Ben O’Steen, Labs and Digital Research Team*Matt Prior - Since Dec 2013 Tumblr *Estimates >More demand to see physical items
  63. 63 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol British Library Flickr Commons Flickr Commons has items from Galleries, Libraries, Archives and Museums (GLAM) (Mostly Public Domain)
  64. 64 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Flickr Commons (100 + GLAMs as of 25/09/18)
  65. 65 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Getting an account on Flickr •Get a Flickr / Yahoo account ( •You can then tag, organise favourites, make your own albums and galleries from Flickr images online or uploaded •You get 1TB for free!
  66. 66 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol British Library Flickr Commons Why Flickr Commons? • Free! • Each image has it’s own unique web address, easy to share • Can Tag images • Has Application Programming Interface (API) Late August 2013
  67. 67 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Using British Library Flickr Commons •How do we find things in this collection? •Remember snipped out images from books with no description? •Not straightforward…
  68. 68 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol How is Flickr Commons Organised? • Photostream • Albums • Faves • Galleries • Tags
  69. 69 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Flickr Photostream Kind of the home page for the collection! Usually displays images with most recent activity!
  70. 70 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Flickr Albums Curated by the British Library – specifically Nora McGregor She works with the public to add images or create new ones! Over 450 Albums as of 25/09/18 – Mostly Maps!
  71. 71 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Flickr Faves Most favorited image first in descending order To favourite an image requires an account
  72. 72 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Flickr Galleries More useful if you have an account You can create a Gallery of Flickr images to share with everyone Gallery is tied to your account
  73. 73 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Flickr Groups Community based – for sharing and discussing images We might create a group for the competition – watch this space!
  74. 74 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Adding Tags in Flickr Be the next ‘Chico45’!
  75. 75 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Get Tags!
  76. 76 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Searching within the collection!
  77. 77 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol The Anatomy of a BL Flickr Record Download high res 300dpi image
  78. 78 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  79. 79 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol When you log in to Flickr Commons
  80. 80 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  81. 81 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Opportunities – increasing traffic to Library services You can purchase a ‘High Res’ Copy View in the Library Item Viewer Download .pdf All illustrations in book Other illustrations in books Published in same year View the item in the Library Catalogue Tags auto generated User generated Tag Grouping for image
  82. 82 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Refers to the Physical Copy of the Item
  83. 83 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  84. 84 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  85. 85 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Physical and Digital Copy Number relates to Physical Copy
  86. 86 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  87. 87 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  88. 88 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  89. 89 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  90. 90 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  91. 91 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  92. 92 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Warning – can be large file! It’s aPDF You can do Ctrl F in it to find text But health warning about OCR!
  93. 93 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  94. 94 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Page numbers don’t always correspond! Page numbers Don’t always correspond Page 132 on Flickr? Is Page Number in PDF In PDF of book Page number in book
  95. 95 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol
  96. 96 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Plain Text from Books? Not working But can be obtained from
  97. 97 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol All illustrations in book / books in same year! All the illustrations in this book Other illustrations books published in the same year
  98. 98 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Views and Favourites
  99. 99 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Galleries •Personal Galleries which you can share.
  100. 100 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Exchangeable Image File Information! For Geeks only!
  101. 101 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Tags!
  102. 102 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Tagging a million images Iterative Crowdsourcing Cardiff University’s Lost Visions Project Metadata Games James Heald Mario Klingemann Chico 45 Use computational methods Human Tagger Top British Library Flickr Commons Taggers 18 hard core taggers How to reward and keep motivated this ‘small group? Average for ‘crowd’ is 1 tag per person What kind of ‘task’ can this ‘crowd’ do? Mobile games for ‘Ships’, ‘Covers’ and ‘Portraits’ Interface for tagging
  103. 103 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Adding Tags! •You have to have an account to add tags! •Could you be the next Chico 45?
  104. 104 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Generated from book Description Generated from user
  105. 105 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Generated by Flickr
  106. 106 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Flickr Commons API
  107. 107 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Generated by SherlockNet!
  108. 108 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Sherlocknet has a search interface!
  109. 109 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol SherlockNet Search for ‘people’
  110. 110 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Advanced Search in SherlockNet! Tags Available for Download
  111. 111 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol 19th Century Books Metadata • 1,9 Million records of 19th Century Books • Used for Sample generator project
  112. 112 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Using the Wikimedia Synoptic Index • Created to help find all the maps in the books • Great resource if you want to find things by place!
  113. 113 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Google Fusion Table • ySKC0gnPk-pSvrDqqnA7&pli=1#rows:id=1
  114. 114 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Geodata flickr_geodata.csv
  115. 115 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Alston Index Internal Document 55-602 - Topical Index 603 - 925 - Pressmark Sequence925 page document of BL / British Museum Pressmarks
  116. 116 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Alston Index • Internal document (not to be externally shared) • Published in 1987 – dot matrix printed • Refers to British Museum and British Library Pressmarks / Shelfmarks • Shelfmarks are used internally to identify
  117. 117 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Topical Index OCR problems – Re-do? Manually correct?
  118. 118 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Augment Library Catalogue?
  119. 119 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Libcrowds – In the Spotlight
  120. 120 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Libcrowds – Spotlight - Data
  121. 121 @BL_Labs @mahendra_mahey @dhil_sfu @SFU @britishlibrary @BL_DigiSchol Data Journey • Choose one or two datasets maximum • Explore the collection and make notes about any challenges and issues • See if you can curate a smaller collection from the larger collection • Tell us what you have done • We will consider to publish on

Hinweis der Redaktion

  1. Morning everyone. <CLICK> I’m Mahendra Mahey, from the British Library in London, England, ‘Hello’. I am here to tell you my personal story about the experiences and lessons I’ve had working for my institution as well as with other Galleries, Libraries, Archives and Museums or ‘GLAMS’ at National, State, Public, University organisations and charitable and commercial organisations around the world. My story will focus particularly on how we have engaged with researchers, artists, educators and entrepreneurs from school children to adults who have used digitised and born digital cultural heritage collections and data to inspire them to create innovative, fun and inspiring projects. I would love it if my experiences can help other organisations build better ‘GLAM’ Labs, but I am also here to learn too from you. <CLICK> For the last 6 years, I have been running ‘British Library Labs’ a digital Laboratory to encourage anyone to experiment with our vast, incredible, sometimes totally unique and mind blowing digital collections and data. Our work has been generously funded over these years by the Andrew W. Mellon foundation and the British Library. We are in fact waiting for news any day now to see if the Library will be moving our work into its core business on a long term basis. <CLICK> During and after my presentation, please feel free to use twitter to amplify anything that resonates with you. My slides include much more information than I will have time to talk about, including links for you to delve deeper into the subject. On the right hand side in the footer there is a link to download all my slides, this will appear on all my slides. Please feel free to reuse them but it would great if you could attribute me and the Library when you do. My presentation will last about an hour. I will do my best to keep your attention. If there are any questions, something springs to mind, please make a note as. I will take questions at the end of my presentation, though I may ask you some quick ones along the way. So like all stories, lets start at the beginning and let me take you on my personal journey.
  2. The picture you can see is inside the main building in London, it’s the King’s Library – King George the Third’s personal library, Mad King George! Sometimes known as the ‘stack’, I walk past this everyday and it gives me a sense of awe and reminds that the collections the British Library have are truly staggering and almost impossible to comprehend. <CLICK> We currently estimate them to exceed <click>180 million items, representing every age of written civilisation and every known language. Our archives now contain the earliest surviving printed book in the world, the Diamond Sutra, written in Chinese and dating from 868 AD….only around 8% of our collections are books and as you can see we have so much more, please note the numbers are only really guesses as to exactly what we do have. If you saw 5 items a day it would take you over 80,000 years to see the whole collection.
  3. For me, this is what it is like trying to find a physical item at the Library. It feels like a huge hypermarket, or perhaps even a factory or warehouse. It stocks a random assortment of things and if you ask the assistants they can tell you about things that are simply not visible on the shelves in huge storage facilities.
  4. Moving on to our digital collections which is where my work largely sits. What percentage/proportion of our physical collections are digitised? <CLICK> What surprises many people is that only an estimated 3% of our physical holding are digitised. This is because digitisation costs time and money and we have to achieve this through partnerships with commercial and other philanthropic organisations. <CLICK> Through one of the first BL Labs project, ‘Sample Generator’ we discovered that our digital collections are not truly representative of our physical collections. There will be all sorts of reasons why certain items get digitised and others do not. In reality, all our collections be them digital or physical have selection biases. Our collections are hundreds of years of decisions made by people as to which items to buy, keep and which ones to discard. <CLICK> In terms of licensing and using/reusing digital collections, a Lab like ours has further challenges. Out of our over 720 digital collections, only around 15% have an open license. The remainder are only available onsite at the moment. This is in part because many legacy digitisation projects didn’t always consider licensing when items were digitised. Trying to retrospectively establish rights and licensing on previously digitised collections costs time and money. <CLICK> As a National Library, we have been collecting born digital items for over many decades. We are the home of the UK Web Archive that periodically captures billions of UK websites to keep for posterity and research and we are the home of the Alan Turing institute centre for AI and data science where we are an active research partner. For example, Living with Machines is a five-year £9.2 million research project that will take a fresh look at the well-known history of the Industrial Revolution using data-driven approaches. <CLICK> A new digitisation programme Heritage Made Digital is trying to learn from past digitisation projects, especially on digitising collections based on research demand.
  5. It can often feel like this…It’s much smaller, we have some free stuff, some can only be consumed on site, some you need to buy. If you speak to shop keeper, they may be able to get you to see what’s under the counter, because they couldn’t display it. You might be able to get special permission to get a look in the warehouse at the back of the shop which has even more goodies there. If you are looking for vegetables you have come to the wrong shop!
  6. How do you find our open cultural heritage collections? On way is to use our collection guides which offer subject pathways into our collections, each guide will have a section non what’s available digitally if at all.
  7. What’s important to understand is that if you really want to work with our digital collections, it sometimes pays to learn the ‘back story’ of how the collection came about, this was a really early and important lesson I learned. Knowing it, can have a significant impact on what you might want to do with it. On the screen you can see many factors. I simply haven’t got time to go into them all, but perhaps the most important one is the last one. Is there a human being around in the organisation who can tell you about the collection, as communicating with them may be the quickest way to learn more about the digital collection you want to work with. Often, they will have access to important information that isn’t written down.
  8. As a Labs manager, I faced a significant challenge of how we would enable those who want to access to our 85% of digital collections that are only available onsite at the moment. If we look at this, we can see onsite at the Library may mean that the digital materials are only available in the reading room on a specific PC, or that the materials are still on their original storage media and may need obsolete equipment to access them or they still are as yet to be transferred onto a more modern system. <CLICK> Some digital materials are only available through payment <CLICK> Only a small fraction of digital materials are in the shiny happy carefree open web (more about how to access these later) <CLICK> What we developed to tackle this situation was to develop a ‘Residency Model’ initially through an annual competition that we ran and now this has evolved in application process where researcher’s can apply to carry out digital research onsite at the British Library. These researchers in residence have special access to digital collections that our regular readers do not have. Access is strictly controlled depending what they would like to access and what they want to do with the materials.
  9. At the Library, interpretations vary of what ‘onsite means’, does it mean only on a PC in a specific room, on all PCs in all reading rooms on all sites, <CLICK> In order for Labs researchers to be resident, they need to be security cleared to gain access to specific collections . Once done they can get access to a hot desk in a staff area, and we are trailing something similar in the reading room in a secure space, sometimes they can get access to their desktop remotely too via Citrix and Virtual Machines. <CLICK> Sometimes further permissions are required even after security clearance depending on what they want to get access to and what they want to do with it <CLICK> Digital material can be on various media formats (not always online) and they need to be mounted on obsolete devices or transferred onto more modern equipment <CLICK> We are getting better at providing access to our digital collections but we still have a lot of work to do.
  10. 24 seconds (72 words) Let’s look a little further at the types of interactions we have with our researchers. We have summarised these phases as ‘Exploration’ where people often ‘rethink’ their ideas of what they want to do with the data, ‘Query-Focused’ where they often have to iterate to come up with a realistic proposal of what they want to do and a ‘Wrap-up’ phase to end their project with us, if it is relevant.
  11. 970 files from a selection of 19th century newspaper titles from the BL corpus for us to correct using the overProof post-OCR correction software The best way to measure the improvement made by the correction process is to compare the OCR'ed text and the automatically corrected text with a perfect correction made by a human (known as the "ground truth"). Hannah-Rose's 5 small human-corrected samples are show as green dots. These are not only smaller than the other files, but their raw error rate is much lower at 13.3%. OverProof was measured as reducing this to 5.4%, a removal of almost 60% of errors. The red dotted-line indicates the correction "break-even" point: the further under the line, the better the quality of the document after correction. In the graph below, the grey line shows distribution of files across error rates before correction and the green line after correction.
  12. Posts small illustrations taken almost at random from the digitised book corpus to a Tumblr blog. This experiment with undirected engagement was a by-product of work to uncover the hidden wealth of illustrations within the digitised pages.
  13. 50 seconds Here is the anatomy of a Flickr record, importantly we have created links to many of the Library’s services <click>some of this lovely traffic is going back to the Library and hopefully generating more interest in our services, from downloading a pdf of the book to purchasing a high res scan of the image. <click>Tags are added from the original book record, including the approximate page number the image came from<click>users of Flickr can add their own tags, and I have mentioned they have already started doing it.