SlideShare ist ein Scribd-Unternehmen logo
1 von 89
Downloaden Sie, um offline zu lesen
BHL Technical Director’s Report
William Ulate
New York Botanical Garden
March 10, 2014
22.00
40.00
84.86
94.6
105.85
120.09
132.86
9.2
16.4
31.8 35.4 38.9 41.9 42.8
-
20
40
60
80
100
120
140
Oct-08 Oct-09 Oct-10 Oct-11 Oct-12 Oct-13
Pages (Millions) and Volumes (in Thousands)
included in BHL
Volumes (K)
Pages (M)
More Online Content
Technical Group at MBG
Mike Lichtenberg
Developer
Trish Rose-Sandler
Data Analyst
William Ulate
Technical Director
Technical Support
MBG IT Division
• Manage servers, systems and
telecommunications.
• Installs software needed
And others:
• MBL
• Internet Archive
• BHL-Australia
• BHL-Europe
Technical Advisory Group
Technical Support
• BHL-Australia
• BHL-Europe
• MBL
Projects
• Global Names
• Art of Life
• Purposeful Gaming
• Digging into Data
Scientific Name Extraction
• TaxonFinder algorithm in production since
2008
– More than 100 million candidate name strings
– More than 1.5 million unique, verified names
– Available through UI, APIs, Data Exports & Internet
Archive
• New collaboration with Global Names project
– Improved algorithm, better precision & recall
– More data with TaxonFinder and Neti Neti!
– http://gnrd.globalnames.org/
Taxon Names
BEFORE
Name Instances 101,591,803 101,288,804
Unique Names 7,498,554 7,464,924
Verified Names 1,905,507 1,902,803
EOL Names 63,130,350 62,963,582
EOL Pages 13,579,868 13,532,684
AFTER
Name Instances 151,222,182 150,066,425
Unique Names 29,246,382 29,091,767
Verified Names 10,153,165 10,109,540
EOL Names 87,791,695 87,135,089
EOL Pages 15,466,713 15,342,867
Article-level metadata
Chapter-level metadata
Treatment-level metadata
Part-level metadata
Articles in the BHL UI
See also:
Related Titles
Art of Life
Art of Life
Art of Life
Art of Life
Art of Life
Art of Life
Macaw
https://github.com/cajunjoel/macaw-book-metadata-tool
Reviewing Metadata
Reviewing Metadata
Manually built:
1,714 sets
89,457 images
Purposeful Gaming
*E.xvi�c�piteI von c. cXx.WptdvonfnrWmn
bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X
a�m cv(f b1air�'o�et ert oiensr �; �',
:�hlrfc�c wa ff�4am.diug bist a
6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem
b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck
wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra
tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM
w ?ffoaifrn w4wmeu nu weib e , wpiteI
voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J '
>bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl:
bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r
trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas
waIwutr Ober �ci ti 1V Ces ' wt
gbtiemwwajfu tpctt, afferain 9 c: b�titbfof
�r f eran m rs bra wlg auig4;f aer�m *mc vrt
blatcabtfm wfru an'deg~m rt blas Iaum
bwWt� run f ncmai b14ianf tJobrrfan
ebrut4net vnber Brwt Ober awawi*m.crriii
btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C
fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc
a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3
rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt
enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
OCR Improvements
• Gaming
• Transcription
OCR Improvements
• Transcription
• Purposeful Gaming
• Looking at…
– Crowdsource Markup
Purposeful Gaming
DIGITALKOOT
• Joint project run by the National
Library of Finland and Microtask to
index the library's enormous archives
so that they are searchable on the
Internet for easier access to the
Finnish cultural heritage.
.
Purposeful Gaming
DIGITALKOOT
• Launched on Feb 8 2011, nearly 110 000
participants completed over 8 million word
fixing tasks by Nov 29 2012
• DigiTalkoot enabled volunteers to participate
in this fixing work by playing games.
• .
Purposeful gaming and BHL:
engaging the public in improving and
enhancing access to digital texts
• IMLS Grant Program:
National Leadership Grants for Libraries
• Partners:
– Missouri Botanical Garden
– Harvard University
– Cornell University
– New York Botanical Garden
• P.I.: Trish Rose-Sandler, Missouri Botanical Garden
• Dates: Dec 2013 – Nov. 2015
Project objectives and benefits
• Test new means of crowdsourcing to support the
enhancement of content in BHL
• Demonstrate if digital games are an effective tool for
analyzing and improving digital outputs from OCR and
transcription
• Benefits of gaming include:
– improved access to content by providing richer and more
accurate data;
– an extension of limited staff resources; and
– exposure of library content to communities who may not
know about the collections otherwise.
OCR Improvements
German text interpreted by the OCR process as:
“unb auf ben ©elnrgen be6 fublic{)en”
OCR Improvements
Different resulting texts from parsing the phrase:
“und auf den Gebirgen des südlichen Deutschlands”
(“and on the mountains of southern Germany”)
IA OCR OCR 2
Transcription
1
Transcription
2
1 unb und und und Ok
2 den ben den den Ok
3 ©elnrgen ©ebirgen Bebirgen Gebirgen X
4 be6 des de5 des Chk
5 fublic{)en fublichen Füdlichen Südlichen X
6 £)eittfc{)(anb6 Deutfchlanbs Deutfchlands Deutschlands X
Purposeful Gaming
Currently…
• Evaluating Transcription Tools…
• Setting up the Workflow for
iDigBio’s aOCR Hackathon
• Improve OCR parsing of labels with clear metrics
(datasets, output formats, scoring algorithm)
• Libraries of regular expr. to clean up each field
(different error correction for latitude/longitude
coordinates than personal names or herbarium
catalog numbers)
• Tool for classifying segments of the image before
submitting to OCR
• Do a first pass of OCR to clean images before
sending them to a second, 'real' pass of OCR
iDigBio’s CITScribe Hackathon
1. Interoperability betweenpublic participation
tools and biodiversity data systems,
2. Transcription quality assessment/quality
control (QA/QC) and the reconciliation of
replicatetranscriptions,
3. Integration of optical character recognition
(OCR) into thetranscription workflow
4. User engagement
NfN & iDigBio’s CITScribe Hackathon
• Jason Best’s DarwinScore
• Ben Brumfield’s Handwriting Gibberish Detector
• Dictionaries to improve crowdsourcing consensus
(e.g., names of collectors, scientific names)
• Word Clouds created using n-gram scoring,
faceting, and Solr for indexing + Carrot2 for
specimen selection (visualize and explore of the use
with a word of interest from the word cloud) and a
data cleaning step (highlight infrequent words by
the system).
NESCent EOL-BHL Research Sprint
There is no place like home: Defining “habitat” for
biodiversity science
Robert D. Stevenson
UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston,
MA 02125-3393
Carl Nordman (Natureserve) and
Evangelos Pafilis
Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion,
71003, Crete, Greece
NESCent EOL-BHL Research Sprint
Assessing Risk Status of Mexican Amphibians Through Data
Mining.
Esther Quintero and Bárbara Ayala
National Commission for Knowledge and Use of Biodiversity
(CONABIO)
and
Anne Thessen
Marine Biological Laboratory and Arizona State University
Planning for global change: using species interactions in conservation
Nicole F. Angeli, Emma P. Gomez, Margot A. Wood,
Applied Biodiversity Sciences Program, Texas A&M University, College
Station, Texas
nangeli1@jhu.edu
Tweet me @auratus_nicole
and
Javier Otegui
University of Colorado-Boulder
There is no place like home: Defining “habitat” for biodiversity science
Robert D. Stevenson
UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA
02125-3393
Carl Nordman (Natureserve)
Evangelos Pafilis
Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003,
Crete, Greece
http://epafilis.info/ , vagpafilis@gmail.com
Evolution in the usage of anatomical concepts in
the biodiversity literature
Todd Vision (tjv@bio.unc.edu),
Prashanti Manda (manda.prashanti@gmail.com), and
Dongye Meng (dmeng@cs.unc.edu)
University of North Carolina at Chapel Hill
NESCent EOL-BHL Research Sprint
Evolution in the usage of anatomical concepts in the
biodiversity literature
Todd Vision (tjv@bio.unc.edu),
Prashanti Manda (manda.prashanti@gmail.com), and
Dongye Meng
University of North Carolina at Chapel Hill
Some preliminary observations…
• Our API seemed to work fine
• Access via a taxon (or a group), for example:
“I want to harvest all pages with names from this taxon (Chordata) or this common
name (Vertebrate)”.
• Groups started getting results after 2.5 days.
• The structure of BHL was explained so researchers could
understand the title, item, page and part levels and define what
they wanted. Ex: one group was looking for terms in the titles and
the parts’ titles.
• Some others said they would Harvest the OCR from IA although
they will not be able to harvest the text on a page by page
granularity (only item level).
NESCent EOL-BHL Research Sprint
There is no place like home: Defining “habitat” for
biodiversity science
Robert D. Stevenson
UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston,
MA 02125-3393
Carl Nordman (Natureserve) and
Evangelos Pafilis
Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion,
71003, Crete, Greece
Mining Biodiversity
Mining Biodiversity
• Mining Biodiversity: Enriching Biodiversity Heritage
with Text Mining and Social Media
• One of the international projects that won in the
third round of the 2013 Digging Into Data Challenge
• Promote the development of innovative
computational techniques to apply into big data in
the humanities and social sciences
– The National Centre for Text Mining (UK)
– Missouri Botanical Garden (US)
– Dalhousie University's Big Data Analytics
Institute (Canada)
– Social Media Lab (Canada)
MiBIO: Mining Biodiversity
1. Automatic error correction of OCR text errors.
2. Crowdsource annotation of legacy texts with semantic metadata.
3. Adapt text mining techniques to extract terminology, entities and
significant events automatically and to track terminology evolution
over time.
4. Use Interactive visualization techniques to help users manage
search results through next generation browsing capabilities,
assisted by a semantic similarity network of important terms and
entities.
5. Design of a social media layer, serving as an environment for
diverse users to interact and collaborate on science, public
education, awareness and outreach.
MiBIO: Mining Biodiversity
•
Crowdsource Markup
Display text Species Profile Model category
General/summary TaxonBiology
Geographic range Distribution
Habitat Habitat
Food sources and feeding behavior TrophicStrategy
Physical description (general) Description
Physical description (detailed morphology) DiagnosticDescription
Visit to NaCTeM, Feb. 17, 2014
NaCTeM’s
Biodiversity-
relevant tools
ANNNOTATION PLATFORM
Remote Processing
Workflows processed on remote
machines. No attendance needed
Workflows
GUI for creating single-flow and
multi-branch workflows
Workflow Designer
User Interaction
Annotation Editor allows for
making changes while processing
Annotator/Curator
WebService
Third-party
applications
Processing Components
Data (de)serialisation, search
engines, NLP, NER, etc.
Developers
Workflows view
Processes View
Documents view
Workflow editor
Workflow as a Web service
Workflow as a Web service
http://argo.nactem.ac.uk/test/services/webservice/314
INPUT
OUTPUT
NAMED ENTITY RECOGNISERS AND
NORMALISERS
✔
✔
✔
✔
✔
Automatically recognised
named entities
Linking to external dictionaries
Species and habitat recognition
EVENT EXTRACTORS
Events: associations between entities
SEMANTIC SEARCH
TERM EXTRACTION
Dalhousie SocialLab’s Netlytic.org
http://miningbiodiversity.com/http://miningbiodiversity.org/
Thank you
William Ulate
BHL Technical Director
Missouri Botanical Garden
william.ulate@mobot.org
Skype: william_ulate_r

Weitere ähnliche Inhalte

Ähnlich wie BHL Technical Director's Report, Mar. 2014

BHL Markup Efforts and Plans
BHL Markup Efforts and PlansBHL Markup Efforts and Plans
BHL Markup Efforts and PlansWilliam Ulate
 
nternational Biodiversity Projects and Natural History Museums: Current stat...
nternational Biodiversity Projects and Natural History Museums:  Current stat...nternational Biodiversity Projects and Natural History Museums:  Current stat...
nternational Biodiversity Projects and Natural History Museums: Current stat...Klaus Riede
 
Provenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible ScienceProvenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible ScienceBertram Ludäscher
 
BHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesBHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesChris Freeland
 
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2Ellinor Michel
 
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK Cyndy Parr
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyArchiver
 
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...Chris Freeland
 
BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeChris Freeland
 
No specimen left behind: Collections digitisation at the NHM, London*
No specimen left behind:  Collections digitisation at the NHM, London*No specimen left behind:  Collections digitisation at the NHM, London*
No specimen left behind: Collections digitisation at the NHM, London*Vince Smith
 
What is DataCite-screenshots
What is DataCite-screenshotsWhat is DataCite-screenshots
What is DataCite-screenshotsdatacite
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeEdward Baker
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeVince Smith
 
GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...
GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...
GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...Dag Endresen
 
#LAWDI Open Context, publishing linked data in archaeology
#LAWDI Open Context, publishing linked data in archaeology#LAWDI Open Context, publishing linked data in archaeology
#LAWDI Open Context, publishing linked data in archaeologyekansa
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesespetermurrayrust
 
Introduction to EOL.org for scientists
Introduction to EOL.org for scientistsIntroduction to EOL.org for scientists
Introduction to EOL.org for scientistsCyndy Parr
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themRoss Mounce
 

Ähnlich wie BHL Technical Director's Report, Mar. 2014 (20)

BHL Markup Efforts and Plans
BHL Markup Efforts and PlansBHL Markup Efforts and Plans
BHL Markup Efforts and Plans
 
BHL Tech Report
BHL Tech ReportBHL Tech Report
BHL Tech Report
 
nternational Biodiversity Projects and Natural History Museums: Current stat...
nternational Biodiversity Projects and Natural History Museums:  Current stat...nternational Biodiversity Projects and Natural History Museums:  Current stat...
nternational Biodiversity Projects and Natural History Museums: Current stat...
 
Provenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible ScienceProvenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible Science
 
BHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesBHL: Big Data, Big Challenges
BHL: Big Data, Big Challenges
 
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2
 
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and Ceremony
 
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
 
BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-Europe
 
Ji cv6n1
Ji cv6n1Ji cv6n1
Ji cv6n1
 
No specimen left behind: Collections digitisation at the NHM, London*
No specimen left behind:  Collections digitisation at the NHM, London*No specimen left behind:  Collections digitisation at the NHM, London*
No specimen left behind: Collections digitisation at the NHM, London*
 
What is DataCite-screenshots
What is DataCite-screenshotsWhat is DataCite-screenshots
What is DataCite-screenshots
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...
GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...
GBIF & GRScicoll, Høstseminar Norges museumsforbunds Seksjon for natur, 2021-...
 
#LAWDI Open Context, publishing linked data in archaeology
#LAWDI Open Context, publishing linked data in archaeology#LAWDI Open Context, publishing linked data in archaeology
#LAWDI Open Context, publishing linked data in archaeology
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and theses
 
Introduction to EOL.org for scientists
Introduction to EOL.org for scientistsIntroduction to EOL.org for scientists
Introduction to EOL.org for scientists
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 

Mehr von William Ulate

Enhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptxEnhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptxWilliam Ulate
 
Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryWilliam Ulate
 
Botanists and annotations printer friendly
Botanists and annotations   printer friendlyBotanists and annotations   printer friendly
Botanists and annotations printer friendlyWilliam Ulate
 
Engaging the Citizen Scientist in Content Enhancement for BHL
Engaging the Citizen Scientist in Content Enhancement for BHLEngaging the Citizen Scientist in Content Enhancement for BHL
Engaging the Citizen Scientist in Content Enhancement for BHLWilliam Ulate
 
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...William Ulate
 
Purposeful Gaming and BHL
Purposeful Gaming and BHLPurposeful Gaming and BHL
Purposeful Gaming and BHLWilliam Ulate
 
Fourth Global BHL Meeting - Technical Update
Fourth Global BHL Meeting - Technical UpdateFourth Global BHL Meeting - Technical Update
Fourth Global BHL Meeting - Technical UpdateWilliam Ulate
 
Bibliographic References in BHL
Bibliographic References in BHLBibliographic References in BHL
Bibliographic References in BHLWilliam Ulate
 
A new flora fauna mycota should...
A new flora fauna mycota should...A new flora fauna mycota should...
A new flora fauna mycota should...William Ulate
 
BHL Technical Update (May 2013)
BHL Technical Update (May 2013)BHL Technical Update (May 2013)
BHL Technical Update (May 2013)William Ulate
 
Global BHL Update May 2013
Global BHL Update May 2013Global BHL Update May 2013
Global BHL Update May 2013William Ulate
 
The BHL way to content
The BHL way to contentThe BHL way to content
The BHL way to contentWilliam Ulate
 
TDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life projectTDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life projectWilliam Ulate
 
BHL Technical Projects Updates
BHL Technical Projects UpdatesBHL Technical Projects Updates
BHL Technical Projects UpdatesWilliam Ulate
 
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...William Ulate
 
BHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceBHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceWilliam Ulate
 
Global BHL Meeting Action Items
Global BHL Meeting Action ItemsGlobal BHL Meeting Action Items
Global BHL Meeting Action ItemsWilliam Ulate
 

Mehr von William Ulate (17)

Enhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptxEnhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptx
 
Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital library
 
Botanists and annotations printer friendly
Botanists and annotations   printer friendlyBotanists and annotations   printer friendly
Botanists and annotations printer friendly
 
Engaging the Citizen Scientist in Content Enhancement for BHL
Engaging the Citizen Scientist in Content Enhancement for BHLEngaging the Citizen Scientist in Content Enhancement for BHL
Engaging the Citizen Scientist in Content Enhancement for BHL
 
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
 
Purposeful Gaming and BHL
Purposeful Gaming and BHLPurposeful Gaming and BHL
Purposeful Gaming and BHL
 
Fourth Global BHL Meeting - Technical Update
Fourth Global BHL Meeting - Technical UpdateFourth Global BHL Meeting - Technical Update
Fourth Global BHL Meeting - Technical Update
 
Bibliographic References in BHL
Bibliographic References in BHLBibliographic References in BHL
Bibliographic References in BHL
 
A new flora fauna mycota should...
A new flora fauna mycota should...A new flora fauna mycota should...
A new flora fauna mycota should...
 
BHL Technical Update (May 2013)
BHL Technical Update (May 2013)BHL Technical Update (May 2013)
BHL Technical Update (May 2013)
 
Global BHL Update May 2013
Global BHL Update May 2013Global BHL Update May 2013
Global BHL Update May 2013
 
The BHL way to content
The BHL way to contentThe BHL way to content
The BHL way to content
 
TDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life projectTDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life project
 
BHL Technical Projects Updates
BHL Technical Projects UpdatesBHL Technical Projects Updates
BHL Technical Projects Updates
 
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
 
BHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceBHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable Resource
 
Global BHL Meeting Action Items
Global BHL Meeting Action ItemsGlobal BHL Meeting Action Items
Global BHL Meeting Action Items
 

BHL Technical Director's Report, Mar. 2014

  • 1. BHL Technical Director’s Report William Ulate New York Botanical Garden March 10, 2014
  • 2.
  • 3. 22.00 40.00 84.86 94.6 105.85 120.09 132.86 9.2 16.4 31.8 35.4 38.9 41.9 42.8 - 20 40 60 80 100 120 140 Oct-08 Oct-09 Oct-10 Oct-11 Oct-12 Oct-13 Pages (Millions) and Volumes (in Thousands) included in BHL Volumes (K) Pages (M) More Online Content
  • 4. Technical Group at MBG Mike Lichtenberg Developer Trish Rose-Sandler Data Analyst William Ulate Technical Director
  • 5. Technical Support MBG IT Division • Manage servers, systems and telecommunications. • Installs software needed And others: • MBL • Internet Archive • BHL-Australia • BHL-Europe
  • 6.
  • 9. Projects • Global Names • Art of Life • Purposeful Gaming • Digging into Data
  • 10. Scientific Name Extraction • TaxonFinder algorithm in production since 2008 – More than 100 million candidate name strings – More than 1.5 million unique, verified names – Available through UI, APIs, Data Exports & Internet Archive • New collaboration with Global Names project – Improved algorithm, better precision & recall – More data with TaxonFinder and Neti Neti! – http://gnrd.globalnames.org/
  • 11. Taxon Names BEFORE Name Instances 101,591,803 101,288,804 Unique Names 7,498,554 7,464,924 Verified Names 1,905,507 1,902,803 EOL Names 63,130,350 62,963,582 EOL Pages 13,579,868 13,532,684 AFTER Name Instances 151,222,182 150,066,425 Unique Names 29,246,382 29,091,767 Verified Names 10,153,165 10,109,540 EOL Names 87,791,695 87,135,089 EOL Pages 15,466,713 15,342,867
  • 12.
  • 13.
  • 15. Articles in the BHL UI
  • 16.
  • 19.
  • 25.
  • 30.
  • 32.
  • 34. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X a�m cv(f b1air�'o�et ert oiensr �; �', :�hlrfc�c wa ff�4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas waIwutr Ober �ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b�titbfof �r f eran m rs bra wlg auig4;f aer�m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt� run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
  • 36. OCR Improvements • Transcription • Purposeful Gaming • Looking at… – Crowdsource Markup
  • 37. Purposeful Gaming DIGITALKOOT • Joint project run by the National Library of Finland and Microtask to index the library's enormous archives so that they are searchable on the Internet for easier access to the Finnish cultural heritage. .
  • 38. Purposeful Gaming DIGITALKOOT • Launched on Feb 8 2011, nearly 110 000 participants completed over 8 million word fixing tasks by Nov 29 2012 • DigiTalkoot enabled volunteers to participate in this fixing work by playing games. • .
  • 39. Purposeful gaming and BHL: engaging the public in improving and enhancing access to digital texts • IMLS Grant Program: National Leadership Grants for Libraries • Partners: – Missouri Botanical Garden – Harvard University – Cornell University – New York Botanical Garden • P.I.: Trish Rose-Sandler, Missouri Botanical Garden • Dates: Dec 2013 – Nov. 2015
  • 40. Project objectives and benefits • Test new means of crowdsourcing to support the enhancement of content in BHL • Demonstrate if digital games are an effective tool for analyzing and improving digital outputs from OCR and transcription • Benefits of gaming include: – improved access to content by providing richer and more accurate data; – an extension of limited staff resources; and – exposure of library content to communities who may not know about the collections otherwise.
  • 41. OCR Improvements German text interpreted by the OCR process as: “unb auf ben ©elnrgen be6 fublic{)en”
  • 42. OCR Improvements Different resulting texts from parsing the phrase: “und auf den Gebirgen des südlichen Deutschlands” (“and on the mountains of southern Germany”) IA OCR OCR 2 Transcription 1 Transcription 2 1 unb und und und Ok 2 den ben den den Ok 3 ©elnrgen ©ebirgen Bebirgen Gebirgen X 4 be6 des de5 des Chk 5 fublic{)en fublichen Füdlichen Südlichen X 6 £)eittfc{)(anb6 Deutfchlanbs Deutfchlands Deutschlands X
  • 44. Currently… • Evaluating Transcription Tools… • Setting up the Workflow for
  • 45. iDigBio’s aOCR Hackathon • Improve OCR parsing of labels with clear metrics (datasets, output formats, scoring algorithm) • Libraries of regular expr. to clean up each field (different error correction for latitude/longitude coordinates than personal names or herbarium catalog numbers) • Tool for classifying segments of the image before submitting to OCR • Do a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR
  • 46. iDigBio’s CITScribe Hackathon 1. Interoperability betweenpublic participation tools and biodiversity data systems, 2. Transcription quality assessment/quality control (QA/QC) and the reconciliation of replicatetranscriptions, 3. Integration of optical character recognition (OCR) into thetranscription workflow 4. User engagement
  • 47. NfN & iDigBio’s CITScribe Hackathon • Jason Best’s DarwinScore • Ben Brumfield’s Handwriting Gibberish Detector • Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names) • Word Clouds created using n-gram scoring, faceting, and Solr for indexing + Carrot2 for specimen selection (visualize and explore of the use with a word of interest from the word cloud) and a data cleaning step (highlight infrequent words by the system).
  • 48. NESCent EOL-BHL Research Sprint There is no place like home: Defining “habitat” for biodiversity science Robert D. Stevenson UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA 02125-3393 Carl Nordman (Natureserve) and Evangelos Pafilis Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003, Crete, Greece
  • 49. NESCent EOL-BHL Research Sprint Assessing Risk Status of Mexican Amphibians Through Data Mining. Esther Quintero and Bárbara Ayala National Commission for Knowledge and Use of Biodiversity (CONABIO) and Anne Thessen Marine Biological Laboratory and Arizona State University
  • 50. Planning for global change: using species interactions in conservation Nicole F. Angeli, Emma P. Gomez, Margot A. Wood, Applied Biodiversity Sciences Program, Texas A&M University, College Station, Texas nangeli1@jhu.edu Tweet me @auratus_nicole and Javier Otegui University of Colorado-Boulder
  • 51. There is no place like home: Defining “habitat” for biodiversity science Robert D. Stevenson UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA 02125-3393 Carl Nordman (Natureserve) Evangelos Pafilis Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003, Crete, Greece http://epafilis.info/ , vagpafilis@gmail.com
  • 52. Evolution in the usage of anatomical concepts in the biodiversity literature Todd Vision (tjv@bio.unc.edu), Prashanti Manda (manda.prashanti@gmail.com), and Dongye Meng (dmeng@cs.unc.edu) University of North Carolina at Chapel Hill
  • 53. NESCent EOL-BHL Research Sprint Evolution in the usage of anatomical concepts in the biodiversity literature Todd Vision (tjv@bio.unc.edu), Prashanti Manda (manda.prashanti@gmail.com), and Dongye Meng University of North Carolina at Chapel Hill
  • 54.
  • 55.
  • 56. Some preliminary observations… • Our API seemed to work fine • Access via a taxon (or a group), for example: “I want to harvest all pages with names from this taxon (Chordata) or this common name (Vertebrate)”. • Groups started getting results after 2.5 days. • The structure of BHL was explained so researchers could understand the title, item, page and part levels and define what they wanted. Ex: one group was looking for terms in the titles and the parts’ titles. • Some others said they would Harvest the OCR from IA although they will not be able to harvest the text on a page by page granularity (only item level).
  • 57. NESCent EOL-BHL Research Sprint There is no place like home: Defining “habitat” for biodiversity science Robert D. Stevenson UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA 02125-3393 Carl Nordman (Natureserve) and Evangelos Pafilis Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003, Crete, Greece
  • 59. Mining Biodiversity • Mining Biodiversity: Enriching Biodiversity Heritage with Text Mining and Social Media • One of the international projects that won in the third round of the 2013 Digging Into Data Challenge • Promote the development of innovative computational techniques to apply into big data in the humanities and social sciences – The National Centre for Text Mining (UK) – Missouri Botanical Garden (US) – Dalhousie University's Big Data Analytics Institute (Canada) – Social Media Lab (Canada)
  • 60. MiBIO: Mining Biodiversity 1. Automatic error correction of OCR text errors. 2. Crowdsource annotation of legacy texts with semantic metadata. 3. Adapt text mining techniques to extract terminology, entities and significant events automatically and to track terminology evolution over time. 4. Use Interactive visualization techniques to help users manage search results through next generation browsing capabilities, assisted by a semantic similarity network of important terms and entities. 5. Design of a social media layer, serving as an environment for diverse users to interact and collaborate on science, public education, awareness and outreach.
  • 62. Crowdsource Markup Display text Species Profile Model category General/summary TaxonBiology Geographic range Distribution Habitat Habitat Food sources and feeding behavior TrophicStrategy Physical description (general) Description Physical description (detailed morphology) DiagnosticDescription
  • 63. Visit to NaCTeM, Feb. 17, 2014
  • 66. Remote Processing Workflows processed on remote machines. No attendance needed Workflows GUI for creating single-flow and multi-branch workflows Workflow Designer User Interaction Annotation Editor allows for making changes while processing Annotator/Curator WebService Third-party applications Processing Components Data (de)serialisation, search engines, NLP, NER, etc. Developers
  • 71. Workflow as a Web service
  • 72. Workflow as a Web service http://argo.nactem.ac.uk/test/services/webservice/314 INPUT OUTPUT
  • 73. NAMED ENTITY RECOGNISERS AND NORMALISERS
  • 76. Linking to external dictionaries
  • 77. Species and habitat recognition
  • 80.
  • 82.
  • 83.
  • 84.
  • 86.
  • 89. Thank you William Ulate BHL Technical Director Missouri Botanical Garden william.ulate@mobot.org Skype: william_ulate_r