4. Technical Group at MBG
Mike Lichtenberg
Developer
Trish Rose-Sandler
Data Analyst
William Ulate
Technical Director
5. Technical Support
MBG IT Division
• Manage servers, systems and
telecommunications.
• Installs software needed
And others:
• MBL
• Internet Archive
• BHL-Australia
• BHL-Europe
10. Scientific Name Extraction
• TaxonFinder algorithm in production since
2008
– More than 100 million candidate name strings
– More than 1.5 million unique, verified names
– Available through UI, APIs, Data Exports & Internet
Archive
• New collaboration with Global Names project
– Improved algorithm, better precision & recall
– More data with TaxonFinder and Neti Neti!
– http://gnrd.globalnames.org/
34. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn
bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X
a�m cv(f b1air�'o�et ert oiensr �; �',
:�hlrfc�c wa ff�4am.diug bist a
6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem
b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck
wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra
tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM
w ?ffoaifrn w4wmeu nu weib e , wpiteI
voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J '
>bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl:
bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r
trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas
waIwutr Ober �ci ti 1V Ces ' wt
gbtiemwwajfu tpctt, afferain 9 c: b�titbfof
�r f eran m rs bra wlg auig4;f aer�m *mc vrt
blatcabtfm wfru an'deg~m rt blas Iaum
bwWt� run f ncmai b14ianf tJobrrfan
ebrut4net vnber Brwt Ober awawi*m.crriii
btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C
fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc
a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3
rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt
enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
37. Purposeful Gaming
DIGITALKOOT
• Joint project run by the National
Library of Finland and Microtask to
index the library's enormous archives
so that they are searchable on the
Internet for easier access to the
Finnish cultural heritage.
.
38. Purposeful Gaming
DIGITALKOOT
• Launched on Feb 8 2011, nearly 110 000
participants completed over 8 million word
fixing tasks by Nov 29 2012
• DigiTalkoot enabled volunteers to participate
in this fixing work by playing games.
• .
39. Purposeful gaming and BHL:
engaging the public in improving and
enhancing access to digital texts
• IMLS Grant Program:
National Leadership Grants for Libraries
• Partners:
– Missouri Botanical Garden
– Harvard University
– Cornell University
– New York Botanical Garden
• P.I.: Trish Rose-Sandler, Missouri Botanical Garden
• Dates: Dec 2013 – Nov. 2015
40. Project objectives and benefits
• Test new means of crowdsourcing to support the
enhancement of content in BHL
• Demonstrate if digital games are an effective tool for
analyzing and improving digital outputs from OCR and
transcription
• Benefits of gaming include:
– improved access to content by providing richer and more
accurate data;
– an extension of limited staff resources; and
– exposure of library content to communities who may not
know about the collections otherwise.
45. iDigBio’s aOCR Hackathon
• Improve OCR parsing of labels with clear metrics
(datasets, output formats, scoring algorithm)
• Libraries of regular expr. to clean up each field
(different error correction for latitude/longitude
coordinates than personal names or herbarium
catalog numbers)
• Tool for classifying segments of the image before
submitting to OCR
• Do a first pass of OCR to clean images before
sending them to a second, 'real' pass of OCR
46. iDigBio’s CITScribe Hackathon
1. Interoperability betweenpublic participation
tools and biodiversity data systems,
2. Transcription quality assessment/quality
control (QA/QC) and the reconciliation of
replicatetranscriptions,
3. Integration of optical character recognition
(OCR) into thetranscription workflow
4. User engagement
47. NfN & iDigBio’s CITScribe Hackathon
• Jason Best’s DarwinScore
• Ben Brumfield’s Handwriting Gibberish Detector
• Dictionaries to improve crowdsourcing consensus
(e.g., names of collectors, scientific names)
• Word Clouds created using n-gram scoring,
faceting, and Solr for indexing + Carrot2 for
specimen selection (visualize and explore of the use
with a word of interest from the word cloud) and a
data cleaning step (highlight infrequent words by
the system).
48. NESCent EOL-BHL Research Sprint
There is no place like home: Defining “habitat” for
biodiversity science
Robert D. Stevenson
UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston,
MA 02125-3393
Carl Nordman (Natureserve) and
Evangelos Pafilis
Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion,
71003, Crete, Greece
49. NESCent EOL-BHL Research Sprint
Assessing Risk Status of Mexican Amphibians Through Data
Mining.
Esther Quintero and Bárbara Ayala
National Commission for Knowledge and Use of Biodiversity
(CONABIO)
and
Anne Thessen
Marine Biological Laboratory and Arizona State University
50. Planning for global change: using species interactions in conservation
Nicole F. Angeli, Emma P. Gomez, Margot A. Wood,
Applied Biodiversity Sciences Program, Texas A&M University, College
Station, Texas
nangeli1@jhu.edu
Tweet me @auratus_nicole
and
Javier Otegui
University of Colorado-Boulder
51. There is no place like home: Defining “habitat” for biodiversity science
Robert D. Stevenson
UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA
02125-3393
Carl Nordman (Natureserve)
Evangelos Pafilis
Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003,
Crete, Greece
http://epafilis.info/ , vagpafilis@gmail.com
52. Evolution in the usage of anatomical concepts in
the biodiversity literature
Todd Vision (tjv@bio.unc.edu),
Prashanti Manda (manda.prashanti@gmail.com), and
Dongye Meng (dmeng@cs.unc.edu)
University of North Carolina at Chapel Hill
53. NESCent EOL-BHL Research Sprint
Evolution in the usage of anatomical concepts in the
biodiversity literature
Todd Vision (tjv@bio.unc.edu),
Prashanti Manda (manda.prashanti@gmail.com), and
Dongye Meng
University of North Carolina at Chapel Hill
54.
55.
56. Some preliminary observations…
• Our API seemed to work fine
• Access via a taxon (or a group), for example:
“I want to harvest all pages with names from this taxon (Chordata) or this common
name (Vertebrate)”.
• Groups started getting results after 2.5 days.
• The structure of BHL was explained so researchers could
understand the title, item, page and part levels and define what
they wanted. Ex: one group was looking for terms in the titles and
the parts’ titles.
• Some others said they would Harvest the OCR from IA although
they will not be able to harvest the text on a page by page
granularity (only item level).
57. NESCent EOL-BHL Research Sprint
There is no place like home: Defining “habitat” for
biodiversity science
Robert D. Stevenson
UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston,
MA 02125-3393
Carl Nordman (Natureserve) and
Evangelos Pafilis
Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion,
71003, Crete, Greece
59. Mining Biodiversity
• Mining Biodiversity: Enriching Biodiversity Heritage
with Text Mining and Social Media
• One of the international projects that won in the
third round of the 2013 Digging Into Data Challenge
• Promote the development of innovative
computational techniques to apply into big data in
the humanities and social sciences
– The National Centre for Text Mining (UK)
– Missouri Botanical Garden (US)
– Dalhousie University's Big Data Analytics
Institute (Canada)
– Social Media Lab (Canada)
60. MiBIO: Mining Biodiversity
1. Automatic error correction of OCR text errors.
2. Crowdsource annotation of legacy texts with semantic metadata.
3. Adapt text mining techniques to extract terminology, entities and
significant events automatically and to track terminology evolution
over time.
4. Use Interactive visualization techniques to help users manage
search results through next generation browsing capabilities,
assisted by a semantic similarity network of important terms and
entities.
5. Design of a social media layer, serving as an environment for
diverse users to interact and collaborate on science, public
education, awareness and outreach.
66. Remote Processing
Workflows processed on remote
machines. No attendance needed
Workflows
GUI for creating single-flow and
multi-branch workflows
Workflow Designer
User Interaction
Annotation Editor allows for
making changes while processing
Annotator/Curator
WebService
Third-party
applications
Processing Components
Data (de)serialisation, search
engines, NLP, NER, etc.
Developers