BHL Markup Efforts and Plans

pro-iBiosphere Markup Workshop

Efforts and plans towards
Markup of the BHL Content
William Ulate R.
BHL Technical Director
Missouri Botanical Garden
Berlin, Feb. 10, 2014

More Online Content
Pages (Millions) and Volumes (in Thousands)
included in BHL
140
130.68
120.09

120
105.85

100

94.6
84.86

80
60
40

40.00

31.8
20

22.00
9.2

Oct-08

35.4

38.9

41.942.6
Volumes (K)

16.4
Pages (M)

Oct-09

Oct-10

Oct-11

Oct-12

Oct-13

Scientific Name Extraction
• TaxonFinder algorithm in production since
2008
– More than 100 million candidate name strings
– More than 1.5 million unique, verified names
– Available through UI, APIs, Data Exports & Internet
Archive

• New collaboration with Global Names project
– Improved algorithm, better precision & recall
– More data with TaxonFinder and Neti Neti!
– http://gnrd.globalnames.org/

Taxon Names
BEFORE

Name Instances
Unique Names
Verified Names
EOL Names
EOL Pages

101,591,803
7,498,554
1,905,507
63,130,350
13,579,868

101,288,804
7,464,924
1,902,803
62,963,582
13,532,684

151,222,182
29,246,382
10,153,165
87,791,695
15,466,713

150,066,425
29,091,767
10,109,540
87,135,089
15,342,867

AFTER
Name Instances
Unique Names
Verified Names
EOL Names
EOL Pages

Article-level metadata
Chapter-level metadata
Treatment-level metadata

Part-level metadata

Global Replication & Serving
Replicated Data Center

Portal Application

Taxonomic Literature II (TL-2)

BioStor articles marked up with JATS

Macaw

https://github.com/cajunjoel/macaw-book-metadata-tool

Manually built:
1,693 sets
87,879 images

The Art of Life schema: describing and providing access to natural history
illustrations from the Biodiversity Heritage Library (BHL)
by William Ulate, Trish Rose-Sandler, Gaurav Vaidya, Robert Guralnick
Example of illustration described using Art of Life schema
Title

Stictospiza formosa

Type

Illustrations

Date

Publication: 1898

Agent

Description
Subjects

Inscriptions
Source

Rights

Author: Arthur G. Butler (1844-1925)
Illustrator: F.W. Frohawk (1861-1946)

A pair of finches with green and yellow bodies resting on reeds
Scientific name: Amandava formosa (Latham, 1790)
Vernacular Name: Green Avadavat or Green Munia
Accepted Name: Amandava formosa (Latham, 1790)
Birds, finches

bottom center: Green Amaduvade Waxbill (Stictospiza formosa)
Butler, Arthur Gardiner. Foreign finches in captivity. Hull and London: Brumby and
Clarke, limited,1889 (2nd edition). This image comes from the Biodiversity Heritage
Library, and is available online at biodiversitylibrary.org/page/17195895
Public domain

Art of Life schema elements required in Red
Element

Agents

Definition

person or corporate entity involved in
the creation, design, production, or
publication of a visual resource.

Examples

Repea
t

<vra:agent>
<vra:name type="personal" vocab="LCNAF" refid="89015596>
Curtis,John</vra:name>
<vra:dates type="life">
<vra:earliestDate>1791</vra:earliestDate>
<vra:latestDate>1862</vra:latestDate>
</vra:dates>
<vra:role vocab="AAT" refid="300025574">publisher</vra:role>
</vra:agent>

Y

Copyright

The copyright status of the visual
resource.

Date

Date or range of dates associated with
the creation or publication of the visual
resource.

<vra:date type="creation">
<vra:earliestDate>1945</vra:earliestDate>
<vra:latestDate>1955</vra:latestDate>
</vra:date>

Y

Description

A free-text note about content of the
image, including comments, description,
or interpretation, that gives additional
information not recorded in other
categories.

<vra:description>This illustration shows a scale, coloured illustration
of Sepsis annulipes (now known as Encita annulipes) beside the
Trifolium ochroleucum plant. Several dissections from Sepsis
cylindrica Fab. (all these details are provided on the next page of this
book and the subsequent page).</vra:description>

Y

Inscriptions

All marks, caption, or written words
added to the object at the time of
production or in its subsequent history,
including signatures, dates, dedications,
texts, and colophons, as well as marks,
such as the stamps of silversmiths,
publishers, or printers.

<vra:inscription>
<vra:position>bottom</vra:position>
<vra:text>Radula of L. souleyetianum on a more
reduced scale</vra:text>
</vra:inscription>

Y

Source

A citation for the book, journal or
resource that hosts the visual resource

<vra:source><vra:name type=”book”>Butler, Arthur Gardiner.
Foreign finches in captivity. HullBrumby and Clarke, limited,1889 (2nd
edition). </vra:name>
<vra:refid
type=”URI”>http://biodiversitylibrary.org/page/17195895</vra:refid>
</vra:source>

N

Subject

Terms or phrases that describe, identify,
or interpret the visual resource.

<vra:subject><vra:term type=”personalName”>Carl
Linnaeus</vra:term></vra:subject>

Y

<vra:rights refid=”http://creativecommons.org/licenses/bync/2.0/deed.en”>Creative Commons Attribution-NonCommercial 2.0
Generic (CC BY-NC 2.0)
</vra:rights>

N

<dwc:scientificName>Plant: Picea abies</dwc:scientificName>
<dwc:acceptedName>Plant: Picea abies</dwc:acceptedName>
<dwc:vernacularName>Plant: Norway spruce<dwc:vernacularName>

Title

The title or identifying phrase given to an
Image

<vra:title xml:lang=”la”>Sepsis annulipes</vra:title>
<vra:title type=“alternate”>Orangutan</vra:title>

Y

We welcome your feedback on the schema! http://tinyurl.com/9hm7nsb

*E.xvi�c�piteI von c. cXx.WptdvonfnrWmn
bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X
a�m cv(f b1air�'o�et ert oiensr �; �',
:�hlrfc�c wa ff�4am.diug bist a
6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem
b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck
wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra
tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM
w ?ffoaifrn w4wmeu nu weib e , wpiteI
voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J '
>bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl:
bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r
trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas
waIwutr Ober �ci ti 1V Ces ' wt
gbtiemwwajfu tpctt, afferain 9 c: b�titbfof
�r f eran m rs bra wlg auig4;f aer�m *mc vrt
blatcabtfm wfru an'deg~m rt blas Iaum
bwWt� run f ncmai b14ianf tJobrrfan
ebrut4net vnber Brwt Ober awawi*m.crriii
btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C
fca trc* cx u W�e�&mcyfbq4 Mabtt mmw
rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3
rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt
enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i

OCR Improvements
• Gaming
• Transcription

OCR Improvements
• Transcription
• Purposeful Gaming
• Looking at…
– Crowdsource Markup

Purposeful Gaming
DIGITALKOOT

• Joint project run by the National
Library of Finland and Microtask to
index the library's enormous archives
so that they are searchable on the
Internet for easier access to the
Finnish cultural heritage.

.

Purposeful Gaming
DIGITALKOOT
• Launched on Feb 8 2011, nearly 110 000
participants completed over 8 million word
fixing tasks by Nov 29 2012
• DigiTalkoot enabled volunteers to participate
in this fixing work by playing games.

• .

Purposeful gaming and BHL:
engaging the public in improving and
enhancing access to digital texts
• IMLS Grant Program:
National Leadership Grants for Libraries
• Partners:
–
–
–
–

Harvard University
Cornell University
New York Botanical Garden

• P.I.: Trish Rose-Sandler, Missouri Botanical Garden
• Dates: Dec 2013 – Nov. 2015

Project objectives and benefits
• Test new means of crowdsourcing to support the
enhancement of content in BHL
• Demonstrate if digital games are an effective tool for
analyzing and improving digital outputs from OCR and
transcription
• Benefits of gaming include:
– improved access to content by providing richer and more
accurate data;
– an extension of limited staff resources; and
– exposure of library content to communities who may not
know about the collections otherwise.

OCR Improvements

German text interpreted by the OCR process as:
“unb auf ben Â©elnrgen be6 fublic{)en”

OCR Improvements
IA OCR

OCR 2

Transcription
1

Transcription
2

1

unb

und

und

und

Ok

2

den

ben

den

den

Ok

3

Â©elnrgen

Â©ebirgen

Bebirgen

Gebirgen

X

4

be6

des

de5

des

Chk

5

fublic{)en

fublichen

Füdlichen

Südlichen

X

6

Â£)eittfc{)(anb6

Deutfchlanbs

Deutfchlands

Deutschlands

X

Different resulting texts from parsing the phrase:
“und auf den Gebirgen des südlichen Deutschlands”
(“and on the mountains of southern Germany”)

iDigBio’s aOCR Hackathon
• Improve OCR parsing of labels with clear metrics
(datasets, output formats, scoring algorithm)
• Libraries of regular expr. to clean up each field
(different error correction for latitude/longitude
coordinates than personal names or herbarium
catalog numbers)
• Tool for classifying segments of the image before
submitting to OCR

• Do a first pass of OCR to clean images before
sending them to a second, 'real' pass of OCR

iDigBio’s CITScribe Hackathon
1. Interoperability betweenpublic participation
tools and biodiversity data systems,
2. Transcription quality assessment/quality
control (QA/QC) and the reconciliation of
replicatetranscriptions,
3. Integration of optical character recognition
(OCR) into thetranscription workflow
4. User engagement

NfN & iDigBio’s CITScribe Hackathon
• Jason Best’s DarwinScore
• Ben Brumfield’s Handwriting Gibberish Detector
• Dictionaries to improve crowdsourcing consensus
(e.g., names of collectors, scientific names)
• Word Clouds created using n-gram scoring,
faceting, and Solr for indexing + Carrot2 for
specimen selection (visualize and explore of the use
with a word of interest from the word cloud) and a
data cleaning step (highlight infrequent words by
the system).

NESCent EOL-BHL Research Sprint
There is no place like home: Defining “habitat” for
biodiversity science
Robert D. Stevenson
UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston,
MA 02125-3393
Carl Nordman (Natureserve) and
Evangelos Pafilis
Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion,
71003, Crete, Greece

Assessing Risk Status of Mexican Amphibians Through Data
Mining.
Esther Quintero and Bárbara Ayala
National Commission for Knowledge and Use of Biodiversity
(CONABIO)
and
Anne Thessen
Marine Biological Laboratory and Arizona State University

Evolution in the usage of anatomical concepts in the
biodiversity literature
Todd Vision (tjv@bio.unc.edu),
Prashanti Manda (manda.prashanti@gmail.com), and
Dongye Meng
University of North Carolina at Chapel Hill

MiBIO: Mining Biodiversity
• Mining Biodiversity: Enriching Biodiversity Heritage
with Text Mining and Social Media
• One of the international projects that won in the
third round of the 2013 Digging Into Data Challenge
• Promote the development of innovative
computational techniques to apply into big data in
the humanities and social sciences
– The National Centre for Text Mining (UK)
– Missouri Botanical Garden (US)
– Dalhousie University's Big Data Analytics
Institute (Canada)
– Social Media Lab (Canada)

1.

Automatic error correction of OCR text errors.

2.

Crowdsource annotation of legacy texts with semantic metadata.

3.

Adapt text mining techniques to extract terminology, entities and
significant events automatically and to track terminology evolution
over time.

4.

Use Interactive visualization techniques to help users manage
search results through next generation browsing capabilities,
assisted by a semantic similarity network of important terms and
entities.

5.

Design of a social media layer, serving as an environment for
diverse users to interact and collaborate on science, public
education, awareness and outreach.

•

Crowdsource Markup
Display text

Species Profile Model category

General/summary

TaxonBiology

Geographic range

Distribution

Habitat

Habitat

Food sources and feeding behavior

TrophicStrategy

Physical description (general)

Description

Physical description (detailed morphology) DiagnosticDescription

Thank you
William Ulate
Global BHL Project Manager / Technical Director
william.ulate@mobot.org
Skype: william_ulate_r

BHL Markup Efforts and Plans

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie BHL Markup Efforts and Plans

Ähnlich wie BHL Markup Efforts and Plans (20)

Mehr von William Ulate

Mehr von William Ulate (18)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

BHL Markup Efforts and Plans

Hinweis der Redaktion