The document discusses improving the scalability and standardization of taxon name identification and disambiguation APIs. The current API has limitations and depends on a single source. A new proposed API would [1] take plain text as input and return possible name strings and offsets, separating name identification from ID resolution. It would [2] allow multiple name resolution services to integrate via a standard specification. The next steps are to [3] finalize the API specifications and build out services according to the new framework.
8. issues
the TF API is doing jobs it shouldn’t do..
Namebank is a large but outdated dataset
“taxonfinder” has no idea what a namebank ID actually is, it only knows strings
current code is completely dependent on www.ubio.org and is not scalable
9. why change?
scaling - we can run 10,000 taxonfinding processes using any algorithm
that supports the standard. Super fast indexing of BHL
future-proofing for devs - any new namefinding tool can take advantage
of the API and doesn’t need to write a webservice or API of it’s own
future-proofing for BHL - any new namefinding tool can be added with
one parameter
(&client=taxonfinder | &client=neti)
reliability - existing TF API goes down when Rod runs a screen scraping
tool on ubio.org.
10. new API spec
API specs
Request
input (string)
type (text , url)
format (xml=default, json)
Response
XML Response
A response example that corresponds to the xml schema:
<names xmlns="http://globalnames.org/namefinder" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<name>
<verbatim>T. rotundata</verbatim>
<dwc:scientificName>Tillandsia rotundata</dwc:scientificName>
<!-- 0-100 -->
<score>100</score>
<offset start="4550" end="4573" />
</name>
</names>
11. New API
you give us text, we give you strings and offsets. This is the limit of
what a “namefinding” tool can and should do
separately you also need IDs.. Namebank, EOL, tropicos, gn*, GBIF...
once you know Mus musculus is EOL ID “9872332” you don’t need to know
that again. If a book on mice has 40,000 instances of Mus musculus, you
need to know where they are, but not the NameBank ID 40,000 times..
(this is a scaling problem..)
Where do we get these? GNI has 19.3m names & IDs.
13. next?
we could make a tool that hacks together IDs and names..
... but that’s not dev time well spent
we could participate in a process to check off the latter two categories
of the name finding -> ID resolution process
... yes we can
Let’s make a spec, build some APIs.
silver lining - we can start now