Scaling Namefinding

•Als KEY, PDF herunterladen•

2 gefällt mir•235 views

The document discusses improving the scalability and standardization of taxon name identification and disambiguation APIs. The current API has limitations and depends on a single source. A new proposed API would [1] take plain text as input and return possible name strings and offsets, separating name identification from ID resolution. It would [2] allow multiple name resolution services to integrate via a standard specification. The next steps are to [3] finalize the API specifications and build out services according to the new framework.

Technologie

process

disambiguate /
identify ID lookup
reconcile

process

disambiguate /
identify ID lookup
reconcile

Mature & scalable, well deﬁned and standardized

process

disambiguate /
identify ID lookup
reconcile

in progress, needs API & standard

process

disambiguate /
identify ID lookup
reconcile

GNI has API, needs standards

issues
the TF API is doing jobs it shouldn’t do..

Namebank is a large but outdated dataset

“taxonfinder” has no idea what a namebank ID actually is, it only knows strings

current code is completely dependent on www.ubio.org and is not scalable

why change?
scaling - we can run 10,000 taxonfinding processes using any algorithm
that supports the standard. Super fast indexing of BHL

future-proofing for devs - any new namefinding tool can take advantage
of the API and doesn’t need to write a webservice or API of it’s own

future-proofing for BHL - any new namefinding tool can be added with
one parameter
(&client=taxonfinder | &client=neti)

reliability - existing TF API goes down when Rod runs a screen scraping
tool on ubio.org.

new API spec
API specs
Request
input (string)
type (text , url)
format (xml=default, json)
Response
XML Response
A response example that corresponds to the xml schema:
<names xmlns="http://globalnames.org/namefinder" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<name>
<verbatim>T. rotundata</verbatim>
<dwc:scientificName>Tillandsia rotundata</dwc:scientificName>
<!-- 0-100 -->
<score>100</score>
<offset start="4550" end="4573" />
</name>
</names>

New API
you give us text, we give you strings and offsets. This is the limit of
what a “namefinding” tool can and should do

separately you also need IDs.. Namebank, EOL, tropicos, gn*, GBIF...

once you know Mus musculus is EOL ID “9872332” you don’t need to know
that again. If a book on mice has 40,000 instances of Mus musculus, you
need to know where they are, but not the NameBank ID 40,000 times..
(this is a scaling problem..)

Where do we get these? GNI has 19.3m names & IDs.

issues

misspellings etc need to be “reconciled”

this definitely isn’t the job of a name finding tool

next?
we could make a tool that hacks together IDs and names..
... but that’s not dev time well spent

we could participate in a process to check off the latter two categories
of the name ﬁnding -> ID resolution process
... yes we can

Let’s make a spec, build some APIs.

silver lining - we can start now

Weitere ähnliche Inhalte

Andere mochten auch

Dog Breedshounds30

Devops @ Woods Hole Informatics talksAnthony Goddard

Cu00927 c gestion excepciones java try catch finally ejemplos ejerciciosUniminuto - San Francisco

Formulas en excelUniminuto - San Francisco

Woodpeckershgbaize

Presentation about the Master of Science: Communication Technologies, Systems...Escuela Técnica Superior de Ingenieros de Telecomunicación - UPV - Valencia

AforismosEscuela Técnica Superior de Ingenieros de Telecomunicación - UPV - Valencia

Modulando nuestro oscilador_de_radiofrecuenciaEscuela Técnica Superior de Ingenieros de Telecomunicación - UPV - Valencia

Oscilador de radiofrecuenciaEscuela Técnica Superior de Ingenieros de Telecomunicación - UPV - Valencia

Practicando morse con_nuestro_oscilador_de_radiofrecuenciaEscuela Técnica Superior de Ingenieros de Telecomunicación - UPV - Valencia

Andere mochten auch (10)

Dog Breeds

Devops @ Woods Hole Informatics talks

Cu00927 c gestion excepciones java try catch finally ejemplos ejercicios

Formulas en excel

Woodpeckers

Presentation about the Master of Science: Communication Technologies, Systems...

Aforismos

Modulando nuestro oscilador_de_radiofrecuencia

Oscilador de radiofrecuencia

Practicando morse con_nuestro_oscilador_de_radiofrecuencia

Ähnlich wie Scaling Namefinding

Get your Hero Groove On - Heroes RebornCaleb Jenkins

A tale of two proxiesSensePost

Persistently identifying website contentAndy Powell

SADI SWSIP '09 'cause you can't always GET what you want!Mark Wilkinson

Eventos y Microservicios - Santander TechTalkconfluent

Building Event-driven Serverless ApplicationsAmazon Web Services

Backend as a Serviceapiomat

Implementing AuthorizationTorin Sandall

API Athens Meetup - API standards 25-6-2014openi_ict

API Athens Meetup - API standards 25-6-2014Michael Petychakis

Using Semantics to personalize medical researchMark Wilkinson

Yahoo for the MassesChristian Heilmann

Open Source Information Gathering Brucon EditionChris Gates

Walter apiNicholas Schiller

Alitora Innovation Networksalitora

Advanced Web DevelopmentRobert J. Stein

Prophet - Beijing Perl WorkshopJesse Vincent

API's - Successes to Replicate. Pitfalls to Avoid.Inman News

How I Built Bill, the AI-Powered Chatbot That Reads Our Docs for Fun , by Tod...Nordic APIs

2012 03 27_philly_jug_rewrite_staticLincoln III

Ähnlich wie Scaling Namefinding (20)

Get your Hero Groove On - Heroes Reborn

A tale of two proxies

Persistently identifying website content

SADI SWSIP '09 'cause you can't always GET what you want!

Eventos y Microservicios - Santander TechTalk

Building Event-driven Serverless Applications

Backend as a Service

Implementing Authorization

API Athens Meetup - API standards 25-6-2014

Using Semantics to personalize medical research

Yahoo for the Masses

Open Source Information Gathering Brucon Edition

Walter api

Alitora Innovation Networks

Advanced Web Development

Prophet - Beijing Perl Workshop

API's - Successes to Replicate. Pitfalls to Avoid.

How I Built Bill, the AI-Powered Chatbot That Reads Our Docs for Fun , by Tod...

2012 03 27_philly_jug_rewrite_static

Kürzlich hochgeladen

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Slack Application Development 101 Slidespraypatel2

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Real Time Object Detection Using Open CVKhem

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Histor y of HAM Radio presentation slidevu2urc

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

A Domino Admins Adventures (Engage 2024)Gabriella Davis

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity

Finology Group – Insurtech Innovation Award 2024

Advantages of Hiring UIUX Design Service Providers for Your Business

Exploring the Future Potential of AI-Enabled Smartphone Processors

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Tata AIG General Insurance Company - Insurer Innovation Award 2024

08448380779 Call Girls In Civil Lines Women Seeking Men

Slack Application Development 101 Slides

Scaling API-first – The story of a global engineering organization

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Real Time Object Detection Using Open CV

Powerful Google developer tools for immediate impact! (2023-24 C)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Histor y of HAM Radio presentation slide

How to Troubleshoot Apps for the Modern Connected Worker

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

2024: Domino Containers - The Next Step. News from the Domino Container commu...

A Domino Admins Adventures (Engage 2024)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Scaling Namefinding

1. current BHL ubio.org

2. future

3. process disambiguate / identify ID lookup reconcile

4. process disambiguate / identify ID lookup reconcile Mature & scalable, well deﬁned and standardized

5. process disambiguate / identify ID lookup reconcile in progress, needs API & standard

6. process disambiguate / identify ID lookup reconcile GNI has API, needs standards

7. current response <entity> <nameString>Abietineae</nameString> <namebankID>8401003</namebankID> <weblinks> <website> <title>Tropicos</title> <link>http://mobot.mobot.org/W3T/Search/vast.html</link> <logo>http://names.ubio.org/tools/image/tropicos.png</logo> <links> <link nameString="Abietineae Eichler">http:// mobot.mobot.org/cgi-bin/search_vast?onda=N50205444</link> </links> </website> </weblinks> </entity>

8. issues the TF API is doing jobs it shouldn’t do.. Namebank is a large but outdated dataset “taxonfinder” has no idea what a namebank ID actually is, it only knows strings current code is completely dependent on www.ubio.org and is not scalable

9. why change? scaling - we can run 10,000 taxonfinding processes using any algorithm that supports the standard. Super fast indexing of BHL future-proofing for devs - any new namefinding tool can take advantage of the API and doesn’t need to write a webservice or API of it’s own future-proofing for BHL - any new namefinding tool can be added with one parameter (&client=taxonfinder | &client=neti) reliability - existing TF API goes down when Rod runs a screen scraping tool on ubio.org.

10. new API spec API specs Request input (string) type (text , url) format (xml=default, json) Response XML Response A response example that corresponds to the xml schema: <names xmlns="http://globalnames.org/namefinder" xmlns:dwc="http://rs.tdwg.org/dwc/terms/"> <name> <verbatim>T. rotundata</verbatim> <dwc:scientificName>Tillandsia rotundata</dwc:scientificName>  <score>100</score> <offset start="4550" end="4573" /> </name> </names>

11. New API you give us text, we give you strings and offsets. This is the limit of what a “namefinding” tool can and should do separately you also need IDs.. Namebank, EOL, tropicos, gn*, GBIF... once you know Mus musculus is EOL ID “9872332” you don’t need to know that again. If a book on mice has 40,000 instances of Mus musculus, you need to know where they are, but not the NameBank ID 40,000 times.. (this is a scaling problem..) Where do we get these? GNI has 19.3m names & IDs.

12. issues misspellings etc need to be “reconciled” this definitely isn’t the job of a name finding tool

13. next? we could make a tool that hacks together IDs and names.. ... but that’s not dev time well spent we could participate in a process to check off the latter two categories of the name ﬁnding -> ID resolution process ... yes we can Let’s make a spec, build some APIs. silver lining - we can start now

Scaling Namefinding

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (10)

Ähnlich wie Scaling Namefinding

Ähnlich wie Scaling Namefinding (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scaling Namefinding

Hinweis der Redaktion