SlideShare ist ein Scribd-Unternehmen logo
1 von 5
Downloaden Sie, um offline zu lesen
Copyedited by: GS MANUSCRIPT CATEGORY: Article
[13:42 7/9/2012 Sysbio-sys069.tex] Page: 1 1–5
Software for Systematics and Evolution
Syst. Biol. 0(0):1–5, 2012
© The Author(s) 2012. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved.
For Permissions, please email: journals.permissions@oup.com
DOI:10.1093/sysbio/sys069
IKey+: A New Single-Access Key Generation Web Service
THOMAS BURGUIERE∗, FLORIAN CAUSSE, VISOTHEARY UNG, AND RÉGINE VIGNES-LEBBE
Laboratoire Informatique et Systématique, Museum National d’Histoire Naturelle, Département Histoire de la Terre,
UMR 7207 – C.N.R.S. - M.N.H.N. - U.P.M.C., Paris, France
∗Correspondence to be sent to: Thomas Burguiere, Laboratoire Informatique et Systématique, M.N.H.N Département Histoire de la Terre,
Bâtiment de Géologie, CP48, 57 rue Cuvier, 75005 Paris, France,
E-mail: thomas.burguiere@upmc.fr
Thomas Burguiere and Florian Causse contributed equally to this article.
Received 2 April 2012; reviews returned 29 June 2012; accepted 25 July 2012
Associate Editor: David Posada
Abstract.—Single-access keys are a major tool for biologists who need to identify specimens. The construction process of these
keys is particularly complex (especially if the input data set is large) so having an automatic single-access key generation tool
is essential. As part of the European project ViBRANT, our aim was to develop such a tool as a web service, thus allowing
end-users to integrate it directly into their workflow.
IKey+generates single-access keys on demand, for single users or research institutions. It receives user input data (using
the standard SDD format), accepts several key-generation parameters (affecting the key topology and representation), and
supports several output formats.
IKey+ is freely available (sources and binary packages) at www.identificationkey.fr. Furthermore, it is deployed on our
server and can be queried (for testing purposes) through a simple web client also available at www.identificationkey.fr
(last accessed 13 August 2012). Finally, a client plugin will be integrated to the Scratchpads biodiversity networking tool
(scratchpads.eu). [Systematics; taxonomy; single-access key; web service; biodiversity informatics.]
During the last decade, taxonomy has been going
through a revolution towards cybertaxonomy (Wheeler
2008). This is particularly obvious when looking at
the numerous initiatives dedicated to the management
and monitoring of biodiversity data such as Global
Biodiversity Information Facility (GBIF), European
Distributed Institute of Taxonomy (EDIT), BIOTA,
Catalogue of Life, Encyclopedia of Life (EOL) or Key
To Nature, and now ViBRANT. A significant increase
of digitized information should be expected in a near
future, and the deep social and scientific impact of web
2.0 stimulates sharing of digital data and proposals
of cybertaxonomy projects to study global biodiversity
and climate changes. In the current context of
distributed knowledge and long-distance collaborations,
the European project ViBRANT (http://vbrant.eu, last
accessed 13 August 2012) has emerged to bridge the
existing gap between tools and services, providing a
powerful exemplar platform, the Scratchpads, for the
entire community.
ViBRANT is about connecting the people, data and science of
biodiversity (ViBRANT - Objectives)
As studying and monitoring biodiversity remains
a strong challenge for biologists, descriptive data
management and identification tools are needed to assist
them in their daily work. For instance, we developed a
complete descriptive data management tool: Xper2 (Ung
et al. 2010).
Our contribution to ViBRANT is the implementation
of an identification key generation web service, because
we also have a significant experience in developing key
generation tools such as MyKey (Gérard and Vignes-
Lebbe 2010). The most common identification keys
(described in Hagedorn et al. 2010) are the printed
single-access keys published in almost all taxonomic
revisionary work. But these printed keys may be difficult
to use while collecting samples in the field or when
consulting incomplete specimens. Hence, electronic
single-access keys, which can be quickly generated
with different parameters in order to bypass potentially
problematic steps (i.e., steps with characters for which
the specimen studied in the field cannot provide
information) are a good alternative.
APPROACH
Web Service
There are currently several tools designed to generate
single-access keys such as, among others, the Delta
software suite (Dallwitz et al. 1993), Pankey (Pankhurst
1988), and MyKey. These tools are well established and
produce valid identification keys, but are not without
drawbacks. For instance, both Delta and Pankey need to
be installed on the end-user’s machine and are not cross-
platform compatible (Windows, Apple, and Linux).
Mykey is a web application, which makes it platform-
agnostic, but the end-users cannot submit their own data
set to generate a key (the data set has to be manually
uploaded by someone in our laboratory). Furthermore,
these 3 tools do not use the biodiversity description
standard file format, Structured Descriptive Data (SDD)
(Hagedorn 2006), for data input.
Our objective, within the ViBRANT project, was to
design a free and open-source key generation service
using the SDD file format that can be used by
independent users or through the ViBRANT OBOE
service (also available through Scratchpads). It was
thus decided to develop a web service (implementing
both the SOAP and REST web service communication
1
Systematic Biology Advance Access published September 16, 2012
atFreieUniversitaetBerlinonSeptember18,2012http://sysbio.oxfordjournals.org/Downloadedfrom
Copyedited by: GS MANUSCRIPT CATEGORY: Article
[13:42 7/9/2012 Sysbio-sys069.tex] Page: 2 1–5
2 SYSTEMATIC BIOLOGY
FIGURE 1. IKey+use cases.
protocols), which is particularly suitable for these two
usecases(cf.Fig.1)andallowstheclienttobuildcomplex
workflows, with a high level of automation.
Architecture
Because the main purpose of ViBRANT is to provide
a platform that is both open source and easily reusable,
we organized IKey+ in two parts:
• an Application Programming Interface (API)
consisting of 3 distinct modules; the Data Model,
that is, the computer representation of the
descriptive data, the Input-Output (IO) module,
whichparsestheSDDinputfilesandloadsthedata
in the Model and generates the output files, and
the Algorithm, which uses the data contained in
the Model to generate a key and returns it through
the IO module.
• a Simple Object Access Protocol (SOAP) or a
REpresentational State Transfer (REST) Service
Layer encapsulating the API, which manages the
communication between the client and the API.
This allows potential developers to integrate the service
into their workflows and adapt it to their specific needs
(e.g., integrating the API in a standalone software). The
whole application was developed using the J2EE (Java 2,
Enterprise Edition) programming environment.
Web Service Input Parameters
IKey+ has several parameters available to the end-
user. These parameters allow the end-user to change the
topology of the key (e.g., pruning the key, promoting
some characters), change the visual aspect of the key, or
the output format. A complete list of these parameters
can be found in the user documentation, available here:
http://www.identificationkey.fr/resources/docs/ident
ificationKeyGeneratorWS_UserGuide.pdf (last accessed
13 August 2012).
Algorithm
The single-access key generation algorithm is a
recursive depth-first graph construction algorithm. Its
arguments are a list of taxa, taxaList and the list
of character considered, charList (cf. Supplementary
Materials, appendices 1, 2 and 3, available at http://
datadryad.org, doi:10.5061/dryad.3ft19). The resulting
graph (cf. Fig. 2) is a directed acyclic graph, consisting
of non-terminal nodes labelled with a character, edges
labelled with the states of the character of the previous
node, and terminal nodes labelled with taxa. At each
step of the algorithm, the best character among those
available is selected by the BEST_CHAR function, which
iterates over the available characters and returns the
character with the greatest discriminant power.
Discriminant Power of a Character
During the last 50 years, estimating a character’s
discriminant power has been the central point of key-
construction algorithms. Many measurement methods
have been suggested, and a review of these methods can
be found in Gower and Payne (1975), Pankhurst (1991),
and Delgado-Calvo-Flores et al. (2006). In a statistical
context, examples include the Bayesian probability and
generalized entropy, such as the Shannon entropy used
in the ID3 algorithm developed by Quinlan (1986) or the
Gini index used in Breiman et al. (1984). In a context
where there is no probability associated with the states
of the characters, one can use the separation coefficient
or the variance as measurement methods of a character’s
discriminant power.
In IKey+, the discriminant power of a given character
C is calculated by the DPOWER function and is
an estimate of C’s ability to differentiate the taxa
of the current list. DPOWER is an extension of the
Gyllenberg separation factor (Gyllenberg 1963) and
is a measurement of the number of pairs of taxa
discriminated by C, which is a generalization of the
variance. Indeed, the variance of a variable can be
computed by comparing all pairs of values (here,
comparing all pairs of taxa description for C). With
different comparison functions, each adapted to a certain
type of character (e.g., a categorical or a numerical
character), it is possible to obtain a generalized formula
to estimate the discriminant power of a given character,
even with polymorphic characters or characters with
missing data. For categorical characters, DPOWER can
use a binary function (boolean comparison) or other
atFreieUniversitaetBerlinonSeptember18,2012http://sysbio.oxfordjournals.org/Downloadedfrom
Copyedited by: GS MANUSCRIPT CATEGORY: Article
[13:42 7/9/2012 Sysbio-sys069.tex] Page: 3 1–5
2012 BURGUIERE ET AL.—Ikey+ 3
Character I
Character II
State I-a
Taxon 1
State I-b
Character III
State II-a
Taxon 2
State II-b
Character IV
State II-c OR State II-d
Taxon 3
State III-a
Taxon 4
State III-b
Taxon 5
State IV-a
Taxon 6, Taxon 7
State IV-b
FIGURE 2. Formal identification key example.
functions such as the Sokal and Michener coefficient
(Sokal and Michener 1958) or the Jaccard coefficient
(Jaccard 1901). For numerical characters (e.g., a size
measured in milllimetre), the set of values (i.e., all the
values entered for C for all taxa) is split in two intervals.
In order to determine the threshold value separating
these two intervals, we consider the list of min and max
value of C for each of the remaining taxa. We then choose
from this list the value that separates the remaining
taxa into two groups of equal size (±1 taxon). These
two intervals are considered as 2 discrete states for the
calculation of the discriminant power.
DPOWER iterates over the available taxa (i.e., those
that are compatible with the current description), and
determines, for each pair of taxa Ta and Tb), the
dissimilarity(i.e.,thepossibilityofdiscriminatingTa and
Tb with C) that is based on the number of common states
of C for Ta and Tb (n11), the number of states of C that
occur only for Ta ( n10), the number of states of C that
occur only for Tb (n01), and the number of states of C that
do not occur for neither Ta or Tb (n00). The discriminant
power is then calculated as
DP=
a b
SCORE(n11,n10,n01,n00).
Three different SCORE measurements are currently
available in IKey+, the Xper coefficient (Ung et al. 2010;
Vignes et al. 1989), the Jaccard coefficient, and the Sokal
and Michener coefficient.
BENCHMARKS
Tests
In order to assess the performance of our web
service, we conducted a series of tests using the
same input file for every test. This file contains a
data set representing the subfamily Cichorieae of the
plant family Asteraceae with 303 observable characters,
144 taxa, and their descriptions. This data set was
generated and curated as an exemplar group for the
EDIT project (Hand et al. 2009) [the file is available
here: http://www.infosyslab.fr/vibrant/project/test/
Cichorieae-fullSDD.xml (last accessed 13 August 2012)].
The aim of these tests was to evaluate the overall
performance of the algorithm, to assess the impact
of the various parameters available to the end-user
on performance, and to ensure that IKey+ would be
robust in a production environment. We measured the
time necessary for the web service to respond to 100
identification keys generation queries, while counting
the number of rejected queries due to CPU overload
(IKey+ rejects any query received when the CPU load
of the host server is greater than 80%). We ran one
reference test (described in Supplementary Materials,
appendix 4, doi:10.5061/dryad.3ft19), and several tests
with variations on the input parameters, the web service
communication protocol used or the parallelization
setups. The complete list of performance test setups
is available in Supplementary Materials, appendix 5,
doi:10.5061/dryad.3ft19. We also measured the time
necessarytogenerateasinglekey,usingthereferencetest
configuration described in Supplementary Materials,
appendix 4, doi:10.5061/dryad.3ft19. Finally, we tested
the usability of IKey+ from an end-user perspective,
when generating the Cichorieae identification key using
the web interface. We asked 11 persons who were not
involved in the development of IKey+ to test the web
service and the web interface. They were given the
SDD formated Cichorieae data set and were asked to
use the web interface to create the identification key.
We measured the time needed by each test subject to
generate the identification key.
Results
The average length of the paths leading to taxa
that were identified by the algorithm is 4.67 steps,
with the shortest path being 1 step long, and the
longest being 10 steps long. The generation of a single
atFreieUniversitaetBerlinonSeptember18,2012http://sysbio.oxfordjournals.org/Downloadedfrom
Copyedited by: GS MANUSCRIPT CATEGORY: Article
[13:42 7/9/2012 Sysbio-sys069.tex] Page: 4 1–5
4 SYSTEMATIC BIOLOGY
key took roughly 1.8 s. The results of the other tests
are shown in Supplementary Materials, appendix 6,
doi:10.5061/dryad.3ft19. The reference test took roughly
50 s to complete, that is, 500 ms per query, which is
consistent with the time measured for the generation
of a single key, because the reference test uses 4
simultaneous threads to generate the keys. In this test,
few queries were rejected due to CPU overload. Among
the parameters available to the end-user, only the score
method parameter had a significant influence: when
using the Sokal and Michener score method or the
Jaccard score method, the tests took longer to finish
(∼80 s instead of 50 s). The score method parameter had
no impact on the number of rejected queries. Our tests
showed that the communication protocol used (SOAP or
REST) had no influence on the performance of IKey+.
Some taxa do not appear in the generated key, due to
insufficient data in the input file. Indeed, the Cichorieae
data set we used for our tests was created by an
external team. This is because some errors were made
during creation of the Cichorieae data set: some taxa
with unknown data were not specifically marked as
“unknown data”, but were left unspecified instead.
Regardless, although these ambiguously coded taxa do
not appear in the resulting identification key by default,
the end-user can choose to have them appear in the key.
When modifying the parallelization of the queries,
we observed significant performance variations. As can
be expected, when launching 100 queries sequentially
(instead of using 4 simultaneous threads launching 25
queries each), the test was 4 times longer, and no queries
were rejected. When we augmented the parallelism
of the queries (e.g., 25 or 100 simultaneous queries),
more queries were rejected (up to 50%). However, when
launching 100 simultaneous queries, with a random
delay at the beginning of each thread, the results (both in
time and number of rejected queries) were comparable to
the performance of the reference test. In the usability test,
the average time needed to generate the identification
key was slightly above 60 s (60.36 s), with the shortest
time measured at 26 s, and the longest time measured
at 121 s. The complete results of the usability test
are available in Supplementary Materials, appendix 7,
doi:10.5061/dryad.3ft19.
DISCUSSION
IKey+ is available and can be installed on any J2EE
application server (e.g., Apache Tomcat). It can generate
single-access keys using a tree or a flat representation
in several output formats (HTML, Wiki, SDD, plain-
text, etc.).
Our test showed that IKey+ performs sufficiently well
to handle a large data set in a relatively short amount of
time and can generate well-optimized key files (average
number of steps to identify a taxon: 4.67). This, combined
with the web service accessibility, makes it possible to
integrate IKey+ in many workflows that might require
a fast and automated key generation process (e.g., a batch
key generation script). Furthermore, an end-user can use
the web interface available at www.identificationkey.fr
to quickly generate a customized identification key,
using the numerous parameters available (affecting the
topology, representation, file format, etc.).
Finally, our tests showed that IKey+ is likely to
be robust in a production environment (i.e., many
simultaneous queries) as it is able to withstand
simultaneous key generation queries (e.g., 4 threads
launching 25 consecutive queries). It is also protected
from cryptic failure, because we implemented a CPU-
load-watching mechanism that automatically rejects a
query (with an explicit error message) whenever the
CPU load exceeds a given threshold (80%). This prevents
a crash of the service, or the generation of incomplete or
corrupt key files.
CONCLUSION
IKey+ is the first key-generation tool available as
a web service with standardized input and output
formats. Our test showed that IKey+ is able to generate
keys rapidly and that it can also be used by an end-user
with the web interface. Finally, the modular and open-
source nature of IKey+ makes it possible for anyone to
reuseitscomponents.Forinstance,weplantoreusesome
components of the API to develop another web service
that would provide free-access key identification.
LICENSING
As part of the ViBRANT project, IKey+’s source
code is freely available and is licensed under the GNU
General Public License version 2. It is already available
on our google code SVN repository: http://ikey-plus.
googlecode.com/svn/trunk/ (last accessed 13 August
2012). It will be actively maintained by our team for the
next 2 years.
SUPPLEMENTARY MATERIAL
Supplementary material, including Algorithms and
appendices, can be found at http://www.sysbio
.oxfordjournals.org.
FUNDING
This work was supported by the European Union
funded FP7 ViBRANT Project (Contract number RI-
261532, Period, December 2010 to November 2013).
ACKNOWLEDGEMENTS
We sincerely thank Gregor Hagedorn (Julius Kühn
Institute, Berlin, Germany) and Andreas Müller
(Botanical Garden and Botanical Museum, Berlin,
atFreieUniversitaetBerlinonSeptember18,2012http://sysbio.oxfordjournals.org/Downloadedfrom
Copyedited by: GS MANUSCRIPT CATEGORY: Article
[13:42 7/9/2012 Sysbio-sys069.tex] Page: 5 1–5
2012 BURGUIERE ET AL.—Ikey+ 5
Germany) for sharing their knowledge on the SDD
format. We are also grateful to Dave Roberts (Natural
History Museum, London, UK) for reviewing an early
version of the article and providing style improvements.
REFERENCES
BIOTA. Available from: URL http://www.edinburgh.ceh.ac.uk/
biota/ (last accessed 13 August 2012).
Breiman L., Friedman J.H., Olshen R.A. Stone C.J. 1984. Classification
and regression trees. Belmont, CA:Wadsworth International Group.
Catalogue of Life. Available from: URL http://www.catalogueoflife.
org/ (last accessed 13 August 2012).
Dallwitz M.J., Paine T.A., Zurcher E.J. 1993. User’s guide to the delta
system: a general system for coding taxonomic descriptions. 4th
ed. Available from: URL http://delta-intkey.com (last accessed
13 August 2012).
Delgado-Calvo-Flores M., Fajardo-Contreras W., Gibaja-Galindo E.L.,
Perez-Perez R. 2006. Xkey: a tool for the generation of identification
keys. Expert Syst. Appl. 30:337–351.
EDIT. European Distributed Institute of Taxonomy. Available from:
URL http://www.e-taxonomy.eu/ (last accessed 13 August 2012).
EOL. Encyclopaedia of Life. Available from: URL http://eol.org/.
GBIF. Global Biodiversity Information Facility. Available from: URL
http://www.gbif.org/ (last accessed 13 August 2012).
Gérard D., Vignes-Lebbe R. 2010. Mykey: a server-side software to
create customized decision trees. In: Nimis P.L., Vignes-Lebbe R.,
editors. Tools for identifying biodiversity: progress and problems.
Edizioni Università di Trieste, Trieste, Italy. p. 107–112.
Gower J.C., Payne R.W. 1975. A comparison of different criteria
for selecting binary tests in diagnostic keys. Biometrika
62:665–672.
Gyllenberg H.G. 1963. A general method for deriving determinative
schemes for random collections of microbial isolates. Ann. Acad.
Scient. Fenn. Ser. A IV. Biologica 1(69):1–23.
Hagedorn G. 2006. The structured descriptive data (SDD) w3c-xml-
schema. Version 1.1 Available from: URL http://wiki.tdwg.org/
twiki/bin/view/SDD/Version1dot1 (last accessed 13 August 2012).
Hagedorn G., Rambold G., Martellos S. 2010. Types of identification
keys. In: Nimis P.L., Vignes-Lebbe R. editors. Tools for identifying
biodiversity: progress and problems. Edizioni Università di Trieste,
Trieste, Italy. p. 59–64.
Hand R., Kilian N., Raab-Straube E. 2009. International cichorieae
network: Cichorieae portal. Available from: URL http://wp6-
cichorieae.e-taxonomy.eu/portal/ (last accessed 13 August 2012).
Jaccard P. 1901. Étude comparative de la distribution florale dans une
portion des alpes et des jura. Bull. Soc. Vaud. Sci. Nat. 37:547–579.
J2EE. Java 2, Enterprise Edition. Available from: URL http://www.
oracle.com/technetwork/java/javaee/overview/index.html (last
accessed 13 August 2012).
KeyToNature. Available from: URL http://www.keytonature.
eu/wiki/ (last accessed 13 August 2012).
Pankhurst R.J. 1988. Pankey programs. DELTA Newsletter 1:2.
Pankhurst R.J. 1991. Practical taxonomic computing. Cambridge
University Press, Cambridge, UK.
Quinlan J.R. 1986. Induction of decision trees. Mach. Learn. 1:81–
106. ISSN 0885-6125. Available from: URL http://dx.doi.org/
10.1007/BF00116251 (last accessed 13 August 2012).
REST Architecture. Available from: URL http://www.oracle.com/
technetwork/articles/javase/index-137171.html (last accessed
13 August 2012).
Scratchpads. Biodiversity Online. Available from: URL http://
scratchpads.eu/ (last accessed 13 August 2012).
SOAP. Simple Object Access Protocol, W3C Recommandation. Version
1.2. Available from: URL http://www.w3.org/TR/soap/ (last
accessed 13 August 2012).
Sokal R., Michener C. 1958. A statistical method for evaluating
systematic relationships. Univ. Kansas Sci. Bull., (38):1409–1438.
Ung V., Dubus G., Zaragüeta-Bagils R., Vignes-Lebbe R. 2010. Xper2:
introducing e-taxonomy. Bioinformatics 26(5):703–704.
ViBRANTa. Objectives. Available from: URL http://vbrant.eu/
node/1 (last accessed 13 August 2012).
ViBRANTb. Virtual Biodiversity Research and Access Network for
Taxonomy. Available from: URL http://vbrant.eu (last accessed
13 August 2012).
Vignes R., Lebbe J., Darmoni S. 1989. Symbolic-numeric approach
for biological knowledge representation: a medical example with
creation of identification graphs. In E. Diday, editor, Proceedings
of the conference on Data analysis, learning symbolic and numeric
knowledge.NovaSciencePublishers,Inc.Commack,NY,USA.ISBN
0-941743-64-0. p. 389–398.
Wheeler, Q.D., editor 2008. The new taxonomy. CRC Press Inc. New
York, USA.
atFreieUniversitaetBerlinonSeptember18,2012http://sysbio.oxfordjournals.org/Downloadedfrom

Weitere ähnliche Inhalte

Ähnlich wie Syst biol 2012-burguiere-sysbio sys069

FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows Carole Goble
 
Linking data, models and tools an overview
Linking data, models and tools an overviewLinking data, models and tools an overview
Linking data, models and tools an overviewGennadii Donchyts
 
OSFair2017 Workshop | EGI applications database
OSFair2017 Workshop | EGI applications databaseOSFair2017 Workshop | EGI applications database
OSFair2017 Workshop | EGI applications databaseOpen Science Fair
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
 
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...Kim Daniels
 
e-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubee-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubeFAO
 
A_Middleware_based_on_Service_Oriented_Architectur.pdf
A_Middleware_based_on_Service_Oriented_Architectur.pdfA_Middleware_based_on_Service_Oriented_Architectur.pdf
A_Middleware_based_on_Service_Oriented_Architectur.pdf12rno
 
Semantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaSemantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaGiorgia Lodi
 
03. revised paper edit iq
03. revised paper edit iq03. revised paper edit iq
03. revised paper edit iqIAESIJEECS
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring reportpathsproject
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...CSCJournals
 
Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?Paolo Romano
 
Ingredients for Semantic Sensor Networks
Ingredients for Semantic Sensor NetworksIngredients for Semantic Sensor Networks
Ingredients for Semantic Sensor NetworksOscar Corcho
 
A SESERV methodology for tussle analysis in Future Internet technologies - In...
A SESERV methodology for tussle analysis in Future Internet technologies - In...A SESERV methodology for tussle analysis in Future Internet technologies - In...
A SESERV methodology for tussle analysis in Future Internet technologies - In...ictseserv
 
The Revolution Of Cloud Computing
The Revolution Of Cloud ComputingThe Revolution Of Cloud Computing
The Revolution Of Cloud ComputingCarmen Sanborn
 
A study of existing ontologies in the io t domain
A study of existing ontologies in the io t domainA study of existing ontologies in the io t domain
A study of existing ontologies in the io t domainSof Ouni
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
Tell Me Quality Documentation
Tell Me Quality DocumentationTell Me Quality Documentation
Tell Me Quality DocumentationMarco Berlot
 

Ähnlich wie Syst biol 2012-burguiere-sysbio sys069 (20)

FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
Linking data, models and tools an overview
Linking data, models and tools an overviewLinking data, models and tools an overview
Linking data, models and tools an overview
 
OSFair2017 Workshop | EGI applications database
OSFair2017 Workshop | EGI applications databaseOSFair2017 Workshop | EGI applications database
OSFair2017 Workshop | EGI applications database
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
Reproducible Science and Deep Software Variability
Reproducible Science and Deep Software VariabilityReproducible Science and Deep Software Variability
Reproducible Science and Deep Software Variability
 
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
 
e-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubee-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCube
 
A_Middleware_based_on_Service_Oriented_Architectur.pdf
A_Middleware_based_on_Service_Oriented_Architectur.pdfA_Middleware_based_on_Service_Oriented_Architectur.pdf
A_Middleware_based_on_Service_Oriented_Architectur.pdf
 
UCIAD overview
UCIAD overviewUCIAD overview
UCIAD overview
 
Semantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaSemantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenza
 
03. revised paper edit iq
03. revised paper edit iq03. revised paper edit iq
03. revised paper edit iq
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring report
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...
 
Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?
 
Ingredients for Semantic Sensor Networks
Ingredients for Semantic Sensor NetworksIngredients for Semantic Sensor Networks
Ingredients for Semantic Sensor Networks
 
A SESERV methodology for tussle analysis in Future Internet technologies - In...
A SESERV methodology for tussle analysis in Future Internet technologies - In...A SESERV methodology for tussle analysis in Future Internet technologies - In...
A SESERV methodology for tussle analysis in Future Internet technologies - In...
 
The Revolution Of Cloud Computing
The Revolution Of Cloud ComputingThe Revolution Of Cloud Computing
The Revolution Of Cloud Computing
 
A study of existing ontologies in the io t domain
A study of existing ontologies in the io t domainA study of existing ontologies in the io t domain
A study of existing ontologies in the io t domain
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Tell Me Quality Documentation
Tell Me Quality DocumentationTell Me Quality Documentation
Tell Me Quality Documentation
 

Kürzlich hochgeladen

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Kürzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Syst biol 2012-burguiere-sysbio sys069

  • 1. Copyedited by: GS MANUSCRIPT CATEGORY: Article [13:42 7/9/2012 Sysbio-sys069.tex] Page: 1 1–5 Software for Systematics and Evolution Syst. Biol. 0(0):1–5, 2012 © The Author(s) 2012. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com DOI:10.1093/sysbio/sys069 IKey+: A New Single-Access Key Generation Web Service THOMAS BURGUIERE∗, FLORIAN CAUSSE, VISOTHEARY UNG, AND RÉGINE VIGNES-LEBBE Laboratoire Informatique et Systématique, Museum National d’Histoire Naturelle, Département Histoire de la Terre, UMR 7207 – C.N.R.S. - M.N.H.N. - U.P.M.C., Paris, France ∗Correspondence to be sent to: Thomas Burguiere, Laboratoire Informatique et Systématique, M.N.H.N Département Histoire de la Terre, Bâtiment de Géologie, CP48, 57 rue Cuvier, 75005 Paris, France, E-mail: thomas.burguiere@upmc.fr Thomas Burguiere and Florian Causse contributed equally to this article. Received 2 April 2012; reviews returned 29 June 2012; accepted 25 July 2012 Associate Editor: David Posada Abstract.—Single-access keys are a major tool for biologists who need to identify specimens. The construction process of these keys is particularly complex (especially if the input data set is large) so having an automatic single-access key generation tool is essential. As part of the European project ViBRANT, our aim was to develop such a tool as a web service, thus allowing end-users to integrate it directly into their workflow. IKey+generates single-access keys on demand, for single users or research institutions. It receives user input data (using the standard SDD format), accepts several key-generation parameters (affecting the key topology and representation), and supports several output formats. IKey+ is freely available (sources and binary packages) at www.identificationkey.fr. Furthermore, it is deployed on our server and can be queried (for testing purposes) through a simple web client also available at www.identificationkey.fr (last accessed 13 August 2012). Finally, a client plugin will be integrated to the Scratchpads biodiversity networking tool (scratchpads.eu). [Systematics; taxonomy; single-access key; web service; biodiversity informatics.] During the last decade, taxonomy has been going through a revolution towards cybertaxonomy (Wheeler 2008). This is particularly obvious when looking at the numerous initiatives dedicated to the management and monitoring of biodiversity data such as Global Biodiversity Information Facility (GBIF), European Distributed Institute of Taxonomy (EDIT), BIOTA, Catalogue of Life, Encyclopedia of Life (EOL) or Key To Nature, and now ViBRANT. A significant increase of digitized information should be expected in a near future, and the deep social and scientific impact of web 2.0 stimulates sharing of digital data and proposals of cybertaxonomy projects to study global biodiversity and climate changes. In the current context of distributed knowledge and long-distance collaborations, the European project ViBRANT (http://vbrant.eu, last accessed 13 August 2012) has emerged to bridge the existing gap between tools and services, providing a powerful exemplar platform, the Scratchpads, for the entire community. ViBRANT is about connecting the people, data and science of biodiversity (ViBRANT - Objectives) As studying and monitoring biodiversity remains a strong challenge for biologists, descriptive data management and identification tools are needed to assist them in their daily work. For instance, we developed a complete descriptive data management tool: Xper2 (Ung et al. 2010). Our contribution to ViBRANT is the implementation of an identification key generation web service, because we also have a significant experience in developing key generation tools such as MyKey (Gérard and Vignes- Lebbe 2010). The most common identification keys (described in Hagedorn et al. 2010) are the printed single-access keys published in almost all taxonomic revisionary work. But these printed keys may be difficult to use while collecting samples in the field or when consulting incomplete specimens. Hence, electronic single-access keys, which can be quickly generated with different parameters in order to bypass potentially problematic steps (i.e., steps with characters for which the specimen studied in the field cannot provide information) are a good alternative. APPROACH Web Service There are currently several tools designed to generate single-access keys such as, among others, the Delta software suite (Dallwitz et al. 1993), Pankey (Pankhurst 1988), and MyKey. These tools are well established and produce valid identification keys, but are not without drawbacks. For instance, both Delta and Pankey need to be installed on the end-user’s machine and are not cross- platform compatible (Windows, Apple, and Linux). Mykey is a web application, which makes it platform- agnostic, but the end-users cannot submit their own data set to generate a key (the data set has to be manually uploaded by someone in our laboratory). Furthermore, these 3 tools do not use the biodiversity description standard file format, Structured Descriptive Data (SDD) (Hagedorn 2006), for data input. Our objective, within the ViBRANT project, was to design a free and open-source key generation service using the SDD file format that can be used by independent users or through the ViBRANT OBOE service (also available through Scratchpads). It was thus decided to develop a web service (implementing both the SOAP and REST web service communication 1 Systematic Biology Advance Access published September 16, 2012 atFreieUniversitaetBerlinonSeptember18,2012http://sysbio.oxfordjournals.org/Downloadedfrom
  • 2. Copyedited by: GS MANUSCRIPT CATEGORY: Article [13:42 7/9/2012 Sysbio-sys069.tex] Page: 2 1–5 2 SYSTEMATIC BIOLOGY FIGURE 1. IKey+use cases. protocols), which is particularly suitable for these two usecases(cf.Fig.1)andallowstheclienttobuildcomplex workflows, with a high level of automation. Architecture Because the main purpose of ViBRANT is to provide a platform that is both open source and easily reusable, we organized IKey+ in two parts: • an Application Programming Interface (API) consisting of 3 distinct modules; the Data Model, that is, the computer representation of the descriptive data, the Input-Output (IO) module, whichparsestheSDDinputfilesandloadsthedata in the Model and generates the output files, and the Algorithm, which uses the data contained in the Model to generate a key and returns it through the IO module. • a Simple Object Access Protocol (SOAP) or a REpresentational State Transfer (REST) Service Layer encapsulating the API, which manages the communication between the client and the API. This allows potential developers to integrate the service into their workflows and adapt it to their specific needs (e.g., integrating the API in a standalone software). The whole application was developed using the J2EE (Java 2, Enterprise Edition) programming environment. Web Service Input Parameters IKey+ has several parameters available to the end- user. These parameters allow the end-user to change the topology of the key (e.g., pruning the key, promoting some characters), change the visual aspect of the key, or the output format. A complete list of these parameters can be found in the user documentation, available here: http://www.identificationkey.fr/resources/docs/ident ificationKeyGeneratorWS_UserGuide.pdf (last accessed 13 August 2012). Algorithm The single-access key generation algorithm is a recursive depth-first graph construction algorithm. Its arguments are a list of taxa, taxaList and the list of character considered, charList (cf. Supplementary Materials, appendices 1, 2 and 3, available at http:// datadryad.org, doi:10.5061/dryad.3ft19). The resulting graph (cf. Fig. 2) is a directed acyclic graph, consisting of non-terminal nodes labelled with a character, edges labelled with the states of the character of the previous node, and terminal nodes labelled with taxa. At each step of the algorithm, the best character among those available is selected by the BEST_CHAR function, which iterates over the available characters and returns the character with the greatest discriminant power. Discriminant Power of a Character During the last 50 years, estimating a character’s discriminant power has been the central point of key- construction algorithms. Many measurement methods have been suggested, and a review of these methods can be found in Gower and Payne (1975), Pankhurst (1991), and Delgado-Calvo-Flores et al. (2006). In a statistical context, examples include the Bayesian probability and generalized entropy, such as the Shannon entropy used in the ID3 algorithm developed by Quinlan (1986) or the Gini index used in Breiman et al. (1984). In a context where there is no probability associated with the states of the characters, one can use the separation coefficient or the variance as measurement methods of a character’s discriminant power. In IKey+, the discriminant power of a given character C is calculated by the DPOWER function and is an estimate of C’s ability to differentiate the taxa of the current list. DPOWER is an extension of the Gyllenberg separation factor (Gyllenberg 1963) and is a measurement of the number of pairs of taxa discriminated by C, which is a generalization of the variance. Indeed, the variance of a variable can be computed by comparing all pairs of values (here, comparing all pairs of taxa description for C). With different comparison functions, each adapted to a certain type of character (e.g., a categorical or a numerical character), it is possible to obtain a generalized formula to estimate the discriminant power of a given character, even with polymorphic characters or characters with missing data. For categorical characters, DPOWER can use a binary function (boolean comparison) or other atFreieUniversitaetBerlinonSeptember18,2012http://sysbio.oxfordjournals.org/Downloadedfrom
  • 3. Copyedited by: GS MANUSCRIPT CATEGORY: Article [13:42 7/9/2012 Sysbio-sys069.tex] Page: 3 1–5 2012 BURGUIERE ET AL.—Ikey+ 3 Character I Character II State I-a Taxon 1 State I-b Character III State II-a Taxon 2 State II-b Character IV State II-c OR State II-d Taxon 3 State III-a Taxon 4 State III-b Taxon 5 State IV-a Taxon 6, Taxon 7 State IV-b FIGURE 2. Formal identification key example. functions such as the Sokal and Michener coefficient (Sokal and Michener 1958) or the Jaccard coefficient (Jaccard 1901). For numerical characters (e.g., a size measured in milllimetre), the set of values (i.e., all the values entered for C for all taxa) is split in two intervals. In order to determine the threshold value separating these two intervals, we consider the list of min and max value of C for each of the remaining taxa. We then choose from this list the value that separates the remaining taxa into two groups of equal size (±1 taxon). These two intervals are considered as 2 discrete states for the calculation of the discriminant power. DPOWER iterates over the available taxa (i.e., those that are compatible with the current description), and determines, for each pair of taxa Ta and Tb), the dissimilarity(i.e.,thepossibilityofdiscriminatingTa and Tb with C) that is based on the number of common states of C for Ta and Tb (n11), the number of states of C that occur only for Ta ( n10), the number of states of C that occur only for Tb (n01), and the number of states of C that do not occur for neither Ta or Tb (n00). The discriminant power is then calculated as DP= a b SCORE(n11,n10,n01,n00). Three different SCORE measurements are currently available in IKey+, the Xper coefficient (Ung et al. 2010; Vignes et al. 1989), the Jaccard coefficient, and the Sokal and Michener coefficient. BENCHMARKS Tests In order to assess the performance of our web service, we conducted a series of tests using the same input file for every test. This file contains a data set representing the subfamily Cichorieae of the plant family Asteraceae with 303 observable characters, 144 taxa, and their descriptions. This data set was generated and curated as an exemplar group for the EDIT project (Hand et al. 2009) [the file is available here: http://www.infosyslab.fr/vibrant/project/test/ Cichorieae-fullSDD.xml (last accessed 13 August 2012)]. The aim of these tests was to evaluate the overall performance of the algorithm, to assess the impact of the various parameters available to the end-user on performance, and to ensure that IKey+ would be robust in a production environment. We measured the time necessary for the web service to respond to 100 identification keys generation queries, while counting the number of rejected queries due to CPU overload (IKey+ rejects any query received when the CPU load of the host server is greater than 80%). We ran one reference test (described in Supplementary Materials, appendix 4, doi:10.5061/dryad.3ft19), and several tests with variations on the input parameters, the web service communication protocol used or the parallelization setups. The complete list of performance test setups is available in Supplementary Materials, appendix 5, doi:10.5061/dryad.3ft19. We also measured the time necessarytogenerateasinglekey,usingthereferencetest configuration described in Supplementary Materials, appendix 4, doi:10.5061/dryad.3ft19. Finally, we tested the usability of IKey+ from an end-user perspective, when generating the Cichorieae identification key using the web interface. We asked 11 persons who were not involved in the development of IKey+ to test the web service and the web interface. They were given the SDD formated Cichorieae data set and were asked to use the web interface to create the identification key. We measured the time needed by each test subject to generate the identification key. Results The average length of the paths leading to taxa that were identified by the algorithm is 4.67 steps, with the shortest path being 1 step long, and the longest being 10 steps long. The generation of a single atFreieUniversitaetBerlinonSeptember18,2012http://sysbio.oxfordjournals.org/Downloadedfrom
  • 4. Copyedited by: GS MANUSCRIPT CATEGORY: Article [13:42 7/9/2012 Sysbio-sys069.tex] Page: 4 1–5 4 SYSTEMATIC BIOLOGY key took roughly 1.8 s. The results of the other tests are shown in Supplementary Materials, appendix 6, doi:10.5061/dryad.3ft19. The reference test took roughly 50 s to complete, that is, 500 ms per query, which is consistent with the time measured for the generation of a single key, because the reference test uses 4 simultaneous threads to generate the keys. In this test, few queries were rejected due to CPU overload. Among the parameters available to the end-user, only the score method parameter had a significant influence: when using the Sokal and Michener score method or the Jaccard score method, the tests took longer to finish (∼80 s instead of 50 s). The score method parameter had no impact on the number of rejected queries. Our tests showed that the communication protocol used (SOAP or REST) had no influence on the performance of IKey+. Some taxa do not appear in the generated key, due to insufficient data in the input file. Indeed, the Cichorieae data set we used for our tests was created by an external team. This is because some errors were made during creation of the Cichorieae data set: some taxa with unknown data were not specifically marked as “unknown data”, but were left unspecified instead. Regardless, although these ambiguously coded taxa do not appear in the resulting identification key by default, the end-user can choose to have them appear in the key. When modifying the parallelization of the queries, we observed significant performance variations. As can be expected, when launching 100 queries sequentially (instead of using 4 simultaneous threads launching 25 queries each), the test was 4 times longer, and no queries were rejected. When we augmented the parallelism of the queries (e.g., 25 or 100 simultaneous queries), more queries were rejected (up to 50%). However, when launching 100 simultaneous queries, with a random delay at the beginning of each thread, the results (both in time and number of rejected queries) were comparable to the performance of the reference test. In the usability test, the average time needed to generate the identification key was slightly above 60 s (60.36 s), with the shortest time measured at 26 s, and the longest time measured at 121 s. The complete results of the usability test are available in Supplementary Materials, appendix 7, doi:10.5061/dryad.3ft19. DISCUSSION IKey+ is available and can be installed on any J2EE application server (e.g., Apache Tomcat). It can generate single-access keys using a tree or a flat representation in several output formats (HTML, Wiki, SDD, plain- text, etc.). Our test showed that IKey+ performs sufficiently well to handle a large data set in a relatively short amount of time and can generate well-optimized key files (average number of steps to identify a taxon: 4.67). This, combined with the web service accessibility, makes it possible to integrate IKey+ in many workflows that might require a fast and automated key generation process (e.g., a batch key generation script). Furthermore, an end-user can use the web interface available at www.identificationkey.fr to quickly generate a customized identification key, using the numerous parameters available (affecting the topology, representation, file format, etc.). Finally, our tests showed that IKey+ is likely to be robust in a production environment (i.e., many simultaneous queries) as it is able to withstand simultaneous key generation queries (e.g., 4 threads launching 25 consecutive queries). It is also protected from cryptic failure, because we implemented a CPU- load-watching mechanism that automatically rejects a query (with an explicit error message) whenever the CPU load exceeds a given threshold (80%). This prevents a crash of the service, or the generation of incomplete or corrupt key files. CONCLUSION IKey+ is the first key-generation tool available as a web service with standardized input and output formats. Our test showed that IKey+ is able to generate keys rapidly and that it can also be used by an end-user with the web interface. Finally, the modular and open- source nature of IKey+ makes it possible for anyone to reuseitscomponents.Forinstance,weplantoreusesome components of the API to develop another web service that would provide free-access key identification. LICENSING As part of the ViBRANT project, IKey+’s source code is freely available and is licensed under the GNU General Public License version 2. It is already available on our google code SVN repository: http://ikey-plus. googlecode.com/svn/trunk/ (last accessed 13 August 2012). It will be actively maintained by our team for the next 2 years. SUPPLEMENTARY MATERIAL Supplementary material, including Algorithms and appendices, can be found at http://www.sysbio .oxfordjournals.org. FUNDING This work was supported by the European Union funded FP7 ViBRANT Project (Contract number RI- 261532, Period, December 2010 to November 2013). ACKNOWLEDGEMENTS We sincerely thank Gregor Hagedorn (Julius Kühn Institute, Berlin, Germany) and Andreas Müller (Botanical Garden and Botanical Museum, Berlin, atFreieUniversitaetBerlinonSeptember18,2012http://sysbio.oxfordjournals.org/Downloadedfrom
  • 5. Copyedited by: GS MANUSCRIPT CATEGORY: Article [13:42 7/9/2012 Sysbio-sys069.tex] Page: 5 1–5 2012 BURGUIERE ET AL.—Ikey+ 5 Germany) for sharing their knowledge on the SDD format. We are also grateful to Dave Roberts (Natural History Museum, London, UK) for reviewing an early version of the article and providing style improvements. REFERENCES BIOTA. Available from: URL http://www.edinburgh.ceh.ac.uk/ biota/ (last accessed 13 August 2012). Breiman L., Friedman J.H., Olshen R.A. Stone C.J. 1984. Classification and regression trees. Belmont, CA:Wadsworth International Group. Catalogue of Life. Available from: URL http://www.catalogueoflife. org/ (last accessed 13 August 2012). Dallwitz M.J., Paine T.A., Zurcher E.J. 1993. User’s guide to the delta system: a general system for coding taxonomic descriptions. 4th ed. Available from: URL http://delta-intkey.com (last accessed 13 August 2012). Delgado-Calvo-Flores M., Fajardo-Contreras W., Gibaja-Galindo E.L., Perez-Perez R. 2006. Xkey: a tool for the generation of identification keys. Expert Syst. Appl. 30:337–351. EDIT. European Distributed Institute of Taxonomy. Available from: URL http://www.e-taxonomy.eu/ (last accessed 13 August 2012). EOL. Encyclopaedia of Life. Available from: URL http://eol.org/. GBIF. Global Biodiversity Information Facility. Available from: URL http://www.gbif.org/ (last accessed 13 August 2012). Gérard D., Vignes-Lebbe R. 2010. Mykey: a server-side software to create customized decision trees. In: Nimis P.L., Vignes-Lebbe R., editors. Tools for identifying biodiversity: progress and problems. Edizioni Università di Trieste, Trieste, Italy. p. 107–112. Gower J.C., Payne R.W. 1975. A comparison of different criteria for selecting binary tests in diagnostic keys. Biometrika 62:665–672. Gyllenberg H.G. 1963. A general method for deriving determinative schemes for random collections of microbial isolates. Ann. Acad. Scient. Fenn. Ser. A IV. Biologica 1(69):1–23. Hagedorn G. 2006. The structured descriptive data (SDD) w3c-xml- schema. Version 1.1 Available from: URL http://wiki.tdwg.org/ twiki/bin/view/SDD/Version1dot1 (last accessed 13 August 2012). Hagedorn G., Rambold G., Martellos S. 2010. Types of identification keys. In: Nimis P.L., Vignes-Lebbe R. editors. Tools for identifying biodiversity: progress and problems. Edizioni Università di Trieste, Trieste, Italy. p. 59–64. Hand R., Kilian N., Raab-Straube E. 2009. International cichorieae network: Cichorieae portal. Available from: URL http://wp6- cichorieae.e-taxonomy.eu/portal/ (last accessed 13 August 2012). Jaccard P. 1901. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull. Soc. Vaud. Sci. Nat. 37:547–579. J2EE. Java 2, Enterprise Edition. Available from: URL http://www. oracle.com/technetwork/java/javaee/overview/index.html (last accessed 13 August 2012). KeyToNature. Available from: URL http://www.keytonature. eu/wiki/ (last accessed 13 August 2012). Pankhurst R.J. 1988. Pankey programs. DELTA Newsletter 1:2. Pankhurst R.J. 1991. Practical taxonomic computing. Cambridge University Press, Cambridge, UK. Quinlan J.R. 1986. Induction of decision trees. Mach. Learn. 1:81– 106. ISSN 0885-6125. Available from: URL http://dx.doi.org/ 10.1007/BF00116251 (last accessed 13 August 2012). REST Architecture. Available from: URL http://www.oracle.com/ technetwork/articles/javase/index-137171.html (last accessed 13 August 2012). Scratchpads. Biodiversity Online. Available from: URL http:// scratchpads.eu/ (last accessed 13 August 2012). SOAP. Simple Object Access Protocol, W3C Recommandation. Version 1.2. Available from: URL http://www.w3.org/TR/soap/ (last accessed 13 August 2012). Sokal R., Michener C. 1958. A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull., (38):1409–1438. Ung V., Dubus G., Zaragüeta-Bagils R., Vignes-Lebbe R. 2010. Xper2: introducing e-taxonomy. Bioinformatics 26(5):703–704. ViBRANTa. Objectives. Available from: URL http://vbrant.eu/ node/1 (last accessed 13 August 2012). ViBRANTb. Virtual Biodiversity Research and Access Network for Taxonomy. Available from: URL http://vbrant.eu (last accessed 13 August 2012). Vignes R., Lebbe J., Darmoni S. 1989. Symbolic-numeric approach for biological knowledge representation: a medical example with creation of identification graphs. In E. Diday, editor, Proceedings of the conference on Data analysis, learning symbolic and numeric knowledge.NovaSciencePublishers,Inc.Commack,NY,USA.ISBN 0-941743-64-0. p. 389–398. Wheeler, Q.D., editor 2008. The new taxonomy. CRC Press Inc. New York, USA. atFreieUniversitaetBerlinonSeptember18,2012http://sysbio.oxfordjournals.org/Downloadedfrom