1. The structure of insect—plant host data
as derived from museum collections:
An analysis based on data from the
NSF-funded Tritrophic Database —
Thematic Collections Network
(TTD-TCN)
Randall T. Schuh
Katja Seltmann
Christine A. Johnson
American Museum of Natural History
2. TTD-TCN Rationale
“The data captured via ADBC funding will
dramatically improve our understanding of the
relationships among the more than 11,000
species of North American Hemiptera (scale
insects, aphids, leafhoppers, true bugs, and
relatives), their food plants, and the wasps that
parasitize the hemipterans.”
3. The data we will evaluate today were captured
through a Web-based application developed with
NSF Planetary Biodiversity Inventory funding and
used by the TTD-TCN. This software application,
known as Arthropod Easy Capture (AEC), is built in
open-source code, is being implemented as an
appliance by the ADBC-funded Home Uniting Biocollections (HUB, iDigBio), and through that
implementation will be able to be installed with a
“one-click” installation application. Server code is online at Source Forge:
http://sourceforge.net/projects/arthropodeasy/
6. Data on insect-plant relationships is available
primarily from labels on insect specimens—as
opposed to labels on plant specimens.
Substantial amounts of data were captured for
the family Miridae on a world basis under NSF
Planetary Biodiversity Inventory funding
between 2003—2011.
The TTD-TCN is a collaboration among 17 US
entomological institutions. The institutional
contributions from these two projects, as
represented by numbers of specimen records,
are seen in the following graph.
The TTD-TCN is defining the field structure for
host data as used by the iDigBio and for other
Web-aggregators such as DiscoverLife.org.
9. In order to evaluate the nature of insect-host plant data
derived from collections, we need to look at groups
that offer large data sets. Necessary attributes are:
1.Large numbers of specimen records with host
information
2.Large numbers of collecting events
3.Substantial diversity of host taxa
At the present time the following taxa in our database
meet those criteria:
10. Hemiptera
Sternorrhyncha
Aphididae (4400 species worldwide)
Auchenorrhyncha
Membracidae (3200 species worldwide)
Heteroptera
Miridae (11,000 species worldwide)
Raw data for each taxon are distributed as seen in
the following four graphs.
18. COLLECTING EVEN DATA:
The occurrence of an insect
species on a plant genus
ANALYSIS: evaluate insect/plant
ANALYSIS: evaluate insect/plant
associations with different scores
associations with different scores
Modify algorithm to improve fit
of model to data based on results
Compute frequency
of occurrence on a
particular plant genus
Compare with all insect
collecting events on any plant
Scores: High, Medium, or Low
confidence in insect--plant
association
HEURISTIC DATA:
Larvae present?
Multiple specimens?
Voucher specimen available?
19. f(y) ≥ 15.00%
y≥5
f(y) ≥ 2.00%
y≥3
∨
f(y) ≥ 15.00%
y≥2
)
n
m
p
#
s
h
u
,
e
v
r
:
a
c
g
l
o
i
b
(
x=y′ +y
c
t
s
i
r
u
e
H
not high or medium
v
g
l
o
n
m
i
c
e
p
s
:
t
a
D
x=1
Analysis
26. Reasons for Low Host Scores and
Methods for Improving Data Quality
27. Reasons for Low Scores
1. Actual low host specificity: Indicated when a large number of
collecting events are distributed across many plant taxa.
28. Reasons for Low Scores
1. Actual low host specificity: Indicated when a large number of
collecting events are distributed across many plant taxa.
2. Movement of adult specimens to alternative food sources:
Algorithm points out apparent vagility when there are multiple
hosts and little or no host repetition across collecting events.
29. Reasons for Low Scores
1. Actual low host specificity: Indicated when a large number of
collecting events are distributed across many plant taxa.
2. Movement of adult specimens to alternative food sources:
Algorithm points out apparent vagility when there are multiple
hosts and little or no host repetition across collecting events.
3. Commingling of specimens in the field: Algorithm points out
problem when insect specimen numbers are low for a host
taxon and when there is lack of repetition of host occurrence.
30. Reasons for Low Scores
1. Actual low host specificity: Indicated when a large number of
collecting events are distributed across many plant taxa.
2. Movement of adult specimens to alternative food sources:
Algorithm points out apparent vagility when there are multiple
hosts and little or no host repetition across collecting events.
3. Commingling of specimens in the field: Algorithm points out
problem when insect specimen numbers are low for a host
taxon and when there is lack of repetition of host occurrence.
4. Mislabeling of insects for hosts from a collecting event: Difficult
to distinguish from actual polyphagy in cases where all
specimens from an event are mislabeled. Often seen as a
unique host for a given insect taxon. More fieldwork needed.
31. Reasons for Low Scores
1. Actual low host specificity: Indicated when a large number of
collecting events are distributed across many plant taxa.
2. Movement of adult specimens to alternative food sources:
Algorithm points out apparent vagility when there are multiple
hosts and little or no host repetition across collecting events.
3. Commingling of specimens in the field: Algorithm points out
problem when insect specimen numbers are low for a host
taxon and when there is lack of repetition of host occurrence.
4. Mislabeling of insects for hosts from a collecting event: Difficult
to distinguish from actual polyphagy in cases where all
specimens from an event are mislabeled. Often seen as a
unique host for a given insect taxon. More fieldwork needed.
5. Single collecting events: Indistinguishable from absolute host
fidelity based on multiple events, except no confidence limit can
be assessed. Heuristics such as presence of larvae and large
numbers of specimens give credence to presumed association.
Resolved only by further fieldwork.
35. 1. Insect collections offer substantial data on host
relationships even though a majority of the specimens
lack such information.
2. Our algorithm demonstrates a method for assessing
data quality on a large scale. Our initial analyses show
that:
-
We can have confidence in a significant proportion
of the available information
The data demonstrate a substantial degree of host
specificity in our three target groups.
3. Degree of host specificity requires a scoring method
that takes into account biological attributes, collecting
techniques, and approaches to data capture in the field.
36. Acknowledgments
•Participating TCN and PBI Institutions
•iDigBio
•AMNH Database Data-entry Personnel
•Participating TCN Data-entry Personnel
•Michael D. Schwartz
•National Science Foundation
Hinweis der Redaktion
Good morning. Today I would like to speak to you about data on insect-plant associations as derived from insect collections. This presentation is a joint effort by Katja Seltmann, Christine Johnson, and me as part of our work on a TCN award from the NSF.
In this talk we will use TCN data to host data for three families of herbivorous hemipterans and evaluate three propositions:
The degree to which collections contain information on host relationships
The degree of confidence we are able to place in that information, and
The degree to which those data demonstrate host specificity or the lack thereof
The AEC database has supported data capture for a number of NSF-supported projects. This slide shows the relative proportion of data captured by these projects, which in aggregate represent more than 1 million specimen records, the largest numbers coming from the TCN project which represents about two-thirds of the red slice of the pie.
Here we see the institutions with more than 10,000 speciemen records and which have therefore made the most significant contributions to our knowledge of host relationships.
These graphs plot specimens against time, with each point representing a collecting event. The graph in the lower right is the sum of collecting events for all three groups. Note that the scale for each graph is different, with the Miridae having a much greater number of specimens per collecting event than Aphidae and Membracidae. These data represent all collecting events, irrespective of whether host data involved or not.
Here we see the data in the prior graphs transformed to show the numbers of collecting events with host records or each taxon, as well as information for remaining taxa in the database. Comparison of the right-hand bar with the remaining three gives a clear indication of the reasons for choice of taxa for this analysis.
Blue is for records without host information;
brown is for non-unique hosts;
Yellow represents unique hosts, in other words, all host records for the insect taxon are from the same host genus in this analysis.
Aphids almost always come with host information as a result of the collecting methods that are used in the group.
Here we can see the numbers of plant families on the left, and plant genera on the right, occupied by each of the three groups we have chosen to analyze. Relative to the size of the taxon sample, the Aphidae show the highest diversity of host information at both the family and generic levels. The family data also support the proposition that all three taxa are specializing on many of the same plant families, a phenomenon that is reinforced in the following graphs.
Here we numbers of collecting events by plant family for the Miridae,
For the Membracidae, and
For the Aphididae. You will note that a few families loom large as hosts, usually in all three groups, notably Asteraceae, Fabaceae, Fagaceae, Rosaceae, and Pinaceae, with most other families occurring in lower frequencies.
Our approach to assessing the strength of host data in through a DECISION TREE: the first set of decisions is based on the frequencies and collecting event counts; the second set of decisions is based on the heuristic properties. Scores are based on fit of the model to the data and ranked from high to low.
The main contributor to a score is frequency (f). Low frequency does not argue that information for a taxon should be completely disregarded. The score for an insect-plant association can be increased through information from the heuristics component, as for example, having insect larvae collected on a given plant species which would indicate a strong association even when there is a low number of collecting events. The existence large numbers of specimens or of authoritatively identified plant vouchers would also improve the score for a given association. Associations with a frequency of 1 make no argument for whether the data are strong or not because no confidence limits can be established. The only way to bolster the score is through more collecting. The single-event data do suggest that when going to the field the first host to be investigated should be the one for which we already have a presumed association.
For example, in order to get in the high category, the frequency of y (f(y) has to be greater than or equal to 15 AND y has to be equal to or greater than 5.
In order to get a medium score, you either need one or the other score.
The value for the frequency of f(y) [frequency of y] is obtained by following formula:
Here we see confidence values for the three families we have analyzed. As a proportion the Membracidae have the most high scores (in yellow). In absolute terms the Miridae present the most data on host fidelity with 842 associations with high scores, and in blue 1844 associations with medium scores; but, they also possess the greatest number of host data points based on a single collecting event (pink), a situation that obviously demands further fieldwork but may nonetheless be an indicator of a large number of valid host associations. All three insect families have large numbers of putative host associations with low scores (gray), a situation we will return to later in the presentation.
Here we see a histogram showing all species of Miridae known to occur on Larrea (Zygophyllaceae) in the American Southwest. The gray portion of the bar indicates the proportion of collecting events known from Larrea, while the other colors indicate the proportions of collecting events from other plant families. What does our decision tree approach tell us about these data?
Larrea served as the model from which we developed the decision-tree criteria.
Here we see those same data plotted in the form of a graph: gray nodes represent insect species, green nodes represent plant genera, the large node representing Larrea. Red lines (edges) represent associations for which the decision tree indicates a high level of confidence in the host association. One might ask “Just what does this graph tell us?”
The size of the balls is determined by the number of collecting events.
This slide makes clear that in order to make sense of these data we need to tease apart the noise from the real signal. This graph shows the signal, whereas the prior graph commingles noise and signal. Even though specimen labels indicate that many taxa had been collected on Larrea, only 5 of those taxa actually appear to be host specific, as seen by their connections to the large green node. All other insect taxa are shown to have there actual (breeding) host associations with taxa other than Larrea, a result that may not be clear from a naïve interpretation of the data. We might therefore wish to look at the reasons for why we get these spurious answers. This graph is based on high scores only.
When the noise is filtered out for the Miridae as a group by the elimination of low scores, and insect genera are plotted against plant genera, we see distinct patterns develop in the group. Here we see plant groups which have high herbivore diversity and in which we also have high confidence in the data. If this graph was done on a species-by-species comparison, the strength of the signal would probably be even greater although more complex in terms of presentation.