This study evaluated the reliability of four binocular vision measurements used in diagnosing convergence insufficiency: near heterophoria, positive fusional vergence, near point of convergence, and accommodative amplitude. Two examiners measured 20 children on two separate occasions one week apart. The near heterophoria, near point of convergence, and accommodative amplitude measurements showed good within-session and between-session reliability. However, positive fusional vergence measurements were found to have only fair reliability, with clinically significant differences between sessions. Large potential test-retest differences could complicate clinical diagnosis and treatment decisions for convergence insufficiency.
2. abnormalities, trends of deterioration or of spontaneous improve-
ment, and the effects of therapy. Tests should be repeatable, with
the same examiner at different times (intraexaminer reliability) as
well as with different examiners (interexaminer reliability) obtain-
ing similar results. Reliability is critical information for both the
clinician and researcher who want to obtain an accurate time
course of the patient’s condition.
Reliability of von Graefe Heterophoria
Hirsch and Bing5
reported the reliability of the near von Graefe
method using 38 adult subjects (optometry students) measured by
two examiners on two separate occasions. The exact time interval
between the two sessions was not specified. Hirsch and Bing found
good-to-excellent reliability for both intraexaminer (r ϭ 0.88 ex-
aminer 1 and 2) and interexaminer (r ϭ 0.94) measurements. They
also reported relatively small intraexaminer mean differences of
2.16 ⌬ (SD ϭ 1.84) for one examiner, 2.05 ⌬ (SD ϭ 1.75) for the
other examiner, and a small interexaminer mean difference of 2.00
⌬. Morgan6
reported good intraexaminer reliability (r ϭ 0.81) on
23 optometry students who first served as subject and then exam-
iner each week over a 5-week period. Rainey et al.7
evaluated in-
terexaminer repeatability of heterophoria tests on 72 second- and
third-year optometry students. He reported fair-to-good reliability
(r ϭ 0.75) for the near von Graefe method, with a small mean
difference of Ϫ0.20 ⌬, but clinically large 95% limits of agreement
of (Ϫ0.20 Ϯ 6.7 ⌬).
Reliability of von Graefe Fusional Vergence
There have been few investigations regarding the reliability of
fusional vergence measurements. The general opinion is that when
fusional vergence tests are repeated on the same patient, the second
value found may be quite different from the first.8
Sheedy9
sug-
gested that “A difference of 10 prism diopters from one fusional
vergence amplitude measurement to another is not unusual unless
rigorous controls are applied.”
Brozek et al.10
examined the PFV at distance on six occasions in
six subjects between the ages of 20 and 30 years. A Risley prism was
held in front of one eye while the subject fixated a spot of light at
6 m. It was not clear whether single or multiple examiners were
used. Brozek et al. found good consistency among the six measures
(rc ϭ 0.81, where rc is a modified intraclass correlation coefficient
[ICC]). The actual ICC (which we calculated from their data) for
these data is 0.72, which still indicates good reliability. Assuming
that the mean difference between two vergence measurements is
zero and using our ICC calculation, we estimated the 95% limits of
agreement for Brozek’s data to be Ϯ5.06 ⌬. Penisten et al.11
re-
cently completed a similar study, but of phoropter-mounted Risley
prism fusional vergence at 4 m and 40 cm on eight young adult
subjects. The authors reported that the distance PFV break and
near PFV blur and recovery were the least repeatable with an esti-
mated intrasubject SD on replicated measurements of about 2.75
⌬ (compared with 3.45 ⌬ for Brozek et al.) whereas the distance
PFV blur and recovery were slightly more repeatable (SD ϭ 2.00
to 2.25 ⌬). The PFV break at near had the smallest SD of 1.5 to 2
⌬.
Feldman et al.12
compared the near PFV taken twice within a
single session (5 min apart) by a single examiner. Subjects were
adults (optometric students, faculty, and staff) with a mean age of
25 years. They reported good-to-excellent within-session reliabil-
ity for both PFV break (r ϭ 0.87) and PFV recovery (r ϭ 0.86).
Reliability of Nearpoint of Convergence
Brozek et al.10
also examined the nearpoint of convergence on
six occasions in six subjects between the ages of 20 and 30 years. A
Prentice rule with a white circular target, 2 mm in diameter, was
bought in from the distance of clear vision to the point of binocular
diplopia. It was not clear whether single or multiple examiners
were used. They reported good consistency (rc ϭ 0.79), but the
corresponding ICC was only 0.65, which reflects only a fair level of
reliability.
Reliability of Accommodative Amplitude
Brozek et al.10
also examined the nearpoint of accommodation
on six occasions in six subjects between the ages of 20 and 30 years.
A Prentice rule with a 20/30 line of letters was bought in from a
distance of clear vision to the point of first blur. It was not clear
whether single or multiple examiners were used. Three measure-
ments were taken and averaged on each of six occasions, and good
consistency (rc ϭ 0.76) was reported, although the ICC was only
0.51, which suggests only fair reliability. The AA of the six subjects
were fairly homogeneous, which may have artificially lowered the
ICC.
Rosenfield and Cohen13
evaluated the pushup method of ac-
commodative amplitude on five occasions separated by at least
24 h. The maximum separation between the sessions was not re-
ported. It was also not clear whether single or multiple examiners
were used. Thirteen adult subjects (mean age of 24 years) viewed a
single optotype within the smallest line of letters that could be
resolved at a viewing distance of 40 cm, and the target was brought
from clear vision to first sustained blur. They reported that the
range over which 95% of accommodative amplitude values would
be predicted to lie was 10.11 D Ϯ 1.44 (i.e., mean ϭ 10.11 D, SD
ϭ 0.73 D, and 1.96 ϫ 0.73 ϭ 1.44 D). These authors inappro-
priately characterized this range as the Bland-Altman 95% limits of
agreement.14
In this case the Bland and Altman limits of agreement
should provide an interval in which 95% of the differences between
two measurements of amplitude, not the actual amplitude values,
would be predicted to lie. From the results of Rosenfield and Co-
hen and by making certain reasonable assumptions, estimated val-
ues of the Bland-Altman limits of agreement can be calculated. In
particular, assuming that there is no bias between two measure-
ments and that the ICC is a moderate 0.70, the 95% limits of
agreement can be estimated to be Ϯ1.11 D.
Chen and O’Leary15
measured accommodative amplitude on
18 adults on two separate occasions (the exact time period was not
reported). A modified pushup method of blur to first detection was
used with a target size of N8 reduced Lea symbols. They reported
a correlation coefficient of 0.99, with a mean difference of 0.07 D
and 95% limits of agreement of 0.07 Ϯ 1.22 D.
The literature review reveals heterogeneity in the reporting of
reliability (or repeatability) study results on binocular measures,
making direct and clear comparisons between studies difficult.
Classification of Convergence Insufficiency—Rouse et al. 255
Optometry and Vision Science, Vol. 79, No. 4, April 2002
3. Many of the above-cited studies inappropriately have used the
Pearson product-moment correlation coefficient (r) as an index of
reliability. Other studies utilized the methods of Bland and Alt-
man14
in reporting the limits of agreement—a range of values in
which it is reasonable to expect the difference between two mea-
surements of the same parameter to occur just by chance. The
studies by Rosenfield and Cohen13
and Penisten et al.11
inappro-
priately used the Bland-Altman “limits of agreement” terminol-
ogy, but in fact reported ranges of values on a single measurement
of a parameter that one might expect from a typical patient. This
indirect view of reliability is difficult to compare with the Bland-
Altman approach, which is a direct view of the distribution of
differences between replicated measurements.
There is a clear need to evaluate intraexaminer and interexam-
iner reliability of common binocular vision measurements in
school-aged children. Although a few studies suggest that there is
good reliability for some measurements (near heterophoria and
accommodative amplitude), it may not be appropriate to apply
adult results to children because binocular function depends on
both examiner instructions and the patient’s subjective response.
Children may be poorer observers, have more trouble understand-
ing instructions or expected endpoints, or be slower to respond
than adults. The purpose of this paper is to evaluate the reliability
of the primary binocular vision measurements used in determining
the diagnosis of CI in school-aged children.
METHODS
This study was approved by the Southern California College of
Optometry institutional review board, and informed consent was
obtained for all subjects in the study.
Study Population
Fifth and sixth graders were screened in a school setting by two
CIRS examiners according to a standard protocol. Screening cri-
teria were as follows:
• either no glasses or had worn glasses or contact lenses Ն1 month
by subject report;
• visual acuity 20/30 or better in each eye with habitual correction
using a Snellen wall chart;
• uncorrected refractive error equal to or between Ϫ0.50 to
ϩ1.00 D, and Յ1.00 D of astigmatism in either eye or Յ1.00
D of anisometropia by retinoscopy;
• no strabismus at 3 m or 30 cm by unilateral cover test.
Data Collection
The first 20 consecutive children who passed the vision screen-
ing were used as subjects. Intraexaminer and interexaminer reli-
ability were evaluated for the following measurements:
• von Graefe heterophoria at 30 cm using a single line of 20/30
reduced Snellen target;
• von Graefe PFV and NFV at 30 cm (blur/break/recovery) using
a 2 ϫ 5 block of 20/30 reduced Snellen target;
• NPC (break/recovery) using a single line of 20/30 reduced
Snellen target on a Astron International (ACR/21) Accommo-
dative Rule;
• monocular accommodative amplitude (Donder’s Pushup
Method) of the right eye only using a single line of 20/30
reduced Snellen target on an Astron International (ACR/21)
Accommodative Rule.
Each examiner took three consecutive measurements on each
subject according to the standard protocol outlined in the Appen-
dix. The exception was vergence measures, where a counterbal-
anced method of negative and then positive fusional vergence was
conducted until three measurements of each were reached. The
examiners performed independent measurements on the same sub-
jects without knowledge of the other examiner’s results. Measure-
ments were taken again by the same examiners on the same subjects
1 week later.
One problem we noted while reviewing the literature was a lack
of detail in the Methods sections. Because there is some variation in
the literature and especially among practitioners as to the exact
procedure for our four measures, we are providing detailed meth-
ods as an Appendix to the manuscript as outlined from the CIRS
Manual of Procedures.1
a
Data Analysis
This study design allows for the consideration of intraexaminer
reliability both within and between sessions as well as within-
session interexaminer reliability for each of the four principal CI
diagnostic variables: NH, PFV break, NPC break, and AA.
Within-session intraexaminer reliability was assessed using both
the within-session ranges and the intraclass correlation coefficient
(ICC). The range for each subject is the difference between the
maximum and minimum of the three within-session measure-
ments. We will report first the sample mean range, which provides
a measure of a typical patient’s within-session difference in mea-
sures; second, the 95th percentile of the ranges (R95), which gives
a practical upper limit on the differences between within-session
measurements. We estimate that 95% of all patients would have a
maximum within-session difference in measures no greater than
this limit.
The ICC is an overall index of reliability ranging between zero
and one. A value of one indicates perfect repeatability—meaning,
in this case, each subject obtained the same value on the three
within-session measures. A value of zero indicates no reproducibil-
ity of the measurement and, hence, no reliability. The ICC is
commonly interpreted as follows16
:
ICC Ͻ 0.4 indicates poor reliability;
0.4 Ͻ ICC Ͻ 0.75 indicates fair-to-good reliability;
ICC Ͼ 0.75 indicates good-to-excellent reliability.
The ICC depends on both the between- and within-subject
variability. It will be high when the within-subject variability is low
relative to the total of between- and within-subject variability. It
will be low when within-subject variability is high relative to this
total variability. Hence, a sample that either overestimates or un-
derestimates the population variability may result in a distorted
ICC estimate. Consequently, it is important that the ICC is inter-
256 Classification of Convergence Insufficiency—Rouse et al.
Optometry and Vision Science, Vol. 79, No. 4, April 2002
4. preted in conjunction with measures of variability like the range.
Low values of the range will correspond typically to a higher ICC.
For intraexaminer between-session reliability, each examiner’s ses-
sion 1 and session 2 means were compared. A principal focus is the
distribution of the between-session differences in these means.
Here the methods of Bland and Altman14
are useful in the case
where the distribution of differences is approximately normal and
the mean difference is close to zero. To check these assumptions, a
preliminary Anderson-Darling test for normality17
and a matched-
paired t-test were conducted. If both of these tests were nonsignif-
icant, then we proceeded with the Bland-Altman methodology.
The mean difference, the SD of the differences, and the coefficient
of repeatability (COR), which is 1.96 ϫ SD, and the 95% limits of
agreement (mean difference Ϯ COR) were computed. In cases
where either of the preliminary tests were significant, we consid-
ered the distribution of absolute differences. In place of the COR,
we computed the 95th percentile of the absolute differences
(AD95). This 95th percentile provides, as does the COR in the
case of normality and a zero mean, a threshold for differences of
successive measures that would have to be exceeded to conclude
that a true shift in value has likely occurred, as opposed to an
observed difference that can be explained by the natural variability
in the measure. In both cases, we find it useful to compute the
median absolute difference (MAD). It provides a measure of
the typical difference in mean between the two sessions where the
distribution of differences may not be normal. The ICC also was
computed as an index of agreement between the means of the two
sessions. These same methods were also used in assessing within-
session interexaminer reliability.
Sample Size
The sample size of 20 was selected based on the available re-
sources to CIRS at the time of testing. In testing the hypothesis of
having just a fair level of reliability, say Ho: ICC ϭ 0.50, at the 0.05
level of significance, 23 subjects would be required to have 80%
power to reject the alternative of having excellent reliability, say
ICC ϭ 0.80. This assumes that a one-tailed test is to be conducted.
For the same test conducted at 70% power, 18 subjects would be
required. Hence, our sample size of 20 renders our test slightly
underpowered, but is also a substantial improvement on several of
the frequently referenced studies (e.g., Brozek et al.,10
Penisten et
al.,11
and Rosenfield and Cohen13
) that were described previously.
RESULTS
Twenty fifth and sixth graders (8 males, 12 females; mean age
10.8 years, SD 0.34 years, range 10.2 to 11.5 years) served as the
subjects.
Near Heterophoria
Table 1 provides a summary of reliability measures for the NH.
The results indicate a high level of intraexaminer reliability, both
within and between sessions. The within-session ICC’s are excel-
lent (0.95 or higher), and the mean ranges are Ͻ2 ⌬ with R95 Յ4
⌬. Intraexaminer between-session reliability was good for both
examiners (ICC ϭ 0.81) with MAD’s Յ2 ⌬. The COR for exam-
iner 1, which was ~7 ⌬, and the corresponding limits of agreement
are illustrated in Fig. 1. The interexaminer within-session reliabil-
ity was excellent for session 1 (ICC ϭ 0.91) and good for session 2
(ICC ϭ 0.72). The COR’s were Ͻ9 ⌬, with the MAD’s Յ2.5 ⌬.
Positive Fusional Vergence Break
Table 2 shows the summary of reliability measures for PFV
break. The within-session measurements for examiners 1 and 2
indicate different levels of intraexaminer reliability depending on
the testing session. Both examiners had good session 1 reliability
(ICC: 0.76 and 0.71) but excellent session 2 reliability (ICC: 0.94
and 0.93). Consistent with the ICC’s, for examiners 1 and 2,
session 1 mean ranges and R95’s were higher (means: 5.30 ⌬ and
5.40 ⌬) than the corresponding session 2 values (means: 3.80 ⌬
and 2.45 ⌬). Intraexaminer between-session reliability was fair
(ICC: 0.59 and 0.53) with COR’s of 14.07 ⌬ and 12 ⌬. The 95%
TABLE 1.
Near heterophoria.a
Intraexaminer Within-Session Reliability
Session 1 Session 2
Examiner 1
Mean 4.20 ⌬ XP 4.05 ⌬ XP
Mean range 1.95 ⌬ 1.65 ⌬
R95 4.00 ⌬ 4.00 ⌬
ICC 0.95 0.95
Examiner 2
Mean 4.82 ⌬ XP 4.23 ⌬ XP
Mean range 1.70 ⌬ 0.90 ⌬
R95 3.00 ⌬ 2.00 ⌬
ICC 0.97 0.99
Intraexaminer Between-Session Reliability
ICC COR MAD
E1S1 vs. E1S2 0.81 6.78 ⌬ 1.67 ⌬
E2S1 vs. E2S2 0.81 7.64 ⌬ 2.00 ⌬
Interexaminer Within-Session Reliability
ICC COR MAD
E1S1 vs. E2S1 0.91 4.86 ⌬ 1.33 ⌬
E1S2 vs. E2S2 0.72 8.86 ⌬ 2.50 ⌬
a
For intraexaminer within-session reliability, mean is the av-
erage of the 60 (20 patients times 3 measurements per patient)
within-session measures, mean range is the average of the 20
individual patient ranges, R95 is the 95th percentile of those 20
ranges, and ICC is the intraclass correlation coefficient. For both
the intraexaminer between-session and interexaminer within-ses-
sion reliability, ICC is the intraclass correlation coefficient for the
session means, COR is the coefficient of repeatability, which is
1.96 times the SD of the session differences, and MAD is the
median absolute difference of those session differences. In cases
where the COR value is asterisked, the 95th percentile of the
absolute differences is being substituted. E1S1, examiner 1/ses-
sion 1; E2S2, examiner 2/session 2.
Classification of Convergence Insufficiency—Rouse et al. 257
Optometry and Vision Science, Vol. 79, No. 4, April 2002
5. limits of agreement for examiner 2 are illustrated in Fig. 2. Inter-
examiner within-session reliability was also fair. The ICC’s were
0.64 (session 1) and 0.53 (session 2), with COR’s of 10.30 ⌬ or
higher.
NPC Break
The summary of reliability measures for NPC break is shown in
Table 3. In all three comparisons, NPC break has excellent reli-
ability, with ICC’s no lower than 0.86. Intraexaminer within-
session reliability is especially high, with all ICC’s Ն0.94 and mean
ranges Յ1.25 cm. The intraexaminer between-session reliability
was excellent (ICC: 0.91 and 0.89) with MAD’s of ~1 cm. The
limits of agreement for examiner 2 are illustrated in Fig. 3. This
plot also shows a fairly strong positive trend (r ϭ 0.78) between the
differences between the two measures and their means, indicating
FIGURE 1.
Examiner 1 between-session reliability on near heterophoria; the plot of
the difference between the two session averages (session 2 Ϫ session 1) vs.
the mean of those two averages. The lines at L ϭ Ϫ6.93 and U ϭ 6.63
show, respectively, the lower and upper 95% limits of agreement.
TABLE 2.
Positive fusional vergence break.a
Intraexaminer Within-Session Reliability
Session 1 Session 2
Examiner 1
Mean 22.10 ⌬ 24.10 ⌬
Mean range 5.30 ⌬ 3.80 ⌬
R95 8.00 ⌬ 8.00 ⌬
ICC 0.76 0.94
Examiner 2
Mean 22.78 ⌬ 19.06 ⌬
Mean range 5.40 ⌬ 2.45 ⌬
R95 12.00 ⌬ 6.00 ⌬
ICC 0.71 0.93
Intraexaminer Between-Session Reliability
ICC COR MAD
E1S1 vs. E1S2 0.59 14.07 ⌬ 3.67 ⌬
E2S1 vs. E2S2 0.53 12.00 ⌬* 4.00 ⌬
Interexaminer Within-Session Reliability
ICC COR MAD
E1S1 vs. E2S1 0.64 10.30 ⌬ 3.33 ⌬
E1S2 vs. E2S2 0.53 16.00 ⌬* 5.67 ⌬
a
See notes for Table 1.
FIGURE 2.
Examiner 2 between-session reliability on PFV break; the plot of the
difference between the two session averages (session 2 Ϫ session 1) vs. the
mean of those two averages. The lines at L ϭ Ϫ12.00 and U ϭ 12.00
show, respectively, the lower and upper empirical 95% limits of
agreement.
TABLE 3.
Nearpoint of convergence break.a
Intraexaminer Within-Session Reliability
Session 1 Session 2
Examiner 1
Mean 5.45 cm 5.72 cm
Mean range 1.10 cm 0.80 cm
R95 2.00 cm 2.00 cm
ICC 0.98 0.98
Examiner 2
Mean 4.54 cm 5.68 cm
Mean range 0.78 cm 1.25 cm
R95 2.00 cm 3.00 cm
ICC 0.98 0.94
Intraexaminer Between-Session Reliability
ICC COR MAD
E1S1 vs. E1S2 0.92 5.33 cm* 1.17 cm
E2S1 vs. E2S2 0.89 5.00 cm* 1.00 cm
Interexaminer Within-Session Reliability
ICC COR MAD
E1S1 vs. E2S1 0.86 4.43 cm 1.68 cm
E1S2 vs. E2S2 0.97 2.55 cm 0.67 cm
a
See notes for Table 1.
258 Classification of Convergence Insufficiency—Rouse et al.
Optometry and Vision Science, Vol. 79, No. 4, April 2002
6. a tendency for the difference to increase with the NPC break. A
similar pattern (not shown) is evident for examiner 1, but two
highly influential outliers lower the correlation (r ϭ 0.01, but r ϭ
0.58 with the outliers excluded). The interexaminer within-session
reliability was also excellent, with smaller COR’s than the intraex-
aminer between-session reliability.
Accommodative Amplitude
Table 4 is a summary of the reliability measures for AA. Intraex-
aminer within-session reliability is excellent with ICC’s Ն0.88,
mean ranges Յ2.29 D, and R95 of 5.00 D in all cases. Intraexam-
iner between-session reliability differed by examiner (0.82 vs. 0.69)
with MAD’s of ~2 ⌬ or less. The limits of agreement for examiner
1 are illustrated in Fig. 4. The interexaminer within-session reli-
ability was good (0.81 and 0.85), with slightly higher MAD’s and
smaller COR’s than the intraexaminer between-session reliability.
Positive Fusional Vergence Recovery and
NPC Recovery
Although PFV recovery and NPC recovery are not used in our
diagnostic classification system, they typically are measured in a
clinical assessment of binocular vision. In Tables 5 and 6, the
summary of reliability information is provided for these binocular
measures.
DISCUSSION
Our multifaceted data analysis approach provides different per-
spectives on the issue of reliability or repeatability. The ICC is a
reliability index ranging from zero to one regardless of the units of
the measure under consideration. It readily allows for direct com-
parison of reliability between different measurements. The ICC
takes into account intrasubject and intersubject variability, but it
does not directly convey the level of intrasubject variability. For
example, the ICC was 0.81 for the between-examiner reliability
NH (session 1). This means that 81% of the variability in these
measurements is due to intersubject variability and only 19% is
due to intrasubject variability. It is this intrasubject variability that
is more clinically relevant to the practitioner. A more direct clinical
summary of these data is provided by the MAD, which is 1.67 ⌬,
and the COR, which is 6.78 ⌬. That is, the difference between NH
taken 1 week apart would typically differ by Ͻ2 ⌬, but it would be
FIGURE 3.
Examiner 2 between-session reliability on NPC break; the plot of the
difference between the two session averages (session 2 Ϫ session 1) vs. the
mean of those two averages. The lines at L ϭ Ϫ5.00 and U ϭ 5.00 show,
respectively, the lower and upper empirical 95% limits of agreement.
TABLE 4.
Accommodative amplitude.a
Intraexaminer Within-Session Reliability
Session 1 Session 2
Examiner 1
Mean 14.18 D 15.46 D
Mean range 2.04 D 2.29 D
R95 5.00 D 5.00 D
ICC 0.88 0.90
Examiner 2
Mean 14.41 D 15.17 D
Mean range 2.25 D 1.70 D
R95 5.00 D 5.00 D
ICC 0.90 0.95
Intraexaminer Between-Session Reliability
ICC COR MAD
E1S1 vs. E1S2 0.82 5.32 D 1.63 D
E2S1 vs. E2S2 0.69 10.48 D 2.06 D
Interexaminer Within-Session Reliability
ICC COR MAD
E1S1 vs. E2S1 0.81 4.13 D* 1.82 D
E1S2 vs. E2S2 0.85 6.86 D 2.58 D
a
See notes for Table 1.
FIGURE 4.
Examiner 1 between-session reliability on accommodative amplitude; the
plot of the difference between the two session averages (session 2 Ϫ
session 1) vs. the mean of those two averages. The lines at L ϭ Ϫ4.04 and
U ϭ 6.60 show, respectively, the lower and upper 95% limits of
agreement.
Classification of Convergence Insufficiency—Rouse et al. 259
Optometry and Vision Science, Vol. 79, No. 4, April 2002
7. possible for the difference to be as large as ~7 ⌬. Fig. 1 shows these
95% limits of agreement. Any finding outside this range has only a
5% probability of being due to measurement error alone. We feel
as others18
that the Bland-Altman approach gives a more relevant
clinical picture of measurement error because it details the nature
of the intrasubject variability. However, Bland and Altman have
acknowledged the appropriateness of the ICC in reliability stud-
ies.19
In addition, most of the older studies have used the standard
product moment correlation coefficient (r) to evaluate reliability.
The ICC is preferable because it is a measure of agreement between
measures, whereas r is a measure of association. Because the ICC is
usually close to r and always less than or equal to it, we are still able
to draw comparisons between this and previous studies.
Near Heterophoria
The intraexaminer within-session reliability was found to be
excellent (0.95 to 0.99). Hence, repeated measurements within a
single testing session are very repeatable in children. The previous
adult studies measuring intraexaminer reliability ranged from 0.81
to 0.88, with a MAD of ~2 ⌬.5, 6
Our intraexaminer between
session reliability results (0.81) using the ICC are similar to these
previous studies. Therefore, the clinician can expect typical differ-
ences of ~2 ⌬ (MAD) but can measure differences as large as 6 to 7
⌬ (COR). Our interexaminer reliability was also similar (0.91 and
0.72) to previous adult studies, which ranged from 0.75 to 0.94.5, 7
In general, children appear to respond as reliably as adults on this
near heterophoria measure. However, most clinicians would prob-
ably be uncomfortable with the large COR values for intraexam-
iner reliability between sessions.
Positive Fusional Vergence
Intraexaminer within-session reliability varied between the two
testing sessions, with session 1 being lower (0.76 and 0.71) and
session 2 higher (0.94 and 0.93) than that reported by Feldman et
al.12
(0.87). The mean ranges were higher for session 1, which
resulted in the lower ICC’s. The initial testing session may have
served as training, and the children may have learned to respond
better to the PFV on the second testing session where the mean
range and R95 are smaller.
Our findings do indicate that the intraexaminer between-ses-
sion and interexaminer within-session reliability is at best fair. The
intraexaminer between-session reliability results are lower (0.59
and 0.53) than Brozek et al.10
(0.72). However, Brozek et al. (and
Penisten et al.11
) took their measurements on consecutive days,
evaluated distance PFV, and used adult subjects, which makes
direct comparison difficult. One might expect PFV at distance to
be more stable, thus more repeatable than PFV at near where the
accommodative-convergence relationship is more complex, al-
though Penisten et al. found intrasubject variability to be lowest
with the near PFV break. Based on our results, the clinician can
expect typical differences of 3 to 4 ⌬, but can measure differences
as large as 12 ⌬ on follow-up visits. The large differences may cause
problems with accurately classifying patients as CI and monitoring
treatment outcomes. It may also explain why some patients appear
to have CI, based for example on Sheard’s criteria, and are asymp-
TABLE 5.
Positive fusional vergence recovery.a
Intraexaminer Within-Session Reliability
Session 1 Session 2
Examiner 1
Mean 6.72 ⌬ 9.08 ⌬
Mean range 5.70 ⌬ 4.75 ⌬
R95 13.00 ⌬ 8.00 ⌬
ICC 0.71 0.88
Examiner 2
Mean 6.47 ⌬ 5.78 ⌬
Mean range 5.15 ⌬ 3.55 ⌬
R95 13.00 ⌬ 7.00 ⌬
ICC 0.68 0.90
Intraexaminer Between-Session Reliability
ICC COR MAD
E1S1 vs. E1S2 0.27 16.54 ⌬ 3.00 ⌬
E2S1 vs. E2S2 0.50 12.15 ⌬ 4.00 ⌬
Interexaminer Within-Session Reliability
ICC COR MAD
E1S1 vs. E2S1 0.57 10.62 ⌬ 4.17 ⌬
E1S2 vs. E2S2 0.65 10.00 ⌬* 4.17 ⌬
a
See notes for Table 1.
TABLE 6.
Nearpoint of convergence recovery.a
Intraexaminer Within-Session Reliability
Session 1 Session 2
Examiner 1
Mean 7.88 cm 8.33 cm
Mean range 1.20 cm 1.10 cm
R95 3.00 cm 2.00 cm
ICC 0.97 0.98
Examiner 2
Mean 6.03 cm 7.38 cm
Mean range 1.03 cm 1.25 cm
R95 2.00 cm 3.00 cm
ICC 0.97 0.97
Intraexaminer Between-Session Reliability
ICC COR MAD
E1S1 vs. E1S2 0.90 5.15 cm 1.00 cm
E2S1 vs. E2S2 0.84 7.33 cm* 0.92 cm
Interexaminer Within-Session Reliability
ICC COR MAD
E1S1 vs. E2S1 0.80 6.00 cm* 2.17 cm
E1S2 vs. E2S2 0.96 2.70 cm* 1.00 cm
a
See notes for Table 1.
260 Classification of Convergence Insufficiency—Rouse et al.
Optometry and Vision Science, Vol. 79, No. 4, April 2002
8. tomatic or vise versa. Additionally, when evaluating the effects of
vision therapy, a single examiner would need a change of 12 ⌬,
whereas different examiners might need changes as large as 10 to
16 ⌬ to be confident that the change was real and not the result of
measurement variability.
The large PFV break differences could be due to children having
more difficulty with the psychophysical aspects of this test. Chil-
dren may be poorer observers, have more trouble understanding
the instructions or expected endpoints, or be slower to respond
than adults. Presently we are evaluating the PFV in adults to ad-
dress this issue. If the large break differences are not due to subject
age, then the differences may be related to the fusional vergence
system being inherently variable over time.
Nearpoint of Convergence
The intraexaminer within-session reliability was found to be
excellent (0.95 to 0.99) for the NPC break. Hence, the measure-
ments within a single testing session are very repeatable in children.
Regarding intraexaminer between-session reliability, the only pre-
vious study10
with adult subjects reported fair intraexaminer reli-
ability (ICC: 0.65). We found higher ICC values (0.92 and 0.89),
suggesting that the NPC break is a reliable measure over time in
children. The clinician can expect typical differences of ~1 cm, but
differences as large as ~5 cm may be measured. One caveat is that
patients with receded NPC’s (Ͼ6 cm) will generally have larger
differences when tested over time. The results for the NPC recov-
ery showed similar high ICC, suggesting it is also a reliable
measure.
Accommodative Amplitude
The intraexaminer within-session reliability was found to be
excellent (0.88 to 0.95) for the AA. Hence, the measurements
within a single testing session are very repeatable in children. Our
intraexaminer between-session results in children are consistent
with the previous study by Chen and O’Leary15
in adults showing
excellent (r ϭ 0.99) reliability. Our results show a higher level of
reliability than Brozek’s adult study in which the ICC was 0.51.10
Based on our results, a clinician can expect typical within-ses-
sion differences of ~2 D, but differences as large as ~5 D may be
measured. These differences are difficult to compare with the re-
sults reported in the two most often quoted adult studies.10, 13
These studies reported the typical patient would have a SD of
about 0.75 D and, hence, a range of values of about Ϯ1.5 D. From
this, Rosenfield and Cohen13
suggested that the typical difference
between two AA measurements on the same subject would be
within 1.5 D of each other. This conclusion is erroneous. The
Ϯ1.5 D range understates the reliability of the AA measure, sug-
gesting that the typical patient could have a range of AA measure-
ments of actually ~3 D. Thus, the two AA measurements on such
a patient could readily be more than 1.5 D (and up to 3 D)
different and still be within the bounds of the natural variability of
this patient. This 3 D difference for adults is lower than this study’s
5 D difference for children, which indicates adults tend to have
better within-session reliability than children.
In both of these previous adult studies, the intervening time
between measurements varies from several hours to a day or more,
whereas in our study, the between-session measurements were
taken 1 week apart. Our within-session reliability is more compa-
rable to the short-term repeatability results of these adult studies.
There are no comparable studies for estimating the between-ses-
sion differences that this study found. Based on our results, a
clinician can expect typical between-session differences of ~5 D,
but differences as large as ~5 to 10 D may be measured. The large
COR values for intraexaminer reliability between sessions are be-
yond the comfort level of most clinicians.
CONCLUSIONS
The study and analysis of measurement reliability is extensive
and intricate, and different authors have divergent views on which
methods are most appropriate. We elected to use a multifaceted
approach in presenting our reliability results because there is no
one accepted mode of analysis, and each method gives a different
and useful perspective on the problem.
The ICC, perhaps the most common index of reliability in the
health science literature,4, 20
provides a method to compare the
reliability of tests that have different units of measurement (in our
case, tests using prism diopters, to lense diopters, to centimeters).
We can view the relative reliability of the group of tests typically
used in evaluating the syndrome diagnosis of CI. The ICC also
allows us to compare our results with older literature that may have
only used the correlation coefficient in their analysis. Three of the
four measures (NH, NPC, and AA) often used in the classification
of CI generally have good-to-excellent intraexaminer and interex-
aminer reliability based on the ICC evaluation. The PFV break was
found to have only fair intraexaminer and interexaminer reliability.
A difficulty with the ICC is that its interpretation is problematic
for the clinician. Knowing that the ICC for a test is 0.90 does not
help the clinician with the question of “how much difference
should I reasonably expect between two measurements of that
same test?” The Bland-Altman approach provides a more clinician-
friendly view of reliability. We have presented both the typical
difference between measurements (mean range for within session
and MAD for between sessions) and what the clinician may think
of as the worst-case difference, or as we have described in the results
section, “the difference can be as large as” (R95 within session or
the COR between sessions).
We feel that the clinician who routinely takes these binocular
measurements on children will find the typical differences within
session and between sessions to be in line with what they generally
expect. See the summary in Table 7. The worst-case difference will
be greater, and in some cases much greater (two to five times the
typical differences) than those differences expected by that same
clinician. These “worst-case” differences represent the maximum
difference between measurements that a clinician would ever ob-
serve on nearly all patients.
It may be unfair to look at each new patient in light of the
worse-case difference scenario. Most patients are close to typical,
but of course, a few problematic patients are not! We suggest
viewing a patient using the typical difference in most cases and
asking the following question: would the diagnosis be altered if the
observed measurement changed by as much as the typical differ-
ence? What if it changed by as much as the worst-case difference? It
is especially important to consider the worst-case differences when
Classification of Convergence Insufficiency—Rouse et al. 261
Optometry and Vision Science, Vol. 79, No. 4, April 2002
9. there are inconsistencies in the case findings; for example, when a
patient with clinical findings supports the diagnosis of CI, but the
patient has no or few symptoms, or when a patient presents with CI
type symptoms, but has clinical findings that appear within accept-
able limits.
Unfortunately for the clinician treating and monitoring this
condition, they will need to use the worst-case differences to feel
confident that the changes that are being seen are not just natural
variation in the between-session measurements. The large poten-
tial test-retest differences found could complicate clinical decision-
making in regards to diagnosis and treatment. Changes in the
testing protocol used in this study, as well as other PFV procedures
should be investigated in an attempt to improve both intraexam-
iner and interexaminer reliability.
APPENDIX
von Graefe Near Heterophoria Test
A table stand with phoropter (B and L style) was used. Risley
prisms were marked in 2 ⌬ increments from 0 to 30 ⌬. The fixation
target was a vertical column of 20/30 reduced Snellen letters. Illu-
mination was provided by a floor-stand lamp with 100- to 150-ft-
cd/m2
on the card face. The patient’s interpupillary distance was
taken by pupillometer and dialed into the phoropter. The subject’s
habitual distance refractive correction was placed in the phoropter
if the subject was wearing glasses.
Before testing, the subject was shown a two-picture demonstra-
tion of the test responses. The first picture showed the initial pre-
sentation. The subject was told, “First you will see two lines of
letters with one being higher and to the right.” The second picture
showed vernier alignment of the two lines. The subject was told,
“The upper line of letters will flash on and off several times. Each
time they come on tell me whether they are to the right, to the left,
or directly above the lower letters as shown here.”
The examiner then introduced 4 to 6 ⌬ base-up over the left eye
(OS) for dissociation and 12 ⌬ base-in over the right eye (OD) for
biasing. The subject was asked to “Please read the line of letters”
first with the OS and then with the OD. The occluder was re-
moved, and the subject was asked, “Do you see two lines of letters
with one being higher and to the right (the subject’s right shoulder
was tapped to reinforce the concept of ‘right’ direction).” If only
one target was seen, the prisms were readjusted. If the subject was
still unable to see the two targets, the suppressing eye was deter-
mined and testing was stopped.
The subject was then instructed to “Keep the letters as clear as
you can. The upper letters will flash on and off several times. Each
time they come on tell me whether they are to the right, to the left,
or directly above the lower letters.” The RE was occluded, and then
the RE was uncovered and recovered (~1 s flash exposure time). If
the upper target was seen to the right, the RE prism was reduced by
4 ⌬ and the flashing was repeated until the upper target was first
seen to the left. The prism was changed in 2 ⌬ increments until the
subject reported alignment. The prism amount and direction of
the deviation (eso, exo, and ortho) were recorded. The procedure
was repeated two additional times, and results were recorded.
von Graefe Fusional Vergence Tests
The initial setup for the von Graefe fusional vergence measures
was the same as for the von Graefe heterophoria measurement
except the fixation target was a two letters by five letters block of
20/30 reduced Snellen letters.
Before testing, the subject was shown a three-picture demon-
stration of the test responses. The first picture had the block of
letters in focus. The subject was told, “This is what I mean by
clear.” The second picture had the block of letters photographically
blurred to a level of 0.50 to 0.75 D to simulated the first sustained
blur point. The subject was told, “This is what I mean by blurred.”
The third picture had two blocks of letters one printed on paper,
one overlaid on a plastic sheet. The subject was told, “This is what
I mean by double (as the examiner slowly slides the targets apart).
I will separate them a little more, like this... (As examiner slides the
targets so they are distinctly separate). Then tell me when the two
blocks come together into one block, like this... (As the examiner
slides the two targets together until they are one).”
The subject was positioned behind the phoropter and was in-
structed to “Read aloud the letters in the top row of the block of
letters you see in front of you.” Then the subject was instructed, “I
want you to say ‘now or blur’ when the letters become blurred as I
showed you in the picture.” Base-in prism was added at the ap-
proximate rate of 4 ⌬/s.
Once blur was reported or if no blur was reported, the subject
was instructed, “I want you to say ‘now or double’ when the block
of letters are seen as double as I showed you in the last picture.”
Once again, base-in prism was added at the approximate rate of 4
⌬/s until double vision was reported. Once diplopia was reported,
the subject was instructed, “Now I want you to tell me when you
see only one block of letters. The letters may either be clear or
blurred, but they must be single.” Base-in prism was reduced at the
approximate rate 4 ⌬/s until single vision was reported. The results
were recorded in prism diopters for the blur, break, and recovery
findings. A 20-s “wait” period was used between fusional vergence
measurements. PFV was measured next using base-out prisms with
instructions as above. In alternate order, NFV and then PFV were
repeated two additional times and recorded.
TABLE 7.
Summary of the typical differences and worst-case differ-
ences that clinicians might expect for the four binocular
measures.a
Typical Difference Worst-Case Difference
Within
Session
Between
Session
Within
Session
Between
Session
NH 1–2 ⌬ 1–2 ⌬ 2–4 ⌬ 7–8 ⌬
PFV-break 2–5 ⌬ 4 ⌬ 6–12 ⌬ 12–14 ⌬
NPC 1 cm 1 cm 2–3 cm 5 cm
AA 2 D 1–2 D 5 D 5–10 D
a
NH, near heterophoria; PFV-break, positive fusion vergence
break point; NPC, near point of convergence; AA, accommoda-
tive amplitude. Worst-case difference (within session ϭ R95,
between session ϭ COR (or LoA); typical difference (within ses-
sion ϭ mean range; between session ϭ MAD).
262 Classification of Convergence Insufficiency—Rouse et al.
Optometry and Vision Science, Vol. 79, No. 4, April 2002
10. Nearpoint of Convergence Test
The accommodative rule was used with a fixation target consist-
ing of a single vertical column of five 20/30 reduced Snellen letters.
Subjects wore their habitual spectacle or contact lenses prescription
during the testing.
The subject was given the following instructions: “This proce-
dure is designed to measure your ability to converge your eyes; that
is, turn your eyes in toward your nose. Look directly at the line of
letters on this card (examiner pointed to target card on accommo-
dative rule) with both eyes open as the card is moved toward you.
The image may appear to blur. That is okay. However, if you see
the letter double (that is, split into two), say ‘two.’ I will then pull
the card back. If you see the two images join into a single image
again, say ‘one.’ This procedure will be repeated three times.”
The examiner sat slightly to the side of the subject and viewed
the subject from a slightly elevated position. The examiner held the
accommodative rule in the horizontal position with the ruler po-
sitioned against the middle of the subject’s forehead approximately
1 cm above the eyebrow line and tested convergence along the
anterior-polar or z axis. The target card was started at approxi-
mately 20 cm. The target card was moved along the accommoda-
tive rule in a smooth linear manner at the rate of approximately 1
cm/s toward the subject. The slider was stopped when the subject’s
eyes were observed to fail to converge or when the subject reported
diplopia. The target card was stopped at this point, and the subject
was asked if the images remained double. If the images remained
double or the subject remained strabismic relative to the target, the
centimeter position of the card on the accommodative rule was
recorded. If the images became fused into one or the subject was
observed to be bifixating the target, the target was moved toward
the subject’s face until the eyes were objectively observed to fail to
converge or the subject reported diplopia. This cycle was contin-
ued until the subject remained diplopic or strabismic relative to the
target.
The target was then moved away from the subject at approxi-
mately 1 cm/s until the eyes were observed to reestablish bifixation
or the subject reported the two images fused into a single image.
The centimeter position of the card on the accommodative rule
was recorded. The target card was moved back to the starting
position of 20 cm, and the subject was given a 10-s rest period
before starting the next measurement. The procedure was repeated
two additional times and recorded.
Pushup Method for Monocular Accommodative
Amplitude
The accommodative rule was used with a fixation target consist-
ing of a single vertical column of five 20/30 reduced Snellen letters.
Subjects wore their habitual spectacle or contact lenses prescription
during the testing.
The subject was given the following instructions: “This proce-
dure is designed to measure your ability to focus your eyes on a
target that is slowly moved closer to your eyes. I am going to cover
your left eye with this patch. Look directly at the line of letters on
this card (pointing to the target card on the accommodative rule) as
the card is moved toward you. Tell me when the target first be-
comes blurry. This is what I mean by blurry (demo card shown).
This procedure will be repeated three times.”
The examiner sat in front of the subject and viewed the
subject from a slightly elevated position. The examiner held the
accommodative rule in the horizontal position with the ruler
positioned against the subject’s forehead (approximately 1 cm
above the eyebrow line) and above the right eye. The examiner
tested the accommodative amplitude of the right eye only. The
target card was started at approximately 20 cm. The target card
was moved along the accommodative rule in a smooth linear
manner at the rate of approximately 1 cm/s toward the subject’s
face. The target was stopped when the subject first reported that
the letters were blurry. The subject was then asked if the letters
remained blurry or became clear. If the letters remained blurry,
the centimeter position of the target on the accommodative rule
was recorded. If the print became clear, the target was moved
toward the subject until the print first became blurry. This cycle
was continued until the subject reported a sustained blur. The
target was moved back to the starting position of approximately
20 cm, and the subject was given a 10-s rest period before
starting the next measurement. The procedure was repeated two
additional times and recorded.
ACKNOWLEDGMENTS
The Convergence Insufficiency and Reading Study (CIRS) group is Michael
W. Rouse, Leslie Hyman, Mohamed Hussein, Harold Solan, Eric Borsting,
Susan Cotter, David Grisham, Leonard Press, and Mitchell Scheiman.
Received June 26, 2000; revision received December 20, 2001.
REFERENCES
1. Rouse MW, Borsting E, Hyman L, Hussein M, the Convergence
Insufficiency and Reading Study (CIRS) group. Pilot study to evalu-
ate convergence insufficiency in a school-aged population. Optom
Vis Sci 1995;72(Suppl):218.
2. Rouse MW, Hyman L, Hussein M, Solan H, the Convergence Insuf-
ficiency and Reading Study (CIRS) Group. Frequency of conver-
gence insufficiency in optometry clinic settings. Optom Vis Sci 1998;
75:88–96.
3. Rouse MW, Borsting E, Hyman L, Hussein M, Cotter SA, Flynn M,
Scheiman M, Gallaway M, De Land PN, the Convergence Insuffi-
ciency and Reading Study (CIRS) Group. Frequency of convergence
insufficiency among fifth and sixth graders. Optom Vis Sci 1999;76:
643–9.
4. Streiner DL, Norman GR. Health Measurement Scales: a Practical
Guide to Their Development and Use. New York: Oxford University
Press, 1995:104–27.
5. Hirsch MJ, Bing LB. The effect of testing method on values obtained
for phoria at forty centimeters. Am J Optom Arch Am Acad Optom
1948;25:407–16.
6. Morgan MW. The reliability of clinical measurements with special
reference to distance heterophoria. Am J Optom Arch Am Acad Op-
tom 1955;32:167–79.
7. Rainey BB, Schroeder TL, Goss DA, Grosvenor TP. Inter-examiner
repeatability of heterophoria tests. Optom Vis Sci 1998;75:719–26.
8. Larson WL. Vergence breaks with a stepping prism. Am J Optom
Arch Am Acad Optom 1972;49:569–74.
9. Sheedy JE, Saladin JJ. Validity of diagnostic criteria and case analysis
in binocular vision disorders. In: Schor CM, Ciuffreda KJ, eds. Ver-
Classification of Convergence Insufficiency—Rouse et al. 263
Optometry and Vision Science, Vol. 79, No. 4, April 2002
11. gence Eye Movements: Basic and Clinical Aspects. Boston: Butter-
worth, 1983:517–40.
10. Brozek J, Simonson E, Bushard JW, Peterson HJ. Effects of practice
and the consistency of repeated measurements of accommodation
and vergence. Am J Ophthalmol 1948;31:191–8.
11. Penisten DK, Hofstetter HW, Goss DA. Reliability of rotary prism
fusional vergence ranges. Optometry 2001;72:117–22.
12. Feldman JM, Cooper J, Carniglia P, Schiff FM, Skeete JN. Compar-
ison of fusional ranges measured by Risley prisms, vectograms, and
computer orthopter. Optom Vis Sci 1989;66:375–82.
13. Rosenfield M, Cohen AS. Repeatability of clinical measurements of the am-
plitude of accommodation. Ophthalmic Physiol Opt 1996;16:247–9.
14. Bland JM, Altman DG. Statistical methods for assessing agreement be-
tween two methods of clinical measurement. Lancet 1986;1:307–10.
15. Chen AH, O’Leary DJ. Validity and repeatability of the modified
push-up method for measuring the amplitude of accommodation.
Clin Exper Optom 1998;81:63–71.
16. Fleiss JL. The Design and Analysis of Clinical Experiments. New
York: Wiley, 1986.
17. D’Augostino RB, Stevens MA. Tests for the Normal Distribution.
New York: Marcel Dekker, 1986:367–419.
18. Zadnik K, Mutti DO, Bullimore MA. Use of statistics for comparing
two measurement methods. Optom Vis Sci 1994;71:539–41.
19. Bland JM, Altman DG. A note on the use of the intraclass correlation
coefficient in the evaluation of agreement between two methods of
measurement. Comput Biol Med 1990;20:337–40.
20. Shoukri MM, Pause CA. Statistical Methods for Health Sciences,
2nd ed. Boca Raton, FL: CRC Press, 1999:19–42.
Michael W. Rouse
Southern California College of Optometry
2575 Yorba Linda Blvd.
Fullerton, California 92831
e-mail: mrouse@scco.edu
264 Classification of Convergence Insufficiency—Rouse et al.
Optometry and Vision Science, Vol. 79, No. 4, April 2002