This document discusses using eye movement data to infer image relevance in a content-based image retrieval system. An experiment was conducted where 10 participants viewed images from 101 categories to perform image searching tasks. Eye movement data including fixation duration, fixation count, and number of revisits was analyzed. The results show that fixation duration and count were significantly higher for positively relevant images compared to irrelevant images. This suggests eye movement data could provide an implicit form of relevance feedback to improve image retrieval systems.
A decision tree was then developed to predict user feedback during the tasks based on the eye tracking measures. It achieved over 87% accuracy, demonstrating the potential of using natural eye movement as a robust source of relevance feedback to bridge the semantic gap in
2. concrete and substantial foundation to involve natural eye Ten participants took part in the study, four females and six
movement as a robust RF source [Zhou and Huang. 2003]. males in an age range from 20 to 32 all with an academic back-
ground. All of them are proficient computer users, and half of
The rest of the paper is organized as follows. Section 2 introduc- them have had experience of using an eye tracking system. Their
es experimental design and setting for relevance feedback tasks visions are either normal or correct-to-normal. The participants
and the corresponding eye movement data collecting. In Section were asked to complete two sets of the above mentioned image
3, we report our thorough investigation on using fixation dura- searching tasks and the gaze data are recorded with a 60 Hz
tion, fixation count and the numbers of revisits for the prediction sampling rate. Afterwards the participants were asked to indicate
of relevant images. These factors are performed with the ANO- which images they have chosen as positive images to ensure the
VA test to reveal their significances and interconnections. Sec- accuracy of a further analysis on their eye movement data. The
tion 4 proposes a decision tree model to predict the user’s input eye tracker is non-intrusive and allows a 300x220x300 mm free
during the images searching tasks. Finally, we conclude the head movement space. Different candidate images and the loca-
results and propose the future work. tions of positive images are ensured in and between each set of
the task. In other words, no two images are the same and no two
2 Design of Experiments stimuli have the same positive image locations. This is to reduce
the memory effects and to simulate the natural relevance feed-
2.1 Task Setup back situation.
We study an image searching task which reflects kinds of activi-
ties occurring in a complete CBIR system. In total, 882 images 3 Analysis of Gaze Data in Image Searching
are randomly selected from 101 object categories. The image set
is obtained by collecting images through the Google image Raw gaze data are preprocessed by finding the fixations with the
search enginee [Li 2005]. The design and examples of the built-in filter provided by Tobii Technology. The filter maps a
searching task interface is shown in Fig. 1. On the top left is the series of raw coordinates to a single fixation if the coordinates
query image. Twenty candidate images are arranged as a 4x5 stay sufficiently long within a sphere of a given radius. We used
grid display. All of the images are from 101 categories such as an interval threshold of 150 ms and a radius of 1 º visual angle.
landscapes, animals, buildings, human faces, and home ap- 3.1 Fixation Duration and Fixation Count
pliances. The red blocks in Fig. 1(a) denotes the locations of
positive images in Fig. 1(b) (Class No. 22: Pyramid). The others The main features used in eye tracking related information re-
are negative images and their image classes are different from trieval are fixations and saccades [Jacob and Karn.2003]. Two
each other. That is to say, apart from the query image’s category, groups of derived metrics stem from the fixation: fixation dura-
no two images in the grid are from the same category. The can- tion and fixation count are thoroughly studied to support the
didate images in one searching stimulus are randomly arranged. possibility of inferring the relevance of images based on eye
movements [Goldberg et al.2002; Gołofit 2008]. Suppose that
FDP(m) and FDN(m) are the fixation durations on the positive
Query Class No Class No Class No Class No Class No
Image 01 22 22 75 64 and the negative images observed by subject m, respectively;
Negative Positive Positive Negative Negative
FCP(m) and FCN(m) are the fixation counts on the positive and
Class No
56
Class No
38
Class No
17
Class No
100
Class No
12
the negative images observed by subject m, respectively; Then
Negative Negative Negative Negative Negative
in our searching task, FDP(m) and FDN(m) are defined as
Class No Class No Class No Class No Class No
45 22 06 77 91
Negative Positive Negative Negative Negative ∑, , FD sgn ,
FDP =
(a) Class No Class No Class No Class No Class No
(b) ∑, , sgn , (1)
13 69 22 22 28
Negative Negative Positive Positive Negative ∑, , FD 1 sgn ,
FDN =
∑, , 1 sgn 1,
Figure 1. Image searching stimulus. (a) the layout of the search-
ing stimulus with 5 positive images; (b) an example. where 0,1, … ,20 denotes the image candidate in each
searching stimulus interface; 1,2, … ,21 denotes the stimulus
in each searching task (it also represents the numbers of positive
Such a simulated relevance feedback task asks each participant
images in the current stimulus); 1,2 denotes the task set,
to use his eye to locate the positive image on each stimulus. On
1,2, … ,10 represents the subject and sgn(x) is the signum
locating the positive image, the participants select the target by
function. Consequently, FD is the fixation duration on the
fixating on it for a short period of time with the eye. A set of the
i-th image candidate of the j-th stimulus of the k-th task from
task are composed of 21 such stimulus whose positive image
subject m, and
number are varied from 0 to 20. Thus, the set of task contains
21x21 = 441 images and the total number of the negative images 1 if subject regards cadidate image as positive
and positive images are equal (210 images each). ,
0 if subject regards cadidate image as negative
2.2 Apparatus and Procedure In the similar manner, FCP(m) and FCN(m) are defined as
Eye tracking data is collected by the Tobii X120 eye tracker, ∑, , FC sgn ,
whose accuracy is α 0.5° and drift β 0.3°. Each candidate FCP =
∑, , sgn ,
image has a resolution of 300 x 300 pixels and thus an image
∑, , FC 1 sgn , (2)
stimulus has 1800 x 1200 pixels. Each of stimuli is displayed on FCN =
the screen with a viewing distance of 600 mm and the screen’s ∑, , 1 sgn ,
resolution is 1920x1280 pixels and the pixel pitch is h = 0.264 where FC is the fixation counts on the i-th image candi-
mm. Hence the output uncertainty is just R tan α β /h = date of the j-th stimulus of the k-th task from subject m. The two
30 pixels, which has ensured the error of gaze data no larger pairs of fixation-related variables were monitored and recorded
than 1% area of each candidate image.
38
3. during the experiment. The average value and standard deviation ing task. We can see that (1) some of the candidate images are
of ten participants are summarized in Table 1. never visited, which indicates the use of pre-attentive vision at
the very beginning of the visual search [Salojärvi et al. 2004].
Table 1 Statistics on the fixation duration and fixation count on During the pre-attentive process, all the candidate images have
positive and negative images been examined to decide the successive fixation locations; and
Sub. FDP(m) FDN(m) FCP(m) FCN(m) (2) in our experiments, revisits happen both on positive images
1.410±1.081 0.415±0.481 2.5±1.9 1.3±1.3 and negative images. The majority of them have just been vi-
1
sited once, while some of them are revisited during the image
2 1.332±0.394 0.283±0.247 2.7±1.4 1.2±0.9 searching.
3 2.582±1.277 0.418±0.430 5.6±3.3 1.7±1.5
4 0.805±0.414 0.356±0.328 2.4±1.2 1.5±1.2 The Number of Visits ‐‐ Histogram
2500 2149
5 1.154±0.484 0.388±0.284 2.6±1.4 1.5±1.0
2000
6 1.880±0.926 0.402±0.338 3.0±1.9 1.4±1.0
1500
7 0.987±0.397 0.166±0.283 1.7±0.8 0.6±0.7 878
8 0.704±0.377 0.358±0.254 2.2±1.1 1.3±0.9 1000
403 306
9 1.125±0.674 0.329±0.403 3.0±2.0 1.4±1.5 500 119 65 80
10 1.101±0.444 0.392±0.235 2.7±1.3 1.5±0.8 0
AVG. 1.308±0.891 0.351±0.345 2.8±2.0 1.3±1.1 No Visit 1 times 2 times 3 times 4 times 5 times > 6 times
Figure 2 The total revisit histogram. The X-axis denotes the
Analysis of variance (ANOVA) tests are performed to find out number of re-fixation and Y-axis is the corresponding count
whether there are discriminating visual behaviors between the (unit: millisecond).
observation of positive and negative images. Given the individu-
al difference in eye movements, we designed two groups of two- Table 3 Overall revisits on positive and negative images
way ANOVA among three factors: test subject, fixation duration A1 1 2 3 4 5 6 >7
and fixation count. The results are shown in Table 2.
A2 549 196 88 55 34 13 27
Table 2 ANOVA test results among three factors: test subject, A3 329 110 31 10 3 2 1
fixation duration and fixation count.
A4 878 306 119 65 37 15 28
GROUP I
Factor Levels Test result A5 63% 64% 74% 85% 92% 87% 100%
(A) Test 10 levels A1 = the number of revisits on an image candidate; A2 = revisit
F(9,9) = 1.26, p < 0.37
Subjects (10 subjects) counts on positive images; A3 = revisit counts on negative im-
(B) Fixation 2 levels ages; A4 = the total number of revisits; A5 = the percentage of
F(1,9) = 32.84, p < 0.0003
Duration (FDP & FDN) the total revisits occurring to positive images.
GROUP II
Factor Levels Test result To compare with Oyekoya and Setntiford’s work [2006], we
(A) Test 10 levels investigate whether the variance of revisit counts has a different
F(9,9) = 2.03, p < 0.15 effect between positive and negative image candidates over all
Subjects (10 subjects)
(B) Fixation 2 levels the participants (as shown in Table 3). When revisits counts ≥ 3
F(1,9) = 28.28, p < 0.0005
Count (FCP & FCN) times, the result of one-way ANOVA is significant with F(1,8)
= 5.73, p < 0.044. That is to say, the probability of revisits on a
As illustrated in Table 2, both fixation duration and fixation positive image is increased with revisits counts. For example,
count revealed significant effects to positive and negative im- when an image is revisited more than three times, it has a very
ages during simulated relevance feedback tasks. Concretely high probability (over 74%) to be a positive image candidate. As
speaking, the fixation durations on each positive image from all a result, the number of revisit is also a feasible implicit relev-
the subjects (1.30 seconds) are longer than those on negative ance feedback to drive an image retrieval engine.
image (0.35seconds). Correspondingly, the analysis of fixation
count produces similar results that subjects visit more times on a 4 Feature Extraction and Results
positive image (2.8) than on a negative one (1.3). On the other The primary focus of this paper is on evaluating the possibility
hand, the variations of different subjects have no significant of inferring the relevance of images based on eye movement
effects on both groups. (In GROUP I, 0.37 > α = 0.05; in GROUP II, data. The features such as fixation duration, fixation count and
0.15 > α = 0.05). the number of revisit have shown discriminating power between
positive and negative images. Consequently, we composed a
3.2 Number of Revisits simple set of 11 features , ,…, , an eye
A revisit is defined as the re-fixation on an AOI previously fix- movement’s vector to predict the positive images from each
ated. Much human computer interaction and usability research returned 4x5 image candidates set in the simulated relevance
shows that re-fixation or revisit on a target may be an indication feedback task, where 1,2, … ,20 denotes the numbers of
of special interest on the target. Therefore, the analysis of revisit positive images in the current stimulus; 1,2, … ,10
during the relevance feedback process may reveal the correlation represents the subject , , … , are listed in Table 4, where
between the eye movement pattern and positive image candi- 1, … ,20 and FL FD /FC .
dates. Table 4 Features used in relevance feedback to predict positive
images
Figure 2 shows a general status of the overall visit frequency (no.
of revisits = no. of visits - 1) throughout the whole image search-
39
4. NO. Features Description color, texture, shape, and spatial information, to human attention,
such as AOIs. As a result, eye tracking data can be a rich and
Fixation duration on i-th image inside 4x5 image
FD
candidate set interface
new source for improving image representation [Lei Wu et al.
Fixation Count on i-th image inside 4x5 image
2009]. Our future work is to develop an eye tracking based
FC CBIR system in which human beings’ natural eye movements
candidate set interface
FL FD /FC will be effectively exploited and used in the modules of image
FL Fixation Length on i-th image inside 4x5 image representation, similarity measurement and relevance feedback.
candidate set interface
R
Revisit numbers happened on i-th image inside Acknowledgments
4x5 image candidate set interface
The work reported in this paper is substantially supported by the
Different from Klami et al.’s work [Klami et al. 2008], we use a Research Grants Council of the Hong Kong Special Administra-
decision tree (DT) as a classifier to automatically learn the pre- tive Region, China (Project code: PolyU 5141/07E) and the
diction rules. The data set mentioned in Section 2 is divided into PolyU Grant (Project code: 1-BBZ9).
a training and a testing sets to evaluate the prediction accuracy.
Two different methods are used to train the DT, which are illu- References
strated in Table 5 (prediction precisions are 87.3% and 93.5%,
respectively), and an example of predicted positive image from DACHENG TAO, XIAOOU TANG AND XUELONG LI. 2008. Which
4x5 candidates set is shown in Figure 3. Components are Important for Interactive Image Searching Circuits and
Systems for Video Technology, IEEE Transactions on 18, 3-11. .
Table 5 Training methods and testing results of decision trees
FLICKNER, M., SAWHNEY, H., NIBLACK, W., ASHLEY, J., HUANG,
Method I
Q., DOM, B., GORKANI, M., HAFNER, J., LEE, D., PETKOVIC, D.,
Training Data Set 1,2, … 5 STEELE, D. AND YANKER, P. 1995. Query by Image and Video
Testing Data Set 5,6, … 10 Content: The QBIC System. Computer 28, 23-32. .
Prediction Precision 87.3% GOLDBERG, J.H., STIMSON, M.J., LEWENSTEIN, M., SCOTT, N.
Method II AND WICHANSKY, A.M. 2002. Eye tracking in web search tasks:
design implications. In ETRA '02: Proceedings of the 2002 symposium
Training Data Set 1,3,5 … 19
on Eye tracking research & applications, New Orleans, Louisiana,
Testing Data Set 2,4,6 … 20 Anonymous ACM, New York, NY, USA, 51-58.
Prediction Precision 93.5%
GOŁOFIT, K. 2008. Click Passwords Under Investigation. Computer
Security - ESORICS 2007 343-358. .
JACOB, R. AND KARN, K. 2003. Eye Tracking in Human-Computer
Interaction and Usability Research: Ready to Deliver the Promises. In
The Mind's Eye: Cognitive and Applied Aspects of Eye Movement Re-
search, HYONA, RADACH AND DEUBEL, Eds. Elsevier Science,
Oxford, England.
KLAMI, A., SAUNDERS, C., DE CAMPOS, T.E. AND KASKI, S. 2008.
Can relevance of images be inferred from eye movements? In MIR '08:
Proceeding of the 1st ACM international conference on Multimedia
information retrieval, Vancouver, British Columbia, Canada, Anonym-
ous ACM, New York, NY, USA, 134-140.
Figure 3 An example of predicted positive images from 4x5 LEI WU, YANG HU, MINGJING LI, NENGHAI YU AND XIAN-
candidates set in the simulated relevance feedback task. The SHENG HUA. 2009. Scale-Invariant Visual Language Modeling for
Object Categorization. Multimedia, IEEE Transactions on 11, 286-294. .
query image is “hedgehog”, and DT model returned 8 predicted
positive images (in red frames) based on the 11 features vector FEIFEI. LI ,Visual recognition: computational models and human psy-
with 100% accuracy. chophysics, Phd Thesis, California Institute of Technology, 2005.
LIU, D., HUA, K., VU, K. AND YU, N. 2006. Fast Query Point Move-
5 Conclusion and Further Work ment Techniques with Relevance Feedback for Content-Based Image
Retrieval. Advances in Database Technology - EDBT 2006 700-717. .
An eye tracking system can be possibly integrated into a CBIR
system as a more efficient input mechanism for implementing OYEKOYA, O. AND STENTIFORD, F. 2004. Exploring Human Eye
the user’s relevance feedback process. In this paper, we mainly Behaviour using a Model of Visual Attention. 17th International Confe-
concentrate on a group of fixation- related measurements which rence on (ICPR'04) Volume 4, IEEE Computer Society, Washington, DC,
USA, 945-948.
shows static eye movement patterns. In fact, the dynamic cha-
racteristics can also manifest human organizational behavior and OYEKOYA, O. AND STENTIFORD, F. 2006. Perceptual Image Retriev-
decision processes, such as saccades and scan path, which reveal al Using Eye Movements. Advances in Machine Vision, Image
the pre-attention and cognition process of a human being while Processing, and Pattern Analysis 281-289. .
viewing an image. In our further work, we will continue to de- SALOJÄRVI, J., PUOLAMÄKI, K. AND KASKI, S. 2004. Relevance
velop a more comprehensive study which includes both the stat- feedback from eye movements for proactive information retrieval. In
ic and dynamic features of eye movements. Originally, it is a Workshop on Processing Sensory Information for Proactive Systems
unity of human’s conscious and unconscious visual cognition (PSIPS 2004, Anonymous , 14-15.
behavior, which can not only be used in relevance feedback, but
also a new source of image representation. Human’s image ZHOU, X.S. AND HUANG, T.S. 2003. Relevance feedback in image
viewing automatically bridge the low level features, such as retrieval: A comprehensive review. Multimedia Systems 8, 536-544. .
40