SlideShare a Scribd company logo
1 of 7
Download to read offline
Lucene and Juru at Trec 2007: 1-Million Queries Track
Doron Cohen, Einat Amitay, David Carmel
IBM Haifa Research Lab
Haifa 31905, Israel
Email: {doronc,einat,carmel}@il.ibm.com
ABSTRACT
Lucene is an increasingly popular open source search library.
However, our experiments of search quality for TREC data
and evaluations for out-of-the-box Lucene indicated inferior
quality comparing to other systems participating in TREC.
In this work we investigate the diļ¬€erences in measured
search quality between Lucene and Juru, our home-brewed
search engine, and show how Lucene scoring can be modiļ¬ed
to improve its measured search quality for TREC.
Our scoring modiļ¬cations to Lucene were trained over the
150 topics of the tera-byte tracks. Evaluations of these mod-
iļ¬cations with the new - sample based - 1-Million Queries
Track measures - NEU-Map and -Map - indicate the ro-
bustness of the scoring modiļ¬cations: modiļ¬ed Lucene per-
forms well when compared to stock Lucene and when com-
pared to other systems that participated in the 1-Million
Queries Track this year, both for the training set of 150
queries and for the new measures. As such, this also sup-
ports the robustness of the new measures tested in this track.
This work reports our experiments and results and de-
scribes the modiļ¬cations involved - namely normalizing term
frequencies, diļ¬€erent choice of document length normaliza-
tion, phrase expansion and proximity scoring.
1. INTRODUCTION
Our experiments this year for the TREC 1-Million Queries
Track focused on the scoring function of Lucene, an Apache
open-source search engine [4]. We used our home-brewed
search engine, Juru [2], to compare with Lucene on the 10K
trackā€™s queries over the gov2 collection.
Lucene1
is an open-source library for full-text indexing
and searching in Java. The default scoring function of Lucene
implements the cosine similarity function, while search terms
are weighted by the tf-idf weighting mechanism. Equation 1
describes the default Lucene score for a document d with
respect to a query q:
Score(d, q) =
tāˆˆq
tf(t, d) Ā· idf(t) Ā· (1)
boost(t, d) Ā· norm(d)
where
1
We used Lucene Java: http://lucene.apache.org/java/docs, the ļ¬rst and
main Lucene implementation. Lucene deļ¬nes a ļ¬le for-
mat, and so there are ports of Lucene to other languages.
Throughout this work, ā€Luceneā€ refers to ā€Apache Lucene
Javaā€.
ā€¢ tf(t, d) is the term frequency factor for the term t in
the document d, computed as freq(t, d).
ā€¢ idf(t) is the inverse document frequency of the term,
computed as 1 + log numDocs
docF req(t)+1
.
ā€¢ boost(t.field, d) is the boost of the ļ¬eld of t in d, as
set during indexing.
ā€¢ norm(d) is the normalization value, given the number
of terms within the document.
For more details on Lucene scoring see the oļ¬ƒcial Lucene
scoring2
document.
Our main goal in this work was to experiment with the
scoring mechanism of Lucene in order to bring it to the
same level as the state-of-the-art ranking formulas such as
OKAPI [5] and the SMART scoring model [1]. In order
to study the scoring model of Lucene in full details we ran
Juru and Lucene over the 150 topics of the tera-byte tracks,
and over the 10K queries of the 1-Million Queries Track of
this year. We then modiļ¬ed Luceneā€™s scoring function to
include better document length normalization, and a better
term-weight setting according to the SMART model. Equa-
tion 2 describes the term frequency (tf) and the normaliza-
tion scheme used by Juru, based on SMART scoring mech-
anism, that we followed to modify Lucene scoring model.
tf(t, d) =
log(1+freq(t,d))
log(1+avg(freq(d)))
(2)
norm(d) = 0.8 avg(#uniqueTerms) + 0.2 #uniqueTerms(d)
As we later describe, we were able to get better results
with a simpler length normalization function that is avail-
able within Lucene, though not as the default length nor-
malization.
For both Juru and Lucene we used the same HTML parser
to extract content of the Web documents of the gov2 collec-
tion. In addition, for both systems the same anchor text ex-
traction process took place. Anchors data was used for two
purposes: 1) Set a document static score according to the
number of in-links pointing to it. 2) Enrich the document
content with the anchor text associated with its in-links.
Last, for both Lucene and Juru we used a similar query
parsing mechanism, which included stop-word removal, syn-
onym expansion and phrase expansion.
2
http://lucene.apache.org/java/2_3_0/scoring.html
In the following we describe these processes in full de-
tails and the results both for the 150 topics of the tera-
byte tracks, and for the 10K queries of the 1-Million Queries
Track.
2. ANCHOR TEXT EXTRACTION
Extracting anchor text is a necessary task for indexing
Web collections: adding text surrounding a link to the in-
dexed document representing the linked page. With inverted
indexes it is often ineļ¬ƒcient to update a document once it
was indexed. In Lucene, for example, updating an indexed
document involves removing that document from the index
and then (re) adding the updated document. Therefore, an-
chor extraction is done prior to indexing, as a global com-
putation step. Still, for large collections, this is a nontrivial
task. While this is not a new problem, describing a timely
extraction method may be useful for other researchers.
2.1 Extraction method
Our gov2 input is a hierarchical directory structure of
about 27,000 compressed ļ¬les, each containing multiple doc-
uments (or pages), altogether about 25,000,000 documents.
Our output is a similar directory structure, with one com-
pressed anchors text ļ¬le for each original pages text ļ¬le.
Within the ļ¬le, anchors texts are ordered the same as pages
texts, allowing indexing the entire collection in a single com-
bined scan of the two directories. We now describe the ex-
traction steps.
ā€¢ (i) Hash by URL: Input pages are parsed, emitting
two types of result lines: page lines and anchor lines.
Result lines are hashed into separate ļ¬les by URL,
and so for each page, all anchor lines referencing it are
written to the same ļ¬le as its single page line.
ā€¢ (ii) Sort lines by URL: This groups together every
page line with all anchors that reference that page.
Note that sort complexity is relative to the ļ¬les size,
and hence can be controlled by the hash function used.
ā€¢ (iii) Directory structure: Using the ļ¬le path info
of page lines, data is saved in new ļ¬les, creating a
directory structure identical to that of the input.
ā€¢ (iv) Documents order: Sort each ļ¬le by document
numbers of page lines. This will allow to index pages
with their anchors.
Note that the extraction speed relies on serial IO in the
splitting steps (i), (iii), and on sorting ļ¬les that are not too
large in steps (ii), (iv).
2.2 Gov2 Anchors Statistics
Creating the anchors data for gov2 took about 17 hours
on a 2-way Linux. Step (i) above ran in 4 parallel threads
and took most of the time: 11 hours. The total size of the
anchors data is about 2 GB.
We now list some interesting statistics on the anchors
data:
ā€¢ There are about 220 million anchors for about 25 mil-
lion pages, an average of about 9 anchors per page.
ā€¢ Maximum number of anchors of a single page is
1,275,245 - that many .gov pages are referencing the
page www.usgs.gov.
ā€¢ 17% of the pages have no anchors at all.
ā€¢ 77% of the pages have 1 to 9 anchors.
ā€¢ 5% of the pages have 10 to 99 anchors.
ā€¢ 159 pages have more than 100,000 anchors.
Obviously, for pages with many anchors only part of the
anchors data was indexed.
2.3 Static scoring by link information
The links into a page may indicate how authoritative that
page is. We employed a simplistic approach of only count-
ing the number of such in-links, and using that counter to
boost up candidate documents, so that if two candidate
pages agree on all quality measures, the page with more
incoming links would be ranked higher.
Static score (SS) values are linearly combined with textual
scores. Equation 3 shows the static score computation for a
document d linked by in(d) other pages.
SS(d) = min(1,
in(d)
400
) (3)
It is interesting to note that our experiments showed con-
ļ¬‚icting eļ¬€ects of this static scoring: while SS greatly im-
proves Juruā€™s quality results, SS have no eļ¬€ect with Lucene.
To this moment we do not understand the reason for this
diļ¬€erence of behavior. Therefore, our submissions include
static scoring by in-links count for Juru but not for Lucene.
3. QUERY PARSING
We used a similar query parsing process for both search
engines. The terms extracted from the query include single
query words, stemmed by the Porter stemmer which pass
the stop-word ļ¬ltering process, lexical aļ¬ƒnities, phrases and
synonyms.
3.1 Lexical afļ¬nities
Lexical aļ¬ƒnities (LAs) represent the correlation between
words co-occurring in a document. LAs are identiļ¬ed by
looking at pairs of words found in close proximity to each
other. It has been described elsewhere [3] how LAs improve
precision of search by disambiguating terms.
During query evaluation, the query proļ¬le is constructed
to include the queryā€™s lexical aļ¬ƒnities in addition to its in-
dividual terms. This is achieved by identifying all pairs of
words found close to each other in a window of some prede-
ļ¬ned small size (the sliding window is only deļ¬ned within
a sentence). For each LA=(t1,t2), Juru creates a pseudo
posting list by ļ¬nding all documents in which these terms
appear close to each other. This is done by merging the
posting lists of t1 and t2. If such a document is found, it
is added to the posting list of the LA with all the relevant
occurrence information. After creating the posting list, the
new LA is treated by the retrieval algorithm as any other
term in the query proļ¬le.
In order to add lexical aļ¬ƒnities into Lucene scoring we
used Luceneā€™s SpanNearQuery which matches spans of the
query terms which are within a given window in the text.
3.2 Phrase expansion
The query is also expanded to include the query text as
a phrase. For example, the query ā€˜U.S. oil industry historyā€™
is expanded to ā€˜U.S. oil industry history. ā€U.S. oil industry
historyā€ ā€™. The idea is that documents containing the query
as a phrase should be biased compared to other documents.
The posting list of the query phrase is created by merging
the postings of all terms, considering only documents con-
taining the query terms in adjacent oļ¬€sets and in the right
order. Similarly to LA weight which speciļ¬es the relative
weight between an LA term and simple keyword term, a
phrase weight speciļ¬es the relative weight of a phrase term.
Phrases are simulated by Lucene using the PhraseQuery
class.
3.3 Synonym Expansion
The 10K queries for the 1-Million Queries Track track are
strongly related to the .gov domain hence many abbrevia-
tions are found in the query list. For example, the u.s. states
are usually referred by abbreviations (e.g. ā€œnyā€ for New-
York, ā€œcaā€ or ā€œcalā€ for California). However, in many cases
relevant documents refer to the full proper-name rather than
its acronym. Thus, looking only for documents containing
the original query terms will fail to retrieve those relevant
documents.
We therefore used a synonym table for common abbre-
viations in the .gov domain to expand the query. Given a
query term with an existing synonym in the table, expansion
is done by adding the original query phrase while replacing
the original term with its synonym. For example, the query
ā€œca veteransā€ is expanded to ā€œca veterans. California vet-
eransā€.
It is however interesting to note that the evaluations of our
Lucene submissions indicated no measurable improvements
due to this synonym expansion.
3.4 Lucene query example
To demonstrate our choice of query parsing, for the origi-
nal topic ā€“ ā€U.S. oil industry historyā€, the following Lucene
query was created:
oil industri histori
(
spanNear([oil, industri], 8, false)
spanNear([oil, histori], 8, false)
spanNear([industri, histori], 8, false)
)^4.0
"oil industri histori"~1^0.75
The result Lucene query illustrates some aspects of our
choice of query parsing:
ā€¢ ā€U.S.ā€ is considered a stop word and was removed from
the query text.
ā€¢ Only stemmed forms of words are used.
ā€¢ Default query operator is OR.
ā€¢ Words found in a document up to 7 positions apart
form a lexical aļ¬ƒnity. (8 in this example because of
the stopped word.)
ā€¢ Lexical aļ¬ƒnity matches are boosted 4 times more than
single word matches.
ā€¢ Phrase matches are counted slightly less than single
word matches.
ā€¢ Phrases allow fuzziness when words were stopped.
For more information see the Lucene query syntax3
doc-
ument.
4. LUCENE SCORE MODIFICATION
Our TREC quality measures for the gov2 collection re-
vealed that the default scoring4
of Lucene is inferior to that
of Juru. (Detailed run results are given in Section 6.)
We were able to improve Luceneā€™s scoring by changing
the scoring in two areas: document length normalization
and term frequency (tf) normalization.
Luceneā€™s default length normalization5
is given in equa-
tion 4, where L is the number of words in the document.
lengthNorm(L) =
1
āˆš
L
(4)
The rational behind this formula is to prevent very long
documents from ā€taking overā€ just because they contain
many terms, possibly many times. However a negative side
eļ¬€ect of this normalization is that long documents are ā€pun-
ishedā€ too much, while short documents are preferred too
much. The ļ¬rst two modiļ¬cations below remedy this fur-
ther.
4.1 Sweet Spot Similarity
Here we used the document length normalization of ā€Sweet
Spot Similarityā€6
. This alternative Similarity function is
available as a Luceneā€™s ā€contribā€ package. Its normaliza-
tion value is given in equation 5, where L is the number of
document words.
lengthNorm(L) = (5)
1āˆš
steepnessāˆ—(|Lāˆ’min|+|Lāˆ’max|āˆ’(maxāˆ’min))+1)
We used steepness = 0.5, min = 1000, and max =
15, 000. This computes to a constant norm for all lengths in
the [min, max] range (the ā€sweet spotā€), and smaller norm
values for lengths out of this range. Documents shorter or
longer than the sweet spot range are ā€punishedā€.
4.2 Pivoted length normalization
Our pivoted document length normalization follows the
approach of [6], as depicted in equation 6, where U is the
number of unique words in the document and pivot is the
average of U over all documents.
lengthNorm(L) =
1
(1 āˆ’ slope) āˆ— pivot + slope āˆ— U
(6)
For Lucene we used slope = 0.16.
3
http://lucene.apache.org/java/2_3_0/queryparsersyntax.html
4
http://lucene.apache.org/java/2_3_0/scoring.html
5
http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/search/DefaultSimilarity.html
6
http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/misc/SweetSpotSimilarity.html
4.3 Term Frequency (tf) normalization
The tf-idf formula used in Lucene computes tf(t in d)
as freq(t, d) where freq(t, d) is the frequency of t in d,
and avgFreq(d) is the average of freq(t, d) over all terms t
of document d. We modiļ¬ed this to take into account the
average term frequency in d, similar to Juru, as shown in
formula 7.
tf(t, d) =
log(1 + freq(t, d))
log(1 + avgFreq(d))
(7)
This distinguishes between terms that highly represent
a document, to terms that are part of duplicated text. In
addition, the logarithmic formula has much smaller variation
than Luceneā€™s original square root based formula, and hence
the eļ¬€ect of freq(t, d) on the ļ¬nal score is smaller.
4.4 Query fewer ļ¬elds
Throughout our experiments we saw that splitting the
document text into separate ļ¬elds, such as title, abstract,
body, anchor, and then (OR) querying all these ļ¬elds per-
forms poorer than querying a single, combined ļ¬eld. We
have not fully investigated the causes for this, but we suspect
that this too is due to poor length normalization, especially
in the presence of very short ļ¬elds.
Our recommendation therefore is to refrain from splitting
the document text into ļ¬elds in those cases that all the ļ¬elds
are actually searched at the same time.
While splitting into ļ¬elds allows easy boosting of certain
ļ¬elds at search time, a better alternative is to boost the
terms themselves, either statically by duplicating them at
indexing time, or perhaps even better dynamically at search
time, e.g. by using term payload values, a new Lucene fea-
ture. In this work we took the ļ¬rst, more static approach.
5. QUERY LOG ANALYSIS
This yearā€™s task resembled a real search environment mainly
because it provided us with a substantial query log. The log
itself provides a glimpse into what it is that users of the
.gov domain actually want to ļ¬nd. This is important since
log analysis has become one of the better ways to improve
search quality and performance.
For example, for our index we created a stopword list from
the most frequent terms appearing in the queries, crossed
with their frequency in the collection. The most obvious
characteristic of the ļ¬‚avor of the queries is the term state
which is ranked as the 6th most frequent query term with
342 mentions, and similarly it is mentioned 10,476,458 times
in the collection. This demonstrates the bias of both the col-
lection and the queries to content pertaining to state aļ¬€airs.
Also in the queries are over 270 queries containing us, u.s,
usa, or the phrase ā€œunited statesā€ and nearly 150 queries
containing the term federal. The fact is that the underlying
assumption in most of the queries in the log is that the re-
quested information describes US government content. This
may sound obvious since it is after all the gov2 collection,
however, it also means that queries that contain the terms
US, federal, and state actually restrict the results much more
than required in a collection dedicated to exactly that con-
tent. So removing those terms from the queries makes much
sense. We experimented with the removal of those terms
and found that the improvement over the test queries is
substantial.
Term Frequency in query log Frequency in index
of 1228 Originally stopped
in 838 Originally stopped
and 653 Originally stopped
for 586 Originally stopped
the 512 Originally stopped
state 342 10476458
to 312 Originally stopped
a 265 Originally stopped
county 233 3957936
california 204 1791416
tax 203 1906788
new 195 4153863
on 194 Originally stopped
department 169 6997021
what 168 1327739
how 143 Originally stopped
federal 143 4953306
health 135 4693981
city 114 2999045
is 114 Originally stopped
national 104 7632976
court 103 2096805
york 103 1804031
school 100 2212112
with 99 Originally stopped
Table 1: Most frequent terms in the query log.
The log information allowed us to stop general terms along
with collection-speciļ¬c terms such as US, state, United States,
and federal.
Another problem that arose from using such a diverse log
was that we were able to detect over 400 queries which con-
tained abbreviations marked by a dot, such as fed., gov.,
dept., st. and so on. We have also found numbers such
as in street addresses or date of birth that were sometimes
provided in numbers and sometimes in words. Many of the
queries also contained abbreviations for state names which
is the acceptable written form for referring to states in the
US.
For this purpose we prepared a list of synonyms to expand
the queries to contain all the possible forms of the terms.
For example for a query ā€Orange County CAā€ our system
produced two queries ā€Orange County CAā€ and ā€Orange
County Californiaā€ which were submitted as a single query
combined of both strings. Overall we used a list of about
250 such synonyms which consisted of state abbreviations
(e.g. RI, Rhode Island), common federal authoritiesā€™ abbre-
viations (e.g. FBI, FDA, etc.), and translation of numbers
to words (e.g. 1, one, 2, two, etc.).
Although we are aware of the many queries in diļ¬€erent
languages appearing in the log, mainly in Spanish, we chose
to ignore cross-language techniques and submit them as they
are. Another aspect we did not cover and that could possible
have made a diļ¬€erence is the processing of over 400 queries
that appeared in a wh question format of some sort. We
speculate that analyzing the syntax of those queries could
have made a good ļ¬ltering mechanism that may have im-
proved our results.
6. RESULTS
Search quality results of our runs for the 1-Million Queries
Track are given in Table 2. For a description of the various
options see Section 4. The top results in each column are
emphasized. The MAP column refer to the 150 terabyte
track queries that were part of the queries pool. We next
analyze these results.
Figure 1: MAP and NEU-Map
Figure 2: -MAP
Figure 3: Precision cut-oļ¬€s 5, 10, 20
Run 1, the only Juru run, was used as a target. All the
other runs - 2 to 10 - used Lucene.
Runs 1 (Juru) and 9 (Lucene) - marked with ā€™(*)ā€™ - are the
only runs in this table that were submitted to the 1 Million
Queries Track. Two more Lucene runs were submitted, with
Figure 4: -MAP Recomputed
Figure 5: Search time (seconds/query)
variations of synonym expansion and spell correction, but
their results were nearly identical to Run 9, and these two
runs are therefore not discussed separately in this report.
Also note that Lucene Run 10, which was not submitted,
outperforms the submitted Lucene runs in all measures but
NEU-Map.
Run 2 is the default - out-of-the-box Lucene. This run was
used as a base line for improving Lucene quality measures.
Comparing run 2 with run 1 shows that Luceneā€™s default
behavior does not perform very well, with MAP of 0.154
comparing to Juruā€™s MAP of 0.313.
Runs 3, 4, 5 add proximity (LA), phrase, and both prox-
imity and phrase, on top of the default Lucene1. The results
for Run 5 reveals that adding both proximity and phrase to
the query evaluation is beneļ¬cial, and we therefore pick this
conļ¬guration as the base for the rest of the evaluation, and
name it as Lucene4. Note then that runs 6 to 9 are marked
as Lucene4 as well, indicating proximity and phrase in all of
them.
Luceneā€™s sweet spot similarity run performs much better
than the default Lucene behavior, bringing MAP from 0.214
in run 5 to 0.273 in run 7. Note that the only change here
is a diļ¬€erent document length normalization. Replacing the
document length normalization with a pivoted length one in
run 6 improves MAP further to 0.284.
Run 10 is deļ¬nitely the best run: combining the docu-
ment length normalization of Luceneā€™s sweet spot similarity
with normalizing the tf by the average term frequency in a
document. This combination brings Luceneā€™s MAP to 0.306,
and even outperforms Juru for the P@5, P@10, and P@20
Run MAP -Map NEU-Map P@5 P@10 P@20
-Map
sec/q
Recomputed
1. Juru (*) 0.313 0.1080 0.3199 0.592 0.560 0.529 0.2121
2. Lucene1 Base 0.154 0.0746 0.1809 0.313 0.303 0.289 0.0848 1.4
3. Lucene2 LA 0.208 0.0836 0.2247 0.409 0.382 0.368 0.1159 5.6
4. Lucene3 Phrase 0.191 0.0789 0.2018 0.358 0.347 0.341 0.1008 4.1
5. Lucene4 LA + Phrase 0.214 0.0846 0.2286 0.409 0.390 0.383 0.1193 6.7
6. Lucene4 Pivot Length Norm 0.284 0.572 0.540 0.503
7. Lucene4 Sweet Spot Similarity 0.273 0.1059 0.2553 0.593 0.565 0.527 0.1634 6.9
8. Lucene4 TF avg norm 0.194 0.0817 0.2116 0.404 0.373 0.370 0.1101 7.8
9. Lucene4 Pivot + TF avg norm (*) 0.294 0.1031 0.3255 0.587 0.542 0.512 0.1904
10. Lucene4 Sweet Spot + TF avg norm 0.306 0.1091 0.3171 0.627 0.589 0.543 0.2004 8.0
Table 2: Search Quality Comparison.
measures.
The 1-Million Queries Track measures of -Map and NEU-
Map introduced in the 1-Million Queries Track support the
improvements to Lucene scoring: runs 9 and 10 are the best
Lucene runs in these measures as well, with a slight dis-
agreement in the NEU-Map measure that prefers run 9.
Figures 1 and 2 show that the improvements of stock
Lucene are consistent for all three measures: MAP, -Map
and NEU-Map. Figure 3 demonstrates similar consistency
for precision cut-oļ¬€s at 5, 10 and 20.
Since most of the runs analyzed here were not submitted,
we applied the supplied expert utility to generated a new
expert.rels ļ¬le, using all our runs. As result, the evaluate
program that computes -Map was able to take into account
any new documents retrieved by the new runs, however, this
ignored any information provided by any other run. Note:
it is not that new unjudged documents were treated as rele-
vant, but the probabilities that a non-judged document re-
trieved by the new runs is relevant were computed more
accurately this way.
Column -Map2 in Table 2 show this re-computation of
-Map, and, together with Figure 4, we can see that the
improvements to Lucene search quality are consistent with
previous results, and run 10 is the best Lucene run.
Finally, we measured the search time penalty for the qual-
ity improvements. This is depicted in column sec/q of Table
2 as well as Figure 5. We can see that the search time grew
by a factor of almost 6. This is a signiļ¬cant cost, that surely
some applications would not be able to aļ¬€ord. However it
should be noticed that in this work we focused on search
quality and so better search time is left for future work.
7. SUMMARY
We started by measuring Luceneā€™s out of the box search
quality for TREC data and found that it is signiļ¬cantly
inferior to other search engines that participate in TREC,
and in particular comparing to our search engine Juru.
We were able to improve Luceneā€™s search quality as mea-
sured for TREC data by (1) adding phrase expansion and
proximity scoring to the query, (2) better choice of docu-
ment length normalization, and (3) normalizing tf values by
documentā€™s average term frequency. We also would like to
note the high performance of improved Lucene over the new
query set ā€“ Lucene three submitted runs were ranked ļ¬rst
according to the NEU measure and in places 2-4 (after the
Juru run) according to eMap.
The improvements that were trained on the 150 terabyte
queries of previous years were shown consistent with the
1755 judged queries of the 1-Million Queries Track, and with
the new sampling measures -Map and NEU-Map.
Application developers using Lucene can easily adopt the
document length normalization part of our work, simply by
a diļ¬€erent similarity choice. Phrase expansion of the query
as well as proximity scoring should be also relatively easy
to add. However for applying the tf normalization some
changes in Lucene code would be required.
Additional considerations that search application devel-
opers should take into account are the search time cost, and
whether the improvements demonstrated here are also rele-
vant for the domain of the speciļ¬c search application.
8. ACKNOWLEDGMENTS
We would like to thank Ian Soboroļ¬€ and Ben Carterette
for sharing their insights in discussions about how the results
- all the results and ours - can be interpreted for the writing
of this report.
Also thanks to Nadav Harā€™El for helpful discussions on
anchor text extraction, and for his code to add proximity
scoring to Lucene.
9. REFERENCES
[1] C. Buckley, A. Singhal, and M. Mitra. New retrieval
approaches using smart: Trec 4. In TREC, 1995.
[2] D. Carmel and E. Amitay. Juru at TREC 2006: TAAT
versus DAAT in the Terabyte Track. In Proceedings of
the 15th Text REtrieval Conference (TREC2006).
National Institute of Standards and Technology. NIST,
2006.
[3] D. Carmel, E. Amitay, M. Herscovici, Y. S. Maarek,
Y. Petruschka, and A. Soļ¬€er. Juru at TREC 10 -
Experiments with Index Pruning. In Proceeding of
Tenth Text REtrieval Conference (TREC-10). National
Institute of Standards and Technology. NIST, 2001.
[4] O. Gospodnetic and E. Hatcher. Lucene in Action.
Manning Publications Co., 2005.
[5] S. E. Robertson, S. Walker, M. Hancock-Beaulieu,
M. Gatford, and A. Payne. Okapi at trec-4. In TREC,
1995.
[6] A. Singhal, C. Buckley, and M. Mitra. Pivoted
document length normalization. In Proceedings of the
19th annual international ACM SIGIR conference on
Research and development in information retrieval,
pages 21ā€“29, 1996.

More Related Content

What's hot

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD
Ā 
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
Shaghayegh (Sherry) Sahebi
Ā 

What's hot (17)

Ir 02
Ir   02Ir   02
Ir 02
Ā 
Ir 08
Ir   08Ir   08
Ir 08
Ā 
Unit 3 lecture-2
Unit 3 lecture-2Unit 3 lecture-2
Unit 3 lecture-2
Ā 
Efficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl FrameworkEfficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl Framework
Ā 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
Ā 
Ir 09
Ir   09Ir   09
Ir 09
Ā 
An Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired FileAn Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired File
Ā 
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
Ā 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management Service
Ā 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
Ā 
Cross-Domain Recommendation for Large-Scale Data
Cross-Domain Recommendation for Large-Scale DataCross-Domain Recommendation for Large-Scale Data
Cross-Domain Recommendation for Large-Scale Data
Ā 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
Ā 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce Technique
Ā 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
Ā 
Coling2014:Single Document Keyphrase Extraction Using Label Information
Coling2014:Single Document Keyphrase Extraction Using Label InformationColing2014:Single Document Keyphrase Extraction Using Label Information
Coling2014:Single Document Keyphrase Extraction Using Label Information
Ā 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural Network
Ā 
LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases
LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge BasesLOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases
LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases
Ā 

Similar to Ibm haifa.mq.final

Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
NIKHIL NAIR
Ā 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Edmond Lepedus
Ā 
Lucene basics
Lucene basicsLucene basics
Lucene basics
Nitin Pande
Ā 

Similar to Ibm haifa.mq.final (20)

ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
Ā 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
Ā 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
Ā 
Lucene
LuceneLucene
Lucene
Ā 
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Ā 
RDBMS
RDBMSRDBMS
RDBMS
Ā 
Apache lucene
Apache luceneApache lucene
Apache lucene
Ā 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
Ā 
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSTEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
Ā 
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
Ā 
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSTEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
Ā 
Seminar report(rohitsahu cs 17 vth sem)
Seminar report(rohitsahu cs 17 vth sem)Seminar report(rohitsahu cs 17 vth sem)
Seminar report(rohitsahu cs 17 vth sem)
Ā 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Ā 
How a search engine works report
How a search engine works reportHow a search engine works report
How a search engine works report
Ā 
Lucene basics
Lucene basicsLucene basics
Lucene basics
Ā 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
Ā 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
Ā 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. Elasticsearch
Ā 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run time
Ā 
How web searching engines work
How web searching engines workHow web searching engines work
How web searching engines work
Ā 

More from Pranav Prakash

Apple banana oranges_peaches
Apple banana oranges_peachesApple banana oranges_peaches
Apple banana oranges_peaches
Pranav Prakash
Ā 
Banana oranges peaches
Banana oranges peachesBanana oranges peaches
Banana oranges peaches
Pranav Prakash
Ā 
Apple oranges peaches
Apple oranges peachesApple oranges peaches
Apple oranges peaches
Pranav Prakash
Ā 

More from Pranav Prakash (20)

Data engineering track module 2
Data engineering track module 2Data engineering track module 2
Data engineering track module 2
Ā 
Data engineering track module 2
Data engineering track module 2Data engineering track module 2
Data engineering track module 2
Ā 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
Ā 
Test document
Test documentTest document
Test document
Ā 
Webtech1b
Webtech1bWebtech1b
Webtech1b
Ā 
Solidry @ bakheda2
Solidry @ bakheda2Solidry @ bakheda2
Solidry @ bakheda2
Ā 
To Infinity and Beyond - OSDConf2014
To Infinity and Beyond - OSDConf2014To Infinity and Beyond - OSDConf2014
To Infinity and Beyond - OSDConf2014
Ā 
#comments
#comments#comments
#comments
Ā 
Apple banana oranges_peaches
Apple banana oranges_peachesApple banana oranges_peaches
Apple banana oranges_peaches
Ā 
Oranges
OrangesOranges
Oranges
Ā 
Oranges peaches
Oranges peachesOranges peaches
Oranges peaches
Ā 
Banana
BananaBanana
Banana
Ā 
Banana peaches
Banana peachesBanana peaches
Banana peaches
Ā 
Banana oranges
Banana orangesBanana oranges
Banana oranges
Ā 
Banana oranges peaches
Banana oranges peachesBanana oranges peaches
Banana oranges peaches
Ā 
Apple
AppleApple
Apple
Ā 
Apple peaches
Apple peachesApple peaches
Apple peaches
Ā 
Apple oranges
Apple orangesApple oranges
Apple oranges
Ā 
Apple oranges peaches
Apple oranges peachesApple oranges peaches
Apple oranges peaches
Ā 
Apple banana
Apple bananaApple banana
Apple banana
Ā 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(ā˜Žļø+971_581248768%)**%*]'#abortion pills for sale in dubai@
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Ā 

Recently uploaded (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
Ā 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Ā 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
Ā 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Ā 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Ā 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
Ā 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Ā 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Ā 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ā 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Ā 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Ā 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Ā 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Ā 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Ā 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Ā 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
Ā 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Ā 

Ibm haifa.mq.final

  • 1. Lucene and Juru at Trec 2007: 1-Million Queries Track Doron Cohen, Einat Amitay, David Carmel IBM Haifa Research Lab Haifa 31905, Israel Email: {doronc,einat,carmel}@il.ibm.com ABSTRACT Lucene is an increasingly popular open source search library. However, our experiments of search quality for TREC data and evaluations for out-of-the-box Lucene indicated inferior quality comparing to other systems participating in TREC. In this work we investigate the diļ¬€erences in measured search quality between Lucene and Juru, our home-brewed search engine, and show how Lucene scoring can be modiļ¬ed to improve its measured search quality for TREC. Our scoring modiļ¬cations to Lucene were trained over the 150 topics of the tera-byte tracks. Evaluations of these mod- iļ¬cations with the new - sample based - 1-Million Queries Track measures - NEU-Map and -Map - indicate the ro- bustness of the scoring modiļ¬cations: modiļ¬ed Lucene per- forms well when compared to stock Lucene and when com- pared to other systems that participated in the 1-Million Queries Track this year, both for the training set of 150 queries and for the new measures. As such, this also sup- ports the robustness of the new measures tested in this track. This work reports our experiments and results and de- scribes the modiļ¬cations involved - namely normalizing term frequencies, diļ¬€erent choice of document length normaliza- tion, phrase expansion and proximity scoring. 1. INTRODUCTION Our experiments this year for the TREC 1-Million Queries Track focused on the scoring function of Lucene, an Apache open-source search engine [4]. We used our home-brewed search engine, Juru [2], to compare with Lucene on the 10K trackā€™s queries over the gov2 collection. Lucene1 is an open-source library for full-text indexing and searching in Java. The default scoring function of Lucene implements the cosine similarity function, while search terms are weighted by the tf-idf weighting mechanism. Equation 1 describes the default Lucene score for a document d with respect to a query q: Score(d, q) = tāˆˆq tf(t, d) Ā· idf(t) Ā· (1) boost(t, d) Ā· norm(d) where 1 We used Lucene Java: http://lucene.apache.org/java/docs, the ļ¬rst and main Lucene implementation. Lucene deļ¬nes a ļ¬le for- mat, and so there are ports of Lucene to other languages. Throughout this work, ā€Luceneā€ refers to ā€Apache Lucene Javaā€. ā€¢ tf(t, d) is the term frequency factor for the term t in the document d, computed as freq(t, d). ā€¢ idf(t) is the inverse document frequency of the term, computed as 1 + log numDocs docF req(t)+1 . ā€¢ boost(t.field, d) is the boost of the ļ¬eld of t in d, as set during indexing. ā€¢ norm(d) is the normalization value, given the number of terms within the document. For more details on Lucene scoring see the oļ¬ƒcial Lucene scoring2 document. Our main goal in this work was to experiment with the scoring mechanism of Lucene in order to bring it to the same level as the state-of-the-art ranking formulas such as OKAPI [5] and the SMART scoring model [1]. In order to study the scoring model of Lucene in full details we ran Juru and Lucene over the 150 topics of the tera-byte tracks, and over the 10K queries of the 1-Million Queries Track of this year. We then modiļ¬ed Luceneā€™s scoring function to include better document length normalization, and a better term-weight setting according to the SMART model. Equa- tion 2 describes the term frequency (tf) and the normaliza- tion scheme used by Juru, based on SMART scoring mech- anism, that we followed to modify Lucene scoring model. tf(t, d) = log(1+freq(t,d)) log(1+avg(freq(d))) (2) norm(d) = 0.8 avg(#uniqueTerms) + 0.2 #uniqueTerms(d) As we later describe, we were able to get better results with a simpler length normalization function that is avail- able within Lucene, though not as the default length nor- malization. For both Juru and Lucene we used the same HTML parser to extract content of the Web documents of the gov2 collec- tion. In addition, for both systems the same anchor text ex- traction process took place. Anchors data was used for two purposes: 1) Set a document static score according to the number of in-links pointing to it. 2) Enrich the document content with the anchor text associated with its in-links. Last, for both Lucene and Juru we used a similar query parsing mechanism, which included stop-word removal, syn- onym expansion and phrase expansion. 2 http://lucene.apache.org/java/2_3_0/scoring.html
  • 2. In the following we describe these processes in full de- tails and the results both for the 150 topics of the tera- byte tracks, and for the 10K queries of the 1-Million Queries Track. 2. ANCHOR TEXT EXTRACTION Extracting anchor text is a necessary task for indexing Web collections: adding text surrounding a link to the in- dexed document representing the linked page. With inverted indexes it is often ineļ¬ƒcient to update a document once it was indexed. In Lucene, for example, updating an indexed document involves removing that document from the index and then (re) adding the updated document. Therefore, an- chor extraction is done prior to indexing, as a global com- putation step. Still, for large collections, this is a nontrivial task. While this is not a new problem, describing a timely extraction method may be useful for other researchers. 2.1 Extraction method Our gov2 input is a hierarchical directory structure of about 27,000 compressed ļ¬les, each containing multiple doc- uments (or pages), altogether about 25,000,000 documents. Our output is a similar directory structure, with one com- pressed anchors text ļ¬le for each original pages text ļ¬le. Within the ļ¬le, anchors texts are ordered the same as pages texts, allowing indexing the entire collection in a single com- bined scan of the two directories. We now describe the ex- traction steps. ā€¢ (i) Hash by URL: Input pages are parsed, emitting two types of result lines: page lines and anchor lines. Result lines are hashed into separate ļ¬les by URL, and so for each page, all anchor lines referencing it are written to the same ļ¬le as its single page line. ā€¢ (ii) Sort lines by URL: This groups together every page line with all anchors that reference that page. Note that sort complexity is relative to the ļ¬les size, and hence can be controlled by the hash function used. ā€¢ (iii) Directory structure: Using the ļ¬le path info of page lines, data is saved in new ļ¬les, creating a directory structure identical to that of the input. ā€¢ (iv) Documents order: Sort each ļ¬le by document numbers of page lines. This will allow to index pages with their anchors. Note that the extraction speed relies on serial IO in the splitting steps (i), (iii), and on sorting ļ¬les that are not too large in steps (ii), (iv). 2.2 Gov2 Anchors Statistics Creating the anchors data for gov2 took about 17 hours on a 2-way Linux. Step (i) above ran in 4 parallel threads and took most of the time: 11 hours. The total size of the anchors data is about 2 GB. We now list some interesting statistics on the anchors data: ā€¢ There are about 220 million anchors for about 25 mil- lion pages, an average of about 9 anchors per page. ā€¢ Maximum number of anchors of a single page is 1,275,245 - that many .gov pages are referencing the page www.usgs.gov. ā€¢ 17% of the pages have no anchors at all. ā€¢ 77% of the pages have 1 to 9 anchors. ā€¢ 5% of the pages have 10 to 99 anchors. ā€¢ 159 pages have more than 100,000 anchors. Obviously, for pages with many anchors only part of the anchors data was indexed. 2.3 Static scoring by link information The links into a page may indicate how authoritative that page is. We employed a simplistic approach of only count- ing the number of such in-links, and using that counter to boost up candidate documents, so that if two candidate pages agree on all quality measures, the page with more incoming links would be ranked higher. Static score (SS) values are linearly combined with textual scores. Equation 3 shows the static score computation for a document d linked by in(d) other pages. SS(d) = min(1, in(d) 400 ) (3) It is interesting to note that our experiments showed con- ļ¬‚icting eļ¬€ects of this static scoring: while SS greatly im- proves Juruā€™s quality results, SS have no eļ¬€ect with Lucene. To this moment we do not understand the reason for this diļ¬€erence of behavior. Therefore, our submissions include static scoring by in-links count for Juru but not for Lucene. 3. QUERY PARSING We used a similar query parsing process for both search engines. The terms extracted from the query include single query words, stemmed by the Porter stemmer which pass the stop-word ļ¬ltering process, lexical aļ¬ƒnities, phrases and synonyms. 3.1 Lexical afļ¬nities Lexical aļ¬ƒnities (LAs) represent the correlation between words co-occurring in a document. LAs are identiļ¬ed by looking at pairs of words found in close proximity to each other. It has been described elsewhere [3] how LAs improve precision of search by disambiguating terms. During query evaluation, the query proļ¬le is constructed to include the queryā€™s lexical aļ¬ƒnities in addition to its in- dividual terms. This is achieved by identifying all pairs of words found close to each other in a window of some prede- ļ¬ned small size (the sliding window is only deļ¬ned within a sentence). For each LA=(t1,t2), Juru creates a pseudo posting list by ļ¬nding all documents in which these terms appear close to each other. This is done by merging the posting lists of t1 and t2. If such a document is found, it is added to the posting list of the LA with all the relevant occurrence information. After creating the posting list, the new LA is treated by the retrieval algorithm as any other term in the query proļ¬le. In order to add lexical aļ¬ƒnities into Lucene scoring we used Luceneā€™s SpanNearQuery which matches spans of the query terms which are within a given window in the text.
  • 3. 3.2 Phrase expansion The query is also expanded to include the query text as a phrase. For example, the query ā€˜U.S. oil industry historyā€™ is expanded to ā€˜U.S. oil industry history. ā€U.S. oil industry historyā€ ā€™. The idea is that documents containing the query as a phrase should be biased compared to other documents. The posting list of the query phrase is created by merging the postings of all terms, considering only documents con- taining the query terms in adjacent oļ¬€sets and in the right order. Similarly to LA weight which speciļ¬es the relative weight between an LA term and simple keyword term, a phrase weight speciļ¬es the relative weight of a phrase term. Phrases are simulated by Lucene using the PhraseQuery class. 3.3 Synonym Expansion The 10K queries for the 1-Million Queries Track track are strongly related to the .gov domain hence many abbrevia- tions are found in the query list. For example, the u.s. states are usually referred by abbreviations (e.g. ā€œnyā€ for New- York, ā€œcaā€ or ā€œcalā€ for California). However, in many cases relevant documents refer to the full proper-name rather than its acronym. Thus, looking only for documents containing the original query terms will fail to retrieve those relevant documents. We therefore used a synonym table for common abbre- viations in the .gov domain to expand the query. Given a query term with an existing synonym in the table, expansion is done by adding the original query phrase while replacing the original term with its synonym. For example, the query ā€œca veteransā€ is expanded to ā€œca veterans. California vet- eransā€. It is however interesting to note that the evaluations of our Lucene submissions indicated no measurable improvements due to this synonym expansion. 3.4 Lucene query example To demonstrate our choice of query parsing, for the origi- nal topic ā€“ ā€U.S. oil industry historyā€, the following Lucene query was created: oil industri histori ( spanNear([oil, industri], 8, false) spanNear([oil, histori], 8, false) spanNear([industri, histori], 8, false) )^4.0 "oil industri histori"~1^0.75 The result Lucene query illustrates some aspects of our choice of query parsing: ā€¢ ā€U.S.ā€ is considered a stop word and was removed from the query text. ā€¢ Only stemmed forms of words are used. ā€¢ Default query operator is OR. ā€¢ Words found in a document up to 7 positions apart form a lexical aļ¬ƒnity. (8 in this example because of the stopped word.) ā€¢ Lexical aļ¬ƒnity matches are boosted 4 times more than single word matches. ā€¢ Phrase matches are counted slightly less than single word matches. ā€¢ Phrases allow fuzziness when words were stopped. For more information see the Lucene query syntax3 doc- ument. 4. LUCENE SCORE MODIFICATION Our TREC quality measures for the gov2 collection re- vealed that the default scoring4 of Lucene is inferior to that of Juru. (Detailed run results are given in Section 6.) We were able to improve Luceneā€™s scoring by changing the scoring in two areas: document length normalization and term frequency (tf) normalization. Luceneā€™s default length normalization5 is given in equa- tion 4, where L is the number of words in the document. lengthNorm(L) = 1 āˆš L (4) The rational behind this formula is to prevent very long documents from ā€taking overā€ just because they contain many terms, possibly many times. However a negative side eļ¬€ect of this normalization is that long documents are ā€pun- ishedā€ too much, while short documents are preferred too much. The ļ¬rst two modiļ¬cations below remedy this fur- ther. 4.1 Sweet Spot Similarity Here we used the document length normalization of ā€Sweet Spot Similarityā€6 . This alternative Similarity function is available as a Luceneā€™s ā€contribā€ package. Its normaliza- tion value is given in equation 5, where L is the number of document words. lengthNorm(L) = (5) 1āˆš steepnessāˆ—(|Lāˆ’min|+|Lāˆ’max|āˆ’(maxāˆ’min))+1) We used steepness = 0.5, min = 1000, and max = 15, 000. This computes to a constant norm for all lengths in the [min, max] range (the ā€sweet spotā€), and smaller norm values for lengths out of this range. Documents shorter or longer than the sweet spot range are ā€punishedā€. 4.2 Pivoted length normalization Our pivoted document length normalization follows the approach of [6], as depicted in equation 6, where U is the number of unique words in the document and pivot is the average of U over all documents. lengthNorm(L) = 1 (1 āˆ’ slope) āˆ— pivot + slope āˆ— U (6) For Lucene we used slope = 0.16. 3 http://lucene.apache.org/java/2_3_0/queryparsersyntax.html 4 http://lucene.apache.org/java/2_3_0/scoring.html 5 http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/search/DefaultSimilarity.html 6 http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/misc/SweetSpotSimilarity.html
  • 4. 4.3 Term Frequency (tf) normalization The tf-idf formula used in Lucene computes tf(t in d) as freq(t, d) where freq(t, d) is the frequency of t in d, and avgFreq(d) is the average of freq(t, d) over all terms t of document d. We modiļ¬ed this to take into account the average term frequency in d, similar to Juru, as shown in formula 7. tf(t, d) = log(1 + freq(t, d)) log(1 + avgFreq(d)) (7) This distinguishes between terms that highly represent a document, to terms that are part of duplicated text. In addition, the logarithmic formula has much smaller variation than Luceneā€™s original square root based formula, and hence the eļ¬€ect of freq(t, d) on the ļ¬nal score is smaller. 4.4 Query fewer ļ¬elds Throughout our experiments we saw that splitting the document text into separate ļ¬elds, such as title, abstract, body, anchor, and then (OR) querying all these ļ¬elds per- forms poorer than querying a single, combined ļ¬eld. We have not fully investigated the causes for this, but we suspect that this too is due to poor length normalization, especially in the presence of very short ļ¬elds. Our recommendation therefore is to refrain from splitting the document text into ļ¬elds in those cases that all the ļ¬elds are actually searched at the same time. While splitting into ļ¬elds allows easy boosting of certain ļ¬elds at search time, a better alternative is to boost the terms themselves, either statically by duplicating them at indexing time, or perhaps even better dynamically at search time, e.g. by using term payload values, a new Lucene fea- ture. In this work we took the ļ¬rst, more static approach. 5. QUERY LOG ANALYSIS This yearā€™s task resembled a real search environment mainly because it provided us with a substantial query log. The log itself provides a glimpse into what it is that users of the .gov domain actually want to ļ¬nd. This is important since log analysis has become one of the better ways to improve search quality and performance. For example, for our index we created a stopword list from the most frequent terms appearing in the queries, crossed with their frequency in the collection. The most obvious characteristic of the ļ¬‚avor of the queries is the term state which is ranked as the 6th most frequent query term with 342 mentions, and similarly it is mentioned 10,476,458 times in the collection. This demonstrates the bias of both the col- lection and the queries to content pertaining to state aļ¬€airs. Also in the queries are over 270 queries containing us, u.s, usa, or the phrase ā€œunited statesā€ and nearly 150 queries containing the term federal. The fact is that the underlying assumption in most of the queries in the log is that the re- quested information describes US government content. This may sound obvious since it is after all the gov2 collection, however, it also means that queries that contain the terms US, federal, and state actually restrict the results much more than required in a collection dedicated to exactly that con- tent. So removing those terms from the queries makes much sense. We experimented with the removal of those terms and found that the improvement over the test queries is substantial. Term Frequency in query log Frequency in index of 1228 Originally stopped in 838 Originally stopped and 653 Originally stopped for 586 Originally stopped the 512 Originally stopped state 342 10476458 to 312 Originally stopped a 265 Originally stopped county 233 3957936 california 204 1791416 tax 203 1906788 new 195 4153863 on 194 Originally stopped department 169 6997021 what 168 1327739 how 143 Originally stopped federal 143 4953306 health 135 4693981 city 114 2999045 is 114 Originally stopped national 104 7632976 court 103 2096805 york 103 1804031 school 100 2212112 with 99 Originally stopped Table 1: Most frequent terms in the query log. The log information allowed us to stop general terms along with collection-speciļ¬c terms such as US, state, United States, and federal. Another problem that arose from using such a diverse log was that we were able to detect over 400 queries which con- tained abbreviations marked by a dot, such as fed., gov., dept., st. and so on. We have also found numbers such as in street addresses or date of birth that were sometimes provided in numbers and sometimes in words. Many of the queries also contained abbreviations for state names which is the acceptable written form for referring to states in the US. For this purpose we prepared a list of synonyms to expand the queries to contain all the possible forms of the terms. For example for a query ā€Orange County CAā€ our system produced two queries ā€Orange County CAā€ and ā€Orange County Californiaā€ which were submitted as a single query combined of both strings. Overall we used a list of about 250 such synonyms which consisted of state abbreviations (e.g. RI, Rhode Island), common federal authoritiesā€™ abbre- viations (e.g. FBI, FDA, etc.), and translation of numbers to words (e.g. 1, one, 2, two, etc.). Although we are aware of the many queries in diļ¬€erent languages appearing in the log, mainly in Spanish, we chose to ignore cross-language techniques and submit them as they are. Another aspect we did not cover and that could possible have made a diļ¬€erence is the processing of over 400 queries that appeared in a wh question format of some sort. We speculate that analyzing the syntax of those queries could have made a good ļ¬ltering mechanism that may have im- proved our results.
  • 5. 6. RESULTS Search quality results of our runs for the 1-Million Queries Track are given in Table 2. For a description of the various options see Section 4. The top results in each column are emphasized. The MAP column refer to the 150 terabyte track queries that were part of the queries pool. We next analyze these results. Figure 1: MAP and NEU-Map Figure 2: -MAP Figure 3: Precision cut-oļ¬€s 5, 10, 20 Run 1, the only Juru run, was used as a target. All the other runs - 2 to 10 - used Lucene. Runs 1 (Juru) and 9 (Lucene) - marked with ā€™(*)ā€™ - are the only runs in this table that were submitted to the 1 Million Queries Track. Two more Lucene runs were submitted, with Figure 4: -MAP Recomputed Figure 5: Search time (seconds/query) variations of synonym expansion and spell correction, but their results were nearly identical to Run 9, and these two runs are therefore not discussed separately in this report. Also note that Lucene Run 10, which was not submitted, outperforms the submitted Lucene runs in all measures but NEU-Map. Run 2 is the default - out-of-the-box Lucene. This run was used as a base line for improving Lucene quality measures. Comparing run 2 with run 1 shows that Luceneā€™s default behavior does not perform very well, with MAP of 0.154 comparing to Juruā€™s MAP of 0.313. Runs 3, 4, 5 add proximity (LA), phrase, and both prox- imity and phrase, on top of the default Lucene1. The results for Run 5 reveals that adding both proximity and phrase to the query evaluation is beneļ¬cial, and we therefore pick this conļ¬guration as the base for the rest of the evaluation, and name it as Lucene4. Note then that runs 6 to 9 are marked as Lucene4 as well, indicating proximity and phrase in all of them. Luceneā€™s sweet spot similarity run performs much better than the default Lucene behavior, bringing MAP from 0.214 in run 5 to 0.273 in run 7. Note that the only change here is a diļ¬€erent document length normalization. Replacing the document length normalization with a pivoted length one in run 6 improves MAP further to 0.284. Run 10 is deļ¬nitely the best run: combining the docu- ment length normalization of Luceneā€™s sweet spot similarity with normalizing the tf by the average term frequency in a document. This combination brings Luceneā€™s MAP to 0.306, and even outperforms Juru for the P@5, P@10, and P@20
  • 6. Run MAP -Map NEU-Map P@5 P@10 P@20 -Map sec/q Recomputed 1. Juru (*) 0.313 0.1080 0.3199 0.592 0.560 0.529 0.2121 2. Lucene1 Base 0.154 0.0746 0.1809 0.313 0.303 0.289 0.0848 1.4 3. Lucene2 LA 0.208 0.0836 0.2247 0.409 0.382 0.368 0.1159 5.6 4. Lucene3 Phrase 0.191 0.0789 0.2018 0.358 0.347 0.341 0.1008 4.1 5. Lucene4 LA + Phrase 0.214 0.0846 0.2286 0.409 0.390 0.383 0.1193 6.7 6. Lucene4 Pivot Length Norm 0.284 0.572 0.540 0.503 7. Lucene4 Sweet Spot Similarity 0.273 0.1059 0.2553 0.593 0.565 0.527 0.1634 6.9 8. Lucene4 TF avg norm 0.194 0.0817 0.2116 0.404 0.373 0.370 0.1101 7.8 9. Lucene4 Pivot + TF avg norm (*) 0.294 0.1031 0.3255 0.587 0.542 0.512 0.1904 10. Lucene4 Sweet Spot + TF avg norm 0.306 0.1091 0.3171 0.627 0.589 0.543 0.2004 8.0 Table 2: Search Quality Comparison. measures. The 1-Million Queries Track measures of -Map and NEU- Map introduced in the 1-Million Queries Track support the improvements to Lucene scoring: runs 9 and 10 are the best Lucene runs in these measures as well, with a slight dis- agreement in the NEU-Map measure that prefers run 9. Figures 1 and 2 show that the improvements of stock Lucene are consistent for all three measures: MAP, -Map and NEU-Map. Figure 3 demonstrates similar consistency for precision cut-oļ¬€s at 5, 10 and 20. Since most of the runs analyzed here were not submitted, we applied the supplied expert utility to generated a new expert.rels ļ¬le, using all our runs. As result, the evaluate program that computes -Map was able to take into account any new documents retrieved by the new runs, however, this ignored any information provided by any other run. Note: it is not that new unjudged documents were treated as rele- vant, but the probabilities that a non-judged document re- trieved by the new runs is relevant were computed more accurately this way. Column -Map2 in Table 2 show this re-computation of -Map, and, together with Figure 4, we can see that the improvements to Lucene search quality are consistent with previous results, and run 10 is the best Lucene run. Finally, we measured the search time penalty for the qual- ity improvements. This is depicted in column sec/q of Table 2 as well as Figure 5. We can see that the search time grew by a factor of almost 6. This is a signiļ¬cant cost, that surely some applications would not be able to aļ¬€ord. However it should be noticed that in this work we focused on search quality and so better search time is left for future work. 7. SUMMARY We started by measuring Luceneā€™s out of the box search quality for TREC data and found that it is signiļ¬cantly inferior to other search engines that participate in TREC, and in particular comparing to our search engine Juru. We were able to improve Luceneā€™s search quality as mea- sured for TREC data by (1) adding phrase expansion and proximity scoring to the query, (2) better choice of docu- ment length normalization, and (3) normalizing tf values by documentā€™s average term frequency. We also would like to note the high performance of improved Lucene over the new query set ā€“ Lucene three submitted runs were ranked ļ¬rst according to the NEU measure and in places 2-4 (after the Juru run) according to eMap. The improvements that were trained on the 150 terabyte queries of previous years were shown consistent with the 1755 judged queries of the 1-Million Queries Track, and with the new sampling measures -Map and NEU-Map. Application developers using Lucene can easily adopt the document length normalization part of our work, simply by a diļ¬€erent similarity choice. Phrase expansion of the query as well as proximity scoring should be also relatively easy to add. However for applying the tf normalization some changes in Lucene code would be required. Additional considerations that search application devel- opers should take into account are the search time cost, and whether the improvements demonstrated here are also rele- vant for the domain of the speciļ¬c search application. 8. ACKNOWLEDGMENTS We would like to thank Ian Soboroļ¬€ and Ben Carterette for sharing their insights in discussions about how the results - all the results and ours - can be interpreted for the writing of this report. Also thanks to Nadav Harā€™El for helpful discussions on anchor text extraction, and for his code to add proximity scoring to Lucene. 9. REFERENCES [1] C. Buckley, A. Singhal, and M. Mitra. New retrieval approaches using smart: Trec 4. In TREC, 1995. [2] D. Carmel and E. Amitay. Juru at TREC 2006: TAAT versus DAAT in the Terabyte Track. In Proceedings of the 15th Text REtrieval Conference (TREC2006). National Institute of Standards and Technology. NIST, 2006. [3] D. Carmel, E. Amitay, M. Herscovici, Y. S. Maarek, Y. Petruschka, and A. Soļ¬€er. Juru at TREC 10 - Experiments with Index Pruning. In Proceeding of Tenth Text REtrieval Conference (TREC-10). National Institute of Standards and Technology. NIST, 2001. [4] O. Gospodnetic and E. Hatcher. Lucene in Action. Manning Publications Co., 2005. [5] S. E. Robertson, S. Walker, M. Hancock-Beaulieu, M. Gatford, and A. Payne. Okapi at trec-4. In TREC, 1995. [6] A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proceedings of the
  • 7. 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 21ā€“29, 1996.