Essential Elements of Excellent Multilingual Search
1. February 22, 2012
Essential Elements of
Excellent Multilingual Search
Boosting Search Quality with the Rosette Linguistics Platform
We put the World in the World Wide WebĀ®
3. ABSTRACT
If search is vital to the competitive advantage or value proposition of your business, then this
whitepaper is a must read. In the survival of the ļ¬ttest, companies like Netflix and many
businesses today thrive or die depending on whether they can:
a) Provide āexcellent searchā ā comprehensive, accurate, and relevant to the user
b) Serve each user equally well in his or her preferred language
The three essential elements of a business-dependable search solution on which to build
applications or workļ¬ows to generate revenue are: (1) search customizability and the speed-
scalability-cost trio, (2) high search qualityāwith emphasis on ārecallā or maximizing the number
of good results foundāand (3) reliability and support. This paper will examine what these three
elements mean, and examine Basis Technologyās Rosette linguistic platform as one possible
solution.
THE THREE ELEMENTS OF āEXCELLENT SEARCHā
More and more, revenues and business longevity rely on cost-eļ¬ective, accurate, and scalable
search in English and other languages. E-commerce, job or real estate hunting sites, e-discovery/
e-disclosure solutions, ļ¬nancial compliance solutions, online information providers, media and
entertainment portal sites all begin with users ļ¬nding what they are looking for. Choose the right
search and see revenues go up, satisļ¬ed customers increase, and a business well-positioned for
the future. Choose the wrong search and costs constrict growth and poor relevancy chases away
customers. Savvy businesses recognize that their customers, partners, suppliers and employees
increasingly create and consume content in their native language.
For most enterprises, improving search quality means improving (1) recallāi.e., maximizing the
number of relevant results foundāand to a slightly lesser degree, (2) precisionāi.e., maximizing
the ratio of good to bad results returned.
Weāll now examine each of these three elements of search in detail:
ā¢ Customizability of search is critical to ļ¬t it to your particular use case or create the value add
of your business. Speed, Scale, and Cost can quickly strangle the success of your business
if search is struggling to maintain speed and scale, and choking proļ¬ts with escalating per-
query or per-document costs.
ā¢ Maximizing Search Recall and Precision is the 50,000 pound gorilla. Speed is important,
but if your users arenāt ļ¬nding the results they want to see, it doesnāt matter how fast it
gets delivered. For enterprise applications and search engines, maximizing recall (without
diminishing precision) is usually the greater concern. Good language support is an intrinsic
ingredient for improved recall
ā¢ Reliability and Technical Support ensures that expert resources are available for solving any
serious issues in the service.
Essential Elements of Excellent Multilingual Search 3
4. ELEMENT 1: CUSTOMIZABILITY, SPEED, SCALE, COST
The basic question a company has to ask is: Do we want/like/need/know how to bend technology
to our business processes and business model in order to achieve competitive advantage? And
most importantly: Is search part of that competitive advantage? If the answer is āyesā then search
is not a plug-in-a-solution problem, but a software development problem.
Because search is such a broad-scope problem, there are all kinds of edge cases that are diļ¬cult to
support using a proprietary, commercial search solution. The technology is not easily changed
without expensive consultation and modiļ¬cations to the core software that are dependent on the
vendorās product release cycle.
For companies where better search means more money earned or saved, the frustration can
increase geometrically because their use case may be unique, but the software isn't extensible
enough to deal with it, and they have to spend more and more for less and less.
Like three legs of a triangle, speed, scalability and cost are inextricably linked. Assuming that the
load on the search engine will only increase over time, search performance must be scalable to
keep the query response times low and to ensure index updating times do not aļ¬ect searches.
At the same time, pricing models of commercial search engines based on number of queries or
number of documents indexed will eventually constrict growth by sheer cost.
ELEMENT 2: MAXIMIZING SEARCH RECALL AND PRECISION DEPENDS ON LANGUAGE SUPPORT
Search is really all about ļ¬nding, whether itās a shopper on an e-commerce site, a banker checking a
wire transfer request against a list of money launderers, or home buyers looking through real estate
listings.
The concepts of precision and recall measure how āgoodā search is. More precise results with greater
recall mean better quality search, but frequently increasing one means a decrease in the other, so the
trick is how to maximize each metric while minimizing the impact on the other. (See sidebar.)
Letās now look at the language-speciļ¬c processing required to increase recall and/or precision for three
categories of notoriously diļ¬cult languages to process: Chinese/Japanese/Korean, Germanic and
Scandinavian languages, and Arabic. Weāll ļ¬rst look at one aspect of linguistic processing that improves
recall in most languages, including English and European languages: lemmas and stems.
Improving Recall via Lemmas and Stems
Lemmatization and stemming are two methods of increasing the recall of search results. Stemming
improves recall, but can diminish precision. Lemmatization is the preferred method since it increases
recall while maintaining precision.
Lemmatization, ļ¬nding the dictionary form of a word, broadens search to add relevant search results,
in ways that stemmingāmore commonly availableācannot. A user searching for āPresident Obama
speaking on healthcareā would likely also want results containing āPresident Obama spoke on
healthcareā and āPresident Obama was speaking on healthcare.ā By lemmatizingā spokeā and āwas
speakingā and storing the lemma āspeakā in the search index, the latter two results will (correctly!) get
picked up by a search for āPresident Obama speaking on healthcare.
Essential Elements of Excellent Multilingual Search 4
5. Stemming uses some basic rules to look for common stems in words. For example, ādecodā is the
stem of ādecoding,ā ādecoder,ā and ādecodes.ā However, no matter how many letters a stemmer
chops oļ¬, āspokeā will never become āspeak.ā Lemmas are found using dictionary data and
recognizing the context of a word. The lemma is very diļ¬erent when āspokeā is used as a noun or a
verb. Stemming, on the other hand, applies a set of rules regardless of a wordās part-of-speech or
context in a sentence.
Since lemmas go back to the intrinsic meaning of a word, irrelevant results are kept to a minimum.
On the other hand, stemming may sometimes produce unpredictable results. Stemming turns
āseveralā into āsever,ā and āarsenicā and āarsenalā share the same stem āarsen.ā And, whereas
āapplesā to āappleā is reasonable, āLos Angelesā to āLos Angeleā is not.
Although stemming is more readily accessible as a set of language-speciļ¬c rules than the
dictionary-requiring lemmatization, the greater recall and precision from Lemmatization is the
diļ¬erence between good and excellent search.
Stemming vs. Lemmatization in a Nutshell
Search Query Traditional Stemming Lemmatization using Rosette Comparison
animals anim animal Two unrelated words may
share a stem, in this case
animated anim animate āanimā
arsenal arsen arsenal
arsenic arsen arsenic
several sever several Stemming may have
unintended consequences
organization organ organization
children children child Irregular verbs and nouns
stump the stemmer
spoke spoke speak (when used as past tense
verb)
spoke (when used as noun)
Improving Precision in Chinese, Japanese, and Korean
Asian languages like Chinese, Japanese, and Korean are fundamentally more diļ¬cult to process
than European languages because their words are not consistently separated by spaces. Chinese
and Japanese use no spaces between words, and Korean uses some spaces, but not between
every word, and inconsistently depending on the writer.
For a search engine to build an index, it has to have words. N-gram and Tokenization are methods
used to produce indexable units in these languages, each resulting in diļ¬erent degrees of search
precision.
N-gram vs. Tokenization
Some engines may use the n-gram technique to break up streams of text into overlapping 2-4
character units, but though recall will be high, precision will be low and indexes will be very large,
impacting speed.
Here is a Japanese example: ę±äŗ¬é½ć®č¦³å å° (translation: sightseeing spots in Tokyo)
Essential Elements of Excellent Multilingual Search 5
7. Improving Recall with Pan-Chinese Search
Chinese search gains an enormous boost in recall if both the simpliļ¬ed and traditional Chinese
scripts are searched. Mainland China and Singapore use simpliļ¬ed Chinese, whereas Taiwan and
Hong Kong use traditional Chinese. Chinese speakers expect one query in one script to ļ¬nd results
from both scripts.
Pan-Chinese searchāsearching documents in both Chinese scriptsāis accomplished by converting
all the text of one script into the other at query and index
The diļ¬erences between the two scripts fall into three categories:
Category Simpliļ¬ed Chinese Traditional Chinese English Translation
big
Same character used in both scriptsāthis is true
where the character was simple to start with 大 大
Two characters with same pronunciation (but
diļ¬erent base meaning)ācollapsed to one 夓å é é«® emit
character in simpliļ¬ed Chinese hair
åŗå åŗē¼
Diļ¬erent vocabulary (analogous to American computer
and British English diļ¬erences such as ātruckā vs. č®”ē®ęŗ é»č ¦
ālorryā)āoccurs in mostly modern words
In the ļ¬rst case, the change is trivial, and amounts to making sure all the text is in the same
encoding; however, the second and third categories require dictionary data to ensure that the
conversion is context-sensitive, so that the correct traditional character is chosen.
Improving Recall in Germanic and Scandinavian Languages
Linguistic processing required by German, Dutch, Scandinavian languages (Danish, Norwegian,
Swedish, and Finnish), and Korean is very similar in that these languages freely use compound
words which, frequently need to be broken up to increase recall.
Take these two German compound words:
German compound German decompounded English translation
Jugendarbeitslosigkeit Jugend + arbeitslosigkeit Youth unemployment
(Youth + unemployment)
Samstagmorgen Samstag + morgen Saturday morning
(Saturday + morning)
Itās reasonable to imagine that a person looking for information about youth unemployment in
German would also welcome search results which included āyouthā and āunemploymentā as
separate words. Similarly, the concepts of āSaturdayā and āmorningā are very likely to appear as a
compound word or as separate words in related documents.
Decompounding increases search recall with minimal impact on precision, but again, requires
dictionary data to do properly.
Additionally, to maximize recall, character normalization for plurals āGartenā (garden, singular
form) and its plural form āGƤrtenā is needed.
Essential Elements of Excellent Multilingual Search 7
8. Improving Recall in Arabic
Arabic is one language which especially suļ¬ers from low recall if a search engine does not
perform Arabic-speciļ¬c linguistic processing, starting from the basic character normalization up to
stemming of proper nouns and Lemmatization of common nouns. Arabic attaches so many aļ¬xes
āto the beginning, end and middle of wordsāthat without lemmatizing before searching, many
relevant results are never found. One root could form the basis for up to ļ¬fteen diļ¬erent verbs!
Character Normalization to Improve Precision and Recall
In English, search engines perform some basic normalization such as lowercasing all the words. In
Arabic, normalization is much more complex and will improve both precision and recall. Types of
character normalization required include:
ā¢ Words with additional vocalization marks such as: ā«ļŗ³ļ»²Łā¬ vs. ā«ļ»²Łļŗ³ļŗļƾŁļŗ³ā¬
ā«ļŗ³Ł ļƾļŗā¬
ā¢ Words containing certain letters with dots added or removed such as: ā« Ų±Ł Łļ»ļƾļ»Ŗ Łā¬vs. ā«Ų±Ł Łļ»ļƾļŗ Łā¬
ā¢ Wordsāincluding ambiguous casesācontaining certain letters with symbols added or
removed such as:
ā«Ł±ŲÆļŗŁļ»£ļŗļ»ā¬ vs. ā«Ų§ŲÆļŗŁļ»£ļŗļ»ā¬
ā«Ų§ŲÆŲ¤ŁŲÆā¬ vs. ā«Ų§ŲÆŁŁŲÆā¬
ā«Ų§ļ»£Łļŗā¬ vs. (ā« Ų¢ļ»£Łļŗā¬or ā«Ų„ļ»£Łļŗā¬ or ā«)Ų£ļ»£Łļŗā¬
Improving Recall with Lemmatization and Stemming
In Arabic as other languages, Lemmatization increases the recall of search results (cf. āImproving
Recall via Lemmas vs. Stemsā), but with a twist. In Arabic, Lemmatization is only applicable to verbs
and common nouns (e.g., apple, book, table) and is not applicable to proper nouns (e.g., names of
people such as, Abul-qassem El-Chabby, Baqah Al-Sharqiyyah).
Proper nouns require stemming to remove prepositions and conjunctions which are often
attached to them. Searches for names of people, places, and organizations are seriously hampered
without stemming. For example, these phrases in English appear as one word in Arabic: āfor
Othman,ā āwith Othman,ā āas Othman,ā and āand Othman.ā Thus a search for just āOthmanā
would not ļ¬nd the above variations without stemming.
Additionally, since Arabic names are frequently the same words as common nouns, part-of-speech
tagging becomes critical to diļ¬erentiating between nouns and proper nouns, particularly when a
wordās part-of-speech varies depending on where it appears in a sentence.
Search Recall and Precision Enhancers for Many Languages
For English and most European languages, better recall and precision means:
ā¢ Adding lemmas (the dictionary form of a word) to the search index to increase recall, while
minimizing the negative impact on precision (cf. previous section āImproving Recall via
Lemmas vs. Stemsā)
ā¢ Boosting the ranking of relevant documents on the results page through the use of
document metadata such as entities.
ā¢ Increasing recall through comprehensive name search that goes beyond the āDid you
mean?ā functionality of most search engines.
Essential Elements of Excellent Multilingual Search 8
9. Boosting Relevant Results
Once the search engine is ļ¬nding more good results (better recall) and returning fewer bad results
(better precision), the good results need to be at the top of the search results.
Frequently the most critical words in a search have to do with entitiesāproper nouns such as
names of people, places, and organizations. But when is āChristianā a religious aļ¬liation or the
name of a person āChristian Diorā? A good entity extractor will know. At indexing time, entities can
be added to a document recordās metadata to boost the rank of documents which have entities
matching those in the query.
Comprehensive Name Search to Improve Recall
The āDid you mean?ā function of most modern search engines will ļ¬nd actress āCate Blanche ā
even if the user types āKate .ā However, for less famous people, and to cover a gamut of
name variations, a true name matching function is needed to handle āChuck Berryā vs. āCharles
Berryā; āJohn Kearnsā vs. āJon Cairnsā or even āBaqah Al-Sharqiyyahā and āā«.āļŗļ»ļŗļŗ Ų±ļŗ·ļ» Ų§ļ»ļŗļƾā¬
The Rosette Option for Language Support
Rosette is a software development kit (SDK) designed for a wide range of large-scale applications
that need to identify, classify, analyze, index, and search unstructured text from various sources,
and its linguistic analysis capabilities are widely used by search engines to both increase
precision and recall It uses a combination of statistical models, dictionary data, and sophisticated
computational linguistics to parse digital text in English and over 25 major European, Asian,
and Middle Eastern languages. Years of development and linguistic analysis have gone into the
development of Rosette to satisfy the most demanding customers, both for quality, robustness,
and speed.
Nearly every major web and enterprise search engine since 1999, including Bing, Endeca, goo
(Japanese), Google, Microsoft/FAST, and Yahoo! has used Rosette. New generations of search-
based applications for e-discovery, ļ¬nancial compliance, and other ļ¬elds also use Rosette.
Rosette is a cross-platform SDK available for Windows and Unix, and oļ¬ers all its capabilities via a
single API, in C, C++, Java, or .NET.
Rosette Capabilities
ā¢ Language Identiļ¬cationāidentiļ¬cation of the primary language of a documentāor language
regions within a multilingual documentāand the ļ¬leās encoding in 55 languages and 45
encodings.
ā¢ Unicode Conversionāconverts documents in legacy encodings to Unicode
ā¢ Character Normalizationānormalizes characters to a single representation (e.g. character+
diacritic to single character with diacritics, half/full-size variants of ASCII characters in Asian
languages, and several language-speciļ¬c normalizations.)
ā¢ Linguistic AnalysisāLemmatization, tokenization, part-of-speech tagging, decompounding,
and more in over 25 languages.
ā¢ Entity Extractionāextracts entities such as people, places, and organizations in 15 languages
via statistical models with customizable user-deļ¬ned entities via regular expressions and
entity databases
ā¢ Name Matchingāreturns matching names despite spelling variations, initials, nicknames,
missing name components, missing spaces, out-of-order name components, the same name
written in diļ¬erent languages, and moreāsupported in multiple languages including Arabic,
Chinese, English, Korean, and Persian.
Essential Elements of Excellent Multilingual Search 9
10. ā¢ Name Translationātranslates names from non-Latin script languages to English and standardizes
already translated namesāsupported in multiple languages including Arabic, Chinese, English,
Korean, Russian, and Persian.
Lemmatization and Stemming
Needing to support a variety languages can stretch the capabilities of out of the box search
solutions, requiring additional eļ¬ort to custom-conļ¬gure the handling of each language and
tune its accuracy one by one.
Chinese, Japanese, and Korean Search
Normalization of Chinese, Japanese and Korean characters is done by the Rosette Core Library for
Unicode (RCLU):
ā¢ For Chinese, Japanese, and Korean: Converting full-width ASCII characters (ļ¼”ļ¼³ļ¼¤ļ¼¦ļ¼ļ¼ ļ¼) to
the usual half-width characters (ASDF&%$).
ā¢ For Japanese: Converting half-width Japanese katakana characters (ļ½±ļ½²ļ½³ļ½“ļ½µ; one-byte
characters invented in the early days of Japanese computing to save on storage) to the usual
full-width characters (ć¢ć¤ć¦ćØćŖ).
Pan-Chinese Search
Rosetteās Chinese script converter can convert text into either simpliļ¬ed or traditional
Chinese.
Germanic and Scandinavian Languages
Character normalization of plurals is part of the Lemmatization of German within Rosette.
Arabic
Rosette provides speech tagging and Lemmatization for Arabic.
Essential Elements of Excellent Multilingual Search 10
11. Boosting Relevant Results with Entities
Rosetteās entity extractor can push results higher in the rankings.
Comprehensive Name Search
Rosetteās name matching functionality including name variations due to nicknames, initials, missing
name components, missing spaces between names, out-of-order name components, the same
name represented in diļ¬erent languages, and more. Thus, besides handling āRobertā vs. āBob,ā
and āAbdul Rasheedā vs. āAbd-al-Rasheedā vs. āAbd Ar-Rashid,ā Rose e also matches āMao Tse-
Tungā, āMao Zedongā, and āęÆę³½äøā (simpliļ¬ed Chinese).
PRECISION VS. RECALL VS. F-SCORE IN A NUTSHELL
Precision and recall are metrics used to evaluate the quality of search results, whether searching
for articles on a topic, or ļ¬nding name matches.
Supposed you are searching for red balls from a box that contains 7 red balls and 8 green balls.
Blindfolded, you pull out 8 balls, of which 4 are red and 4 are green.
Precision is the number of correct items over the total number of items found. That is, you found
4 red (which you wanted) and pulled out 8 balls in total, thus precision is 4/8 (half of your results
were correct) or 50%.
Recall asks the question: āOf all the correct items, how many did I ļ¬nd?ā In this case, there were 7
correct items (because there are 7 red balls) and you found 4, thus your recall is 4/7 or 57%.
F-score is a measure that attempts to balance precision and recall and is often called the
ārelevancyā of a system.
F-score = 2*(Precision*Recall) Ć· (Precision+Recall)
Thus our F-score above = 8/15 = .53, or 53%
Essential Elements of Excellent Multilingual Search 11
12. ELEMENT 3: RELIABILITY AND SUPPORT
Even in a technologically sophisticated company, the core business logic will be the main focus of
all business operations, so technical support as a safety net is essential. Just as ļ¬rms rely on
calling the vendor when the copier is acting up, a few judicious pieces of advice from the search
or linguistics experts to settle a nettlesome problem lends assurance to both engineers and upper
management.
On the language support side, the wide language coverage of Basis Technologyās Rosetteāover 25
languages covering Asia, Europe, and the Middle Eastāensure there will be one point of contact
to address questions about any languages, instead of a multitude of single-language vendors with
varying support contracts. Rosette developers are on the front lines, providing high quality support
to customer inquiries.
Additionally as a single platform, Rosetteās benchmarks for speed and accuracy are readily
available, and implementing one or over 25 languages is the same amount of work.
Essential Elements of Excellent Multilingual Search 12
13. SUMMARY & CONCLUSION: ROSETTE LINGUISTICS PLATFORM
For companies that derive competitive advantage from a better quality or better customized
search, search is a development problem, and not a āplug-in-a-solutionā problem, and the four
search engine essentials are:
1. Customizabilityāthe ability to innovate on top of the chosen search engine to build value-
added features
2. Speedāthe search engineās query response rates and indexing and updating times
3. Scalabilityāthe ability to easily handle increased number of queries or volumes of content
requiring indexing
4. Costātotal cost of ownership; return on investment; scaling of cost to search needs
The Rosette platform ļ¬lls that gap, through solid linguistic analysis technology and comprehensive
dictionary data to perform:
ā¢ Language Identiļ¬cationāThe language identiļ¬er detects 55 languages with speed and
accuracy, having been trained on gigabytes of hand-veriļ¬ed data. It covers a broad range of
Asian, Indo-European, and Middle Eastern languages.
ā¢ Linguistic AnalysisāRosette performs a complete linguistic analysis to improve search recall
and precision for some of the most linguistically diļ¬cult languages.
ā Lemmatizationāfor boosting recall while maintaining precision in many languages
ā Tokenizationāfor Chinese, Japanese, and Korean which do not have spaces between
words
ā Chinese script conversionāfor pan-Chinese search
ā Character normalizationāspecialized transforms for Arabic and Asian languages
ā Decompoundingāfor Germanic and Scandinavian languages which form words from
multiple words
ā Part-of-speech taggingāfor context-sensitive Lemmatization, and particularly for Arabic
search to distinguish between common nouns requiring Lemmatization, and proper nouns,
requiring stemming
ā¢ Entity ExtractionāTo locate names, places, organizations, and other entities, both to boost
ranking of results whose entities match the search query or as a base for building faceted
search. Rosette plugs into the UpdateRequestProcessor at indexing to populate each
document record with entities.
ā¢ Name MatchingāGoing beyond āDid you mean?ā, Rosette ļ¬nds matching names that diļ¬er
due to nicknames, initials, missing name components, missing spaces between names, out-
of-order name components, the same name represented in diļ¬erent languages, and more.
NEXT STEPS
ā¢ Request a free product evaluation of Rosette at:
http://www.basistech.com/products/requests/evalua -request.html
Essential Elements of Excellent Multilingual Search 13