SlideShare a Scribd company logo
1 of 13
Download to read offline
February 22, 2012

Essential Elements of
Excellent Multilingual Search
Boosting Search Quality with the Rosette Linguistics Platform




                      We put the World in the World Wide WebĀ®
ABOUT BASIS TECHNOLOGY
Basis Technology provides software solutions for text analytics, information retrieval,
digital forensics, and identity resolution in over forty languages. Our RosetteĀ® linguistics
platform is a widely used suite of interoperable components that power search, business
intelligence, e-discovery, social media monitoring, ļ¬nancial compliance, and other
enterprise applications. Our linguistics team is at the forefront of applied natural language
processing using a combination of statistical modeling, expert rules, and
corpus-derived data. Our forensics team pioneers better, faster, and cheaper techniques
to extract forensic evidence, keeping government and law enforcement ahead of
exponential growth of data storage volumes.

Software vendors, content providers, ļ¬nancial institutions, and government agencies
worldwide rely on Basis Technologyā€™s solutions for Unicode compliance, language
identiļ¬cation, multilingual search, entity extraction, name indexing, and name
translation. Our products and services are used by over 250 major ļ¬rms, including Cisco,
EMC, Exalead/Dassault Systems, Hewlett-Packard, Microsoft, Oracle, and Symantec.
Our text analysis products are widely used in the U.S. defense and intelligence industry
by such ļ¬rms as CACI, Lockheed Martin, Northrop Grumman, SAIC, and SRI. We are
the top provider of multilingual technology to web and e-commerce search engines,
including Amazon.com, Bing, Google, and Yahoo!.

Company headquarters are in Cambridge, Massachusetts, with branch oļ¬ƒces
in San Francisco, Washington, London, and Tokyo. For more information, visit
www.basistech.com.




Ā© 2012 Basis Technology Corporation. ā€œBasis Technologyā€, ā€œGeoscopeā€, ā€œOdyssey Digital Forensicsā€, ā€œRosetteā€, and ā€œWe put the World in the
World Wide Webā€ are registered trademarks of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the
property of their respective owners. (2012-08-30)
ABSTRACT
If search is vital to the competitive advantage or value proposition of your business, then this
whitepaper is a must read. In the survival of the ļ¬ttest, companies like Netflix and many
businesses today thrive or die depending on whether they can:
   a) Provide ā€œexcellent searchā€ ā€“ comprehensive, accurate, and relevant to the user
   b) Serve each user equally well in his or her preferred language

The three essential elements of a business-dependable search solution on which to build
applications or workļ¬‚ows to generate revenue are: (1) search customizability and the speed-
scalability-cost trio, (2) high search qualityā€”with emphasis on ā€œrecallā€ or maximizing the number
of good results foundā€”and (3) reliability and support. This paper will examine what these three
elements mean, and examine Basis Technologyā€™s Rosette linguistic platform as one possible
solution.


THE THREE ELEMENTS OF ā€œEXCELLENT SEARCHā€
More and more, revenues and business longevity rely on cost-eļ¬€ective, accurate, and scalable
search in English and other languages. E-commerce, job or real estate hunting sites, e-discovery/
e-disclosure solutions, ļ¬nancial compliance solutions, online information providers, media and
entertainment portal sites all begin with users ļ¬nding what they are looking for. Choose the right
search and see revenues go up, satisļ¬ed customers increase, and a business well-positioned for
the future. Choose the wrong search and costs constrict growth and poor relevancy chases away
customers. Savvy businesses recognize that their customers, partners, suppliers and employees
increasingly create and consume content in their native language.

For most enterprises, improving search quality means improving (1) recallā€”i.e., maximizing the
number of relevant results foundā€”and to a slightly lesser degree, (2) precisionā€”i.e., maximizing
the ratio of good to bad results returned.

Weā€™ll now examine each of these three elements of search in detail:
   ā€¢ Customizability of search is critical to ļ¬t it to your particular use case or create the value add
     of your business. Speed, Scale, and Cost can quickly strangle the success of your business
     if search is struggling to maintain speed and scale, and choking proļ¬ts with escalating per-
     query or per-document costs.
   ā€¢ Maximizing Search Recall and Precision is the 50,000 pound gorilla. Speed is important,
     but if your users arenā€™t ļ¬nding the results they want to see, it doesnā€™t matter how fast it
     gets delivered. For enterprise applications and search engines, maximizing recall (without
     diminishing precision) is usually the greater concern. Good language support is an intrinsic
     ingredient for improved recall
   ā€¢ Reliability and Technical Support ensures that expert resources are available for solving any
     serious issues in the service.




 Essential Elements of Excellent Multilingual Search                                                3
ELEMENT 1: CUSTOMIZABILITY, SPEED, SCALE, COST
The basic question a company has to ask is: Do we want/like/need/know how to bend technology
to our business processes and business model in order to achieve competitive advantage? And
most importantly: Is search part of that competitive advantage? If the answer is ā€œyesā€ then search
is not a plug-in-a-solution problem, but a software development problem.

Because search is such a broad-scope problem, there are all kinds of edge cases that are diļ¬ƒcult to
support using a proprietary, commercial search solution. The technology is not easily changed
without expensive consultation and modiļ¬cations to the core software that are dependent on the
vendorā€™s product release cycle.

For companies where better search means more money earned or saved, the frustration can
increase geometrically because their use case may be unique, but the software isn't extensible
enough to deal with it, and they have to spend more and more for less and less.

Like three legs of a triangle, speed, scalability and cost are inextricably linked. Assuming that the
load on the search engine will only increase over time, search performance must be scalable to
keep the query response times low and to ensure index updating times do not aļ¬€ect searches.
At the same time, pricing models of commercial search engines based on number of queries or
number of documents indexed will eventually constrict growth by sheer cost.


ELEMENT 2: MAXIMIZING SEARCH RECALL AND PRECISION DEPENDS ON LANGUAGE SUPPORT
Search is really all about ļ¬nding, whether itā€™s a shopper on an e-commerce site, a banker checking a
wire transfer request against a list of money launderers, or home buyers looking through real estate
listings.

The concepts of precision and recall measure how ā€œgoodā€ search is. More precise results with greater
recall mean better quality search, but frequently increasing one means a decrease in the other, so the
trick is how to maximize each metric while minimizing the impact on the other. (See sidebar.)

Letā€™s now look at the language-speciļ¬c processing required to increase recall and/or precision for three
categories of notoriously diļ¬ƒcult languages to process: Chinese/Japanese/Korean, Germanic and
Scandinavian languages, and Arabic. Weā€™ll ļ¬rst look at one aspect of linguistic processing that improves
recall in most languages, including English and European languages: lemmas and stems.


Improving Recall via Lemmas and Stems
Lemmatization and stemming are two methods of increasing the recall of search results. Stemming
improves recall, but can diminish precision. Lemmatization is the preferred method since it increases
recall while maintaining precision.

Lemmatization, ļ¬nding the dictionary form of a word, broadens search to add relevant search results,
in ways that stemmingā€”more commonly availableā€”cannot. A user searching for ā€œPresident Obama
speaking on healthcareā€ would likely also want results containing ā€œPresident Obama spoke on
healthcareā€ and ā€œPresident Obama was speaking on healthcare.ā€ By lemmatizingā€œ spokeā€ and ā€œwas
speakingā€ and storing the lemma ā€œspeakā€ in the search index, the latter two results will (correctly!) get
picked up by a search for ā€œPresident Obama speaking on healthcare.




             Essential Elements of Excellent Multilingual Search                                            4
Stemming uses some basic rules to look for common stems in words. For example, ā€œdecodā€ is the
stem of ā€œdecoding,ā€ ā€œdecoder,ā€ and ā€œdecodes.ā€ However, no matter how many letters a stemmer
chops oļ¬€, ā€œspokeā€ will never become ā€œspeak.ā€ Lemmas are found using dictionary data and
recognizing the context of a word. The lemma is very diļ¬€erent when ā€œspokeā€ is used as a noun or a
verb. Stemming, on the other hand, applies a set of rules regardless of a wordā€™s part-of-speech or
context in a sentence.

Since lemmas go back to the intrinsic meaning of a word, irrelevant results are kept to a minimum.
On the other hand, stemming may sometimes produce unpredictable results. Stemming turns
ā€œseveralā€ into ā€œsever,ā€ and ā€œarsenicā€ and ā€œarsenalā€ share the same stem ā€œarsen.ā€ And, whereas
ā€œapplesā€ to ā€œappleā€ is reasonable, ā€œLos Angelesā€ to ā€œLos Angeleā€ is not.

Although stemming is more readily accessible as a set of language-speciļ¬c rules than the
dictionary-requiring lemmatization, the greater recall and precision from Lemmatization is the
diļ¬€erence between good and excellent search.

Stemming vs. Lemmatization in a Nutshell
Search Query                     Traditional Stemming           Lemmatization using Rosette      Comparison

animals                          anim                           animal                           Two unrelated words may
                                                                                                 share a stem, in this case
animated                         anim                           animate                          ā€œanimā€
arsenal                          arsen                          arsenal

arsenic                          arsen                          arsenic

several                          sever                          several                          Stemming may have
                                                                                                 unintended consequences
organization                     organ                          organization

children                         children                       child                            Irregular verbs and nouns
                                                                                                 stump the stemmer
spoke                            spoke                          speak (when used as past tense
                                                                verb)
                                                                spoke (when used as noun)


Improving Precision in Chinese, Japanese, and Korean
Asian languages like Chinese, Japanese, and Korean are fundamentally more diļ¬ƒcult to process
than European languages because their words are not consistently separated by spaces. Chinese
and Japanese use no spaces between words, and Korean uses some spaces, but not between
every word, and inconsistently depending on the writer.

For a search engine to build an index, it has to have words. N-gram and Tokenization are methods
used to produce indexable units in these languages, each resulting in diļ¬€erent degrees of search
precision.

N-gram vs. Tokenization
Some engines may use the n-gram technique to break up streams of text into overlapping 2-4
character units, but though recall will be high, precision will be low and indexes will be very large,
impacting speed.

Here is a Japanese example: ę±äŗ¬éƒ½ć®č¦³å…‰åœ° (translation: sightseeing spots in Tokyo)




          Essential Elements of Excellent Multilingual Search                                                                 5
Bigram          English translation

ę±äŗ¬              Tokyo

äŗ¬éƒ½              Kyoto

éƒ½ć®              capitalā€™s

ć®č¦³              not a word

č¦³å…‰              sightseeing

光地              not a word


A seven character phrase produces six items to index via the n-gram technique, but also introduces
a false word ā€œKyoto.ā€ Since many of the bigrams are not actual words in the original text, many
other unrelated strings might accidentally match them, too, and a search becomes a statistical
process: Do enough of these two character segments appear in a given document to ļ¬‚ag it as a
ā€œmatch,ā€ while not accidentally matching the wrong strings?

This approach also produces many more entries in the index than words, unnecessarily increasing
its size. Processing a 100-character buļ¬€er (about 13-17 words), adds 999 entries to the index.

A better choice is true tokenization based on morphological analysis1, which breaks up text into
real words. With tokenization of the example above, there are at most three items to index, and
possibly only two if the possessive marker is treated as a ā€œstop wordā€.2

Words                 English Translation

ę±äŗ¬éƒ½                   Tokyo

恮                     (possessive marker)

č¦³å…‰åœ°                   Sightseeing spots


Character Normalization to Improve Recall
Chinese, Japanese, and Korean have additional character normalization needs beyond making sure
ASCII characters are all uppercase or lowercase. In digital ļ¬les, these languages also use a full- width
variant of ASCII letters and punctuation which need to be normalized to the half-width ASCII form.

                                          ļ¼”ļ¼³ļ¼£ļ¼©ļ¼©ļ½—ļ½ļ½’ļ½„ļ½“ļ¼…ļ¼„ļ¼† ā†’ ASCIIwords%$&

In the case of Japanese, a half-width version of Japanese katakana characters also needs to be
normalized to the usual full-width katakana. The half-width version was invented in the early days
of computer processing to reduce data size, but it is still found in documents and webpages.

                                                   ļ½¶ļ¾€ļ½¶ļ¾… ā†’ ć‚«ć‚æć‚«ćƒŠ




1     Morphological analysis in linguistics examines the smallest units of meaning within a wordā€”
      called ā€œmorphemesā€”such as ā€œun-ā€ in ā€œunremarkable.ā€
2     A ā€œstop wordā€ is one which occurs so frequently within a language as to not help in ļ¬nding
      search results. Some search engines may choose to ignore them.




    Essential Elements of Excellent Multilingual Search                                             6
Improving Recall with Pan-Chinese Search
Chinese search gains an enormous boost in recall if both the simpliļ¬ed and traditional Chinese
scripts are searched. Mainland China and Singapore use simpliļ¬ed Chinese, whereas Taiwan and
Hong Kong use traditional Chinese. Chinese speakers expect one query in one script to ļ¬nd results
from both scripts.

Pan-Chinese searchā€”searching documents in both Chinese scriptsā€”is accomplished by converting
all the text of one script into the other at query and index

The diļ¬€erences between the two scripts fall into three categories:
Category                                               Simpliļ¬ed Chinese   Traditional Chinese       English Translation

                                                                                                     big
Same character used in both scriptsā€”this is true
where the character was simple to start with           大                   大
Two characters with same pronunciation (but
diļ¬€erent base meaning)ā€”collapsed to one                夓发                  é ­é«®                        emit
character in simpliļ¬ed Chinese                                                                       hair
                                                       å‡ŗ发                  å‡ŗē™¼
Diļ¬€erent vocabulary (analogous to American                                                           computer
and British English diļ¬€erences such as ā€œtruckā€ vs.     č®”ē®—ęœŗ                 電腦
ā€œlorryā€)ā€”occurs in mostly modern words


In the ļ¬rst case, the change is trivial, and amounts to making sure all the text is in the same
encoding; however, the second and third categories require dictionary data to ensure that the
conversion is context-sensitive, so that the correct traditional character is chosen.

Improving Recall in Germanic and Scandinavian Languages
Linguistic processing required by German, Dutch, Scandinavian languages (Danish, Norwegian,
Swedish, and Finnish), and Korean is very similar in that these languages freely use compound
words which, frequently need to be broken up to increase recall.

Take these two German compound words:
German compound                               German decompounded                  English translation

Jugendarbeitslosigkeit                        Jugend + arbeitslosigkeit            Youth unemployment
                                                                                   (Youth + unemployment)

Samstagmorgen                                 Samstag + morgen                     Saturday morning
                                                                                   (Saturday + morning)


Itā€™s reasonable to imagine that a person looking for information about youth unemployment in
German would also welcome search results which included ā€œyouthā€ and ā€œunemploymentā€ as
separate words. Similarly, the concepts of ā€œSaturdayā€ and ā€œmorningā€ are very likely to appear as a
compound word or as separate words in related documents.

Decompounding increases search recall with minimal impact on precision, but again, requires
dictionary data to do properly.

Additionally, to maximize recall, character normalization for plurals ā€œGartenā€ (garden, singular
form) and its plural form ā€œGƤrtenā€ is needed.




                 Essential Elements of Excellent Multilingual Search                                                       7
Improving Recall in Arabic
Arabic is one language which especially suļ¬€ers from low recall if a search engine does not
perform Arabic-speciļ¬c linguistic processing, starting from the basic character normalization up to
stemming of proper nouns and Lemmatization of common nouns. Arabic attaches so many aļ¬ƒxes
ā€”to the beginning, end and middle of wordsā€”that without lemmatizing before searching, many
relevant results are never found. One root could form the basis for up to ļ¬fteen diļ¬€erent verbs!

Character Normalization to Improve Precision and Recall
In English, search engines perform some basic normalization such as lowercasing all the words. In
Arabic, normalization is much more complex and will improve both precision and recall. Types of
character normalization required include:
   ā€¢ Words with additional vocalization marks such as: ā€«ļŗ³ļ»²Ł€ā€¬          vs. ā€«ļ»²Ł€ļŗ³ļŗŽļƾŁ€ļŗ³ā€¬

   ā€«ļŗ³Ł€ ļƾļŗŽā€¬
   ā€¢ Words containing certain letters with dots added or removed such as: ā€« Ų±Ł€ Ł€ļ»—ļƾļ»Ŗ Ł€ā€¬vs. ā€«Ų±Ł€ Ł€ļ»—ļƾļŗ” Ł€ā€¬
   ā€¢ Wordsā€“including ambiguous casesā€“containing certain letters with symbols added or
     removed such as:
                     ā€«Ł±ŲÆļŗŽŁ€ļ»£ļŗ—ļ»‹ā€¬    vs.   ā€«Ų§ŲÆļŗŽŁ€ļ»£ļŗ—ļ»‹ā€¬
                      ā€«Ų§ŲÆŲ¤ŁˆŲÆā€¬     vs.   ā€«Ų§ŲÆŁˆŁˆŲÆā€¬
                        ā€«Ų§ļ»£Ł€ļŗ›ā€¬    vs.   (ā€« Ų¢ļ»£Ł€ļŗ›ā€¬or ā€«Ų„ļ»£Ł€ļŗ›ā€¬    or   ā€«)Ų£ļ»£Ł€ļŗ›ā€¬

Improving Recall with Lemmatization and Stemming
In Arabic as other languages, Lemmatization increases the recall of search results (cf. ā€œImproving
Recall via Lemmas vs. Stemsā€), but with a twist. In Arabic, Lemmatization is only applicable to verbs
and common nouns (e.g., apple, book, table) and is not applicable to proper nouns (e.g., names of
people such as, Abul-qassem El-Chabby, Baqah Al-Sharqiyyah).

Proper nouns require stemming to remove prepositions and conjunctions which are often
attached to them. Searches for names of people, places, and organizations are seriously hampered
without stemming. For example, these phrases in English appear as one word in Arabic: ā€œfor
Othman,ā€ ā€œwith Othman,ā€ ā€œas Othman,ā€ and ā€œand Othman.ā€ Thus a search for just ā€œOthmanā€
would not ļ¬nd the above variations without stemming.

Additionally, since Arabic names are frequently the same words as common nouns, part-of-speech
tagging becomes critical to diļ¬€erentiating between nouns and proper nouns, particularly when a
wordā€™s part-of-speech varies depending on where it appears in a sentence.

Search Recall and Precision Enhancers for Many Languages
For English and most European languages, better recall and precision means:
   ā€¢ Adding lemmas (the dictionary form of a word) to the search index to increase recall, while
     minimizing the negative impact on precision (cf. previous section ā€œImproving Recall via
     Lemmas vs. Stemsā€)
   ā€¢ Boosting the ranking of relevant documents on the results page through the use of
     document metadata such as entities.
   ā€¢ Increasing recall through comprehensive name search that goes beyond the ā€œDid you
     mean?ā€ functionality of most search engines.




 Essential Elements of Excellent Multilingual Search                                                8
Boosting Relevant Results
Once the search engine is ļ¬nding more good results (better recall) and returning fewer bad results
(better precision), the good results need to be at the top of the search results.

Frequently the most critical words in a search have to do with entitiesā€”proper nouns such as
names of people, places, and organizations. But when is ā€œChristianā€ a religious aļ¬ƒliation or the
name of a person ā€œChristian Diorā€? A good entity extractor will know. At indexing time, entities can
be added to a document recordā€™s metadata to boost the rank of documents which have entities
matching those in the query.

Comprehensive Name Search to Improve Recall
The ā€œDid you mean?ā€ function of most modern search engines will ļ¬nd actress ā€œCate Blanche ā€
even if the user types ā€œKate          .ā€ However, for less famous people, and to cover a gamut of
name variations, a true name matching function is needed to handle ā€œChuck Berryā€ vs. ā€œCharles
Berryā€; ā€œJohn Kearnsā€ vs. ā€œJon Cairnsā€ or even ā€œBaqah Al-Sharqiyyahā€ and ā€œā€«.ā€ļŗ”ļ»—ļŗŽļŗ‘ Ų±ļŗ·ļ»Ÿ Ų§ļ»—ļŗ”ļƾā€¬

The Rosette Option for Language Support
Rosette is a software development kit (SDK) designed for a wide range of large-scale applications
that need to identify, classify, analyze, index, and search unstructured text from various sources,
and its linguistic analysis capabilities are widely used by search engines to both increase
precision and recall It uses a combination of statistical models, dictionary data, and sophisticated
computational linguistics to parse digital text in English and over 25 major European, Asian,
and Middle Eastern languages. Years of development and linguistic analysis have gone into the
development of Rosette to satisfy the most demanding customers, both for quality, robustness,
and speed.

Nearly every major web and enterprise search engine since 1999, including Bing, Endeca, goo
(Japanese), Google, Microsoft/FAST, and Yahoo! has used Rosette. New generations of search-
based applications for e-discovery, ļ¬nancial compliance, and other ļ¬elds also use Rosette.

Rosette is a cross-platform SDK available for Windows and Unix, and oļ¬€ers all its capabilities via a
single API, in C, C++, Java, or .NET.

Rosette Capabilities
   ā€¢ Language Identiļ¬cationā€”identiļ¬cation of the primary language of a documentā€”or language
     regions within a multilingual documentā€”and the ļ¬leā€™s encoding in 55 languages and 45
     encodings.
   ā€¢ Unicode Conversionā€”converts documents in legacy encodings to Unicode
   ā€¢ Character Normalizationā€”normalizes characters to a single representation (e.g. character+
     diacritic to single character with diacritics, half/full-size variants of ASCII characters in Asian
     languages, and several language-speciļ¬c normalizations.)
   ā€¢ Linguistic Analysisā€”Lemmatization, tokenization, part-of-speech tagging, decompounding,
     and more in over 25 languages.
   ā€¢ Entity Extractionā€”extracts entities such as people, places, and organizations in 15 languages
     via statistical models with customizable user-deļ¬ned entities via regular expressions and
     entity databases
   ā€¢ Name Matchingā€”returns matching names despite spelling variations, initials, nicknames,
     missing name components, missing spaces, out-of-order name components, the same name
     written in diļ¬€erent languages, and moreā€”supported in multiple languages including Arabic,
     Chinese, English, Korean, and Persian.




             Essential Elements of Excellent Multilingual Search                                           9
ā€¢ Name Translationā€”translates names from non-Latin script languages to English and standardizes
  already translated namesā€”supported in multiple languages including Arabic, Chinese, English,
  Korean, Russian, and Persian.



     Lemmatization and Stemming
     Needing to support a variety languages can stretch the capabilities of out of the box search
     solutions, requiring additional eļ¬€ort to custom-conļ¬gure the handling of each language and
     tune its accuracy one by one.


     Chinese, Japanese, and Korean Search
     Normalization of Chinese, Japanese and Korean characters is done by the Rosette Core Library for
     Unicode (RCLU):
        ā€¢ For Chinese, Japanese, and Korean: Converting full-width ASCII characters (ļ¼”ļ¼³ļ¼¤ļ¼¦ļ¼†ļ¼…ļ¼„) to
          the usual half-width characters (ASDF&%$).
        ā€¢ For Japanese: Converting half-width Japanese katakana characters (ļ½±ļ½²ļ½³ļ½“ļ½µ; one-byte
          characters invented in the early days of Japanese computing to save on storage) to the usual
          full-width characters (ć‚¢ć‚¤ć‚¦ć‚Øć‚Ŗ).


     Pan-Chinese Search
       Rosetteā€™s Chinese script converter can convert text into either simpliļ¬ed or traditional
       Chinese.

     Germanic and Scandinavian Languages

       Character normalization of plurals is part of the Lemmatization of German within Rosette.

      Arabic
       Rosette provides speech tagging and Lemmatization for Arabic.




       Essential Elements of Excellent Multilingual Search                                          10
Boosting Relevant Results with Entities
Rosetteā€™s entity extractor can push results higher in the rankings.




Comprehensive Name Search
Rosetteā€™s name matching functionality including name variations due to nicknames, initials, missing
name components, missing spaces between names, out-of-order name components, the same
name represented in diļ¬€erent languages, and more. Thus, besides handling ā€œRobertā€ vs. ā€œBob,ā€
and ā€œAbdul Rasheedā€ vs. ā€œAbd-al-Rasheedā€ vs. ā€œAbd Ar-Rashid,ā€ Rose e also matches ā€œMao Tse-
Tungā€, ā€œMao Zedongā€, and ā€œęÆ›ę³½äøœā€ (simpliļ¬ed Chinese).




PRECISION VS. RECALL VS. F-SCORE IN A NUTSHELL
Precision and recall are metrics used to evaluate the quality of search results, whether searching
for articles on a topic, or ļ¬nding name matches.

Supposed you are searching for red balls from a box that contains 7 red balls and 8 green balls.
Blindfolded, you pull out 8 balls, of which 4 are red and 4 are green.

Precision is the number of correct items over the total number of items found. That is, you found
4 red (which you wanted) and pulled out 8 balls in total, thus precision is 4/8 (half of your results
were correct) or 50%.

Recall asks the question: ā€œOf all the correct items, how many did I ļ¬nd?ā€ In this case, there were 7
correct items (because there are 7 red balls) and you found 4, thus your recall is 4/7 or 57%.




F-score is a measure that attempts to balance precision and recall and is often called the
ā€œrelevancyā€ of a system.

F-score = 2*(Precision*Recall) Ć· (Precision+Recall)

Thus our F-score above = 8/15 = .53, or 53%




             Essential Elements of Excellent Multilingual Search                                        11
ELEMENT 3: RELIABILITY AND SUPPORT
Even in a technologically sophisticated company, the core business logic will be the main focus of
all business operations, so technical support as a safety net is essential. Just as ļ¬rms rely on
calling the vendor when the copier is acting up, a few judicious pieces of advice from the search
or linguistics experts to settle a nettlesome problem lends assurance to both engineers and upper
management.

On the language support side, the wide language coverage of Basis Technologyā€™s Rosetteā€”over 25
languages covering Asia, Europe, and the Middle Eastā€”ensure there will be one point of contact
to address questions about any languages, instead of a multitude of single-language vendors with
varying support contracts. Rosette developers are on the front lines, providing high quality support
to customer inquiries.

Additionally as a single platform, Rosetteā€™s benchmarks for speed and accuracy are readily
available, and implementing one or over 25 languages is the same amount of work.




 Essential Elements of Excellent Multilingual Search                                             12
SUMMARY & CONCLUSION: ROSETTE LINGUISTICS PLATFORM
    For companies that derive competitive advantage from a better quality or better customized
    search, search is a development problem, and not a ā€œplug-in-a-solutionā€ problem, and the four
    search engine essentials are:
       1. Customizabilityā€”the ability to innovate on top of the chosen search engine to build value-
          added features
       2. Speedā€”the search engineā€™s query response rates and indexing and updating times
       3. Scalabilityā€”the ability to easily handle increased number of queries or volumes of content
          requiring indexing
       4. Costā€”total cost of ownership; return on investment; scaling of cost to search needs

The Rosette platform ļ¬lls that gap, through solid linguistic analysis technology and comprehensive
dictionary data to perform:
   ā€¢ Language Identiļ¬cationā€”The language identiļ¬er detects 55 languages with speed and
     accuracy, having been trained on gigabytes of hand-veriļ¬ed data. It covers a broad range of
     Asian, Indo-European, and Middle Eastern languages.
   ā€¢ Linguistic Analysisā€”Rosette performs a complete linguistic analysis to improve search recall
     and precision for some of the most linguistically diļ¬ƒcult languages.
     ā—‹ Lemmatizationā€”for boosting recall while maintaining precision in many languages
     ā—‹ Tokenizationā€”for Chinese, Japanese, and Korean which do not have spaces between
       words
     ā—‹ Chinese script conversionā€”for pan-Chinese search

     ā—‹ Character normalizationā€”specialized transforms for Arabic and Asian languages
     ā—‹ Decompoundingā€”for Germanic and Scandinavian languages which form words from
       multiple words
     ā—‹ Part-of-speech taggingā€”for context-sensitive Lemmatization, and particularly for Arabic
       search to distinguish between common nouns requiring Lemmatization, and proper nouns,
       requiring stemming

   ā€¢ Entity Extractionā€”To locate names, places, organizations, and other entities, both to boost
     ranking of results whose entities match the search query or as a base for building faceted
     search. Rosette plugs into the UpdateRequestProcessor at indexing        to populate each
     document record with entities.
   ā€¢ Name Matchingā€”Going beyond ā€œDid you mean?ā€, Rosette ļ¬nds matching names that diļ¬€er
     due to nicknames, initials, missing name components, missing spaces between names, out-
     of-order name components, the same name represented in diļ¬€erent languages, and more.


NEXT STEPS

   ā€¢ Request a free product evaluation of Rosette at:
     http://www.basistech.com/products/requests/evalua         -request.html




     Essential Elements of Excellent Multilingual Search                                             13

More Related Content

Viewers also liked

Power Point
Power PointPower Point
Power Pointminervaa12
Ā 
Memo final
Memo finalMemo final
Memo finaltaylorjhen
Ā 
Big Data for content marketers from a tech industry perspective
Big Data for content marketers from a tech industry perspectiveBig Data for content marketers from a tech industry perspective
Big Data for content marketers from a tech industry perspectiveArif Shafique
Ā 
Enabling High Quality Search in European Languages
Enabling High Quality Search in European LanguagesEnabling High Quality Search in European Languages
Enabling High Quality Search in European Languagesandrew_paulsen
Ā 
09342
0934209342
09342Sb Raja
Ā 
Vegetarianism[1]
Vegetarianism[1]Vegetarianism[1]
Vegetarianism[1]taylorjhen
Ā 
Mimecast Presentation
Mimecast PresentationMimecast Presentation
Mimecast PresentationMichelle6518
Ā 
Question bank accounting for management
Question bank   accounting for managementQuestion bank   accounting for management
Question bank accounting for managementSb Raja
Ā 
14127918 financial-management-scope-objectives-and-types-of-finances
14127918 financial-management-scope-objectives-and-types-of-finances14127918 financial-management-scope-objectives-and-types-of-finances
14127918 financial-management-scope-objectives-and-types-of-financesSb Raja
Ā 
Outdoor Connect
Outdoor  ConnectOutdoor  Connect
Outdoor ConnectEsstathios
Ā 

Viewers also liked (12)

Power Point
Power PointPower Point
Power Point
Ā 
Memo final
Memo finalMemo final
Memo final
Ā 
Big Data for content marketers from a tech industry perspective
Big Data for content marketers from a tech industry perspectiveBig Data for content marketers from a tech industry perspective
Big Data for content marketers from a tech industry perspective
Ā 
Enabling High Quality Search in European Languages
Enabling High Quality Search in European LanguagesEnabling High Quality Search in European Languages
Enabling High Quality Search in European Languages
Ā 
09342
0934209342
09342
Ā 
Rpp smt 2
Rpp smt 2Rpp smt 2
Rpp smt 2
Ā 
Vegetarianism[1]
Vegetarianism[1]Vegetarianism[1]
Vegetarianism[1]
Ā 
Mimecast Presentation
Mimecast PresentationMimecast Presentation
Mimecast Presentation
Ā 
Question bank accounting for management
Question bank   accounting for managementQuestion bank   accounting for management
Question bank accounting for management
Ā 
14127918 financial-management-scope-objectives-and-types-of-finances
14127918 financial-management-scope-objectives-and-types-of-finances14127918 financial-management-scope-objectives-and-types-of-finances
14127918 financial-management-scope-objectives-and-types-of-finances
Ā 
Barcamp AQUOPS (BarAQUOPS) 2015
Barcamp AQUOPS (BarAQUOPS) 2015Barcamp AQUOPS (BarAQUOPS) 2015
Barcamp AQUOPS (BarAQUOPS) 2015
Ā 
Outdoor Connect
Outdoor  ConnectOutdoor  Connect
Outdoor Connect
Ā 

Similar to Essential Elements of Excellent Multilingual Search

Semantic Search_ NLP_ ML.pdf
Semantic Search_ NLP_ ML.pdfSemantic Search_ NLP_ ML.pdf
Semantic Search_ NLP_ ML.pdfPlamenaDzharadat
Ā 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction documentrajatkr
Ā 
Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14Steven Toole
Ā 
Introduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint AdministratorsIntroduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint AdministratorsBradley Bennet
Ā 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperJohn Felahi
Ā 
Language First Protocol from QSi
Language First Protocol from QSiLanguage First Protocol from QSi
Language First Protocol from QSiJohn O'Gorman
Ā 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
Ā 
Metaphic or the art of looking another way.
Metaphic or the art of looking another way.Metaphic or the art of looking another way.
Metaphic or the art of looking another way.Suresh Manian
Ā 
The Bottom Line: Whereā€™s the Money?
The Bottom Line: Whereā€™s the Money?The Bottom Line: Whereā€™s the Money?
The Bottom Line: Whereā€™s the Money?eprentise
Ā 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
Ā 
NICE Speech Analytics - White Paper
NICE Speech Analytics - White PaperNICE Speech Analytics - White Paper
NICE Speech Analytics - White PaperClark Hill
Ā 
The Effects of Globalization on Technical Communication and Training
The Effects of Globalization on Technical Communication and Training The Effects of Globalization on Technical Communication and Training
The Effects of Globalization on Technical Communication and Training Scott Abel
Ā 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalMarianne Sweeny
Ā 
Bryan Bell Presentation
Bryan Bell PresentationBryan Bell Presentation
Bryan Bell PresentationMediabistro
Ā 
5 Algorithms Every Web Developer Can Use and Understand
5 Algorithms Every Web Developer Can Use and Understand5 Algorithms Every Web Developer Can Use and Understand
5 Algorithms Every Web Developer Can Use and UnderstandMatt Kiser
Ā 
5 lexis nexis legal innovation powered by ai_min chen
5 lexis nexis legal innovation powered by ai_min chen5 lexis nexis legal innovation powered by ai_min chen
5 lexis nexis legal innovation powered by ai_min chenAnnaKurusu
Ā 
SEO + NLP - Redefining The Computer & Human Relationship.pdf
SEO + NLP - Redefining The Computer & Human Relationship.pdfSEO + NLP - Redefining The Computer & Human Relationship.pdf
SEO + NLP - Redefining The Computer & Human Relationship.pdfLet's Get Visible
Ā 
Croud Presents: How to Build a Data-driven SEO Strategy Using NLP
Croud Presents: How to Build a Data-driven SEO Strategy Using NLPCroud Presents: How to Build a Data-driven SEO Strategy Using NLP
Croud Presents: How to Build a Data-driven SEO Strategy Using NLPDaniel Liddle
Ā 
Introduction to site search analytics by SearchBroker
Introduction to site search analytics by SearchBrokerIntroduction to site search analytics by SearchBroker
Introduction to site search analytics by SearchBrokerEmpathyBroker
Ā 

Similar to Essential Elements of Excellent Multilingual Search (20)

Semantic Search_ NLP_ ML.pdf
Semantic Search_ NLP_ ML.pdfSemantic Search_ NLP_ ML.pdf
Semantic Search_ NLP_ ML.pdf
Ā 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction document
Ā 
Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14
Ā 
Introduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint AdministratorsIntroduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint Administrators
Ā 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Ā 
Language First Protocol from QSi
Language First Protocol from QSiLanguage First Protocol from QSi
Language First Protocol from QSi
Ā 
Key Phrases for Better Search
Key Phrases for Better SearchKey Phrases for Better Search
Key Phrases for Better Search
Ā 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
Ā 
Metaphic or the art of looking another way.
Metaphic or the art of looking another way.Metaphic or the art of looking another way.
Metaphic or the art of looking another way.
Ā 
The Bottom Line: Whereā€™s the Money?
The Bottom Line: Whereā€™s the Money?The Bottom Line: Whereā€™s the Money?
The Bottom Line: Whereā€™s the Money?
Ā 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
Ā 
NICE Speech Analytics - White Paper
NICE Speech Analytics - White PaperNICE Speech Analytics - White Paper
NICE Speech Analytics - White Paper
Ā 
The Effects of Globalization on Technical Communication and Training
The Effects of Globalization on Technical Communication and Training The Effects of Globalization on Technical Communication and Training
The Effects of Globalization on Technical Communication and Training
Ā 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices Final
Ā 
Bryan Bell Presentation
Bryan Bell PresentationBryan Bell Presentation
Bryan Bell Presentation
Ā 
5 Algorithms Every Web Developer Can Use and Understand
5 Algorithms Every Web Developer Can Use and Understand5 Algorithms Every Web Developer Can Use and Understand
5 Algorithms Every Web Developer Can Use and Understand
Ā 
5 lexis nexis legal innovation powered by ai_min chen
5 lexis nexis legal innovation powered by ai_min chen5 lexis nexis legal innovation powered by ai_min chen
5 lexis nexis legal innovation powered by ai_min chen
Ā 
SEO + NLP - Redefining The Computer & Human Relationship.pdf
SEO + NLP - Redefining The Computer & Human Relationship.pdfSEO + NLP - Redefining The Computer & Human Relationship.pdf
SEO + NLP - Redefining The Computer & Human Relationship.pdf
Ā 
Croud Presents: How to Build a Data-driven SEO Strategy Using NLP
Croud Presents: How to Build a Data-driven SEO Strategy Using NLPCroud Presents: How to Build a Data-driven SEO Strategy Using NLP
Croud Presents: How to Build a Data-driven SEO Strategy Using NLP
Ā 
Introduction to site search analytics by SearchBroker
Introduction to site search analytics by SearchBrokerIntroduction to site search analytics by SearchBroker
Introduction to site search analytics by SearchBroker
Ā 

Recently uploaded

šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜RTylerCroy
Ā 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
Ā 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
Ā 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
Ā 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
Ā 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
Ā 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
Ā 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
Ā 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
Ā 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
Ā 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
Ā 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
Ā 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
Ā 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
Ā 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
Ā 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
Ā 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
Ā 

Recently uploaded (20)

šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Ā 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Ā 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Ā 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Ā 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Ā 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Ā 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Ā 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Ā 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Ā 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Ā 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
Ā 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Ā 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Ā 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Ā 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Ā 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Ā 

Essential Elements of Excellent Multilingual Search

  • 1. February 22, 2012 Essential Elements of Excellent Multilingual Search Boosting Search Quality with the Rosette Linguistics Platform We put the World in the World Wide WebĀ®
  • 2. ABOUT BASIS TECHNOLOGY Basis Technology provides software solutions for text analytics, information retrieval, digital forensics, and identity resolution in over forty languages. Our RosetteĀ® linguistics platform is a widely used suite of interoperable components that power search, business intelligence, e-discovery, social media monitoring, ļ¬nancial compliance, and other enterprise applications. Our linguistics team is at the forefront of applied natural language processing using a combination of statistical modeling, expert rules, and corpus-derived data. Our forensics team pioneers better, faster, and cheaper techniques to extract forensic evidence, keeping government and law enforcement ahead of exponential growth of data storage volumes. Software vendors, content providers, ļ¬nancial institutions, and government agencies worldwide rely on Basis Technologyā€™s solutions for Unicode compliance, language identiļ¬cation, multilingual search, entity extraction, name indexing, and name translation. Our products and services are used by over 250 major ļ¬rms, including Cisco, EMC, Exalead/Dassault Systems, Hewlett-Packard, Microsoft, Oracle, and Symantec. Our text analysis products are widely used in the U.S. defense and intelligence industry by such ļ¬rms as CACI, Lockheed Martin, Northrop Grumman, SAIC, and SRI. We are the top provider of multilingual technology to web and e-commerce search engines, including Amazon.com, Bing, Google, and Yahoo!. Company headquarters are in Cambridge, Massachusetts, with branch oļ¬ƒces in San Francisco, Washington, London, and Tokyo. For more information, visit www.basistech.com. Ā© 2012 Basis Technology Corporation. ā€œBasis Technologyā€, ā€œGeoscopeā€, ā€œOdyssey Digital Forensicsā€, ā€œRosetteā€, and ā€œWe put the World in the World Wide Webā€ are registered trademarks of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective owners. (2012-08-30)
  • 3. ABSTRACT If search is vital to the competitive advantage or value proposition of your business, then this whitepaper is a must read. In the survival of the ļ¬ttest, companies like Netflix and many businesses today thrive or die depending on whether they can: a) Provide ā€œexcellent searchā€ ā€“ comprehensive, accurate, and relevant to the user b) Serve each user equally well in his or her preferred language The three essential elements of a business-dependable search solution on which to build applications or workļ¬‚ows to generate revenue are: (1) search customizability and the speed- scalability-cost trio, (2) high search qualityā€”with emphasis on ā€œrecallā€ or maximizing the number of good results foundā€”and (3) reliability and support. This paper will examine what these three elements mean, and examine Basis Technologyā€™s Rosette linguistic platform as one possible solution. THE THREE ELEMENTS OF ā€œEXCELLENT SEARCHā€ More and more, revenues and business longevity rely on cost-eļ¬€ective, accurate, and scalable search in English and other languages. E-commerce, job or real estate hunting sites, e-discovery/ e-disclosure solutions, ļ¬nancial compliance solutions, online information providers, media and entertainment portal sites all begin with users ļ¬nding what they are looking for. Choose the right search and see revenues go up, satisļ¬ed customers increase, and a business well-positioned for the future. Choose the wrong search and costs constrict growth and poor relevancy chases away customers. Savvy businesses recognize that their customers, partners, suppliers and employees increasingly create and consume content in their native language. For most enterprises, improving search quality means improving (1) recallā€”i.e., maximizing the number of relevant results foundā€”and to a slightly lesser degree, (2) precisionā€”i.e., maximizing the ratio of good to bad results returned. Weā€™ll now examine each of these three elements of search in detail: ā€¢ Customizability of search is critical to ļ¬t it to your particular use case or create the value add of your business. Speed, Scale, and Cost can quickly strangle the success of your business if search is struggling to maintain speed and scale, and choking proļ¬ts with escalating per- query or per-document costs. ā€¢ Maximizing Search Recall and Precision is the 50,000 pound gorilla. Speed is important, but if your users arenā€™t ļ¬nding the results they want to see, it doesnā€™t matter how fast it gets delivered. For enterprise applications and search engines, maximizing recall (without diminishing precision) is usually the greater concern. Good language support is an intrinsic ingredient for improved recall ā€¢ Reliability and Technical Support ensures that expert resources are available for solving any serious issues in the service. Essential Elements of Excellent Multilingual Search 3
  • 4. ELEMENT 1: CUSTOMIZABILITY, SPEED, SCALE, COST The basic question a company has to ask is: Do we want/like/need/know how to bend technology to our business processes and business model in order to achieve competitive advantage? And most importantly: Is search part of that competitive advantage? If the answer is ā€œyesā€ then search is not a plug-in-a-solution problem, but a software development problem. Because search is such a broad-scope problem, there are all kinds of edge cases that are diļ¬ƒcult to support using a proprietary, commercial search solution. The technology is not easily changed without expensive consultation and modiļ¬cations to the core software that are dependent on the vendorā€™s product release cycle. For companies where better search means more money earned or saved, the frustration can increase geometrically because their use case may be unique, but the software isn't extensible enough to deal with it, and they have to spend more and more for less and less. Like three legs of a triangle, speed, scalability and cost are inextricably linked. Assuming that the load on the search engine will only increase over time, search performance must be scalable to keep the query response times low and to ensure index updating times do not aļ¬€ect searches. At the same time, pricing models of commercial search engines based on number of queries or number of documents indexed will eventually constrict growth by sheer cost. ELEMENT 2: MAXIMIZING SEARCH RECALL AND PRECISION DEPENDS ON LANGUAGE SUPPORT Search is really all about ļ¬nding, whether itā€™s a shopper on an e-commerce site, a banker checking a wire transfer request against a list of money launderers, or home buyers looking through real estate listings. The concepts of precision and recall measure how ā€œgoodā€ search is. More precise results with greater recall mean better quality search, but frequently increasing one means a decrease in the other, so the trick is how to maximize each metric while minimizing the impact on the other. (See sidebar.) Letā€™s now look at the language-speciļ¬c processing required to increase recall and/or precision for three categories of notoriously diļ¬ƒcult languages to process: Chinese/Japanese/Korean, Germanic and Scandinavian languages, and Arabic. Weā€™ll ļ¬rst look at one aspect of linguistic processing that improves recall in most languages, including English and European languages: lemmas and stems. Improving Recall via Lemmas and Stems Lemmatization and stemming are two methods of increasing the recall of search results. Stemming improves recall, but can diminish precision. Lemmatization is the preferred method since it increases recall while maintaining precision. Lemmatization, ļ¬nding the dictionary form of a word, broadens search to add relevant search results, in ways that stemmingā€”more commonly availableā€”cannot. A user searching for ā€œPresident Obama speaking on healthcareā€ would likely also want results containing ā€œPresident Obama spoke on healthcareā€ and ā€œPresident Obama was speaking on healthcare.ā€ By lemmatizingā€œ spokeā€ and ā€œwas speakingā€ and storing the lemma ā€œspeakā€ in the search index, the latter two results will (correctly!) get picked up by a search for ā€œPresident Obama speaking on healthcare. Essential Elements of Excellent Multilingual Search 4
  • 5. Stemming uses some basic rules to look for common stems in words. For example, ā€œdecodā€ is the stem of ā€œdecoding,ā€ ā€œdecoder,ā€ and ā€œdecodes.ā€ However, no matter how many letters a stemmer chops oļ¬€, ā€œspokeā€ will never become ā€œspeak.ā€ Lemmas are found using dictionary data and recognizing the context of a word. The lemma is very diļ¬€erent when ā€œspokeā€ is used as a noun or a verb. Stemming, on the other hand, applies a set of rules regardless of a wordā€™s part-of-speech or context in a sentence. Since lemmas go back to the intrinsic meaning of a word, irrelevant results are kept to a minimum. On the other hand, stemming may sometimes produce unpredictable results. Stemming turns ā€œseveralā€ into ā€œsever,ā€ and ā€œarsenicā€ and ā€œarsenalā€ share the same stem ā€œarsen.ā€ And, whereas ā€œapplesā€ to ā€œappleā€ is reasonable, ā€œLos Angelesā€ to ā€œLos Angeleā€ is not. Although stemming is more readily accessible as a set of language-speciļ¬c rules than the dictionary-requiring lemmatization, the greater recall and precision from Lemmatization is the diļ¬€erence between good and excellent search. Stemming vs. Lemmatization in a Nutshell Search Query Traditional Stemming Lemmatization using Rosette Comparison animals anim animal Two unrelated words may share a stem, in this case animated anim animate ā€œanimā€ arsenal arsen arsenal arsenic arsen arsenic several sever several Stemming may have unintended consequences organization organ organization children children child Irregular verbs and nouns stump the stemmer spoke spoke speak (when used as past tense verb) spoke (when used as noun) Improving Precision in Chinese, Japanese, and Korean Asian languages like Chinese, Japanese, and Korean are fundamentally more diļ¬ƒcult to process than European languages because their words are not consistently separated by spaces. Chinese and Japanese use no spaces between words, and Korean uses some spaces, but not between every word, and inconsistently depending on the writer. For a search engine to build an index, it has to have words. N-gram and Tokenization are methods used to produce indexable units in these languages, each resulting in diļ¬€erent degrees of search precision. N-gram vs. Tokenization Some engines may use the n-gram technique to break up streams of text into overlapping 2-4 character units, but though recall will be high, precision will be low and indexes will be very large, impacting speed. Here is a Japanese example: ę±äŗ¬éƒ½ć®č¦³å…‰åœ° (translation: sightseeing spots in Tokyo) Essential Elements of Excellent Multilingual Search 5
  • 6. Bigram English translation ę±äŗ¬ Tokyo äŗ¬éƒ½ Kyoto éƒ½ć® capitalā€™s ć®č¦³ not a word č¦³å…‰ sightseeing 光地 not a word A seven character phrase produces six items to index via the n-gram technique, but also introduces a false word ā€œKyoto.ā€ Since many of the bigrams are not actual words in the original text, many other unrelated strings might accidentally match them, too, and a search becomes a statistical process: Do enough of these two character segments appear in a given document to ļ¬‚ag it as a ā€œmatch,ā€ while not accidentally matching the wrong strings? This approach also produces many more entries in the index than words, unnecessarily increasing its size. Processing a 100-character buļ¬€er (about 13-17 words), adds 999 entries to the index. A better choice is true tokenization based on morphological analysis1, which breaks up text into real words. With tokenization of the example above, there are at most three items to index, and possibly only two if the possessive marker is treated as a ā€œstop wordā€.2 Words English Translation ę±äŗ¬éƒ½ Tokyo 恮 (possessive marker) č¦³å…‰åœ° Sightseeing spots Character Normalization to Improve Recall Chinese, Japanese, and Korean have additional character normalization needs beyond making sure ASCII characters are all uppercase or lowercase. In digital ļ¬les, these languages also use a full- width variant of ASCII letters and punctuation which need to be normalized to the half-width ASCII form. ļ¼”ļ¼³ļ¼£ļ¼©ļ¼©ļ½—ļ½ļ½’ļ½„ļ½“ļ¼…ļ¼„ļ¼† ā†’ ASCIIwords%$& In the case of Japanese, a half-width version of Japanese katakana characters also needs to be normalized to the usual full-width katakana. The half-width version was invented in the early days of computer processing to reduce data size, but it is still found in documents and webpages. ļ½¶ļ¾€ļ½¶ļ¾… ā†’ ć‚«ć‚æć‚«ćƒŠ 1 Morphological analysis in linguistics examines the smallest units of meaning within a wordā€” called ā€œmorphemesā€”such as ā€œun-ā€ in ā€œunremarkable.ā€ 2 A ā€œstop wordā€ is one which occurs so frequently within a language as to not help in ļ¬nding search results. Some search engines may choose to ignore them. Essential Elements of Excellent Multilingual Search 6
  • 7. Improving Recall with Pan-Chinese Search Chinese search gains an enormous boost in recall if both the simpliļ¬ed and traditional Chinese scripts are searched. Mainland China and Singapore use simpliļ¬ed Chinese, whereas Taiwan and Hong Kong use traditional Chinese. Chinese speakers expect one query in one script to ļ¬nd results from both scripts. Pan-Chinese searchā€”searching documents in both Chinese scriptsā€”is accomplished by converting all the text of one script into the other at query and index The diļ¬€erences between the two scripts fall into three categories: Category Simpliļ¬ed Chinese Traditional Chinese English Translation big Same character used in both scriptsā€”this is true where the character was simple to start with 大 大 Two characters with same pronunciation (but diļ¬€erent base meaning)ā€”collapsed to one 夓发 é ­é«® emit character in simpliļ¬ed Chinese hair å‡ŗ发 å‡ŗē™¼ Diļ¬€erent vocabulary (analogous to American computer and British English diļ¬€erences such as ā€œtruckā€ vs. č®”ē®—ęœŗ 電腦 ā€œlorryā€)ā€”occurs in mostly modern words In the ļ¬rst case, the change is trivial, and amounts to making sure all the text is in the same encoding; however, the second and third categories require dictionary data to ensure that the conversion is context-sensitive, so that the correct traditional character is chosen. Improving Recall in Germanic and Scandinavian Languages Linguistic processing required by German, Dutch, Scandinavian languages (Danish, Norwegian, Swedish, and Finnish), and Korean is very similar in that these languages freely use compound words which, frequently need to be broken up to increase recall. Take these two German compound words: German compound German decompounded English translation Jugendarbeitslosigkeit Jugend + arbeitslosigkeit Youth unemployment (Youth + unemployment) Samstagmorgen Samstag + morgen Saturday morning (Saturday + morning) Itā€™s reasonable to imagine that a person looking for information about youth unemployment in German would also welcome search results which included ā€œyouthā€ and ā€œunemploymentā€ as separate words. Similarly, the concepts of ā€œSaturdayā€ and ā€œmorningā€ are very likely to appear as a compound word or as separate words in related documents. Decompounding increases search recall with minimal impact on precision, but again, requires dictionary data to do properly. Additionally, to maximize recall, character normalization for plurals ā€œGartenā€ (garden, singular form) and its plural form ā€œGƤrtenā€ is needed. Essential Elements of Excellent Multilingual Search 7
  • 8. Improving Recall in Arabic Arabic is one language which especially suļ¬€ers from low recall if a search engine does not perform Arabic-speciļ¬c linguistic processing, starting from the basic character normalization up to stemming of proper nouns and Lemmatization of common nouns. Arabic attaches so many aļ¬ƒxes ā€”to the beginning, end and middle of wordsā€”that without lemmatizing before searching, many relevant results are never found. One root could form the basis for up to ļ¬fteen diļ¬€erent verbs! Character Normalization to Improve Precision and Recall In English, search engines perform some basic normalization such as lowercasing all the words. In Arabic, normalization is much more complex and will improve both precision and recall. Types of character normalization required include: ā€¢ Words with additional vocalization marks such as: ā€«ļŗ³ļ»²Ł€ā€¬ vs. ā€«ļ»²Ł€ļŗ³ļŗŽļƾŁ€ļŗ³ā€¬ ā€«ļŗ³Ł€ ļƾļŗŽā€¬ ā€¢ Words containing certain letters with dots added or removed such as: ā€« Ų±Ł€ Ł€ļ»—ļƾļ»Ŗ Ł€ā€¬vs. ā€«Ų±Ł€ Ł€ļ»—ļƾļŗ” Ł€ā€¬ ā€¢ Wordsā€“including ambiguous casesā€“containing certain letters with symbols added or removed such as: ā€«Ł±ŲÆļŗŽŁ€ļ»£ļŗ—ļ»‹ā€¬ vs. ā€«Ų§ŲÆļŗŽŁ€ļ»£ļŗ—ļ»‹ā€¬ ā€«Ų§ŲÆŲ¤ŁˆŲÆā€¬ vs. ā€«Ų§ŲÆŁˆŁˆŲÆā€¬ ā€«Ų§ļ»£Ł€ļŗ›ā€¬ vs. (ā€« Ų¢ļ»£Ł€ļŗ›ā€¬or ā€«Ų„ļ»£Ł€ļŗ›ā€¬ or ā€«)Ų£ļ»£Ł€ļŗ›ā€¬ Improving Recall with Lemmatization and Stemming In Arabic as other languages, Lemmatization increases the recall of search results (cf. ā€œImproving Recall via Lemmas vs. Stemsā€), but with a twist. In Arabic, Lemmatization is only applicable to verbs and common nouns (e.g., apple, book, table) and is not applicable to proper nouns (e.g., names of people such as, Abul-qassem El-Chabby, Baqah Al-Sharqiyyah). Proper nouns require stemming to remove prepositions and conjunctions which are often attached to them. Searches for names of people, places, and organizations are seriously hampered without stemming. For example, these phrases in English appear as one word in Arabic: ā€œfor Othman,ā€ ā€œwith Othman,ā€ ā€œas Othman,ā€ and ā€œand Othman.ā€ Thus a search for just ā€œOthmanā€ would not ļ¬nd the above variations without stemming. Additionally, since Arabic names are frequently the same words as common nouns, part-of-speech tagging becomes critical to diļ¬€erentiating between nouns and proper nouns, particularly when a wordā€™s part-of-speech varies depending on where it appears in a sentence. Search Recall and Precision Enhancers for Many Languages For English and most European languages, better recall and precision means: ā€¢ Adding lemmas (the dictionary form of a word) to the search index to increase recall, while minimizing the negative impact on precision (cf. previous section ā€œImproving Recall via Lemmas vs. Stemsā€) ā€¢ Boosting the ranking of relevant documents on the results page through the use of document metadata such as entities. ā€¢ Increasing recall through comprehensive name search that goes beyond the ā€œDid you mean?ā€ functionality of most search engines. Essential Elements of Excellent Multilingual Search 8
  • 9. Boosting Relevant Results Once the search engine is ļ¬nding more good results (better recall) and returning fewer bad results (better precision), the good results need to be at the top of the search results. Frequently the most critical words in a search have to do with entitiesā€”proper nouns such as names of people, places, and organizations. But when is ā€œChristianā€ a religious aļ¬ƒliation or the name of a person ā€œChristian Diorā€? A good entity extractor will know. At indexing time, entities can be added to a document recordā€™s metadata to boost the rank of documents which have entities matching those in the query. Comprehensive Name Search to Improve Recall The ā€œDid you mean?ā€ function of most modern search engines will ļ¬nd actress ā€œCate Blanche ā€ even if the user types ā€œKate .ā€ However, for less famous people, and to cover a gamut of name variations, a true name matching function is needed to handle ā€œChuck Berryā€ vs. ā€œCharles Berryā€; ā€œJohn Kearnsā€ vs. ā€œJon Cairnsā€ or even ā€œBaqah Al-Sharqiyyahā€ and ā€œā€«.ā€ļŗ”ļ»—ļŗŽļŗ‘ Ų±ļŗ·ļ»Ÿ Ų§ļ»—ļŗ”ļƾā€¬ The Rosette Option for Language Support Rosette is a software development kit (SDK) designed for a wide range of large-scale applications that need to identify, classify, analyze, index, and search unstructured text from various sources, and its linguistic analysis capabilities are widely used by search engines to both increase precision and recall It uses a combination of statistical models, dictionary data, and sophisticated computational linguistics to parse digital text in English and over 25 major European, Asian, and Middle Eastern languages. Years of development and linguistic analysis have gone into the development of Rosette to satisfy the most demanding customers, both for quality, robustness, and speed. Nearly every major web and enterprise search engine since 1999, including Bing, Endeca, goo (Japanese), Google, Microsoft/FAST, and Yahoo! has used Rosette. New generations of search- based applications for e-discovery, ļ¬nancial compliance, and other ļ¬elds also use Rosette. Rosette is a cross-platform SDK available for Windows and Unix, and oļ¬€ers all its capabilities via a single API, in C, C++, Java, or .NET. Rosette Capabilities ā€¢ Language Identiļ¬cationā€”identiļ¬cation of the primary language of a documentā€”or language regions within a multilingual documentā€”and the ļ¬leā€™s encoding in 55 languages and 45 encodings. ā€¢ Unicode Conversionā€”converts documents in legacy encodings to Unicode ā€¢ Character Normalizationā€”normalizes characters to a single representation (e.g. character+ diacritic to single character with diacritics, half/full-size variants of ASCII characters in Asian languages, and several language-speciļ¬c normalizations.) ā€¢ Linguistic Analysisā€”Lemmatization, tokenization, part-of-speech tagging, decompounding, and more in over 25 languages. ā€¢ Entity Extractionā€”extracts entities such as people, places, and organizations in 15 languages via statistical models with customizable user-deļ¬ned entities via regular expressions and entity databases ā€¢ Name Matchingā€”returns matching names despite spelling variations, initials, nicknames, missing name components, missing spaces, out-of-order name components, the same name written in diļ¬€erent languages, and moreā€”supported in multiple languages including Arabic, Chinese, English, Korean, and Persian. Essential Elements of Excellent Multilingual Search 9
  • 10. ā€¢ Name Translationā€”translates names from non-Latin script languages to English and standardizes already translated namesā€”supported in multiple languages including Arabic, Chinese, English, Korean, Russian, and Persian. Lemmatization and Stemming Needing to support a variety languages can stretch the capabilities of out of the box search solutions, requiring additional eļ¬€ort to custom-conļ¬gure the handling of each language and tune its accuracy one by one. Chinese, Japanese, and Korean Search Normalization of Chinese, Japanese and Korean characters is done by the Rosette Core Library for Unicode (RCLU): ā€¢ For Chinese, Japanese, and Korean: Converting full-width ASCII characters (ļ¼”ļ¼³ļ¼¤ļ¼¦ļ¼†ļ¼…ļ¼„) to the usual half-width characters (ASDF&%$). ā€¢ For Japanese: Converting half-width Japanese katakana characters (ļ½±ļ½²ļ½³ļ½“ļ½µ; one-byte characters invented in the early days of Japanese computing to save on storage) to the usual full-width characters (ć‚¢ć‚¤ć‚¦ć‚Øć‚Ŗ). Pan-Chinese Search Rosetteā€™s Chinese script converter can convert text into either simpliļ¬ed or traditional Chinese. Germanic and Scandinavian Languages Character normalization of plurals is part of the Lemmatization of German within Rosette. Arabic Rosette provides speech tagging and Lemmatization for Arabic. Essential Elements of Excellent Multilingual Search 10
  • 11. Boosting Relevant Results with Entities Rosetteā€™s entity extractor can push results higher in the rankings. Comprehensive Name Search Rosetteā€™s name matching functionality including name variations due to nicknames, initials, missing name components, missing spaces between names, out-of-order name components, the same name represented in diļ¬€erent languages, and more. Thus, besides handling ā€œRobertā€ vs. ā€œBob,ā€ and ā€œAbdul Rasheedā€ vs. ā€œAbd-al-Rasheedā€ vs. ā€œAbd Ar-Rashid,ā€ Rose e also matches ā€œMao Tse- Tungā€, ā€œMao Zedongā€, and ā€œęÆ›ę³½äøœā€ (simpliļ¬ed Chinese). PRECISION VS. RECALL VS. F-SCORE IN A NUTSHELL Precision and recall are metrics used to evaluate the quality of search results, whether searching for articles on a topic, or ļ¬nding name matches. Supposed you are searching for red balls from a box that contains 7 red balls and 8 green balls. Blindfolded, you pull out 8 balls, of which 4 are red and 4 are green. Precision is the number of correct items over the total number of items found. That is, you found 4 red (which you wanted) and pulled out 8 balls in total, thus precision is 4/8 (half of your results were correct) or 50%. Recall asks the question: ā€œOf all the correct items, how many did I ļ¬nd?ā€ In this case, there were 7 correct items (because there are 7 red balls) and you found 4, thus your recall is 4/7 or 57%. F-score is a measure that attempts to balance precision and recall and is often called the ā€œrelevancyā€ of a system. F-score = 2*(Precision*Recall) Ć· (Precision+Recall) Thus our F-score above = 8/15 = .53, or 53% Essential Elements of Excellent Multilingual Search 11
  • 12. ELEMENT 3: RELIABILITY AND SUPPORT Even in a technologically sophisticated company, the core business logic will be the main focus of all business operations, so technical support as a safety net is essential. Just as ļ¬rms rely on calling the vendor when the copier is acting up, a few judicious pieces of advice from the search or linguistics experts to settle a nettlesome problem lends assurance to both engineers and upper management. On the language support side, the wide language coverage of Basis Technologyā€™s Rosetteā€”over 25 languages covering Asia, Europe, and the Middle Eastā€”ensure there will be one point of contact to address questions about any languages, instead of a multitude of single-language vendors with varying support contracts. Rosette developers are on the front lines, providing high quality support to customer inquiries. Additionally as a single platform, Rosetteā€™s benchmarks for speed and accuracy are readily available, and implementing one or over 25 languages is the same amount of work. Essential Elements of Excellent Multilingual Search 12
  • 13. SUMMARY & CONCLUSION: ROSETTE LINGUISTICS PLATFORM For companies that derive competitive advantage from a better quality or better customized search, search is a development problem, and not a ā€œplug-in-a-solutionā€ problem, and the four search engine essentials are: 1. Customizabilityā€”the ability to innovate on top of the chosen search engine to build value- added features 2. Speedā€”the search engineā€™s query response rates and indexing and updating times 3. Scalabilityā€”the ability to easily handle increased number of queries or volumes of content requiring indexing 4. Costā€”total cost of ownership; return on investment; scaling of cost to search needs The Rosette platform ļ¬lls that gap, through solid linguistic analysis technology and comprehensive dictionary data to perform: ā€¢ Language Identiļ¬cationā€”The language identiļ¬er detects 55 languages with speed and accuracy, having been trained on gigabytes of hand-veriļ¬ed data. It covers a broad range of Asian, Indo-European, and Middle Eastern languages. ā€¢ Linguistic Analysisā€”Rosette performs a complete linguistic analysis to improve search recall and precision for some of the most linguistically diļ¬ƒcult languages. ā—‹ Lemmatizationā€”for boosting recall while maintaining precision in many languages ā—‹ Tokenizationā€”for Chinese, Japanese, and Korean which do not have spaces between words ā—‹ Chinese script conversionā€”for pan-Chinese search ā—‹ Character normalizationā€”specialized transforms for Arabic and Asian languages ā—‹ Decompoundingā€”for Germanic and Scandinavian languages which form words from multiple words ā—‹ Part-of-speech taggingā€”for context-sensitive Lemmatization, and particularly for Arabic search to distinguish between common nouns requiring Lemmatization, and proper nouns, requiring stemming ā€¢ Entity Extractionā€”To locate names, places, organizations, and other entities, both to boost ranking of results whose entities match the search query or as a base for building faceted search. Rosette plugs into the UpdateRequestProcessor at indexing to populate each document record with entities. ā€¢ Name Matchingā€”Going beyond ā€œDid you mean?ā€, Rosette ļ¬nds matching names that diļ¬€er due to nicknames, initials, missing name components, missing spaces between names, out- of-order name components, the same name represented in diļ¬€erent languages, and more. NEXT STEPS ā€¢ Request a free product evaluation of Rosette at: http://www.basistech.com/products/requests/evalua -request.html Essential Elements of Excellent Multilingual Search 13