For enterprise search requirements, we assess the Google Search Appliance along with SharePoint, social search, and AutoSuggest, in contrast to Apache Lucene and Solr.
1. • Cognizant 20-20 Insights
Designing an Effective Enterprise
Search Solution
Executive Summary Defining the Enterprise Search Engine
There are many diverse requirements for search There are quite a few proprietary and open
capabilities that emerge within an enterprise. This source enterprise search tools available in the
white paper addresses the top five most desired market. The Google Search Appliance is chosen
enterprise search requirements. The solution here for its ease of use and its ability to handle
discussed herein is based on various implemen- most of the aforementioned requirements. (Refer
tation experiences we have gained over years of to the Appendix for the architecture diagram that
using multiple search tools. The top five search provides an alternate approach using Apache Solr
requirements include: 3.1 and Nutch 1.3.)
• Diverse Content: Ability to crawl, index and The Google Search Appliance provides quite a
search diverse content repository. few traditionally requested enterprise search
features out-of-the-box (OOTB). Even though
>> The Web, Microsoft SQL database and
the appliance fits the hardware plug-and-play
SharePoint content management systems.
model, it provides a flexible framework for inte-
• Secured Search: Ability to crawl secured grating with external systems such as content
content and make it accessible to only management systems, document management
authorized people and/or groups. systems, security systems, federated search and
>> Single sign-on, forms-based authentication. both Google and non-Google services on the
cloud. The following lists the features required to
• User Interface: Ability to provide various user implement the top five requirements. For the sake
interface (UI) components to serve end users
of gauging the complexity of the implementation,
with precise results.
our list differentiates custom components from
>> Guided navigation, related search terms, re- Google out-of-the-box (GOOTB) capabilities. A
lated articles and best bets. sample approach is also provided for developing
the required custom components to architect an
>> AutoSuggest with terms combined from
real-time search and custom (user configu- effective enterprise search solution using the
rable) terms in data stores. Google Appliance.
• Desktop Search: Ability to integrate with • Google Web crawler for crawling and indexing
content stored in the desktop. Web content (GOOTB).
• Social Search: Ability to find other people, • Google DB connector for crawling and indexing
ratings and expertise within the organization. Microsoft SQL database (GOOTB).
cognizant 20-20 insights | february 2012
2. • Google SharePoint connector for crawling and appliance (GSA) is at the center of this proposed
indexing SharePoint content (GOOTB). search architecture. GSA, an enterprise search
appliance, is a packaged hardware search
1. Google forms authentication for index time
engine. The plug-and-play model provides
authorization and serve time authentication
many necessary enterprise search features
(GOOTB).
out of the box. The tool also provides a flexible
2. Google front-end configuration for: (connector) framework for which developers/
>> Faceted search, aka guided navigation (lim- implementers can integrate the appliance with
ited OOTB). other content sources.
>> Related search terms (GOOTB). Just like Google.com, the Google search
appliance is extremely adept in crawling,
>> Related articles (GOOTB). indexing and serving Web pages. Additionally,
>> Best bets (GOOTB). GSA provides connectors to index and search
any relational databases, content management
>> AutoSuggest (GOOTB and custom applica-
tion). systems (i.e., EMC Documentum, Microsoft
SharePoint, Open Text Livelink, IBM FileNet)
3. Google desktop search component integration and local/network file systems or file shares.
(external Google component). Connectors to other non-supported CMS
4. Google results integration with internal rating (content management system) tools can either
system. be developed from scratch using Google’s
connector framework or can be purchased as
>> Integration with Google people search a product from Google’s partner Websites.
(GOOTB).
>> Integration with expertise system (custom). • Google Web Crawler: Intranet Web can be
indexed using Google’s Web crawler. GSA
>> Integration with custom rating system allows implementers to configure the “start
(custom). URL,” “follow URL pattern” and “do not crawl
Component Description pattern” details, with additional options that
The following components are critical to Google- allow us to force a recrawl of specific patterns
powered enterprise search architecture (see as and when required. The crawl status alone
Figure 1 for a schematic view). can be tracked using the “crawler diagnos-
tics” which details if a URL was successfully
• Google Search Appliance: Google search
Enterprise Search Component Diagram (using GSA)
People Search LDAP Server
Web Crawling Google Desktop Tool Client Desktop
Database Connector
Search Configuration
TermsFederator
SharePoint Connector <<XML Feeds>>
Forms Authentication
ExpertsConnect
Best Bets
InfoValuator
Related Terms
Microsoft Intranet Microsoft
SharePoint Web SQL <<Stores>>
CMS G
S
Content Repositories
Google OOTB Configurations Content Repositories Custom Applications Google Component
Figure 1
cognizant 20-20 insights 2
3. indexed or not. While crawling, if the Googlebot »» Delta query in Solr can be achieved using
encounters an error or if the pages are not the regular “deltaquery” with the TIME-
crawler friendly, appropriate error messages STAMP column. Additionally, XML import
are displayed in the diagnostics that aid in (using /update handler) can be used to
troubleshooting. import new or modified documents in the
form of an XML file.
>> Disadvantage: As efficient and good as it
sounds, one disadvantage of Web crawler is »» Solr provides an effective way to remove
Google’s inability to reveal the exact page content from the index, either via the ad-
that is currently being processed. min console or via XML import (/update
with delete option).
>> Alternative: The OS console monitor and/
or tracking log files are some ways that
could help track URL crawl status. At any
• SharePoint Connector: With its introduc-
tion of SP2010, Microsoft provides an easy
point of time, a developer should be able Web-based medium for uploading, maintaining
to view the current URL being crawled and and sharing documents within the enterprise.
issues faced (if any) with security. Almost The portal-collaboration-CMS-DMS (document
all tools provide this feature – such as Solr, management system) apparatus provides tight
FAST, Endeca and Autonomy. integration with Integrated Windows Authen-
• Database Connector: DB crawling is again tication (IWA) for securing sites and/or pages
made trivial in Google by its ability to provide and/or documents.
the implementer just the bare minimal details to Google provides connectors to very few CMS
fill in before triggering the database indexing. systems out of the box. But one such system
The key configurations needed are username, is Microsoft’s SharePoint — which makes
password, JDBC URL, database name, results sense since SharePoint is gaining momentum
URL hostname, JDBC driver class name, SQL within the enterprise due to its ease of use
query and primary key field. Stylesheets can be and cost benefits. For seamless integra-
used to allow users to preview a document in a tion with SharePoint, GSA uses the content
specific format in the search results page. feed which handles authorization along with
>> Disadvantage: A few key disadvantages the connector configuration. For a smooth
faced during the implementation are: handshake between the Google connector
and SharePoint sites, an additional Google
»» Google’s inability to allow end implement- services component must be installed on the
ers to schedule DB crawl.
SharePoint server. The connector supports
»» Google’s way of removing content from late security binding with bulk authentication
index is quite primitive and time-consum- at query time.
ing. The only way is to include the URL
Google Services (GS) is a component that must
in the do not crawl (DNC) list and wait
be installed at the SharePoint server end to
for the appliance to realize it. It is even
ensure site/sub-site indexing and bulk authori-
more complicated for connector/XML-fed
zation at index and query time. For more infor-
content.
mation, the step-by-step approach is easily
»» Poor diagnostics for connector/XML-fed available on the Google open source site.
content.
>> Disadvantage: Even if Google is executing
>> Alternative: Compared to GSA, we found a bulk late binding, performance issues at
Apache Solr is a better option for indexing query time are inevitable when the docu-
the database via data import handler. ment volume is high.
»» 2000 documents in Solr would take about >> Alternative: One alternate is to consider
5-10 seconds. the site/page/document level security as an
»» The Solr console provides a very good additional metadata, develop an application
overview of documents crawled and sta- that would post-filter the results based on
tus of crawl. Log files can be configured to end-user security attributes. This is again a
capture the crawler status as well. primitive method and has its own disadvan-
tages in terms of query time latency.
cognizant 20-20 insights 3
4. »» There are quite a few consumer off the >> Alternative: There are tools that support
shelf (COTS) products like Vivisimo, Micro- an early binding security model that allows
soft FS4SP (Fast Search for SharePoint), the search engine to cache the user se-
FAST ESP — now fast search for Internet curity groups along with the content. The
search (FSIS) — that are designed to bet- disadvantage of early binding is that real-
ter handle SharePoint SP2010 for a higher time security changes are not immediately
cost. We have found FAST ESP to be an ef- reflected in the system.
fective tool that handles diverse CMS in-
cluding SharePoint and offers early bind-
>> Note: One disadvantage with Apache Solr
is that it does not handle secured content.
ing security model for better performance. The only way to serve secured content is to
Now, the tightly integrated SP2010 search store the security tags/groups as one of the
— FS4SP — seems to be preferred by many metadata and implement a field (or meta-
enterprise intranets. SP2010 has its own data) constrained search.
cost benefits if the enterprise’s primary
technology stack includes Windows and • AutoSuggest: GSA provides an open source
other Microsoft product suites. component called “search-as-you-type” which
allows end implementers to fetch real-time
• Forms Authentication: Google provides a results from the appliance (see Figure 2). In
simple way to define forms authentication at order to integrate the results from Google with
both index time and query time. Forms authen- that of custom (suggestive) terms, developers
tication at index time allows the GoogleBot need to build a special component.
to register the security cookie based on the TermFederator is a custom component that
domain of the secured website that requires fetches user configurable (suggestive) terms
indexing. This cookie is carried with the stored in the database and/or any other CMS.
GoogleBot, allowing the bot to crawl and index (Note: TermFederator is only a consumer of
secured content. the terms stored in the database and any gov-
At query time, Google uses the query time ernance for these terms is outside the scope
configuration to make an HEAD request of this architecture.) Adequate caching can
that would allow the logged-in user (within be used at this component-end to improve
a specific domain) to view only the content retrieval performance of the database/CMS
that he is authorized to view. This late binding query. The TermFederator is configured in GSA
security model has its disadvantages in terms as a “Onebox” module. At the time of end-
of query time latency and performance issues user query, real-time results from Google and
at the hosting server end when too many results from the TermFederator are merged
HEAD requests are made to the web server. into a single unified suggestion terms list. The
Performance degradation is inevitable with key advantage is that GSA handles on-the-
higher QPS and/or higher results count. fly federation between real-time results and
custom terms with minimal overhead at query
time.
How AutoSuggest Works
<<HTTP>> <<HTTP>>
TermsFederator
Cache
<<JSON>> <<OBXML>>
Database
Figure 2
cognizant 20-20 insights 4
5. »» Disadvantage: Onebox modules are de- • Related Results: Collective results from
signed to respond within one second. This different and/or the same domain that refer to
could result in no results from TermFeder- the same content is again OOTB with Google.
ator if there is any delay at the database.
• External Data Federation: OneBox module in
GSA allows us to federate search results from
»» Alternative: “TermComponent” in Apache external data source/store. GSA internally
Solr is an effective autosuggest tool.
handles uncluttered merging of
Terms stored in any local text file can be
the Onebox results with organic Among social
made available to Solr at startup. A sepa-
results in sequence. (Note: search, expertise
rate component designed to merge alpha-
no relevancy or algorithm is
betically (or sequentially) the top N terms
applied for merging.) This is one
search with expert
from Solr and from the local text file would
powerful feature that Google rating are common
address this requirement as well.
provides in comparison to other requirements.
User Interface: tools on the market.
Enterprises are keen
• Best Bets — aka Keymatches, aka AdWords. • Assessing Social Search: Social to find out those
search — rather, personalized
• Related search terms same as synonyms. social search — has emerged who specialize or
• Query expansion, same as dictionary-based as a key requirement within have expertise in
search, allows users to do bidirectional search today’s enterprises. Among
based on terms maintained as a part of this social search, expertise search
specific fields.
configuration. with expert rating are common
• Faceted search, aka Guided Navigation: GSA requirements. Enterprises are keen to find
does not support faceted search. But this out those who specialize or have expertise
feature can be achieved via metadata con- in specific fields. This information is used to
strained search at query time, similar to how assist internal research, consulting, resource
it is implemented in Solr. The structure of allocation (across departments) and/or simple
content within the appliance is flat and there is networking. Papers published by experts
no hierarchy and/or taxonomy maintained. are most sought after within an enterprise.
Research papers rated by even a contact or
>> Disadvantage: Facet count in GSA is not
an expert are valued high as compared to
available OOTB.
documents that are not rated at all.
>> Alternative: As a key requirement in most People search is an OOTB GSA feature. People
ES implementations, faceted search is cur-
search in GSA becomes easy with provision
rently GSA’s primary drawback. But again,
for integration with lightweight directory
if we think of alternates, here are a few
access protocol (LDAP) servers. But for social
options:
search, we would need to develop additional
»» Continue using GSA: We could develop an components that would allow us to tag people
application that can return count based as experts and allow end users to rate content.
on certain fields (metadata) that are avail- The data that is captured and stored via any
able. If not carefully planned, this could be of these custom components are imported as
another query time overhead. Addition- external metadata into the GSA appliance. GSA
ally, GSA partners commercially sell com- links any external metadata imported to the
ponents supporting “parametric search” content already available in the index based
that can be evaluated for our requirement. on the content universal resource identifier
»» Considering other COTS/Open Source: (URI). We can thus accomplish linking between
(Oracle) Endeca and (HP) Autonomy people information that’s already indexed with
maintain content hierarchy for guided that of expert and rating meta information.
navigation. Indexing hierarchical content ExpertsConnect is a custom component that
is unsupported with future Microsoft keeps track of people and contacts within GSA.
FS4SP. But faceted search will continue These contacts provide a link to “prospec-
to be supported. Faceted search is one tive” experts. This information is fed to GSA as
of Apache Solr’s strongest features and external metadata and hence is linked to the
is implemented within many e-commerce people data that are indexed from a directory
Websites. residing on an LDAP server.
cognizant 20-20 insights 5
6. The Anatomy of Social Search
PeopleSearch LDAP server
<<HTTP>> <<HTTP>>
ExpertsConnect
<<JSON>> <<OBXML>>
<<HTTP>> <<Stores>>
InfoValuator
Database
<<OBXML>>
<<Stores>>
Figure 3
At query time, any people information sources. Any search made from the Google
associated with organic results are tagged. toolbar for desktops returned results from
Additional details, pertaining to designation, both the desktop and from the appliance. Note:
department, etc., are fetched as a separate The Google desktop search tool maintained an
Onebox result based on the information that index within the user’s own file system.
is indexed from LDAP. These details are linked
with the contact details that are externally
>> Alternates include:
fed and this augmented set of “prospective” »» COTS vendor like Autonomy and Micro-
experts is displayed as a single Onebox result soft FAST ESP provide options to imple-
along with the organic search engine results ment specific independent components
page (SERP). for desktop search. But these components
are not cost-effective for a simple desktop
The organic results, when integrated with a search.
rating system would allow people to rate and/
or tag content. This can be achieved using the »» Apache Lucene and/or Solr is a good
“InfoValuator” component. option for desktop search. The tool which
is easy to install and configure comes with
• InfoValuator: InfoValuator component cap- a default Jetty server that can be used to
tures end-user rating and saves a combina- index local file systems. Again, Lucene/
tion of user identity, content URI and value Solr are limited to searching files within
rating in the backend data store. On a sched- a desktop and cannot handle indexing
uled basis, this data is fed into the appliance e-mail servers out of the box.
as external metadata and is associated with
the content and/or people information. The Conclusions
rating is fetched along with Google’s organic
There is no one search engine that fulfills all
search results. The UI can be designed to allow
enterprise search requirements. HP Autonomy
(any logged-in) user to rate the content on the
claims this lofty perch but it comes with a huge
fly, which the InfoValuator component would
cost overhead, with the base cost crossing half a
capture and store.
million dollars. Open source search engines are
• Accessing Desktop Search: Google compo- widely used by many but large enterprises still
nent for desktop search was decommissioned fear open source products for their lack of profes-
as of September 2011. But it was one solid sional product support. Google Search Appliance
platform-independent tool that allowed imple- has been the preferred tool for many medium to
menters to index and search desktop content. large enterprises for its ease of use, ready-to-
The tool came with a provision to configure an go model and all-inclusive support package that
internal Google appliance as one of the search comes with the purchase of the appliance. But
cognizant 20-20 insights 6
7. Architecture Alternate Using Apache Solr 3.1 & Nutch 1.3
Search Frontend Application
Application Server
Desktop
Files WebAPP InfoValuator People Content Builder
Frontend Renderer
Terms Federator
Plug-in Security Post filter Experts Connect
Web/SharePoint Database Search
Solr Search People Search XML Content Feeder
Database LDAP
Application Server
The Web WebAPP Solr HTTP Servlet
Nutch
regex-urlfilter.txt DisMax Handler Search (Custom) (Custom)
local-fs /Update
nutch-site.xml More like this Spell People Search Handler Social Request Handler handler
SharePoint
Solr Core XML
Schema Solr Config Synonyms Caching Admin analytics
Nutch
Schema
Spell Data import Dictionary Query parser
Apache Lucene Indexer & Search Engine
Solr Components Content Repositories Custom Applications Nutch Components
Figure 4
again, even Google is not the right fit for many cost of ownership) and ROI (return on investment)
requirements that we have seen so far. Custom ahead of time would help enterprises strike a
search application development is inevitable and right balance between choosing the right search
if well planned, we can basically use any tool in tool and investing time and money on extending
the market to implement enterprise search as a the tool for required customizations.
full-fledged application. Identifying the TCO (total
References
Google Search Appliance Document Reference
Google Search Appliance SharePoint Implementation
Apache Solr 3.1
About the Author
Aruna Vaidyanathan is an Enterprise Search Architect within Cognizant’s Portal Content Collaboration
(PCC) Practice. With 10 years of IT experience, Aruna has spent seven-plus years integrating enterprise
search projects within various industries such as manufacturing and logistics, consumer goods and life
sciences. As a part of an 800-member practice with over 80 enterprise search professionals, Aruna’s
primary role is to evaluate top enterprise search products in the market and provide domain-specific
consultation and search product implementation recommendations to Cognizant clients. Aruna holds a
master’s degree in computer applications and can be reached at Aruna.Vaidyanathan@cognizant.com.
cognizant 20-20 insights 7