Search was once considered a black-box application that ingested content and delivered results to users opaquely. However, driven by the opportunities and demands of the growing universe of content and by the versatility of Solr/Lucene open source search technology, search applications are evolving from a standalone facility to an enabling framework.http://www.lucidimagination.com/developer/whitepapers/search-readiness-checklist
2. Abstract
Search was once considered a black-box application that ingested content and delivered results to users
opaquely. However, driven by the opportunities and demands of the growing universe of content and by
the versatility of Solr/Lucene open source search technology, search applications are evolving from a
standalone facility to an enabling framework.
Good search is hard. While the basics of search technology can be deceptively simple, the art and science
of applying that technology to relevant business and content processing problems is daunting. By its very
nature, search can span an almost infinite variety of content, formats, subject matter, relevancy criteria,
and more.
This Open Source Search Readiness Checklist is organized into four broad categories:
Why do you need a search application?
What are the key technical characteristics of your search application?
What is your search application’s technology environment?
How can you ensure the best fit between Solr/Lucene and your ongoing business needs?
Each category details key issues to consider in moving to open source search. Whether you are
undertaking a new search application or have a working search application running on a platform you
are considering leaving behind, this checklist provides a working foundation to help you make the
transition smoothly.
Working with Lucid Imagination, the commercial company for Solr/Lucene open source search
technology, offers you packaged solutions that simplify and streamline search application development;
lower the cost of growth through flexible, adaptable architecture; and deliver reliable backing of
unmatched expertise in enterprise search and open source.
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page i
3. Contents
Introduction ........................................................................................................................................................................................... 1
I. Why Do You Need a Search Application?........................................................................................................................... 2
II. What Are the Key Technical Characteristics of Your Search Application? .......................................................... 5
III . What Is the Technology Environment in Which You Are Building Your Search Application? ...................... 9
IV. How can you ensure fit between Solr/Lucene and your ongoing business needs? ........................................ 13
Summary of Questions...................................................................................................................................................................... 16
About Lucid Imagination ................................................................................................................................................................. 17
Recommended Reading ................................................................................................................................................................... 17
Appendix: Solr/Lucene Features and Benefits ........................................................................................................................ 18
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page ii
4. Introduction
Whether you are undertaking a new search application or have a working search application running on
a platform you are considering leaving behind, there are a lot of questions you’ll need to answer to be
prepared for the effort.
Good search is hard. While the basics of search technology can be deceptively simple, the art and science
of applying that technology to relevant business and content processing problems are daunting. By its
very nature, search can span an almost infinite variety of content, formats, subject matter, relevancy
criteria, and more. Add in the fact that there are almost as many ways to judge relevant results as there
are individual end users, and you can see the challenge.
This Open Source Readiness Checklist is organized into four broad categories, each with a discussion of
the issues and opportunities you’ll need to consider as you prepare for your search application. Where
applicable, we’ll provide additional references for further study or research.
Why do you need a search application?
What are the key technical characteristics of your search application?
What is your search application’s technology environment?
How can you ensure the best fit between Solr/Lucene and your ongoing business needs?
This guide is not intended to replace a design strategy, architectural rigor, or a formal requirements
document. By considering answers for the issues it sets forth, we believe you’ll be better prepared for
getting your Solr/Lucene application up and running.
If you are replacing a legacy commercial platform, you may wonder: Can Solr/Lucene be a complete
search platform if you can’t just “drop it in” and replace what you now have, function-for-function,
feature for feature? Consider first that, owing to the great variation of search problems, search
technology providers have historically taken different approaches to developing their own toolkit: An
effort to imitate one with the other will not cut it. We believe you will be best served by a fresh look at the
problem search was meant to solve, unburdened by the details of prior implementations. More
importantly, the flexibility and adaptive nature of Solr/Lucene open source will both enable immediate
transition and lay the foundation for evolving your application to meet emerging needs.
The key measure of readiness for the transition is a solid grip on the value of the effort. Lucid
Imagination’s customers report that Solr/Lucene technology delivers tremendous benefits in flexibility,
result quality, performance—and most importantly, an ability to control their business and technology
destiny with search. Those same customers use Lucid Imagination’s services and solutions to lock in
those gains, and cement the competitive advantage achieved with Solr/Lucene.
We believe an understanding of these advantages will lead you to apply Solr/Lucene most effectively, and
identify where it is that Lucid Imagination can help you design, develop, and deploy your search
application with confidence.
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 1
5. In understanding the motivation behind your search application, consider how best to align three factors:
I. Why Do You Need a Search Application?
your users, your data, and your business objectives.
When you build a search application, you face end users with expectations driven by their experience
with the large consumer search engines on the public Internet, such as Google, Bing, and Yahoo. Certainly,
the billions of dollars spent on billions of end users searching trillions of documents have delivered
broad-ranging innovations.
It’s a fundamentally different proposition to build your own search application. Internet searches may
produce millions of results in milliseconds, but they rely on measures like website popularity or on URLs
and domain names—not generally applicable to purpose-built applications for businesses. Relying on
generalized relevancy for a global population of all Internet users, the big Internet search engines are not
tied to your business rules, business process logic, or the opportunity cost of improved precision for your
specific set of data or your search users—and their business interests are not yours.
Retrieval of unstructured, heterogeneous documents and data is where
Lucene/Solr search technology excels. Much of that data has been
stored in a relational database, which offer robust storage and stability,
RECOMMENDED READING:
but its query and retrieval model is ill-suited to the more varied,
dynamic modern data landscape.
Starting a Search
Application
Solr/Lucene search technology offers extraordinarily
Marc Krellenstein, CTO and
broad applicability, flexibility, scalability, and adaptability. Open source
Founder, Lucid Imagination
provenance contributes directly to those benefits in many ways. It
The Case for Lucene/Solr:
provides a broad community of professional developers, testing and
Real World Open Source
perfecting the technology against tremendous variation in use cases, as
Search Applications
well as changes and improvements that are strictly peer-reviewed,
A Lucid Imagination White
creating a broad foundation of innovation and rigorous peer review.
Paper
Not to mention faceting, geo-search, numeric range queries, speed and
scalability into the billions of documents, near-real-time indexing,
and many more innovations that have broken barriers to building effective search applications.
Another great capability inherent in the Solr/Lucene platform is anticipating the future needs of the
broad range of users. With adaptive and editorial boosting relevancy techniques, query corrections and
suggestions, recommended results, and faceted search, search applications built with Solr/Lucene help
your business control the quality of experience between your users and your data—and fit that
experience to your business objectives.
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 2
6. Free software, such as Lucene and Solr open source search, does not mean search is free of effort. If
1. What business objectives are (or should be) achieved with your search application?
your search project is successful, consider how you will prove it: Which of these would you be able to
point to?
(a) Save money? How much or how much more?
(b) Save time? How much or how much more?
(c) Increase revenue? How much or how much more?
(d) Increase end user satisfaction? Which ones?
(e) Create advantage over competitors?
(f) Decrease risk? How much or how much more?
(g) More than one of the above?
Most organizations have a system for finding information, often a legacy commercial search system.
2. What objectives are (or are not) being met with your current search implementation?
Why is it unsatisfactory? If you were to replace or improve it, which of the results in the previous
question would it affect? By how much?
Which of the following properties of your search application (one or more) would have the most
3. Which improvements in search behavior contribute to improved business results?
impact on the business results you are looking for?
(a) Speed with which new content is available.
(b) Likelihood the user’s chosen result is in the top n results returned.
(c) Completeness of the full set of results the system delivers.
(d) Speed with which queries deliver result sets.
(e) Flexibility with which the system handles different types of queries.
(f) Ability of the system to never deliver “zero” results.
(g) Ranking of particular results for particular queries.
(h) Reduced effort required for users to find previously unknown content.
(i) Likelihood the user will return to use the search system again and again.
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 3
7. Within the realm of search behaviors, special attention needs to be paid to the control of search
4. How much control do you need over the results that end users see?
results. Often, the application of algorithms, business rules, and access rights tie directly to the
economic benefits of search. Solr/Lucene offer great depth in this dimension. The previous question
asked about general changes in search behavior; here, consider specifically how important direct
control of results is to the success of the application.
(a) Do you need to adjust the likelihood that particular results or documents appear at a certain
time, or in relationship to other results?
(b) Are there certain documents or data that should be delivered to certain users, but proscribed
from others?
(c) Are there algorithms that you need the system to account for programmatically, in automated
fashion during the course of search, such as performing probability calculations?
(d) How important is it that you understand why the search returned a particular set of results,
and be able to adjust the search behavior as a result?
The behavior of your search application will be judged by its end users; how much do you know
5. How much do your end users know about the content they are searching for?
about those users and the queries they are likely to submit? Consider the following contrasts. Are
your end users likely to:
(a) Express their queries in terms or phrases that will narrow in on results quickly, or submit
broad, general words that retrieve broad results?
(b) Spell the terms they are searching for correctly?
(c) Search for known results in an unknown location (e.g., “Find the e-mail I sent to Carol on
Tuesday, August 10” )? Or undertake a search without knowing which content they might
find?
(d) Browse through interim sets of results in order to narrow or refine their search queries?
(e) Specify quantitative parameters, such as distances, prices, locations, or dates, as part of their
search?
(f) Use logic-oriented language (e.g., Boolean queries or wildcard characters) or natural
language?
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 4
8. II. What Are the Key Technical
Characteristics of Your Search Application?
Given the flexibility and broad applicability of Solr/Lucene open source search technology, there is a rich
set of design decisions to be made in setting up the application to meet your business objectives within
the scope of your technology. In this section, we’ll explore some of the key inputs you’ll need to consider
before you begin the exercise of architecture and design of your search application. In most, if not all, of
the permutations of search needs implied by the questions below, the flexibility of Solr/Lucene search
can address your needs.
It’s important to note that these questions are not intended to replace a formal design process or
substitute for rigorous architectural assessment of how you can use Solr/Lucene to build a successful
search application. Rather, it will help establish your intent with respect to key functional and system
behaviors.
More than in the previous sections, you may find that the answers to
the scoping questions below change over time. As you familiarize
RECOMMENDED READING:
yourself with the capabilities and possibilities available with the
Solr/Lucene search platform, you may well want to refine or revise
Faceted Search with Solr
your understanding of what constitutes desired behavior.
Yonik Seeley, creator of
Apache Solr and co-founder
Often, organizations build a working prototype of their search
of Lucid Imagination
application in order to validate the assumptions, as well as the design
Optimizing Findability in
and implementation of the system intended to put those assumptions
Lucene and Solr
into action. While there are many nuances to formal development
Grant Ingersoll, Chair,
methodologies that exploit this discover-by-doing effect, they share a
Apache Lucene PMC and co-
founder of Lucid
common pattern of implementation, iteration, learning, improvement,
Imagination
and change.
It is strongly recommended that you consider at least two sets of answers to the questions below; first for
a prototype implementation, and perhaps one or more revisions of that implementation going forward,
once you accumulate experience and discover the full range of possibilities.
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 5
9. Much as documents and data can live in different repositories, they come packaged in different
1. In hat formats are the documents and data you will search?
formats, based on where they originated and who created them. A good understanding of these
formats enables successful content processing for search. Different format types require different
levels of interpretation and composition to separate out searchable text content and metadata
(information about the document or its content), which can inform a search, from visual presentation
details such as colors, fonts, or software-specific content. For each of the formats, there are further
considerations of version; to cite just one example, the formatting and file structure of Microsoft
Word 97 *.doc documents differs from the Office 2007 *.docx version.
Solr/Lucene can leverage a range of tools—built-in as well as extensions, including both open source
and commercial source. Which of the following document format types will you be indexing and
searching?
(a) XML documents
(b) Database records
(c) HTML documents
(d) Microsoft office documents: *.doc or *.docx for Word; *.ppt or *.pptx for Powerpoint; *.xls or
*.xlsx for Excel
(e) PDF documents
(f) CSV (comma separated values) or TSVs (tab separated values)
(g) Open Office documents
(h) Engineering drawings from CAD/CAM/CAE systems
(i) Others
Configuring your search system requires an understanding of your document sizes, as performance
2. Document collection composition: how big are documents?
and throughput depend heavily on accounting for the size of documents to be indexed. What
percentage or fraction of your documents are:
(a) Under 1 KB (f) 5 MB to 10 MB
(b) 1 KB to 100 KB (g) 10 MB to 50 MB
(c) 100 KB to 500 KB (h) 50 MB to 100 MB
(d) 500 KB to 1 MB (i) 100 KB to 250 MB
(e) 1 MB to 5 MB (j) 250 MB and up
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 6
10. 3. Howmuch new content do you presently add per unit time?
The quality of your search results can be affected by the interval between when a document is
How many documents are updated per unit time?
complete or ready, and when it appears in the index for searching.
(a) Millions of very small documents—in the form of tweets, comments, messages, log files, etc.—
appear continuously as users or systems create these content snippets.
(b) Existing documents are revised, either by users, or by machines—in the latter case, examples
such as reports and data output indexed by your search application.
(c) New documents are available less frequently, perhaps even on a regular schedule, which in
turn drives user expectations of when they can be searched.
(d) Changes to content come in particular windows, busier at some times than others.
Consider the question of change to your collection in two ways: First, at what interval does the
amount of content in your collection change? Second, what fraction of the total documents are you
adding to the overall collection within each interval?
(a) From minute to minute (e) Daily
(b) About to four times per hour (f) Weekly
(c) No more than two per hour (g) Monthly
(d) No more once every 4 hours
Consider the population of users who drive your search application. How many are they, and what
4. What is the rate of queries you expect from your user population?
number of queries might be submitted? Consider especially that queries in the search application do
not always map one-to-one with a single string entered by a user in a search box. Use these questions
to characterize how many queries your search application will need to handle per unit time, typically
in queries per second.
(a) How often do they need access to the application?
(b) Will they submit queries one at a time on an occasional or ad-hoc basis, or will they rely on
the search application for continuous constant use?
(c) Do they have the expertise necessary to narrow quickly on search results, or will they require
continuous iteration, using one set of results to inform a series of subsequent queries?
(d) Will they have the expertise to write queries that conform precisely to the search
application’s expectation, or will you rely on the search application to analyze and decompose
their terms and phrases to ensure efficient execution and relevant results?
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 7
11. 5. Does your content require faceting or a taxonomy in order to support productive navigation
Faceted search provides an effective way to allow users to refine search results, continually drilling
and discovery by end-users?
down until the desired items are found. For example, on an e-commerce site, Solr/Lucene can present
a list of different brands of a flat-screen television, or let the user navigate into results. Facets can
span virtually any list of attributes, from sets of terms within a field to dates to numeric ranges and
the like. In addition to document-driven faceting, some search applications add an external taxonomy
platform to derive metadata—i.e., to extract what documents are about and append fields that
support guided navigation through results.
(a) Do documents contain data or metadata that allow users to narrow results?
(b) Are there consistent rules of document analysis you can create and apply to derive attributes
from documents?
(c) If documents lack native metadata, can you use a third party taxonomy platform to identify
attributes for faceted navigation?
6. Which advanced search features do you expect to use in order to improve how users can
Solr/Lucene offers a broad set of powerful query and search tools that can help users quickly choose
submit queries and choose?
from available options, either before or after they submit a query. Which of the following features can
help improve the speed and efficacy of the experience for your end users?
(a) Autosuggest/as-you-type: The search application prompts the user with possible alternate
queries implied by a partial or complete search term, as they type in the search box.
(b) Spellchecking: The search application can interpret search terms that are not necessarily
spelled correctly, either prompting the user with correctly spelled alternatives, and/or
automatically retrieving results that match terms that most closely resemble the misspelled
word in the query.
(c) Did you mean: Similar to spell checking, the search application can offer alternate matches to
terms that resemble the user’s query, even when those terms were not typed in explicitly.
(d) More like this: The search application allows the user to drill down into a particular element
of one result set to find additional results that resemble it.
(e) Hit highlighting: The search application can mark or emphasize specific terms from the
query in snippets of the document result, showing the user which terms match the query.
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 8
12. III . What Is the Technology Environment
in Which You Are Building Your Search Application?
Driven by the opportunities and demands of the growing universe of available content and by the
versatility of Solr/Lucene open source search, search is evolving from a standalone facility to an enabling
framework.
Search was once considered a black-box application that ingested
content and delivered results to users opaquely. No more. Today,
RECOMMENDED READING:
developers are turning to Solr/Lucene to extend the data access and
management power of their applications into the realm of unstructured
Full Text Search
text—documents, articles, product descriptions, case studies, informal
Engine vs. RDBMS
Marc Krellenstein, CTO and
notes, websites, forums, wikis, inventory data, patient records, e-mail
co-founder, Lucid
messages, resumes, patents, legal decisions, tweets, log files, traditional
Imagination
relational data stores, and nontraditional data infrastructure: The
Scaling Lucene and Solr
examples are endless. Effective retrieval of timely, actionable content in
Mark Miller, Lucid
the face of such diversity means treating search as an application
Imagination; Apache
development platform or an enabling framework, not an end-unto-itself
Lucene and Solr Committer
application.
Like application development effort, the exercise of creating search applications and enabling existing
applications with search must be driven by business considerations. With an understanding of your
business needs in hand from the previous section, we now turn to the constraints and capabilities of the
technology context in which the search application is to be developed and deployed, and exploring key
attributes of your technology environment tied to search application development.
Solr and Lucene search applications are typically developed as web applications. High-level search
1. What Programming Skills Do Your Developers Bring to Your Search Application?
functions that can be accessed programmatically include queries, indexing commands, relevance
algorithms, performance, and the like, generally presented by Solr as services and configuration
options. Solr offers a particularly broad base of client libraries, which means it can be accessed
through a large variety of programming languages.
In which of the following languages/environments supported by Solr is your application development
team skilled and experienced?
(a) JSON (f) Python
(b) Java (g) .Net
(c) Ruby (h) C#
(d) PHP (i) Perl
(e) Ajax (j) JavaScript
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 9
13. For most intents and purposes, open-source software has “crossed the chasm” into mainstream
2. Is your development team skilled and experienced in working with Open Source?
usage, with a broad range of government, nonprofit, and corporate sectors running well-established
portions of their IT infrastructure on the LAMP stack—Linux, Apache, MySQL, and PHP/Perl/Python.
A recent survey of 300 large corporations by the global consultancy firm Accenture shows the
majority of respondents committing strategic technology initiatives to open source. To gauge the
depth of open source utilization, which of the following major open source projects are broadly
utilized in your organization?
(a) Linux for server operating systems
(b) MySQL or Postgres for RDBMS
(c) Eclipse for integrated software development
(d) PHP for web application integration
(e) Apache for http services
(f) Tomcat for web application containers
(g) JBOSS for application business logic
Most individuals are acquainted with searching for content stored either in the context of their own
3. How and where are the data and documents stored, independent of format?
personal computer environments, such as a file system, in e-mail, or in one of the popular,
advertising-driven consumer-facing commercial Internet search service. In the context of enterprise
or commercial search, the diversity of data storage methods spans a much broader range of
technologies, not necessarily tied to formats for individual file objects. Which of the following data
repositories will your search application access?
(a) Traditional directory-oriented file servers, fileshares, and filesystems
(b) Web servers
(c) Relational databases, including Oracle, MySQL, SQL Server, Informix, Postgres, DB2
(d) Nonrelational (AKA NoSQL) data stores, such as Hadoop, Cassandra, Memcached
(e) Proprietary collaboration stores e.g., Lotus Notes, Sharepoint
(f) Open Source content management systems, e.g., Drupal, Joomla, Alfresco.
(g) Proprietary Enterprise content management systems, e.g., Documentum, Vignette, OpenText
(h) XML-oriented data stores, such as Mark Logic
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 10
14. IT organizations are able to achieve significant setup/deployment economies by standardizing
4. On what operating system platform(s) or environments will your search application run?
hardware and software practices at the platform level, along with operating practices. Because Solr
runs in a Java servlet container, with indexes portable across platforms, it can operate in any of the
mix of mainstream operating system environments, virtualized environments and cloud platforms
available in today’s marketplace.
(a) Linux
(b) Solaris
(c) Windows/NT Server/.Net framework
(d) Mac OS
(e) Amazon EC2 (including the above OS environments)
(f) VMWare (including the above OS environments)
Solr and Lucene are complementary technologies that offer very similar underlying capabilities. Solr
5. Should you use Lucene or Solr?
is the Lucene search server; Lucene is the set of Java libraries that run inside the Solr search server,
also available independent of the server implementation.
As the Lucene search server, Solr presents a web service layer built atop Lucene using the Lucene
search library and extending it to provide application users with a ready-to-use search platform. Solr
offers search speed, relevancy ranking, complete query capabilities, portability, scalability, low
overhead indexes, and rapid incremental indexing, from its Lucene core. Its server encapsulation of
Lucene adds operational and administrative capabilities like web services, faceting, configurable
schema, caching, replication, and administrative tools for configuration, data loading, statistics,
logging, cache management, and more.
Lucene gives Solr its search power. In all but a small number of exceptions, organizations building
search applications should start with Solr rather than a direct implementation of the Lucene libraries.
Applications that do otherwise often began their efforts prior to the availability of Solr.
Solr provides the starting point for most developers who are building a Lucene-based search
application. Organizations who build with Solr find themselves better able to adapt their application
to changing data structures, query needs, user behaviors, and infrastructure configuration. These
benefits accrue in lower “costs of ownership,” improved flexibility, and a broader available pool of
search application developers in the marketplace.
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 11
15. 6. Are application development practices in your organization structured to address time to
Successful application development depends on the professional practice of software development.
market constraints or technical complexity?
While there are many theories, approaches, and development models, there are a key set of
development disciplines practiced by successful application development organizations. Does your
application development team understand the tools and methodologies methods and mechanisms
involved in the following software development competencies?
(a) Requirements analysis
(b) Iterative design
(c) Documentation
(d) Test planning
(e) Change control
(f) Architectural description
(g) Formal design
(h) Fuild and release engineering
7. What service level availability does your search application need to deliver to end users? What
Solr’s ability to run on a distributed infrastructure provides robust application availability and
is the cost or impact of outages or service unavailability?
performance at scale, allowing you to expand to meet growth in both your document collection and
your user workload. As with all infrastructure, it is important to understand in advance what impact a
service outage would have on your end users, in order to ensure that the system is as strong as its
weakest link, so that you can make appropriate choices about networking, servers, storage, and
operating procedures. What is the longest interval during which your end users can be productive
without access to your search application? And how often can they tolerate such unscheduled
outages?
(a) 1 minute (a) Once per year
Duration Frequency
(b) 30 minutes (b) Once per month
(c) 1 hour (c) Once per week
(d) 4 hours (d) Once per day
(e) 1 day (e) Once an hour or more
(f) Longer than 1 day
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 12
16. IV. How can you ensure fit between Solr/Lucene
and your ongoing business needs?
The best test of technology in the enterprise is in its ability to deliver on business needs consistently. It
must strike the optimal balance between features/functions and the continuous achievement of
competitive advantage for the business paying for it. Search is the same, only more so: It must constantly
do a better and better job of delivering results that derive competitive advantage from matching end
users to valuable information in timely fashion.
Open source can be a two-edged sword: Unmatched in its innovation, the timing of its innovation (as is
often the case with innovation in any domain) is not always predictable. While the marketplace
challenges a company faces are constant and dynamic, its technology infrastructure demands a strong
degree of stability and predictability. The design, building, and maintenance of applications must handle
change without adding instability to the problems they aim to solve.
At Lucid Imagination, we specialize in capturing the best that open source Solr/Lucene search offers,
delivering it into business-critical application development efforts in a way that improves stability;
providing predictability without sacrificing the power, scalability, or flexibility of open source. With time-
driven support, deep expertise, and broad solution platform of stable value-added software, we
transform open source search into a stable foundation that lets you accelerate with confidence.
In this section, we’ll present considerations for you in taking advantage of the power of open source in
the context of the enterprise. Unlike previous sections that were shaped by various options, these
questions are designed to help you consider risks and dynamics of your development effort and its ability
to bridge the gap: between the open source innovation you need to compete and the enterprise
foundation you rely on to effectively reap the benefits of that innovation.
If there is one element all search applications share, it is their diversity: Each set of content, queries,
1. What is your “bench-depth” in designing and deploying search applications?
and end user requirements is unique. One of the great strengths of open source search is as a robust,
general purpose platform capturing inputs from a broad variety of search use cases.
Even when you have top talent, your search application may be limited by their experience; others
inside or outside the public open source discussion archives might have experience that could benefit
their efforts.
For example, the foundations of ambition for your search application are built-in early: Your
development team must make critical architecture and design decisions, with significant downstream
impact throughout subsequent releases of your application to customers. Breadth of experience will
make a critical difference in whether those assumptions will lend themselves to necessary future
changes, or introduce unnecessary constraints that hobble your application when the time comes to
seize new opportunities.
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 13
17. 2. How does your organization find and incorporate changes to code or source code
Open source code is the raw material of your application development effort. The less it costs to
fixes for your applications?
ensure inbound quality and stability, the more you reduce risks to the application you are building.
Open source software does not stand still. Even between major releases, the team of committers and
programmers developing fixes and improvements is constantly adding new ideas and features to
their project. Some of these changes are available as patches, others are built into trunk and available
through nightly builds, and they may or may not meet your acceptance criteria.
Solr/Lucene is no different: Driven by a consensus-leveraged meritocracy, they produce changes that
may or may not be compatible with your implementation. Identifying which of those to incorporate
into development and assessing their impact on other elements of the system is a critical success
factor—and may or may not be obvious at the point in time they become available.
In building prototypes, you may or may not be able to wait for the community of experts to work on
3. What is the cost-benefit tradeoff of timely fixes and availability of expertise?
your need or provide advice; once you reach a production, business-critical scenario, you’ll need
things done on your timetable, not theirs. Or, you may not wish your particular effort to have any
public exposure at all—in which case you’ll want a communications channel that meets the needs of
your business in your marketplace.
Many problems can be solved given enough time and effort. If your design and deployment efforts
conform to a schedule where speed has value, consider the relative cost-benefits of internal trial-and-
error vs. predictably delivered expertise available on demand.
4. Does the cost-benefit tradeoff of fix timeliness change once your application moves into a
Once an application’s user base extends beyond the developers who built it, its owners must be ready
production environment?
to deliver consistent, predictable availability, performance, and scalability. Meeting the service needs
of end users cannot always be done in real time by the person who wrote the software; developers
move on to other projects or leave the company.
The heterogeneity of your content collection, particularly as it changes and grows, can introduce new,
unanticipated sources of anomalies in its performance. Similarly, it is difficult to anticipate the full
range of user queries and demands on the system, which often leads to the application's inability to
meet new, previously unaccounted-for requirements. Ensuring timeliness of fixes to accommodate
these organic changes may well be beyond the reach of your development team or your IT
organization.
Last, and not least, ensuring that the release process itself can meet its intended thresholds of
performance, throughput, and other systemic qualities can benefit from lessons learned by experts
experienced across a diverse range of deployments.
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 14
18. 5. How Will You Ensure a Consistent, Authoritative Base of
Critical mass of expertise in development is directly correlated with the overall effectiveness and
Knowledge and Skills for Your Development Team to Work From?
velocity of your development efforts. The Solr/Lucene open source community provides developers
with a rich, diverse base of resources to use in bootstrapping their skills, including mailing list
forums, examples, peer-to-peer resources, and much more. The enterprise developer can swim far
and wide in the sea of information, learning by wandering among other implementations and other
discussions.
At the same time, organizations driven by a development and business timetable need a more
structured, organized, and directed approach to building a solid, consistent foundation based on
authoritative sources. Working from a pedagogically oriented set of materials, developers can not
only acquire a clearer sense of what the technology is and does, but also how best to apply search
engine technologies to business requirements. Best practices distilled from years of experience of a
broad base of experts can give your team a quicker start, reduce the setup and execution time, and
improve how effectively they contend with problems as and when they emerge.
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 15
19. Summary of Questions
1. What business objectives are (or should be) achieved with your search application?
I. Why do you need a search application?
2. What objectives are (or are not) being met with your current search implementation?
3. Which improvements in search behavior contribute to improved business results?
4. How much control do you need over the results that end users see?
5. How much do your end users know about the content they are searching for?
1. In what formats are the documents and data you will search?
II. What are the key technical characteristics of your search application?
2. Document composition: how big are documents?
3. How much new content do you presently add per unit time?
How many documents are updated per unit time?
4. What is the rate of queries you expect from your user population?
5. Does your content require faceting or a taxonomy in order
to support productive navigation and discovery by end-users?
6. Which advanced search features do you expect to use
in order to improve how users can submit queries and choose?
1. What programming skills do your developers bring to your search application?
III . What is the technology environment in which you are building your search application?
2. Is your development team skilled and experienced in working with Open Source?
3. How and where are the data and documents stored, independent of format?
4. On what operating system platform(s) or operating environments will your search application run?
5. Should you use Lucene or Solr?
6. Are application development practices in your organization
structured to address time-to-market constraints or technical complexity?
7. What service level availability does your search application need
to deliver to end users? What is the cost or impact of outages or service unavailability?
1. What is your “bench-depth” in designing and deploying search applications?
IV. How can you ensure continuous fit between Solr/Lucene and your business needs?
2. How does your organization find and incorporate changes to code or source code fixes for your applications?
3. What is the cost-benefit tradeoff of timely fixes and availability of expertise?
4. Does the cost-benefit tradeoff of fix timeliness change once your application moves into a production
environment?
5. How will you ensure a consistent, authoritative base of
knowledge and skills for your development team to work from?
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 16
20. About Lucid Imagination
Lucid Imagination can help you use Solr/Lucene to get the most from your search applications. Lucid
Imagination has the world-class expertise, resources, support, and services needed to cost-effectively
architect, implement, and optimize Solr/Lucene-based solutions. We provide commercial-grade support,
training, and consulting and offer certified, tested versions of Lucene and Solr. Lucid Imagination’s goal is
to serve as a central resource for the entire Lucene community and marketplace, to make enterprise
search application developers more productive. We also provide access to Solr/Lucene experts, well-
organized information, and documentation.
We’ve helped hundreds of companies get the most out of their search infrastructure. Customers include
AT&T, Buy.com, Cisco, Ford, Macy’s, Sears, Shopzilla, The Motley Fool, Verizon, Edmunds.com, GSI
Commerce, Zappos (Amazon), and many other household names. Lucid Imagination is a privately held
venture-funded company. The investors include Granite Ventures, Walden International, In-Q-Tel, and
Shasta Ventures. To learn more please visit http://www.lucidimagination.com or
http://www.lucidimagination.com/solutions.
For more information on what Lucid Imagination can do to help your employees, customers, and partners
get the most out of your e-commerce efforts contact sales@lucidimagination.com or please call
+1.650.353.4057.
Recommended Reading
Starting a Search Application by Marc Krellenstein
http://www.lucidimagination.com/developers/whitepapers/starting-search-application
The Case for Lucene/Solr Real World Open Source Search Applications
http:/www.lucidimagination.com/solutions/whitepapers/Managers-Guide-to-Real-World-Open-
Faceted Search with Solr by Yonik Seeley http://www.lucidimagination.com/Community/Hear-
Source-Search-Applications
Optimizing Findability in Lucene and Solr by Grant Ingersoll
from-the-Experts/Articles/Faceted-Search-Solr
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Optimizing-
Full Text Search Engine vs. RDBMS by Marc Krellenstein
Findability-Lucene-and-Solr
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-
Scaling Lucene and Solr by Mark Miller http://www.lucidimagination.com/Community/Hear-
Solr
from-the-Experts/Articles/Scaling-Lucene-and-Solr
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 17
21. Appendix: Solr/Lucene Features and Benefits
Lucene and Solr are complementary technologies that offer very similar underlying capabilities. In
choosing a search solution that is best suited for your requirements, key factors to consider are
application scope, development environment, and software development preferences.
Lucene is a Java technology-based search library that offers speed, relevancy ranking, complete query
capabilities, portability, scalability, and low overhead indexes and rapid incremental indexing.
Solr is the Lucene search server. It presents a web service layer built atop Lucene using the Lucene search
library and extending it to provide application users with a ready-to-use search platform. Solr brings with
it operational and administrative capabilities like web services, faceting, configurable schema, caching,
replication, and administrative tools for configuration, data loading, statistics, logging, cache
management, and more.
Lucene presents a collection of directly callable Java libraries and requires coding and solid information
retrieval experience. Solr extends the capabilities of Lucene to provide an enterprise-ready search
platform, eliminating the need for extensive programming.
Solr provides the starting point for most developers who are building a Lucene-based search application.
It comes ready to run in a servlet container such as Tomcat or Jetty, making it ready to scale in a
production Java environment.
With convenient ReST-like/web-service interfaces callable over HTTP, and transparent XML-based
configuration files, Solr can greatly accelerate application development and maintenance. In fact, Lucene
programmers have often reported that they find Solr contains “the same features I was going to build
myself as a framework for Lucene, but already very well implemented.” Using Solr, enterprises can
customize the search application according to their requirements, without involving the cost and risk of
writing the code from the scratch.
Lucene provides greater control of your source code and works best in development environments
where resources need to be controlled exclusively by Java API calls. It works best when constructing and
embedding a state-of-the-art search engine, allowing programmers to assemble and compile inside a
native Java application. While working with Lucene, programmers can directly control the large set of
sophisticated features with low-level access, data, or state manipulation.
Enterprises that do not require strict control of low-level Java libraries generally prefer Solr, as it
provides ease of use and scalable search power out of the box.
As functional siblings, Lucene and Solr have become popular alternatives for search applications; the two
differ mainly in the style of application development used. Key benefits of search with Solr/Lucene
include:
Search Quality: Speed, Relevance, and Precision Solr/Lucene provides near-real-time search and
strong relevance ranking to deliver contextually relevant and accurate results very quickly. Tailor-
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 18
22. made coding for relevancy ranking and sophisticated search capabilities like faceted search help
users in sorting, organizing, classifying, and structuring retrieved information to ensure that search
delivers desired results. Search with Solr/Lucene also provides proximity operators, wildcards,
fielded searching, term/field/document weights, find-similar functions, spell checking, multilingual
search, and much more.
Lower Cost and Greater Flexibility, Plug and Play Architecture Solr/Lucene reduces recurring
and nonrecurring costs, lowering your TCO. As open source software, it does not require purchase of
a license and is freely available for use. The open source code can be used as is, modified, customized,
and updated as appropriate to your needs. Solr is easily embedded in your enterprise’s existing
infrastructure, reducing costs of installation, configuration, and management.
Open Source Platform for Portability and Easy Deployment Because Solr/Lucene is an open-
source software solution, it is based on open standards and community-driven development
processes. It is highly portable and can run on any platform that supports Java. For instance, you can
build an index on Linux and copy it to a Microsoft Windows machine and search there. This
unsurpassed portability enables you to keep your search application and your company’s evolving
infrastructure in tandem. Lucene, in turn, has been implemented in other environments, including C#,
C, Python, and PHP. At deployment time, Solr offers very flexible options; it can be easily deployed on
a single server as well as on distributed, multiserver systems.
Largest Installed Base of Applications, Increasing Customer Base Solr/Lucene is the most widely
used open source search system and is installed in around 4,000 organizations worldwide. Publicly
visible search sites that use Solr/Lucene include CNET, LinkedIn, Monster, Digg, Zappos, MySpace,
Netflix, and Wikipedia. Solr/Lucene is also in use at Apple, HP, IBM, Iron Mountain, and Los Alamos
National Laboratories.
Large Developer Base and Adaptability As community developed software, Solr/Lucene provides
transparent development and easy access to updates and releases. Developers can work with open
source code and customize the software according to business-specific needs and objectives. Its open
source paradigm lets Solr/Lucene provide developers with the freedom and flexibility to evolve the
software with changing requirements, liberating them from the constraints of commercial vendors.
Lucid Imagination provides the expertise, resources, and services needed to help enterprises deploy
Commercial-Grade Support for Mission Critical Search Applications from Lucid Imagination
and develop Lucene-based search solutions efficiently and cost-effectively. Lucid helps enterprises
achieve optimal search performance and accuracy with its broad range of expertise, which includes
indexing and metadata management, content analysis, business rule application, and natural
language processing. Lucid Imagination also offers certified distributions of Lucene and Solr,
commercial-grade SLA-based support, training, high-level consulting, and value-added software
extensions to enable customers to create powerful and successful search applications.
Lucene/Solr Open Source Search Readiness Checklist
A Lucid Imagination White Paper • September 2010 Page 19