2. Table of Contents
1 | Introduction
2 | Characteristics of Unstructured Information
Why MarkLogic:
4 | MarkLogic Addresses Unstructured Information
Addressing the Challenges of Unstructured Information
6 | Summary
with Purpose-built Technology
7 | About MarkLogic
Abstract
Rapidly changing conditions are forcing organizations to re-think how they use information
to meet their objectives. Whether battling in the market place or on the battlefield, the
need for flexibility and agility with information has never been greater. Organizations are
looking to integrate and enrich information to create additional value for users. User ex-
pectations are changing too, as they demand Web 2.0 and Enterprise 2.0 style applications
that provide modern search capabilities, as well as an ability to interact with information
through tagging and user generated comments. And various distribution channels present
new challenges for information providers in exposing their information through rich user in-
terfaces or through syndicated services like RSS and Atom feeds, allowing users to explore
and access information in their own context.
Choosing the right technology at the core of their application architecture is critical for
any organization to provide them with the agility they need to meet these goals and rapidly
respond to unforeseen changes. XML servers such as MarkLogic Server provide that agility
by providing a single unified platform for storing, manipulating and delivering XML and
building innovative information applications.
This paper provides a technical overview of MarkLogic Server, the industry’s leading XML
server, and also discusses some of the challenges facing organizations today for storing,
repurposing, and dynamically delivering information.
3. Introduction
MarkLogic Server is a purpose-built database for unstructured informa-
tion. In this context, “unstructured information” refers to all information
that does not fit well in the rows and columns of a relational database
management system (RDBMS). In some cases, unstructured information
might be semi- or even highly structured, but due to specific characteris-
tics discussed in this paper, requires significant efforts to load, store, and
query in an RDBMS.
Most organizations recognize unstructured information as documents,
such as policies, manuals, contracts, reports, articles, cables, journals,
and legal briefs. Even media such as user-generated content, RSS feeds,
emails, social graphs, metadata, images, videos, and audio files are widely
used forms of unstructured information.
Most existing tools such as RDBMSs were not built to handle the challeng-
es of unstructured information. These tools either require rigid adherence
to a specific structure or ignore any existing structure altogether. In other
words, they treat unstructured information as second class citizens. This
precludes organizations from effectively leveraging information.
1 | MarkLogic whitepaper
4. Characteristics of Unstructured • MDDL – Market Data Definition Language
Information • DDMS – Department of Defense
To understand why today’s most common
Discovery Metadata Specification
tools are insufficient for leveraging
unstructured information, it is useful to Also consider the different document
review the specific characteristics of formats such as PDF, HTML, Microsoft
unstructured information that require it Office, RTF, etc. These options represent
to be treated differently than structured the different ways unstructured infor-
information. This section discusses these mation is stored.
characteristics while the next section will
Contrast this heterogeneity to the homo-
discuss how MarkLogic addresses them.
geneity of structured information, which
Heterogeneous is stored in a consistent, tabular form.
The first important characteristic of The data types in structured information
unstructured information is it is hetero- primarily consist of numbers, dates, and
geneous. In other words, not only does it fixed-length text strings, which limits its
look different from structured informa- format variation. Database tables were
tion, but the many formats of unstructured invented with this limited variation in mind.
information vary significantly from one
Since unstructured information varies
another. Unstructured information includes
greatly, it is not easily stored in tables.
non-discrete data types such as words,
The challenge is unstructured information
sentences, and concepts, in conjunction
must be mapped into tables and discrete
with discrete data types such as numbers,
data types, which entails an unnatural and
dates, and identifiers. Many combina-
time-consuming effort. As an alterna-
tions of these data types are possible, so
tive, data types such as character/binary
standards are created to maintain manage-
large objects (i.e., CLOBs and BLOBs) of
ability. However, the gains are not always
an RDBMS were created to overcome the
clear, since great variance still exists as
limitations of the discrete data types, but
evidenced by the many domain-specific
they facilitate only storage, not querying.
standards such as:
Therefore, CLOBs/BLOBs are marginally
• FpML – Financial products Markup better than storage on a filesystem. The
Language problem remains that RDBMSs treat
unstructured information as second-
• OOXML – Office Open XML for Microsoft
Office 2007/2010 class citizens. The monolithic approach
of CLOBs/BLOBs ignores the important
• ISO 20022 – the ISO Standard for context in unstructured information, and
Financial Services Messaging
thus precludes analysis, retrieval, and
• XBRL – eXtensible Business Reporting updates at a granular level.
Language
Complex
• RixML – Research Information Markup In addition to heterogeneity, unstruc-
Language tured information is also very complex.
There are several characteristics that
• DocBook – a popular markup language for
documentation contribute to complexity, any combina-
tion of which are found in unstructured
information.
2 | MarkLogic whitepaper
5. For one, unstructured information is Changing in Unpredictable Ways
typically hierarchical, with nested parent/ When unstructured information evolves,
child relationships. Often these relation- it changes in unpredictable and unan-
ships are not obvious, but examples nounced ways. New standards, new
include subsections in a chapter of a book sources, and new applications are created
or sub-clauses in a contract. On the other continually. And there are generally no
hand, structured information typically restrictions on how it is updated. Take an
has flat, tabular relationships that may be example such as a contract. If an attorney
expressed as one-to-one, one-to-many, or amends a contract to revise terms, she
many-to-many. Since RDBMSs were not updates it in any way she desires without
designed for hierarchies, a query to join formatting restrictions. She is not limited
rows to recreate the hierarchy is slow and by the number of words or sentences,
inefficient. or even by the location of the amended
text. She typically uses a word processing
Unstructured information is irregular,
program like Microsoft Word to make
meaning unstructured information does
updates, and the user interface does
not fit in neat, predefined data elements.
not have hard rules on how the contract
Information may vary greatly in length,
should be changed. There also is no
with no pre-definition or bounded data
preparation required by IT staff to plan
lengths. It might also be sparsely popu-
for the changes, as the attorney makes
lated, meaning across a collection of
the changes ad hoc.
information, there might be thousands
of known data elements, many of which Contrast this to structured information,
are blank. These characteristics are which changes in well-known ways.
inconsistent from what RDBMSs expect, For example, each value in a RDBMS
in which most columns are expected to changes in an expected way—numbers
be filled with values. are increased or decreased, dates are
modified with other dates, and text
Finally, unstructured information may
strings are updated within predefined
or may not conform to a predefined
lengths. And when the schema changes,
schema. If it does conform, the schema
the system is first updated to accom-
might be poorly defined, not followed
modate that change. Schema changes
strictly, or not known in advance. Even
must be announced before they can
in the case of predefined schemas, large
be handled by the system. The IT staff
variances may be allowed, making each
necessarily knows what type of changes
item appear very different from the
will be made by users to structured
next. RDBMSs expect rigid, predefined
information before the changes can be
schemas with predefined data elements,
made. RDBMSs are good for predictable
so unstructured information is a poor fit.
and announced changes, but are not
While some organizations try to map efficient for the changes that unstructured
unstructured information into rows and information undergoes.
columns, they face huge tradeoffs. Either
Text-Centric
data accessibility is compromised, or the
Unstructured information is heavily text-
system takes a significant performance
centric. It contains language ambiguities
hit due to inefficient storage and indexing.
3 | MarkLogic whitepaper
6. typically not clear for processing by comput- MarkLogic Addresses
ers. For example, a word such as “foot” can Unstructured Information
have several different meanings including a Based on the characteristics of un-
body part, the bottom of something, or 12 structured information in the previous
inches. The definition is dependent on the section, it is clear today’s most popular
context. Without proper context, users may technologies are not able to fully lever-
encounter many false positives, in which they age unstructured information. RDBMSs
retrieve irrelevant information. They may also lack the flexibility to efficiently handle
“MarkLogic’s Universal Index is a key encounter many false negatives, in which unstructured information, and search
feature for addressing the heterogeneity they miss relevant information described engines lack the management and update
of unstructured information.” using different terminology. capabilities that applications require.
Content management systems, which are
Also, text within unstructured information
largely workflow-oriented applications
lacks specific identifiers to help define
built on RDBMSs and search engines,
various data elements. In comparison,
suffer the same challenges because of
column names such as “first_name” in an
the limitations of the underlying platform.
RDBMS table leave no ambiguity about
meaning of the data values. While human Despite this, many organizations still
readers can easily find names in unstructured try to use their current tools with
information such as in a contract, it is limited success. But now organizations
far less obvious when processed by a no longer have to compromise. Since
computer. Since RDBMSs were designed MarkLogic was designed for leveraging
for tabular data, they do not have the unstructured information, it has impor-
functionality to properly handle the text- tant features that lead to significant
centric nature of unstructured information. benefits. Some of those key features
are described below.
Exponentially Growing
Analysts estimate unstructured information Universal Index
grows 10 to 50 times faster than struc- MarkLogic’s Universal Index is a key
tured information. Information in gen- feature for addressing the heterogeneity
eral continues to grow at a tremendous of unstructured information. It captures
rate with one estimate at 800% over all information users need for precise,
the next five years. This rapid growth of high-performance queries. Application
unstructured information requires new development teams spend less time on data
approaches and strategies pertaining modeling, re-modeling, and performance
to performance and scalability. Though tuning, thus expediting time-to-market and
hardware advancements help with lowering total cost of ownership. Unstruc-
scaling, those are only part of the solu- tured information wants to be unrestricted,
tion. Software must be optimized with and the Universal Index allows that.
modern hardware in mind to maximize
efficiency. Organizations that rely on The Universal Index allows users to
older technologies must choose between query all information that the system
excessive expenditures or insufficient sees, rather than only the information
functionality when facing today’s the system is told to see. In other words,
unstructured information loads. the Universal Index enables MarkLogic
to make no presumptions around what
4 | MarkLogic whitepaper
7. information should be expected and can be added ad hoc without having to
enables the system to store information redesign a schema. Third, XML has the
“To properly handle the complexity of unstructured
“as is” without requiring time-consuming flexibility to fully capture and model
information, MarkLogic uses a data model based
data modeling to standardize dispa- the unpredictable and irregular aspects
on XML documents, which is more efficient and
rate information formats. This is also of unstructured information, including
effective for storing unstructured information
referred to as being “schema-agnostic” non-discrete data elements, hierarchical
than the relational model.”
or “schema-permissive” in which any elements, variable length characters, and
schema, or even non-existent schemas, sparseness of data.
can be loaded into MarkLogic with no
Using XML documents as the data model
prior planning. It automatically captures
was a natural architectural decision for
all elements in information, including
MarkLogic Server. XML is ideal for fully
words, structure, dates, and numbers.
exploiting unstructured information
This means no information is lost, and all
despite the heterogeneity, complexity,
elements can be queried and retrieved.
and unpredictable change. MarkLogic’s
In addition to effectively handling het- use of XML ensures it can handle current
erogeneous information, the Universal and future requirements around unstruc-
Index also addresses the complexity of tured information.
unstructured information due to hierarchy,
Transaction Controller
irregularity, and poor schema definition. It
Delays in access to information are often
also provides the flexibility to accommo-
due to limitations in technology. With
date the wide variety of changes end users
unpredictable changes in unstructured
make with their information.
information—including those pertaining
XML Documents as the Data Model to standards, formats, and content—
To properly handle the complexity of the potential for delay is increased.
unstructured information, MarkLogic MarkLogic Server was designed to
“MarkLogic Server was designed to immediately
uses a data model based on XML docu- immediately accommodate those types
accommodate unannounced changes, thus eliminating
ments, which is more efficient and of changes, thus eliminating the latency
the latency found in structured technologies.”
effective for storing unstructured found in structured technologies. As
information than the relational model. mentioned earlier, MarkLogic’s Universal
Support for W3C-standard XSLT and Index and XML data model provide the
XQuery, both purpose-built for XML, flexibility to offset the design overhead
enables fast and easy querying and for new information types.
transformation. MarkLogic customers
Those features represent only part
have experienced significant improvements
of the real-time access capability.
in agility and efficiency by eliminating the
MarkLogic’s ACID (atomicity, consist-
resource drain of trying to model and store
ency, isolation, durability) transaction
unstructured information in an RDBMS.
controller ensures newly inserted
An XML data model gives MarkLogic information is indexed in real time
several important advantages for and available to users immediately.
leveraging unstructured information. Its multi-version concurrency control
First, embedded markup in XML creates (MVCC) ensures rapid insertion with
context to enable granularity for access, minimal resource contention. Index-
updates, reuse, and repurposing. Second, ing can be done simultaneously with
XML is extensible so new data elements heavy query loads with no blocking so
5 | MarkLogic whitepaper
8. organizations do not have to settle for faster discovery by end users. Geospatial
delayed information access. And for searching enables location-based in-
“MarkLogic Server provides features to make
the most time-sensitive information, formation retrieval. And finally, built-in
information clearer, and also provides several
MarkLogic’s real-time alerting quickly co-occurrence analysis reveals hidden
techniques for finding evidence as the basis
and efficiently processes millions or relationships between various entities
for relevance.”
billions of queries against a fast incoming in a collection of information.
feed of new information.
Shared Nothing Architecture
Search and Analytics Capabilities MarkLogic’s shared nothing architecture
Resolving language ambiguities is an allows high performance and massive
“important requirement in handling text- scalability to address the unanticipated
centric unstructured information. MarkLogic growth of unstructured information.
Server helps in two ways to let end users MarkLogic is optimized for commodity
find and make sense of the information they hardware, and exhibits linear scaling
have. First, it provides features to make to easily and efficiently grow to handle
information clearer. Second, it provides future needs. As the user or informa-
several techniques for finding evidence as tion load increases, performance and
the basis for relevance. response times can be maintained by
adding servers to a cluster.
To make information more clear,
“MarkLogic is optimized for commodity hard-
MarkLogic helps with the identification MarkLogic has been deployed in clusters of
ware, and exhibits linear scaling to easily and
of meaning and context in information. over 100 hardware servers, with expecta-
efficiently grow to handle future needs.”
For example, integration with entity tions of customers moving well beyond that
enrichment tools enables identification in the near future. Not only do customers
of entities such as people, places, and gain cost savings by leveraging commodity
things. Range indexes provide structure hardware, and fewer of them, but the lower
around specific values to enable precise administrative overhead has resulted in
and fast retrievals, as well as sorting, the ability to reallocate human resources
aggregations, and lookups. Support for to higher value activities. At one customer
extensible metadata schemas allows site, only one-half of a full-time equivalent
adding any type of identifying data to is required to administer the 100-server
existing documents. MarkLogic cluster.
To improve relevance in searches, MarkLogic Summary
Server provides capabilities found The focus on unstructured information
in leading enterprise search engines has increased over the years, but the
such as phrase, proximity, and thesaurus ubiquity of RDBMSs has misled many
searches. In addition, MarkLogic sup- organizations to make tradeoffs around
ports highly tunable relevance ranking functionality, time-to-market, total
to more precisely match the end user’s costs, and performance. Since RDBMSs
needs. The Universal Index captures all were designed for structured information,
components of information to enable a which is greatly different from unstruc-
higher level of specificity, granularity, tured information, there is a clear
and structure in searches. Range indexes mismatch that leads to costly inefficiencies.
enable classification and faceted
With its Universal Index, XML data
navigation, to help organize information
model, transaction controller, search and
in meaningful and structured ways for
6 | MarkLogic whitepaper
9. analytics capabilities, and shared nothing
architecture, MarkLogic is the right choice
for tackling the challenges of unstructured
information. Customers report significant
gains with MarkLogic Server, including 10 to
100 times performance improvements, time-
to-market in weeks instead of years, and
scaling to hundreds of terabytes today
and petabytes tomorrow.
About MarkLogic
MarkLogic Corporation is revolutionizing
the way organizations leverage information.
Our flagship product is a purpose-built data-
base for unstructured information. Based on
patented innovations, MarkLogic Server
enables customers in industries including
media, government and financial services
to develop and deploy information appli-
cations at a fraction of the time and cost
it takes with conventional approaches.
The company is led by pioneers in search
engine technologies, database management
systems, and business intelligence software.
Our founder saw that the traditional ways of
managing and delivering information using
relational databases and search engines
were no longer sufficient. The increasing
volume and variety of information necessary
for enterprises to leverage required a
radically new approach.
7 | MarkLogic whitepaper