2. 2
Contents
Introduction............................................................................................................................................................................................4
What is NoSQL? ....................................................................................................................................................................................5
Key-Value Stores .............................................................................................................................................................................6
Document Stores.............................................................................................................................................................................7
Wide Column Stores ......................................................................................................................................................................8
Graph Databases .......................................................................................................................................................................... 10
From Relational to Relationships ...................................................................................................................................... 10
Graphs and ORM..................................................................................................................................................................... 10
NoSQL Database Common Traits ............................................................................................................................................... 11
Shared Legacy: MapReduce, Hadoop, BigTable and HBase ....................................................................................... 11
NoSQL Database Consistency................................................................................................................................................. 13
Logical Models, Physical Models, and the Ubiquity of Key-Value Pairs................................................................. 13
NoSQL Indexing............................................................................................................................................................................ 14
NoSQL options on the Windows Azure Platform................................................................................................................. 14
Azure Table Storage.................................................................................................................................................................... 15
SQL Azure XML Columns .......................................................................................................................................................... 15
SQL Azure Federation................................................................................................................................................................. 16
OData................................................................................................................................................................................................ 17
What the Support Means..................................................................................................................................................... 17
Running NoSQL Database Products using Azure Worker Roles, VM Roles and Azure Drive........................ 18
On-Premise Technologies......................................................................................................................................................... 18
SQL Server 2008/2008R2 “Beyond Relational” Features.......................................................................................... 19
SQL Server Parallel Data Warehouse Edition ............................................................................................................... 19
Microsoft Research Dryad.................................................................................................................................................... 20
NoSQL Upsides, Downsides .......................................................................................................................................................... 21
Upsides............................................................................................................................................................................................. 22
Lightweight, low-friction ...................................................................................................................................................... 22
Minimalist tool requirements ............................................................................................................................................. 22
Sharding & Replication......................................................................................................................................................... 22
Web Developer-Friendliness............................................................................................................................................... 22
3. 3
Cross-Platform, Cross-Device Operation....................................................................................................................... 23
Downsides....................................................................................................................................................................................... 23
Optimizations Have a Price ................................................................................................................................................. 23
Requirement to Query using a Procedural Language .............................................................................................. 24
Necessity to Scale Manually................................................................................................................................................ 24
Primitive Tooling...................................................................................................................................................................... 25
Lack of ACID Transactional Capabilities in Some Products.................................................................................... 25
Conclusion: Relational’s Continued Indispensability in Line-of-Business................................................................... 26
4. 4
Introduction
Just at the time when the database market seemed to many to be almost completely mature, a group of
non-relational data stores, collectively categorized as “NoSQL” databases, have attracted significant
attention. These databases are often employed in public, massively scaled Web site scenarios, where
traditional database features matter less, and fast fetching of relatively simple data sets matters most.
Many of these databases employ parallelized query mechanisms, horizontal partitioning and allow storage
of heterogeneous, loosely-schematized data records.
With so much developer mindshare being focused on the Web these days, and with the constant thirst for
performance amongst technologists, especially for large Web applications, it’s no wonder that NoSQL
databases are seen favorably and used by an enthusiastic population of developers. As Cloud computing
grows, and given the proclivity of developers to conflate Web computing and scale with Cloud computing
and elasticity, interest in NoSQL databases amongst cloud developers is equally unsurprising. Together,
these streams of interest and visibility are significant; understandably, then, even users of traditional,
relational databases are exploring the question of whether NoSQL technology is something they should
use, too.
There’s no free lunch though. Although NoSQL databases do facilitate the performance and availability
that public Web properties sometimes require, the cost can be great. Things that users of a Relational
Database Management System (RDBMS) would take for granted, including some or all of: transactional,
atomic writes; indexing of non-key columns; query optimizers; and declarative, set-oriented query, are
sacrificed in the NoSQL world. In certain scenarios, that sacrifice is justified and acceptable. But in many
others, including line-of-business applications, that sacrifice is much less reasonable.
As with anything in the software world, when technologies enter the realm of phenomena, the prudent
thing to do is deconstruct and demystify them, understand and enumerate their various capabilities, then
judge if those capabilities merit the enthusiasm and justify a disruption. Specifically, in the realm of cloud
computing with the Microsoft stack, i.e. Windows Azure and SQL Azure, important questions arise with
respect to NoSQL, and need to be answered.
What exactly is NoSQL, and what characterizes its various subcategories? Are individual facets of NoSQL
database architectures available to Azure developers? Are they sufficient or will only a full-blown NoSQL
technology fulfill most requirements? Where in the Azure stack do these NoSQL technologies sit? For the
types of applications that .NET and SQL Server practitioners build, is NoSQL better than relational? Is it
even as good? These questions must be explored and answered before the larger question of NoSQL’s
(or relational’s) overall efficacy can be judged.
In this paper, we will define NoSQL, explore some of its history, review the various types of NoSQL
databases, and understand their respective features. We will determine the commonalities between the
various NoSQL subcategories and try to determine what basket of features seem to attract developers the
most. We’ll examine the scenarios where use of NoSQL makes the most sense. We’ll distill the
enumeration of NoSQL features down to the overall tradeoffs between NoSQL and relational databases.
5. 5
We will also review the various components of the Azure stack that offer NoSQL technology, or
capabilities that are comparable to those found in NoSQL databases. We will look at Windows Azure
Storage, new and imminent features in SQL Azure, and even ways to deploy non-Microsoft, NoSQL
databases to the Azure cloud, to make them usable from .NET code that is also deployed there. By the
end of this paper, readers should have a good understanding of what NoSQL is all about and whether
individual NoSQL features, full-fledged NoSQL databases or continued use of relational technology will
work best for them.
Let’s now define NoSQL, by examining the general use cases that it serves. We’ll also discuss the
subcategories of NoSQL and take a more detailed look at each of them.
What is NoSQL?
There are scenarios in the software development world where data management is required, but what
many of us might think of as a full-fledged database is not. Think of that application you wrote once that
had a small amount of data to store, and did it using flat files, so you could avoid creating a database.
Maybe you needed to store a few bits of information about the current user; maybe you needed to store
application settings, or application state information, like window size and position; or perhaps you
needed to store and retrieve actual content – be it raw text, images, or media – and the file system
seemed to make more sense than a relational database as the repository.
Now imagine an application like that one you wrote, but which ran on the Web and needed to serve a
vast array of users distributed across the globe, many of them concurrently. You would find that your
database needs, while still technically modest in terms of query complexity, would almost certainly
outstrip what you could do comfortably using the file system. You’d need a server, or even a globally
distributed cluster of servers. The server or cluster would need to be highly scalable to meet the demands
of a popular Web-based application, and very fast at performing these relatively simple discrete store and
fetch operations. You would need a database, but probably not the relational one you’re used to.
The grouping of database engines collectively referred to as “NoSQL” is optimized for these workloads.
Most of them sport distributed architectures as a core feature. Many of them are Apache or independent
open source projects.
NoSQL databases are good at what they do, primarily by dispensing with many of the tenets of relational
database management. Many NoSQL databases trade off “ACID” (atomicity, consistency, isolation and
durability) guarantees in favor of providing for very-high performance in the broad scale/simple store and
retrieve scenario. And as we mentioned already, NoSQL databases, to varying degrees, even allow for the
schema of data to differ from record to record. The “CAP” theorem says that databases may only excel at
two of the following three attributes: consistency, availability and partition tolerance. Relational databases
favor the first and last of those three properties; NoSQL databases favor the last two. In other words,
NoSQL intentionally de-emphasizes the rules and functionality of consistency that many database
administrators and developers think of as the very prerequisites of database management.
6. 6
In his paper Amazon's Dynamo
1
(Dynamo is the online retailer’s foundational NoSQL database), Werner
Vogels, Amazon.com’s Chief Technology Officer, describes why such an approach is appropriate: “Most of
these services only store and retrieve data by primary key and do not require the complex querying and
management functionality offered by an RDBMS.” In other words, various systems on the Web, many of
which are consumer-facing, don’t have sophisticated database needs, but they nonetheless have a huge
burden. They must carry out their simple needs very, very quickly.
NoSQL databases handle these workloads well, but they make serious concessions, to otherwise
mainstream database needs, in order to do it. That is well-justified, but not always well-understood; in
fact there exist NoSQL practitioners who advocate the usage of NoSQL as a general database technology
applicable to the mainstream of application database needs. Such advocacy has caused some relational
database customers to have concerns that they should perhaps switch to NoSQL databases even for line-
of-business (LOB) applications.
Customers have these concerns despite the fact that most LOB apps require transactional guarantees, and
are well-served by normalized design and formal schema. This can be a controversial state of affairs and
we hope to sort out that controversy. For now though, let’s just say that NoSQL databases work well in
certain scenarios, and that sketching out what those scenarios are, and what they are not, is an important
goal of this paper.
To help enumerate those scenarios, it’s best that we discuss four subcategories that NoSQL databases
tend to break down into. Enumerations of such subcategories tend to vary, but they usually include Key-
Value Stores, Document Stores, Wide Column Stores and Graph Databases. Each NoSQL subcategory
serves certain scenarios best. To understand core NoSQL scenarios as best as we can, let’s explore the
various NoSQL subcategories and the specific types of applications and workloads they support most
ably.
Key-Value Stores
The Key-Value Store subcategory (summarized
graphically in Figure 1) is perhaps the mother of all
NoSQL database types. Most NoSQL databases
feature key-value mechanisms, even if only behind
the scenes. NoSQL databases that belong to the
explicit Key-Value Store category use their namesake
construct as the basic unit of storage. A key-value
pair might consist of a key like “Phone Number” that
is associated with a value like “(212) 555-1212.” Key-
Value Stores contain records whose entire content is
made up of such pairs; the structure of one record
can differ from the others in the same collection.
1
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Figure 1: Key-Value Stores often use the nomenclature of
tables and rows, but the latter simply contain collections
of key-value pairs, which vary from row to row.
7. 7
If you do much programming, you’ll recognize this construct right away. That’s because collections,
dictionaries and associative arrays in the programming world work on the same principle. Data caches
work on the key-value principle as well. In fact, one prominent Key-Value Store, MemcacheDB, is API-
compatible with the Memcached open source cache.
The parallels between Key-Value Stores on the one hand, and collections, dictionaries, associative arrays
and caches on the other, is more than academic; it’s significant. It shows that NoSQL databases work well
in circumstances where data retrieval needs to be cache-like in speed and where the data which must be
stored and retrieved consists of small, simple collections of attributes and values.
Applications where Key-Value Stores would work well include anything where lists, like product
categories, individual product attributes, shopping cart contents and top n best-selling products, or
individual values like color schemes, a landing page URI, or a default account number, must be
maintained. Values can consist of long text content, not just numeric and short string data. As such,
content like comments, reviews, status messages or even private emails can be stored in a Key-Value
Store. Most of this data is non-hierarchical, so the lack of relational logic or join constructs is acceptable.
Some of this key-value-appropriate data (though probably not the long text content) is akin to lookup
data, or configuration and preference data, in smaller applications. For a desktop app, we could imagine
this data might be stored in a configuration file or a small, offline database. We could also imagine that
much of it might do well to be loaded in memory upon application startup. For a consumer-facing Web
app, the data is similarly straightforward, but the storage technology itself must be more capable. The
data must live in a repository that is distributed, fault tolerant, fast and highly available.
Beyond MemcacheDB and Dynamo lie other Key-Value Stores. Project Voldemort is an open source Key-
Value store that originated at LinkedIn; and Dynomite, Kai and Riak are open source derivatives of
Dynamo (which is not open source, nor publicly available, even though its architecture has been disclosed
through published papers).
Before we go on to describe other NoSQL database types, we must reiterate that almost all of them,
whether physically or conceptually, build upon Key-Value Store principles. Therefore you should expect
their applications to be more specialized than, but not wholly distinct from, those of Key-Value Stores
themselves.
Document Stores
Document Stores are NoSQL databases which treat what might be otherwise called “records” or “rows” as
“documents.” As with Key-Value Stores, each record can have a structure widely differentiated from the
others. Each document consists of a set of keys and values, which can be compared to a relational table’s
field names and values. The Document Store data structure is summarized in Figure 2.
Two leading Document Stores, CouchDB and MongoDB, each use JavaScript data types for the values
stored in their documents. Because of this, their documents can be thought of as JavaScript objects and
can, in fact, be written and read in JSON (JavaScript Object notation) format. That doesn’t mean
Document Stores equate to Object Databases, but it does mean that Document Stores have an affinity
8. 8
with JavaScript programming and programmers. In fact, the native stored procedure/scripting language
for both CouchDB and MongoDB is JavaScript itself.
Documents can also contain attachments, making document stores useful for content management. The
fact that certain Document Stores feature versioning of their documents (i.e. old versions are retained and
all versions are numbered) makes this all the more so.
CouchDB and MongoDB have been used for an array of public-facing Web application types including
blog engines, event logs, appointment calendars, media stores, chat applications, cloud bookmark storage
and even Twitter clients.
An important facet of Document Stores is that
the documents themselves can be addressed
by unique URLs. And given the HTTP and URL
orientation, document databases are
automatically REST-friendly, as their APIs bear
out. In the case of CouchDB, the HTTP
orientation is developed to the point where
the database can function as its own Web
application server.
Here’s how: so-called Show Functions in
CouchDB – JavaScript functions that render
HTML with the return statement – can be
stored in special documents called design
documents, and each function within is
accessible via URL. This means that entire
Web applications can be implemented in a
document database. Users visit a URL, code
runs on the server and content is returned via the HTTP response stream, just as it would be with classic
ASP, node.js, ASP.NET Web Pages or PHP.
This HTTP and application orientation distinguishes Documents Stores from Key-Value Stores, the latter of
which are more general purpose in their implementation and application. That said, there are some
NoSQL taxonomies which do not recognize the Document Store category and instead label its members
as Key-Value Stores.
As you will see, the remaining two NoSQL subcategories utilize key-value technology as well.
Wide Column Stores
Wide Column Stores, also known as Column Family Stores, manage key-value pairs, but they organize
their storage in a semi-schematized and hierarchical pattern. Perhaps fittingly then, some of their
nomenclature correlates with that of RDBMS technology. For example, the keys in a Wide Column Store
are referred to as columns, and are stored in structures that are sometimes referred to as tables. In
Figure 2: Document Stores contain JSON objects, referred to as
documents, each of which has a schema-free of set properties
and values. Values may contain attachments, point to other
documents, or directly contain them.
9. 9
between the table and column level lie various intermediate structures that vary by product. For example,
Apache Cassandra (originated by Facebook) features Super Columns. Hypertable and Apache HBase
feature Column Families, and Google’s BigTable features Tablets. The hierarchical structure and some of
the varying nomenclature of Wide Column Stores is summarized in Figure 3.
Although the schema within the intermediate structures can vary from row to row, tables and the
intermediate structures themselves must be declared. Therefore, Wide Column Stores, while they tolerate
schema variation at the “leaf” column level, are not completely schema-free. One could reasonably argue,
in fact, that schema changes at the non-leaf level in Wide Column Stores are more disruptive than
changes to table schemas in relational databases.
Wide Column Stores work well for a subset of
requirements that Key-Value Stores accommodate
and many adopters of this category of NoSQL
database cite the performance factors, over the
structural ones, as reasons they chose it. But,
clearly, Wide Column Stores are best for semi-
structured data, rather than data whose structure is
completely variable from row to row.
As an example, in a product catalog, we may have
a collection of items, each of which has a size and
a rating associated with it, and we may want to
store these items together in a table. But certain
items’ sizes may be represented by height, width
and depth, others by radius, and still others by
weight. The rating may be a star rating on a 1-5
scale (e.g. for a book), or collection of sub-ratings
on various attributes (e.g. freshness, flavor, color,
moistness). Accommodating a grouping of entities
with high-level characteristics in common, but with
differing context-specific attributes, is one area
where Wide Column Stores do well.
In the relational world, traditionally, such context-specific attributes would each need to be stored in
separate tables, with a foreign key in the main table to link them
2
. Joins and application-level merging of
the datasets might be necessary. But Wide Column Stores allow such differently nuanced data to
comingle in the same tables and query result sets.
2
Recent versions of major RDBMS products offer new features to accommodate this requirement without
resorting to separate attribute tables. Such features in SQL Server and SQL Azure will be discussed later in
this paper.
Figure 3: Wide Column Stores contain tables
(indicated above as “T”); Cassandra calls them “super-
column families” (shown as “SCF”). These contain a key
and columns (“C”) which consist of name/value pairs.
Columns are subdivided into column families (“CF”),
which are known as “super columns” (“SC”) in
Cassandra. Columns are schema-free, but higher-level
objects must be declared.
10. 10
Graph Databases
Graph databases recognize entities in a business or other domain, and explicitly track the relationships
between them. In the graph database world, these entities are called nodes and the relationships between
them are called edges; all of these terms come from mathematical graph theory as does this NoSQL
database subcategory’s name. An example of a graph database assertion (the fundamental atomic unit of
data expression) might be:
Chris city Auckland
Where Chris and Auckland are nodes and city is an edge.
From Relational to Relationships
As we try to orient ourselves to graph
databases from a relational frame of
reference, we could think of an edge in a
graph database (a predicate) as a join, and
the subject and the object of that predicate
(the Chris node and the Auckland node,
respectfully, in the above case) as rows in a
table. Attributes of a node that have scalar
values (for example the attribute Age
might have a value of 45) can also be
represented using edges and nodes, or as
properties and values, depending on the
specific graph database in use. In the
former case, an edge might be thought of as
a column, in a broad sense, rather than as a
join. A collection of assertions are kept
together in a graph. The structure of Graph
Databases is illustrated in Figure 4.
New edges can be added (or old ones removed) at any time, allowing one-to-many and many-to-many
relationships to be expressed easily and avoiding anything like an intermediate relationship table that you
might use in a relational database to accommodate many-to-many joins.
Social graphs fit into the graph database rubric nicely (as does the name). Constructs like friends,
followers, degrees of separation, lists, endorsements, status messages and responses to them are very
naturally accommodated in graph databases. Semantic Web data also maps quite nicely on to the graph
database structure.
Graphs and ORM
As we consider the concepts of properties, values and relationships, it starts to become clear that graph
database theory has some alignment with object-relational modeling and ORM programming. This then
Figure 4: Graph databases, like those in other NoSQL
subcategories, may be key-value based, but they excel at tracking
relationships (edges) between entities (nodes), in addition to the
entities, keys and values, themselves. Sometimes even the key-
value pairs are represented as edges and nodes.
11. 11
begs the question of whether object databases belong in the NoSQL camp or even of whether they are in
fact synonymous with graph databases. There really are no rules or strict definitions to provide
authoritative answers to these questions, but there are differences in intent between graph and object
databases. Object databases typically are schema based (even if the schema describes a class rather than
a table) and are focused on entities and their properties. Graph databases are designed to accommodate
slowly- or even rapidly-changing schemas and focus on relationships between entities more than the
entities themselves.
Popular graph databases include AllegroGraph, Neo4j and Twitter’s FlockDB.
NoSQL Database Common Traits
Having now covered the four main NoSQL subcategories, and what distinguishes them, let’s take a look at
the qualities which each category’s products have in common. We’ll first look at a pair of technologies
from Google (and their Apache project counterparts) whose design principles pervade all NoSQL
subcategories. We’ll continue with a general look at the data consistency models employed in NoSQL
databases and the split between NoSQL’s physical and logical implementations. We’ll finish with a look at
NoSQL indexing and we’ll then be able to move to the next section and review the various features and
products within Windows Azure and SQL Azure that provide NoSQL functionality.
Shared Legacy: MapReduce, Hadoop, BigTable and HBase
It’s a good idea for us to take a look at two technologies which underlie, or have provided inspiration for,
many of the individual products in each NoSQL subcategory. Specifically, Google’s MapReduce and
BigTable and their open source counterparts, Apache Hadoop and Apache HBase. Google MapReduce
and the open source Hadoop project provide generalized parallel job processing engines; Google
BigTable and the open source HBase are Wide Column Stores whose tables can serve as sources and
destinations for the MapReduce and Hadoop jobs, respectively.
Why are the job processing engines necessary? Because the less structured, less formal approaches
employed by NoSQL databases make querying them less straightforward than in the relational world, and
MapReduce/Hadoop help mitigate the burden.
Think about it: although explicit joins are not necessary in the NoSQL world, the permissive environment
and resulting inconsistency across records/entities/documents makes for quite a bit more hunting and
gathering in order to satisfy a query. This is especially true for distributed NoSQL databases which store
their data across various servers, typically using a partitioning pattern called sharding (more on that later).
The lack of query optimizers, and corresponding query efficiencies, in NoSQL databases cries out for some
help.
NoSQL databases often require queries to be broken up and executed across multiple repositories on
different servers. At some point, the resulting segmented result sets need to be collected and unified. An
12. 12
approach called map-reduce acknowledges and addresses this conundrum. Specifically, the process of
distributing the query across multiple agents is the Map step, and the process of coalescing the results
into a single result set is the Reduce step.
Map-reduce is a general algorithm, and is prevalent in functional programming languages – including F#
– which support the notion of map and reduce functions. MapReduce (without the hyphen) is the
patented software framework from Google that the company applies in the realm of managing large
datasets over clusters or other distributed topologies. Hadoop is the top-level Apache project which
implements map-reduce as a generalized highly parallel, divide-and-conquer batch job task manager.
Google MapReduce/ BigTable and Apache Hadoop /HBase have their fingerprints all over most NoSQL
databases. For example, Apache CouchDB, one of the document store databases already discussed, is,
according to its Web site on apache.org, “queried and indexed in a MapReduce fashion.” Some would
argue that CouchDB’s map and reduce steps differ conceptually from those in MapReduce itself.
Nonetheless, the overarching map-reduce approach is the inspiration for the design of many NoSQL
products.
As effective as these mechanisms can be, they also introduce extra work for the database developer.
That’s because instead of providing a declarative language over distributed storage that could then be
implemented using map-reduce functionality under the covers, the architecture’s designers focused
primarily on the raw processing approach and never added a language abstraction. In the world of line-
of-business applications, the declarative power of SQL provides productivity that most organizations
count on. Map-reduce based systems, by and large, cannot provide that productivity.
A summary of the various NoSQL database subcategories, and the suitability of each to different scenarios
and requirements, including map-reduce, is presented in table form in Figure 5.
Figure 5: This chart shows the applicability of different NoSQL database types to different needs
or scenarios. Notice that wide column stores are more special-purposed than are the other
NoSQL subcategories, which are applicable in a variety of scenarios.
13. 13
NoSQL Database Consistency
Many NoSQL databases use an “eventual consistency” model for database updates and schema changes.
This means that changes made at one replica will be transmitted asynchronously to the others. Domain
Name Servers on the Internet refresh themselves on this model, and that is exactly why DNS propagation
delay can allow some Internet users to navigate successfully to a new or updated domain name, while for
other users the name may not resolve correctly. Eventually, all users’ DNS servers are updated and the
anomaly disappears.
The sacrifice of propagation delay is acceptable when the alternative (a coordinated atomic update across
all DNS servers globally) is considered. The eventual consistency model allows updates to occur and DNS
server availability to be maintained, all for the price of a temporary, tolerable, well-understood anomaly in
the data.
Likewise, in the NoSQL context, eventual consistency makes possible discrepancies in data state between
replicas, and thus between users and locations, for a temporary period. As with DNS servers, such
concessions to consistency are made in the name of high availability and will eventually resolve.
Not all NoSQL databases use eventual consistency. Some are fully transactional. Others use an optimistic
concurrency model. Some databases, like Apache Cassandra and Apache HBase, not only replicate over
time, but commit their initial writes to disk over a certain latency period as well. In other words, these
databases perform buffered writes by writing to memory initially (and to a log), rather than tables on disk.
This is done in order to batch up the writes, rather than have them execute one at-a-time, since batching
reduces the aggregate i/o time required. It is completely different from the update behavior of an RDBMS.
The liberal consistency regimes of many NoSQL databases are quite appropriate, in certain scenarios. It’s
important to remember that the transactional model is still the correct one in many others, including most
line-of-business applications. The supremacy of one model in certain circumstances does not render
established models obsolete in a variety, or even a majority, of others.
Consistency is not the only sacrifice made in the name of performance and high availability. For some
NoSQL databases, declarative query power is sacrificed as well. For example, “views” in CouchDB, rather
than being stored queries, are actually JavaScript programs that return data. They are somewhat akin to
stored procedures in the relational world, but even that analogy falters, as CouchDB views must iterate
through data imperatively rather than use the set-oriented constructs found in SQL.
The result is that individual query patterns must be optimized through code that anticipates them, rather
than through optimizing logic that encounters them. As with the consistency sacrifice, in some situations,
this may be perfectly acceptable. As we have discussed, many public Web applications perform a variety
of very simple queries and a small number of complex ones, all of which can be explicitly coded. But,
again, that’s not usually the case with LOB apps.
Logical Models, Physical Models, and the Ubiquity of Key-Value Pairs
The subcategory distinctions we’ve covered here are not only soft, but are logical model distinctions that
may or may not translate to the underlying physical models. For example, Cassandra, a Wide Column
14. 14
Store, essentially imposes a logical “super column” hierarchy over key-value pairs. Key-Value Stores
underlie most other subcategories, either in terms of technique (such as how CouchDB’s documents are
actually key-value structures, in an overt fashion) or in implementation (such as how edges and nodes in a
graph database can be stored as key-value pairs as well, but behind the scenes).
Document Stores, Wide Column Stores, and Graph Databases are in some senses akin to domain specific
languages (DSL) in the programming world. While most NoSQL databases utilize key-value constructs,
distributed architectures and sharding, and allow for schema-free databases, the various NoSQL
subcategories provide different data interfaces, each of which works best in a subset of scenarios.
NoSQL Indexing
Despite the DSL analogy above, the common key-value substrate of most NoSQL databases does not
render the subcategory a mere trivial abstraction. The quite wide spectrum of indexing features in the
various NoSQL databases makes this clear. Some NoSQL databases index on little else than the keys used
for rows/entities/documents and/or partitions. Others go a bit beyond this. For example, CouchDB
indexes documents only on their IDs and sequence (version) numbers, but it also creates indexes on
views. The AllegroGraph Graph Database, meanwhile, indexes everything (id, subject, predicate, object
and graph), automatically.
Some Key-Value and Wide Column Stores support so-called “secondary” indexes – a generic term for an
index built on the value of a property/column that is not the key. But secondary indexes are relatively
new features in some databases and still a bit immature. For example, Cassandra added secondary
indexes in version 0.7, which was just released on January 9, 2011. These secondary indexes are
essentially hash indexes only; support for bitmapped indexes, with which range criteria could be satisfied,
is in the works for a future release.
In the absence of secondary index support, some developers implement them on their own. The common
approach is to create a second table containing the values of the “indexed” column and their
corresponding row keys from the main table. This requirement is somewhat emblematic of NoSQL
databases in general: developers may need to implement on their own what could long be taken for
granted in an RDBMS. Again, in some situations, the tradeoff is deemed reasonable given the
performance and availability requirements, but the price should not be understated.
NoSQL options on the Windows Azure Platform
As we discussed in the paper’s introduction, a proper evaluation of NoSQL involves deconstructing and
deciding which features or characteristics are compelling. Next, you need to decide if those same features
or characteristics are available from technologies you already use. With that in mind, what follows is an
overview of certain Windows Azure and SQL Azure technologies (plus a few Microsoft on-premise
15. 15
products and features) and which aspect of NoSQL technology each one implements. As you will see,
elements of NoSQL computing can pop up in some unexpected places.
Azure Table Storage
Azure Storage is probably the most compelling place to start on our tour of NoSQL in Azure. That’s
because Azure Table Storage is in fact a NoSQL database. Of the various categories of NoSQL database
discussed in the last section, Azure Table Storage fits most snugly with Key-Value Stores. Azure Storage
key-value pairs are called Properties; they belong to Entities which, in turn, are organized into so-called
Tables. Azure Table Storage features optimistic concurrency and, as with other NoSQL databases, is
schema-free, so the properties of each entity in a table may differ.
Azure Table Storage does not support secondary indexes, and it’s not intended for use as a mainstream
database, especially since SQL Azure is available to handle relational database duties. But Azure Table
Storage is inexpensive (15c/GB/month and $0.01/10,000 transactions), easily programmed (via a .NET
client library, a LINQ client and a RESTful API), and scales over multiple servers, as needed, automatically.
Since Azure Table Storage is a bona fide NoSQL database, we could stop there. But it’s important to
realize that other Azure technologies allow for the implementation of NoSQL approaches. These options
are less about full-on NoSQL and more about cherry picking various NoSQL features when that is all that
is actually desired. Let’s continue by looking at those options.
SQL Azure XML Columns
We’ll declare here and now: using XML columns in SQL Azure data storage constitutes NoSQL database
storage. There are a number of reasons why this is the case. First, consider that an XML payload bears
much resemblance to a Document Store NoSQL database. Not only are XML documents just that (i.e.
documents) but they store a collection of elements and values, with those XML elements equivalent to
key-value pairs in Document Stores
3
.
The schema of an XML document can be changed at will (provided there’s no XSD schema in place – and
the Schema Collections feature of SQL Server that supports XSD is not even implemented in SQL Azure at
this time) and a collection of XML documents may or may not follow a given schema consistently. Again,
each of these qualities is common to SQL Azure XML columns and Document Stores.
If that weren’t enough to convince you, then consider that the developer version of Azure Storage (i.e. the
emulator that runs on the local PC to use during development) is actually implemented using XML
columns in SQL Server Express Edition. That means all Azure developers have a full XML-data-as-NoSQL
proof-of-concept running on their development PCs.
This is more than coincidence; it’s about motivation: XML columns were added to SQL Server (and other
major relational database products) to accommodate databases with dynamic schema needs for certain
3
This analogy works best if we think of XML documents as a non-hierarchical storage mechanism. If we
think of them as hierarchical (i.e. through the use of XML attributes or child elements) then an analogy
with Wide Column Stores becomes more appropriate.
16. 16
tables. Prior to XML in the database, the only way to accommodate changing schemas was to build out
“vertical” tables, whose column values were stored as rows in attribute value tables (as key-value pairs, in
fact).
So if we consider one of the major value propositions of NoSQL, namely flexibility around changing
schemas, we see that very scenario is the inspiration for the XML column feature in SQL Server (and now
in SQL Azure). Using XML for NoSQL computing needs is not a kluge, but rather a sensible alignment of
interests.
It is important to note, however, that unlike on-premise editions of SQL Server, SQL Azure does not
support indexes on XML columns. As long as your tables contain a scalar primary key column, then you’ll
have the option of a key-based index, though you will lack the equivalent of a secondary index.
SQL Azure Federation
NoSQL focuses quite heavily on the notion of horizontal scaling and “sharding.” Sharding (i.e. horizontal
partitioning) of databases accommodates the vast demand that many public Web products may
experience. Using map-reduce-style technology is a common NoSQL product solution for managing the
shards.
SQL Azure Federation, announced at the 2010 Professional Developer Conference (PDC), is a forthcoming
feature of SQL Azure which will allow individual SQL Azure databases to function as individual “shards” in
a larger virtual database. This feature provides a supportable approach to dealing with SQL Azure’s
current 50GB size limit on individual databases and enhances query performance while at the same time
retaining the RDBMS features that most LOB developers need.
SQL Azure Federation “Members” are the counterparts to NoSQL Shards. Shards are “federated” (hence
the name of the feature) and this is achieved through the creation of a so-called Federation Key. The key
is present in any table that will be distributed and each shard is defined in such a way that it is responsible
for storing rows whose federation keys are in a specific range of values
4
. If the distribution of values
changes over time, individual shards which become too large can be split into multiple ones. A significant
advantage of this splitting feature is that it takes place online, under load, without affecting database
availability or consistency. Once again, Azure lets us cherry-pick a NoSQL feature, without forcing us to
forfeit RDBMS underpinnings
This first version of SQL Azure Federation will not have support for so-called fan-out queries. So it will not
have a map-reduce-style facility for taking a query that spans multiple members, splitting it automatically
into separate queries and merging the results of each into a single result set. But SQL Azure Federation
will have mapping functions, whereby a needed shard can be located by a specific Federation Key value
and need not be addressed by its physical database name. This makes programming the query
4
In this way, a Federation Key is similar to an Azure Table Storage Partition Key
17. 17
distribution simpler and it also provides the foundation for a full map-reduce-style fan out query
capability that could appear in a future release.
5
OData
OData is Microsoft’s generalized XML data serialization format, based on the ATOM feed standard, and
RESTful API used to query, create and update data in the repositories it wraps. OData debuted as the
transmission format and API for data exposed by what is now called WCF Data Services (originally known
as project “Astoria,” then as ADO.NET Data Services). Typically, Astoria services act as RESTful wrappers
around Entity Framework data models. But with the generalization of the data format and REST
implementation, OData is now used by Microsoft and others to expose a variety of data sources. On-
premise Microsoft products and technologies that support OData interfaces include SQL Server Reporting
Services in SQL Server 2008 R2, SharePoint 2010 lists and Dynamics CRM 2011.
In the Azure world, both Azure Table Storage and SQL Azure support OData interfaces to their respective
tables. Azure Storage does so natively, while SQL Azure exposes its OData interface via a pre-release tool
(SQL Azure OData Service) at time of this writing available from SQL Azure Labs. By logging into the tool
and enabling OData access with a single checkbox (either for anonymous access or access by specific
named users), the OData interface is made available immediately; there is no coding required to enable it.
What’s more, SQL Azure provides this RESTful interface while maintaining its conventional Tabular Data
Stream (TDS) interface. As such, SQL Azure provides developer simplicity while retaining its native
interface, and the performance necessary for heavy LOB workloads.
Windows Azure Marketplace DataMarket leverages OData as its native format for publishing the free and
subscription-based data feeds that comprise the service. This makes the OData format itself especially
valuable, and arguably more so than more generic XML data serialization formats, as it is at once an API
tool and a channel to commercial or public distribution of data.
What the Support Means
In practical terms, this broad support for OData on Azure means that most of its data-focused services can
be programmed via REST from most any development platform. The commands use intuitive URL
patterns and open HTTP verb conventions to provide a full data platform for key-value structured storage
(Azure Table Storage), relational data (SQL Azure) and de-normalized, processed data (DataMarket).
OData can return results not only in ATOM/XML format, but in JSON format too. This makes it conform
extremely well to various numerous NoSQL database APIs.
Many NoSQL databases tout their support for REST, and the corresponding ease of use and low barrier to
entry this provides. Arguably many NoSQL proponents are drawn to these platforms because of their
simple RESTful interfaces. Given that Azure provides this same ease of use throughout the platform, we
can see once again that Azure addresses specific needs catered to by NoSQL platforms. In fact, Azure
provides for this need, and then goes beyond it: given Microsoft’s PowerPivot self-service BI tool, and its
5
Even in advance of such support, considering that map-reduce jobs must themselves be explicitly coded
or scripted in many NoSQL databases, the notion of writing an Azure Federation fan-out query through
code seems a reasonable task by comparison
18. 18
ability to consume and analyze OData-formatted feeds using Azure’s RESTful services, Azure provides self-
service BI to customers and not just APIs to developers. This presents a very clear business case that
various NoSQL databases may be hard-pressed to counter.
Running NoSQL Database Products using Azure Worker Roles, VM
Roles and Azure Drive
If the desire or specific need is present to run a particular NoSQL database product, Worker and Virtual
Machine Roles make it possible to accommodate this setup on Azure, provided the NoSQL product has a
Windows Server-compatible version (and most do). The VM role allows customers to build their own
machine image, upload it as a virtual hard drive (VHD) file to their Azure accounts, and then spin up
instances of that image. Any properly licensed software can be installed in that machine image, including
various free NoSQL products. Likewise, a Worker Role can accommodate such customization, but any
products added to the baseline image must be xcopy-deployable or silently installed during the Worker
Role's startup task or its code's RoleEntryPoint.OnStart method.
There is one complication though: since Worker and even VM role instances may be recycled at any point,
local hard drive storage within the instance may at any time revert back to its baseline image state. So
unless the data in the instance is static and can itself be included with a VM Role image or placed on a
Worker Role image in a scripted manner at startup, data storage becomes an issue.
Luckily, the Windows Azure Drive offering provides a solution. Azure Drive allows a separate VHD file,
hosted in Azure Blob Storage, to be mounted as a mapped drive, within the Worker/VM Role instance,
through a simple .NET API. This means that a Worker/VM Role instance could have a NoSQL database
product installed on it, configured to read and write data to a mapped drive, and as long as the drive were
mounted before the NoSQL product initialized, all would be well. Scaling this to multiple Role instances
gets tricky, since a given VHD can be used as a read/write volume by only one instance at a time, but
there are ways to do it.
Is this solution optimal? Probably not. But it is workable and still runs within the context of the Azure
managed platform from which you can avail yourself of the elasticity and other traits and features of the
Azure fabric’s management. For Microsoft customers who already have a substantial investment in SQL
Server and/or .NET, this no mere trivial benefit. And readers who find compelling the argument that
NoSQL features and benefits can be had from existing Azure data products like Azure Storage, SQL Azure
and their OData interfaces, will likely find the need to run dedicated NoSQL products an edge case. With
that in mind, the Azure Worker Role/VM Role/Azure Drive option appears quite feasible.
On-Premise Technologies
Before we move on, three non-cloud technologies from Microsoft bear special mention, as they provide
their own implementations of the non-tabular data, fan-out query and map-reduce job execution
technology discussed in this paper.
19. 19
SQL Server 2008/2008R2 “Beyond Relational” Features
With the release of SQL Server 2008, a number of features were added to the product under the moniker
“beyond relational.” There is an array of features in this category. The two features most often identified
there are the so-called spatial features that allow for efficient storage and processing of geo-spatial
information, such as latitude/longitude coordinates, polygons, points and lines. But “Beyond Relational”
goes beyond geospatial, and includes a set of features that one could classify as NoSQL-like in nature.
For example, the Sparse Columns feature effectively allows for loosely-schematized tables. Although all
possible columns do in fact need to be declared as part of a table’s definition, the values for columns
declared as sparse can be null, without introducing any storage overhead on a per-row basis (there is
some overhead at the table-level, however). So while the full schema of sparse columns is stored, the
physical content of each row in the table may differ, and drastically so, if necessary. Special filtered
indexes and filtered statistics can be used to maintain good performance in tables that use sparse
columns. Filestream columns allow Binary Large Object (BLOB) data to be stored in the server’s file
system rather than in the database per se. Hierarchies and the HierarchyID column type allow for the
representation of hierarchical data and provide explicit support for referencing and testing data in terms
of ancestors and descendants.
The XML data type is a beyond-relational feature as well and, as we have discussed, it is supported by SQL
Azure; spatial features and the HierarchyID column type are supported by SQL Azure as well. However,
Sparse Columns and Filestream features are not supported by SQL Azure at present. My take on this is
that the symmetry between SQL Server and SQL Azure will continue to increase and, as such, the
remaining Beyond Relational features will eventually be available in the cloud. When that happens,
developers who are attracted to specific facets of NoSQL databases will find SQL Azure even more
accommodating of their needs.
SQL Server Parallel Data Warehouse Edition
SQL Server Parallel Data Warehouse Edition (SQL PDW), which was borne of the acquisition of DATAllegro
by Microsoft in 2008, is Microsoft’s maiden offering in the Massive Parallel Processing (MPP) database
space. The product allows horizontal scaling of SQL Server by providing an interface over a number of
instances of the product, each of which participates in a striped distribution of large data warehouse
databases. To the database client, the entire array of SQL Server instances appears as a unified whole, and
the queries sent to that single entity are appropriately split and dispatched by PDW to the appropriate
individual agents, with each constituent query being executed in parallel (hence the term MPP).
MPP shares qualities with both the sharding and map-reduce approaches to database management.
PDW provides more value than a raw MPP or map-reduce software implementation though. It is sold as
an appliance such that compute, network and storage hardware are purchased together with the software,
as an appliance. PDW provides more evidence that if you seek specific capabilities of NoSQL, you may
find that the relational products you use today, or products from the same family, deliver those
capabilities to you, without the disruption that would come from migration to a new database platform.
20. 20
Microsoft Research Dryad
Dryad is a Microsoft Research (MSR) project that implements a map-reduce style execution engine. Dryad
jobs consist of series of programs that are connected by channels. The programs represent vertices, and
the channels represent edges. Together, these vertices and edges form a graph, and any such graph
6
, as
long as it is acyclical, can be executed by Dryad.
Like MapReduce or Hadoop, Dryad is an execution engine that manages jobs, processes input files and
produces output files. Dryad manages the execution of a graph’s vertices/programs across various nodes
in a compute cluster. Nodes may be physical machines, or cores within a machine. MSR explains that
Dryad subsumes map-reduce and also provides such infrastructural services as fault tolerance, re-
execution, scheduling, and accounting.
Dryad is not a database, but it can coordinate the operations of multiple database servers. In fact,
Microsoft AdCenter uses Dryad to run multiple instances of SQL Server Integration Services (and SQL
Server RDBMS instances) for log processing.
Dryad is now available as a technology preview within the Windows HPC Server 2008 R2 high-
performance computing line. Furthermore, according to Microsoft Research, Dryad eventually will be
integrated with Microsoft SQL Server and Windows Azure. Dryad implements an execution model with
great affinity to the map-reduce approach so closely associated with NoSQL databases. It is therefore
crucial to the discussion of NoSQL computing in the Microsoft technology universe.
An enumeration of all the cloud and on-premise products and technologies discussed in this section is
presented in Figure 6.
6
Do not confuse Dryad’s graphs with those of Graph Databases. Though the vocabulary is quite similar,
the contexts are rather different.
21. 21
Figure 6: These lists summarize the cloud and on-premise technologies from Microsoft which deliver genuine NoSQL
technology (e.g. Azure Table Storage) and/or features that NoSQL databases offer and which resonate with NoSQL
developers (like OData’s HTTP/REST APIs). We also enumerate the option of running open source NoSQL database
products in Azure compute instances, using Worker and VM Roles.
NoSQL Upsides, Downsides
We’ve already alluded to many of the relative pros and cons of dedicated NoSQL products and various
Azure technologies which, at the very least, nip away at the NoSQL feature list and deliver certain of their
advantages on an a la carte basis. Allusions are one thing, but it’s probably best that we work to
enumerate NoSQL’s upsides and downsides in a formal manner. By doing so, readers will be able to
evaluate their NoSQL needs in a no-nonsense fashion and then determine, given the Azure platform
capabilities, whether those needs necessitate use of dedicated NoSQL products.
22. 22
Upsides
Lightweight, low-friction
Probably the most touted attribute of NoSQL database systems is their ease of provisioning, deployment
and integration into application code. Download, install, run a browser-based UI, create a new database,
and away you go. Since the products are open source, the licensing worries are reduced. Since there are
no schemas to declare with many NoSQL products, the database is ready as soon as you create it. And
since many NoSQL APIs are HTTP- and REST-based, and, for a number of NoSQL databases, a multitude
of client libraries for various programming environments are available, you can start coding quickly too.
Minimalist tool requirements
A number of NoSQL databases have browser-based UIs. After the product is installed, simply point your
browser at the server’s host name (or localhost, if you’re browsing on the server), a specific port and a
given virtual directory, and you may get a fully-functional UI in the browser for managing your databases,
and querying them too.
Sharding & Replication
Most NoSQL databases support the notion of sharding, which we have already discussed in the section on
SQL Azure Federation, above. Unlike SQL Azure though, the sharding facilities in most NoSQL databases
do support fan-out queries transparently. It seems reasonable that fan-out query capabilities will come to
SQL Azure in the future, but they’re not there now.
Many NoSQL databases also have simple replication facilities built in. In the relational world, replication
can be useful in branch office scenarios, but for the Web-centric focus of most NoSQL databases, it is
likely that geographic content distribution is more important. In other words, NoSQL database instances
can be created in various geographic regions, and then be configured for continuous replication such that
users can work against a database to which minimal network hops are required, with replication assuring
that each regional server gets data changes from the others.
Replication is also a disaster recovery tool, as the failure of a single replica can be addressed by the
swapping in of another. This is very important in both sharded and single-server implementations: in the
latter, the unitary server becomes a single point of failure; in the former, every single shard becomes a
point of failure as well. For this reason, sharding and replication are often used together.
Web Developer-Friendliness
Many Document Store databases use JavaScript Object Notation (JSON) as the internal storage format
and JavaScript as an internal scripting language. Therefore, writing an AJAX application against a
database in one of these products becomes much easier, as the objects in the application’s JavaScript
code can be directly written to, or read from, the database. This makes client-side (browser script-based)
data access code quite feasible and simple.
23. 23
Add to this the REST APIs used by most Document Store products, and the jQuery REST libraries available
to Web developers, and it becomes clear that the suitability of NoSQL products to JavaScript/jQuery-
based applications is high, with a reasonably low learning curve for many Web developers.
For certain NoSQL products, especially Document Stores, it seems almost a core design principal that the
databases function as an extension of JavaScript’s implementation of object orientation. While it would
probably be a stretch to call these NoSQL products object databases
7
, that is a useful way to consider the
intent with which they are built, with respect to JavaScript developers and their code.
Cross-Platform, Cross-Device Operation
Most NoSQL database products run on multiple OSes and thus on multiple devices. Specifically, most of
them run on Windows clients and servers, as well as on Linux. Running on Linux allows certain of these
products to run on Apple Mac OS, iOS and the Android operating system on phones and tablets
8
.
For cloud computing though, the cloud servers are the host, and the only device compatibility that
becomes important is on the client side. And given the number of OData interfaces supported by Azure,
client compatibility with Microsoft’s cloud platform is quite high indeed.
Downsides
Having enumerated several facets of NoSQL databases that work out elegantly and advantageously, it’s
important to point out some of the NoSQL product’s liabilities as well, especially with regard to
productivity and suitability to line-of-business application development.
Optimizations Have a Price
Usually in computing, an advantageous optimization for certain activities and patterns leads to less
functionality or flexibility in others. And with certain NoSQL databases, that is definitely the case.
Consider CouchDB, and its ability to read and write data very quickly, which in turn helps it facilitate the
Web scale capability which draws so many of its users to it.
On the write side, CouchDB can process things so quickly because the operation of writing to disk is in
fact deferred. Writes are buffered, which makes for better responsiveness, but leads to inconsistency in
the physical database in the short term and risk of data loss in the event of a crash or other outage before
the cache is committed to disk
9
.
On the read side, CouchDB cannot be queried in an ad hoc fashion at all. Instead, the database designer
must author a “view” containing JavaScript code that traverses CouchDB databases and returns a specific
result set. This requirement, of course, makes CouchDB less than suitable for ad hoc query activities, or
even for applications where the standard querying needs are in flux. The good news is that for
applications where the querying needs are well-known and limited, CouchDB can work well, and the
7
Recall that we had already drawn parallel between Graph Databases and Object Databases. Here we do
so for Document Stores. As before, the distinctions between NoSQL categories are not cut and dry.
8
At time of writing, CouchDB for Android is available as a developer alpha release.
9
The lost data is recoverable from database log files. But the restore operation can prove inefficient.
24. 24
overhead of a query optimizer need not impose itself. But for applications where requirements may shift
over time, capabilities are much more limited than with relational databases. This has some irony to it,
given the importance of schema flexibility (and thus accommodation of changing requirements) in NoSQL
databases overall.
Requirement to Query using a Procedural Language
A corollary to the above point on development of static views for querying is the procedural method by
which the code itself must traverse the database in order to produce its results. Instead of using the set-
based paradigm in SQL, NoSQL databases often must be traversed on a row-by-row (or document-by-
document, or entity-by-entity) basis. Each row/document/entity must be evaluated individually, and
declarative SQL operations like joins, which filter data more implicitly, are not available. What this does is
force a client-like data access model to be employed at the server which could, in turn, impair scalability
more than facilitate it.
10
Of course, that statement really comingles two separate senses of the word “scalability.” For many Web
applications, scalability involves the elimination of latency in rather simple operations, such as pulling up
an individual note, writing out a status message, bringing up account settings for a specific customer, and
so forth. Another kind of scaling involves things such as efficient keyword searches over a gigantic bodies
of data, limiting the value of specific fields to a certain range or aggregating numeric field values over a
large subset of data; this sense of scale is very important as well and procedural traversals do not often
enhance it.
So perhaps it is unfair to say that, generally, procedural, row-wise data evaluation impairs scalability, since
notions of scale differ between classes of applications. But this assertion must hold true in the converse
as well, making it inaccurate to say, in a sweeping fashion, that NoSQL databases are more “scalable” or
“Web scale” than relational databases. The reality is that different applications have different needs,
different burdens and different points of stress (or failure). Scalability really is measured by the degree to
which these needs are met, burdens lifted and stresses reduced as the volume of data and/or user activity
grows linearly, and exponentially.
The best database for the job is just that: the best database for the job at hand. For some applications,
relational databases are not the optimal vehicle for storage and retrieval. For many others, NoSQL
databases would be quite inappropriate. So the most important question in evaluating options amongst
NoSQL databases, as well as evaluating the option of using them at all, hinges on the type of application
being written, the type of queries that must be expected and handled with relative ease, and the regularity
vs. variability of the data’s structure. That a certain type of database appears clumsy in certain situations
does not by itself render that type of database inappropriate if that situation is merely an edge case.
Necessity to Scale Manually
For various Web applications that are public facing, and whose data may be document-, user- or
message-oriented, NoSQL databases can work quite well. Their ability to stripe, replicate, cluster and
10
SQL Server and SQL Azure provide this same data access option through cursors, but SQL Server
developers use cursors very sparingly to avoid the downside.
25. 25
provide geographically distributed points of presence may form the perfect approach for the problem
space of these applications. The ad hoc, semi-federated nature of NoSQL clusters and replicas makes for
low-friction provisioning and helps assure that growth spurts in services usage and membership are non-
disruptive.
That said, there is still work involved, both in terms of resource monitoring and provisioning, that must be
done in order to meet these very demands. Meanwhile, a Platform as a Service cloud like Windows Azure,
with a data platform like SQL Azure to match, facilitates a more automated approach to both the
monitoring and provisioning which must be performed to make certain a site or application grows non-
disruptively. New Windows Azure Web and Worker roles can be spun up through clicks in the Azure
portal’s management interface, and they can be deactivated just as easily.
As a result, elasticity is achieved more laboriously with hosted NoSQL database applications. Replicas for
SQL Azure databases are created implicitly and the “cutover” from one replica to another is implicit as
well. The ramifications of this for NoSQL include extra effort and greater opportunity for error, which may
have a very real and measurable economic impact in labor costs and/or opportunity costs, as well as
greater risk exposure, to the companies building sites or providing services that use NoSQL databases
11
.
Primitive Tooling
NoSQL databases are, in many cases, easier to get up and running than are relational databases. There’s
less up-front formality involved in terms of planning and design and, as a result, there’s a shorter distance
between concept and implementation. That’s exactly the kind of agility that growing companies and their
sites may need. There’s also far less complexity in tooling around these databases…simple, self-
explanatory browser-based management interfaces, straightforward REST programming interfaces and
conceptually simple key-value paradigms abound.
But tooling has its value, and that value tends to increase over time, when the imperative of raw
implementation has passed and need for smooth maintenance and troubleshooting becomes more
pronounced (and economically impactful). The design, diagnostic and operational monitoring capabilities
of SQL Server’s tools are significant, and have evolved over the roughly 20-year existence of the product.
These tools, including SQL Server Management Studio and its execution plan window, aid greatly in
preventing problems, and in solving them quickly when they do arise. NoSQL databases’ more minimalist
tooling approach leads to more manual and time-consuming management and troubleshooting than is
the case with SQL Azure (which is compatible with SQL Server’s tools), and may also make the process
more error prone. The cost impact of this can be significant.
Lack of ACID Transactional Capabilities in Some Products
Many NoSQL databases do not provide ACID guarantees nor support for large-scoped transactions. As
discussed previously, some products provide “eventual consistency” while others treat each database
operation as its own isolated transaction. This may be appropriate if the application need only provide
that level of reliability. For example, if social media status messages occasionally fail to post, users may
11
Some Web enterprises have large, dedicated technology staffs in place, who can handle this burden
well. But many corporate business units, and even IT departments, are not in that position
26. 26
find it perfectly acceptable to discover the failure (by noticing the message never appears in a feed or
stream) and re-post the message. Furthermore, the occurrence of transactions that span more than a
single database operation may not be significant in certain apps. Note taking-applications must update
notes one at a time; blog posting is a simple operation; social networks may need to register a new
follower for a given user, and that’s a discrete operation. Unlike a financial system which may need to
execute a debit and credit as an atomic operation, many Web applications interact with data in a more
granular, minimalist way.
But for most corporate business applications, ACID guarantees are imperative. Debits and credits must
execute in an all-or-nothing fashion; ecommerce orders cannot be lost as customers will not be content to
recreate them from scratch. So, once again, the context of an application/service/site in large part
determines what defines standards of reliability and what determines whether certain advanced features
of a database are overkill or absolute necessities.
Conclusion: Relational’s Continued Indispensability in
Line-of-Business
In this paper, we’ve investigated NoSQL’s general tenets. We have discussed each of its four major
subcategories: Key-Value Stores, Document Stores, Wide Column Stores and Graph Databases. We’ve
also reviewed the distributed nature of NoSQL databases, including the partitioning and replication
schemes many of them use. We have looked at NoSQL’s concurrency models, its programming models
and have explored the concepts around loosely schematized data. We reviewed MapReduce and
BigTable, and saw that they established a legacy that has influenced most, if not all, NoSQL products.
We also looked at Microsoft’s Azure cloud stack, including Windows Azure Table Storage, which is itself a
bona fide NoSQL database; various facets of SQL Azure; and support for OData in both Windows Azure
and SQL Azure. In doing so, we have seen how the Azure platform supports a full-on NoSQL approach as
well as the ability to implement various NoSQL features on an “a la carte” basis. Furthermore, we looked
at how Windows Azure Worker Roles and VM Roles support the installation and use of non-Microsoft
NoSQL databases, when and if nothing else will do. We digressed, slightly, to review the NoSQL qualities
of SQL Server’s “Beyond Relational” features and SQL Server Parallel Data Warehouse Edition; we briefly
discussed Dryad, Microsoft Research’s project providing map-reduce capabilities, and more.
We saw how NoSQL databases are suitable for data management that is light-duty but large-scale, and
how they work well for content management requirements of many stripes. We also saw, again and
again, that relational databases are best for line-of-business applications. The database consistency,
query optimization and set-based declarative query capability that relational databases have provided for
decades is still required by most LOB applications; this has not changed.
In business, data in a specific domain tends to be very regular and consistent in structure. For example,
most equities trades have the same fields, as do the counterparties involved in the trades. Most sales
invoices and line items in those invoices have consistent structure as well. When such regularity exists –
which is in fact quite often –relational databases work perfectly. Granted, they may need to be
27. 27
appropriately scaled and tuned, but the overarching point is that the relational scheme is best in these
scenarios.
To understand the line-of-business versus structured data distinction, it may be helpful to consider a
hypothetical large, online bookseller. This reseller likely keeps its catalog data in a NoSQL database. It
may do likewise with its Web content, reviews and perhaps even its reading lists. But in all likelihood, its
customer billing system, its inventory and supply chain systems, its publisher online inquiry systems and
its shipping application all use relational databases. We don’t know this for a fact about any one
bookseller, but the assumptions are nonetheless based on good rules of thumb for when and where each
type of database is best utilized.
The regular, consistent data scenario is the most common one in most corporate settings. Granted, for
any number of outward-, consumer-facing Web applications, which are essentially content-and
relationship-driven, NoSQL structured stores have a welcoming home.
So you must ask yourself: do I have irregularly schematized data, such that I need to use a NoSQL,
structured storage approach to storing and retrieving it? Try not to be led to a conclusion by fear (or
even guilt) over the issue of inflexibility. Just because schema-less databases let you store irregular data
doesn’t mean you’ll need that, and just because relational databases require you to go through steps that
can be disruptive in order to modify a table’s schema, doesn’t mean you’re somehow foolhardy for going
that route.
Consider a household analogy: if, as you build a house, you run wiring in conduit, external to your walls,
and surface-mount your fixtures, you’ll always be able to upgrade your wiring, or repair a wiring segment
gone bad. But if you know that the electrical, and maybe cable TV and computer network wiring to be
installed will suit your purposes for the long term, then it makes perfect sense to run your wiring in-wall.
You can always open the walls again if need be, and if you’re reasonably certain that you won’t need to,
then running the wiring internally is the right decision. It will look better to most people, make it easier to
push furniture against the wall and will, arguably, be somewhat safer. In general, your home will have a
more finished look to it. If one day your needs change and you need to open the walls again, that will not
necessarily mean you made a bad decision.
People should not let a relatively insignificant chance of disruption thwart them from enjoying the
advantages of something that is otherwise advantageous. By the same token, customers should not let
the notion that their database schema may someday change force them into a decision of going with a
non-relational, loosely-schematized database.
As we have said, some applications by their nature manage data that is variable in structure, and NoSQL
databases may work very well for those applications. But if your app uses highly structured data – and
most line-of-business apps do – then why forego the compatibility, data consistency, query optimization,
maturity, broad support and professional talent pool that a major relational database offers? You should
give that up only if the benefits of doing so outweigh the costs, and each such benefit should be
evaluated on a sober survey of likelihoods and risks.
28. 28
But what if the “wires” in your “house” are changing a lot? What if you’ve got an app that manages a lot
of data that is ever-changing in structure and much of it functions as content on your Web site? Do you
need Cassandra or MongoDB or Neo4j on a hosted Linux server? Probably not. Azure tools like Azure
Table Storage, SQL Azure XML columns and OData may be viable options for your structured storage or
key-value retrieval needs. And if not, then running xcopy-deployable or silently-installable NoSQL
databases in Azure Worker Roles and Azure Drive, or running full blown NoSQL installations using Azure
VM Roles ,may well work for you.
Hopefully this paper has made the choices more clear and your evaluation a more straightforward and
less “loaded” prospect. The Azure cloud provides for a spectrum of choice, rather than a single,
compulsory methodology. This provides flexibility and protection in a cost-effective, elastic computing
environment. And that’s really what “Web scale” should be all about.