Booz Allen’s data lake approach enables agencies to embed security controls within each individual piece of data to reinforce existing layers of security and dramatically reduce risk. Government agencies – including military and intelligence agencies – are using this proven security approach to secure data and fully capitalize on the promise of big data and the cloud.
5. 1
Introduction
We are entering an era of big data and cloud computing.
The combination, termed “cloud analytics,” holds
enormous promise for improved productivity, cost
savings, and enhanced mission performance. The Big
Data Research and Development Initiative, launched
by the White House Office of Science and Technology
Policy (OSTP) in March 2012, underscores a growing
recognition that big data analytics can help solve some
of the nation’s most complex problems. Developed by
OSTP in concert with several federal departments and
agencies, the big data initiative provides funding and
guidance aimed at improving our ability to collect, store,
preserve, manage, analyze, and share huge quantities
of data, with the ultimate goal of harnessing big data
technologies to accelerate the pace of discovery in
science and engineering, strengthen national security,
and transform teaching and learning.1
Despite the evident benefits of cloud analytics, many
federal leaders hesitate to adopt a cloud-based
services model because of worries about both costs
and security. How will my organization pay for these
new capabilities? And will our data be secure in the
cloud? How do we secure data in the cloud while still
meeting our information sharing obligations? These
are legitimate questions, particularly given today’s
constrained fiscal environment and government’s
strict privacy and security requirements. Booz Allen
Hamilton’s viewpoint, “Developing a Business Case
for Cloud-based Services,” shows how agencies can
address cost concerns through a combination of cost-
savings and productivity gains that more than justify
their cloud investments.2
The current viewpoint examines how an innovation
in cloud data storage and management known as a
“data lake” is opening new avenues for agencies to
meet their security and compliance requirements in a
cloud environment. The data lake approach enables
agencies to embed security controls within each
individual piece of data to reinforce existing layers
of security and dramatically reduce risk. Government
agencies — including military and intelligence
agencies — are using this proven security approach to
secure data and fully capitalize on the promise of big
data and the cloud.
The Cloud Analytics Imperative
To understand the power of cloud analytics, it helps
to see the progression from basic data analytics
performed in most organizations today to cloud
analytics (Exhibit 1). As a system is built out along the
continuum to cloud analytics, the size and scale of
data the system can process increases, along with its
analytic capabilities. The combination of large datasets
and powerful analytics create a platform — cloud
Tapping the Full Value of Big Data and the Cloud
Enabling Cloud Analytics with Data-Level Security
1
“Obama Administration Unveils ‘Big Data‘ Initiative: Announces $200 Million In New RD
Investments.” http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_
press_release_final_2.pdf.
2
For more information about Booz Allen’s Cloud Cost Model, see our viewpoint, “Developing
a Business Case for Cloud-based Services,” available at http://www.boozallen.com/
insights/insight-detail-spec/concepts-in-the-cloud.
Exhibit 1 | Progression to Cloud Analytics
Source: Booz Allen Hamilton
6. 2
analytics — for enormous leaps forward in problem
solving, decisionmaking, and overall performance.
Numerous factors are driving federal agencies to
adopt cloud analytics. The Office of Management and
Budget (OMB) mandated a rapid move to embrace
“infrastructure as a service” in its “Federal Cloud
Computing Strategy,” issued in February 2011. The
cloud-first strategy called for agencies to begin by
moving at least three services to the cloud within 18
months, so they could begin harnessing the anticipated
savings and efficiencies. For example, cloud computing
facilitates federal efforts to consolidate data centers,
improve server utilization, and reduce the energy
footprint and management costs associated with data
centers. Agencies can also reduce costs and improve
IT performance with cloud-based services that enable
rapid provisioning, efficient use of resources, and greater
agility in adopting new technologies and solutions.
Another key driver is the desire to achieve cost
efficiencies by consolidating stove-pipes of
data — basically assessing legacy systems to identify
integration opportunities, consolidating interfaces,
and so on. For example, an agency that maintains 15
separate data systems would look to consolidate them
down to just 1, with an eye to reducing overall IT “cost
of ownership.” However, with that consolidation comes
a host of the security concerns.
Security is also a key component in the White House’s
“Digital Government Strategy,” which calls for agencies
to make better use of digital technologies, including
analytics for data-driven decisionmaking. Finally, the
White House’s “Big Data Research and Development
Initiative” would exploit the fast-growing volume of
federal data using cloud-based services and emerging
analytics tools. Cloud analytics offers a wealth of
potential insights and benefits in medicine and
healthcare, military operations, intelligence analysis,
fraud detection, border protection, anti-terrorism, and
other critical government missions. Together, cloud
computing and data analytics provide a foundation for
productivity gains and enhanced mission performance
too compelling to ignore. The question is: How can
agencies realize these benefits while also ensuring
security and compliance?
Embedding Data-Level Security in the Cloud
Many organizations today rely on techniques and
approaches for storing and accessing data that
were created before the advent of the cloud and big
data. These legacy approaches typically store data
in “siloed” servers that house different types of data
based on a variety of characteristics, such as their
source, function, and security restrictions, or whether
they are batch, streaming, structured, or unstructured.
Security approaches for protecting data “at rest” have
naturally focused on protecting the individual silos that
store the data. Unfortunately, these approaches for
storing and securing data create significant challenges
for cloud analytics. The cloud’s value stems from its
ability to bring together vast amounts of data from
multiple sources and in multiple combinations for
analysis — and to do so quickly and efficiently. Rigid,
regimented silos make the data difficult to access
and nearly impossible to mix and use all at once,
reducing the effectiveness of the analytical tools.
Organizations can build bridges between silos to enable
sharing and analysis, but this approach becomes
increasingly cumbersome and costly as more and
more bridges are required to facilitate sharing among
multiple combinations of databases. In addition, it
becomes more difficult to determine who is accessing
the data, what they do with it, and why they need it
across all their systems because there is no record
of data provenance, data lineage, or data access.
Combining data from databases that have different
levels of security is especially problematic, often
requiring designation of the mixed data (and resulting
analysis) with high levels of security restrictions.
Another complicating factor for many organizations is
that some of the more effective methods for protecting
data — such as using polymorphous techniques, mixing
bogus data with real data, changing where the data
resides, and disaggregating data — become difficult
to implement as the datasets become larger and
7. 3
larger. These techniques do not scale easily with the
data. Ultimately, conventional approaches for securing
data become impossible to sustain in a growing cloud
environment, and the full potential of cloud analytics
remains unfulfilled.
The new, complex cloud environment requires
organizations to re-imagine how they store, manage,
and secure data to facilitate the free flow and mixing
of different types of data. An innovative approach
called the data lake has proven extremely effective in
addressing the challenges of managing and securing
large, diverse datasets in the cloud. Rather than
storing data in siloed servers, this approach ingests
all data — structured, unstructured, streaming, batch,
etc. — into a common storage pool: the data lake. As
data enters the data lake, each piece is tagged with
security information — security metadata — that embeds
security within the data. The metadata tags can control
(or prescribe) security parameters such as who can
access the data; when they can access the data;
what networks and devices can access the data; and
the regulations, standards, and legal restrictions that
apply. Security resides within and moves with the data,
whether the data is in motion or at rest. As a result,
organizations can confidently mix multiple datasets and
provide analysts with fast and efficient access to the
data, knowing the security tags will remain permanently
attached to the data.
Before examining how security metadata is attached
to the data, it is important to understand the types of
security controls needed in a cloud environment. Within
the cloud, data is typically shared among multiple
users, devices, networks, platforms, and applications;
consequently, effective cloud security encompasses
three essential activities: identity management,
configuration management, and compliance. Identity
management is critical to ensure that the right
people — and only those people — have access to
the different types of data. For most government
and commercial organizations, the requirements
for multilevel identity management complicate this
task because they give some employees access to
some but not all types of information, such as top-
secret intelligence reports or proprietary financial
data. Cloud-based data is also shared across many
different types of platforms, applications, and devices,
which further complicates the security task, because
employees might be authorized to access some data
only from specific types of devices (e.g., a secure
computer located within a government building) or
only on authorized networks (e.g., a secure intranet).
Consequently, secure cloud–based systems require
effective configuration management to manage data
access for many combinations of approved networks,
platforms, and devices, while also taking into account
user identities and authorizations. Finally, organizations
require security controls to ensure they comply
with relevant regulations and standards as data is
accessed, shared, and analyzed. For example, federal
agencies must comply with a host of security standards
and authorizations, such the Federal Information
Security Management Act (FISMA) National Institute of
Standards and Technology (NIST) security standards
and guidelines, Health Insurance Portability and
Accountability Act (HIPAA) privacy requirements, and the
Federal Risk and Authorization Management Program
(FedRAMP) program for accreditation of cloud products
and services.
The data lake enables organizations to address
these security requirements efficiently and effectively
through the security tags attached to the data as it
flows into and out of the data lake. In carrying out
this security function, the data lake acts as though it
were a massive spreadsheet with an infinite number
of columns and rows, and each cell within the
spreadsheet contains a unique piece of data, with a
defined set of security conditions or restrictions. As
each piece of data enters the lake and is tagged, it is
assigned to its cell, along with its particular security
parameters. For example, a piece of data could be
tagged with information describing who can use the
data, as well as with information describing the types
of approved devices, networks, platforms, or locations.
The tags could also describe the types of compliance
9. 5
regulations and standards that apply. And the tags
could contain the dimension of time, thus helping
organizations maintain the integrity of the data and
have a record of changes over time. Similarly, the tags
could allow certain people access to all historical data
while limiting others to just the most recent data; or the
tags could embed an expiration date on the data. Many
data elements will have multiple security descriptors;
there are no limits to the number or combinations
assigned. Every piece of data is tagged with security
metadata describing the applicable security restrictions
and conditions of its use.
Also noteworthy, organizations can code the tags
to recognize and work with security controls in the
other layers of the architecture — that is, with the
infrastructure, platform, application, and software
layers. In this way, data-level security complements
and reinforces the identity management, configuration
management, and compliance controls already in place
(or later implemented) while also facilitating the free
flow of data that gives cloud computing and analytics
their power.3
For example, the data lake approach
uses an identity management system that can handle
Attribute-Based Access Control (ABAC), a public key
infrastructure (PKI), to protect the communications
between the servers and to bind the tags to the
data elements, and a process for developing the
security controls to apply to each data element.
These technology elements are usually combined
with an organization’s existing security policies and
are then applied as analytics on top of the data once
it is ingested. In addition, unlike many conventional
security techniques, data tagging can easily scale with
an organization’s expanding infrastructure, datasets,
devices, and user population.
Implementing Data-Level Security
The data-level security made possible by the data
lake approach can be used within a variety of cloud
frameworks. A number of federal agencies have
recently implemented it with great success using the
Cloud Analytics Reference Architecture, a breakthrough
approach for storing, managing, securing, and analyzing
data in the cloud.4
Developed by Booz Allen Hamilton
in collaboration with its US government partners, the
Cloud Analytics Reference Architecture automatically
tags each piece of data with security metadata as
the data enters the data lake. Organizations can
use a variety of commercial off-the-shelf (COTS)
or government off-the-shelf (GOTS) tools, including
open-source tools, to tag the data. The tagging
technology — basically a preprocessor with the ability
add metadata to data streams — has not proven
difficult to implement. However, resolving the policy
and legal issues surrounding the sharing and mixing
of data can be problematic. The complex process to
decide which policies and laws apply to which pieces
of data requires a determined effort by the relevant
stakeholders and decisionmakers. Each organization
is different and so will apply the rules, standards,
laws, and policies in accordance with its culture and
mission. However, once these decisions are made
and the appropriate mechanisms are put in place,
the security metadata can be attached automatically
based on the agreed-upon, preconfigured rules
addressing the relevant aspects of security, including
identity management, configuration management,
and compliance.
JIEDDO Bolsters Cloud Analytics
with Data-Level Security
A government organization that is successfully
implementing data-level security within the Cloud
Analytics Reference Architecture is the Joint Improvised
Explosives Device Defeat Organization (JIEDDO).
Established in 2006, JIEDDO seeks to improve threat
intelligence-gathering, acquire counter-IED technologies
and solutions, and develop counter-IED training for
US forces. To identify and eliminate threats, JIEDDO
analysts constantly comb through hundreds of different
data sources, such as message traffic from the
intelligence community, operations summaries from
3
In addition to applying metadata security tags to their data, organizations can also encrypt
selected pieces of data to further control access and risk. As with other security controls that
organizations put in place, the decision to encrypt data should be determined by an assess-
ment of the overall benefits relative to the costs and risks of encrypting the information.
4
For an overview of the Cloud Analytics Reference Architecture, see the Appendix.
10. 6
on-the-ground deployed units, RS feeds, news reports,
websites, and other open sources. The diverse sets of
data enter JIEDDO in every kind of format. Combining
all of JIEDDO’s information so that analysts could
conduct a single search was difficult and sometimes
impossible before JIEDDO adopted the Cloud Analytics
Reference Architecture and data-security tagging.
Typically, analysts were forced to query separate
databases using processes and tools that were
specific to each database, which meant the analysts
needed to master each database and format.
After receiving the results, analysts would then
manually combine the results to find the answers
they were seeking. The process, although valuable,
could be cumbersome and time consuming, even
for thosewith experience and expertise in using
the databases.
In contrast, the Cloud Analytics Reference Architecture
allows analysts to run a single query of all JIEDDO’s
data because the data is stored together in the data
lake. When looking for patterns and trends, such as
what types of IEDs certain groups are using or where
the danger spots are located, analysts can tap every
available source. Analysts can also ask any type of
question regarding information in the data lake; in
contrast, the types of questions that analysts can ask
using conventional databases are often limited by how
the data is formatted. In addition, one of the benefits
of security tagging is that it creates hierarchies of
access control to identify who can and cannot see
the data and the analytical results. This is extremely
important for JIEDDO, because it supports the US
military and international security assistance forces.
Security tagging enables analysts and commanding
officers to more readily share information with foreign
allies because the metadata protects the data.
Previously, without such tagging, valuable information
and analyses often defaulted to the highest level of
security, thus limiting their usefulness because the
information and analyses could not be widely shared.
Data tagging and the Cloud Analytics Reference
Architecture are enabling JIEDDO to more effectively
carry out its mission responsibilities to analyze
intelligence, attack terrorist networks, and protect US
and coalition forces from IEDs.
Conclusion
Federal chief information officers and IT managers
overwhelmingly cite security as their chief concern
when moving to cloud computing. Many fear a loss of
control over their data. Data-level security within a data
lake addresses their concerns by providing security that
is fine-grained and expressive. It is expressive in that
organizations can tag their data with a limitless number
of security and business rules; and it is fine-grained in
that organizations can affix those rules with rigorous,
detailed precision to specify approved user identities,
devices, physical locations, networks, and applications,
applicable privacy and security regulations, and other
security parameters to each piece of data. Data
tagging also reinforces existing layers of security
embedded at the infrastructure, platform, application,
and network levels. And the metadata tags embed
each piece of data with security throughout its lifecycle,
from data generation to data elimination when the hard
drive and data are destroyed.
Together, the data lake and data-level security
represent an entirely new approach that gives both
government and business organizations a powerful tool
to solve their most complex problems. By re-imagining
data security in the cloud, organizations can unlock
the full value of cloud analytics to address scientific,
social, and economic challenges in ways that were
unimaginable a decade ago.
11. 7
Appendix:
Cloud Analytics Reference Architecture
The Cloud Analytics Reference Architecture, as shown
in Exhibit 2, is built on a cloud computing and network
infrastructure that ingests all data — structured,
unstructured, streaming, batch, etc. — into a
common storage pool called a data lake. Storing
data in the data lake has many advantages over
conventional techniques. It is stored on commodity
hardware and can scale rapidly in performance and
storage. This gives the data lake the flexibility to
expand to accommodate the natural growth of an
organization’s data, as well as additional data from
multiple outside sources. Thus, unlike conventional
approaches, it enables organizations to pursue new
analytical approaches with few changes, if any, to the
underlying infrastructure. It also precludes the need
for building bridges between data silos, because all
of the information is already stored together. Perhaps
most important, the data lake treats structured and
unstructured data equally. There is no “second-class”
data based on how easy it is to use. Given that an
estimated 80 percent of the data created today is
unstructured, organizations must have the ability to
use this data. Overall, the data lake makes all of the
data easy to access and opens the door to the more
efficient and effective use of big data analytical tools.
The Cloud Analytics Reference Architecture also
allows computers to take over much of the work,
freeing people to focus on analysis and insight. As
data flows into the data lake, it is automatically
tagged and indexed for analytics and services.
Unlike in conventional approaches, the data is not
pre-summarized or pre-categorized as structured or
unstructured or by its different locations (given that
all data is stored in the data lake), but rather for
indexing, sorting, identification, and security across
multiple dimensions. The data lake smoothly accepts
all types of data, including unstructured data, through
this automated tagging process. When organization are
ready to apply analytic tools to the data, pre-analytics
filers help sort the data and prepare it for deeper
Exhibit 2 | Primary Elements of the Cloud Analytics Reference Architecture
Source: Booz Allen Hamilton
Streaming
Indexes
Human Insights and Actions
Enabled by customizable interfaces
and visualizations of the data
Analytics and Services
Your tools for analysis, modeling,
testing, and simulations
Data Management
The single, secure repository
for all of your valuable data
Infrastructure
The technology platform for storing
and managing your data
Services (SOA)
Analytics and
Discovery
Views and Indexes
Data Lake
Metadata Tagging
Data Sources
Infrastructure/
Management
Visualization, Reporting,
Dash-boards, and Query
Interface
13. 9
analysis, using the tags to locate and pull out the
relevant information from the data lake. Pre-analytical
tools are also used in the conventional approach, but
they are typically part of a rigid structure that must be
reassembled as inquiries change. In contrast, the pre-
analytics in the Cloud Analytics Reference Architecture
are designed for use with the data lake, and so are
both flexible and reusable.
The Cloud Analytics Reference Architecture opens
up the enormous potential of big data analytics in
multiple ways. For example, it removes the constraints
created by data silos. Rather than having to move from
database to database to pull out specific information,
users can access all of the data at once, including
data from outside sources, expanding exponentially the
spectrum of analysis. This approach also expands the
range of questions that can be asked of data through
multiple analytic tools and processes, including:
• Ad hoc queries. Unlike conventional approaches,
where analytics are part of the narrow, custom-
built structure, in the Cloud Analytics Reference
Architecture, analysts are free to pursue ad hoc
queries employing any line of inquiry, including
improvised follow-up questions that can yield
particularly valuable results.
• Machine learning. Analytics can search for patterns
examining all of the available data at once without
needing to hypothesize in advance what patterns
might exist.
• Alerting. An analytic alert notifying an organization
that something unexpected has occurred — such
as an anomaly in a pattern — can signal important
changes and trends in cyber threats, enemy
activities, health and disease status, consumer
behavior, market activity, and other areas.
The Cloud Analytics Reference Architecture also
supports interfaces and visualization dashboards to
contextualize and package the insights, patterns, and
other results for decisionmakers. Although the Cloud
Analytics Reference Architecture opens a wide aperture
to data, it incorporates visualization and interaction
tools that present the analyses in clear formats tailored
to the specific issues and decisions at hand, enabling
insight and confident action by decisionmakers.
A number of defense, civilian, and intelligence agencies
are already using the Cloud Analytics Reference
Architecture to generate valuable insights and achieve
mission goals previously unattainable in conventional
cloud environments. For example, the US military
is using the Cloud Analytics Reference Architecture
to search for patterns in war zone intelligence data,
mapping out convoy routes least likely to encounter
IEDs. The Centers for Medicare and Medicaid Services
(CMS) are using this approach to combat fraud by
analyzing mountains of data, which enables CMS to
assess doctors and others who bill Medicare on their
risk to commit fraud. And intelligence agencies are
using this new cloud architecture to apply aggressive
indexing techniques and on-demand analytics across
the agencies’ massive and increasing volume of
both structured and unstructured data. Booz Allen
itself is also adopting the Cloud Analytics Reference
Architecture to maximize its cloud analytics capabilities,
both for the firm and its clients.
Many organizations today have an urgent need to
make sense of data from diverse sources, including
those that have previously been inaccessible
or extremely difficult to use, such as streams
of unstructured data from social networks or
remote sensors. The Cloud Analytics Reference
Architecture enables analysts and decisionmakers
to see new connections within all of this data to
uncover previously hidden trends and relationships.
Organizations can extract real business and mission
value from their data to address pressing challenges
and requirements, while improving operational
effectiveness and overall performance.
15. 11
About Booz Allen Hamilton
ContactsBooz Allen Hamilton has been at the forefront of
strategy and technology consulting for nearly a century.
Today, Booz Allen Hamilton is a leading provider of
management and technology consulting services to
the US and international governments in defense,
intelligence, and civil sectors, and to major corporations,
institutions, and not-for-profit organizations. In the
commercial sector, the firm focuses on leveraging its
existing expertise for clients in the financial services,
healthcare, and energy markets, and to international
clients in the Middle East. Booz Allen Hamilton offers
clients deep functional knowledge spanning strategy and
organization, engineering and operations, technology,
and analytics—which it combines with specialized
expertise in clients’ mission and domain areas to help
solve their toughest problems.
The firm’s management consulting heritage is the
basis for its unique collaborative culture and operating
model, enabling Booz Allen Hamilton to anticipate
needs and opportunities, rapidly deploy talent and
resources, and deliver enduring results. By combining
a consultant’s problem-solving orientation with deep
technical knowledge and strong execution, Booz
Allen Hamilton helps clients achieve success in their
most critical missions—as evidenced by the firm’s
many client relationships that span decades. Booz
Allen Hamilton helps shape thinking and prepare for
future developments in areas of national importance,
including cybersecurity, homeland security, healthcare,
and information technology.
Booz Allen is headquartered in McLean, Virginia,
employs approximately 25,000 people, and had
revenue of $5.86 billion for the 12 months ended
March 31, 2012. For over a decade, Booz Allen’s high
standing as a business and an employer has been
recognized by dozens of organizations and publications,
including Fortune, Working Mother, G.I. Jobs, and
DiversityInc. More information is available at
www.boozallen.com. (NYSE: BAH)
Jason Escaravage
Principal
escaravage_jason@bah.com
703-902-5635
Peter Guerra
Senior Associate
guerra_peter@bah.com
301-497-6754