Dealing with Big Data: Planning for and Surviving the Petabyte Age
1. • Cognizant 20-20 Insights
Making Sense of Big Data in the
Petabyte Age
Executive Summary continues to accelerate in terms of volume,
complexity and formats.2
The concept of “big data”1 is gaining attention
across industries and the globe, thanks to the A traditional approach to handling big data is to
growth in social media (Twitter, Facebook, blogs, replace SQL with tools like MapReduce.3 However,
etc.) and the explosion of rich content from other the sheer volume of data contained in a data
information sources (activity logs from the Web, set that is routinely analyzed does not solve the
proximity and wireless sources, etc.). The desire to more pressing issue — that people have difficulty
create actionable insights from the ever-increas- focusing on massive amounts of tables, files, Web
ing volumes of unstructured and structured sites and data marts, all of which are candidates
data sets is forcing enterprises to rethink their for analysis. It’s not all about just data warehouses,
approaches to big data, particularly as tradi- anymore.
tional approaches have proved difficult — if even
possible — to apply to structured data sets. Usability is a factor that will overshadow the
technical characteristics of big data analysis for
While data volume proliferates, the knowledge it at least the next five years. This paper focuses
creates has not kept pace. For example, the sheer specifically on the roadmap organizations must
complexity of how to store and index large data create and follow to survive the Petabyte Age.
stores, as well as the information models required
to access them, have made it difficult for organi- Big Data = Big Challenges
zations to convert captured data into insight.
The Petabyte Age is creating a multitude of
The media appears obsessed with how today’s challenges for organizations. The accelerating
leading companies are dealing with big data, a deluge of data is problematic to all, for within
phenomenon known as living in the “Petabyte the massive array of data sources — including
Age.” However, coverage often focuses on the data warehouses, data marts, portals, Web sites,
technology aspects of big data, leaving such social media, files and more — is the informa-
concerns as usability largely untouched. tion required to make the smartest strategic
business decisions. Many enterprises are facing
For years, the accelerating data deluge has also the dilemma that the systems and processes
received significant attention from industry devised specifically to integrate all this informa-
pundits and researchers. What is new is the tion lack the required responsiveness to place the
threshold that has been crossed as the onslaught information into a neatly organized warehouse in
cognizant 20-20 insights | june 2011
2. time to be used at the current speed of business. usage and availability and other issues —
The heightened use of Excel and other desktop require businesses to store increasingly greater
tools to integrate needed information in the most volumes of information for much longer time
expedient way possible only adds complexity to frames.
the problem enterprises face.
As a result of these factors, enterprises worldwide
There are a number of factors at the heart of big have been rapidly increasing the amount of data
data that make analysis difficult. For example, housed for analysis to compete in the Petabyte
the timing of data, the complexity of data, Age. Many have responded by embracing new
the complexity of the synthesized enterprise technologies to help manage the sheer volume
warehouse and the identification of the most of data. However, these new toys also introduce
appropriate data are equal if not larger challenges data usability issues that will not be solved by
than dealing with large data sets, themselves. new technology but rather will require some
rethinking of the business consequences of big
The increased complexity of the data available for data. Among these challenges:
generating insights is a direct consequence of the
following: • Big data is not only tabular; it also includes
documents, e-mails, pictures, videos, sound
• The highly communicative and integrated bites, social media extracts, logs and other
global economy. Enterprises across industry forms of information that is difficult to fit
are increasingly seeking more granular insight into the nicely organized world of traditional
into market forces that ultimately shape their database tables (rows and columns).
success and failure. Data generated by the
“always-on economy” is the impetus, in many • Companies that tackle big data as a tech-
nology-only initiative will only solve a single
cases, for the keen interest in implementing
dimension of the big data mandate.
so-called insight facilitation appliances on
mobile devices – smartphones, iPads, Android • There are sheer volumetric issues, such
and other tablets — throughout the enterprise. as billions of rows of data, that need to be
solved. While tried-and-true technologies (par-
• The enlightened consumer. Given the
titioning) and newer technologies (MapReduce,
explosion in social media and smart devices,
etc.) permit organizations to segment data into
many consumers have more information at
more manageable chunks, such an approach
their fingertips than ever before (often more
does not deal with the issue that rarely
so than businesses) and are becoming increas-
used information is clogging the pathway to
ingly sophisticated in how they gather and
necessary information. Traditional lifecycle
apply such information.
management technologies will alleviate many
• The global regulatory climate. New regulatory of the volumetric issues, but they will do little
mandates — covering financial transactions, to solve the non-technical issues associated
corporate health, food-borne illnesses, energy with volumetrics.
Tabulating the Information Lifecycle
Types of Information Retained the Longest How Much Information is Retained
Development Records 4% >1 PB 4%
Email 4% >500 TB 7%
Financial Records 5% >100 TB 7%
Database Archive 6% >25 TB 4%
Government Records 11% >10 TB 14%
Organization Records 18% >5 TB 18%
Customer Records 19% >1 TB 25%
Source Files 25% <1 TB 21%
0% 5% 10% 15% 20% 25% 0% 5% 10% 15% 20% 25%
Source: The 100 Year Archive Task Force, SNIA Data Management Forum
Figure 1
cognizant 20-20 insights 2
3. As a result of mergers and acquisitions, global > Information required for regulatory activi-
sourcing, data types and other issues, the sheer ties but not necessarily related to the cre-
number of tables and objects available for access ation, extraction or capture of value.
has mushroomed. This increase in the volume of
objects has made access schemes for big data
> Historical supporting information.
overly complex, and it has made finding needed > Historical information that was once aligned
data akin to finding a needle in a haystack. with value, regulatory or other purposes
but is now kept because it might be useful
• The information lifecycle management4 con- at some future date.
siderations of data have not received the
attention they deserve. Information lifecycle
• Much of the complexity of information
made available for deriving insight is a
management should not be limited to parti- complex weave of multiple versions of the
tioning schemes and the physical location of truth, data organized for different purposes,
data. (Much attention is being given to cloud data of different vintages and similar data
and virtualized storage, which presumes a obtained from different sources. This is a
process has been devised for rationalizing the phenomenon that many organizational stake-
fact that always-on information made available holders describe as a “black box of informa-
in the cloud is worthy of this heightened avail- tion” into which they have no insight into its
ability.) Information lifecycle management lineage. This adds delay to the use of insight
is the process that traditionally stratifies the gained from such information, a development
physical layout for technical performance. In caused by the obvious necessity of validating
the Petabyte Age, where the amount of infor- information prior to using it for anything out
mation available for analysis is increasing at of the ordinary. Much of the data available for
an accelerating rate, the information lifecycle analysis results from the conventional wisdom
management process should also ensure a that winning organizations are “data pack
heightened focus on information that matters. rats” and that information that arrives on their
This stratification should categorize informa- doorsteps tends to stay as a permanent artifact
tion into the following groups: of the business. Interestingly, according to the
> Information directly related to the creation, 100-Year Archive Task Force,5 source files are
extraction or capture of value. the most frequently identified source of data
retained as part of the “100-Year Archive.”
> Supporting information that could be re-
ferred to when devising a strategy to cre- • A sizable amount of operational information
ate, extract or capture value. is not housed in official systems adminis-
tered by enterprise IT groups but instead is
> Information required for the operations of
stored on desktops throughout the organi-
the enterprise but not necessarily related
zation. Much of these local data stores were
to the creation, extraction or capture of
created with good intentions; people respon-
value.
Many Sources of Data
Partner Data
Portal Data Warehouse
Taxonomy Dimensions
Other Informal
?
Information Network
Spreadsheets
and Local Data Operational
Personal System
Organization Keys
Figure 2
cognizant 20-20 insights 3
4. sible for business functions had no other way The Roadmap to Managing Big Data
of gaining control over the information they
Big data will be solved through a combination
needed. As a result, these desktop-based
of enhancements to the people, process and
systems have created a different form of big
technology strengths of an enterprise.
data — a weave of Microsoft Access, Excel and
other desktop tool-based sources that are just People-based initiatives that will impact big data
as critical in running enterprises. The contents programs employed at companies include the
of these individually small to medium-size following:
sources of information collectively add up
to a sizable number of sources. One large • Managing the information lifecycle employed
at the organizations. For good reason (glaring
enterprise was found to have close to 1,000
privacy and security concerns, among them),
operational metrics managed in desktop appli-
organizations have placed significant focus
cations (i.e., Excel) — which is not an uncommon
on information governance. The mandate for
situation. These sources never make it to the
determining which data deserves focus should
production environment and downstream data
be part of the overall governance charter.
warehouses.
• Much of the big data housed in organiza- • Ensuring a sufficient skill set to tackle the
issues introduced as a consequence of big
tions results from regulatory requirements
data.
that necessitate storing large amounts of
historical data. This large volume of historical • Developing a series of metrics to manage
data hides the data required for insight. While the effectiveness of the big data program.
it is important to retain this information, the This includes:
necessity to house it with the same priority as
information used for deriving insight is both
> Average, minimum and maximum time re-
quired to turn data into insight.
expensive and unnecessary.
> Average, minimum and maximum time re-
The Case for Horizontal Partitioning quired to integrate new information sources.
Horizontal partitioning is the process defined as > Average, minimum and maximum time re-
segmenting data in such a way that prioritizes quired to integrate existing information
information required for value extraction, origi- sources.
nation and capture.6 This partitioning process
should result in a process that tiers information
> Time required for the management process.
along the traditional dimensions of a business > Percentage of people associated with the
program participating in the management
information model. Such a model enhances the
process.
focus of information that supports the extraction,
origination and capture of value. > Value achieved from the program.
Converting Big Data Into Value
Relevant Actionable
Acquired & Trustworthy Learned
Created Knowledge Data Inference
Capabilities Customers Markets
Channels Value
Risks
Investors
Chain Insight Regulatory Expected
Disruptions Outcomes
Heard
Inference
Action Innovation
Extracted Originated
Value Value Value
Captured Captured
Transaction Captured Value Value Stream
Figure 3
cognizant 20-20 insights 4
5. According to McKinsey,7 the activities of people Big Data Getting Bigger
steering big data will include:
• Ensuring that big data is more accessible and TB eBay: 6.5 PB, 2009
timely. 1,000
Google: 1 PB of new data every 3 days, 2009
• Measuring the value achieved by unlocking big 800
data. Size of the Largest Data
Warehouse in the
600
• Embracing experimentation through data to Winter Top Ten Survey
CAGR = 173%
expose variability and raise performance. 400 Moore’s Law
Growth Rate
• Enabling the customization of populations 200
through segmentation. 0
1998 2000 2002 2004 2006 2008 2010 2012
• Facilitating the use of automated algorithms
� Actual � Projected
to replace and support human decision-
making, thereby improving decisions, minimiz-
Source: Winter Corp.
ing risks and unearthing valuable insights that
Figure 4
would otherwise remain hidden.
• Facilitating innovation programs that use new Technology-based initiatives that will impact big
business models, products and services. data programs employed at companies include:
Process-based initiatives that will impact big data
programs are best enabled as augmentations of a
• Ensuring that the specialized skills required
to administer and use the fruits of the big
company’s governance activities. These augmen- data initiative are present. Some of these
tations include: include the databases, the ontologies used
to navigate big data and the MapReduce
• Ensuring sufficient focus on information
concepts that augment or fully replace SQL
that will drive value within the organization.
access techniques.
These processes are best employed as linkages
between corporate strategy and information • Ensuring that tools introduced to navigate
lifecycle management programs. big data are usable by the intended audience
without upsetting self-service paradigms that
> It is important to note that information
have slowly gained traction during the past
lifecycle management is defined in many
several years.
organizations as a program to manage hi-
erarchies and business continuity. For the • Ensuring that the architecture and sup-
purposes of this paper, this definition is ex- porting network, technology and software
tended to include promotion and demotion infrastructures are capable of supporting big
of data items as aligned with the business data.
information model (how information is used
It is safe to state that if history is any prediction
to support enterprise strategies and tactics)
of the future, the sheer volume of data that orga-
of the organization.
nizations will need to deal with will outstrip our
> The process used to govern big data and its collective imaginations for how much data will
information lifecycle should continually re- be available for generating insights.. Only eight
fine and prioritize the benefits, constraints, years ago, a 300 to 400 terabyte data warehouse
priorities and risks associated with the was considered an outlier. Today, multi-petabyte
timely publication and use of relevant, fo- warehouses are easily found. Failure to take action
cused, actionable and trustworthy informa- to manage the usability of information pouring
tion published under big data initiatives. into the enterprise is (and will be) a competitive
disadvantage (see Figure 4).
• Ensuring the metrics that drive proper
adhesion and use of big data are developed.
This should cover the following topics:
Recommendations
Big data is a reality in most enterprises. However,
> Governing big data.
companies that tackle big data as merely a
> Big data lifecycle. technology imperative will solve a less important
> Big data use and adoption. dimension of their big data challenges. Big data
is much more than an extension of the technolo-
> Big data publication metrics.
cognizant 20-20 insights 5
6. Storage Definitions to the models (making it difficult for knowledge
workers to navigate the data needed for analysis)
and added an analytic plaque, which makes finding
the required information for analysis more akin to
Terabyte Will fit 200,000 photos finding a needle in a haystack.
or MP3 songs on a single
1 terabyte hard drive.
In many organizations, information lifecycle ini-
Petabyte Will fit on 16 Backblaze tiatives are mandated that too often focus on
storage pods racked in
two data center cabinets. the optimal partitioning and archiving of the
enterprise (i.e., vertical partitioning). Largely a
Exabyte Will fit in 2,000 cabinets and technology focus, this thread of the information
fill a four story data center
that takes up a city block. lifecycle overlooks data that is no longer aligned
with the strategies, tactics and intentions of the
Zettabyte Will fit in 1,000 data centers organization. The scope and breadth of informa-
or about 20% of Manhattan,
New York. tion housed by enterprises in the Petabyte Age
mandates that data be stratified according to its
Yottabyte Will fit in the states of Delaware usefulness in the organizational value creation
and Rhode Island with a
million data centers. (i.e., horizontal partitioning). In today’s organiza-
tions, the only cross-functional bodies with the
ability to perform this horizontal partitioning are
virtual organizations employed to govern the
enterprise information asset.
Figure 5
It was only a few years ago that a data warehouse
gies used in partitioning strategies employed at requiring a terabyte of storage was the exception.
enterprises. As we embrace the Petabyte Age, companies
are entering an era where they will need to be
Companies have proved that they are pack rats. capable of handling and analyzing much larger
They need to house large amounts of history for populations of information. Regardless of the
advanced analytics, and regulatory pressures processes put in place, ever-increasing volumes
influence them to just store everything. The of structured and unstructured data will only
reduced cost of storage has allowed companies proliferate, challenging companies to quickly and
to turn their data warehousing environments into effectively convert raw data into insight in ways
data dumps, which has added both a complexity that stand the test of time.
Footnotes
1
Big data refers to data sets that grow so large that they become awkward to work with using on-hand
database management tools. Difficulties include capture, storage, search, sharing, analytics and visu-
alizing.
2
“The Toxic Terabyte” (IBM Research Labs, July 2006) provides a thorough analysis of how companies
had to get their houses in order to deal with a terabyte of data. If this authorative work were rewritten
today, it would be called “The Problematic Petabyte” and in five years, most probably, “The Exhaustive
Exabyte.”
3
MapReduce is a Google-inspired framework specifically devised for processing large amounts of data
optimized across a grid of computing power.
4
Information lifecycle management is a process used to improve the usefulness of data by moving
lesser used data into segments. Information lifecycle management is most commonly interested in
moving data from always-needed partitions to rarely needed partitions and, finally, into archives.
5
SNIA Data Management Forum’s 100 Year Archive Task Force, 2007.
6
Horizontal partitioning is a term created by the author. It describes the application of generally
accepted techniques of gaining performance by segmenting data into partitions (vertical partitioning)
to segmenting groups of data by the likelihood of it achieving organizational value.
7
“Big Data, The Next Frontier for Innovation, Competition and Productivity,” McKinsey & Company, May
2011.
cognizant 20-20 insights 6