For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
The paper discusses the challenges that relate to the cost of Big Data solutions and looks at the technology options available to overcome these problems.
WordPress Websites for Engineers: Elevate Your Brand
Reducing the Total Cost of Ownership of Big Data- Impetus White Paper
1. Reducing the Total Cost of
Ownership of Big Data
W H I T E P A P E R
Abstract
In this white paper, Impetus shares best practices and
strategies that will enable businesses to lower the total
cost of ownership of Big Data solutions. This white paper
discusses challenges related to the cost of Big Data
solutions, and looks at the technological options available
to address Big Data concerns.
Impetus Technologies Inc.
www.impetus.com
2. Reducing the Total Cost of Ownership of Big Data
2
Contents
Introduction...........................................................................................................3
Using Commodity Hardware for Big Data..............................................................3
Using Open Source and Cloud Computing.............................................................4
The Cost Components of a Big Data Warehouse...................................................4
Lowering the Total Cost of Ownership ..................................................................5
Reducing the Cost of Storage.................................................................................5
What Technologies, Where? ..................................................................................6
Big Data Scenarios in OLAP....................................................................................7
Analytics with Hadoop............................................................................................8
Choosing the Right Technologies...........................................................................8
Opting for Faster MapReduce/Hadoop .................................................................9
NoSQL Database Solutions.....................................................................................9
New Era Relational Databases.............................................................................10
Impetus Solutions and Recommendations..........................................................10
Conclusion............................................................................................................11
3. Reducing the Total Cost of Ownership of Big Data
3
Introduction
As the power of Big Data solutions continues to grow, so too does the cost of
collecting, managing, and storing data. According to IDC/EMC estimates, the
total value of the computers, networks, and storage facilities driving the digital
universe now stands at a whopping USD 6 trillion! Furthermore, that figure is
expected to grow significantly over the next few years. In fact, some estimate
that the size of digital universe doubles every 18 months.
Yet, how much of that information is actually useful? An overload of information
can actually increase the cost of storage, reduce producitivity, and essentially
ensure much of the collected data will go to waste. Despite access to this rich
pool of data, many businesses continue to extract information of little value. It
is estimated that businesses spend an extra USD 650 billion to gather and store
data that they never put to use.
Clearly, much more can be done to unearth business intelligence and actionable
insights from Big Data. The question is, what is the best way do that both
intellgently and cost-effectively? In this white paper, Impetus examines some of
the pros and cons of several Big Data solutions on the market, and offers
practical advice based on years of experience.
Using Commodity Hardware for Big Data
There are many advantages to using commodity hardware. In addition to being
both readily available and accessible, the biggest advantage of commodity
hardware is businesses can build it themselves, opening up many avenues for
innovation.
The cost of building reliable storage from commodity hardware is about USD 1
per gigabyte—a great deal and a very good start. However, keep in mind, that
figure only covers the cost of storage and does not include other costs
associated with managing, monitoring, and hosting data.
4. Reducing the Total Cost of Ownership of Big Data
4
Using Open Source and Cloud Computing
Using free, open source software to store, manage, and analyze Big Data comes
with a number of benefits. By now, everyone has heard of Hadoop and its ability
to tackle large volumes of data, while still providing significant savings.
Using cloud computing for Big Data also has its advantages. The advantage is
cloud computing allows users to rent resources over the cloud to take care of
data and analytics; for example, Amazon Web Services, and Microsoft, for its
Windows Azure platform. Cloud computing allows you to select an offering from
their portfolios appropriate for your needs and requirements.
The downside to using cloud computing, however, is its storage capabilities.
While there is storage over the cloud, it can be very costly.
The Cost Components of a Big Data Warehouse
Many businesses today are turning to Big Data Warehouses as a means of
storage. Before making this decision, it is important to understand the costs
these storage facilities can generate.
Entry Cost
The first expense is entry cost—the cost incurred to identify the right Big Data
solution.
Cost of Migrating Data
Once a Big Data solution has been chosen, next expense will be the cost of
moving data to the new system. Data migration can be especially expensive for
businesses requiring ETL processes. ETL processes may require the purchase of
specialized tools that can also be quite expensive.
Other Costs
A number of other factors can potentially inflate the cost of Big Data solutions.
For example, all solutions require a tool that will enable the system to be easily
handled for scalability, and in the setting of failing conditions. Thus,
5. Reducing the Total Cost of Ownership of Big Data
5
performance analytics and data management may represent additional major
expenses to a Big Data plan.
Ongoing maintenance is also essential, and accounts for another cost. As the
volume of data increases and changes are made, Big Data warehouses will
always require monitoring and tuning.
Taken together, these factors—performance analytics, data management, data
maintenance—can dramatically increase the cost of a Big Data solution.
Lowering the Total Cost of Ownership
Based on years of experience in the field, Impetus has identified a number of
best practices to help businesses reduce the total cost of ownership of Big Data
solutions. This section discusses potential cost savings in hardware and
software, with these two main suggestions in mind:
For hardware, Impetus suggests that looking at cost saving available in
storage and computation.
For software, Impetus suggests a number of solutions that will enable
the processing of more data, more quickly, and for less money.
Reducing the Cost of Storage
Impetus advises businesses to compress data in order to cut storage costs.
Compressed data requires less storage space, and less storage space means less
spending.
Some of the solutions available on the market claim they can compress data to
1/40th of its previous size. When looking at these solutions, however, be careful
to ensure that the read throughput of the data is not compromised when it is
decompressed.
Additionally, with Big Data analytics, businesses may opt to focus on a specific
subset of data, rather than looking at all data, which accumulated over time.
Another option would be to look into systems designed to store data and
information based on principles very similar to information lifecycle
management (ILM).
6. Reducing the Total Cost of Ownership of Big Data
6
With all this talk about Big Data, it is easy to forget about small data. Often, it is
easier to gain business insight using smaller sets of data. Thus, Impetus does not
recommend using Big Data solutions for the storage and retrieval of small
amounts of data, as the relative latency of queries will be higher.
What Technologies, Where?
One key to reducing the total cost of ownership is to understand the available
technologies and how they can be used.
With the advent of Big Data, many different commercial and specialized
hardware and appliances have come to the market. These solutions offer rich
features such as fault tolerance, easy capacity scaling, and specialized
management tools. The commodity hardware available today can be harnessed
for Big Data use cases by leveraging the open source stacks or solutions.
Latency is also a critical factor, but the systems with the lowest latency are also
likely to be the most expensive. There is, of course, a niche market that focuses
on latency as a business problem.
For cloud-based Big Data solutions, the first question is whether moving to the
cloud is the only solution given data storage requirements. Moving to a cloud-
based solution can be quite expensive, especially if the data is not already on
the cloud. Businesses will also need to upload all of the data needed for
processing, which adds significantly to the cost.
With this thorough understanding of the technologies available to tackle Big
Data, Impetus will now discuss how these technologies can be used. These
technologies can be broadly divided into two categories—online analytical
processing (OLAP) and online transaction processing (OTP).
Big Data Scenarios in OLTP
When generating or working with large sets of data in an OLTP scenario, cost-
effective NoSQL solutions are ideal. When working with a typical data
warehouse that requires analytical processing, however, Impetus recommends
using MapReduce or MPP-based systems.
7. Reducing the Total Cost of Ownership of Big Data
7
Big Data Scenarios in OLAP
Big Data online analytical processing (OLAP) can be divided into three different
scenarios:
Big Input Small Output. This is the most common scenario, and is often
used to draw conclusions and to prepare graphs or charts, or in cases
where the top n-elements in a data set need to be identified.
Small Input Big Output. This scenario occurs when the input data set is
small and the resulting output is big, and typically occurs in cases of
predictive analysis, where n-number of outcomes are possible. It is also
applicable in scenarios where correlation-coefficient matrices must be
populated with a given set of inputs. These inputs may be small, but the
results might turn out to be very large.
Big Input and Big Output. The third scenario occurs in ETL processes.
Here, the magnitude of output data is similar to that of input data.
In the real world, whenever businesses summarize or concentrate data with
respect to parameters such as data volume, latency, or cost, there is a decrease
in volume of data. In such a scenario, small data solutions such as MPP-data
stores, traditional relational databases, and newer NoSQL databases offering
the lowest latency are recommended. Note, however, that when moving from a
small data solution to a Big Data solution, the latency of these systems will
increase while the corresponding cost per gigabyte will decrease.
It is well known that Hadoop systems are cost effective. That said, in case of
small data solutions, where latency is the key factor, opting for customized and
tailored solutions that enable quicker data retrieval will provide the best results.
The primary downfall of these solutions is that the cost of deployment will
increase the storage cost per GB.
Massively parallel processing (MPP), on the other hand, offers a number of
significant benefits. MPP-data store solutions provide relational stores while
simultaneously accommodating larger sizes of data.
Often times it is best to deploy a combination of these systems to best address
business needs.
8. Reducing the Total Cost of Ownership of Big Data
8
Analytics with Hadoop
Indirect Analytics Over Hadoop
In this approach, Hadoop is used to clean and transform the data into a
structured form, then to load the structured data into the RDBMS databases.
This approach provides the end user with the flexibility of parallel processing of
Hadoop and an SQL interface at the summarized data level. This solution is
relatively inexpensive when compared with other options.
Direct Analytics Over Hadoop
Applying analytics directly over a Hadoop system without moving it to any
RDBMS databases can be an effective practice to analyze the data from the
Hadoop Distributed File System (HDFS).
This approach enables both batch and asynchronous analytics of data in the
Hadoop system. This is a very cost-effective approach because it does not
require the management of data sources other than existing Hadoop systems. It
also allows flexibility to scale to any level with summarized data.
Analytics Over Hadoop with MPP Data Warehouse
Today, a number of options available on the market allow for the integration of
MPP-based data warehouses and Hadoop. These options are worth considering
for large volumes of data.
The primary disadvantage to these approaches, however, is the potential cost
involved. Most MPP-based data warehouses are expensive. Some also require
high-end servers for deployment, which only add to the expense.
Choosing the Right Technologies
To choose right technology stack, businesses need to look at these three factors
to first determine whether implementation of business use cases:
Cost: The first factor is the cost per terabyte for storage. The next
consideration is the cost related to business continuity and vendor lock-
9. Reducing the Total Cost of Ownership of Big Data
9
in. Also, understand how the current system is likely to change with
strategic decisions, and if these changes would require a different
vendor.
Latency: The next factors to consider are latency requirements. Do any
use case take the throughput of the system into account? For a system
for smaller data, when system response times are critical, MPP-based or
relational database systems would be a better choice.
Dollar-per-terabyte: For business driven by the dollar-per-terabyte
factor, Impetus advises an MPP-based solution. This option provides a
middle ground between the Hadoop and NoSQL-based solutions, and
can allow storage of large amounts of data without compromising
speed.
For business with varying requirements, whose data and related strategies also
change frequently, Impetus does recommend working with a vendor lock-in
model.
Opting for Faster MapReduce/Hadoop
For business requirements driven by cost or business continuity, opt for
Hadoop. Hadoop will enable storage of all of data, and has a relatively high
degree of latency. A few vendors offer faster Hadoop implementations or other
parallel processing frameworks. These solutions usually extend standard
Hadoop APIs and offer enhanced system performance, as well as better support
for the production environment.
NoSQL Database Solutions
OLTP scenarios mean that faster reads and writes are required. The vendors in
this market offer a variety of different solutions with different underlying
implementations, each suited to a different business use case:
Hbase and Cassandra are recommended for banking and financial
business. For random and real-time read/write access to the Big ‘table-
like’ Data, use HBase. For faster writes, look to Cassandra.
MongoDB and CouchDB are recommended when the primary
requirement is the querying of transactional data and defining indexes.
10. Reducing the Total Cost of Ownership of Big Data
10
There are also other databases—graph databases like Neo4j for
instance—that make Big-Data-heavy social media analytics problems
simpler.
New Era Relational Databases
The latest relational databases (RDBMSs) have been specifically designed with
these OLTP scenarios in mind, and have taken major steps toward addressing
latency issues. Many businesses have been using SQL successfully for the last
several years, and most business users still consider SQL to be the best tool to
query structured data.
Other solutions include emerging sets of technologies and new versions of
existing RDBMS engines that are all very adept at handling large volumes of
structured data.
Therefore, for handling large volumes of structured data, look to new era RDMS
solutions like MySQL cluster, GridSQL, or later versions of Microsoft SQL Server.
Impetus Solutions and Recommendations
One way to reduce the cost of data migration is to use MapReduce for ETL,
rather than costly ETL tools.
Management and provisioning tools are available with commercial Big Data
solutions for easy management of systems. Impetus offers Ankush, a vendor-
neutral tool for cluster management, which can be used to automatically
provision multiple Hadoop clusters.
For ongoing maintenance, Impetus mantra for success is, “automate, automate,
automate!” Any task that needs to be carried out more than once should be
automated. This also holds true for monitoring and tuning.
When dealing with changing capacity, continue to add hardware or look for
alternative methods to speed things up. Using graphics processing units for
general purpose computing can also help.
Impetus also recommends Rainstor or similar solutions that help to compress
data and reduce the cost of hardware required data storage.