The document summarizes Aginity's "Big Data" research lab which was launched in March 2009. The lab aims to build a 10 terabyte massively parallel processing always-on data warehouse using $10,000 in commodity hardware and $15,000 per terabyte for database software. This is to test how close the lab can get to a $2 million data warehouse from 5 years ago for just $10,000-$20,000. The lab contains several large databases and tests complex queries, analytics without pre-aggregation, and MapReduce capabilities on its SUSE Linux platform.
2. Background
Google changed everything….
What makes Google great isn’t the user interface …
or the word processor, or even gmail, although these are great tools.
What made Google great was their massive database of searches and indexes to content that
allows them to understand what you are searching for even better than you do yourself.
Google is a database company. They process more data every day than almost any other company
in the world. And unlike other big data companies, most of Google’s data is unstructured.
To pull this off, Google invented a new class of database that could perform analytics on-the-fly
“In-Database”, with largely unstructured data using large clusters of off the shelf computers.
From this work, was launched a new class of data warehouse that we believe will change the
world.
3. What Was Our Goal?
We wanted to see what could be built using the framework invented by Google for
under $10,000 in hardware cost and $15,000 per terabyte for the data warehouse
software.
Our goal was to build a 10 terabyte MPP Always-on data Warehouse using
desktop-class commodity hardware, an open source operating system, and the
leading MPP database software on the planet.
This is a technology sandbox in which we are seeing how close we can get to a 2
million dollar data warehouse of 5 years ago for $10,000 to $20,000.
Obviously, this is not a production-class system but it is a good illustration of the
power of the latest Software Only “Big Data” systems and Aginity’s mastery of
those systems.
4. What Is A MPP Data Warehouse?
MPP, or Massively Parallel Processing, is a class of architectures aimed specifically at addressing
the processing requirements of very large databases. MPP architecture has been accepted as the
only way to go at the high end of the data warehousing world.
Degrees of Massively Parallel Processing
John O'Brien
InfoManagement Direct, February 26, 2009
5. What Is MapReduce?
MapReduce was invented by Google and is a programming model and an associated implementation for processing and
generating large data sets.
The core ideas of MapReduce are:
• MapReduce isn’t about data management, at least not primarily. It’s about parallelism.
• In principle, any alphanumeric data at all can be stuffed into tables. But in high-dimensional scenarios, those tables are
super-sparse. That’s when MapReduce can offer big advantages by bypassing relational databases. Examples of such
scenarios are found in CRM and relationship analytics.
• MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up
• On its own, MapReduce can do a lot of important work in data manipulation and analysis. Integrating it with SQL should
just increase its applicability and power.
• At its core, most data analysis is really pretty simple – it boils down to arithmetic, Boolean logic, sorting, and not a lot
else. MapReduce can handle a significant fraction of that.
• MapReduce isn’t needed for tabular data management. That’s been efficiently parallelized in other ways. But if you want
to build non-tabular structures such as text indexes or graphs, MapReduce turns out to be a big help.
DBMS2
6. What are we testing?
• Very large 5 TB database with 2 TB fact table
• Ability to do “on-the-fly” analytics without creating cubes or any form of pre-aggregation at sub-
second speed.
• Very large complex queries that span nodes
• The benefits of using the MapReduce indexing model
• In-Database Analytics
• Fault tolerance at scale? What happens if I unplug one of the nodes during a complex process?
7. How much MPP power can $5,682.10 buy in 2009?
At least 10 terabytes. We constructed a 9-box server farm using off-the-shelf components. Our
Chief Architect, Ted Westerheide, personally oversaw the construction of a 10 terabyte enterprise-
wide “data production” system about 10 years ago. The cost at that time? $2.2 million. Here’s the
story of how we built similar capabilities for our lab for $5,682.10 U.S..
Then Our Lab Real-world blade servers
9. The Databases We Are Testing
Think of these as “The Big Three”. All matter to us and all are in our lab. Databases such as the
ones we work with cost about $15,000 per terabyte per year to operate.
12. MapReduce
MapReduce: Simplified Data Processing on Large Clusters
Google Research
Complete article here
MapReduce is a programming model and an associated implementation for processing and generating large data
sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs,
and a reduce function that merges all intermediate values associated with the same intermediate key. Many real
world tasks are expressible in this model, as shown in the paper.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity
machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's
execution across a set of machines, handling machine failures, and managing the required inter-machine
communication. This allows programmers without any experience with parallel and distributed systems to easily
utilize the resources of a large distributed system.
Our [Google’s]implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable:
a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find
the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand
MapReduce jobs are executed on Google's clusters every day….
Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose
computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to
compute various kinds of derived data…continued in paper.
13. In-Database Analytics
In-Database Analytics: A Passing Lane for Complex Analysis
Seth Grimes
Intelligent Enterprise, December 15, 2008
What once took one company three to four weeks now takes four to eight hours thanks to in-database
computation. Here's what Netezza, Teradata, Greenplum and Aster Data Systems are doing to make it
happen.
A next-generation computational approach is earning front-line operational relevance for data warehouses,
long a resource appropriate solely for back-office, strategic data analyses. Emerging in-database analytics
exploits the programmability and parallel-processing capabilities of database engines from vendors Teradata,
Netezza, Greenplum, and Aster Data Systems. The programmability lets application developers move
calculations into the data warehouse, avoiding data movement that slows response time. Coupled with
performance and scalability advances that stem from database platforms with parallelized, shared-nothing
(MPP) architectures, database-embedded calculations respond to growing demand for high-throughput,
operational analytics for needs such as fraud detection, credit scoring, and risk management.
Data-warehouse appliance vendor Netezza released its in-database analytics capabilities last May, and in
September the company announced five partner-developed applications that rely on in-database
computations to accelerate analytics. quot;Netezza's [on-stream programmability] enabled us to create
applications that were not possible before,quot; says Netezza partner Arun Gollapudi, CEO of Systech Solutions.
14. Massively Parallel Processing (MPP)
Degrees of Massively Parallel Processing
John O'Brien
InfoManagement Direct, February 26, 2009
The concept of linear growth is obsolete. In the closing decades of the 20th century, we got used to the rapid
pace of change, but the shape of that change was still one of incremental growth. Now we’re contending with
a breakneck speed of change and exponential growth almost everywhere we look, especially with the
information we generate. As documented in “Richard Winter’s Top Ten” report from 2005, the very largest
databases in the world are literally dwarfed by today’s databases.
The fact that the entire Library of Congress’s holdings comprised 20 terabytes of data was breathtaking.
Today, some telecommunications, energy and financial companies can generate that much data in a month.
Even midsized organizations are coping with data sets that will soon outgrow the Library of Congress.
MPP is a class of architectures aimed specifically at addressing the processing requirements of very large
databases. MPP architecture has been accepted as the only way to go at the high end of the data
warehousing world. If it’s so well-suited to the very large data warehouses, why hasn’t everyone adopted it?
The answer lies in its previous complexity. Engineering an MPP system is difficult and remains the purview of
organizations and specialized vendors that have a deep layer of dedicated R&D resources. These specialized
vendors are bringing solutions to the market that shield the user from the complexity of implementing their
own MPP systems. These solutions take a variety of forms, such as custom-built deployments,
software/hardware configurations and all-in-one appliances.