Big Data Systems: Past, Present & (Possibly) Future with @techmilind
1. CSI Communications | April 2013 | 7
Big Data Systems: Past, Present &
(possibly) Future
Dr. Milind Bhandarkar
Chief Scientist, Machine Learning Platforms,
Greenplum, A Division of EMC2
Cover
Story
The data management industry has
matured over the last three decades,
primarily based on Relational Data
Base Management Systems (RDBMS)
technology. Even today, RDBMS systems
power a majority of backend systems for
online digital media, financial systems,
insurance, healthcare, transportation,
and telecommunications companies.
Since the amount of data collected, and
analyzed in enterprises has increased
several-folds in volume, variety, and
velocity of generation and consumption,
organizations have started struggling
with architectural limitations of traditional
RDBMS architectures. As a result, a new
class of systems had to be designed and
implemented, giving rise to the new
phenomenon of “Big Data”.
In this article, we will trace the origin
and history of this new class of systems
to handle “Big Data”. We refer to current
popular big data systems, exemplified by
Hadoop, and discuss some current and
future use-cases of these systems.
What is Big Data?
While there is no universally accepted
definition of Big Data yet, and most of
the attention in the press is devoted to
the “Bigness” of Big Data, volume of data
is only one factor in the requirements
of modern data processing platforms.
Industry analyst firm Gartner[1]
defines Big
Data as:
Big data is high-volume, high-velocity,
and high-variety information assets that
demand cost-effective, innovative forms
of information processing for enhanced
insight and decision-making.
A recent IDC study, sponsored
by EMC2[2]
, predicts that the “digital
universe”, the data that is generated in
digital form by humankind, will double
every two years, and will reach 40,000
exabytes (40 * 1021
bytes) by 2020. A
major driving factor behind this data
growth is ubiquitous connectivity via
rapidly growing reach of mobile devices,
constantly connected to the networks.
What is even more remarkable, is thatonly
a small portion of this digital universe
is “visible”, which is the data (videos,
pictures, documents, status updates,
tweets) created and consumed by
consumers. A vast amount of data will be
created not “by” human users, but “about”
humans by the digital universe, and it
will be stored, managed, and analyzed by
the enterprises, such as Internet service
providers, and cloud service providers of
all varieties (Infrastructure-as-a-service,
Platform-as-a-Service, and Software-as-
a-Service.)
Origins of Big Data Infrastructure
We already notice this rapid growth
of data generation in the online world
around us. Facebook has grown from one
Million users in 2004, to more than one
Billion in 2012, a thousand-fold increase in
less than eight years. More than 60% of
these users access Facebook from mobile
phones today. The value generated by
a social network is proportional to the
number of contacts between users of the
social network, rather than the number of
users. According to Metcalfe’s Law[3]
, and
its variants, the number of contacts for
N users is proportional to N*logN. Thus,
the growth of contacts, and therefore
the interactions within a social network,
which results in data generation, is non-
linear with respect to number of users. As
the world gets more connected, one can
expect the number of interactions to grow,
resulting in even more accelerated data
growth.
Since the popularity of Internet
was one of the main reasons for growth
of communication and connectivity in
the world, we saw emergence of Big
Data platforms in the Internet industry.
Google, founded in 1998 with the goal
of organizing all the information in the
world, became the dominant content
discovery platform on the World Wide
Web, trumping human-powered and
semi-automated approaches, such as web
portals and directories. The challenges
Google faced in crawling the web, storing,
indexing, ranking, and serving billions
of web pages could not be solved with
the existing data management systems
economically.Amountofpubliclyavailable
content on the web in Google’s search
index exploded from 26 Million pages in
1998, to more than 1 Trillion in less than
a decade[4]
.In addition, this content was
“multi-structured”, consisting of natural-
language text, images, video, geo-spatial,
and even renderings of structured data. In
order to rapidly answer the search queries,
with information ranked by relevance as
well as timeliness, Google had to develop
its infrastructure from scratch. In 2003
and 2004, Google published details of
a part of its infrastructure, in particular,
the Google File System (GFS)[5]
, and
MapReduce programming framework[6]
.
These two publications became the
blueprint for Apache Hadoop, an open
source framework that has become a
de facto standard for big data platforms
deployed today.
Apache Hadoop
TheGFSandMapReducepapersmotivated
Doug Cutting, creator of an open-source
search engine, Apache Lucene, to re-
architect the content system of Lucene,
called Nutch, to incorporate a distributed
file system, and MapReduce programming
framework for tasks of crawling, storing,
ranking, and indexing web pages so that
they could be served as search results
by Lucene. These developments were
noticed by engineers and executives at
Yahoo, which was then struggling to scale
its search backend infrastructure. Yahoo
adopted Apache Hadoop in January
2006, and made significant contributions
to make it a scalable and stable platform.
Today, Yahoo has the largest footprint
of Apache Hadoop, running more than
45,000 servers managing more than 370
Petabytes of data with Hadoop[7]
. Being an
open source system, licensed under the
liberal Apache Software License, governed
bytheApacheSoftwareFoundation,meant
that Hadoop could be freely downloaded
and deployed in any organization, modified
and used without any hard requirement of
having to contribute the changes back to
open source. The scalability and flexibility
of Apache Hadoop prompted growing
Internet companies such as Facebook,
Twitter, and LinkedIn to adopt it for
their data infrastructure, and contribute
2. CSI Communications | April 2013 | 8 www.csi-india.org
modifications and usability enhancements
back to the Apache Hadoop community.
As a result, the Hadoop ecosystem grew
rapidly over the years.
Today, there are more than 20
components in the Hadoop ecosystem
that are developed as open source projects
under the Apache Software Foundation,
and several hundred proprietary and
other open source components. Some
of the popular components in the
Hadoop ecosystem, apart from Hadoop
Distributed File System (HDFS), and
MapReduce, include Hive, A SQL-like
language that translates to MapReduce;
Pig, an imperative data flow language that
generates MapReduce jobs to execute the
data flow; and HBase, a NoSQL Key-Value
store that uses HDFS as its persistent
layer. HBase is based on a paper describing
another Google infrastructure component,
Bigtable, which was published in 2006[8]
.
While Hadoop, today, has become
thedefactoplatformforanalyzingBigData,
challenges remain in making it accessible
and improving its ease of use, thus making
it a first-class citizen of data infrastructure
managed by IT professionals. The
MapReduce programming paradigm is not
particularly easy to use for data analysts,
and commonly used business intelligence
tools do not interoperate with interfaces
provided by Hadoop today. To overcome
these challenges, a number of data
warehousing system vendors, such as
Teradata, Oracle, IBM, EMC2
/Greenplum,
and others offer connectivity with Hadoop
platforms. There are even efforts towards
unifying SQL-based OLAP platforms, such
as Greenplum, with Hadoop[9]
. A number
of Hadoop distributions have emerged
over the years, improving manageability
of Hadoop infrastructure. These include
Cloudera, Hortonworks, MapR, EMC2
/
Greenplum, IBM BigInsights, Microsoft
HDInsights, etc. In addition, there is an
increasing number of Big Data Appliances;
hardware platforms that are integrated
with Hadoop distributions, including
Oracle, Teradata, and EMC2
/Greenplum.
Hadoop Adoption & Use Cases
Over the years, Hadoop and other big
data technologies have become popular
in non-Internet organizations as well,
also struggling to handle the data deluge.
Infrastructure in many organizations in
variousindustries,suchasretail,insurance,
healthcare, finance, manufacturing, and
others have been almost fully digitized.
Until recently, the data these organizations
used to collect were stored in archival
systems, mostly for regulatory compliance
purposes. However, there is a growing
realization across these organizations
that this data can be utilized for gaining
competitive advantage, increasing process
efficiencies, and improve customer
experience. In a recent study conducted
by Tata Consultancy Services (TCS)[10]
,
over 50% of organizations surveyed are
using Big Data technologies, and many of
them predicted more than 25% gains in
returns on investment (ROI), mostly from
increased revenue. The flexibility of these
Big Data systems to combine structured
datasets (51%) with semi-structured
datasets (49%) has been cited as enabling
advanced analytics capabilities. In
addition, while most of the organizations
use data that is available internally (70%)
within those organizations, availability of
external data, such as from twitter and
other social media, allows them to perform
better customer behavior analysis.
The 3V’s, volume, velocity, and
variety of data, along with need to develop
agile, data-driven applications, implies
that the humans analyzing, detecting
patterns, and making sense of data need
to have a rich toolset at hand. Traditional
data exploration, visualization, business
intelligence, and reporting tools are being
adapted to co-exist with these new Big
Data technologies. Advances in machine
learning algorithms and methods, as
well as abundant processing power,
have democratized deep and predictive
analytics to be used in any average IT
department. Open source languages
for statistical analysis and modeling,
such as the popular R language[11]
and
a newcomer such as Julia, as well as
emerging machine learning frameworks,
such as scikit-learn in Python[12]
, Apache
Mahout for Hadoop[13]
, and In-Database
deep analytics library, MADlib[14]
have
attracted attention of developers and
users for developing machine-learning
powered applications based on large and
diverse datasets.
These new platforms, languages
and frameworks have challenged several
predominant practices in the enterprises.
Traditional data governance practices,
including access control, provenance,
retention, backup, mirroring, disaster
recovery, security, and privacy, are
struggling to cope with organizations’
ability to store and process massive
amounts of diverse data. Over the next few
years, one should expect best practices
for data governance, and associated
technologies to emerge and become
commonplace.
Industrial Internet: The Next Frontier
While most of the Big Data use-cases
today are analyzing customer behavior,
their buying patterns, their likes and
dislikes as expressed in social media, their
clickstreams and location information
from mobile devices, machine-generated
data could be the next frontier for Big
Data systems. In addition, cheap sensor
technology, and short-range wireless
connectivity has created possibility of
real-time monitoring, and historical
pattern analysis of traditionally analog
informationsources. For example, a
modern Ford automobile has thousands of
signals being captured by 70+ sensors that
generate more than 25 gigabytes of data
every hour, and processed by 70 on-board
computers[15]
. While most of this data is
transient, and needs to be acted upon in
real-time, recognizing patterns within the
data to improve safety and usability of
the automobile implies aggregating and
analyzing it offline.
Indeed, the massive amount of data
captured by sensors in machinery, and
possibility of storing and analyzing this data
to make intelligent design and operational
decisions has created a new opportunity,
now known by a new moniker, Industrial
Internet[16]
. If, as a result of analyzing this
data to aid better decision making, we could
reduce system inefficiencies in healthcare
industry by a mere 1%, it could result in
savings of USD 63 Billion over next 15
years. If advanced analytics capabilities on
the large amount of oil and gas exploration
data results in only 1% of reduction in
capital expenditure, it could save more than
USD 90 Billion over next 15 years. The key
element proposed for the Industrial Internet
is Intelligent Connected Machines with
advanced sensors for data capture, controls
for automation, and software applications
powered by deep physics-based analytics
and predictive algorithms for analyzing large
amounts of sensor and telemetry data.
Indeed, we are witnessing the
third revolution, following the industrial
revolution, and the Internet revolution,
3. CSI Communications | April 2013 | 9
of the Industrial Internet, powered by
Big Data.
References
[1] IT Glossary, Gartner Inc, http://www.
gartner.com/it-glossary/big-data/
[2] The Digital Universe in 2020: Big Data,
Bigger Digital Shadows, and Biggest
Growth in the Far East, http://www.emc.
com/leadership/digital-universe/iview/
index.htm, December 2012
[3] Metcalfe’s Law Recurses Down the Long
Tail of Social Networks, http://vcmike.
wordpress.com/2006/08/18/metcalfe-
social-networks/, April 2006
[4] We knew the web was big, http://
googleblog.blogspot.com/2008/07/we-
knew-web-was-big.html, July 2008
[5] The Google File System, http://research.
google.com/archive/gfs.html, October
2003.
[6] MapReduce: Simplified Data Processing
on Large Clusters, http://research.google.
com/archive/mapreduce.html, December
2004
[7] The History of Hadoop: From 4 Nodes
to the Future of Data, http://gigaom.
com/2013/03/04/the-history-of-
hadoop-from-4-nodes-to-the-future-of-
data/, March 2013.
[8] Bigtable: A Distributed Storage System for
Structured Data, http://research.google.
com/archive/bigtable.html, November
2006.
[9] HAWQ: The New Benchmark for SQL on
Hadoop, http://www.greenplum.com/
blog/dive-in/hawq-the-new-benchmark-
for-sql-on-hadoop, February 2013.
[10] The Emerging Returns on Big Data, http://
www.tcs.com/big-data-study/Pages/
default.aspx, March 2013.
[11] The R Project for Statistical Computing,
http://www.r-project.org/.
[12] Scikit-learn: machine learning in Python,
http://scikit-learn.org/stable/.
[13] Apache Mahout, http://mahout.apache.
org/.
[14] MADlib, http://madlib.net/.
[15] Sensing the Future: Ford issues Predictions
for the next wave of Automotive
Electronics Innovation, http://media.
ford.com/article_display.cfm?article_
id=37541, December 2012.
[16] Industrial Internet: Pushing the
Boundaries of Minds and Machines,
http://www.ge.com/docs/chapters/
Industrial_Internet.pdf, November
2012 n
AbouttheAuthor
Dr. Milind Bhandarkar was the founding member of the team at Yahoo that took Apache Hadoop from 20-node
prototype to datacenter-scale production system, and has been contributing and working with Hadoop since version
0.1. He started the Yahoo Grid solutions team focused on training, consulting, and supporting hundreds of new
migrants to Hadoop. Parallel programming languages and paradigms has been his area of focus for over 20 years,
and a topic of his PhD dissertation at University of Illinois at Urbana-Champaign. He worked at the Centre for
Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center
for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo and LinkedIn.
Currently, he is the Chief Scientist at Greenplum, a division of EMC2
.