Article from In-Q-Tel Qtrly Spring 2011: Predictive Analytics alone is not the answer - explores new data processing and data storage strategies to combine ad hoc, data science and real time analytics to power research and machine learning algorithms to drive business results.
Handwritten Text Recognition for manuscripts and early printed texts
Predictive Analytics & Hadoop: InQTel Qtrly Spr 2011 Think Big Bodkin & Farnell
1. IQT QUARTERLY
PREDICTIVE ANALYTICS ALONE
IS NOT THE ANSWER
By Ron Bodkin and Rick Farnell
What is the Hype About Analytics? The opportunity to re-compute analyses. It’s
impressive to analyze data just once, comprehensively
In the last year, the world has seen a significant trend
and thoroughly, to yield accurate results. But what
emerge — a much greater capacity to perform
happens when you later need to analyze the data from
analytics as a result of greater data storage and
a different perspective or to add data you didn’t have at
processing power. This trend was underscored in
the time of the original analysis? This capability to
February 2011 in a classic human vs. computer
re-compute an analysis is becoming more powerful as
challenge on the TV show Jeopardy!. Watson, an
companies invest in data science capabilities. The very
analytic supercomputer built using Hadoop, took on
role of a data scientist is a new one, having emerged
two of the greatest human champions and won.
from this explosion in analytic capabilities. Data
Since the adoption of the Internet in the late 1990s, science is common in genetics research or in the
there has been an exponential increase in the amount financial trading industry but data scientists are
of data being produced, much of it less structured than beginning to work in retail, advertising, manufacturing,
traditional database data, and a surge of data and other domains that have traditionally not included
integration, correlating a wider array of data than ever scientists or quants. Algorithm development is now
before assembled. Today’s Big Data compute clusters not just optimized for text search relevance but also
(whether hosted in the cloud or in traditional data for advertising scenarios, recommendations, complex
centers) are capable of processing massive data sets. trading products, understanding sentiment on social
Each year brings new milestones in the capacity for networks, determining security risks across multi-
processing data, and in the volumes of data that can channel outlets, and many more areas.
be usefully integrated. Our connected world is made
Flexible data. Flexible data is the ability to compute
up of machines, sensors, and humans all producing
what you want over a data set. Today there are many
data onto the connected network. Where is this data
articles written on “Big Data,” but this trend is not just
going? It is being consumed, stored, and analyzed by
about the size of data. One organization’s Big Data is
tomorrow’s leading organizations.
another company’s sample set. It’s about flexible data
Think Big Analytics was founded in 2010 to help — using data to solve a business problem, often in a
organizations leverage the power of advanced way that wasn’t anticipated. What happens when your
analytics, making it easier to assemble the right business purpose changes? Can you get access to the
technologies and reduce the time to value gained original raw data before it was processed? This
by applying these techniques. Let’s look at some of concept is core to the flexible data principle. Storing
the new opportunities that we are seeing at our data in its raw format and holding on to it for future
customer deployments. analysis was possible a couple years ago, but it was
IQT QUARTERLY SPRING 2011 Vol. 2 No. 4 05
2. IQT QUARTERLY
painfully expensive and time consuming. The most A new cadre of “agile” organizations are building
common practice at the time was to pre-compute integrated predictive modeling, active listening
specific summaries in a data warehouse to answer capabilities, advanced dashboards, and flexible
questions that were anticipated in advance, often at business response procedures. These organizations
great investment of time and money. Thanks to are more successful because they are able to better
Facebook, Quantcast, and other web properties that harness the explosion of available data and begin to
invested in building the open source Hadoop take measures to shape the activity and actions of
distributed storage and processing system, the rest of their clients, partners, and influencers.
the world now has the ability to store, access, and
Predictive analytics examples. Predictive analytics is
process raw data for a relatively small price with
the practice of applying machine learning principles to
unprecedented scalability and flexibility.
drive operational decision-making. For example:
Not a static dashboard any more. Organizations have
• In online advertising, firms like Quantcast build
traditionally invested in analytics to populate static
lookalikes that are predictive models for what ad
dashboards that reflect metrics deemed to be
impressions are likely to convert, and establish a
important at one level of the organization or another. value for real-time bidding on ad exchanges.
While there are many approaches to building these
dashboards, the common theme has been that they • For search, predictive analytics allows optimized
are backward looking. What if you wanted to predict results matched to individual interests.
what is going to happen in the future? What if your • Brands can anticipate behaviors and shape the
goal was to predict this future in a timely fashion and sentiment of customers.
with high accuracy? What if you wanted to listen to
your data and make course corrections to influence • In IT asset management, organizations can predict
problems before systems fail, and proactively
these predictions? Leading organizations are
schedule repairs.
designing and developing data science capabilities that
can predict their business activities with surprising • In banking and insurance, firms can provide risk
accuracy. In order to accomplish this, successful models for their field organizations to help with
companies are making investments in continuous decision-making that matches the overall risk
analysis, course correction, flexible data integration, preferences of the firm.
and A/B testing of their algorithms. • In manufacturing, organizations can predict factory
Individual level data mining. Traditional analyses are machine failures and have parts pre-ordered and
based on aggregating data about the people or events waiting for service installation to minimize
production downtime.
that underlie the data. By contrast, individual level
data mining allows investigation down to the details of • In retail, companies can better predict a customer’s
a single occurrence or a single event, to allow building preferences, offering unique, personalized
a deeper understanding of new phenomena. They can recommendations and cross-sales opportunities
also be used to develop algorithms that more matched to each individual customer. Notably,
effectively predict activity. For example, if you see an Netflix and Amazon have made significant
increase in communications activity in a certain area, advances in their recommendation engines using
it’s important to be able to drill down to see detailed these technologies.
records of actual events that underlie the activity, as • In healthcare, organizations can offer predictive
well as being able to re-summarize within a small care programs matched to a unique individual based
context to get more details about what is happening on patterns that exist in data for similar patients.
(e.g., to spot unusual patterns like anomalous levels
• Financial trading firms can use predictive analytics
of communication from the given area to another area,
and algorithms to model trends in the market to
or unusual call durations).
optimize trading decisions.
Predictive models and feedback loop. Predictive
• All organizations can use predictive analytics to
models without the ability to monitor actual results arm themselves with models that can determine
are not very useful. Predictive models without the fraud and security risks, even detecting the smallest
ability to respond based on their predictions are worse. of variance.
06 Vol. 2 No. 4 Identify. Adapt. Deliver. ™
3. IQT QUARTERLY
• Customer retention can be improved greatly by reads of data. In particular, for low latency analytic
modeling churn and offering discounted services queries SSDs can allow much faster analysis and
and products to those customers that are at a investigation, and support handling larger data sets.
higher risk to leave.1
Another important element of this is the advent of
The real power of these solutions becomes apparent reusable platforms that can be used across many
when the business is able to make changes in applications and analyses. When Google first
real-time based on these core predictive capabilities. introduced a MapReduce computing cluster, there was
How Can a Company Develop an a rapid adoption of the technique, showing both the
Advanced Analytics Capability? power of this kind of analytics and also the importance
of having a reusable system that can be shared across
The Changing Dynamics of Computing applications. This same experience in adopting these
new techniques has been experienced by users of
One of the foundational elements of the new analytics
Hadoop and other scale-out clusters.
is the ability to apply a scalable amount of computing
capacity to problems. With the continued progression Reference Architecture
of Moore’s Law and related increases in computing
power, commodity hardware is tremendously powerful The patterns for how data storage and processing are
nowadays, allowing the application of copious organized for advanced analytics are similar even
quantities network bandwidth, storage, CPU, and RAM across different domains. There are three important
to distributed computing problems. Notably, some arenas needed for this data processing:
aspects of computing are increasing much faster Event processing: There’s typically a need to respond
than others.2,3,4 to incoming interactions within milliseconds, e.g., to
flag possible fraud, to bid on an auction, to respond to
RESOURCE ANNUAL GROWTH RATE a routing request, or to make a recommendation.
Network Bandwidth 60%
in Data Center
60%
Disk Storage Density
CPU Performance 60%
Disk Transfer Rate 40%
Random Disk
16%
Operations
The increasing density of disk has allowed storage of
unprecedented quantities of data, which is one of the key
enablers of this trend. Moreover, network bandwidth has
grown to the point where servers can now stream reads
from their disk at wire speed. When you combine this
with the fact that disk transfer rate is lagging storage
density, processor performance, and network bandwidth,
scaling out becomes vital to allow having enough
spindles to sustain high performance data computing.
The rapid increase in performance and rapid decrease
in cost of Solid State Drives (SSDs) are combining to
transform applications that require low latency random
1
Seven Reasons You Need Predictive Analytics Today, Prediction Impact 2010
2
Rules of Thumb in Data Engineering: http://www.slidefinder.net/r/rules_thumb_data/engineering/1062757
3
E.g., 100 megabit Ethernet was first available in 1995, and 100 gigabit Ethernet was first available in 2010,
representing a CAGR of 58.5%.
4
http://www.merriam-webster.com/dictionary/moore's%20law
IQT QUARTERLY SPRING 2011 Vol. 2 No. 4 07
4. IQT QUARTERLY
Typically these responses involve a fast response integration at the scale and performance required.
based on a model that was previously scored in a Managing distributed data is often a challenge – large
cluster. In large volume applications, this response data sets are slow to move across a WAN, and keeping
often involves horizontally scaling out a database for consistent copies of information among clusters and
reading and writing state, which has been the raison data centers pose their own challenges. Pooling rich
d’être for NoSQL databases. For some applications, information in one place also makes it important to
there’s a need for more advanced correlation among have effective data security. In regulated industries,
events, which has led to the development of Complex there is increased investment in protecting data and
Event Processing systems. providing access controls. Hadoop security now
supports client authentication and file-level
Batch processing: To respond effectively in near
authorization. Additional security can be provided by
real-time, it’s important to apply analytics in advance,
encrypting fields at rest and in transit, and with
by crunching large amounts of data. This is where
physical separation of data.
scale-out clusters, such as those built on Hadoop
MapReduce, really shine. Immediately, this includes These new patterns of computing are driving
the production cycle, which involves updating profiles tremendous innovation. There has been considerable
for items (cookies, placements, content, places, investment in open source technologies such as
devices, etc.) that can in turn be pushed out for Hadoop, HBase, MongoDB, Membase, Oozie, Flume,
real-time event response and analytics. However, the Pig, Hive, and R. As the market has expanded,
cluster is also used for a science cycle, which is a commercial vendors have expanded the investment,
process of investigation and improvement that’s used building products like Cloudera Enterprise, MapR
to improve the production cycle — typically new Technologies, Datameer, IBM Big Sheets, and
approaches are simulated in the cluster and when Karmasphere. It’s important to have a good breadth
they appear promising, they are A/B tested. of understanding of the technologies when assembling
a solution.
Fast analytics: Both data scientists and business
analysts need access to summarized calculations of Advanced analytics is a fast moving arena, and as such
common values to explore and visualize data, and to it is highly desirable to build capability iteratively, with a
make decisions. Some of these values need to be focus on getting results to business decision makers
available quickly to facilitate faster iterations and quickly. This allows the organization to learn and adjust
quick decision making (e.g., for reporting and the approach, as well as to get quick feedback on
common decision support needs). This kind of analytic techniques and technologies that are working. Naturally,
information is another kind that is typically pre- it also allows for a reduced time to value — getting real
computed in a cluster in batch, and then exported to a results from better analysis, and driving a virtuous cycle
low latency database (whether relational or NoSQL). of improved data that can be used for future experiments.
A Hadoop cluster becomes a hub of information both In summary, advanced analytics have arrived and are
from within an organization, and leveraging important having a significant impact across a wide variety of
data from outside, allowing distillation of information domains. The unprecedented ability to store and
from data. Naturally, data integration is central to analyze data is allowing for a new class of applications,
making these architectures work, and there are many and bringing more data to bear on decisions than ever
important technologies and patterns to support before possible.
Ron Bodkin is Founder and CEO of Think Big Analytics, which helps customers leverage new data processing
technologies like Hadoop, NoSQL databases, and R for statistical analysis. Previously Ron was the VP of Engineering for
Quantcast. Each day Quantcast ingests 10 billion events and produces more than a petabyte of data using Hadoop. The
Quantcast MapReduce stack handles production data processing, ad hoc analysis, data mining and machine learning.
Prior to that, Ron was a founder of enterprise consulting companies C-bridge Internet Solutions and New Aspects.
Rick Farnell is President and Co-Founder of Think Big Analytics, and has over 15 years of global consulting and
management experience. Rick has held key positions at several successful technology companies including Sun
Microsystems, SeeBeyond, eXcelon and C-bridge Internet Solutions, where he helped grow the firm to employ over 800
consultants, leading to a successful IPO in 1999. Rick is Founder of Rapid Formation which helps incubate, fund, and
scale startup technology companies.
08 Vol. 2 No. 4 Identify. Adapt. Deliver. ™