SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
To keep things simple, we typically define Big
Data using four Vs; namely,
volume, variety, velocity, and veracity. We
added the veracity characteristic
recently in response to the quality and source
issues our clients began facing
with their Big Data initiatives. Some analysts
include other V-based descriptors,
such as variability and visibility, but we’ll leave
those out of this discussion.
Volume is the obvious Big Data trait. At the
start of this chapter we rhymed
off all kinds of voluminous statistics that do two
things: go out of date the
moment they are quoted and grow bigger! We
can all relate to the cost of
home storage; we can remember geeking out
and bragging to our friends
about our new 1TB drive we bought for $500;
it’s now about $60; in a couple
of years, a consumer version will fit on your
fingernail.
The thing about Big Data and data volumes is
that the language has
changed. Aggregation that used to be measured
in petabytes (PB) is now
referenced by a term that sounds as if it’s from a
Star Wars movie: zettabytes
(ZB). A zettabyte is a trillion gigabytes (GB), or
a billion terabytes!
Since we’ve already given you some great
examples of the volume of data
in the previous section, we’ll keep this section
short and conclude by referencing
the world’s aggregate digital data growth rate.
In 2009, the world had
about 0.8ZB of data; in 2010, we crossed the
1ZB marker, and at the end of
2011 that number was estimated to be 1.8ZB
(we think 80 percent is quite the
significant growth rate). Six or seven years from
now, the number is estimated
(and note that any future estimates in this book
are out of date the
moment we saved the draft, and on the low side
for that matter) to be around
35ZB, equivalent to about four trillion 8GB
iPods! That number is astonishing
considering it’s a low-sided estimate. Just as
astounding are the challenges
and opportunities that are associated with this
amount of data.
The variety characteristic of Big Data is really
about trying to capture all of the
data that pertains to our decision-making
process. Making sense out of
unstructured data, such as opinion and intent
musings on Facebook, or analyzing
images, isn’t something that comes naturally for
computers. However, this
kind of data complements the data that we use
to drive decisions today. Most
of the data out there is semistructured or
unstructured. (To clarify, all data has
some structure; when we refer to unstructured
data, we are referring to the subcomponents
that don’t have structure, such as the freeform
text in a comments
field or the image in an auto-dated picture.)
Consider a customer call center; imagine being
able to detect the change in
tone of a frustrated client who raises his voice to
say, “This is the third outage
I’ve had in one week!” A Big Data solution
would not only identify the terms
“third” and “outage” as negative polarity
trending to consumer vulnerability,
but also the tonal change as another indicator
that a customer churn incident
is trending to happen. All of this insight can be
gleaned from unstructured
data. Now combine this unstructured data with
the customer’s record data
and transaction history (the structured data with
which we’re familiar), and
you’ve got a very personalized model of this
consumer: his value, how brittle
he’s become as your customer, and much more.
(You could start this usage
pattern by attempting to analyze recorded calls
not in real time, and mature
the solution over time to one that analyzes the
spoken word in real time.)
An IBM business partner, TerraEchos, has
developed one of the most
sophisticated sound classification systems in the
world. This system is used
for real-time perimeter security control; a
thousand sensors are buried underground
to collect and classify detected sounds so that
appropriate action can
be taken (dispatch personnel, dispatch aerial
surveillance, and so on) depending
on the classification. Consider the problem of
securing the perimeter of
a nuclear reactor that’s surrounded by parkland.
The TerraEchos system can
near-instantaneously differentiate the whisper of
the wind from a human
voice, or the sound of a human footstep from the
sound of a running deer.
In fact, if a tree were to fall in one of its
protected forests, TerraEchos can affirm
that it makes a sound even if no one is around to
hear it. Sound classification
is a great example of the variety characteristic of
Big Data.
One of our favorite but least understood
characteristics of Big Data is velocity.
We define velocity as the rate at which data
arrives at the enterprise and is
processed or well understood. In fact, we
challenge our clients to ask themselves,
once data arrives at their enterprise’s doorstep:
“How long does it
take you to do something about it or know it has
even arrived?” Think about it for a moment. The
opportunity cost clock on your data
starts ticking the moment the data hits the wire.
As organizations, we’re taking
far too long to spot trends or pick up valuable
insights. It doesn’t matter
what industry you’re in; being able to more
swiftly understand and respond
to data signals puts you in a position of power.
Whether you’re trying to
understand the health of a traffic system, the
health of a patient, or the health
of a loan portfolio, reacting faster gives you an
advantage. Velocity is perhaps
one of the most overlooked areas in the Big
Data craze, and one in
which we believe that IBM is unequalled in the
capabilities and sophistication
that it provides.
In the Big Data craze that has taken the
marketplace by storm, everyone
is fixated on at-rest analytics, using optimized
engines such the Netezza
technology behind the IBM PureData System
for Analytics or Hadoop to
perform analysis that was never before possible,
at least not at such a large
scale. Although this is vitally important, we
must nevertheless ask: “How
do you analyze data in motion?” This capability
has the potential to provide
businesses with the highest level of
differentiation, yet it seems to be somewhat
overlooked. The IBM InfoSphere Streams
(Streams) part of the IBM Big Data
platform provides a real-time streaming data
analytics engine. Streams is a
platform that provides fast, flexible, and
scalable processing of continuous
streams of time-sequenced data packets. We’ll
delve into the details and
capabilities of Streams in Part III, “Analytics for
Big Data in Motion.”
You might be thinking that velocity can be
handled by Complex Event
Processing (CEP) systems, and although they
might seem applicable on the
surface, in the Big Data world, they fall very
short. Stream processing enables
advanced analysis across diverse data types with
very high messaging data
rates and very low latency (μs to s). For
example, one financial services sector
(FSS) client analyzes and correlates over five
million market messages/
second to execute algorithmic option trades with
an average latency of 30
microseconds. Another client analyzes over
500,000 Internet protocol detail
records (IPDRs) per second, more than 6 billion
IPDRs per day, on more than
4PB of data per year, to understand the trending
and current-state health of their
network. Consider an enterprise network
security problem. In this domain,
threats come in microseconds so you need
technology that can respond and
keep pace. However you also need something
that can capture lots of data
quickly, and analyze it to identify emerging
signatures and patterns on the
network packets as they flow across the network
infrastructure.
Finally, from a governance perspective,
consider the added benefit of a Big
Data analytics velocity engine: If you have a
powerful analytics engine that
can apply very complex analytics to data as it
flows across the wire, and you
can glean insight from that data without having
to store it, you might not
have to subject this data to retention policies,
and that can result in huge savings
for your IT department.
Today’s CEP solutions are targeted to
approximately tens of thousands of
messages/second at best, with seconds-to-
minutes latency. Moreover, the
analytics are mostly rules-based and applicable
only to traditional data
types (as opposed to the TerraEchos example
earlier). Don’t get us wrong;
CEP has its place, but it has fundamentally
different design points. CEP is a
non-programmer-oriented solution for the
application of simple rules to
discrete, “complex” events.
Note that not a lot of people are talking about
Big Data velocity, because
there aren’t a lot of vendors that can do it, let
alone integrate at-rest technologies
with velocity to deliver economies of scale for
an enterprise’s current
investment. Take a moment to consider the
competitive advantage that your
company would have with an in-motion, at-rest
Big Data analytics platform,
by looking at Figure 1-1 (the IBM Big Data
platform is covered in detail in
Chapter 3).
You can see how Big Data streams into the
enterprise; note the point at
which the opportunity cost clock starts ticking
on the left. The more time
that passes, the less the potential competitive
advantage you have, and the
less return on data (ROD) you’re going to
experience. We feel this ROD
metric will be one that will dominate the future
IT landscape in a Big Data
world: we’re used to talking about return on
investment (ROI), which
talks about the entire solution investment;
however, in a Big Data world,
ROD is a finer granularization that helps fuel
future Big Data investments.
Traditionally, we’ve used at-rest solutions
(traditional data warehouses,
Hadoop, graph stores, and so on). The T box on
the right in Figure 1-1
represents the analytics that you discover and
harvest at rest (in this case,
it’s text-based sentiment analysis).
Unfortunately, this is where many
vendors’ Big Data talk ends. The truth is that
many vendors can’t help you
build the analytics; they can only help you to
execute it. This is a key
differentiator that you’ll find in the IBM Big
Data platform. Imagine being
able to seamlessly move the analytic artifacts
that you harvest at rest and
apply that insight to the data as it happens in
motion (the T box by the
lightning bolt on the left). This changes the
game. It makes the analytic
model adaptive, a living and breathing entity
that gets smarter day by day
and applies learned intelligence to the data as it
hits your organization’s
doorstep. This model is cyclical, and we often
refer to this as adaptive
analytics because of the real-time and closed-
loop mechanism of this
architecture.
The ability to have seamless analytics for both
at-rest and in-motion data
moves you from the forecast model that’s so
tightly aligned with traditional
warehousing (on the right) and energizes the
business with a nowcastmodel.
The whole point is getting the insight you learn
at rest to the frontier of the
business so it can be optimized and understood
as it happens. Ironically, the
more times the enterprise goes through this
adaptive analytics cycle
Veracity is a term that’s being used more and
more to describe Big Data; it
refers to the quality or trustworthiness of the
data. Tools that help handle Big
Data’s veracity transform the data into
trustworthy insights and discard
noise.
Collectively, a Big Data platform gives
businesses the opportunity to analyze
all of the data (whole population analytics), and
to gain a better understanding
of your business, your customers, the
marketplace, and so on. This
opportunity leads to the Big Data conundrum:
although the economics of
deletion have caused a massive spike in the data
that’s available to an organization,
the percentage of the data that an enterprise can
understand is on
the decline. A further complication is that the
data that the enterprise is trying
to understand is saturated with both useful
signals and lots of noise (data
that can’t be trusted, or isn’t useful to the
business problem at hand), as
shown in Figure 1-2.
We all have firsthand experience with this;
Twitter is full of examples of
spambots and directed tweets, which is
untrustworthy data. The2012 presidential
election in Mexico turned into a Twitter veracity
example
with fake accounts, which polluted political
discussion, introduced derogatory
hash tags, and more. Spam is nothing new to
folks in IT, but you
need to be aware that in the Big Data world,
there is also Big Spam potential,
and you need a way to sift through it and figure
out what data can and
can’t be trusted. Of course, there are words that
need to be understood in
context, jargon, and more (we cover this in
Chapter 8).
As previously noted, embedded within all of this
noise are useful signals:
the person who professes a profound disdain for
her current smartphone
manufacturer and starts a soliloquy about the
need for a new one is expressing
monetizable intent. Big Data is so vast that
quality issues are a reality, and
veracity is what we generally use to refer to this
problem domain. The fact
that one in three business leaders don’t trust the
information that they use to
make decisions is a strong indicator that a good
BIG DATA – Nathan Marz
1.5 Desired Properties of a Big Data
System
1.5.1 Robust and fault-tolerant
The properties you should strive for in Big Data
systems are as much about
complexity as they are about scalability. Not
only must a Big Data system perform
well and be resource-efficient, it must be easy to
reason about as well. Let's go
over each property one by one. You don't need
to memorize these properties, as we
will revisit them as we use first principles to
show how to achieve these properties.
Building systems that "do the right thing" is
difficult in the face of the challenges
of distributed systems. Systems need to behave
correctly in the face of machines
going down randomly, the complex semantics
of consistency in distributed
databases, duplicated data, concurrency, and
more. These challenges make it
difficult just to reason about what a system is
doing. Part of making a Big Data
system robust is avoiding these complexities so
that you can easily reason about
the system.
Additionally, it is imperative for systems to be
"human fault-tolerant." This is
an oft-overlooked property of systems that we
are not going to ignore. In a
production system, it's inevitable that someone
is going to make a mistake
sometime, like by deploying incorrect code that
corrupts values in a database. You
will learn how to bake immutability and
recomputation into the core of your
systems to make your systems innately resilient
to human error. Immutability and
recomputation will be described in depth in
Chapters 2 through 5.
1.5.2 Low latency reads and updates
The vast majority of applications require reads
to be satisfied with very low
latency, typically between a few milliseconds to
a few hundred milliseconds. On
the other hand, the update latency requirements
vary a great deal between
applications. Some applications require updates
to propogate immediately, while in
other applications a latency of a few hours is
fine. Regardless, you will need to be
able to achieve low latency updates when you
need them in your Big Data systems.
More importantly, you need to be able to
achieve low latency reads and updates
without compromising the robustness of the
system. You will learn how to achieve
low latency updates in the discussion of the
"speed layer" in Chapter 7.
1.5.3 Scalable
Scalability is the ability to maintain
performance in the face of increasing data
and/or load by adding resources to the system.
The Lambda Architecture is
horizontally scalable across all layers of the
system stack: scaling is accomplished
by adding more machines.
1.5.4 General
A general system can support a wide range of
applications. Indeed, this book
wouldn't be very useful if it didn't generalize to
a wide range of applications! The
Lambda Architecture generalizes to applications
as diverse as financial
management systems, social media analytics,
scientific applications, and social
networking.
1.5.5 Extensible
You don't want to have to reinvent the wheel
each time you want to add a related
feature or make a change to how your system
works. Extensible systems allow
functionality to be added with a minimal
development cost.
Oftentimes a new feature or change to an
existing feature requires a migration
of old data into a new format. Part of a system
being extensible is making it easy to
do large-scale migrations. Being able to do big
migrations quickly and easily is
core to the approach you will learn.
1.5.6 Allows ad hoc queries
Being able to do ad hoc queries on your data is
extremely important. Nearly every
large dataset has unanticipated value within it.
Being able to mine a dataset
arbitrarily gives opportunities for business
optimization and new applications.
Ultimately, you can't discover interesting things
to do with your data unless you
can ask arbitrary questions of it. You will learn
how to do ad hoc queries in
Chapters 4 and 5 when we discuss batch
processing.
1.5.7 Minimal maintenance
Maintenance is the work required to keep a
system running smoothly. This
includes anticipating when to add machines to
scale, keeping processes up and
running, and debugging anything that goes
wrong in production.
An important part of minimizing maintenance is
choosing components that
have as small an as possible. implementation
complexity That is, you want to rely
on components that have simple mechanisms
underlying them. In particular,
distributed databases tend to have very
complicated internals. The more complex a
system, the more likely something will go
wrong and the more you need to
understand about the system to debug and tune
it.
You combat implementation complexity by
relying on simple algorithms and
simple components. A trick employed in the
Lambda Architecture is to push
complexity out of the core components and into
pieces of the system whose
outputs are discardable after a few hours. The
most complex components used, like
read/write distributed databases, are in this layer
where outputs are eventually
discardable. We will discuss this technique in
depth when we discuss the "speed
layer" in Chapter 7.
A Big Data system must provide the
information necessary to debug the system
when things go wrong. The key is to be able to
trace for each value in the system
exactly what caused it to have that value.
1.5.8 Debuggable
Achieving all these properties together in one
system seems like a daunting
challenge. But by starting from first principles,
these properties naturally emerge
from the resulting system design. Let's now take
a look at the Lambda Architecture
which derives from first principles and satisifes
all of these properties.
Computing arbitrary functions on an arbitrary
dataset in realtime is a daunting
problem. There is no single tool that provides a
complete solution. Instead, you
have to use a variety of tools and techniques to
build a complete Big Data system.
The Lambda Architecture solves the problem of
computing arbitrary functions
on arbitrary data in realtime by decomposing the
problem into three layers: the
batch layer, t

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentalsrjain51
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1gauravsc36
 
Loyalty Management Innovator AIMIA's Transformation Journey to Modernized and...
Loyalty Management Innovator AIMIA's Transformation Journey to Modernized and...Loyalty Management Innovator AIMIA's Transformation Journey to Modernized and...
Loyalty Management Innovator AIMIA's Transformation Journey to Modernized and...Dana Gardner
 
Solving Compliance for Big Data
Solving Compliance for Big DataSolving Compliance for Big Data
Solving Compliance for Big Datafbeckett1
 
Big Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesBig Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesCRISIL Limited
 
How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medici...
How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medici...How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medici...
How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medici...Dana Gardner
 
Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergencekvnnrao
 
Big Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation SlideBig Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation SlideSlideTeam
 
Approaching Big Data: Lesson Plan
Approaching Big Data: Lesson Plan Approaching Big Data: Lesson Plan
Approaching Big Data: Lesson Plan Bessie Chu
 
BIG DATA-Seminar Report
BIG DATA-Seminar ReportBIG DATA-Seminar Report
BIG DATA-Seminar Reportjosnapv
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data FundamentalsSmarak Das
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentationAASTHA PANDEY
 
Chapter 4 what is data and data types
Chapter 4  what is data and data typesChapter 4  what is data and data types
Chapter 4 what is data and data typesPro Guide
 

Was ist angesagt? (20)

Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Big data
Big dataBig data
Big data
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
The promise and challenge of Big Data
The promise and challenge of Big DataThe promise and challenge of Big Data
The promise and challenge of Big Data
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Loyalty Management Innovator AIMIA's Transformation Journey to Modernized and...
Loyalty Management Innovator AIMIA's Transformation Journey to Modernized and...Loyalty Management Innovator AIMIA's Transformation Journey to Modernized and...
Loyalty Management Innovator AIMIA's Transformation Journey to Modernized and...
 
Solving Compliance for Big Data
Solving Compliance for Big DataSolving Compliance for Big Data
Solving Compliance for Big Data
 
Big Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesBig Data’s Big Impact on Businesses
Big Data’s Big Impact on Businesses
 
How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medici...
How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medici...How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medici...
How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medici...
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergence
 
Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
 
Big Data
Big DataBig Data
Big Data
 
Big Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation SlideBig Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation Slide
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Approaching Big Data: Lesson Plan
Approaching Big Data: Lesson Plan Approaching Big Data: Lesson Plan
Approaching Big Data: Lesson Plan
 
BIG DATA-Seminar Report
BIG DATA-Seminar ReportBIG DATA-Seminar Report
BIG DATA-Seminar Report
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
 
Chapter 4 what is data and data types
Chapter 4  what is data and data typesChapter 4  what is data and data types
Chapter 4 what is data and data types
 

Ähnlich wie Bigdata notes

Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.saranya270513
 
What's the Big Deal About Big Data?
What's the Big Deal About Big Data?What's the Big Deal About Big Data?
What's the Big Deal About Big Data?Logi Analytics
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataHari Priya
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big dataDigimark
 
Data Strategy in 2016
Data Strategy in 2016Data Strategy in 2016
Data Strategy in 2016FairCom
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Gerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and InvestmentGerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and Investmentvijayk23x
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011navaidkhan
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7Rohit Mittal
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private BankingJérôme Kehrli
 
BIG DATA & DATA ANALYTICS
BIG  DATA & DATA  ANALYTICSBIG  DATA & DATA  ANALYTICS
BIG DATA & DATA ANALYTICSNAGARAJAGIDDE
 
Big Data Analytics Research Report
Big Data Analytics Research ReportBig Data Analytics Research Report
Big Data Analytics Research ReportIla Group
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 

Ähnlich wie Bigdata notes (20)

Big data
Big dataBig data
Big data
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
What's the Big Deal About Big Data?
What's the Big Deal About Big Data?What's the Big Deal About Big Data?
What's the Big Deal About Big Data?
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
1
11
1
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big data
 
Data Strategy in 2016
Data Strategy in 2016Data Strategy in 2016
Data Strategy in 2016
 
The ABCs of Big Data
The ABCs of Big DataThe ABCs of Big Data
The ABCs of Big Data
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White Paper
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big Data
Big DataBig Data
Big Data
 
Gerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and InvestmentGerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and Investment
 
130214 copy
130214   copy130214   copy
130214 copy
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011
 
Big data Analytics
Big data Analytics Big data Analytics
Big data Analytics
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private Banking
 
BIG DATA & DATA ANALYTICS
BIG  DATA & DATA  ANALYTICSBIG  DATA & DATA  ANALYTICS
BIG DATA & DATA ANALYTICS
 
Big Data Analytics Research Report
Big Data Analytics Research ReportBig Data Analytics Research Report
Big Data Analytics Research Report
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 

Bigdata notes

  • 1. To keep things simple, we typically define Big Data using four Vs; namely, volume, variety, velocity, and veracity. We added the veracity characteristic recently in response to the quality and source issues our clients began facing with their Big Data initiatives. Some analysts include other V-based descriptors, such as variability and visibility, but we’ll leave those out of this discussion. Volume is the obvious Big Data trait. At the start of this chapter we rhymed off all kinds of voluminous statistics that do two things: go out of date the moment they are quoted and grow bigger! We can all relate to the cost of
  • 2. home storage; we can remember geeking out and bragging to our friends about our new 1TB drive we bought for $500; it’s now about $60; in a couple of years, a consumer version will fit on your fingernail. The thing about Big Data and data volumes is that the language has changed. Aggregation that used to be measured in petabytes (PB) is now referenced by a term that sounds as if it’s from a Star Wars movie: zettabytes (ZB). A zettabyte is a trillion gigabytes (GB), or a billion terabytes! Since we’ve already given you some great examples of the volume of data in the previous section, we’ll keep this section short and conclude by referencing the world’s aggregate digital data growth rate. In 2009, the world had about 0.8ZB of data; in 2010, we crossed the 1ZB marker, and at the end of
  • 3. 2011 that number was estimated to be 1.8ZB (we think 80 percent is quite the significant growth rate). Six or seven years from now, the number is estimated (and note that any future estimates in this book are out of date the moment we saved the draft, and on the low side for that matter) to be around 35ZB, equivalent to about four trillion 8GB iPods! That number is astonishing considering it’s a low-sided estimate. Just as astounding are the challenges and opportunities that are associated with this amount of data. The variety characteristic of Big Data is really about trying to capture all of the data that pertains to our decision-making process. Making sense out of unstructured data, such as opinion and intent musings on Facebook, or analyzing images, isn’t something that comes naturally for computers. However, this
  • 4. kind of data complements the data that we use to drive decisions today. Most of the data out there is semistructured or unstructured. (To clarify, all data has some structure; when we refer to unstructured data, we are referring to the subcomponents that don’t have structure, such as the freeform text in a comments field or the image in an auto-dated picture.) Consider a customer call center; imagine being able to detect the change in tone of a frustrated client who raises his voice to say, “This is the third outage I’ve had in one week!” A Big Data solution would not only identify the terms “third” and “outage” as negative polarity trending to consumer vulnerability, but also the tonal change as another indicator that a customer churn incident is trending to happen. All of this insight can be gleaned from unstructured data. Now combine this unstructured data with the customer’s record data
  • 5. and transaction history (the structured data with which we’re familiar), and you’ve got a very personalized model of this consumer: his value, how brittle he’s become as your customer, and much more. (You could start this usage pattern by attempting to analyze recorded calls not in real time, and mature the solution over time to one that analyzes the spoken word in real time.) An IBM business partner, TerraEchos, has developed one of the most sophisticated sound classification systems in the world. This system is used for real-time perimeter security control; a thousand sensors are buried underground to collect and classify detected sounds so that appropriate action can be taken (dispatch personnel, dispatch aerial surveillance, and so on) depending on the classification. Consider the problem of securing the perimeter of
  • 6. a nuclear reactor that’s surrounded by parkland. The TerraEchos system can near-instantaneously differentiate the whisper of the wind from a human voice, or the sound of a human footstep from the sound of a running deer. In fact, if a tree were to fall in one of its protected forests, TerraEchos can affirm that it makes a sound even if no one is around to hear it. Sound classification is a great example of the variety characteristic of Big Data. One of our favorite but least understood characteristics of Big Data is velocity. We define velocity as the rate at which data arrives at the enterprise and is processed or well understood. In fact, we challenge our clients to ask themselves, once data arrives at their enterprise’s doorstep: “How long does it
  • 7. take you to do something about it or know it has even arrived?” Think about it for a moment. The opportunity cost clock on your data starts ticking the moment the data hits the wire. As organizations, we’re taking far too long to spot trends or pick up valuable insights. It doesn’t matter what industry you’re in; being able to more swiftly understand and respond to data signals puts you in a position of power. Whether you’re trying to understand the health of a traffic system, the health of a patient, or the health of a loan portfolio, reacting faster gives you an advantage. Velocity is perhaps one of the most overlooked areas in the Big Data craze, and one in which we believe that IBM is unequalled in the capabilities and sophistication that it provides. In the Big Data craze that has taken the marketplace by storm, everyone
  • 8. is fixated on at-rest analytics, using optimized engines such the Netezza technology behind the IBM PureData System for Analytics or Hadoop to perform analysis that was never before possible, at least not at such a large scale. Although this is vitally important, we must nevertheless ask: “How do you analyze data in motion?” This capability has the potential to provide businesses with the highest level of differentiation, yet it seems to be somewhat overlooked. The IBM InfoSphere Streams (Streams) part of the IBM Big Data platform provides a real-time streaming data analytics engine. Streams is a platform that provides fast, flexible, and scalable processing of continuous streams of time-sequenced data packets. We’ll delve into the details and capabilities of Streams in Part III, “Analytics for Big Data in Motion.”
  • 9. You might be thinking that velocity can be handled by Complex Event Processing (CEP) systems, and although they might seem applicable on the surface, in the Big Data world, they fall very short. Stream processing enables advanced analysis across diverse data types with very high messaging data rates and very low latency (μs to s). For example, one financial services sector (FSS) client analyzes and correlates over five million market messages/ second to execute algorithmic option trades with an average latency of 30 microseconds. Another client analyzes over 500,000 Internet protocol detail records (IPDRs) per second, more than 6 billion IPDRs per day, on more than 4PB of data per year, to understand the trending and current-state health of their network. Consider an enterprise network security problem. In this domain,
  • 10. threats come in microseconds so you need technology that can respond and keep pace. However you also need something that can capture lots of data quickly, and analyze it to identify emerging signatures and patterns on the network packets as they flow across the network infrastructure. Finally, from a governance perspective, consider the added benefit of a Big Data analytics velocity engine: If you have a powerful analytics engine that can apply very complex analytics to data as it flows across the wire, and you can glean insight from that data without having to store it, you might not have to subject this data to retention policies, and that can result in huge savings for your IT department. Today’s CEP solutions are targeted to approximately tens of thousands of messages/second at best, with seconds-to- minutes latency. Moreover, the
  • 11. analytics are mostly rules-based and applicable only to traditional data types (as opposed to the TerraEchos example earlier). Don’t get us wrong; CEP has its place, but it has fundamentally different design points. CEP is a non-programmer-oriented solution for the application of simple rules to discrete, “complex” events. Note that not a lot of people are talking about Big Data velocity, because there aren’t a lot of vendors that can do it, let alone integrate at-rest technologies with velocity to deliver economies of scale for an enterprise’s current investment. Take a moment to consider the competitive advantage that your company would have with an in-motion, at-rest Big Data analytics platform, by looking at Figure 1-1 (the IBM Big Data platform is covered in detail in Chapter 3).
  • 12. You can see how Big Data streams into the enterprise; note the point at which the opportunity cost clock starts ticking on the left. The more time that passes, the less the potential competitive advantage you have, and the less return on data (ROD) you’re going to experience. We feel this ROD metric will be one that will dominate the future IT landscape in a Big Data world: we’re used to talking about return on investment (ROI), which talks about the entire solution investment; however, in a Big Data world, ROD is a finer granularization that helps fuel future Big Data investments. Traditionally, we’ve used at-rest solutions (traditional data warehouses, Hadoop, graph stores, and so on). The T box on the right in Figure 1-1 represents the analytics that you discover and harvest at rest (in this case,
  • 13. it’s text-based sentiment analysis). Unfortunately, this is where many vendors’ Big Data talk ends. The truth is that many vendors can’t help you build the analytics; they can only help you to execute it. This is a key differentiator that you’ll find in the IBM Big Data platform. Imagine being able to seamlessly move the analytic artifacts that you harvest at rest and apply that insight to the data as it happens in motion (the T box by the lightning bolt on the left). This changes the game. It makes the analytic model adaptive, a living and breathing entity that gets smarter day by day and applies learned intelligence to the data as it hits your organization’s doorstep. This model is cyclical, and we often refer to this as adaptive analytics because of the real-time and closed- loop mechanism of this architecture.
  • 14. The ability to have seamless analytics for both at-rest and in-motion data moves you from the forecast model that’s so tightly aligned with traditional warehousing (on the right) and energizes the business with a nowcastmodel. The whole point is getting the insight you learn at rest to the frontier of the business so it can be optimized and understood as it happens. Ironically, the more times the enterprise goes through this adaptive analytics cycle Veracity is a term that’s being used more and more to describe Big Data; it refers to the quality or trustworthiness of the data. Tools that help handle Big Data’s veracity transform the data into trustworthy insights and discard noise. Collectively, a Big Data platform gives businesses the opportunity to analyze
  • 15. all of the data (whole population analytics), and to gain a better understanding of your business, your customers, the marketplace, and so on. This opportunity leads to the Big Data conundrum: although the economics of deletion have caused a massive spike in the data that’s available to an organization, the percentage of the data that an enterprise can understand is on the decline. A further complication is that the data that the enterprise is trying to understand is saturated with both useful signals and lots of noise (data that can’t be trusted, or isn’t useful to the business problem at hand), as shown in Figure 1-2. We all have firsthand experience with this; Twitter is full of examples of spambots and directed tweets, which is untrustworthy data. The2012 presidential election in Mexico turned into a Twitter veracity example
  • 16. with fake accounts, which polluted political discussion, introduced derogatory hash tags, and more. Spam is nothing new to folks in IT, but you need to be aware that in the Big Data world, there is also Big Spam potential, and you need a way to sift through it and figure out what data can and can’t be trusted. Of course, there are words that need to be understood in context, jargon, and more (we cover this in Chapter 8). As previously noted, embedded within all of this noise are useful signals: the person who professes a profound disdain for her current smartphone manufacturer and starts a soliloquy about the need for a new one is expressing monetizable intent. Big Data is so vast that quality issues are a reality, and veracity is what we generally use to refer to this problem domain. The fact
  • 17. that one in three business leaders don’t trust the information that they use to make decisions is a strong indicator that a good BIG DATA – Nathan Marz 1.5 Desired Properties of a Big Data System 1.5.1 Robust and fault-tolerant The properties you should strive for in Big Data systems are as much about complexity as they are about scalability. Not only must a Big Data system perform well and be resource-efficient, it must be easy to reason about as well. Let's go over each property one by one. You don't need to memorize these properties, as we will revisit them as we use first principles to show how to achieve these properties. Building systems that "do the right thing" is difficult in the face of the challenges
  • 18. of distributed systems. Systems need to behave correctly in the face of machines going down randomly, the complex semantics of consistency in distributed databases, duplicated data, concurrency, and more. These challenges make it difficult just to reason about what a system is doing. Part of making a Big Data system robust is avoiding these complexities so that you can easily reason about the system. Additionally, it is imperative for systems to be "human fault-tolerant." This is an oft-overlooked property of systems that we are not going to ignore. In a production system, it's inevitable that someone is going to make a mistake sometime, like by deploying incorrect code that corrupts values in a database. You will learn how to bake immutability and recomputation into the core of your systems to make your systems innately resilient to human error. Immutability and
  • 19. recomputation will be described in depth in Chapters 2 through 5. 1.5.2 Low latency reads and updates The vast majority of applications require reads to be satisfied with very low latency, typically between a few milliseconds to a few hundred milliseconds. On the other hand, the update latency requirements vary a great deal between applications. Some applications require updates to propogate immediately, while in other applications a latency of a few hours is fine. Regardless, you will need to be able to achieve low latency updates when you need them in your Big Data systems. More importantly, you need to be able to achieve low latency reads and updates without compromising the robustness of the system. You will learn how to achieve low latency updates in the discussion of the "speed layer" in Chapter 7.
  • 20. 1.5.3 Scalable Scalability is the ability to maintain performance in the face of increasing data and/or load by adding resources to the system. The Lambda Architecture is horizontally scalable across all layers of the system stack: scaling is accomplished by adding more machines. 1.5.4 General A general system can support a wide range of applications. Indeed, this book wouldn't be very useful if it didn't generalize to a wide range of applications! The Lambda Architecture generalizes to applications as diverse as financial management systems, social media analytics, scientific applications, and social networking. 1.5.5 Extensible
  • 21. You don't want to have to reinvent the wheel each time you want to add a related feature or make a change to how your system works. Extensible systems allow functionality to be added with a minimal development cost. Oftentimes a new feature or change to an existing feature requires a migration of old data into a new format. Part of a system being extensible is making it easy to do large-scale migrations. Being able to do big migrations quickly and easily is core to the approach you will learn. 1.5.6 Allows ad hoc queries Being able to do ad hoc queries on your data is extremely important. Nearly every large dataset has unanticipated value within it. Being able to mine a dataset arbitrarily gives opportunities for business optimization and new applications.
  • 22. Ultimately, you can't discover interesting things to do with your data unless you can ask arbitrary questions of it. You will learn how to do ad hoc queries in Chapters 4 and 5 when we discuss batch processing. 1.5.7 Minimal maintenance Maintenance is the work required to keep a system running smoothly. This includes anticipating when to add machines to scale, keeping processes up and running, and debugging anything that goes wrong in production. An important part of minimizing maintenance is choosing components that have as small an as possible. implementation complexity That is, you want to rely on components that have simple mechanisms underlying them. In particular, distributed databases tend to have very complicated internals. The more complex a
  • 23. system, the more likely something will go wrong and the more you need to understand about the system to debug and tune it. You combat implementation complexity by relying on simple algorithms and simple components. A trick employed in the Lambda Architecture is to push complexity out of the core components and into pieces of the system whose outputs are discardable after a few hours. The most complex components used, like read/write distributed databases, are in this layer where outputs are eventually discardable. We will discuss this technique in depth when we discuss the "speed layer" in Chapter 7. A Big Data system must provide the information necessary to debug the system when things go wrong. The key is to be able to trace for each value in the system exactly what caused it to have that value.
  • 24. 1.5.8 Debuggable Achieving all these properties together in one system seems like a daunting challenge. But by starting from first principles, these properties naturally emerge from the resulting system design. Let's now take a look at the Lambda Architecture which derives from first principles and satisifes all of these properties. Computing arbitrary functions on an arbitrary dataset in realtime is a daunting problem. There is no single tool that provides a complete solution. Instead, you have to use a variety of tools and techniques to build a complete Big Data system. The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in realtime by decomposing the problem into three layers: the batch layer, t