Weitere ähnliche Inhalte Mehr von Infochimps, a CSC Big Data Business (15) Kürzlich hochgeladen (20) Why Real-Time Analytics?1. Why Real-Time Analytics?
The Chimp Way:
Using the right tool for each job
Explore the At Infochimps, we abide by the philosophy that you should use
the right tool for each job. Why lock in to one set of technologies
technology that or techniques? Depending on what you are trying to accomplish
- the questions you want to ask of your data, or the applications
enables real-time and visualizations you build on top of that data - different tech-
analytics and nologies are best suited for each unique task. You should have all
the best tools at your fingertips for each task. Infochimps excels at
streaming data systems and technology integration -- we can take your existing
tools, add powerful new ones from our kit, and glue them together
processing, and into a unified whole.
how it differs from
We also strongly embrace open source technologies as part of
the world of a complete data solution. Not only do you benefit from the active
participation of the open source community -- you aren’t limited
Hadoop and to a proprietary vendor’s finite feature set and integration connec-
batch analytics. tors. We use Hadoop, Elasticsearch, Flume, Ironfan, and Wu-
kong, among other world-class open source tools that work flex-
ibly with each other and the rest of the tools in your enterprise.
© 2012 Infochimps, Inc. All rights reserved. 1
2. The Hadoop & NoSQL conundrum
Hadoop is a powerful framework for Big Data analytics. It simplifies the analysis of massive sets of
data by distributing the computation load across many processes and machines. Hadoop embraces
a map/reduce framework, which means analytics are performed as batch processes. Depending on
the quantity of data and the complexity of the computation, running a set of Hadoop jobs could take
anywhere from a few minutes to many days. Batch analytics tool sets like Hadoop are great for doing
one-off reports, a recurring schedule of periodic runs, or setting up dedicated data exploration envi-
ronments. However, waiting hours for the analysis you need means you aren’t able to get real-time
answers from your data. Hadoop analysis ends up being a rear view mirror instead of a pulse on the
moment.
NoSQL databases are extremely powerful, but come with certain challenges of their own
At Infochimps we use Hadoop to run map/reduce jobs against scalable, NoSQL data stores like
HBase, Cassandra, or Elasticsearch. These databases are extremely good at enabling fast queries
against many terabytes of data, but each makes certain tradeoffs to enable this ability. One major
tradeoff, common across all three of these examples, is the inability to do SQL-like joins -- the ability
to combine data from one database table with data from another table.
The usual way we work around this tradeoff is to practice denormalization. Imagine we’re asking a
question such as “Find all posts that contain the phrase ‘Cola-Cola’ from all authors based in Spo-
kane, Washington”. In a traditional relational database like SQL, a table of “posts” would join against
a table of “authors” using a shared key like an author’s ID number. In NoSQL databases, denormal-
ization consists of inserting a copy of the author into each row of their posts. Rather than joining the
posts table with the authors table during the query a la SQL, all the authors’ data is already contained
within the posts table before the query.
The question then becomes when should the denormalization of our NoSQL database occur? One
option is to use Hadoop to “backfill” denormalized data from normalized tables before running these
kinds of queries. This approach is perfectly workable but it suffers from the same “rear-view mirror”
problem of doing Hadoop-based batch analytics -- we still cannot perform complex queries of real-
time data. What if we could write denormalized data on the fly: write each incoming Twitter post into
a row in the posts table, and augment that row with information on the author in real-time. This would
keep all data denormalized at all times, always ready for downstream applications to run complex
queries and generate the rich, real-time business insights. Real-time analytics and stream processing
make this possible.
© 2012 Infochimps, Inc. All rights reserved. 2
3. Real-time + Big Data = Stream Processing
In situations where you need to make well-informed, real-time decisions, good data isn’t enough. It
must be timely and actionable. As a mutual fund operator, you can’t wait hours to analyze whether or
not it’s the right moment to sell 200,000 stock shares. As CMO, you can’t wait days to see if there is a
PR crisis occurring around your brand. The time window for data analysis is shrinking, and you need
a different set of tools to get these on-the-fly answers.
Batch Versus Streaming
Consider two hypothetical sandwich makers. Each company makes great sandwiches, but chooses to
deliver them to their customers either in batches or in near real-time.
© 2012 Infochimps, Inc. All rights reserved. 3
4. The Batch Sub Shop can provide large quantities of sandwiches by leveraging many people to ac-
complish the overall project. Similarly, batch analytics can leverage multiple machines to accomplish
a set of analytics jobs. By adding more resources, we can increase the speed with which the tasks
are accomplished, but at a higher cost.
Contrast that with the Streaming Sub Shop, which doesn’t deliver a huge set of sandwiches all at
once, but does quickly create sandwiches on the fly. The process aims to get a sandwich in the cus-
tomer’s hand as soon as possible. Real-time analytics works the same way by processing data the
moment it is collected. If the data is coming in too quickly, we can flexibly increase the resources that
support our real-time workflow. Is the toasting process the bottleneck of our production line? We eas-
ily add a couple of additional toasters.
As you can imagine, the ideal sandwich company probably combines both the ability to cater large
orders ahead of time and in-store made to order business. Likewise, your organization can leverage
both batch analytics and real-time analytics depending on your business needs. Batch analytics is
the most efficient way to process a large quantity of data in a non-time sensitive manner. Real-time
analytics and stream processing are the answer when the timeliness of your insights is important, you
need to scalably process a very large influx of live data, or if NoSQL databases cannot answer the
questions you are asking.
© 2012 Infochimps, Inc. All rights reserved. 4
5. How Does Real-Time Analytics Work?
1. Collect real-time data. Real-time data is being generated all the time. If you are a mutual fund
operator, it’s real-time stock price data. If you are a CMO, it’s real-time social media posts and
Google search results. Typically this data is live streaming data. That means the moment the stock
price changes, we can grab that data point - like a faucet of running water. We collect live data by
“hooking a hose up” to the faucet stream to capture that information in real-time. A lot of different
vocabulary exists to describe these “hoses” including calling them scrapers, collectors, agents,
and listeners.
2. Process the data as it flows in. The key to real-time analytics is that we cannot wait until later to
do things to our data; we must analyze it instantly. Stream processing (also known as streaming
data processing) is the term used for doing things to data instantly as it’s collected. Actions that
you can perform in real-time include splitting data, merging it, doing calculations, connecting it with
outside data sources, forking data to multiple destinations, and more.
3. Reports and dashboards access processed data. Now that data has been processed, it is
reliably delivered to the databases that power your reports, dashboards, and ad-hoc queries. Just
seconds after the data was collected, it is now visible in your charts and tables. Since real-time
analytics and stream processing are flexible frameworks, you can utilize whatever tools you prefer,
whether that’s Tableau, Pentaho, GoodData, a custom application, or something else. Integration
is Infochimps’ forté.
© 2012 Infochimps, Inc. All rights reserved. 5
6. What Can You Do With Stream Processing?
Augment
• Enhance your sales leads - IP addresses of visitors to your website are augmented by the
“company name” associated with that visitor if they are coming from an enterprise. Email ad-
dresses get linked to Twitter handles and Facebook handles to help your sales team leverage
social selling.
• Real-time social media analytics - tweets that mention the brands you are tracking are aug-
mented with a sentiment score (how positive or negative the comment was) and an influencer
score (such as Klout). Know instantly if positive news breaks or a PR crisis arises. Instantly
gain insight into how influential people are and on what topics.
Process and Transform
• On-the-fly analytics reporting - Reformat a tweet on the fly to fit into an agency’s data model so
that the data is visible in our reporting application immediately upon landing in the database.
• SQL-like data queries - Implement a denormalization policy to allow for doing complex JOIN-
like queries in real-time in downstream analytics applications.
• Stock price algorithms - Implement your stock analyzer algorithm mid-stream. Instantly after
an updated stock price is received, the data is processed through the algorithm, and placed in
your reporting database.
Calculate
• Usage monitoring - Track the number of social media posts mentioning your client company’s
brand. See at any given moment how much a brand is buzzing, and even set up tiered pricing
based on how many social posts you are collecting on a client’s behalf.
© 2012 Infochimps, Inc. All rights reserved. 6
7. Real-time analytics with the Infochimps Platform
Apache Flume
While initially built for log collection and routing, Flume has evolved to confidently serve the roles of
general data transport and streaming data processing. Flume not only reliably delivers data from a
source to a destination. With the right optimizations, a single Flume system can ingest many tera-
bytes of data per day, from thousands of data sources. As data flows in, you can do things to that
data, such as add additional data, do calculations, run algorithms, split data, merge data, etc. In
Flume lingo, these actions are powered by scripts called decorators, which perform the stream pro-
cessing required for real-time analytics.
Infochimps Data Delivery Service
Infochimps uses Apache Flume for the Data Delivery Service (DDS), our reliable data transport and
real-time analytics engine for the Infochimps Platform. Infochimps DDS adds important enhance-
ments to the Flume open-source tool including:
• Seamless integrations with your existing environment
and data sources
• Optimizations for highly scalable data collection and
distributed ETL (extract, transform, load)
• Tool set for rapid development of decorators which
perform the stream processing
• Flexible delivery framework to send data to any type
and quantity of databases or file systems
• Rapid solution development and deployment, along with
our expert Big Data methodology and best practices
Infochimps has extensive experience implementing the DDS, both for clients and for our internal data
flows including massive Twitter scrapes, the Foursquare firehose, customer purchase data, product
pricing data, and much more.
Single-purpose ETL solutions are rapidly being replaced with multi-node, multi-purpose data integra-
tion platforms -- the universal glue that connects systems together and makes Big Data analytics
feasible. Today, companies are taking advantage of Amazon Web Services for a few processes, on-
premise or outsourced data centers for others, NoSQL databases, relational databases, cloud storage
-- the list goes on. Data Delivery Service is compatible with all of those environments, making your
data transport needs an implementation detail, not an analytics bottleneck.
© 2012 Infochimps, Inc. All rights reserved. 7
8. About Infochimps
Our mission is to make the world’s data more accessible.
Infochimps helps companies understand their data. We provide
tools and services that connect their internal data, leverage the
power of cloud computing and new technologies such as Hadoop,
and provide a wealth of external datasets, which organizations
can connect to their own data.
Contact Us
Infochimps, Inc.
1214 W 6th St. Suite 202
Austin, TX 78703
1-855-DATA-FUN (1-855-328-2386)
www.infochimps.com
info@infochimps.com
Twitter: @infochimps
Get a free Big Data consultation
Let’s talk Big Data in the enterprise!
Get a free conference with the leading big data experts regarding your enterprise big data
project. Meet with leading data scientists Flip Kromer and/or Dhruv Bansal to talk shop
about your project objectives, design, infrastructure, tools, etc. Find out how other compa-
nies are solving similar problems. Learn best practices and get recommendations — free.
© 2012 Infochimps, Inc. All rights reserved. 8