2. Index
Introduction
Evolving BI and Analytics for Big Data
Impacts to Traditional BI Databases
Challenges
MongoDB with Hadoop
Case Studies
Current Scenario
3. Introduction
Analytics falls along a spectrum. On one end of the spectrum sit batch analytical applications, which are
used for complex, long-running analyses. They tend to have slower response times (up to minutes, hours, or
days) and lower requirements for availability. Examples of batch analytics include Hadoop-based workloads
On the other end of the spectrum sit real-time analytical applications, which provide lighter-weight
analytics very quickly. Latency is low (sub-second) and availability requirements are high (e.g., 99.99%).
MongoDB is typically used for real-time analytics. Example applications include:
Business Intelligence (BI) and analytics provides an essential set of technologies and processes
that organizations have relied upon over many years to guide strategic business decisions.
4. Introduction
1. Predictable Frequency. Data is extracted from source systems at regular intervals -
typically measured in days, months and quarters
2. Static Sources. Data is sourced from controlled, internal systems supporting established
and well-defined back-office processes
3. Fixed Models. Data structures are known and modeled in advance of analysis. This
enables the development of a single schema to accommodate data from all of the source
systems, but adds significant time to the upfront design
4. Defined Queries. Questions to be asked of the data (i.e., the reporting queries) are
pre-defined. If not all of the query requirements are known upfront, or requirements
change, then the schema has to be modified to accommodate changes
5. Slow-changing requirements. Rigorous change-control is enforced before the
introduction of new data sources or reporting requirements
6. Limited users. The consumers of BI reports are typically business managers and senior
executives
5. Evolving BI and Analytics for Big Data
Higher Uptime Requirements
The immediacy of real-time analytics accessed
from multiple fixed and mobile devices places
additional demands on the continuous availability
of BI systems.
Batch-based systems can often tolerate a certain
level of downtime, for example for scheduled
maintenance. Online systems on the other hand
need to maintain operations during both failures
and planned upgrades.
The Need for Speed & Scale
Time to value is everything. For example, having
access to real-time customer sentiment or
logistics tracking is of little benefit unless the data
can be analyzed and reported in real-time. As a
consequence, the frequency of data acquisition,
integration and analysis must increase from days
to minutes or less, placing significant operational
overhead on BI systems.
Agile Analytics and Reporting
With such a diversity of new data sources,
business analysts can not know all of the
questions they need to ask in advance.
Therefore an essential requirement is that
the data can be stored before knowing how
it will be processed and queried.
The Changing Face of Data
Data generated by such workloads as social,
mobile, sensor and logging, is much more
complex and variably structured than
traditional transaction data from back-office
systems such as ERP, CRM, PoS (Point of Sale)
and Accounts Receivable.
Taking BI to the Cloud
The drive to embrace cloud computing to
reduce costs and improve agility means BI
components that have traditionally relied on
databases deployed on monolithic, scale-up
systems have to be re-designed for the
elastic scale-out, service-oriented
architectures of cloud.
6. Impacts to Traditional BI Databases
The relational databases underpinning many of today’s traditional BI platforms are not well suited to the requirements of big
data:
• Semi-structured and unstructured data typical in mobile, social and sensor-driven applications cannot be efficiently
represented as rows and columns in a relational database table
• Rapid evolution of database schema to support new data sources and rapidly changing data structures is not
possible in relational databases, which rely on costly ALTER TABLE operations to add or modify table attributes
• Performance overhead of JOINs and transaction semantics prevents relational databases from keeping pace with the
ingestion of high-velocity data sources
• Quickly growing data volumes require scaling databases out across commodity hardware, rather than the scale-up
approach typical of most relational databases
Relational databases’ inability to
handle the speed, size and diversity
of rapidly changing data generated
by modern applications is already
driving the enterprise adoption of
NoSQL and Big Data technologies in
both operational and analytical
roles.
7. The purpose
• Flume in Hadoop, for batch processing, which make the data relevant time-wise; it can be used
for real time because it would be too fresh, only from several min to even a second late.
• Flume engine, using server side in order to make decisions regarding the current state of
affairs.
• Decisions Making are made based on whatever data is received from customers’ current
condition without all of the history in their user profiles, which would enable a much more
informed decision.
• State of Art Auto updating charting and report creation with Dashboard UI.
Increase scalability and performance of Organizations using Real
Time Analysis platform with a focus on storing, processing and
analyzing the exponentially growing data using big data
technologies.
8. Challenges
1. Getting data metrics to the right people
Often, social media is treated like the ugly stepchild within the marketing department and real-time
social media analytics are either absent or ignored.
2. Visualization
Visualizing real-time social media analytics is another key element involved in developing insights
that matter.
Simply displaying values graphically helps in making the kinds of fast interpretations necessary for
making decisions with real-time data, but adding more complex algorithms and using models
provides deeper insights, especially when visualized.
3. Unstructured data is challenging
Unlike the survey data firms are used to dealing with, most (IBM estimates 80%) is unstructured —
meaning it consists of words rather than numbers. And, text analytics lags seriously behind numeric
analysis.
4. Increasing signal to noise
Social media data is inherently noisy. Reducing noise to even detect signal is challenging — especially
in real time. Sure, with enough time, new analytics tools can ferret out the few meaningful
9. Top 10 Priorities
1 Enable new fast-paced business practices
2 Don’t expect the new stuff to replace the old stuff
3 Do not assume that all the data needs to be in real time, all the time
4 Correlate real-time data with data from other sources and latencies
5 Start with a proof of value with measurable outcomes
6 As a safe starter project, accelerate successful latent processes into near real time
7 Think about operationalizing analytics
8 Think about the skills you need
9 Examine application business rules to ensure they are ready for real-time data flows
10 Evaluate technology platforms and expertise for availability and reliability
10. Challenges
Real-Time Analytics is Hard
Can’t Stay Ahead. You need to account for
many types of data, including unstructured
and semi-structured data. And new sources
present themselves unpredictably.
Relational databases aren’t capable of
handling this, which leaves you hamstrung.
Can’t Scale. You need to analyze terabytes
or petabytes of data. You need sub-second
response times. That’s a lot more than a
single server can handle. Relational
databases weren’t designed for this
Batch. Batch processes are the right
approach for some jobs. But in many cases,
you need to analyze rapidly changing,
multi-structured data in real time. You
don’t have the luxury of lengthy ETL
processes to cleanse data for later.
MongoDB Makes it Easy
Do the Impossible. MongoDB can incorporate any
kind of data – any structure, any format, any
source – no matter how often it changes. Your
analytical engines can be comprehensive and real-
time.
Scale Big. MongoDB is built to scale out on
commodity hardware, in your data center or in the
cloud. And without complex hardware or extra
software. This shouldn’t be hard, and with
MongoDB, it isn’t.
Real Time. MongoDB can analyze data of any
structure directly within the database, giving you
results in real time, and without expensive data
warehouse loads.
11. Why Other Databases Fall Short and MangoDB
Most databases make you chose between a flexible data
model, low latency at scale, and powerful access. But
increasingly you need all three at the same time.
Rigid Schemas. You should be able to analyze unstructured, semi-structured, and
polymorphic data. And it should be easy to add new data. But this data doesn’t
belong in relational rows and columns. Plus, relational schemas are hard to
change incrementally, especially without impacting performance or taking the
database offline.
Scaling Problems. Relational databases were designed for single-server
configurations, not for horizontal scale-out. They were meant to serve 100s of ops
per second, not 100,000s of ops per second. Even with a lot of engineering hours,
custom sharding layers, and caches, scaling an RDBMS is hard at best and
impossible at worst.
Takes Too Long. Analyzing data in real time requires a break from the familiar
ETL and data warehouse approach. You don’t have time for lengthy load
schedules, or to build new query models. You need to run aggregation queries
against variably structured data. And you should be able to do so in place, in real
time.
Organizations are using MongoDB for analytics because it
lets them store any kind of data, analyze it in real time,
and change the schema as they go.
New Data. MongoDB’s document model enables you to store and process data
of any structure: events, time series data, geospatial coordinates, text and
binary data, and anything else. You can adapt the structure of a document’s
schema just by adding new fields, making it simple to bring in new data as it
becomes available.
Horizontal Scalability. MongoDB’s automatic sharding distributes data across
fleets of commodity servers, with complete application transparency. With
multiple options for scaling – including range-based, hash-based and location-
aware sharding – MongoDB can support thousands of nodes, petabytes of
data, and hundreds of thousands of ops per second without requiring you to
build custom partitioning and caching layers.
Powerful Analytics, In Place, In Real Time. With rich index and query
support – including secondary, geospatial and text search indexes – as well as
the aggregation framework and native MapReduce, MongoDB can run complex
ad-hoc analytics and reporting in place.
12. MongoDB with Hadoop
MongoDB Hadoop
Ebay
User data and metadata
management for product
catalog
User analysis for personalized
search & recommendations
Orbitz
Management of hotel data
and pricing
Hotel segmentation to support
building search facets
Pearson
Student identity and access
control. Content
management of course
materials
Student analytics to create
adaptive learning programs
Foursquare
User data, check-ins,
reviews, venue content
management
User analysis, segmentation and
personalization
Tier 1
Investment
Bank
Tick data, quants analysis,
reference data distribution
Risk modeling, security and fraud
detection
Industrial
Machinery
Manufactur
er
Storage and real-time
analytics of sensor data
collected from connected
vehicles
Preventive maintenance
programs for fleet optimization.
In-field monitoring of vehicle
components for design
enhancements
SFR
Customer service applications
accessed via online portals
and call centers
Analysis of customer usage,
devices & pricing to optimize
plans
The following table provides examples of customers using MongoDB together with Hadoop to power big
data applications.
Whether improving customer service, supporting cross-sell and upsell, enhancing business efficiency or
reducing risk, MongoDB and Hadoop provide the foundation to operationalize big data.
13. Future Trends in Real-Time Data, BI, and
Analytics
Data types handled in real time today. Numerous TDWI surveys have shown that structured
data (which
includes relational data) is by far the most common class of data types handled for BI and
analytic purposes, as well as many operational and transactional ones. It’s no surprise that
structured data bubbled to the top of Figure 16. Other data types and sources commonly
handled in real time today include application logs (33%), event data (26%), semi-structured
data (26%), and hierarchical and raw data (24% each).
Data types to be handled in real time within three years. Looking ahead, a number of data
types are poised for greater real-time usage. Some are in limited use today but will
experience aggressive adoption within three years, namely social media data (38%), Web logs
and clickstreams (34%), and unstructured data (34%). Others are handled in real time today
and will become even more so, namely event (36%), semi-structured (33%), structured (31%),
and hierarchical (30%) data.
15. MongoDB Integration with BI and Analytics
Tools
To make online big data actionable through dashboards, reports,
visualizations and integration with other data sources, it must be
accessible to established BI and analytics tools. MongoDB offers integration
with more of the leading BI tools than any other NoSQL or online big data
technology, including:
Actuate Alteryx Informatica
Jaspersoft Logi Analytics MicroStrategy
Pentaho Qliktech SAP Lumira
16. WindyGrid’s
One person, one laptop, and MongoDB’s technology jumpstarted a project that, with
other people joining in, went from prototype to one of the nation’s pioneering projects
to analyze and act on municipal data in real time. In just four months.
WindyGrid put Chicago on the path of revolutionizing how it operates not by replacing
the administrative systems already in place, but by using MongoDB to bring that data
together into a new application. With MongoDB’s flexible data model, WindyGrid doesn’t
have to go back and redo the schema for each new piece of data. Instead, it can evolve
schemas in real time. Which is crucial as WindyGrid expands and adds predictive
analytics, growing by millions of pieces of structured and unstructured data each day.
17. Crittercism is A Mobile Pioneer
Crittercism doesn’t just monitor apps or gather information. Using MongoDB’s powerful built in
query functions, it analyzes avalanches of unstructured and non-uniform data in real time. It
recognizes patterns, identifies trends, and diagnoses problems. That means that Cirttercism’s
customers immediately understand the root cause of problems and the impact they’re having on
business. So they know how to prioritize and correct the problems they’re facing and improve
performance
The kind of real time analysis that Crittercism provides customers would also be impossible
with traditional databases. Crittercism is using MongoDB’s powerful query functions to
analyze the broad variety of data it collects, in real time, within the database. A more
traditional data warehouse approach, with ETLs and long loading times, can’t match this
type of speed.
At the same time, MongoDB lets Crittercism efficiently handle the tons of data it’s
collecting. During the past two years, the number of requests that Crittercism gathers and
analyzes has jumped from 700 to 45,000 per second. Relational databases have a hard time
scaling to meet these kinds of demands, typically requiring expensive add-on software, or
additional layers of proprietary code, to keep up. With MongoDB, horizontal scalability
across multiple data centers is a native function.
18. McAfee - Global Cybersecurity
GTI analyzes cyberthreats from all angles, identifying threat relationships, such as malware used in
network intrusions, websites hosting malware, botnet associations, and more. Threat information is
extremely time sensitive; knowing about a threat from weeks ago is useless.
In order to provide up to date, comprehensive threat information, needs to quickly process terabytes of
different data types (such as IP address or domain) into meaningful relationships:
e.g. Is this web site good or bad? What other sites have been interacting with it? The success of the cloud-based system also
depends on a bidirectional data flow: GTI gathers data from millions of client sensors and provides real-time intelligence
back to these end products, at a rate of 100 billion queries per month.
Was unable to address these needs and effectively scale out to millions of records with their existing solutions. For example,
the HBase / Hadoop setup made it difficult to run interesting, complex queries, and experienced bugs with the Java garbage
collector running out of memory. Another issue was with sharding and syncing;
Lucene was able to index in interesting ways, but required too much customization.
compensated for all the rebuilding and redeploying of Katta shards with “the usual scripting duct tape,” but what they really
needed was a solution that could seamlessly handle the sharding and updating on its own.
selected MongoDB, which had excellent documentation and a growing community that was “on fire.”
19. Power Journalism
BuzzFeed, the social news and entertainment company, relies on MongoDB to analyze all performance data
for its content across the social web. A core part of BuzzFeed’s publishing platform, MongoDB exposes
metrics to editors and writers in real time, to help them understand how its content is performing and to
optimize for the social web. The company has been using MongoDB since 2010. Here’s why.
1.Analytics provide more insight, more quickly. relies on MongoDB for its strategic analytics platform. With apps and
dashboards built on MongoDB, can pinpoint when content is viewed and how it is shared. With this approach, is able to quickly
gain insight on how its content performs, nimbly optimize user’s experience for posts that are performing best and is able to
deliver critical feedback to its writers and editors.
2.BuzzFeed is data-driven. At BuzzFeed, data drives decision-making and powers the company. MongoDB enables to
effectively analyze, track and expose a range of metrics to writers and employees. This includes: the number of clicks; how
often and where posts are being shared; which views on different social media properties lead to the most shares; and how
views differ across mobile and desktop.
3.Successful web journalism demands scale. processes large volumes of data and this is increasing each year as the site’s
traffic continues to grow. Originally built on a relational data store, decided to use MongoDB, a more scalable solution, to
collect and track the data they need with a richer functionality than a standard key-value store.
4.Editors gain edge with access to data in minutes. Fast, easy access to data is critical to helping editors determine what
content will be most shareable in the social media world. With MongoDB, is able to expose performance data shortly after
publication, enabling editors to quickly respond by tweaking headlines and determine the best way to promote.
5.Setting the infrastructure for new applications. As continues its efforts to leverage stats and optimization, MongoDB will
feature prominently in the new infrastructure. MongoDB makes it easy to build apps quickly – a requirement as rolls out
additional products.