User can run queries via MicroStrategy’s visual interface without the need to write unfamiliar HiveQL or MapReduce scripts. In essence, any user, without programming skill in Hadoop, can ask questions against vast volumes of structured and unstructured data to gain valuable business insights.
2. Hadoop
Hadoop is a free, Java-based programming framework that
supports the processing of large data sets in a distributed
computing environment.
It makes it possible to run applications on systems with
thousands of nodes involving thousands of terabytes.
Its distributed file system facilitates rapid data transfer rates
among nodes and allows the system to continue operating
uninterrupted in case of a node failure.
This approach lowers the risk of catastrophic system failure,
even if a significant number of nodes become inoperative.
3. Why Hadoop?
Scalibility
Simply scales just by adding nodes.
Local processing to avoid network bottlenecks.
• Flexibility
All kinds of data.(blobs,documents,records etc).
In all forms(structured,semi-structured,structured)
Store anything and later analyze what you need.
• Efficiency
Cost efficiency(<1$kb/Tb) on commodity hardware.
Unified storage,metadata,security(no duplication or
synchronization)
4. Core parts of Hadoop
Hadoop Distributed File System(HDFS)
It is the primary storage system used by Hadoop applications.
HDFS is a distributed file system that provides high-performance access
to data across Hadoop clusters. Like other Hadoop-related technologies,
HDFS has become a key tool for managing pools of big data and
supporting big data analytics applications.
When HDFS takes in data, it breaks the information down into separate
pieces and distributes them to different nodes in a cluster, allowing
for parallel processing. The file system also copies each piece of data
multiple times and distributes the copies to individual nodes, placing at least
one copy on a different server rack than the others. As a result, the data on
nodes that crash can be found elsewhere within a cluster, which allows
processing to continue while the failure is resolved.
HDFS is built to support applications with large data sets, including
individual files that reach into the terabytes. It uses a master/slave
architecture, with each cluster consisting of a single NameNode that
manages file system operations and supporting DataNodes that manage data
storage on individual compute nodes.
5. MapReduce
A MapReduce program is composed of a Map() procedure that performs
filtering and sorting (such as sorting students by first name into queues, one
queue for each name) and a Reduce() procedure that performs a summary
operation (such as counting the number of students in each queue, yielding
name frequencies).
The "MapReduce System" (also called "infrastructure" or "framework")
orchestrates by marshalling the distributed servers, running the various tasks
in parallel, managing all communications and data transfers between the
various parts of the system, and providing for redundancy and fault tolerance.
HDFS and MapReduce are robust. Servers in a Hadoop cluster can fail and
not abort the computation process. HDFS ensures data is replicated with
redundancy across the cluster. On completion of a calculation, a node will
write its results back into HDFS.
6. MicroStrategy Integration
Cloudera and MicroStrategy have collaborated to develop a powerful and
easy-to-use BI framework for Apache Hadoop by creating a connection
between MicroStrategy 9 and CDH. This connection is established via an
Open Database Connectivity (ODBC) driver for Apache Hive and is available
as the Cloudera Connector for MicroStrategy.
The connector allows business users to perform sophisticated point and click
analytics on data stored in Hadoop directly from MicroStrategy applications –
just as they do on data stored in data warehouses, data marts and operational
databases. MicroStrategy has developed Very Large Database Drivers
(VLDB) specifically for Cloudera that generate optimized queries for
Cloudera's Distribution including Apache Hadoop.
7. The Cloudera Connector for MicroStrategy enables your enterprise users to
access Hadoop data through the Business Intelligence application
MicroStrategy 9.3.1. The driver achieves this by translating Open Database
Connectivity (ODBC) calls from MicroStrategy into SQL and passing the
SQL queries to the underlying Impala or Hive engines.
MSTR and Cloudera together offer a connector that empowers organizations
to extract and deliver valuable insights from massive volumes of structured
and unstructured data. By providing sophisticated yet familiar reporting and
analysis tools on top of Apache Hadoop, business users can quickly and
easily unlock the potential of their data to make better business decisions.
8. What’s Impala
Interactive SQL
Typically 100x faster than Hive.
Responses in sub-seconds.
Nearly ANSI-92 standard SQL queries with Hive SQL
Compatible SQL interfaces for existing Hadoop/CDH applications.
Based on industry standard SQL.
Natively on Hadoop/Hbase storage and metadata
Flexibility,scale and cost advantages of Hadoop.
No duplication/synchronization of data and metadata.
Local processing to avoid network bottlenecks.
Separate runtime on MapReduce
Mapreduce is designed and great for batch.
Impala is purpose-built for low latency SQL queries on Hadoop.
9. Benefits of Impala
More and faster value from “Big Data”
BI tools impractical on Hadoop before Impala
Move from 10s of Hadoop users per cluster to 100s of SQL users.
No delays from data migration
Flexibility
Query across existing data.
Select best-fit file formats.
Run multiple frameworks on the same data at the same time.
Cost Analysis
Reduce movement,duplicate storage & compute.
10% to 1% the cost of analytic DBMS.
Full Fidelity analysis
No loss from aggregations or fixed schemas.
10. Project
Integrating Hadoop-Impala with Microstrategy reporting
capabilities we developed Healthcare Management software.
We used data stored in HDFS and Impala as Native MPP query
engine integrated in Hadoop via connector.
Based on our requirements we made Intelligent Cubes and
directly exported to MicroStrategy.
Using data insight visualization capabilities we are able to display
visually appealing dashboards and insightful reports.
We have developed 3 dashboards displaying various ways of
visualizing HealthCare Management data.
12. Key Performance Indicator displays the total number of
issuers,employes,employers,brokers and enrollments.
It also displays aggregated calculation of employee
income,premium/month and percentage.
Service area displays US-statewise information of total count
using image layout widget.
Enrollment displays heatmap of total enrollment count
corresponding to each US state.
Employee segmentation displays grid graph display of
number of employes per segments.
14. In the Ticketing dashboard,Overall Ticket Workload section
displays information about total count of support persons,open
tickets,average response days and backlog percentage.
Open Tickets section describes waterfall widget describing total
open counts as per the issuer-type.
It contains heatmap corresponding to average closure time and
ticket issuertype.
It contains gauge widgets of closure time in days corresponding to
year,quarter,month and week.
It also displays microcharts displaying count of current-status
based on issuertype.In microcharts we used sparkline and bar mode
to anaylse in different ways.
16. It is an interactive dashboard.
Key Performance Indicator displays information about total
service area and enrollment count corresponding to
issuername.
By using issuername as selector it targets heat map of
enrollment displaying information of total enrollments
corresponding to each state.
By using issuername as selector it also targets the US map
image layout widget displaying total service area count
corresponding to each state.
18. Here we took the raw real-time stock data of NASDAQ and NYSE
for analysing as per our requirement.
In the above screenshot there are 4 selectors namely
Sector,Industries,Symbol and Year.
Industry is filtered by Sector selector and Symbol is filtered by
Sector and Industry respectively.
All the 4 selectors will filter data to the below panel displaying
stock volatility by year,quarter,month and week.
Panel describing grid and graph view limiting to 50 data at a time
as shown in below screenshot.
19. Conclusion
User can run queries via MicroStrategy’s visual interface
without the need to write unfamiliar HiveQL or MapReduce
scripts. In essence, any user, without programming skill in
Hadoop, can ask questions against vast volumes of structured
and unstructured data to gain valuable business insights.
It is very fast,scalable,cost effective and resilent to failure.
Hadoop is inefficient for handling small files, and it
lacks transparent compression. As HDFS is not designed
to work well with random reads over small files due to its
optimization.
It is used only for batch-based architecture not for real-time
data access.
Following shared-nothing architecture so task requiring global
synchronization or sharing of mutable data doesnot fit.