3. Big Data
We understand Big Data as the result of the following changes
that are taking place in the data managed by organizations
The increased Volume of the data available in companies
From Terabytes (103 Gb) to Petabytes (106)
The significant increase in the Variety or heterogeneity of data
sources available
Structured, Semi structured and Unstructured data must be processed
Increased Velocity of generation and distribution of data sources
The above are the main questions to determine if we have a Big
Data scenario
Big Data
4. Big Data technologies
Business intelligence (BI) traditional tools and processes have
been overtaken by the nature of Big Data
This situation has led to the rise and development of a wide
range of technologies for Big Data management
Most of current Big Data technologies are Open Source
Know-How: A major problem
Which technologies use on each Big Data scenario?
How to combine them to be successful and monetize Big Data
management?
Big Data
6. Classification of Big Data technologies
Big Data technologies fall into 3 groups
Big Data
7. Classification of Big Data technologies
Apache Hadoop:
A framework that allows for the distributed processing of Big
Data
Commodity cluster computing: It is designed to scale up
from single servers to thousands of machines
More general approach than the other Big Data
technologies:
Simple programming models for supporting a wide range of
applications: MapReduce, Tez, Hive, Pig, Spark...
Applications: Ingestion, Processing (Batch & Real Time), ETL,
SQL, Machine Learning, NoSQL, Reporting, OLAP…
Big Data
8. Classification of Big Data technologies
Apache Hadoop in its most basic form consists of:
HDFS: A distributed file system
YARN: A framework for job scheduling and cluster resource
management
MapReduce: A YARN-based system for parallel processing of
large data sets
Big Data
9. Classification of Big Data technologies
NoSQL databases
Storing and querying especially for semi-structured data
Usually they implement distributed storage and processing
Aimed to replace the operational databases in Big Data scenarios:
Less general approach than Hadoop
Some form of support for transaction management
Optimized for random reads and writes
Big Data
10. Classification of Big Data technologies
Extended RDBMS
Add features to traditional databases for storing and processing
huge volumes of relational information (mainly structured data)
Including libraries of advanced analytical functions and supporting User
Defined Functions (UDF)
Usually they allows for distributed storage or processing
Some of them implements columnar storage: Optimized for analytical
workload (sums, counts, averages, maximums,…)
One important subtype are MPP (Massive Parallel Processing)
databases
HP Vertica, Pivotal Greemplum
Well suited for OLAP applications
Big Data
11. Classification of Big Data technologies
An alternative classification: based on their role in a Big Data
architecture
Big Data
Ingestion Storage Processing Orchestration Analysis Visualization
12. We provide the best technology for each application
1. Enterprise Data Warehouse Extension:
Big Data scenarios in where we would like to implement low latency
analytics such as OLAP, dashboard, reporting,…
Big Data
13. We provide the best technology for each application
2. Website clickstream analysis :
Big Data
14. We provide the best technology for each application
2. Website clickstream analysis – Visualization Technologies
Apache Zeppelin
http://zeppelin-project.org/demo.html
Big Data
15. We provide the best technology for each application
3. Real Time analytics
Data streams processing, instead of static data sets, as in the batch
processing
Big Data
Syslog
Source
Avro Sink
Kafka
Channel
HDFS Sink
HBase Sink
Others
Sinks
Real Time
Processing
Persistence
Visualizations
for analysis
Apache
HTTP
Server 1
Apache
HTTP
Server 2
Apache
HTTP
Server N
16. We provide the best technology for each application
3. Real Time analytics – Processing Technologies
Big Data
Interceptor Trident API
Processing latency 0,05 a 0,5 sec 0,05 a 0,5 sec 0,5 a 30 sec 0,5 a 30 sec
Agreggations and
Windowing averages
Yes, but not Fault-
Tolerant
Not supported Yes, Faul-Tolerant Yes, Faul-Tolerant
Record level
enrichment and alerts
Yes Yes Yes Yes
Persistence of
transient data
Yes, but poor
performance
Yes, high performance
with HDFS, Hbase…
Yes, high performance
with HDFS, HBase…
Yes, high performance
with HDFS, HBase…
High-Level Functions No. It requires a lot of
code
Yes. Very simple,
configuration-based tool
Yes. Joins, aggregations,
.... Easier programming
than Storm
Yes, a lot of libraries of
functions. Easier
programming than
Storm and Trident.
Reliability Duplicates and data loss More reliable than
Storm and Trident
More reliable than
Storm
More reliable than
Storm and Trident
17. We provide the best technology for each application
3. Real Time analytics – Visualization Technologies
JavaScript Charts libraries (D3, Highcharts…) using Sockets connections
Big Data
18. We provide the best technology for each application
3. Real Time analytics – Visualization Technologies
JavaScript Charts libraries (D3, Highcharts…) using Sockets connections
Big Data
19. We provide the best technology for each application
3. Real Time analytics – A StrateBI case study
Wikipedia updates – Demo StrateBI
http://bigdata.stratebi.com/
Big Data
20. We provide the best technology for each application
3. Real Time analytics – More Technologies
Apache Hue + Solr
Big Data
Syslog
Source
Solr Sink
Kafka
Channel
Solr
Real Time
Indexing
Hue
Visualizations
for analysis
Apache
HTTP
Server 1
Apache
HTTP
Server 2
Apache
HTTP
Server N
21. We provide the best technology for each application
3. Real Time analytics – More Technologies
Apache Hue + Solr
Big Data
22. We provide the best technology for each application
4. Fraud detection system:
Big Data
23. Hadoop Distributions
Separately installation and maintenance of Hadoop tools may
become a serious issue
Hadoop Distributions: Software package that includes the basic
Hadoop components, along with others common and useful tools
of the current Hadoop Stack
In some cases distributions adds improvements or, even, not Open
Source tools (e.g. Cloudera Manager)
Main benefits
Packages or installer: Easy to install Hadoop on different operating
systems such as Ubuntu, CentOS, Debian, Windows Server ...
Easy patch management
Big Data
24. Hadoop distributions recommended by StrateBI
Hortonworks HDP: http://hortonworks.com/
The only 100% Open Source Hadoop Distribution
Only includes the latest stable versions of Hadoop stack tools
Big Data
25. Hadoop distributions recommended by StrateBI
Cloudera: http://www.cloudera.com
Express (free) and Enterprise (comercial) versions
They include tools improvements that have not yet been
incorporated into Apache open source projects
Cloudera Manager: A proprietary tool for Hadoop cluster
management and monitoring
Quite good and very reliable tool
In its free version it does not support some features that Apache
Ambari does support for cluster management in Hortonworks
Users and roles definition, LDAP integration, management of
some Hadoop services (Impala, Spark, etc ...), hot updates of
cluster tools...
Big Data
26. Pentaho & Big Data
The suite of Business Intelligence Pentaho has added improved
support for Big Data management, processing and visualization
Pentaho Data Integration
Visual and powerful ETL design and execution tool
Pentaho Reporting Designer
For creating static and parametrized reports
Pentaho Metadata Editor
To define metadata for Ad-Hoc reporting applications (e.g. STReport)
Pentaho BI Server
For developing and sharing reports, dashboards (e.g. STDashboard) and
OLAP Analysis (e.g. STPivot)
Big Data
28. Pentaho & Big Data
Pentaho Data Integration 6.X
Fully integration with most common Hadoop Distributions
Cloudera 5.X, Hortonworks 2.X, Map R
Functionalities
ETL in-cluster execution: Pentaho automatically generates and launches
MapReduce code in the cluster
Reading, processing and writing data and files from and to HDFS
Processes Orchestration: MapReduce, Pig, Sqoop, Spark, Oozie
JDBC Connection with Apache and Apache Hive Impala
PDI has also support for NoSQL databases
Hbase, Mongo DB, Cassandra (up to version 2.1)
Big Data
31. Some Big Data success stories:
Democratic Party presidential campaigns (Barack Obama)
Data integration from surveys, social networks, members database..
High accuracy in forecasting results per geographic area (> 99%)
Better management of campaign events, advertising placement ...
They won presidential elections in 2008 and 2012
Amazon recommendation system
Big Data
32. Some Big Data success stories:
Banks and insurance companies as Morgan Stanley and ING
Direct have adopted Big Data:
Fraud detection, risk analysis in loans and insurance, customer churn
prevention, ...
The UPS package delivery company invests $ 1 million a year in
Big Data
Uses the data generated by the sensors installed in their vehicles to optimize
the route / fuel consumption, maintenance, CO2 emissions ...
UPS saves 50 million dollars in gasoline a year through its management of
Big Data
Big Data
33. Some Big Data success stories:
T-Mobile USA uses Big Data to reduce churn rate
By integrating data from billing, calls and social networks
All raw data is being stored in a Hadoop Data Lake
Generates a 360 degree view of each customer used to attack
customer dissatisfaction
“Tribal” customer model
Identifying people who have high influence on others due to their large
social network If this client switches telecom provider, it could
cause a domino effect
Customer Lifetime Value is calculated for each of these customers
Big Data
34. Some Big Data success stories:
T-Mobile USA uses Big Data to reduce churn rate
Churn expectancy of a customer is based on different analyses
Billing analysis: Where and how long a user calls or text with whom.
Calls going to different provider could indicate that social network of
the customer is switching
Drop call analysis: For example, proactively detect if the user has
limited coverage is his geographical area of usual movement to offer
solutions, such a new phone or a femtocell to extend coverage in
indoors locations
Sentiment analysis: Social network data combined with other data
collect from customer such as surveys or previous client complains
As a result, T- Mobile down churn rates by 50% in just one
quarter
Big Data
35. StrateBI & Big Data success stories:
StrateBI has successfully applied the previously discussed Big Data
technologies:
Big Data analysis for decision making in agriculture
Real time data generated by sensors installed in farms is ingested and
integrated with weather data sources, in order to generate alerts and
obtaining predictions
Social Network analysis
Technological surveillance for a security company
Detection and prevention of attacks or dangerous scenarios, by
analyzing data from social networks combined with customer data
Detecting trends in social networking for business digital content
management
Intelligent publishing content
Big Data
36. Real time analysis of Big Data for decision making in agriculture
Big Data
39. Why StrateBI for Big Data projects?
Big Data recognized specialists in Spain (Hadoop, Spark, Hive,
Flume, Hortonworks, Cloudera, Cassandra, HP Vertica…)
Backed by our projects and training performed with companies
such as Boeing, Telefónica Educación Digital (TED), Gobierno de
España, Schibsted Group, Prosegur, INCIBE (National Institute of
Cybersecurity)…
Spanish leaders of Open Source BI (Pentaho, Talend,
Mondrian, Ctools, Saiku…)
StrateBI has lead to production a hundreds of Business
Intelligence systems with Pentaho for large companies such as
BBVA, Telefónica, Globalia, Prosegur, ALD, Gobiernos de La
Rioja, Extremadura, Baleares, Eroski, Equifax, Unilever, Amnistía
Internacional, Caixa De Enginyers, Schibsted, etc…
About Us
NoSQL:
Bases de datos para el almacenamiento y consulta de datos, principalmente semi estructurados
Soporte para transacciones y optimizada para lecturas y escrituras aleatorias Aplicaciones operacionales