Are you confused by Big Data? Get in touch with this new "black gold" and familiarize yourself with undiscovered insights through our complimentary introductory lesson on Big Data and Hadoop!
5. VARIETY:
Most data is unstructured.
Partner data,
reference data,
CRM, ERP, Production,
Finance, HR,
Procurement,
Machine sensor data,
etc.
Documents
email,
Contact center
calls,
Presentations,
security images,
Medical scans
unstructuredstructured
internal
BI + data connections
Social media monitoring
tools
Search,
ECM
Traditional BI
Social media content
channel content
external
Business Intelligence & Variety
In Business Intelligence (BI) systems, data is mostly internal & structured.
Including social media content, digitalization, and a global supply chain
requirement shift to support the broadening variety of structuredness.
Business Intelligence is the
set of techniques and tools
required for the
transformation of raw data
into meaningful and useful
information for business an
alysis purposes.
7. Several platforms embrace existing database technologies in order to optimize
analytical applications on large data volumes.
Technology Description Vendor / Product
Massively parallel processing (MPP)
Row-based databases designed to scale out on a cluster of
commodity servers.
Also known as “shared-nothing”-architecture
Teradata Active Data Warehouse, Greenplum (EMC),
Microsoft Parallel Data Warehouse, Aster Data
(Teradata), Kognitio
Columnar Databases
DBMS that store data in columns, not rows.
Support high data compression and analytical query performance
Sybase IQ (SAP), ParAccel, Infobright, Vertica (HP),
1010data
Analytical appliances Pre-configured hardware-software systems
Netezza (IBM), Teradata Appliances, Oracle Exadata,
Greenplum Data Computing Appliance (EMC)
In-memory databases Systems load data into memory to execute complex queries SAP HANA, Cognos TM1 (IBM), QlikView, Membase
Distributed file-based systems
Systems designed for storing, manipulating and querying large
volumes of unstructured and semi-structured data.
Hadoop (Apache, Cloudera, MapR, IBM, HortonWorks),
Apache Hive, Apache Pig
Analytical services (Cloud)
Analytical platforms delivered as hosted or public-cloud-based
services
1010data, Kognitio
Nonrelational (NoSQL)
Nonrelational databases optimized for querying unstructured and
structured data
MongoDB, Apache Cassandra, Apache Hbase
Complex Event Processing (CEP)
Systems optimized for calculation and correlation of large volumes
of discrete events and application of conditions
IBM, Tibco, Streambase, Sybase (Aleri), Informatica
Source: Wayne Eckerson: BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
Existing Database Technology
8. • Google published a paper, which described
• a MapReduce algorithm for processing large
amounts of data
• Doug Cutting, who worked at Yahoo, read
that paper and initiated Hadoop
• Hadoop was the name of the yellow elephant
toy from his son
• Hadoop become an Apache top level project,
• which is supported, among others, by
Facebook, IBM & Yahoo
• Open source project
• Written in Java
• Optimized to handle:
• Massive amounts of data through parallelism
• Using inexpensive commodity hardware
• A variety of data (structured, unstructured, semi-
structured)
• Great performance (on large data volumes)
• Reliability provided through replication
• Not for OLTP, not for OLAP, good for Big Data (1)
FactsHistory
(1)
OLTP: Online Transaction Processing (CRM, ERP)
OLAP: Online Analytical Processing (Data Mining, complex queries over multidimensional data)
What is Hadoop?
9. Hadoop
Core HDFS stores data on
several nodes in the cluster,
with the goal of providing
greater bandwidth across
the cluster as well as higher
reliability.
Hadoop consists mainly of two components:
Hadoop Distributed
Filesystem
It is a computational
paradigm called
Map/Reduce, which
takes an application and
divides it into multiple
fragments of work, each
of which can be
executed on any node in
the cluster.
Hadoop MapReduce
http://mohamednabeel.blogspot.de/2011/03/starting-sub-sandwitch-business.html
Block A Block B Block C
File1.txt
Data
Node 1
Data
Node 2
Data
Node 3
Data
Node 4
Block C
Block ABlock B Block ABlock C
Block A Block B Block B Block C
MAP
1
1
1
1
1
1
1
SORT REDUCE
3
1
1
1
2
2
2
Give every
shape the
value of1
Sort
the
Shapes
For each
shape
type,
count the
vaules
Hadoop Core
10. Data Warehouse Appliances
▪ Expensive dedicated HW
▪ Built for performance
▪ Designed for high volumes (eg. 10s of TB)
▪ High availability
▪ Initially developed using Relational Database Systems like
Oracle, IBM DB2
▪ Designed for modeled and structured data
▪ Business As Usual ways to design, build and deliver
▪ Teradata, Exadata, Netezza, HANA, ... are examples
Hadoop Infrastructure
▪ Uses commodity PCs
▪ Built for extreme scalability
▪ Designed for extreme volumes (10s of PB and more)
▪ Very high availability
▪ Initially developed for web ranking
▪ Hadoop = Data is distributed over many machines
▪ MapReduce = Computing is distributed and executed
where data is (grid solution)
Data Warehouse Appliances vs. Hadoop
“Classical” Data Warehouse Appliances (DWH) differ in the technical basis and the use of
them, compared to a Hadoop infrastructure. This does not mean that DWH Appliances are
now irrelevant, but rather a combination of both is the basis for being future ready.
11. Data import/export (Flume, Sqoop)
Libraries, algorithms (Mahout, Lzo compression)
Tools – monitoring, user experience (Hue, Ambari, White
Elephant)
Data stores (HBase, HCatalog)
Workflow management, job scheduling (Oozie,
Cascading)
Data querying (Hive, Pig, Impala, Drill)
Cluster provisioning & management (Whirr)
… many more
The Hadoop ecosystem uses several tools to solve individual tasks. For example, Sqoop or
Flume are used to import and export data from/into Hadoop or Hive, as data querying tools.
Most of these tools are combined into distributions Cloudera, Pivotal or Hortonworks to
reduce the managing overhead for customers. Again, a combination of both is the basis for
being future ready.
Hadoop Provides Rich Ecosystems For Tasks
Is the term „BigData“ just about „big“?
BigData is often called „new black gold“ with a lot of undiscovered insights
Big Data is about 3 „V‘s“:
Volume: massive amount of data to handle with
Velocity: the speed at which the data come into the system
Variety: The variety of structuredness increases
In traditional Business Intelligence (BI) Systems data are mostly internal and structured. With the rise of social media content, digitalization and a global supply chain requirement shift to support the broadening variety of structuredness
Business intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes.
Big Data Analytics Platform can be classified in four major categories:
Analytical Databases
Analytical Appliances
Analytical Services
File-based analytical systems
Focus of these slides is on 4) File-based analytical systems
„classical“ Data Warehouse appliances differ in the technical basis and the use of them compared to a Hadoop infrastructure
But that does not mean DWH Appliances are not needed any more
a combination of both is the basis for beeing future ready
The Hadoop ecosystem uses several tools to solve individual tasks. For example Sqoop or Flume do import and export data from/into Hadoop or Hive as an data querying tool.
Most of these tools are combined into distributions Cloudera, Pivotal or Hortonworks to reduce the managing overhead for the customers
Hadoop can be integrated in a SAP HANA-System to extend the power of In-Memory computing and the flexibility of SAP HANA to easy to use and cost efficient storage
1) data analytics
– Mining data held in Hadoop for business
intelligence and analytics
2) Flexible data store
– Using Hadoop as a flexible store of
data captured from multiple sources, including SAP and
non-SAP software, enterprise software, and externally
sourced data
3) Simple database
– Using Hadoop as a simple database for
storing and retrieving data in very large data sets
4) Processing engine
– Using the computation engine in
Hadoop to execute business logic or some other process