Big Data and You: An Introduction

Big Data and You
2015
May
Edition
Objectives
This document is designed to introduce big Data
and Analytics . Instead of being deep dive
technical paper or product portfolio details,
friendly educational presentation (easily and
quickly read) for specialists, architects, PMs and
managers (*). One simple goal (but complex and
time consuming exercise): is you read this paper,
you learn something and then you would like to get
more details to become an expert. Yes, You can
Big Data
Table of Contents
1. Introduction
2. Definition
3. BI principles
4. Chronology
5. Hadoop I
6. Hadoop II
7. Hadoop Ecosystem
8. BI vs Big Data
9. Hadoop patterns
10. Hadoop Market
Introduction
2012 was the big data marketing buzz, 2013 was the big
data technical enablement, 2014 was the big data projects.
Now European customers are massively deploying big data
(and still analytics) projects. It is time to become an expert
to guide our customers and talk with Big Data ecosystem
to fill the Big Data skills gap
(*) This paper doesn’t pretend to be exhaustive on the Big Data subject, nor it is intended to recommend precise and specific architecture for architects,
recommend performance and technical details for specialists or marketing campaign. It doesn’t assume, or require any (or few) knowledge of Big Data
11. BD&A vendors
12. Competition
13. In Memory
14. Streams
15. BigInsights
16. Architecture
17. Positioning
18. Why Power ?
19. Contacts
20. New !
Author # Christophe.menichetti@fr.ibm.com

# 1
IBM Montpellier Client Center Christophe.menichetti@fr.ibm.com
Introduction Definition
What is Data Analysis ? Why Analysing Data ?
Analysis of data is a process of inspecting, cleaning, transforming,
and modelling data with the goal of discovering useful information,
suggesting conclusions, and supporting decision-making.
Data analysis has multiple facets and approaches, encompassing
diverse techniques under a variety of names, in different business,
science, and social science domains, such as :
Business Intelligence/Analytics
Data Mining / predictive Tools
Big Data
Data integration/ Data visualisation
And so on …
IT technologies and computer sciences are evolving. Yesterday,
when IBM, Honeywell, Sperry, ICL, Xerox,Digital or Olivetti were
the IT leaders, CPU and Memory were the key differentiators.
Today, when IBM, Google,SAP, Oracle are the IT leaders, the
ultimate differentiator is being able to make more informed
choices with confidence, to anticipate and shape business
outcomes.
As company and industry leaders, you absolutely need deeper
insight from their information, to beat your competitors :
• Which customers are thinking of leaving?
• Which transactions are fraudulent?
• Detect life-threatening conditions in time to intervene
Let’s make it simpler – An example
Analytics = transforming
data into (sexy)
information to make
(intelligent) decision
Weather Forecast : You should decide
which boot you’ll take to go to Paris.
You are not expert at all (temperature,
pressure, cyclone = RAW data) but you
can decide based on weather map
(report/analysis)
!message : Data is the new oil requiring Mining, Refining and Delivering
BI Principles Chronology Hadoop I Hadoop II
Big data and You

# 2
Definition
What is Business Intelligence ?
Business analytics (BA) refers to the skills, technologies,
practices for continuous iterative exploration and investigation of
past business performance to gain insight and drive business
planning.
Business analytics focuses on developing new insights and
understanding of business performance based on data and
statistical methods.
In contrast, business intelligence (BI) traditionally focuses on
using a consistent set of metrics to both measure past
performance and guide business planning, which is also based on
data and statistical methods
Big Data is a broad term for data sets so large or complex that they
are difficult (or too expensive) to process using traditional data
processing applications. Challenges include analysis, capture, curation,
search, sharing, storage, transfer, visualization, and information
privacy.
What is Big Data ?
!message : Big Data creates new opportunities to extend Analytics for higher value
BI Principles Hadoop I Hadoop IIIntroduction Hadoop Ecosystem
Big data and You
4th V: Value
5th V: Veracity
For more information/technical details, feel free to contact us

OLTP versus OLAP
# 3
BI reference Architecture
Reporting solutions
display data in a either synthesized or
detailed view, easy to understand for
the end user (data mining: discovering
Interesting/useful patterns
/relationships in large volumes of
data – analyzing the past to predict
the future)
Data warehouse
central database in which data are
stored and can be restructured to
answer Business needs.
ETL
Unifies data from heterogeneous data
sources (extracting the useful data)
Consolidates them into a unique
destination database (cleansing,
modifying the data according to the
desired output)
Good to know !
People, very often, associate BI with reporting/data mining tool, because this is the “visible” part of the iceberg. But This is an
misnomer, BI refers to the full set of tools, such as Reporting, Data warehouse and ETL. For your information, ~70% of the costs and
efforts in BI projects is about the data warehouse, the most important (but hidden) part of the “iceberg”.
Star Schema
Optimized for SQL read requests. Fact
table (metrics of the reports) in the middle,
surrounded by dimension tables (Y axis)
= On Line Analytical Processing (OLAP)
3NF Schema
Optimized for flexibility and storage
space savings = On Line Transactional
Processing (OLTP)
How does Analytics work ? What does OLAP mean ?
!message : BI/Analytics is the way to transform raw data into decision/information
Definition BI Principles Hadoop IHadoop IIuction Hadoop EcosystemChronology BI vs B
Big data and YouAny Analytics Projects/ questions ? Do not hesitate to contact us

First steps - early1950
IBM newspaper : Article " A Business Intelligence System" (Hans Peter Luhn)
Birth of the wording “Business intelligence”
First tools for automatic methods, providing alert services (for scientists)
1970
First MIS solutions – Management Information System
Static, non flexible
No analysis features
1980
First EIS software – Executive Information System
More sophisticated MIS: simulations, report, forecast,
1990
BI concepts, is officially formalized by Howard Dresner, Gartner Group analyst
Birth of Business Performance Management (BPM / EPM)
2005 – 2010
BI market strong consolidation – big major IT acquisitions
Oracle acquired Siebel (Report - 6B$), Hyperion (EPM- 4B$), Sunopsis (ETL- 1 B$)
SAP acquired Business Objects (Report – 7B$), Sysbase (DW – 6B$), Fuzi (ETL),
IBM bought Cognos (Report – 5B$), Netezza (DW – 2B$), Ascential (ETL – 1B$)
-
Yahoo and Google faced terrible performance issues with DW architecture – Need
of rethinking data analysis approach – birth of Hadoop
2012 and +
Birth of Big data
# 4
A little bit of history ?
!message : Analytics has evolved from business initiative to business imperative
Definition BI Principles Hadoop I Hadoop IIHadoop EcosystemChronology BI vs BigData Hadoop
Big data and You

Why Hadoop ?
1
2
Performance issue : Consider that over the past decade :
- CPU speed performance has increased 8 to 10 times
- DRAM speed performance has increased 7 to 9 times
- Network speed performance has increased 100 times
- Bus speed performance has increased 8 to 10 times
- Hard disk drive speed performance has increased ONLY 1.2 times
NoSQL: Not Only SQL
Mechanism for storage and retrieval of data that is modeled in means other than
the tabular relations used in relational databases.
 Motivations for this approach include simplicity of design, horizontal
scaling, finer control over availability and most importantly COST
!message : Hadoop meets the need of new scalable architectures providing a business
Efficiency and flexibility over the existing relational data model
ciples Hadoop I Hadoop II Hadoop EcosystemChronology BI vs BigData Hadoop Pattern Hadoop Market
# 5
Big data and YouWould like to bench/test ? Go to MOP Client Center

How does it work ?
Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets
(Big Data) on computer clusters built from commodity hardware.
The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce). Hadoop splits files into large
blocks and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop Map/Reduce transfers code (specifically Jar files) to nodes that
have the required data, which the nodes then process in parallel.
This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more
conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking
Would like to appear like an expert ?
HDFS default replication : 3 x, HDFS default blocks size = 128 MB, HDFS sits on top of a native Linux filesytem (ext4, ext3), Slave nodes : HDFS
(= data node), MapReduce (= task tracker) , Master nodes : HDFS (= name node), MR (= job tracker), secondary name node is for High Availability
!message : Volume and Variety challenges have led to the creation of new data
processing : Map Reduce and HDFS
Hadoop I Hadoop II Hadoop EcosystemChronology BI vs BigData Hadoop Pattern Hadoop Market BD&A
# 6
Big data and YouWould like briefing ? Go to MOP Client Center

YARN, “the hadoop 2 “ decouples MapReduce's resource management and
scheduling capabilities, enabling Hadoop to support more varied processing
approaches/applications (interactive SQL, real-time streaming, batch processing) # 7
Flume was created to allow you to
flow data from a source into your
Hadoop® environment.
ZooKeeper provides a centralized
infrastructure and services that
enable synchronization across a
cluster. ZooKeeper maintains
common objects needed in large
cluster environments like
configuration information,
hierarchical naming space …
HBase is a column-oriented
database management system
that runs on top of HDFS. It is well
suited for sparse data sets, which
are common in many big data use
cases
Some folks at Facebook developed
Hive™, allowing SQL developers to
write Hive Query Language (HQL)
statements that are similar to
standard SQL statements
Oozie simplifies workflow and
coordina¬tion between jobs. It
provides users with the ability to
define actions and dependencies
between actions.
Pig initially developed at Yahoo!
allows people to focus more on
analyzing large data sets and spend
less time having to write mapper and
reducer programs.
Sqoop is a connectivity tool for
moving data from non-Hadoop
data stores – such as relational
databases and data warehouses –
into Hadoop
Mahout takes the most popular data mining algorithms
for performing clustering, regression testing and
statistical modeling and implements them using the
Map Reduce model
Ambari is a web-based set
of tools for deploying,
administering and
monitoring Apache Hadoop
clusters
!message : The HDFS file system is not restricted to MapReduce jobs. It can be used
for other applications, many of which are under development at Apache
Hadoop II Hadoop Ecosystem BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition
Big data and You

# 8
Different Approaches
Don’t take us wrong : there is no bad approach or
good approach, there is no magical approach.
There are different approaches, for different
needs and results.
With BI approach, Business Users determine what
question to ask (business hypothesis) and IT team
structures the data (specific selected data into
data warehouse) to answer to the question.
With Big Data approach, IT delivers (all data) a
platform to enable creative discovery and
Business Users Explores what questions could be
asked
Different Architectures
BI architecture: Application server and Database
server are separated, Network is still in the
middle, Data have to go through the network.
Big Data architecture: Analysis Program runs
where are the data : Functions have to go through
the network. This is highly scalable and flexible by
design
Different Objectives
Hadoop is one of the multiple facets of Big Data.
This facet (Hadoop) is designed to run huge
(Volume) “read” batch, in extreme costs savings
way for unstructured data (Variety)
!message : Do not compare apples and oranges : you should (still) need both
Hadoop Ecosystem BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory
Big data and YouFor more information/technical details, feel free to contact us

# 9
Technical Hadoop Patterns
Big Data
Exploration
Find, visualize,
understand all big
data to improve
decision making
Enhanced 360o
View
of the Customer
Extend existing
customer views
(MDM, CRM, etc) by
incorporating
additional internal
and external
information sources
Operations Analysis
Analyze a variety of machine
data for improved business results
Data Warehouse Augmentation
Integrate big data and data
warehouse capabilities to
increase operational efficiency
Security/Intelligence
Extension
Lower risk, detect
fraud and monitor
cyber security in real-
time
Big Data Business Use Cases
Keep in Mind
The term Big Data is a bit of a misnomer. Big data is not
only referring to huge volume of data or Hadoop, there are
many others patterns using streams or in memory solutions
!message : Big Data Analytics are applied Across all Industries, different use cases
BI vs BigData Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights
Big data and You

# 10
Hadoop has been most rapidly adopted by the government,banking,finance,IT and ITES, and insurance sectors
Geographical analysis of the market seems to suggest that North Americais the leadingrevenuegenerating market and will continue to
remain so till 2020.
Hadoop hardware-based,solution providershave been the highest receivers of venture capital funding.The recent times have witnessed a steep
demandfor real-time,operationalanalytics
!message : In 1990’s new performing hardware was the differentiator for companies
to compete. Nowadays big data is the key competitive differentiator
Hadoop Pattern Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights Architecture
Big data and You
Hortonworks study – 2014 wikibon figures - 2013

# 11
IBM Montpellier Client Center
The market for Big Data &
Analytics solutions has
exploded
The race is hot and complex:
 Every vendor is
jumping in
 Alternatives from
everywhere
 Startups proliferate
 Partnerships
No other vendor has what IBM
have
– Software/ Hardware
– Services / Research
– Cloud, Mobile, Social
Yet just having ‘everything’
does not make for a market
leader
Based primarily on 2012 Wikibon report/forcast http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017
!message : The race is hot, Every vendor is jumping in, Alternatives from everywhere,
Startups proliferate, how do we differentiate in such a crowded market?
Hadoop Market BD&A Vendors Competition In Memory Streams BigInsights Architecture Positioning
Big data and YouAny competitive big data questions ? feel free to contact us

# 12
4 major distributions
of Hadoop have
spawned ecosystems
of partners
developing data
management and
analytic solutions for
Big Data
!message : IBM is a global Big data and Analytics leaders, industry’s most comprehensive
and enterprise class solutions, broadest portfolio
BD&A Vendors Competition In Memory Streams BigInsights Architecture Positioning Why Power?
Big data and YouAny competitive big data questions ? feel free to contact us

# 13
In-Memory - good timing for an old idea
Largely driven by the big data phenomenon, In-memory computing is a powerful,
transformative IT trend to meet high-performance analytics expectations and
data visualization needs. In memory solution should not be confused with
conventional DBMS storing data in disk blocks cached in memory.
In-Memory” Database technology has been around for over a decade.
Traditionally in-memory technology was used in a limited number of operational
applications workloads (FSS trading, Telco Billing, HPC, embedded devices) but in
2011 we saw Inflection Point : Increased focus and ‘push’ by SAP
With in-memory database, all information is initially loaded into memory. This
eliminates the need for optimized databases, indexes, aggregates and designing of
cubes and star schemas. The arrival of column centric databases which stored similar information
together allowed storing data more efficiently with greater compression
and faster read access , reducing the amount of memory needed to
perform a query and increasing processing speed. That’s why column-
based technology is very often associated to in memory technology
Column Based Technology
Volume: users /data
increase, RAM needed also
increases = hardware
costs
Velocity : real time
analytics, operational
analytics
!message : Big Data analytics can benefit from these very large in memory
Systems for velocity (since Memory has become cheaper)
dors Competition In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info
Big data and YouDo you need Big Data Analytics Briefing ? Come to us in MOP

# 14
 Deal with Terabytes of data
each second
 Work with application,
sensor and internet data,
video/audio
 Deliver insight in
microseconds to analytical
applications
 Support complex scenarios
using C++ or Java code
Streams is tailor made for companies who need to process data from non-traditional sources, with huge volumes of
data, and need results very, very quickly, integrated with existing analytics investments
 Stream computing is a different paradigm – the left
shows the traditional way data is accessed using
queries to pull the data from a data storage device
such as a data warehouse or database – which is still
valid for many requirements
 The new stream computing paradigm brings data to
the query – data is pushed or flows through the
analytics. This is required for many new use cases in
big data
 Here’s a little more on how streams works and
what you can do with it.
 Each of these square represents an operator.
The data passes (input stream) through each
operator where some action is being performed
on the data (output stream)
 You can fuse data form multiple streams, you
can modify it, annotate it, perform an analytics
operation on it, fuse multiple streams or
classify it.
!message : Velocity challenges have led to the creation of new data computing paradigm
and solution: streaming to bring microseconds effective real time
In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info
Big data and YouDo you need Big Data Analytics Briefing ? Come to us in MOP

# 15
Hadoop is an Open Source implementation and although very well maintained, doing the “job” for
companies it implies a risk. Like Linux, major IT companies provide Hadoop distributions.
IBM took this Hadoop and ruggedized it for enterprises, adding enterprises features such as
performance, resilience and IBM experiences, (bigsheets, bigsql,gpfs…) while maintaining the open
standards 100%. We call it Biginisghts, running on x86, Power Systems and Mainframe (linux)
2 editions : basic edition (100% open
source – free) and Enterprise Edition
BigSheets - a big data
visualization capability
that enables end users to
collect, explore and
uncover actionable
insights through a
commonly understood
spreadsheet experience
(drag and drop, clicks
without any Java or
Hadoop skills)
Adaptive Map Reduce –
Already proven product
from Platform Computing
(HPC acquisition) ,
rewriting Map Reduce
paradigm in C++ (No
garbage collection, faster
memory management),
allowing :
• Optimized Shuffle, map
sort
• Resource management
and scheduling of jobs
is separated
• leverage shared
memory across JVMs,
eliminating data
movement
BigSQL – SQL on Hadoop
is challenging (wide variety
of data, MR is batch
oriented), BigSQL provides
Native full compliant SQL
access to data
stored in BigInsights, Real
JDBC/ODBC drivers, and
optimization based on
Massively Parallel
processing (MPP)
architecture, from DB2
experience
Spectrum Scale – GPFS
FPO (file placement
optimizer) scalable, high
performance, and highly
reliable, 20+ years
experienced product, has
many advantages over
HDFS:
• POSIX compliant
• No single point of
failure
• Multi tenant
• HA/DR solutions
IBM BigInsights for Apache Hadoop v4 has
been just released based on ODP initative
Version 3.0 – Enterprise Edition
!message : IBM Hadoop strategy : better analytics tooling that is easier to use +
commitment to Hadoop open source (ODP initiative)
In Memory Streams BigInsights Architecture Positioning Why Power? Contacts/info
Big data and You

# 16
How are leading companies transforming their data and analytics environment
to take advantage of Big Data and provide faster, better insights at reduced
costs within their existing Enterprise Data Warehouses ?
100010010101010
100010010101010
100010010101010
100001101010101100
100001101010101100
000111000010011
000111000010011
!message : The foundational schematic to bring analytics to all stages in the data
lifecycle can be overlaid with specific products that provide the functions
Streams BigInsights Architecture Positioning Why Power? Contacts/info
Big data and YouNeed Customer Enablement ? Education ? Send us an email

# 3
!message :
Systems of Record
Structured data from
operational systems
Transformational benefit / business outcomes come from integration of
new data sources with traditional corporate data to find new insights
Systems of Engagement
Data that “connects”
companies with their
customers, partners and
employees
Systems of Insight
Diverse data types that
combine
structured and
unstructured data
for business insight
In Memory
Hadoop
EDW
Appliance
# 17
Big data and YouNeed Architecture Workshop ? Sizing ? Send us an email

# 18
Important to keep in mind
Big Data (BigInsights, Cognos, SPSS, …) can run on IBM System z. Customers could take advantages of co-locating business data and OLAP
data, managing high speed transactions and complex queries for real time operational analyticson a single integrated platform and take benefits
of the performance, resiliency and quality of service of IBM Mainframe for critical businesses., as many banks/insurance customers
!message : The infrastructure is a foundational piece to IBM’s perspective of
delivering capabilities and offerings for BD&A
Hadoop is Linux – Linux is Power Hadoop is cheap - Power is cheap
Hadoop ecosystem – PowerLinux market acceptance
Power advantages for Big Data
Linux on Power – run the same commands as linux
on x86 – versions release as the same date
Linux on Power makes 17,6% of top 500 most
linux powerful systems (with 5 in top 10)
POWER8 increases performance, reliability and
availability lead over Intel, alternative to intel
OpenPower foundation brings Rapid innovation to
Power Platform for open linux
Little Endian support makes porting Linux on
x86 applications even easier
Power8 design point is for big data (more
threads, more cache , more bandwidth, CAPI …)
Intel design point is for multiple market
(smart phone, tablet desktop PC, servers …)
Big data and YouFeel free to contact MOP PowerLinux center for more details

# 20
IBM BigData RessourcesWw Competency Centers Big Data Analytics Links
Web sites
ibm.com/Hadoop
Information Management Acceleration Zone
PowerLinux Big Data
IBM communities
IBM Systems Big Data and Analytics
BDSC practitioner wiki
IBM Analytics Global
Big Data& Analytics Clients References
IBM Developper Works
https://www.ibm.com/developerworks/analytics/
Please, Please
Help us in improving this document – if any comments / ideas please feel free to send an email
http://bigdatauniversity.com/
http://wikibon.org/wiki/v/Category:Big_Data
http://en.wikipedia.org/wiki/Apache_Hadoop
http://www.slideshare.net/search/slideshow?
searchfrom=header&q=big+data
[INFO] Based on 3 experienced years of big data projects , after many weeks of intensive work for compiling several
presentations done to customers or conferences, synthetizing concepts, the objective of this educational paper is to
clarify some of the concepts and solutions around Big Data in order to better understand the related challenges and
opportunities. But There may be (so many) typing errors, mistakes, misleading words, missing concepts, so Please be kind 
Streams Biginsights Architecture Positioning Why Power? Contacts/info
Big data and YouIf we can not help you directly, we’ill point you to the right person

> Strong history of leadership in open source & standards : IBM has always been a believer in
standardization of interfaces to components of IT and application infrastructure (SQL, Eclipse,
OpenPower …)
> Supports our commitment to open source currency in all future releases
> Accelerates IBM innovation within Hadoop & surrounding applications
> Expecting Hortonworks, Pivotal distribution adoption on PowerLinux
> The current ecosystem is challenged and slowed by fragmented and duplicated efforts. The ODP
Core will take the guesswork out of the process and accelerate many use cases by running on a
common platform. Freeing up enterprises and ecosystem vendors to focus on building business
driven applications.
# 21
!message : ODP is clearly a major and strategic choice in Open community to accelerate
Hadoop adoption and grow BigInsights and PowerLinux ecosystem / ISV
NEW AND/OR HOT !!! OPEN DATA PLATFORM
Big data and You
What is Open Data Platform (ODP) ?
> It is an Open-source, non-profit entity, focused, committed in evolving the current state of
the platform, and delivering a Foundation certified, packaged, and tested Reference Distribution
Why Open Data Platform (ODP) ?
Where to position ODP vs Apache ?
> ODP supports the Apache (ASF) mission
> ASF provides a governance model around
individual projects without looking at ecosystem
> ODP aims to provide a vendor-led consistent
packaging model for core Apache components as
an ecosystem
Why IBM is involved in ODP ?

# 22
!message : IBM fundamental cloud strategy : Complete cloud offering, mixed between
control and simplicity.
Big data and You
NEW AND /OR HOT !!! Big Data/Analytics and Cloud
Customer Data
Center (On-Premises)
Cloud Data Center
(Off Premises)
SIMPLICITY
CONTROL
PureData for analytics
DB2 BLU
Infosphere Biginsights
Cloudant
DashDB
Softlayer
Cloudant
DashDB
Distributed NoSQL “Data Layer”, Powering
Web, mobile, & IoT since 2009
Available as a fully-managed DBaaS, managed
by you on-premises or hybrid
Transactional JSON “document” database
Spreads data across data centers & devices
Ideal for apps that require:
> Massive, elastic scalability
> High availability
> Geo-location services
> Full-text search
> Occasionally connected users
Data warehouse and analytics
as a service on the cloud
• Next Generation In-Memory
• Columnar
• SIMD Hardware Acceleration
• Actionable Compression
• Support for OLAP SQL extensions
• Connect common 3rd party BI tools
dashDB keeps data warehouse infrastructure out
of your way, allowing you to take benefits of :

# 23
!message : Spark is positioned as a fast and general engine for Big Data. It
generalizes the MapReduce model and (could?)is poised to replace MapReduce
Big data and You
NEW AND/OR HOT !!! SPARK
Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based
MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a
cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms.
Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster), Hadoop YARN, or Apache
Mesos.For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS), Cassandra, OpenStack Swift, and Amazon S3.
Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file
system can be used instead; in this scenario, Spark is running on a single machine with one executor per CPU core.
Spark had over 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among Big Data open source projects

# 24
!message : From application point of view, data lake challenge is to be an unique
and unified data repositories, queryable like a black box
Big data and You
NEW AND /OR HOT !!! DATA LAKE ARCHITECTURE
IDC in late 2014 stated “By 2017 unified data platform architecture will
become the foundation of BDA strategy. The unification will occur
across information management, analysis, and search technology.”
 A Data reservoir is a data lake that provides data to an
organization for a variety of analytics processing including:
• Discovery and exploration of data
• Simple ad hoc analytics
• Complex analysis for business decisions
• Reporting
• Real-time analytics
 It is possible to deploy analytics into the data reservoir to
generate additional insight from the data loaded into the data
reservoir.
 A data reservoir manages shared repositories of information for
analytical purposes.
 Each Data Reservoir Repository is optimized for a particular type
of processing.
• Real-time analytics, deep analytics (such as data mining), exploratory
analytics, OLAP, reporting, …
Example – Creating a logical warehouse
Information virtualization hides the complexities of where the
data is located. Here different repositories are being used to
host different workloads, but this complexity is hidden by the
information virtualization layer.

Big Data and You: An Introduction

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Big Data and You: An Introduction

Ähnlich wie Big Data and You: An Introduction (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data and You: An Introduction