SlideShare ist ein Scribd-Unternehmen logo
1 von 120
On Which side of the Cloud are you ?
An Introduction to Big Data
Denis Rothman
Copyright 2014 Denis Rothman
Big Data - Introduction
□ This course is not meant to make Big
Data experts out of you in a few
hours but is designed to help you
grasp the main concepts.
□ We’ll be discussing Apache Hadoop,
MapReduce, Mongodb, Pig and
several other names and concepts
that will be familiar to you by the end
of the course !
Copyright 2014 Denis Rothman
Big Data - Introduction
□ We’re going to talk about Apache
« Hadoop » and « MapReduce »
because the following companies use
this technology, at least parent or
derived versions : Google,
Yahoo!,Facebook,Amazon, IBM, Ebay
and many more key players on the
market.
Copyright 2014 Denis Rothman
Big Data - Introduction
□ All the figures, software and brands
mentioned in this document are simple
examples. All of this is going to expand and
change through the years !
□ The main goal here is for you to grasp
enough concepts to be able to create Big
Data architectures with today’s but also
tomorrow’s technology and ideas !
□ So focus on the concepts and the way you
can solve problems with Big Data
technology.
Copyright 2014 Denis Rothman
Big Data – What is big data ?
Learn more : http://en.wikipedia.org/wiki/Big_data
Let’s say that starting with one 10TB for a dataset (collection of data) we’re
talking Big Data and starting one petabyte we really need the technology !
The world has jumped from talking petabytes to exabytes in a year, we’ll
probably be talking zettabytes.
1 EB = 1000000000000000000B = 1018bytes = 1000petabytes1 EB
= 1000000000000000000B = 1018bytes = 1000petabytes = 1 million1
EB = 1000000000000000000B = 1018bytes = 1000petabytes =
1 million terabytes1 EB
= 1000000000000000000B = 1018bytes = 1000petabytes =
1 million terabytes = 1 billion1 EB
= 1000000000000000000B = 1018bytes = 1000petabytes =
1 million terabytes = 1 billion gigabytes.
Copyright 2014 Denis Rothman
Big Data – What is big data ?
For the Universe, the galaxies
are our small representative
volumes, and there are
something like 10^11 to
10^12 stars in our Galaxy
(The Milky Way)
•The number of bitsThe
number of bits on a
computer tera capacity hard
disk is typically about 10^13,
1000 GB)
To compare the amount of data we now store we have to do
down to atom level quantities in our universe !
Copyright 2014 Denis Rothman
Big Data – Can you represent the
Volume ?
Learn more : http://www.seagate.com/about/newsroom/press-
releases/Terascale-Enterprise-HDD-pr-master/
Tell us how and were you
would store a 1PB dataset for a
given company without Big
Data technology ?
How many average size 4 TB
hard disks would it take to
simple store the data ?
High-Capacity— highest capacity HDD (4TB) available in a 3.5-
inch enterprise-class SATA(Serial Advanced Technology Attachment)
HDD enabling scalable, high-capacity storage in 24×7
environments.
?
Copyright 2014 Denis Rothman
Big Data – Can you represent a fast
way to access(Velocity) 1PB of data
with Big Data technology?
Let’s say we’re talking
about the data related to
all bank accounts of the
BNP of the past 5 years
that had a balance of
more that 1000 $ at given
time and that need to be
accessed for a financial
analysis.
How would you do it now,
without Big Data
Technology ?
Copyright 2014 Denis Rothman
Big Data – Can you represent to access
additional documents in a great Variety of
data ?
Now we need to retrieve
other documents to
analyse these BNP
accounts : text
documents(signed
contracts, for example)
How would you do it now,
without Big Data
Technology ?
Copyright 2014 Denis Rothman
Big Data – Do you think you can manage
10PB without Big Data ?
If we now try to solve the 3 V problem with a 10PB dataset to
manage, how could we do it even with Oracle Big Files ?
A bigfile tablespace contains only
one datafile or tempfile, which
can contain up to approximately
4 billion ( 232 ) blocks. The
maximum size of the single
datafile or tempfile is 128
terabytes (TB) for a tablespace
with 32 K blocks and 32 TB for
a tablespace with 8 K blocks.
Number of
blocks
Bigfile Tablespaces
Learn more :
http://docs.oracle.com/cd/B28359_01/server.111/b28320/limits002.htm#i2879
15
Copyright 2014 Denis Rothman
Big Data – Volume, Velocity, Variety that is
beyond non Big Data solutions
We seen the limits of non
Big Data technology ?
How would you solve the
problem ?
Even if you already know
how Big Data works, do
you think it will solve the
increasing size and
variety of datasets ?
How will it help with
sensors ?
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
There are several
solutions on the
market. Let’s use
Apache Hadoop as
way to understand
how Big Data storage
works to solve the 3V
problem.
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ There are many ways
to try to understand a
subject. This part of
the course is designed
for you to see that
the core ideas of
Apache Hadoop are
simple !
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ First of all, what
does « Hadoop »
mean ? It means
nothing !
□ Doug Cutting just
named after his
son’s toy elephant.
So that’s one
mystery solved.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ The first thing
we need to do is
understand
cluster
architectures.
□ Cluster
architectures are
spreading at a
wild speed as a
framework for
the analyis of big
data.
New Exabytes of data appear
each…week…
Learn more : http://www.ovh.com/fr/serveurs_dedies/big-data/
Big Data – Apache Hadoop
□ Cluster architectures are the
best choice because they
have Cloud performances :
extensible, flexible and cost
efficient.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ So what ? So
what’s the
difference between
a traditional
entreprise
architecture and a
cloud-cluster
architecture ?
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ A traditional
architecture is
built on
server technology
that is expensive
and thus has to be
used as much as
possible.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ A traditional
architecture is
also built on
storage capacity
of different sizes
and types : SSD
to SATA.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ A traditional
architecture is
finally built on
storage area
networks
(SAN) to
connect a set
of servers to a
set of storage
units
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ The big quality of traditional
architecture is that the servers and
storage units can be managed (size,
number) separately with SAN
(Storage Area Network) connecting
them.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ The big drawback of traditional
architecture is that it must be
extremely reliable and any failure
must be dealt with very quickly.
□ This brings the price up.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ Traditional architectures were
designed for intensive applications
focusing on one part of the data. The
servers process the information and
then the results are transferred to
storage.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ So in essence a traditional architecture is
designed for a specific need (intense
computing, a standard data warehouse.
Fine.
□ How would you now solve a problem
involving a tremendous weekly increase in
data (PB) ? Not knowing what you’re
looking for in advance : sorting by order,
by timestamp or retrieving certain values.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ Even a few years ago Google was
facing a daily increase of data of
20PB…per day.
□ For a special operation, let’s say user
mail history (number and size of
mails over a five year period), we
need to parse the entire dataset not
just a subset.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ Why sort that data ?
□ To make searching, merging and
analyzing easier.
□ So how can you sort n x 20PB of
data?
□ With cluster architecture !
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
Let’s now study 3 basic
properties of cluster computing :
-Pennysort
-Minutesort
-Graysort
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ Sorting being a major function of Big
Data, it’s important to have
benchmark references.
Learn more : http://sortbenchmark.org/
GraySort
Metric: Sort rate (TBs / minute) achieved while sorting a very large
amount of data (currently 100 TB minimum).
PennySort
Metric: Amount of data that can be sorted for a penny's worth of system
time.
Originally defined in AlphaSort paper.
MinuteSort
Metric: Amount of data that can be sorted in 60.00 seconds or less.
Originally defined in AlphaSort paper.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
Learn more : http://sortbenchmark.org/
2013, 1.42 TB/min
Hadoop
102.5 TB in 4,328 seconds
2100 nodes x
(2 2.3Ghz hexcore Xeon E5-2630, 64
GB memory, 12x3TB disks)
Thomas Graves
Yahoo! Inc.
Gray
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
2011, 286 GB
psort
2.7 Ghz AMD Sempron, 4 GB RAM,
5x320 GB 7200 RPM Samsung SpinPoint F4
HD332GJ, Linux
Paolo Bertasi, Federica Bogo, Marco Bressan
and Enoch Peserico
Univ. Padova, Italy
Penny
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
2012, 1,401 GB
Flat Datacenter Storage
256 heterogeneous nodes, 1033 disks
Johnson Apacible, Rich Draves, Jeremy Elson,
Jinliang Fan, Owen Hofmann, Jon Howell, Ed
Nightingale, Reuben Olinksy, Yutaka Suzue
Microsoft ResearchMinute
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ Getting down to a cluster.
A cluster breaks down to its basic
component : a NODE
A node is made up of cores, memory
and disks that can be assembled in
the thousands, the tens of thousands,
the hundreds of thousands.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ The NODES
are then
grouped in
RACKS
□ The RACKS
are then
grouped into
CLUSTERS
The CLUSTERS ARE CONNECTED TO A NETWORK WITH A CISCO
SWITCH, for example
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ The first property of a cluster is to be
MODULAR and SCALABLE (handles
growing amount of elements)
□ This means that it’s cheap to just add
more and more nodes at the best
price and it doesn’t need to be that
reliable as we will see further.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ The second property of a cluster is
DATA LOCALITY. This means your not
going through a sequence but directly
to the physical location. No more
bottlenecks...
□ This leads to PARALLELIZATION which
means you access several locations
simultaneously.
Learn more : http://en.wikipedia.org/wiki/Locality_of_reference
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ With data locality and parallelization
MASSIVE PARALLEL PROCESSING
becomes a reality.
□ The main function, sorting, can now
be done within each node on a subset
of data.
□ Please bear in mind that these nodes
are cheaper than traditional
architectures.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ This is just an example that goes
back to 2011 but makes the point.
A typical SSD drive system would
process data at about $1.2 a gigabyte
at 30K IOPS and a SATA at about
$0.05 but only at 250 IOPS
IOPS (input/output operations per
second) .
Let’s take a simple cluster…
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ In a simple cluster, 30 000 IOPS
are delivered in parallel with around
120 nodes (around 250 IOPS) at the
same time BUT for the IOP price of
SATA.
□ We’re talking about cheaper and
more expendable equipment.
Copyright 2014 Denis Rothman
Big Data – Map Reduce
□ This means that in a cluster
architecture failures will be more
frequent with cheaper equipement.
Copyright 2014 Denis Rothman
□ Failures with cheaper equipement ?
Who cares ? Don’t get ripped off and purchase
expensive reliable hardware but expendable
material to be cost efficient.
We just need to find a way to detect and
respond quickly to deal with this
complexity.
We’ll need to replicate the date up to three
times in three different data locations.
Let’s see how to solve these problems with
Apache Hadoop.
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
Hadoop is about clusters build with
commodity hardware not high quality
hardware :
• widely available
• interchangeable
• plug and play
• breaks down more often
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
□ Before we go on, what’s
the purpose of all this.
WHY ?
It all started with Google who
had to index pages every
day and quickly reach huge
amounts of data. Hadoop
reaches back into the
Google File System (GFS)
and Google MapReduce. In
the early days, Yahoo ! and
Apache got involved in the
process.
Around 2004, Google started
publishing all this…
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
□ Let’s take Facebook. You know all the
information that’s in there for you. But with
over 1 000 000 000 users + the 450 000
000 WhatsApp we’re talking about a
massive chunk of the world population
increasing the size of Facebook every day.
We’re talking increasing data in Exabytes
in this case. How are you going to run a
search over that one dataset spread over
hundreds of thousands of nodes ?
With Apache Hadoop !
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
Apache Hadoop was designed for DISTRIBUTED DATA OVER THE CLUSTERS
Apache Hadoop was designed with the concept of DATA LOCALITY
Hadoop Distributed File
System (HDFS)
Hadoop Map Reduce
Copyright 2014 Denis Rothman
□ HDFS has 3 main functions : split,
scatter and replicate.
Big Data – Apache Hadoop
1. SPILTING. In Hadoop
each FILE BLOCK has the
SAME size (64 Mb for
example) in a STORAGE
BLOCK
2.Scattering. These FILE
BLOCKS are generally
on different datanodes
3.REPLICATION : There are
multiple copies of these
blocks in different
locations.
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
Architecture
Learn more : http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Blocks
One main
node
Generally
3 copies in
the
replication
process so
nodes can
fail !
Copyright 2014 Denis Rothman
□ The NameNode is the centerpiece
of an HDFS file system. It keeps
the directory tree of all files in the
file system, and tracks where
across the cluster the file data is
kept. It does not store the data of
these files itself.
□ Client applications talk to the
NameNode whenever they wish to
locate a file, or when they want to
add/copy/move/delete a file. The
NameNode responds the
successful requests by returning a
list of relevant DataNode servers
where the data lives(addresses).
Big Data – Apache Hadoop
Learn more : http://wiki.apache.org/hadoop/NameNode
Works fine for
failures on
commodity
equipment !
Copyright 2014 Denis Rothman
□ So what happens when the
NameNode fails.
□ Hadoop has copies of the data and
as long as the same IP address is
reassigned, a new NameNode will be
designated and that’s it !
Big Data – Apache Hadoop
Learn more : http://wiki.apache.org/hadoop/NameNode
Copyright 2014 Denis Rothman
Once the HDFS is set up, MAP REDUCE
is there to retrieve information in a
simple way.
First a MAPPER is user then the
information is REDUCED.
Let’s see how this happens.
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
The MAPPER function relies on the fact
that the data is EVENLY
DISTRIBUTED. This means that
Massive Parallel Processing is
possible.
The MAPPER uses the LOCALITY (hence
« MAP » features of HADOOP to
optimize it’s search.
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
□ If not the file blocks were not of equal size,
the processing time would be equal to the
largest file.
□ But since in Hadoop the file blocks have the
same size, processing is tremendously
enhanced for MPP.
□ A little caveat could be Internet unequal
internet connexions but most organizations
have solved this and there are replications
everywhere…
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
Suppose you need to analyse
the number of times the word
« Happy New Year » in a
Google search at midnight on
Decembre 31rst in their
timezone.
Let’s say we’re concentrating
on France only and that the
nodes containing this data are
Nodes 1,2,3 (at their
address)
Copyright 2014 Denis Rothman
□ Now we run a <key,value> pair withe
the mapping functions. They key here
is « Happy New Year » and the value
will be the number of times it
appears.
□ In Node 1: <Happy New Year,
1000000, Node 2 : <Happy New
Year, 4000000>, Node 3: <Happy
New Year, 2000000>
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
Big Data – Apache Hadoop
□ Let’s get a look and feel of Hadoop
command line functions, among
others.
□ https://hadoop.apache.org/docs/r0.1
8.3/hdfs_shell.html
Copyright 2014 Denis Rothman
Big Data – Map Reduce
□ In Node 1: <Happy New Year, 1000000, Node 2 :
<Happy New Year, 4000000>, Node 3: <Happy New
Year, 2000000> data is sent to a reduce node to run
the REDUCE function which will give the following
output:
<Happy New Year, 1000000,4000000,2000000> to be
summed up for example to <Happy New Year,
7000000>
Mapping and reducing are thus 2 simple but powerful
functions.
If various keys are sent, they are SORTED through a
shuffling process.
Copyright 2014 Denis Rothman
Big Data – Map Reduce
□ The Mapper functions and Reduce functions
are TASKS and together they form a JOB.
□ Map Reduce’s framework has a JOB
TRACKER that schedules the tasks.
□ A JOB TRACKER will reroute tasks if a node
fails, it organizes the activities.
□ Just like HDFS has a name node, Map
Reduce has a special node assigned to the
JOB TRACKER.
Copyright 2014 Denis Rothman
Big Data – Map Reduce
□ Now the programmer will provide
MapReduce with a list of file blocks, the
map and jobs.
□ The output is a set of keys and values.
□ All of this can be done in a tremendous MPP
run.
□ By 2015, it’s estimated that 50% of all data
will be processed with Hadoop…
Copyright 2014 Denis Rothman
Big Data – High level software
□ Now the programmer will provide
MapReduce with a list of file blocks, the
map and jobs.
□ The output is a set of keys and values.
□ All of this can be done in a tremendous MPP
run.
□ By 2015, it’s estimated that 50% of all data
will be processed with Hadoop type
technology…
Copyright 2014 Denis Rothman
Getting Started with Hadoop
MapReduce
Now let’s get Hadoop
MapReduce into the equation
Learn more: http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html
http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Pre-requisites
Let’s get a look and feel of MapReduce functions :
http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Example
%3A+WordCount+v1.0
Just bear in mind that you looking at developing
<key,value> sets both mapping them and reducing them.
Copyright 2014 Denis Rothman
MapReduce
More look and
feel
approaches :
http://hadoop.a
pache.org/doc
s/r2.2.0/api/o
rg/apache/had
oop/mapreduc
e/Mapper.html
Copyright 2014 Denis Rothman
Apache Hadoop MapReduce
Architecture
□ Let’s take five here and see what
we’ve got up to here. Ok, we have
Hadoop and MapReduce.
□ Let’s see how this fits together and
how we can access data at a higher
level.
□ We’re going to take a look at how
Google explains this…
Copyright 2014 Denis Rothman
Apache Hadoop MapReduce
Architecture
Google explains it
with this
concept with
physical
retrieval :
1.Standard software
query : 1
person
2.MapReduce :
several persons
Let’s work on this
physical file
system
Learn more : https://cloud.google.com/developers/articles/apache-hadoop-
hive-and-pig-on-google-compute-engine#appendix-b
Copyright 2014 Denis Rothman
Getting Started with PIG
All the tools are there,
just use them !
You’re going to have to choose a
platform or just rent one as
explained further in the
document.
Copyright 2014 Denis Rothman
PIG
Let’s have some fun
with high level
programming !
« Pig is a high-level platform for
creating MapReduce programs used
with Hadoop. »
Learn more : http://en.wikipedia.org/wiki/Pig_(programming_tool)
What does a pig do ? it « grunts »
You can use Grunt to run Pig, you can
use Pig to run Python code, you can
use Pig for the MapReduce
framework.
Just stop thinking « categories »,be
Creative and have fun !
Copyright 2014 Denis Rothman
PIG
Learn more : http://en.wikipedia.org/wiki/Pig_(programming_tool)
http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions
Let’s have a look at
some of the PIG
functions to get the feel
of it.
http://docs.aws.amazon.com/ElasticMapReduce/lat
est/DeveloperGuide/emr-pig-udf.html
Copyright 2014 Denis Rothman
What if I don’t want to use Pig ?
There are a lot of languages you can use that
integrate the Hadoop & MapReduce framework !
Java : http://www.javacodegeeks.com/2013/08/writing-a-hadoop-
mapreduce-task-in-java.html
PHP : http://stackoverflow.com/questions/10978975/need-a-map-
reduce-function-in-mongo-using-php
C++ : http://cxwangyi.blogspot.fr/2010/01/writing-hadoop-programs-
using-c.html
Python :
https://developers.google.com/appengine/docs/python/dataprocessing/h
elloworld
Copyright 2014 Denis Rothman
Big Data or Standard Databases ?
□ File Systems or
databases ?
□ So now what ?
SQL solutions ?
No SQL solutions ?
□ Both ?
Let’s take a few minutes and find some examples in
which one philosphy or another is best for a company
SQL ?
No SQL ?
Copyright 2014 Denis Rothman
Big Data – NOSQL
Learn more : http://en.wikipedia.org/wiki/NoSQL
□ First let’s get rid of a simple and old concept : SQL
□ When you want to explore exabytes, of data, SQL is
useless.
□ « the term was used in NoSQL(Not Only SQL) in 1998
to name a lightweight, open source database that did
not expose the standard SQL interface. Strozzi
suggests that, as the current NoSQL movement
"departs from the relational model altogether; it
should therefore have been called more appropriately
'NoREL'. »
□ In somes cases the volume of data and it’s nature
(documents, texts) can’t be accessed through SQL
Copyright 2014 Denis Rothman
Big Data – NOSQL
□ « Some notable implementations of NoSQL
are Facebook's Cassandra database,
Google's BigTable and Amazon's SimpleDB
and Dynamo. »
□ Let’s approach NOSQL with one of its core
concepts. In a RDMS
(relational database management system)
several users can’t modify exactly the same
record at the same time. The system is
base on read-write-relational functions.
Copyright 2014 Denis Rothman
Big Data – NOSQL
In an RDMS, the last user that writes in
exactly the same record will overide
previous records. Of course you can
append a record per user but then
you have multiple records for the
same data index.
So you generally you lock the record
while it’s in use or use a LIFO(Last In
First Out)
Copyright 2014 Denis Rothman
Big Data – NOSQL
Learn more : http://www.techopedia.com/definition/27689/nosql-database
The fundamental difference in NOSQL is
that the relations don’t matter
anymore, so unique keys don’t
matter either.
You’re not worried about read and write
rules, relations, inner joins, size
constraints, time contraints.
Copyright 2014 Denis Rothman
Big Data – NOSQL
Learn more : http://en.wikipedia.org/wiki/NoSQL
With NOSQL you can
scatter your data
everywhere, on
various servers at
the same time
and write multiple
records with
multiple
simultaneous
users with millions
of same type
entries !
Copyright 2014 Denis Rothman
Big Data – SQL, Data Warehouse
and perspective
Let’s make NOSQL concepts clear :
-Hive is language that is SQL related and used with Big Data
-Pig is a NoSQL language
- You can use both in project !
http://gcn.com/blogs/reality-check/2014/01/hadoop-vs-data-
warehousing.aspx
A traditional Datawarehouse feeds data into a relational database.
What about a Hadoop Datawarehouse ? Why not ?
Perspective : Stop thinking of a data flow from a client
to server, start thinking about a universe of scattered
data ! Think from the point of view of the crowd not
the individual. Stop thinking about a single solution,
just use everything you can to reach your goal !
Copyright 2014 Denis Rothman
MongoDB
Learn more : http://www.mongodb.org/
Whereas Apache Hadoop is based on HFDS, MongoDB is a
NOSQL document database.
-Document-Orientated Storage with JSON style documents
-Index support
-Querying
-Map/Reduce
Copyright 2014 Denis Rothman
MongoDB
http://docs.mongodb.org/manual/core/map-reduce/
Let’s get the feel of Mongodb and MapReduce functions
So, now continue to stop thinking. Oh, i’m into Relational Databases and
this is a non relational database. What do I have to choose.
You don’t have to choose !
At one point Facebook, and this might still be true, gathered data in
MySQL, sent it out to Hadoop and then retrieved it with MapReduce :
mapping it, shuffling it, reducing it and making into sense back …in
MySQL for its users !!!
Copyright 2014 Denis Rothman
Purchasing and managing your
« Hadoop-MapReduce-MongdoDB,
PIG » Architecture
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
□ First you need to set up or choose a
type of physical Cloud architecture.
□ You need to make an financial and
technical decision.
□ If your company is not big enough to
build it’s own cluster, then you need
to choose cloud offers.
Copyright 2014 Denis Rothman
Getting Started with Hadoop
Learn more : http://www.ovh.com/fr/serveurs_dedies/big-data/
Copyright 2014 Denis Rothman
Getting Started with Hadoop
□ Just a concept to bear in mind but you don’t have to do it on your
own as explained previously. Cloud services provide this.
□ "You have 10 machines connected in LAN and i need to create
Name Node in one system and Data Nodes in remaining 9
machines .
□ For example you have ( 1.. 10 ) machines , where machine1
is Server and from machine(2..9) are slaves[Data Nodes] so
do i need to install Hadoop on all 10 machines ?
□ You need Hadoop installed in every node and each node should
have the services started as for appropriate for its role. Also the
configuration files, present on each node, have to coherently
describe the topology of the cluster, including location/name/port
for various common used resources (eg. namenode). Doing this
manually, from scratch, is error prone, specially if you never did
this before and you don't know exactly what you're trying to do.
Also would be good to decide on a specific distribution of Hadoop
(HortonWorks, Cloudera, HDInsight, Intel, etc) »
Copyright 2014 Denis Rothman
Getting Started with Hadoop
Do you have an Amazon account ?
What do you know about what’s beyond your account ?
Does Amazon have Big Data Technology ?
How far does Amazon go in this field ?
Let’s see…
Copyright 2014 Denis Rothman
Getting Started with Hadoop
Learn more : http://aws.amazon.com/big-data/
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-
emr.html
Copyright 2014 Denis Rothman
Getting Starting with your Big Data
Architecture
Let’s have a look at a real Big Data account
and resource management interface.
http://aws.amazon.com/s3/pricing/http://aws
.amazon.com/s3/pricing/
https://console.aws.amazon.com/console/hom
e?region=eu-west-1#
https://console.aws.amazon.com/elasticmapre
duce/vnext/home?region=eu-west-
1#getting-started:
Copyright 2014 Denis Rothman
Big Data – Ebay
□ EBay has a nice way of summing it up
before we get down to analyzing.
http://www.ebaytechblog.com/2010/10/29/hadoop-
the-power-of-the-elephant/#.UxncJbV5Gx4
Copyright 2014 Denis Rothman
Analyst
The analysists are here
Let’s find out what they do and what you could do in the future !
Copyright 2014 Denis Rothman
Big Data – Analyst
First you need to forget about
consumption(sales, marketing) and all the
clichés you hear around you.
Why ? Because the first step is to set highly
creative goals, then to map, reduce and
transform them into useful data. Useful
data can be for medical research, police
departments, astronomy and many other
areas.
Copyright 2014 Denis Rothman
Big Data – Analyst
At Planilog, we created a powerful Advanced
Planning System that deals with the 3 Vs
(Volume, Velocity and Variety). Our APS
can optimize any field of data.
Without going into the detail of our APS
program, the following slides are going to
provide you with tools to begin analyzing.
Of course, you can analyze anything and
anyway you want. This is just a guideline
we used that help us solve hundreds of
problems.
Copyright 2014 Denis Rothman
Big Data – Analyst
Planilog’s first conceptual approch starts
with Cognitive Science and
Linguistics.
Human activity and be broken down into
two great categories :
Passive and active.
Copyright 2014 Denis Rothman
Analyst
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
Let’s take some passive activities using
just one or two senses. You can easily
guess the others after.
Eyes :
• Watching (movies, events, any other)
• Reading
• Listening to music
Copyright 2014 Denis Rothman
Analyst
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
Let’s take some active activities using
some senses. You can easily guess
the others after.
- Writing documents, chats, mails
• Talking over the phone
• Combining video and sound : Skype
Copyright 2014 Denis Rothman
Big Data – Analyst
Now that you have an idea of active and
passive activies, let’s see what they can
apply to and what we can get out them :
Thought process ->analyzing how someone
thinks (« Sentiment analysis »)
Feeling -> Sentiment analysis
Body -> Movement anaylsis (GPS, for
exemple).
Copyright 2014 Denis Rothman
Analyst
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
Finally there are only two ways to measure passive/active
activities applied to the thinking-feeling-body process.
It just boils down to this
Qualitivative properties and quantitive analysis.
Once we know what we’re analysing and how much we
can pretty much make a model of the whole universe
!
We could sum it up with brackets :
<property or key, quantity> or if we simplify :
<key, value>
Sounds familiar ?
See the power ? See why you need to analyse what your
going to do before you analyse the data.
Copyright 2014 Denis Rothman
Analyst
The Hadoop tools available don’t need to be isolated in
terms of concepts but simple must be interoperable.
You can Sqoop data from relational dtabase and collect
event data with Flume. You have HDFS (distributed
files system) that you can acces in a non relational
way with Pig or even use in a Data Warehouse with
Hive. With MapReduce you can run parallel
computation. If you need more resources, you can
use Whirr to deploy more clusters and Zookeeper to
configure, manage and coordinate all of this !
So there is no relational, non relational opposition, there
is no « standard » approach. There is simply a goal to
attain with the best means possible.
Copyright 2014 Denis Rothman
Big Data – Privacy
Everything you touch is
stored, replicated,
mapped, reduced and
processed.
Just focus on the legal
aspect not on ethics.
Do you think all of this is
legal ?
What’s legal ? In which
country ? Where ? How ?
Can this be prevented ?
Copyright 2014 Denis Rothman
Big Data – On which side of the
Cloud are you ?
Now let’s forget about
the legal aspect.
How do you feel about
Clouds and Big Data ?
Do you feel threatened
?
Do you think it’s the
end of your freedom ?
Copyright 2014 Denis Rothman
Big Data – On which side of the
Cloud are you ?
Now if you feel it’s progress
with some drawbacks,
you’re ready to be an Big
Data analyst !
Do you agree with this or
not ?
Progress
Copyright 2014 Denis Rothman
Analyst : for those who agreed on
using Big Data ! ☺, the others can
leave ☺
Let’s sum it up before we begin analyzing real
projects and cases.
Conceptually if you use an active/passive
matrix applied to thought-feeling-physical
body, you can understand a great number
of models.
With Pig ->MapReduce->Hadoop and maybe
Mongdb add or not, you’re going to map,
reduce and transform DATA into useful
INFORMATION for decision making
processes. You’re exploring time and space.
Copyright 2014 Denis Rothman
Analyst : Can you imagine the data
to be mapped and retrieved in
various fields ?
You need to
think
differently.
Forget
everything
you
learned and
be open to
new, very
new ideas.
Let’s hit the
road now !
Copyright 2014 Denis Rothman
Oh, you think this is
theory for the future ?
□ Ok,well you can stop laughing. Let’s have a
look at sentiment analysis tools :
http://blog.mashape.com/post/48757031167/l
ist-of-20-sentiment-analysis-apis
How to feel about that ? Remember if you Tweet about this
page, it will be analyzed, so be careful of what you’re thinking
and writing !
https://www.mashape.com/
How is your mind shaped ?
How many applications are there out there ?
Think as a global data analyst and not an individual. Express your
thoughts.
Sentiment Analysis
□ https://www.mashape.com/aylien/tex
t-analysis#!endpoint-Sentiment-
Analysis
With this technology nothing is a secret anymore.
You can « hack », sorry « analyse » ☺ , in the world with unlimited
technology, space and processing power.
Let’s do a sentiment analysis right now…
Copyright 2014 Denis Rothman
Let’s carry out a little experiment
What do you think about Sentiment
analysis if you were Tweeting your
impression. Let’s analyze the
audience :
Key <positive , value>
Key <negative , value>
More difficult. Explain why.
Key <objective, value>
Key <subjective, value>
Copyright 2014 Denis Rothman
Big Data – Life saving
Try to find some ideas to save lives when there is a fire, or
to protect violence on women or any other idea that
comes to your mind.
Think social networks, think drones, swarms of robots, think from
the point of view of the swarm command, like in SC2 but to help
people not for the confort of an individual.
Copyright 2014 Denis Rothman
Big Data – Life saving
https://www.cmu.edu/silicon-valley/news-
events/news/2011/stamberger-interviewed.html
https://www.cmu.edu/silicon-valley/news-
events/news/2011/stamberger-interviewed.html
Big Data – Insurance
How can you optimize the price of the
premiums in real time worldwide with
Hadoop Mapreduce ?
Start with a major disaster and see how
you’re going to pay and forecast future
disasters.
Hadoop can be used for predictive functions.
Copyright 2014 Denis Rothman
Big Data – Insurance also needs
human resources.
How can you optimize
part time jobs in a
huge quantitative
environment in which
you have 100 000
employees to manage
?
http://www.optimaldecisionsllc.com/Welcome.html
Big Data – Amazon
Think of the passive-active matrix
and the related activities (thought,
feeling, physical) and tell me how
you would use Big Data.
How could you find a way to get
sentiment analysis out of the
reader ?
Copyright 2014 Denis Rothman
Big Data – Amazon-Kindle
What <key, value> pairs what you be looking for ?
Copyright 2014 Denis Rothman
Big Data – Tweeter
Try to find a great many
positive applications for
Tweeter.
We’ve seen the API’s.
Do you have ideas ?
Life Saving
Science and research
Other ?
We’re not going to talk about
the negative ones. You need
to think of how to go forward,
not slow down !
Copyright 2014 Denis Rothman
Big Data – Design Facebook
□ Describe the data
that Facebook
collects.
□ How can it legally
access 50% more
data that it didn’t
gather in the first
place….?
Copyright 2014 Denis Rothman
Big Data – Design Facebook
□ WhatAps ! 450 millions new users !
How would you Map, Shuffle and
Reduce this data to fit into your
Facebook strategy.
Advertising is a cliché.
What else can you do ?
Do you now what Stealth
Marketing is ?
http://en.wikipedia.org/wiki/Undercover_
marketing
How can you analyze and detect it
automatically if you were a
government consumer protection
agency ? Why wouldn’t
governments map,shuffle, and
reduce illegal behaviours ?
Big Data – Design Sony Smartband
http://www.expansys.fr/sony-smartband-swr10-with-2-black-wristbands-sl-
257855/?utm_source=google&utm_medium=shopping&utm_campaign=ba
se&mkwid=svHjLhmZB&kword=adwords_productfeed&gclid=CI36mJSLgb
0CFWjpwgod1wMANA
Smartwatches
□ Samsung has one too that measures
your heartbeat.
□ http://venturebeat.com/2013/09/01/t
his-is-samsungs-galaxy-gear-
smartwatch-a-blocky-health-tracker-
with-a-camera/
They want you to think about what you can do with the watch
while they’re thinking of what to do with the global data they’re
gathering as well. What could you analyse with Big Data
tools ?
Copyright 2014 Denis Rothman
□ http://online.wsj.com/news/articles/S
B10001424127887324178904578340
071261396666
Big Data is taking over !
Forget about the technical aspect.
Just bear in mind that « huge » doesn’t
mean impossible anymore. There is simply
no limit to what data that can be processed
through Big Data !
Copyright 2014 Denis Rothman
HR
□ http://online.wsj.com/news/articles/S
B10001424127887324178904578340
071261396666
http://blog.mashape.com/post/487570
31167/list-of-20-sentiment-analysis-
apis
How can you use Big Data to recruit somebody ? Would
you automatically analyse personal data if you could on social networks ?
With a sentiment anaylsis program for example ?
Copyright 2014 Denis Rothman
Trucks : Tracking and Sensors
□ http://online.wsj.com/news/articles/S
B10001424127887324178904578340
071261396666
Sensors, robots, drones. Surveillance to optimize !
Give some of your ideas…
Copyright 2014 Denis Rothman
Big Data – Design Sony Smartband
Now you can pick up the pulse rate and all activity. How
can you relate this to all the other data on a group of
people and not just yourself ? Think of a concert and
sentiment analysis, for example.
http://www.sonymobile.com/us/products/accessories/sm
artwatch/#tabs
https://play.google.com/store/apps/details?id=com.sony
ericsson.extras.smartwatch&hl=fr
Copyright 2014 Denis Rothman
Governments and government
agencies
What can a governement collect
that corporations can’t ?
Can governments reach the level of
private corporations to protect
you ?
How can it be done ?
With what budget ?
Governments and government
agencies
Phone companies gather data.
Google, Microsoft and others gather
data
In fact everybody gathers data !
So could you gather as much data as
the government, in the end ?
Why or why not ?
Big Data – Can you trust yourself
to drive your car ?
□ Google, like all others, have you focus
on your individual need.
□ In the meantime, your personal data
has gone global.
□ Think global and tell me how you
would analyze the data.
In this case we’re dealing with Big Data streaming data like in
online gaming. So when you’re parsing the data, NoSQL or SQL
is not the issue, getting the right information straight is the vital
goal !
Copyright 2014 Denis Rothman
Big Data – Can you trust a human
to drive your car ?
A Google Car gathers about an average of
Gigabyte / second which is could add up to
over 80 TB a day. An you ?
□ http://www.isn.ethz.ch/Digital-
Library/Articles/Detail/?lng=en&id=173004
Google explains how it collects all types of data.
http://googlepolicyeurope.blogspot.fr/2010/04/da
ta-collected-by-google-cars.html
Where is all the data going : Big Data ?
http://www.hostdime.com/blog/google-self-
driving-car-news/
Now take a step back and imagine all
of the data gathered and accessed by
a single group of analysts…
…And now go out, imagine and conquer the world of Big Data !
Big Data – A New Data Paridigm :
no limits
You can ask your questions now or
contact me at
Denis.Rothman76@gmail.com
Copyright 2014 Denis Rothman

Weitere ähnliche Inhalte

Was ist angesagt?

BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
Simplilearn
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 

Was ist angesagt? (20)

Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
 
10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview Questions10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview Questions
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
 
First Step for Big Data with Apache Hadoop
First Step for Big Data with Apache HadoopFirst Step for Big Data with Apache Hadoop
First Step for Big Data with Apache Hadoop
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop WorkshopBig Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop Workshop
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 

Ähnlich wie Big data-denis-rothman

Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
yhadoop
 

Ähnlich wie Big data-denis-rothman (20)

Big data PPT
Big data PPT Big data PPT
Big data PPT
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
 
Apache Big Data Europa- How to make money with your own data
Apache Big Data Europa- How to make money with your own dataApache Big Data Europa- How to make money with your own data
Apache Big Data Europa- How to make money with your own data
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Big Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case studyBig Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case study
 
00 hadoop welcome_transcript
00 hadoop welcome_transcript00 hadoop welcome_transcript
00 hadoop welcome_transcript
 

Kürzlich hochgeladen

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 

Kürzlich hochgeladen (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Big data-denis-rothman

  • 1. On Which side of the Cloud are you ? An Introduction to Big Data Denis Rothman Copyright 2014 Denis Rothman
  • 2. Big Data - Introduction □ This course is not meant to make Big Data experts out of you in a few hours but is designed to help you grasp the main concepts. □ We’ll be discussing Apache Hadoop, MapReduce, Mongodb, Pig and several other names and concepts that will be familiar to you by the end of the course ! Copyright 2014 Denis Rothman
  • 3. Big Data - Introduction □ We’re going to talk about Apache « Hadoop » and « MapReduce » because the following companies use this technology, at least parent or derived versions : Google, Yahoo!,Facebook,Amazon, IBM, Ebay and many more key players on the market. Copyright 2014 Denis Rothman
  • 4. Big Data - Introduction □ All the figures, software and brands mentioned in this document are simple examples. All of this is going to expand and change through the years ! □ The main goal here is for you to grasp enough concepts to be able to create Big Data architectures with today’s but also tomorrow’s technology and ideas ! □ So focus on the concepts and the way you can solve problems with Big Data technology. Copyright 2014 Denis Rothman
  • 5. Big Data – What is big data ? Learn more : http://en.wikipedia.org/wiki/Big_data Let’s say that starting with one 10TB for a dataset (collection of data) we’re talking Big Data and starting one petabyte we really need the technology ! The world has jumped from talking petabytes to exabytes in a year, we’ll probably be talking zettabytes. 1 EB = 1000000000000000000B = 1018bytes = 1000petabytes1 EB = 1000000000000000000B = 1018bytes = 1000petabytes = 1 million1 EB = 1000000000000000000B = 1018bytes = 1000petabytes = 1 million terabytes1 EB = 1000000000000000000B = 1018bytes = 1000petabytes = 1 million terabytes = 1 billion1 EB = 1000000000000000000B = 1018bytes = 1000petabytes = 1 million terabytes = 1 billion gigabytes. Copyright 2014 Denis Rothman
  • 6. Big Data – What is big data ? For the Universe, the galaxies are our small representative volumes, and there are something like 10^11 to 10^12 stars in our Galaxy (The Milky Way) •The number of bitsThe number of bits on a computer tera capacity hard disk is typically about 10^13, 1000 GB) To compare the amount of data we now store we have to do down to atom level quantities in our universe ! Copyright 2014 Denis Rothman
  • 7. Big Data – Can you represent the Volume ? Learn more : http://www.seagate.com/about/newsroom/press- releases/Terascale-Enterprise-HDD-pr-master/ Tell us how and were you would store a 1PB dataset for a given company without Big Data technology ? How many average size 4 TB hard disks would it take to simple store the data ? High-Capacity— highest capacity HDD (4TB) available in a 3.5- inch enterprise-class SATA(Serial Advanced Technology Attachment) HDD enabling scalable, high-capacity storage in 24×7 environments. ? Copyright 2014 Denis Rothman
  • 8. Big Data – Can you represent a fast way to access(Velocity) 1PB of data with Big Data technology? Let’s say we’re talking about the data related to all bank accounts of the BNP of the past 5 years that had a balance of more that 1000 $ at given time and that need to be accessed for a financial analysis. How would you do it now, without Big Data Technology ? Copyright 2014 Denis Rothman
  • 9. Big Data – Can you represent to access additional documents in a great Variety of data ? Now we need to retrieve other documents to analyse these BNP accounts : text documents(signed contracts, for example) How would you do it now, without Big Data Technology ? Copyright 2014 Denis Rothman
  • 10. Big Data – Do you think you can manage 10PB without Big Data ? If we now try to solve the 3 V problem with a 10PB dataset to manage, how could we do it even with Oracle Big Files ? A bigfile tablespace contains only one datafile or tempfile, which can contain up to approximately 4 billion ( 232 ) blocks. The maximum size of the single datafile or tempfile is 128 terabytes (TB) for a tablespace with 32 K blocks and 32 TB for a tablespace with 8 K blocks. Number of blocks Bigfile Tablespaces Learn more : http://docs.oracle.com/cd/B28359_01/server.111/b28320/limits002.htm#i2879 15 Copyright 2014 Denis Rothman
  • 11. Big Data – Volume, Velocity, Variety that is beyond non Big Data solutions We seen the limits of non Big Data technology ? How would you solve the problem ? Even if you already know how Big Data works, do you think it will solve the increasing size and variety of datasets ? How will it help with sensors ? Copyright 2014 Denis Rothman
  • 12. Big Data – Apache Hadoop There are several solutions on the market. Let’s use Apache Hadoop as way to understand how Big Data storage works to solve the 3V problem. Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop Copyright 2014 Denis Rothman
  • 13. Big Data – Apache Hadoop □ There are many ways to try to understand a subject. This part of the course is designed for you to see that the core ideas of Apache Hadoop are simple ! Copyright 2014 Denis Rothman
  • 14. Big Data – Apache Hadoop □ First of all, what does « Hadoop » mean ? It means nothing ! □ Doug Cutting just named after his son’s toy elephant. So that’s one mystery solved. Copyright 2014 Denis Rothman
  • 15. Big Data – Apache Hadoop □ The first thing we need to do is understand cluster architectures. □ Cluster architectures are spreading at a wild speed as a framework for the analyis of big data. New Exabytes of data appear each…week… Learn more : http://www.ovh.com/fr/serveurs_dedies/big-data/
  • 16. Big Data – Apache Hadoop □ Cluster architectures are the best choice because they have Cloud performances : extensible, flexible and cost efficient. Copyright 2014 Denis Rothman
  • 17. Big Data – Apache Hadoop □ So what ? So what’s the difference between a traditional entreprise architecture and a cloud-cluster architecture ? Copyright 2014 Denis Rothman
  • 18. Big Data – Apache Hadoop □ A traditional architecture is built on server technology that is expensive and thus has to be used as much as possible. Copyright 2014 Denis Rothman
  • 19. Big Data – Apache Hadoop □ A traditional architecture is also built on storage capacity of different sizes and types : SSD to SATA. Copyright 2014 Denis Rothman
  • 20. Big Data – Apache Hadoop □ A traditional architecture is finally built on storage area networks (SAN) to connect a set of servers to a set of storage units Copyright 2014 Denis Rothman
  • 21. Big Data – Apache Hadoop □ The big quality of traditional architecture is that the servers and storage units can be managed (size, number) separately with SAN (Storage Area Network) connecting them. Copyright 2014 Denis Rothman
  • 22. Big Data – Apache Hadoop □ The big drawback of traditional architecture is that it must be extremely reliable and any failure must be dealt with very quickly. □ This brings the price up. Copyright 2014 Denis Rothman
  • 23. Big Data – Apache Hadoop □ Traditional architectures were designed for intensive applications focusing on one part of the data. The servers process the information and then the results are transferred to storage. Copyright 2014 Denis Rothman
  • 24. Big Data – Apache Hadoop □ So in essence a traditional architecture is designed for a specific need (intense computing, a standard data warehouse. Fine. □ How would you now solve a problem involving a tremendous weekly increase in data (PB) ? Not knowing what you’re looking for in advance : sorting by order, by timestamp or retrieving certain values. Copyright 2014 Denis Rothman
  • 25. Big Data – Apache Hadoop □ Even a few years ago Google was facing a daily increase of data of 20PB…per day. □ For a special operation, let’s say user mail history (number and size of mails over a five year period), we need to parse the entire dataset not just a subset. Copyright 2014 Denis Rothman
  • 26. Big Data – Apache Hadoop □ Why sort that data ? □ To make searching, merging and analyzing easier. □ So how can you sort n x 20PB of data? □ With cluster architecture ! Copyright 2014 Denis Rothman
  • 27. Big Data – Apache Hadoop Let’s now study 3 basic properties of cluster computing : -Pennysort -Minutesort -Graysort Copyright 2014 Denis Rothman
  • 28. Big Data – Apache Hadoop □ Sorting being a major function of Big Data, it’s important to have benchmark references. Learn more : http://sortbenchmark.org/ GraySort Metric: Sort rate (TBs / minute) achieved while sorting a very large amount of data (currently 100 TB minimum). PennySort Metric: Amount of data that can be sorted for a penny's worth of system time. Originally defined in AlphaSort paper. MinuteSort Metric: Amount of data that can be sorted in 60.00 seconds or less. Originally defined in AlphaSort paper. Copyright 2014 Denis Rothman
  • 29. Big Data – Apache Hadoop Learn more : http://sortbenchmark.org/ 2013, 1.42 TB/min Hadoop 102.5 TB in 4,328 seconds 2100 nodes x (2 2.3Ghz hexcore Xeon E5-2630, 64 GB memory, 12x3TB disks) Thomas Graves Yahoo! Inc. Gray Copyright 2014 Denis Rothman
  • 30. Big Data – Apache Hadoop 2011, 286 GB psort 2.7 Ghz AMD Sempron, 4 GB RAM, 5x320 GB 7200 RPM Samsung SpinPoint F4 HD332GJ, Linux Paolo Bertasi, Federica Bogo, Marco Bressan and Enoch Peserico Univ. Padova, Italy Penny Copyright 2014 Denis Rothman
  • 31. Big Data – Apache Hadoop 2012, 1,401 GB Flat Datacenter Storage 256 heterogeneous nodes, 1033 disks Johnson Apacible, Rich Draves, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, Ed Nightingale, Reuben Olinksy, Yutaka Suzue Microsoft ResearchMinute Copyright 2014 Denis Rothman
  • 32. Big Data – Apache Hadoop □ Getting down to a cluster. A cluster breaks down to its basic component : a NODE A node is made up of cores, memory and disks that can be assembled in the thousands, the tens of thousands, the hundreds of thousands. Copyright 2014 Denis Rothman
  • 33. Big Data – Apache Hadoop □ The NODES are then grouped in RACKS □ The RACKS are then grouped into CLUSTERS The CLUSTERS ARE CONNECTED TO A NETWORK WITH A CISCO SWITCH, for example Copyright 2014 Denis Rothman
  • 34. Big Data – Apache Hadoop □ The first property of a cluster is to be MODULAR and SCALABLE (handles growing amount of elements) □ This means that it’s cheap to just add more and more nodes at the best price and it doesn’t need to be that reliable as we will see further. Copyright 2014 Denis Rothman
  • 35. Big Data – Apache Hadoop □ The second property of a cluster is DATA LOCALITY. This means your not going through a sequence but directly to the physical location. No more bottlenecks... □ This leads to PARALLELIZATION which means you access several locations simultaneously. Learn more : http://en.wikipedia.org/wiki/Locality_of_reference Copyright 2014 Denis Rothman
  • 36. Big Data – Apache Hadoop □ With data locality and parallelization MASSIVE PARALLEL PROCESSING becomes a reality. □ The main function, sorting, can now be done within each node on a subset of data. □ Please bear in mind that these nodes are cheaper than traditional architectures. Copyright 2014 Denis Rothman
  • 37. Big Data – Apache Hadoop □ This is just an example that goes back to 2011 but makes the point. A typical SSD drive system would process data at about $1.2 a gigabyte at 30K IOPS and a SATA at about $0.05 but only at 250 IOPS IOPS (input/output operations per second) . Let’s take a simple cluster… Copyright 2014 Denis Rothman
  • 38. Big Data – Apache Hadoop □ In a simple cluster, 30 000 IOPS are delivered in parallel with around 120 nodes (around 250 IOPS) at the same time BUT for the IOP price of SATA. □ We’re talking about cheaper and more expendable equipment. Copyright 2014 Denis Rothman
  • 39. Big Data – Map Reduce □ This means that in a cluster architecture failures will be more frequent with cheaper equipement. Copyright 2014 Denis Rothman
  • 40. □ Failures with cheaper equipement ? Who cares ? Don’t get ripped off and purchase expensive reliable hardware but expendable material to be cost efficient. We just need to find a way to detect and respond quickly to deal with this complexity. We’ll need to replicate the date up to three times in three different data locations. Let’s see how to solve these problems with Apache Hadoop. Big Data – Apache Hadoop Copyright 2014 Denis Rothman
  • 41. Hadoop is about clusters build with commodity hardware not high quality hardware : • widely available • interchangeable • plug and play • breaks down more often Big Data – Apache Hadoop Copyright 2014 Denis Rothman
  • 42. □ Before we go on, what’s the purpose of all this. WHY ? It all started with Google who had to index pages every day and quickly reach huge amounts of data. Hadoop reaches back into the Google File System (GFS) and Google MapReduce. In the early days, Yahoo ! and Apache got involved in the process. Around 2004, Google started publishing all this… Big Data – Apache Hadoop Copyright 2014 Denis Rothman
  • 43. □ Let’s take Facebook. You know all the information that’s in there for you. But with over 1 000 000 000 users + the 450 000 000 WhatsApp we’re talking about a massive chunk of the world population increasing the size of Facebook every day. We’re talking increasing data in Exabytes in this case. How are you going to run a search over that one dataset spread over hundreds of thousands of nodes ? With Apache Hadoop ! Big Data – Apache Hadoop Copyright 2014 Denis Rothman
  • 44. Big Data – Apache Hadoop Apache Hadoop was designed for DISTRIBUTED DATA OVER THE CLUSTERS Apache Hadoop was designed with the concept of DATA LOCALITY Hadoop Distributed File System (HDFS) Hadoop Map Reduce Copyright 2014 Denis Rothman
  • 45. □ HDFS has 3 main functions : split, scatter and replicate. Big Data – Apache Hadoop 1. SPILTING. In Hadoop each FILE BLOCK has the SAME size (64 Mb for example) in a STORAGE BLOCK 2.Scattering. These FILE BLOCKS are generally on different datanodes 3.REPLICATION : There are multiple copies of these blocks in different locations. Copyright 2014 Denis Rothman
  • 46. Big Data – Apache Hadoop Architecture Learn more : http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Blocks One main node Generally 3 copies in the replication process so nodes can fail ! Copyright 2014 Denis Rothman
  • 47. □ The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. □ Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives(addresses). Big Data – Apache Hadoop Learn more : http://wiki.apache.org/hadoop/NameNode Works fine for failures on commodity equipment ! Copyright 2014 Denis Rothman
  • 48. □ So what happens when the NameNode fails. □ Hadoop has copies of the data and as long as the same IP address is reassigned, a new NameNode will be designated and that’s it ! Big Data – Apache Hadoop Learn more : http://wiki.apache.org/hadoop/NameNode Copyright 2014 Denis Rothman
  • 49. Once the HDFS is set up, MAP REDUCE is there to retrieve information in a simple way. First a MAPPER is user then the information is REDUCED. Let’s see how this happens. Big Data – Apache Hadoop Copyright 2014 Denis Rothman
  • 50. The MAPPER function relies on the fact that the data is EVENLY DISTRIBUTED. This means that Massive Parallel Processing is possible. The MAPPER uses the LOCALITY (hence « MAP » features of HADOOP to optimize it’s search. Big Data – Apache Hadoop Copyright 2014 Denis Rothman
  • 51. □ If not the file blocks were not of equal size, the processing time would be equal to the largest file. □ But since in Hadoop the file blocks have the same size, processing is tremendously enhanced for MPP. □ A little caveat could be Internet unequal internet connexions but most organizations have solved this and there are replications everywhere… Big Data – Apache Hadoop Copyright 2014 Denis Rothman
  • 52. Big Data – Apache Hadoop Suppose you need to analyse the number of times the word « Happy New Year » in a Google search at midnight on Decembre 31rst in their timezone. Let’s say we’re concentrating on France only and that the nodes containing this data are Nodes 1,2,3 (at their address) Copyright 2014 Denis Rothman
  • 53. □ Now we run a <key,value> pair withe the mapping functions. They key here is « Happy New Year » and the value will be the number of times it appears. □ In Node 1: <Happy New Year, 1000000, Node 2 : <Happy New Year, 4000000>, Node 3: <Happy New Year, 2000000> Big Data – Apache Hadoop Copyright 2014 Denis Rothman
  • 54. Big Data – Apache Hadoop □ Let’s get a look and feel of Hadoop command line functions, among others. □ https://hadoop.apache.org/docs/r0.1 8.3/hdfs_shell.html Copyright 2014 Denis Rothman
  • 55. Big Data – Map Reduce □ In Node 1: <Happy New Year, 1000000, Node 2 : <Happy New Year, 4000000>, Node 3: <Happy New Year, 2000000> data is sent to a reduce node to run the REDUCE function which will give the following output: <Happy New Year, 1000000,4000000,2000000> to be summed up for example to <Happy New Year, 7000000> Mapping and reducing are thus 2 simple but powerful functions. If various keys are sent, they are SORTED through a shuffling process. Copyright 2014 Denis Rothman
  • 56. Big Data – Map Reduce □ The Mapper functions and Reduce functions are TASKS and together they form a JOB. □ Map Reduce’s framework has a JOB TRACKER that schedules the tasks. □ A JOB TRACKER will reroute tasks if a node fails, it organizes the activities. □ Just like HDFS has a name node, Map Reduce has a special node assigned to the JOB TRACKER. Copyright 2014 Denis Rothman
  • 57. Big Data – Map Reduce □ Now the programmer will provide MapReduce with a list of file blocks, the map and jobs. □ The output is a set of keys and values. □ All of this can be done in a tremendous MPP run. □ By 2015, it’s estimated that 50% of all data will be processed with Hadoop… Copyright 2014 Denis Rothman
  • 58. Big Data – High level software □ Now the programmer will provide MapReduce with a list of file blocks, the map and jobs. □ The output is a set of keys and values. □ All of this can be done in a tremendous MPP run. □ By 2015, it’s estimated that 50% of all data will be processed with Hadoop type technology… Copyright 2014 Denis Rothman
  • 59. Getting Started with Hadoop MapReduce Now let’s get Hadoop MapReduce into the equation Learn more: http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Pre-requisites Let’s get a look and feel of MapReduce functions : http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Example %3A+WordCount+v1.0 Just bear in mind that you looking at developing <key,value> sets both mapping them and reducing them. Copyright 2014 Denis Rothman
  • 60. MapReduce More look and feel approaches : http://hadoop.a pache.org/doc s/r2.2.0/api/o rg/apache/had oop/mapreduc e/Mapper.html Copyright 2014 Denis Rothman
  • 61. Apache Hadoop MapReduce Architecture □ Let’s take five here and see what we’ve got up to here. Ok, we have Hadoop and MapReduce. □ Let’s see how this fits together and how we can access data at a higher level. □ We’re going to take a look at how Google explains this… Copyright 2014 Denis Rothman
  • 62. Apache Hadoop MapReduce Architecture Google explains it with this concept with physical retrieval : 1.Standard software query : 1 person 2.MapReduce : several persons Let’s work on this physical file system Learn more : https://cloud.google.com/developers/articles/apache-hadoop- hive-and-pig-on-google-compute-engine#appendix-b Copyright 2014 Denis Rothman
  • 63. Getting Started with PIG All the tools are there, just use them ! You’re going to have to choose a platform or just rent one as explained further in the document. Copyright 2014 Denis Rothman
  • 64. PIG Let’s have some fun with high level programming ! « Pig is a high-level platform for creating MapReduce programs used with Hadoop. » Learn more : http://en.wikipedia.org/wiki/Pig_(programming_tool) What does a pig do ? it « grunts » You can use Grunt to run Pig, you can use Pig to run Python code, you can use Pig for the MapReduce framework. Just stop thinking « categories »,be Creative and have fun ! Copyright 2014 Denis Rothman
  • 65. PIG Learn more : http://en.wikipedia.org/wiki/Pig_(programming_tool) http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions Let’s have a look at some of the PIG functions to get the feel of it. http://docs.aws.amazon.com/ElasticMapReduce/lat est/DeveloperGuide/emr-pig-udf.html Copyright 2014 Denis Rothman
  • 66. What if I don’t want to use Pig ? There are a lot of languages you can use that integrate the Hadoop & MapReduce framework ! Java : http://www.javacodegeeks.com/2013/08/writing-a-hadoop- mapreduce-task-in-java.html PHP : http://stackoverflow.com/questions/10978975/need-a-map- reduce-function-in-mongo-using-php C++ : http://cxwangyi.blogspot.fr/2010/01/writing-hadoop-programs- using-c.html Python : https://developers.google.com/appengine/docs/python/dataprocessing/h elloworld Copyright 2014 Denis Rothman
  • 67. Big Data or Standard Databases ? □ File Systems or databases ? □ So now what ? SQL solutions ? No SQL solutions ? □ Both ? Let’s take a few minutes and find some examples in which one philosphy or another is best for a company SQL ? No SQL ? Copyright 2014 Denis Rothman
  • 68. Big Data – NOSQL Learn more : http://en.wikipedia.org/wiki/NoSQL □ First let’s get rid of a simple and old concept : SQL □ When you want to explore exabytes, of data, SQL is useless. □ « the term was used in NoSQL(Not Only SQL) in 1998 to name a lightweight, open source database that did not expose the standard SQL interface. Strozzi suggests that, as the current NoSQL movement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL'. » □ In somes cases the volume of data and it’s nature (documents, texts) can’t be accessed through SQL Copyright 2014 Denis Rothman
  • 69. Big Data – NOSQL □ « Some notable implementations of NoSQL are Facebook's Cassandra database, Google's BigTable and Amazon's SimpleDB and Dynamo. » □ Let’s approach NOSQL with one of its core concepts. In a RDMS (relational database management system) several users can’t modify exactly the same record at the same time. The system is base on read-write-relational functions. Copyright 2014 Denis Rothman
  • 70. Big Data – NOSQL In an RDMS, the last user that writes in exactly the same record will overide previous records. Of course you can append a record per user but then you have multiple records for the same data index. So you generally you lock the record while it’s in use or use a LIFO(Last In First Out) Copyright 2014 Denis Rothman
  • 71. Big Data – NOSQL Learn more : http://www.techopedia.com/definition/27689/nosql-database The fundamental difference in NOSQL is that the relations don’t matter anymore, so unique keys don’t matter either. You’re not worried about read and write rules, relations, inner joins, size constraints, time contraints. Copyright 2014 Denis Rothman
  • 72. Big Data – NOSQL Learn more : http://en.wikipedia.org/wiki/NoSQL With NOSQL you can scatter your data everywhere, on various servers at the same time and write multiple records with multiple simultaneous users with millions of same type entries ! Copyright 2014 Denis Rothman
  • 73. Big Data – SQL, Data Warehouse and perspective Let’s make NOSQL concepts clear : -Hive is language that is SQL related and used with Big Data -Pig is a NoSQL language - You can use both in project ! http://gcn.com/blogs/reality-check/2014/01/hadoop-vs-data- warehousing.aspx A traditional Datawarehouse feeds data into a relational database. What about a Hadoop Datawarehouse ? Why not ? Perspective : Stop thinking of a data flow from a client to server, start thinking about a universe of scattered data ! Think from the point of view of the crowd not the individual. Stop thinking about a single solution, just use everything you can to reach your goal ! Copyright 2014 Denis Rothman
  • 74. MongoDB Learn more : http://www.mongodb.org/ Whereas Apache Hadoop is based on HFDS, MongoDB is a NOSQL document database. -Document-Orientated Storage with JSON style documents -Index support -Querying -Map/Reduce Copyright 2014 Denis Rothman
  • 75. MongoDB http://docs.mongodb.org/manual/core/map-reduce/ Let’s get the feel of Mongodb and MapReduce functions So, now continue to stop thinking. Oh, i’m into Relational Databases and this is a non relational database. What do I have to choose. You don’t have to choose ! At one point Facebook, and this might still be true, gathered data in MySQL, sent it out to Hadoop and then retrieved it with MapReduce : mapping it, shuffling it, reducing it and making into sense back …in MySQL for its users !!! Copyright 2014 Denis Rothman
  • 76. Purchasing and managing your « Hadoop-MapReduce-MongdoDB, PIG » Architecture Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop □ First you need to set up or choose a type of physical Cloud architecture. □ You need to make an financial and technical decision. □ If your company is not big enough to build it’s own cluster, then you need to choose cloud offers. Copyright 2014 Denis Rothman
  • 77. Getting Started with Hadoop Learn more : http://www.ovh.com/fr/serveurs_dedies/big-data/ Copyright 2014 Denis Rothman
  • 78. Getting Started with Hadoop □ Just a concept to bear in mind but you don’t have to do it on your own as explained previously. Cloud services provide this. □ "You have 10 machines connected in LAN and i need to create Name Node in one system and Data Nodes in remaining 9 machines . □ For example you have ( 1.. 10 ) machines , where machine1 is Server and from machine(2..9) are slaves[Data Nodes] so do i need to install Hadoop on all 10 machines ? □ You need Hadoop installed in every node and each node should have the services started as for appropriate for its role. Also the configuration files, present on each node, have to coherently describe the topology of the cluster, including location/name/port for various common used resources (eg. namenode). Doing this manually, from scratch, is error prone, specially if you never did this before and you don't know exactly what you're trying to do. Also would be good to decide on a specific distribution of Hadoop (HortonWorks, Cloudera, HDInsight, Intel, etc) » Copyright 2014 Denis Rothman
  • 79. Getting Started with Hadoop Do you have an Amazon account ? What do you know about what’s beyond your account ? Does Amazon have Big Data Technology ? How far does Amazon go in this field ? Let’s see… Copyright 2014 Denis Rothman
  • 80. Getting Started with Hadoop Learn more : http://aws.amazon.com/big-data/ http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is- emr.html Copyright 2014 Denis Rothman
  • 81. Getting Starting with your Big Data Architecture Let’s have a look at a real Big Data account and resource management interface. http://aws.amazon.com/s3/pricing/http://aws .amazon.com/s3/pricing/ https://console.aws.amazon.com/console/hom e?region=eu-west-1# https://console.aws.amazon.com/elasticmapre duce/vnext/home?region=eu-west- 1#getting-started: Copyright 2014 Denis Rothman
  • 82. Big Data – Ebay □ EBay has a nice way of summing it up before we get down to analyzing. http://www.ebaytechblog.com/2010/10/29/hadoop- the-power-of-the-elephant/#.UxncJbV5Gx4 Copyright 2014 Denis Rothman
  • 83. Analyst The analysists are here Let’s find out what they do and what you could do in the future ! Copyright 2014 Denis Rothman
  • 84. Big Data – Analyst First you need to forget about consumption(sales, marketing) and all the clichés you hear around you. Why ? Because the first step is to set highly creative goals, then to map, reduce and transform them into useful data. Useful data can be for medical research, police departments, astronomy and many other areas. Copyright 2014 Denis Rothman
  • 85. Big Data – Analyst At Planilog, we created a powerful Advanced Planning System that deals with the 3 Vs (Volume, Velocity and Variety). Our APS can optimize any field of data. Without going into the detail of our APS program, the following slides are going to provide you with tools to begin analyzing. Of course, you can analyze anything and anyway you want. This is just a guideline we used that help us solve hundreds of problems. Copyright 2014 Denis Rothman
  • 86. Big Data – Analyst Planilog’s first conceptual approch starts with Cognitive Science and Linguistics. Human activity and be broken down into two great categories : Passive and active. Copyright 2014 Denis Rothman
  • 87. Analyst Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop Let’s take some passive activities using just one or two senses. You can easily guess the others after. Eyes : • Watching (movies, events, any other) • Reading • Listening to music Copyright 2014 Denis Rothman
  • 88. Analyst Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop Let’s take some active activities using some senses. You can easily guess the others after. - Writing documents, chats, mails • Talking over the phone • Combining video and sound : Skype Copyright 2014 Denis Rothman
  • 89. Big Data – Analyst Now that you have an idea of active and passive activies, let’s see what they can apply to and what we can get out them : Thought process ->analyzing how someone thinks (« Sentiment analysis ») Feeling -> Sentiment analysis Body -> Movement anaylsis (GPS, for exemple). Copyright 2014 Denis Rothman
  • 90. Analyst Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop Finally there are only two ways to measure passive/active activities applied to the thinking-feeling-body process. It just boils down to this Qualitivative properties and quantitive analysis. Once we know what we’re analysing and how much we can pretty much make a model of the whole universe ! We could sum it up with brackets : <property or key, quantity> or if we simplify : <key, value> Sounds familiar ? See the power ? See why you need to analyse what your going to do before you analyse the data. Copyright 2014 Denis Rothman
  • 91. Analyst The Hadoop tools available don’t need to be isolated in terms of concepts but simple must be interoperable. You can Sqoop data from relational dtabase and collect event data with Flume. You have HDFS (distributed files system) that you can acces in a non relational way with Pig or even use in a Data Warehouse with Hive. With MapReduce you can run parallel computation. If you need more resources, you can use Whirr to deploy more clusters and Zookeeper to configure, manage and coordinate all of this ! So there is no relational, non relational opposition, there is no « standard » approach. There is simply a goal to attain with the best means possible. Copyright 2014 Denis Rothman
  • 92. Big Data – Privacy Everything you touch is stored, replicated, mapped, reduced and processed. Just focus on the legal aspect not on ethics. Do you think all of this is legal ? What’s legal ? In which country ? Where ? How ? Can this be prevented ? Copyright 2014 Denis Rothman
  • 93. Big Data – On which side of the Cloud are you ? Now let’s forget about the legal aspect. How do you feel about Clouds and Big Data ? Do you feel threatened ? Do you think it’s the end of your freedom ? Copyright 2014 Denis Rothman
  • 94. Big Data – On which side of the Cloud are you ? Now if you feel it’s progress with some drawbacks, you’re ready to be an Big Data analyst ! Do you agree with this or not ? Progress Copyright 2014 Denis Rothman
  • 95. Analyst : for those who agreed on using Big Data ! ☺, the others can leave ☺ Let’s sum it up before we begin analyzing real projects and cases. Conceptually if you use an active/passive matrix applied to thought-feeling-physical body, you can understand a great number of models. With Pig ->MapReduce->Hadoop and maybe Mongdb add or not, you’re going to map, reduce and transform DATA into useful INFORMATION for decision making processes. You’re exploring time and space. Copyright 2014 Denis Rothman
  • 96. Analyst : Can you imagine the data to be mapped and retrieved in various fields ? You need to think differently. Forget everything you learned and be open to new, very new ideas. Let’s hit the road now ! Copyright 2014 Denis Rothman
  • 97. Oh, you think this is theory for the future ? □ Ok,well you can stop laughing. Let’s have a look at sentiment analysis tools : http://blog.mashape.com/post/48757031167/l ist-of-20-sentiment-analysis-apis How to feel about that ? Remember if you Tweet about this page, it will be analyzed, so be careful of what you’re thinking and writing ! https://www.mashape.com/ How is your mind shaped ? How many applications are there out there ? Think as a global data analyst and not an individual. Express your thoughts.
  • 98. Sentiment Analysis □ https://www.mashape.com/aylien/tex t-analysis#!endpoint-Sentiment- Analysis With this technology nothing is a secret anymore. You can « hack », sorry « analyse » ☺ , in the world with unlimited technology, space and processing power. Let’s do a sentiment analysis right now… Copyright 2014 Denis Rothman
  • 99. Let’s carry out a little experiment What do you think about Sentiment analysis if you were Tweeting your impression. Let’s analyze the audience : Key <positive , value> Key <negative , value> More difficult. Explain why. Key <objective, value> Key <subjective, value> Copyright 2014 Denis Rothman
  • 100. Big Data – Life saving Try to find some ideas to save lives when there is a fire, or to protect violence on women or any other idea that comes to your mind. Think social networks, think drones, swarms of robots, think from the point of view of the swarm command, like in SC2 but to help people not for the confort of an individual. Copyright 2014 Denis Rothman
  • 101. Big Data – Life saving https://www.cmu.edu/silicon-valley/news- events/news/2011/stamberger-interviewed.html https://www.cmu.edu/silicon-valley/news- events/news/2011/stamberger-interviewed.html
  • 102. Big Data – Insurance How can you optimize the price of the premiums in real time worldwide with Hadoop Mapreduce ? Start with a major disaster and see how you’re going to pay and forecast future disasters. Hadoop can be used for predictive functions. Copyright 2014 Denis Rothman
  • 103. Big Data – Insurance also needs human resources. How can you optimize part time jobs in a huge quantitative environment in which you have 100 000 employees to manage ? http://www.optimaldecisionsllc.com/Welcome.html
  • 104. Big Data – Amazon Think of the passive-active matrix and the related activities (thought, feeling, physical) and tell me how you would use Big Data. How could you find a way to get sentiment analysis out of the reader ? Copyright 2014 Denis Rothman
  • 105. Big Data – Amazon-Kindle What <key, value> pairs what you be looking for ? Copyright 2014 Denis Rothman
  • 106. Big Data – Tweeter Try to find a great many positive applications for Tweeter. We’ve seen the API’s. Do you have ideas ? Life Saving Science and research Other ? We’re not going to talk about the negative ones. You need to think of how to go forward, not slow down ! Copyright 2014 Denis Rothman
  • 107. Big Data – Design Facebook □ Describe the data that Facebook collects. □ How can it legally access 50% more data that it didn’t gather in the first place….? Copyright 2014 Denis Rothman
  • 108. Big Data – Design Facebook □ WhatAps ! 450 millions new users ! How would you Map, Shuffle and Reduce this data to fit into your Facebook strategy. Advertising is a cliché. What else can you do ? Do you now what Stealth Marketing is ? http://en.wikipedia.org/wiki/Undercover_ marketing How can you analyze and detect it automatically if you were a government consumer protection agency ? Why wouldn’t governments map,shuffle, and reduce illegal behaviours ?
  • 109. Big Data – Design Sony Smartband http://www.expansys.fr/sony-smartband-swr10-with-2-black-wristbands-sl- 257855/?utm_source=google&utm_medium=shopping&utm_campaign=ba se&mkwid=svHjLhmZB&kword=adwords_productfeed&gclid=CI36mJSLgb 0CFWjpwgod1wMANA
  • 110. Smartwatches □ Samsung has one too that measures your heartbeat. □ http://venturebeat.com/2013/09/01/t his-is-samsungs-galaxy-gear- smartwatch-a-blocky-health-tracker- with-a-camera/ They want you to think about what you can do with the watch while they’re thinking of what to do with the global data they’re gathering as well. What could you analyse with Big Data tools ? Copyright 2014 Denis Rothman
  • 111. □ http://online.wsj.com/news/articles/S B10001424127887324178904578340 071261396666 Big Data is taking over ! Forget about the technical aspect. Just bear in mind that « huge » doesn’t mean impossible anymore. There is simply no limit to what data that can be processed through Big Data ! Copyright 2014 Denis Rothman
  • 112. HR □ http://online.wsj.com/news/articles/S B10001424127887324178904578340 071261396666 http://blog.mashape.com/post/487570 31167/list-of-20-sentiment-analysis- apis How can you use Big Data to recruit somebody ? Would you automatically analyse personal data if you could on social networks ? With a sentiment anaylsis program for example ? Copyright 2014 Denis Rothman
  • 113. Trucks : Tracking and Sensors □ http://online.wsj.com/news/articles/S B10001424127887324178904578340 071261396666 Sensors, robots, drones. Surveillance to optimize ! Give some of your ideas… Copyright 2014 Denis Rothman
  • 114. Big Data – Design Sony Smartband Now you can pick up the pulse rate and all activity. How can you relate this to all the other data on a group of people and not just yourself ? Think of a concert and sentiment analysis, for example. http://www.sonymobile.com/us/products/accessories/sm artwatch/#tabs https://play.google.com/store/apps/details?id=com.sony ericsson.extras.smartwatch&hl=fr Copyright 2014 Denis Rothman
  • 115. Governments and government agencies What can a governement collect that corporations can’t ? Can governments reach the level of private corporations to protect you ? How can it be done ? With what budget ?
  • 116. Governments and government agencies Phone companies gather data. Google, Microsoft and others gather data In fact everybody gathers data ! So could you gather as much data as the government, in the end ? Why or why not ?
  • 117. Big Data – Can you trust yourself to drive your car ? □ Google, like all others, have you focus on your individual need. □ In the meantime, your personal data has gone global. □ Think global and tell me how you would analyze the data. In this case we’re dealing with Big Data streaming data like in online gaming. So when you’re parsing the data, NoSQL or SQL is not the issue, getting the right information straight is the vital goal ! Copyright 2014 Denis Rothman
  • 118. Big Data – Can you trust a human to drive your car ? A Google Car gathers about an average of Gigabyte / second which is could add up to over 80 TB a day. An you ? □ http://www.isn.ethz.ch/Digital- Library/Articles/Detail/?lng=en&id=173004 Google explains how it collects all types of data. http://googlepolicyeurope.blogspot.fr/2010/04/da ta-collected-by-google-cars.html Where is all the data going : Big Data ? http://www.hostdime.com/blog/google-self- driving-car-news/
  • 119. Now take a step back and imagine all of the data gathered and accessed by a single group of analysts… …And now go out, imagine and conquer the world of Big Data !
  • 120. Big Data – A New Data Paridigm : no limits You can ask your questions now or contact me at Denis.Rothman76@gmail.com Copyright 2014 Denis Rothman