1. On Which side of the Cloud are you ?
An Introduction to Big Data
Denis Rothman
Copyright 2014 Denis Rothman
2. Big Data - Introduction
□ This course is not meant to make Big
Data experts out of you in a few
hours but is designed to help you
grasp the main concepts.
□ We’ll be discussing Apache Hadoop,
MapReduce, Mongodb, Pig and
several other names and concepts
that will be familiar to you by the end
of the course !
Copyright 2014 Denis Rothman
3. Big Data - Introduction
□ We’re going to talk about Apache
« Hadoop » and « MapReduce »
because the following companies use
this technology, at least parent or
derived versions : Google,
Yahoo!,Facebook,Amazon, IBM, Ebay
and many more key players on the
market.
Copyright 2014 Denis Rothman
4. Big Data - Introduction
□ All the figures, software and brands
mentioned in this document are simple
examples. All of this is going to expand and
change through the years !
□ The main goal here is for you to grasp
enough concepts to be able to create Big
Data architectures with today’s but also
tomorrow’s technology and ideas !
□ So focus on the concepts and the way you
can solve problems with Big Data
technology.
Copyright 2014 Denis Rothman
5. Big Data – What is big data ?
Learn more : http://en.wikipedia.org/wiki/Big_data
Let’s say that starting with one 10TB for a dataset (collection of data) we’re
talking Big Data and starting one petabyte we really need the technology !
The world has jumped from talking petabytes to exabytes in a year, we’ll
probably be talking zettabytes.
1 EB = 1000000000000000000B = 1018bytes = 1000petabytes1 EB
= 1000000000000000000B = 1018bytes = 1000petabytes = 1 million1
EB = 1000000000000000000B = 1018bytes = 1000petabytes =
1 million terabytes1 EB
= 1000000000000000000B = 1018bytes = 1000petabytes =
1 million terabytes = 1 billion1 EB
= 1000000000000000000B = 1018bytes = 1000petabytes =
1 million terabytes = 1 billion gigabytes.
Copyright 2014 Denis Rothman
6. Big Data – What is big data ?
For the Universe, the galaxies
are our small representative
volumes, and there are
something like 10^11 to
10^12 stars in our Galaxy
(The Milky Way)
•The number of bitsThe
number of bits on a
computer tera capacity hard
disk is typically about 10^13,
1000 GB)
To compare the amount of data we now store we have to do
down to atom level quantities in our universe !
Copyright 2014 Denis Rothman
7. Big Data – Can you represent the
Volume ?
Learn more : http://www.seagate.com/about/newsroom/press-
releases/Terascale-Enterprise-HDD-pr-master/
Tell us how and were you
would store a 1PB dataset for a
given company without Big
Data technology ?
How many average size 4 TB
hard disks would it take to
simple store the data ?
High-Capacity— highest capacity HDD (4TB) available in a 3.5-
inch enterprise-class SATA(Serial Advanced Technology Attachment)
HDD enabling scalable, high-capacity storage in 24×7
environments.
?
Copyright 2014 Denis Rothman
8. Big Data – Can you represent a fast
way to access(Velocity) 1PB of data
with Big Data technology?
Let’s say we’re talking
about the data related to
all bank accounts of the
BNP of the past 5 years
that had a balance of
more that 1000 $ at given
time and that need to be
accessed for a financial
analysis.
How would you do it now,
without Big Data
Technology ?
Copyright 2014 Denis Rothman
9. Big Data – Can you represent to access
additional documents in a great Variety of
data ?
Now we need to retrieve
other documents to
analyse these BNP
accounts : text
documents(signed
contracts, for example)
How would you do it now,
without Big Data
Technology ?
Copyright 2014 Denis Rothman
10. Big Data – Do you think you can manage
10PB without Big Data ?
If we now try to solve the 3 V problem with a 10PB dataset to
manage, how could we do it even with Oracle Big Files ?
A bigfile tablespace contains only
one datafile or tempfile, which
can contain up to approximately
4 billion ( 232 ) blocks. The
maximum size of the single
datafile or tempfile is 128
terabytes (TB) for a tablespace
with 32 K blocks and 32 TB for
a tablespace with 8 K blocks.
Number of
blocks
Bigfile Tablespaces
Learn more :
http://docs.oracle.com/cd/B28359_01/server.111/b28320/limits002.htm#i2879
15
Copyright 2014 Denis Rothman
11. Big Data – Volume, Velocity, Variety that is
beyond non Big Data solutions
We seen the limits of non
Big Data technology ?
How would you solve the
problem ?
Even if you already know
how Big Data works, do
you think it will solve the
increasing size and
variety of datasets ?
How will it help with
sensors ?
Copyright 2014 Denis Rothman
12. Big Data – Apache Hadoop
There are several
solutions on the
market. Let’s use
Apache Hadoop as
way to understand
how Big Data storage
works to solve the 3V
problem.
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
Copyright 2014 Denis Rothman
13. Big Data – Apache Hadoop
□ There are many ways
to try to understand a
subject. This part of
the course is designed
for you to see that
the core ideas of
Apache Hadoop are
simple !
Copyright 2014 Denis Rothman
14. Big Data – Apache Hadoop
□ First of all, what
does « Hadoop »
mean ? It means
nothing !
□ Doug Cutting just
named after his
son’s toy elephant.
So that’s one
mystery solved.
Copyright 2014 Denis Rothman
15. Big Data – Apache Hadoop
□ The first thing
we need to do is
understand
cluster
architectures.
□ Cluster
architectures are
spreading at a
wild speed as a
framework for
the analyis of big
data.
New Exabytes of data appear
each…week…
Learn more : http://www.ovh.com/fr/serveurs_dedies/big-data/
16. Big Data – Apache Hadoop
□ Cluster architectures are the
best choice because they
have Cloud performances :
extensible, flexible and cost
efficient.
Copyright 2014 Denis Rothman
17. Big Data – Apache Hadoop
□ So what ? So
what’s the
difference between
a traditional
entreprise
architecture and a
cloud-cluster
architecture ?
Copyright 2014 Denis Rothman
18. Big Data – Apache Hadoop
□ A traditional
architecture is
built on
server technology
that is expensive
and thus has to be
used as much as
possible.
Copyright 2014 Denis Rothman
19. Big Data – Apache Hadoop
□ A traditional
architecture is
also built on
storage capacity
of different sizes
and types : SSD
to SATA.
Copyright 2014 Denis Rothman
20. Big Data – Apache Hadoop
□ A traditional
architecture is
finally built on
storage area
networks
(SAN) to
connect a set
of servers to a
set of storage
units
Copyright 2014 Denis Rothman
21. Big Data – Apache Hadoop
□ The big quality of traditional
architecture is that the servers and
storage units can be managed (size,
number) separately with SAN
(Storage Area Network) connecting
them.
Copyright 2014 Denis Rothman
22. Big Data – Apache Hadoop
□ The big drawback of traditional
architecture is that it must be
extremely reliable and any failure
must be dealt with very quickly.
□ This brings the price up.
Copyright 2014 Denis Rothman
23. Big Data – Apache Hadoop
□ Traditional architectures were
designed for intensive applications
focusing on one part of the data. The
servers process the information and
then the results are transferred to
storage.
Copyright 2014 Denis Rothman
24. Big Data – Apache Hadoop
□ So in essence a traditional architecture is
designed for a specific need (intense
computing, a standard data warehouse.
Fine.
□ How would you now solve a problem
involving a tremendous weekly increase in
data (PB) ? Not knowing what you’re
looking for in advance : sorting by order,
by timestamp or retrieving certain values.
Copyright 2014 Denis Rothman
25. Big Data – Apache Hadoop
□ Even a few years ago Google was
facing a daily increase of data of
20PB…per day.
□ For a special operation, let’s say user
mail history (number and size of
mails over a five year period), we
need to parse the entire dataset not
just a subset.
Copyright 2014 Denis Rothman
26. Big Data – Apache Hadoop
□ Why sort that data ?
□ To make searching, merging and
analyzing easier.
□ So how can you sort n x 20PB of
data?
□ With cluster architecture !
Copyright 2014 Denis Rothman
27. Big Data – Apache Hadoop
Let’s now study 3 basic
properties of cluster computing :
-Pennysort
-Minutesort
-Graysort
Copyright 2014 Denis Rothman
28. Big Data – Apache Hadoop
□ Sorting being a major function of Big
Data, it’s important to have
benchmark references.
Learn more : http://sortbenchmark.org/
GraySort
Metric: Sort rate (TBs / minute) achieved while sorting a very large
amount of data (currently 100 TB minimum).
PennySort
Metric: Amount of data that can be sorted for a penny's worth of system
time.
Originally defined in AlphaSort paper.
MinuteSort
Metric: Amount of data that can be sorted in 60.00 seconds or less.
Originally defined in AlphaSort paper.
Copyright 2014 Denis Rothman
29. Big Data – Apache Hadoop
Learn more : http://sortbenchmark.org/
2013, 1.42 TB/min
Hadoop
102.5 TB in 4,328 seconds
2100 nodes x
(2 2.3Ghz hexcore Xeon E5-2630, 64
GB memory, 12x3TB disks)
Thomas Graves
Yahoo! Inc.
Gray
Copyright 2014 Denis Rothman
30. Big Data – Apache Hadoop
2011, 286 GB
psort
2.7 Ghz AMD Sempron, 4 GB RAM,
5x320 GB 7200 RPM Samsung SpinPoint F4
HD332GJ, Linux
Paolo Bertasi, Federica Bogo, Marco Bressan
and Enoch Peserico
Univ. Padova, Italy
Penny
Copyright 2014 Denis Rothman
31. Big Data – Apache Hadoop
2012, 1,401 GB
Flat Datacenter Storage
256 heterogeneous nodes, 1033 disks
Johnson Apacible, Rich Draves, Jeremy Elson,
Jinliang Fan, Owen Hofmann, Jon Howell, Ed
Nightingale, Reuben Olinksy, Yutaka Suzue
Microsoft ResearchMinute
Copyright 2014 Denis Rothman
32. Big Data – Apache Hadoop
□ Getting down to a cluster.
A cluster breaks down to its basic
component : a NODE
A node is made up of cores, memory
and disks that can be assembled in
the thousands, the tens of thousands,
the hundreds of thousands.
Copyright 2014 Denis Rothman
33. Big Data – Apache Hadoop
□ The NODES
are then
grouped in
RACKS
□ The RACKS
are then
grouped into
CLUSTERS
The CLUSTERS ARE CONNECTED TO A NETWORK WITH A CISCO
SWITCH, for example
Copyright 2014 Denis Rothman
34. Big Data – Apache Hadoop
□ The first property of a cluster is to be
MODULAR and SCALABLE (handles
growing amount of elements)
□ This means that it’s cheap to just add
more and more nodes at the best
price and it doesn’t need to be that
reliable as we will see further.
Copyright 2014 Denis Rothman
35. Big Data – Apache Hadoop
□ The second property of a cluster is
DATA LOCALITY. This means your not
going through a sequence but directly
to the physical location. No more
bottlenecks...
□ This leads to PARALLELIZATION which
means you access several locations
simultaneously.
Learn more : http://en.wikipedia.org/wiki/Locality_of_reference
Copyright 2014 Denis Rothman
36. Big Data – Apache Hadoop
□ With data locality and parallelization
MASSIVE PARALLEL PROCESSING
becomes a reality.
□ The main function, sorting, can now
be done within each node on a subset
of data.
□ Please bear in mind that these nodes
are cheaper than traditional
architectures.
Copyright 2014 Denis Rothman
37. Big Data – Apache Hadoop
□ This is just an example that goes
back to 2011 but makes the point.
A typical SSD drive system would
process data at about $1.2 a gigabyte
at 30K IOPS and a SATA at about
$0.05 but only at 250 IOPS
IOPS (input/output operations per
second) .
Let’s take a simple cluster…
Copyright 2014 Denis Rothman
38. Big Data – Apache Hadoop
□ In a simple cluster, 30 000 IOPS
are delivered in parallel with around
120 nodes (around 250 IOPS) at the
same time BUT for the IOP price of
SATA.
□ We’re talking about cheaper and
more expendable equipment.
Copyright 2014 Denis Rothman
39. Big Data – Map Reduce
□ This means that in a cluster
architecture failures will be more
frequent with cheaper equipement.
Copyright 2014 Denis Rothman
40. □ Failures with cheaper equipement ?
Who cares ? Don’t get ripped off and purchase
expensive reliable hardware but expendable
material to be cost efficient.
We just need to find a way to detect and
respond quickly to deal with this
complexity.
We’ll need to replicate the date up to three
times in three different data locations.
Let’s see how to solve these problems with
Apache Hadoop.
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
41. Hadoop is about clusters build with
commodity hardware not high quality
hardware :
• widely available
• interchangeable
• plug and play
• breaks down more often
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
42. □ Before we go on, what’s
the purpose of all this.
WHY ?
It all started with Google who
had to index pages every
day and quickly reach huge
amounts of data. Hadoop
reaches back into the
Google File System (GFS)
and Google MapReduce. In
the early days, Yahoo ! and
Apache got involved in the
process.
Around 2004, Google started
publishing all this…
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
43. □ Let’s take Facebook. You know all the
information that’s in there for you. But with
over 1 000 000 000 users + the 450 000
000 WhatsApp we’re talking about a
massive chunk of the world population
increasing the size of Facebook every day.
We’re talking increasing data in Exabytes
in this case. How are you going to run a
search over that one dataset spread over
hundreds of thousands of nodes ?
With Apache Hadoop !
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
44. Big Data – Apache Hadoop
Apache Hadoop was designed for DISTRIBUTED DATA OVER THE CLUSTERS
Apache Hadoop was designed with the concept of DATA LOCALITY
Hadoop Distributed File
System (HDFS)
Hadoop Map Reduce
Copyright 2014 Denis Rothman
45. □ HDFS has 3 main functions : split,
scatter and replicate.
Big Data – Apache Hadoop
1. SPILTING. In Hadoop
each FILE BLOCK has the
SAME size (64 Mb for
example) in a STORAGE
BLOCK
2.Scattering. These FILE
BLOCKS are generally
on different datanodes
3.REPLICATION : There are
multiple copies of these
blocks in different
locations.
Copyright 2014 Denis Rothman
46. Big Data – Apache Hadoop
Architecture
Learn more : http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Blocks
One main
node
Generally
3 copies in
the
replication
process so
nodes can
fail !
Copyright 2014 Denis Rothman
47. □ The NameNode is the centerpiece
of an HDFS file system. It keeps
the directory tree of all files in the
file system, and tracks where
across the cluster the file data is
kept. It does not store the data of
these files itself.
□ Client applications talk to the
NameNode whenever they wish to
locate a file, or when they want to
add/copy/move/delete a file. The
NameNode responds the
successful requests by returning a
list of relevant DataNode servers
where the data lives(addresses).
Big Data – Apache Hadoop
Learn more : http://wiki.apache.org/hadoop/NameNode
Works fine for
failures on
commodity
equipment !
Copyright 2014 Denis Rothman
48. □ So what happens when the
NameNode fails.
□ Hadoop has copies of the data and
as long as the same IP address is
reassigned, a new NameNode will be
designated and that’s it !
Big Data – Apache Hadoop
Learn more : http://wiki.apache.org/hadoop/NameNode
Copyright 2014 Denis Rothman
49. Once the HDFS is set up, MAP REDUCE
is there to retrieve information in a
simple way.
First a MAPPER is user then the
information is REDUCED.
Let’s see how this happens.
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
50. The MAPPER function relies on the fact
that the data is EVENLY
DISTRIBUTED. This means that
Massive Parallel Processing is
possible.
The MAPPER uses the LOCALITY (hence
« MAP » features of HADOOP to
optimize it’s search.
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
51. □ If not the file blocks were not of equal size,
the processing time would be equal to the
largest file.
□ But since in Hadoop the file blocks have the
same size, processing is tremendously
enhanced for MPP.
□ A little caveat could be Internet unequal
internet connexions but most organizations
have solved this and there are replications
everywhere…
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
52. Big Data – Apache Hadoop
Suppose you need to analyse
the number of times the word
« Happy New Year » in a
Google search at midnight on
Decembre 31rst in their
timezone.
Let’s say we’re concentrating
on France only and that the
nodes containing this data are
Nodes 1,2,3 (at their
address)
Copyright 2014 Denis Rothman
53. □ Now we run a <key,value> pair withe
the mapping functions. They key here
is « Happy New Year » and the value
will be the number of times it
appears.
□ In Node 1: <Happy New Year,
1000000, Node 2 : <Happy New
Year, 4000000>, Node 3: <Happy
New Year, 2000000>
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
54. Big Data – Apache Hadoop
□ Let’s get a look and feel of Hadoop
command line functions, among
others.
□ https://hadoop.apache.org/docs/r0.1
8.3/hdfs_shell.html
Copyright 2014 Denis Rothman
55. Big Data – Map Reduce
□ In Node 1: <Happy New Year, 1000000, Node 2 :
<Happy New Year, 4000000>, Node 3: <Happy New
Year, 2000000> data is sent to a reduce node to run
the REDUCE function which will give the following
output:
<Happy New Year, 1000000,4000000,2000000> to be
summed up for example to <Happy New Year,
7000000>
Mapping and reducing are thus 2 simple but powerful
functions.
If various keys are sent, they are SORTED through a
shuffling process.
Copyright 2014 Denis Rothman
56. Big Data – Map Reduce
□ The Mapper functions and Reduce functions
are TASKS and together they form a JOB.
□ Map Reduce’s framework has a JOB
TRACKER that schedules the tasks.
□ A JOB TRACKER will reroute tasks if a node
fails, it organizes the activities.
□ Just like HDFS has a name node, Map
Reduce has a special node assigned to the
JOB TRACKER.
Copyright 2014 Denis Rothman
57. Big Data – Map Reduce
□ Now the programmer will provide
MapReduce with a list of file blocks, the
map and jobs.
□ The output is a set of keys and values.
□ All of this can be done in a tremendous MPP
run.
□ By 2015, it’s estimated that 50% of all data
will be processed with Hadoop…
Copyright 2014 Denis Rothman
58. Big Data – High level software
□ Now the programmer will provide
MapReduce with a list of file blocks, the
map and jobs.
□ The output is a set of keys and values.
□ All of this can be done in a tremendous MPP
run.
□ By 2015, it’s estimated that 50% of all data
will be processed with Hadoop type
technology…
Copyright 2014 Denis Rothman
59. Getting Started with Hadoop
MapReduce
Now let’s get Hadoop
MapReduce into the equation
Learn more: http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html
http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Pre-requisites
Let’s get a look and feel of MapReduce functions :
http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Example
%3A+WordCount+v1.0
Just bear in mind that you looking at developing
<key,value> sets both mapping them and reducing them.
Copyright 2014 Denis Rothman
60. MapReduce
More look and
feel
approaches :
http://hadoop.a
pache.org/doc
s/r2.2.0/api/o
rg/apache/had
oop/mapreduc
e/Mapper.html
Copyright 2014 Denis Rothman
61. Apache Hadoop MapReduce
Architecture
□ Let’s take five here and see what
we’ve got up to here. Ok, we have
Hadoop and MapReduce.
□ Let’s see how this fits together and
how we can access data at a higher
level.
□ We’re going to take a look at how
Google explains this…
Copyright 2014 Denis Rothman
62. Apache Hadoop MapReduce
Architecture
Google explains it
with this
concept with
physical
retrieval :
1.Standard software
query : 1
person
2.MapReduce :
several persons
Let’s work on this
physical file
system
Learn more : https://cloud.google.com/developers/articles/apache-hadoop-
hive-and-pig-on-google-compute-engine#appendix-b
Copyright 2014 Denis Rothman
63. Getting Started with PIG
All the tools are there,
just use them !
You’re going to have to choose a
platform or just rent one as
explained further in the
document.
Copyright 2014 Denis Rothman
64. PIG
Let’s have some fun
with high level
programming !
« Pig is a high-level platform for
creating MapReduce programs used
with Hadoop. »
Learn more : http://en.wikipedia.org/wiki/Pig_(programming_tool)
What does a pig do ? it « grunts »
You can use Grunt to run Pig, you can
use Pig to run Python code, you can
use Pig for the MapReduce
framework.
Just stop thinking « categories »,be
Creative and have fun !
Copyright 2014 Denis Rothman
65. PIG
Learn more : http://en.wikipedia.org/wiki/Pig_(programming_tool)
http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions
Let’s have a look at
some of the PIG
functions to get the feel
of it.
http://docs.aws.amazon.com/ElasticMapReduce/lat
est/DeveloperGuide/emr-pig-udf.html
Copyright 2014 Denis Rothman
66. What if I don’t want to use Pig ?
There are a lot of languages you can use that
integrate the Hadoop & MapReduce framework !
Java : http://www.javacodegeeks.com/2013/08/writing-a-hadoop-
mapreduce-task-in-java.html
PHP : http://stackoverflow.com/questions/10978975/need-a-map-
reduce-function-in-mongo-using-php
C++ : http://cxwangyi.blogspot.fr/2010/01/writing-hadoop-programs-
using-c.html
Python :
https://developers.google.com/appengine/docs/python/dataprocessing/h
elloworld
Copyright 2014 Denis Rothman
67. Big Data or Standard Databases ?
□ File Systems or
databases ?
□ So now what ?
SQL solutions ?
No SQL solutions ?
□ Both ?
Let’s take a few minutes and find some examples in
which one philosphy or another is best for a company
SQL ?
No SQL ?
Copyright 2014 Denis Rothman
68. Big Data – NOSQL
Learn more : http://en.wikipedia.org/wiki/NoSQL
□ First let’s get rid of a simple and old concept : SQL
□ When you want to explore exabytes, of data, SQL is
useless.
□ « the term was used in NoSQL(Not Only SQL) in 1998
to name a lightweight, open source database that did
not expose the standard SQL interface. Strozzi
suggests that, as the current NoSQL movement
"departs from the relational model altogether; it
should therefore have been called more appropriately
'NoREL'. »
□ In somes cases the volume of data and it’s nature
(documents, texts) can’t be accessed through SQL
Copyright 2014 Denis Rothman
69. Big Data – NOSQL
□ « Some notable implementations of NoSQL
are Facebook's Cassandra database,
Google's BigTable and Amazon's SimpleDB
and Dynamo. »
□ Let’s approach NOSQL with one of its core
concepts. In a RDMS
(relational database management system)
several users can’t modify exactly the same
record at the same time. The system is
base on read-write-relational functions.
Copyright 2014 Denis Rothman
70. Big Data – NOSQL
In an RDMS, the last user that writes in
exactly the same record will overide
previous records. Of course you can
append a record per user but then
you have multiple records for the
same data index.
So you generally you lock the record
while it’s in use or use a LIFO(Last In
First Out)
Copyright 2014 Denis Rothman
71. Big Data – NOSQL
Learn more : http://www.techopedia.com/definition/27689/nosql-database
The fundamental difference in NOSQL is
that the relations don’t matter
anymore, so unique keys don’t
matter either.
You’re not worried about read and write
rules, relations, inner joins, size
constraints, time contraints.
Copyright 2014 Denis Rothman
72. Big Data – NOSQL
Learn more : http://en.wikipedia.org/wiki/NoSQL
With NOSQL you can
scatter your data
everywhere, on
various servers at
the same time
and write multiple
records with
multiple
simultaneous
users with millions
of same type
entries !
Copyright 2014 Denis Rothman
73. Big Data – SQL, Data Warehouse
and perspective
Let’s make NOSQL concepts clear :
-Hive is language that is SQL related and used with Big Data
-Pig is a NoSQL language
- You can use both in project !
http://gcn.com/blogs/reality-check/2014/01/hadoop-vs-data-
warehousing.aspx
A traditional Datawarehouse feeds data into a relational database.
What about a Hadoop Datawarehouse ? Why not ?
Perspective : Stop thinking of a data flow from a client
to server, start thinking about a universe of scattered
data ! Think from the point of view of the crowd not
the individual. Stop thinking about a single solution,
just use everything you can to reach your goal !
Copyright 2014 Denis Rothman
74. MongoDB
Learn more : http://www.mongodb.org/
Whereas Apache Hadoop is based on HFDS, MongoDB is a
NOSQL document database.
-Document-Orientated Storage with JSON style documents
-Index support
-Querying
-Map/Reduce
Copyright 2014 Denis Rothman
75. MongoDB
http://docs.mongodb.org/manual/core/map-reduce/
Let’s get the feel of Mongodb and MapReduce functions
So, now continue to stop thinking. Oh, i’m into Relational Databases and
this is a non relational database. What do I have to choose.
You don’t have to choose !
At one point Facebook, and this might still be true, gathered data in
MySQL, sent it out to Hadoop and then retrieved it with MapReduce :
mapping it, shuffling it, reducing it and making into sense back …in
MySQL for its users !!!
Copyright 2014 Denis Rothman
76. Purchasing and managing your
« Hadoop-MapReduce-MongdoDB,
PIG » Architecture
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
□ First you need to set up or choose a
type of physical Cloud architecture.
□ You need to make an financial and
technical decision.
□ If your company is not big enough to
build it’s own cluster, then you need
to choose cloud offers.
Copyright 2014 Denis Rothman
77. Getting Started with Hadoop
Learn more : http://www.ovh.com/fr/serveurs_dedies/big-data/
Copyright 2014 Denis Rothman
78. Getting Started with Hadoop
□ Just a concept to bear in mind but you don’t have to do it on your
own as explained previously. Cloud services provide this.
□ "You have 10 machines connected in LAN and i need to create
Name Node in one system and Data Nodes in remaining 9
machines .
□ For example you have ( 1.. 10 ) machines , where machine1
is Server and from machine(2..9) are slaves[Data Nodes] so
do i need to install Hadoop on all 10 machines ?
□ You need Hadoop installed in every node and each node should
have the services started as for appropriate for its role. Also the
configuration files, present on each node, have to coherently
describe the topology of the cluster, including location/name/port
for various common used resources (eg. namenode). Doing this
manually, from scratch, is error prone, specially if you never did
this before and you don't know exactly what you're trying to do.
Also would be good to decide on a specific distribution of Hadoop
(HortonWorks, Cloudera, HDInsight, Intel, etc) »
Copyright 2014 Denis Rothman
79. Getting Started with Hadoop
Do you have an Amazon account ?
What do you know about what’s beyond your account ?
Does Amazon have Big Data Technology ?
How far does Amazon go in this field ?
Let’s see…
Copyright 2014 Denis Rothman
80. Getting Started with Hadoop
Learn more : http://aws.amazon.com/big-data/
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-
emr.html
Copyright 2014 Denis Rothman
81. Getting Starting with your Big Data
Architecture
Let’s have a look at a real Big Data account
and resource management interface.
http://aws.amazon.com/s3/pricing/http://aws
.amazon.com/s3/pricing/
https://console.aws.amazon.com/console/hom
e?region=eu-west-1#
https://console.aws.amazon.com/elasticmapre
duce/vnext/home?region=eu-west-
1#getting-started:
Copyright 2014 Denis Rothman
82. Big Data – Ebay
□ EBay has a nice way of summing it up
before we get down to analyzing.
http://www.ebaytechblog.com/2010/10/29/hadoop-
the-power-of-the-elephant/#.UxncJbV5Gx4
Copyright 2014 Denis Rothman
83. Analyst
The analysists are here
Let’s find out what they do and what you could do in the future !
Copyright 2014 Denis Rothman
84. Big Data – Analyst
First you need to forget about
consumption(sales, marketing) and all the
clichés you hear around you.
Why ? Because the first step is to set highly
creative goals, then to map, reduce and
transform them into useful data. Useful
data can be for medical research, police
departments, astronomy and many other
areas.
Copyright 2014 Denis Rothman
85. Big Data – Analyst
At Planilog, we created a powerful Advanced
Planning System that deals with the 3 Vs
(Volume, Velocity and Variety). Our APS
can optimize any field of data.
Without going into the detail of our APS
program, the following slides are going to
provide you with tools to begin analyzing.
Of course, you can analyze anything and
anyway you want. This is just a guideline
we used that help us solve hundreds of
problems.
Copyright 2014 Denis Rothman
86. Big Data – Analyst
Planilog’s first conceptual approch starts
with Cognitive Science and
Linguistics.
Human activity and be broken down into
two great categories :
Passive and active.
Copyright 2014 Denis Rothman
87. Analyst
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
Let’s take some passive activities using
just one or two senses. You can easily
guess the others after.
Eyes :
• Watching (movies, events, any other)
• Reading
• Listening to music
Copyright 2014 Denis Rothman
88. Analyst
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
Let’s take some active activities using
some senses. You can easily guess
the others after.
- Writing documents, chats, mails
• Talking over the phone
• Combining video and sound : Skype
Copyright 2014 Denis Rothman
89. Big Data – Analyst
Now that you have an idea of active and
passive activies, let’s see what they can
apply to and what we can get out them :
Thought process ->analyzing how someone
thinks (« Sentiment analysis »)
Feeling -> Sentiment analysis
Body -> Movement anaylsis (GPS, for
exemple).
Copyright 2014 Denis Rothman
90. Analyst
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
Finally there are only two ways to measure passive/active
activities applied to the thinking-feeling-body process.
It just boils down to this
Qualitivative properties and quantitive analysis.
Once we know what we’re analysing and how much we
can pretty much make a model of the whole universe
!
We could sum it up with brackets :
<property or key, quantity> or if we simplify :
<key, value>
Sounds familiar ?
See the power ? See why you need to analyse what your
going to do before you analyse the data.
Copyright 2014 Denis Rothman
91. Analyst
The Hadoop tools available don’t need to be isolated in
terms of concepts but simple must be interoperable.
You can Sqoop data from relational dtabase and collect
event data with Flume. You have HDFS (distributed
files system) that you can acces in a non relational
way with Pig or even use in a Data Warehouse with
Hive. With MapReduce you can run parallel
computation. If you need more resources, you can
use Whirr to deploy more clusters and Zookeeper to
configure, manage and coordinate all of this !
So there is no relational, non relational opposition, there
is no « standard » approach. There is simply a goal to
attain with the best means possible.
Copyright 2014 Denis Rothman
92. Big Data – Privacy
Everything you touch is
stored, replicated,
mapped, reduced and
processed.
Just focus on the legal
aspect not on ethics.
Do you think all of this is
legal ?
What’s legal ? In which
country ? Where ? How ?
Can this be prevented ?
Copyright 2014 Denis Rothman
93. Big Data – On which side of the
Cloud are you ?
Now let’s forget about
the legal aspect.
How do you feel about
Clouds and Big Data ?
Do you feel threatened
?
Do you think it’s the
end of your freedom ?
Copyright 2014 Denis Rothman
94. Big Data – On which side of the
Cloud are you ?
Now if you feel it’s progress
with some drawbacks,
you’re ready to be an Big
Data analyst !
Do you agree with this or
not ?
Progress
Copyright 2014 Denis Rothman
95. Analyst : for those who agreed on
using Big Data ! ☺, the others can
leave ☺
Let’s sum it up before we begin analyzing real
projects and cases.
Conceptually if you use an active/passive
matrix applied to thought-feeling-physical
body, you can understand a great number
of models.
With Pig ->MapReduce->Hadoop and maybe
Mongdb add or not, you’re going to map,
reduce and transform DATA into useful
INFORMATION for decision making
processes. You’re exploring time and space.
Copyright 2014 Denis Rothman
96. Analyst : Can you imagine the data
to be mapped and retrieved in
various fields ?
You need to
think
differently.
Forget
everything
you
learned and
be open to
new, very
new ideas.
Let’s hit the
road now !
Copyright 2014 Denis Rothman
97. Oh, you think this is
theory for the future ?
□ Ok,well you can stop laughing. Let’s have a
look at sentiment analysis tools :
http://blog.mashape.com/post/48757031167/l
ist-of-20-sentiment-analysis-apis
How to feel about that ? Remember if you Tweet about this
page, it will be analyzed, so be careful of what you’re thinking
and writing !
https://www.mashape.com/
How is your mind shaped ?
How many applications are there out there ?
Think as a global data analyst and not an individual. Express your
thoughts.
99. Let’s carry out a little experiment
What do you think about Sentiment
analysis if you were Tweeting your
impression. Let’s analyze the
audience :
Key <positive , value>
Key <negative , value>
More difficult. Explain why.
Key <objective, value>
Key <subjective, value>
Copyright 2014 Denis Rothman
100. Big Data – Life saving
Try to find some ideas to save lives when there is a fire, or
to protect violence on women or any other idea that
comes to your mind.
Think social networks, think drones, swarms of robots, think from
the point of view of the swarm command, like in SC2 but to help
people not for the confort of an individual.
Copyright 2014 Denis Rothman
101. Big Data – Life saving
https://www.cmu.edu/silicon-valley/news-
events/news/2011/stamberger-interviewed.html
https://www.cmu.edu/silicon-valley/news-
events/news/2011/stamberger-interviewed.html
102. Big Data – Insurance
How can you optimize the price of the
premiums in real time worldwide with
Hadoop Mapreduce ?
Start with a major disaster and see how
you’re going to pay and forecast future
disasters.
Hadoop can be used for predictive functions.
Copyright 2014 Denis Rothman
103. Big Data – Insurance also needs
human resources.
How can you optimize
part time jobs in a
huge quantitative
environment in which
you have 100 000
employees to manage
?
http://www.optimaldecisionsllc.com/Welcome.html
104. Big Data – Amazon
Think of the passive-active matrix
and the related activities (thought,
feeling, physical) and tell me how
you would use Big Data.
How could you find a way to get
sentiment analysis out of the
reader ?
Copyright 2014 Denis Rothman
105. Big Data – Amazon-Kindle
What <key, value> pairs what you be looking for ?
Copyright 2014 Denis Rothman
106. Big Data – Tweeter
Try to find a great many
positive applications for
Tweeter.
We’ve seen the API’s.
Do you have ideas ?
Life Saving
Science and research
Other ?
We’re not going to talk about
the negative ones. You need
to think of how to go forward,
not slow down !
Copyright 2014 Denis Rothman
107. Big Data – Design Facebook
□ Describe the data
that Facebook
collects.
□ How can it legally
access 50% more
data that it didn’t
gather in the first
place….?
Copyright 2014 Denis Rothman
108. Big Data – Design Facebook
□ WhatAps ! 450 millions new users !
How would you Map, Shuffle and
Reduce this data to fit into your
Facebook strategy.
Advertising is a cliché.
What else can you do ?
Do you now what Stealth
Marketing is ?
http://en.wikipedia.org/wiki/Undercover_
marketing
How can you analyze and detect it
automatically if you were a
government consumer protection
agency ? Why wouldn’t
governments map,shuffle, and
reduce illegal behaviours ?
109. Big Data – Design Sony Smartband
http://www.expansys.fr/sony-smartband-swr10-with-2-black-wristbands-sl-
257855/?utm_source=google&utm_medium=shopping&utm_campaign=ba
se&mkwid=svHjLhmZB&kword=adwords_productfeed&gclid=CI36mJSLgb
0CFWjpwgod1wMANA
110. Smartwatches
□ Samsung has one too that measures
your heartbeat.
□ http://venturebeat.com/2013/09/01/t
his-is-samsungs-galaxy-gear-
smartwatch-a-blocky-health-tracker-
with-a-camera/
They want you to think about what you can do with the watch
while they’re thinking of what to do with the global data they’re
gathering as well. What could you analyse with Big Data
tools ?
Copyright 2014 Denis Rothman
113. Trucks : Tracking and Sensors
□ http://online.wsj.com/news/articles/S
B10001424127887324178904578340
071261396666
Sensors, robots, drones. Surveillance to optimize !
Give some of your ideas…
Copyright 2014 Denis Rothman
114. Big Data – Design Sony Smartband
Now you can pick up the pulse rate and all activity. How
can you relate this to all the other data on a group of
people and not just yourself ? Think of a concert and
sentiment analysis, for example.
http://www.sonymobile.com/us/products/accessories/sm
artwatch/#tabs
https://play.google.com/store/apps/details?id=com.sony
ericsson.extras.smartwatch&hl=fr
Copyright 2014 Denis Rothman
115. Governments and government
agencies
What can a governement collect
that corporations can’t ?
Can governments reach the level of
private corporations to protect
you ?
How can it be done ?
With what budget ?
116. Governments and government
agencies
Phone companies gather data.
Google, Microsoft and others gather
data
In fact everybody gathers data !
So could you gather as much data as
the government, in the end ?
Why or why not ?
117. Big Data – Can you trust yourself
to drive your car ?
□ Google, like all others, have you focus
on your individual need.
□ In the meantime, your personal data
has gone global.
□ Think global and tell me how you
would analyze the data.
In this case we’re dealing with Big Data streaming data like in
online gaming. So when you’re parsing the data, NoSQL or SQL
is not the issue, getting the right information straight is the vital
goal !
Copyright 2014 Denis Rothman
118. Big Data – Can you trust a human
to drive your car ?
A Google Car gathers about an average of
Gigabyte / second which is could add up to
over 80 TB a day. An you ?
□ http://www.isn.ethz.ch/Digital-
Library/Articles/Detail/?lng=en&id=173004
Google explains how it collects all types of data.
http://googlepolicyeurope.blogspot.fr/2010/04/da
ta-collected-by-google-cars.html
Where is all the data going : Big Data ?
http://www.hostdime.com/blog/google-self-
driving-car-news/
119. Now take a step back and imagine all
of the data gathered and accessed by
a single group of analysts…
…And now go out, imagine and conquer the world of Big Data !
120. Big Data – A New Data Paridigm :
no limits
You can ask your questions now or
contact me at
Denis.Rothman76@gmail.com
Copyright 2014 Denis Rothman