Codemotion Rome 2015 - I Big Data sono indubbiamente tra i temi più "caldi" del panorama tecnologico attuale. Ad oggi nel mondo sono stati prodotti circa 5 Exabytes di dati che costituiscono una potenziale fonte di "intelligenza" che è possibile sfruttare, grazie alle tecnologie più recenti, in svariati ambiti che spaziano dalla medicina alla sociologia passando per il marketing. Il talk si propone, tramite una gita virtuale nello spazio, di introdurre i concetti, le tecniche e gli strumenti che consentono di iniziare a sfruttare il potenziale dei Big Data nel lavoro quotidiano.
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
1. Hands On Big Data:
Getting Started With
NoSQL And Hadoop
Mario Cartia
mario@big-data.ninja
2.
3.
4. Big Data Facts
• Google processes about 20Pb (E+15
bytes) of data each day
• About 5Eb (Exabytes, E+18 bytes) of data
in the world. 90% generated over last 2
years
• Wearable computing and IoT…
5.
6. Big Data: 3V Model
• Big Data it’s not only about volume
– Volume
>= Petabytes, not Gigabytes
– Variety
Structured and unstructured data
– Velocity
Real-time or near real-time
10. Big Data Success Stories
Amazon.com, a pioneer of targeted
advertising became a big data user when Greg
Linden, one of its software engineers realized
the potential of book reviewing from the
average results of their in-house review project
When Amazon compared the results of the
computer sales against the in house reviews,
the results were much better for the data-
derived material, and revolutionized e-
commerce
11. Big Data Success Stories
Google Flu Trends is a web service
operated by Google. It provides
estimates of influenza activity for more
than 25 countries. By aggregating Google
search queries, it attempts to make
accurate predictions about flu activity
In the 2009 flu pandemic Google Flu
Trends tracked information about flu in
the United States. In February 2010, the
CDC identified influenza cases spiking in
the mid-Atlantic region of the United
States. However, Google’s data of search
queries about flu symptoms was able to
show that same spike two weeks prior to
the CDC report being released
12. Big Data Success Stories
reCAPTCHA is a user-dialogue system originally
developed by Luis von Ahn, Ben Maurer, Colin
McMillen, David Abraham and Manuel Blum at
Carnegie Mellon University's main Pittsburgh
campus, and acquired by Google in September
2009
The reCAPTCHA service supplies subscribing
websites with images of words that optical
character recognition (OCR) software has been
unable to read. The subscribing websites present
these images for humans to decipher as
CAPTCHA words, as part of their normal
validation procedures. They then return the results
to the reCAPTCHA service, which sends the
results to the digitization projects
Secondary
data
usage
13. Big Data Techniques
Statistics
Data Warehouse Data Visualization
Data Mining
Prediction Machine Learning
Advanced Analytics
Correlation Analysis
Business Intelligence
14. The Traditional Approach
ETL: Extract, Transform, Load
• Extracts data from outside sources
• Transforms it to fit operational needs,
which can include quality levels
• Loads it into the end target (database,
operational data store, data mart or data
warehouse)
Does it fit “big data” needs?
15.
16. Hadoop Basics
Apache Hadoop is an open-source
software framework for distributed
storage and distributed processing
of Big Data on clusters of
commodity hardware
17. Hadoop Basics
Hadoop was created by Doug
Cutting and Mike Cafarella in 2005.
Cutting, who was working at
Yahoo! at the time named it after
his son's toy elephant
23. From RDBMS to NoSQL
A NoSQL (often interpreted as Not
Only SQL) database provides a
mechanism for storage and
retrieval of data that is modeled in
means other than the tabular
relations used in relational
databases
24. From RDBMS to NoSQL
Motivations for this approach include
simplicity of design, horizontal scaling
and finer control over availability. The
data structure (e.g. key-value, graph, or
document) differs from the RDBMS,
and therefore some operations are
faster in NoSQL and some in RDBMS
34. MapReduce Model
• MapReduce is a programming model, and an
associated implementation, for processing and
generating large data sets with a parallel,
distributed algorithm on a cluster
• The model is inspired by the map and reduce
functions commonly used in functional
programming, although their purpose in the
MapReduce framework is not the same as in their
original forms
36. MapReduce Overview
• Map step: Each worker node applies the map()
function to the local data, and writes the output to a
temporary storage. A master node orchestrates that
for redundant copies of input data, only one is
processed
• Shuffle step: Worker nodes redistribute data based
on the output keys (produced by the map()
function), such that all data belonging to one key is
located on the same worker node
• Reduce step: Worker nodes now process each
group of output data, per key, in parallel
37.
38. Map Reduce: A really simple
introduction
Dear <Your Name>,
As you know we are building the blogging platform
blogger2.com, I need some statistics. I need to find out,
Acorss all blogs ever wrriten on blogger.com, how many times 1
character words occur(like 'a', 'I'), How many times two
character words occur (like 'be', 'is').. and so on till how
many times do ten character words occur.
I know its a really big job. So, I will assign, all 50,000
employees working in our company to work with you on this for
a week. I am going on a vacation for a week, and its really
important that I've this when I return. Good luck.
regds,
The CEO
(src: http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/)
39. Map Reduce: A really simple
introduction
The next day, You stand with a mike on the dias
before 50,000 and proclaim. For a week, you will all
be divided into many groups:
• The Mappers (tens of Thousands of people will be
in this group)
• The Grouper (Assume just one guy for now)
• The Reducers ( Around 10 of em.) and..
• The Master (That’s you)
40. Map Reduce: A really simple
introduction
• Each mapper will get a set of 50 blog urls and really
Big sheet of paper. Each one of you need to go to
each of that url. and for each word in those blogs,
write one line on the paper. The format of that line
should be the number of characters in the word, then
a commna, and then the actual word
• For example, if you find the word “a”, you write “1,a”,
in a new line in your paper. since the word “a” has
only 1 character. If you find the word “hello”, you
write “5,hello” on the new line
41. Map Reduce: A really simple
introduction
Each take 4 days. So, After 4 days, your sheet might
look like this
• “1,a”
• “5,hello”
• “2,if”
• .. and a million more lines
At the end of the 4th day. each one of you will give
your sheet completely filled to the Grouper
42. Map Reduce: A really simple
introduction
• I will give you 10 papers. The first paper will be marked
1, the second paper will be marked 2, and so on, till 10
• You collect the output from mappers and for each line in
the mapper’s sheet, if it says “1,”, your write the on
sheet 1, if it says “2, ”, you write it on sheet two
• For example, if the first line of a mapper’s sheet says
“1,a”, you write “a” on sheet 1. if it says “2,if”, your
write “if” on sheet 2. If it says “5,hello”, you write hello
on sheet 5
43. Map Reduce: A really simple
introduction
So at the end of your work, the 10 sheets you have might look like
this
• Sheet 1: a, a ,a , I, I , i, a, i, i, i…. millions more
• Sheet 2: if, of, it, of, of, if, at, im, is,is, of, of … millions more
• Sheet 3 :the, the, and, for, met, bet, the, the, and, … millions
more
• ..
• Sheet 10: ……
once you are done, you distribute, each sheet to one reducer. For
example sheet 1 goes to reducer 1, sheet 2 goes to reducer 2 and
so on.
44. Map Reduce: A really simple
introduction
• Each one of you gets one sheet from the grouper. For each
sheet you count the number of words written on it and write it
in big bold letters on the back side of the paper.
• For ex, if you are reducer 2 you get sheet 2 from the grouper
that looks like this:
“Sheet 2: if, of, it, of, of, if, at, im,
is,is, of, of …”
• You count the number of words on that sheet, say the number
of words is 28838380044, You write it on the back side of the
paper , in big bold letters and give it to the Master
45. Map Reduce: A really simple
introduction
You essentially did map reduce. The greatest advantage
in your approach was this:
• The mappers can work independently
• The reducers can work independently
• The grouper can work really fast, because, he din’t
have to do any counting of words, all the had to do
was to look at the first number and put that word in the
appropriate sheet
The process can be easily applied to other kinds of
problems
46. Map Reduce: formal definition
The Map and Reduce functions of
MapReduce are both defined with respect
to data structured in (key, value) pairs.
Map takes one pair of data with a type in
one data domain, and returns a list of pairs
in a different domain:
• Map(k1 ,v1) → list(k2, v2)
47. Map Reduce: formal definition
The Map function is applied in parallel to every
pair in the input dataset
This produces a list of pairs for each call
After that, the MapReduce framework collects
all pairs with the same key from all lists and
groups them together, creating one group for
each key
48. Map Reduce: formal definition
The Reduce function is then applied in parallel to
each group, which in turn produces a collection of
values in the same domain:
• Reduce(k2, list (v2)) → list(v3)
Each Reduce call typically produces either one value
v3 or an empty return, though one call is allowed to
return more than one value. The returns of all calls
are collected as the desired result list
49. MapReduce job example
package org.myorg;
import java.io.IOException;
…
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable,
Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
50. MapReduce job example
public static class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
51. MapReduce job example
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
52. Machine Learning
Machine learning is a scientific discipline
that deals with the construction and study
of algorithms that can learn from data.
Such algorithms operate by building a
model based on inputs and using that to
make predictions or decisions, rather
than following only explicitly
programmed instructions
53. Machine Learning
Machine learning can be
considered a subfield of computer
science and statistics. It has strong
ties to artificial intelligence and
optimization, which deliver
methods, theory and application
domains to the field
54. Machine Learning
Example applications include
spam filtering, optical character
recognition (OCR), search engines
and computer vision. Machine
learning is sometimes conflated
with data mining
57. Machine Learning Tools
Apache Mahout is a project of the
Apache Software Foundation to produce
free implementations of distributed or
otherwise scalable machine learning
algorithms focused primarily in the areas
of collaborative filtering, clustering and
classification
59. Data Visualization
Studies show the brain
processes images 60,000x
faster than text. The final
step in your big data
analytics workflow, the big
data analytics visualization
is a visual representation of
the insights gained from
your analysis