Valtech - Big Data & NoSQL : au-delà du nouveau buzz

Big Data et NoSQL
au-delà du nouveau buzz

Définition
Méthodologie
Architecture!
Cas d’utilisation

Définition
«Les technologies Big Data correspondent à une nouvelle
génération de technologies et d’architectures, conçues pour
retirer une valeur économique de gigantesques volumes de
données hétéroclites, en les capturant, en les explorant et/ou en
les analysant en un temps record »
4

Sommaire
Ò  Présentation
Ò  Cas d’utilisation
Ò  Architecture
Ò  Cas Pratique
Ò  Conclusion
Ò  Références
Ò  Annexes
5
Méthodologie

Big Data / Méthodologie
La mise en place d’une démarche Big Data est toujours
composée de trois étapes :
Ò  Collecter, stocker les données : Partie infrastructure.
Ò  Analyser, corréler, agréger les données : Partie analytique.
Ò  Exploiter, afficher l’analyse Big Data : Comment exploiter les
données et les analyses ?

Architecture Big Data
Audio,
Vidéo,
Image
Docs,
Texte,
XML
Web logs,
Clicks,
Social,
Graphs,
RSS,
Capteurs,
Graphs,
RSS,
Spatial,
GPS
Autres
Base de données
Orientée colonne
NoSQL
Distributed File
System
Map Reduce
Base de
données SQL
SQL
Analytiques , Business Intelligent
COLLECTER
LESDONNEES
STOCKAGE&ORGANISATIONEXTRACTION
ANALYSER
&
VISUALISER
BUSINESS

Technologie Big Data
Audio,
Vidéo,
Image
Docs,
Texte,
XML
Web logs,
Clicks,
Social,
Graphs,
RSS,
Capteurs,
Graphs,
RSS,
Spatial,
GPS
Autres
HBase, Big Table,
Cassandra,
Voldemort, …
HDFS, GFS, S3,
…
Oracle, DB2,
MySQL, …
SQL
COLLECTER
LESDONNEES
STOCKAGE&ORGANISATIONEXTRACTION
ANALYSER
&
VISUALISER
BUSINESS

Architecture Big Data Open Source pour l’entreprise

Cas d’utilisation Hadoop
Ò  Facebook uses Hadoop to store copies of internal log and
dimension data sources and as a source for reporting/analytics
and machine learning. There are two clusters, a 1100-machine
cluster with 8800 cores and about 12 PB raw storage and a 300-
machine cluster with 2400 cores and about 3 PB raw storage.
Ò  Yahoo! deploys more than 100,000 CPUs in > 40,000
computers running Hadoop. The biggest cluster has 4500 nodes
(2*4cpu boxes w 4*1TB disk & 16GB RAM). This is used to
support research for Ad Systems and Web Search and to do
scaling tests to support development of Hadoop on larger
clusters
Ò  eBay uses a 532 nodes cluster (8 * 532 cores, 5.3PB),
Java MapReduce, Pig, Hive and HBase
Ò  Twitter uses Hadoop to store and process tweets, log files, and
other data generated across Twitter. They use Cloudera’s CDH2
distribution of Hadoop. They use both Scala and Java to access
Hadoop’s MapReduce APIs as well as Pig, Avro, Hive, and
Cassandra.

Gartner talk
« D'ici 2015, 4,4 millions d'emplois informatiques seront créés dans
le monde pour soutenir le Big Data, dont 1,9 millions aux Etat-
Unis », a déclaré Peter Sondergaard, senior vice-président et
responsable mondial de la recherche chez Gartner.
Wanted
« Sophisticated Statistical Analysis »
100 000 to 500 000 $

Présentation
Architecture & API
Cas d’utilisations
Etude de cas
Sécurité
Ecosystème Hadoop

Présentation
“HBase is the Hadoop database. Think of it as
a distributed scalable Big Data store”
http://hbase.apache.org/
19

Présentation
“Project's goal is the hosting of very large
tables – billions of rows X millions of columns
– atop clusters of commodity hardware”
20

Présentation
“HBase is an open-source, distributed,
versioned, column-oriented store modeled
after Google's BigTable”
21

La trilogie Google
Ò The Google File System
http://research.google.com/archive/gfs.html
Ò MapReduce: Simplified Data Processing on
Large Cluster
http://research.google.com/archive/mapreduce.html
Ò Bigtable: A Distributed Storage System for
Structured Data
http://research.google.com/archive/bigtable.html
22

Systèmes d’exploitation / Plateforme
23

Installation / démarrage / arrêt
$ mkdir bigdata
$ cd bigdata
$ wget http://apache.claz.org/hbase/
hbase-0.92.1/hbase-0.92.1.tar.gz
…
$ tar xvfz hbase-0.92.1.tar.gz
…
$ export HBASE_HOME=`pwd`/hbase-0.92.1
$ $HBASE_HOME/bin/start-hbase.sh
…
$ $HBASE_HOME/bin/stop-hbase.sh
…
24

HBase Shell / exemple de session
$ $HBASE_HOME/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported
commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.92.1, r1298924, Fri Mar 9 16:58:34 UTC 2012
hbase(main):001:0> list
TABLE
0 row(s) in 0.5510 seconds
hbase(main):002:0> create 'mytable', 'cf'
hbase(main):003:0> list
TABLE
mytable
hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello
HBase'
hbase(main):005:0> get 'mytable', 'first'
COLUMN CELL
cf:message timestamp=1323483954406, value=hello HBase
25

HBase Shell / Commandes
Ò  Générales
Ò  status, version
Ò  DDL
Ò  alter, alter_async, alter_status, create, describe, disable,
disable_all, drop, drop_all, enable, enable_all, exists, is_disabled,
is_enabled, list, show_filters
Ò  DML
Ò  count, delete, deleteall, get, get_counter, incr, put, scan, truncate
Ò  Outils
Ò  assign, balance_switch, balancer, close_region, compact, flush,
hlog_roll, major_compact, move, split, unassign, zk_dump
Ò  Replication
Ò  add_peer, disable_peer, enable_peer, list_peers, remove_peer,
start_replication, stop_replication
Ò  Securité
Ò  grant, revoke, user_permission
26

Exemple de table
30
Ò Coordonnées des données
[rowkey, column family, column qualifier, timestamp] → value
[fr.wikipedia.org/wiki/NoSQL, links, count, 1234567890] → 24

HBase Java API
Ò HBaseAdmin, HTableDescriptor, HColumnDescriptor
HTableDescriptor desc = new HTableDescriptor("TableName");
HColumnDescriptor cf = new
HColumnDescriptor("Family".getBytes());
desc.addFamily(contentFamily);
Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
admin.createTable(desc);
31

HBase Java API
Ò HTablePool & HTableInterface
HTablePool pool = new HTablePool();
HTableInterface table = pool.getTable("TableName");
32

HBase Java API
Ò Put
byte[] cellValue = …
Put p = new Put("RowKey".getBytes());
p.add("Family".getBytes(),
"Qualifier".getBytes(),
cellValue);
table.put(p);
33

HBase Java API
Ò Get
Get g = new Get("RowKey".getBytes());
g.addColumn("Family".getBytes(), "Qualifier".getBytes());
Result r = table.get(g);
Ò Result
byte[] cellValue = r.getValue("Family".getBytes(),
"Qualifier".getBytes());
35

HBase Java API
Ò Scan
Scan scan = new Scan();
scan.setStartRow("StartRow".getBytes());
scan.setStopRow("StopRow".getBytes());
scan.addColumn("Family".getBytes(),
"Qualifier".getBytes());
ResultScanner rs = table.getScanner(scan);
for (Result r : rs) {
// …
}
36

Sommaire
Ò  Conclusion
Ò  Annexes
38

40
Vous ?

Création de la table Wikipedia
HTableDescriptor desc = new
HTableDescriptor("wikipedia");
HColumnDescriptor contentFamily = new
HColumnDescriptor("content".getBytes());
contentFamily.setMaxVersions(1);
desc.addFamily(contentFamily);
HColumnDescriptor linksFamily = new
HColumnDescriptor("links".getBytes());
linksFamily.setMaxVersions(1);
desc.addFamily(linksFamily);
HBaseAdmin admin = new HBaseAdmin(conf);
admin.createTable(desc); 44

HBase API : insertion de données par les Crawlers
Put p = new Put(Util.toBytes(url));
p.add("content".getBytes(), "text".getBytes(),
htmlParseData.getText().getBytes());
List<WebURL> links = htmlParseData.getOutgoingUrls();
int count = 0;
for (WebURL outGoingURL : links) {
p.add("links".getBytes(), Bytes.toBytes(count++),
outGoingURL.getURL().getBytes());
}
p.add("links".getBytes(), "count".getBytes(),
Bytes.toBytes(count));
try {
table.put(p);
} catch (IOException e) {
e.printStackTrace();
} 45

Table de l’index inversé
46

MapReduce Index inversé – algorithme
method map(url, text)
for all word ∈ text do
emit(word, url)
method reduce(word, urls)
count ← 0
for all url ∈ urls do
put(word, "links", count, url)
count ← count + 1
put(word, "links", "count", count)
48

MapReduce Index inversé – Configuration
public static void main(String[] args) throws Exception {
Job job = new Job(conf, "Reverse Index Job");
job.setJarByClass(InvertedIndex.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.addColumn("content".getBytes(), "text".getBytes());
TableMapReduceUtil.initTableMapperJob("wikipedia", scan,
Map.class, Text.class, Text.class, job);
TableMapReduceUtil.initTableReducerJob("wikipedia_ri",
Reduce.class, job);
job.setNumReduceTasks(1);
System.exit(job.waitForCompletion(true) ? 0 : 1);
} 49

MapReduce Index inversé – Map
public static class Map extends TableMapper<Text, Text> {
private static final Pattern PATTERN =
Pattern.compile(SENTENCE_SPLITTER_REGEX);
private Text key = new Text();
private Text value = new Text();
@Override
protected void map(ImmutableBytesWritable rowkey, Result result,
Context context) {
byte[] b = result.getValue("content".getBytes(), "text".getBytes());
String text = Bytes.toString(b);
String[] words = PATTERN.split(text);
value.set(result.getRow());
for (String word : words) {
key.set(word.getBytes());
try {
context.write(key, value);
} catch (Exception e) {
}
}
}
}
50

MapReduce Index inversé – Reduce
public static class Reduce extends
TableReducer<Text, Text, ImmutableBytesWritable> {
@Override
protected void reduce(Text rowkey, Iterable<Text> values,
Contex context) {
Put p = new Put(rowkey.getBytes());
int count = 0;
for (Text link : values) {
p.add("links".getBytes(),
Bytes.toBytes(count++), link.getBytes());
}
p.add("links".getBytes(), "count".getBytes(),
Bytes.toBytes(count));
try {
context.write(new ImmutableBytesWritable(rowkey.getBytes()),
p);
} catch (Exception e) {
}
}
}
51

Question
Comment trier par ordre d’importance les
résultats d’une recherche ?
52

Un peu d’algèbre linéaire
56

Digression 1
59
Google@Stanford, Larry Page et Sergey Brin – 1998
Ò 2-proc Pentium II 300mhz, 512mb, five 9gb drives
Ò 2-proc Pentium II 300mhz, 512mb, four 9gb drives
Ò 4-proc PPC 604 333mhz, 512mb, eight 9gb drives
Ò 2-proc UltraSparc II 200mhz, 256mb, three 9gb drives, six
4gb drives
Ò Disk expansion, eight 9gb drives
Ò Disk expansion, ten 9gb drives
CPU
2933
MHz
(sur
10
CPUs)

RAM
1792
MB

HD
366
GB

Digression 2
60
Google 1999 – Premier serveur de production

Une formule sans maths J
Données + science + perspicacité = valeur
61

Sécurité
Ò Kerberos
Ò Access Control List
63

Sommaire
Ò  Conclusion
Ò  Annexes
64
Ecosystème Hadoop

Ecosystème Hadoop
Ò Hadoop
Ò Pig
Ò Hive
Ò Mahout
Ò Whirr
65

Valtech - Big Data & NoSQL : au-delà du nouveau buzz

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Valtech - Big Data & NoSQL : au-delà du nouveau buzz

Ähnlich wie Valtech - Big Data & NoSQL : au-delà du nouveau buzz (20)

Mehr von Valtech

Mehr von Valtech (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Valtech - Big Data & NoSQL : au-delà du nouveau buzz