SlideShare ist ein Scribd-Unternehmen logo
1 von 66
Downloaden Sie, um offline zu lesen
Big Data et NoSQL
au-delà du nouveau buzz
Définition
Méthodologie
Architecture!
Cas d’utilisation
Définition
Définition
«Les technologies Big Data correspondent à une nouvelle
génération de technologies et d’architectures, conçues pour
retirer une valeur économique de gigantesques volumes de
données hétéroclites, en les capturant, en les explorant et/ou en
les analysant en un temps record »
4
Sommaire
Ò  Présentation
Ò  Cas d’utilisation
Ò  Architecture
Ò  Cas Pratique
Ò  Conclusion
Ò  Références
Ò  Annexes
5
Méthodologie
Big Data / Méthodologie
La mise en place d’une démarche Big Data est toujours
composée de trois étapes :
Ò  Collecter, stocker les données : Partie infrastructure.
Ò  Analyser, corréler, agréger les données : Partie analytique.
Ò  Exploiter, afficher l’analyse Big Data : Comment exploiter les
données et les analyses ?
Architecture
Architecture Big Data
Audio,
Vidéo,
Image
Docs,
Texte,
XML
Web logs,
Clicks,
Social,
Graphs,
RSS,
Capteurs,
Graphs,
RSS,
Spatial,
GPS
Autres
Base de données
Orientée colonne
NoSQL
Distributed File
System
Map Reduce
Base de
données SQL
SQL
Analytiques , Business Intelligent
COLLECTER
LESDONNEES
STOCKAGE&ORGANISATIONEXTRACTION
ANALYSER
&
VISUALISER
BUSINESS
Technologie Big Data
Audio,
Vidéo,
Image
Docs,
Texte,
XML
Web logs,
Clicks,
Social,
Graphs,
RSS,
Capteurs,
Graphs,
RSS,
Spatial,
GPS
Autres
HBase, Big Table,
Cassandra,
Voldemort, …
HDFS, GFS, S3,
…
Oracle, DB2,
MySQL, …
SQL
COLLECTER
LESDONNEES
STOCKAGE&ORGANISATIONEXTRACTION
ANALYSER
&
VISUALISER
BUSINESS
Architecture Big Data Open Source pour l’entreprise
Big Data full solution
Cas d’utilisation
Cas d’utilisation BigQuery
Cas d’utilisation Hadoop
Ò  Facebook uses Hadoop to store copies of internal log and
dimension data sources and as a source for reporting/analytics
and machine learning. There are two clusters, a 1100-machine
cluster with 8800 cores and about 12 PB raw storage and a 300-
machine cluster with 2400 cores and about 3 PB raw storage.
Ò  Yahoo! deploys more than 100,000 CPUs in > 40,000
computers running Hadoop. The biggest cluster has 4500 nodes
(2*4cpu boxes w 4*1TB disk & 16GB RAM). This is used to
support research for Ad Systems and Web Search and to do
scaling tests to support development of Hadoop on larger
clusters
Ò  eBay uses a 532 nodes cluster (8 * 532 cores, 5.3PB),
Java MapReduce, Pig, Hive and HBase
Ò  Twitter uses Hadoop to store and process tweets, log files, and
other data generated across Twitter. They use Cloudera’s CDH2
distribution of Hadoop. They use both Scala and Java to access
Hadoop’s MapReduce APIs as well as Pig, Avro, Hive, and
Cassandra.
Gartner talk
« D'ici 2015, 4,4 millions d'emplois informatiques seront créés dans
le monde pour soutenir le Big Data, dont 1,9 millions aux Etat-
Unis », a déclaré Peter Sondergaard, senior vice-président et
responsable mondial de la recherche chez Gartner.
Wanted
« Sophisticated Statistical Analysis »
100 000 to 500 000 $
HBase
Présentation
Architecture & API
Cas d’utilisations
Etude de cas
Sécurité
Ecosystème Hadoop
Présentation
Présentation
“HBase is the Hadoop database. Think of it as
a distributed scalable Big Data store”
http://hbase.apache.org/
19
Présentation
“Project's goal is the hosting of very large
tables – billions of rows X millions of columns
– atop clusters of commodity hardware”
http://hbase.apache.org/
20
Présentation
“HBase is an open-source, distributed,
versioned, column-oriented store modeled
after Google's BigTable”
http://hbase.apache.org/
21
La trilogie Google
Ò The Google File System
http://research.google.com/archive/gfs.html
Ò MapReduce: Simplified Data Processing on
Large Cluster
http://research.google.com/archive/mapreduce.html
Ò Bigtable: A Distributed Storage System for
Structured Data
http://research.google.com/archive/bigtable.html
22
Systèmes d’exploitation / Plateforme
23
Installation / démarrage / arrêt
$ mkdir bigdata
$ cd bigdata
$ wget http://apache.claz.org/hbase/
hbase-0.92.1/hbase-0.92.1.tar.gz
…
$ tar xvfz hbase-0.92.1.tar.gz
…
$ export HBASE_HOME=`pwd`/hbase-0.92.1
$ $HBASE_HOME/bin/start-hbase.sh
…
$ $HBASE_HOME/bin/stop-hbase.sh
…
24
HBase Shell / exemple de session
$ $HBASE_HOME/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported
commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.92.1, r1298924, Fri Mar 9 16:58:34 UTC 2012
hbase(main):001:0> list
TABLE
0 row(s) in 0.5510 seconds
hbase(main):002:0> create 'mytable', 'cf'
0 row(s) in 1.1730 seconds
hbase(main):003:0> list
TABLE
mytable
1 row(s) in 0.0080 seconds
hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello
HBase'
0 row(s) in 0.2070 seconds
hbase(main):005:0> get 'mytable', 'first'
COLUMN CELL
cf:message timestamp=1323483954406, value=hello HBase
1 row(s) in 0.0250 seconds
25
HBase Shell / Commandes
Ò  Générales
Ò  status, version
Ò  DDL
Ò  alter, alter_async, alter_status, create, describe, disable,
disable_all, drop, drop_all, enable, enable_all, exists, is_disabled,
is_enabled, list, show_filters
Ò  DML
Ò  count, delete, deleteall, get, get_counter, incr, put, scan, truncate
Ò  Outils
Ò  assign, balance_switch, balancer, close_region, compact, flush,
hlog_roll, major_compact, move, split, unassign, zk_dump
Ò  Replication
Ò  add_peer, disable_peer, enable_peer, list_peers, remove_peer,
start_replication, stop_replication
Ò  Securité
Ò  grant, revoke, user_permission
26
Architecture & API
Architecture logique
28
Table Design
29
Exemple de table
30
Ò Coordonnées des données
[rowkey, column family, column qualifier, timestamp] → value
[fr.wikipedia.org/wiki/NoSQL, links, count, 1234567890] → 24
HBase Java API
Ò HBaseAdmin, HTableDescriptor, HColumnDescriptor
HTableDescriptor desc = new HTableDescriptor("TableName");
HColumnDescriptor cf = new
HColumnDescriptor("Family".getBytes());
desc.addFamily(contentFamily);
Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
admin.createTable(desc);
31
HBase Java API
Ò HTablePool & HTableInterface
HTablePool pool = new HTablePool();
HTableInterface table = pool.getTable("TableName");
32
HBase Java API
Ò Put
byte[] cellValue = …
Put p = new Put("RowKey".getBytes());
p.add("Family".getBytes(),
"Qualifier".getBytes(),
cellValue);
table.put(p);
33
Chemin d’écriture
34
HBase Java API
Ò Get
Get g = new Get("RowKey".getBytes());
g.addColumn("Family".getBytes(), "Qualifier".getBytes());
Result r = table.get(g);
Ò Result
byte[] cellValue = r.getValue("Family".getBytes(),
"Qualifier".getBytes());
35
HBase Java API
Ò Scan
Scan scan = new Scan();
scan.setStartRow("StartRow".getBytes());
scan.setStopRow("StopRow".getBytes());
scan.addColumn("Family".getBytes(),
"Qualifier".getBytes());
ResultScanner rs = table.getScanner(scan);
for (Result r : rs) {
// …
}
36
Chemin de lecture
37
Sommaire
Ò  Présentation
Ò  Cas d’utilisation
Ò  Architecture
Ò  Cas Pratique
Ò  Conclusion
Ò  Références
Ò  Annexes
38
Cas d’utilisations
Cas d’utilisations
39
Cas d’utilisations
40
Vous ?
Etude de cas
Moteur de recherche
42
Table Wikipedia.fr
43
Création de la table Wikipedia
HTableDescriptor desc = new
HTableDescriptor("wikipedia");
HColumnDescriptor contentFamily = new
HColumnDescriptor("content".getBytes());
contentFamily.setMaxVersions(1);
desc.addFamily(contentFamily);
HColumnDescriptor linksFamily = new
HColumnDescriptor("links".getBytes());
linksFamily.setMaxVersions(1);
desc.addFamily(linksFamily);
Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
admin.createTable(desc); 44
HBase API : insertion de données par les Crawlers
Put p = new Put(Util.toBytes(url));
p.add("content".getBytes(), "text".getBytes(),
htmlParseData.getText().getBytes());
List<WebURL> links = htmlParseData.getOutgoingUrls();
int count = 0;
for (WebURL outGoingURL : links) {
p.add("links".getBytes(), Bytes.toBytes(count++),
outGoingURL.getURL().getBytes());
}
p.add("links".getBytes(), "count".getBytes(),
Bytes.toBytes(count));
try {
table.put(p);
} catch (IOException e) {
e.printStackTrace();
} 45
Table de l’index inversé
46
MapReduce
47
MapReduce Index inversé – algorithme
method map(url, text)
for all word ∈ text do
emit(word, url)
method reduce(word, urls)
count ← 0
for all url ∈ urls do
put(word, "links", count, url)
count ← count + 1
put(word, "links", "count", count)
48
MapReduce Index inversé – Configuration
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "Reverse Index Job");
job.setJarByClass(InvertedIndex.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.addColumn("content".getBytes(), "text".getBytes());
TableMapReduceUtil.initTableMapperJob("wikipedia", scan,
Map.class, Text.class, Text.class, job);
TableMapReduceUtil.initTableReducerJob("wikipedia_ri",
Reduce.class, job);
job.setNumReduceTasks(1);
System.exit(job.waitForCompletion(true) ? 0 : 1);
} 49
MapReduce Index inversé – Map
public static class Map extends TableMapper<Text, Text> {
private static final Pattern PATTERN =
Pattern.compile(SENTENCE_SPLITTER_REGEX);
private Text key = new Text();
private Text value = new Text();
@Override
protected void map(ImmutableBytesWritable rowkey, Result result,
Context context) {
byte[] b = result.getValue("content".getBytes(), "text".getBytes());
String text = Bytes.toString(b);
String[] words = PATTERN.split(text);
value.set(result.getRow());
for (String word : words) {
key.set(word.getBytes());
try {
context.write(key, value);
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
50
MapReduce Index inversé – Reduce
public static class Reduce extends
TableReducer<Text, Text, ImmutableBytesWritable> {
@Override
protected void reduce(Text rowkey, Iterable<Text> values,
Contex context) {
Put p = new Put(rowkey.getBytes());
int count = 0;
for (Text link : values) {
p.add("links".getBytes(),
Bytes.toBytes(count++), link.getBytes());
}
p.add("links".getBytes(), "count".getBytes(),
Bytes.toBytes(count));
try {
context.write(new ImmutableBytesWritable(rowkey.getBytes()),
p);
} catch (Exception e) {
e.printStackTrace();
}
}
}
51
Question
Comment trier par ordre d’importance les
résultats d’une recherche ?
52
Comptage des liens
53
Comptage pondéré
54
Définition récursive
55
Un peu d’algèbre linéaire
56
PageRank
57
Moteur de recherche
58
Digression 1
59
Google@Stanford, Larry Page et Sergey Brin – 1998
Ò 2-proc Pentium II 300mhz, 512mb, five 9gb drives
Ò 2-proc Pentium II 300mhz, 512mb, four 9gb drives
Ò 4-proc PPC 604 333mhz, 512mb, eight 9gb drives
Ò 2-proc UltraSparc II 200mhz, 256mb, three 9gb drives, six
4gb drives
Ò Disk expansion, eight 9gb drives
Ò Disk expansion, ten 9gb drives
CPU	
   2933	
  MHz	
  (sur	
  10	
  CPUs)	
  
RAM	
   1792	
  MB	
  
HD	
   366	
  GB	
  
Digression 2
60
Google 1999 – Premier serveur de production
Une formule sans maths J
Données + science + perspicacité = valeur
61
Sécurité
Sécurité
Ò Kerberos
Ò Access Control List
63
Sommaire
Ò  Présentation
Ò  Cas d’utilisation
Ò  Architecture
Ò  Cas Pratique
Ò  Conclusion
Ò  Références
Ò  Annexes
64
Ecosystème Hadoop
Ecosystème Hadoop
Ò Hadoop
Ò Pig
Ò Hive
Ò Mahout
Ò Whirr
65
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
Martin Ferguson
 
CouchDB on Rails - FrozenRails 2010
CouchDB on Rails - FrozenRails 2010CouchDB on Rails - FrozenRails 2010
CouchDB on Rails - FrozenRails 2010
Jonathan Weiss
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
Nam Nham
 

Was ist angesagt? (20)

Hadoop introduction seminar presentation
Hadoop introduction seminar presentationHadoop introduction seminar presentation
Hadoop introduction seminar presentation
 
Hive hcatalog
Hive hcatalogHive hcatalog
Hive hcatalog
 
MongoDB
MongoDBMongoDB
MongoDB
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapper
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
 
Mythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDB
 
Advance MySQL Docstore Features
Advance MySQL Docstore FeaturesAdvance MySQL Docstore Features
Advance MySQL Docstore Features
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
CouchDB on Rails - FrozenRails 2010
CouchDB on Rails - FrozenRails 2010CouchDB on Rails - FrozenRails 2010
CouchDB on Rails - FrozenRails 2010
 
CouchDB on Rails
CouchDB on RailsCouchDB on Rails
CouchDB on Rails
 
CouchDB on Rails - RailsWayCon 2010
CouchDB on Rails - RailsWayCon 2010CouchDB on Rails - RailsWayCon 2010
CouchDB on Rails - RailsWayCon 2010
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and research
 
An introduction to CouchDB
An introduction to CouchDBAn introduction to CouchDB
An introduction to CouchDB
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
introduction to Mongodb
introduction to Mongodbintroduction to Mongodb
introduction to Mongodb
 
MongoD Essentials
MongoD EssentialsMongoD Essentials
MongoD Essentials
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
Intro To Couch Db
Intro To Couch DbIntro To Couch Db
Intro To Couch Db
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 

Ähnlich wie Valtech - Big Data & NoSQL : au-delà du nouveau buzz

Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
Henk van der Valk
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your Data
DataWorks Summit
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 

Ähnlich wie Valtech - Big Data & NoSQL : au-delà du nouveau buzz (20)

Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your Data
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIOUnlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Json to hive_schema_generator
Json to hive_schema_generatorJson to hive_schema_generator
Json to hive_schema_generator
 
HBase Secondary Indexing
HBase Secondary Indexing HBase Secondary Indexing
HBase Secondary Indexing
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Exploring sql server 2016 bi
Exploring sql server 2016 biExploring sql server 2016 bi
Exploring sql server 2016 bi
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
DAC4B 2015 - Polybase
DAC4B 2015 - PolybaseDAC4B 2015 - Polybase
DAC4B 2015 - Polybase
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 

Mehr von Valtech

Mehr von Valtech (20)

Valtech - Réalité virtuelle : analyses, perspectives, démonstrations
Valtech - Réalité virtuelle : analyses, perspectives, démonstrationsValtech - Réalité virtuelle : analyses, perspectives, démonstrations
Valtech - Réalité virtuelle : analyses, perspectives, démonstrations
 
CES 2016 - Décryptage et revue des tendances
CES 2016 - Décryptage et revue des tendancesCES 2016 - Décryptage et revue des tendances
CES 2016 - Décryptage et revue des tendances
 
Stéphane Roche - Agilité en milieu multiculturel
Stéphane Roche - Agilité en milieu multiculturelStéphane Roche - Agilité en milieu multiculturel
Stéphane Roche - Agilité en milieu multiculturel
 
Valtech - Internet of Things & Big Data : un mariage de raison
Valtech - Internet of Things & Big Data : un mariage de raisonValtech - Internet of Things & Big Data : un mariage de raison
Valtech - Internet of Things & Big Data : un mariage de raison
 
Tendances digitales et créatives // Cannes Lions 2015
Tendances digitales et créatives // Cannes Lions 2015Tendances digitales et créatives // Cannes Lions 2015
Tendances digitales et créatives // Cannes Lions 2015
 
Valtech - Du BI au Big Data, une révolution dans l’entreprise
Valtech - Du BI au Big Data, une révolution dans l’entrepriseValtech - Du BI au Big Data, une révolution dans l’entreprise
Valtech - Du BI au Big Data, une révolution dans l’entreprise
 
Valtech / Adobe - Résultats du Baromètre Marketing Digital 2015
Valtech / Adobe - Résultats du Baromètre Marketing Digital 2015Valtech / Adobe - Résultats du Baromètre Marketing Digital 2015
Valtech / Adobe - Résultats du Baromètre Marketing Digital 2015
 
Valtech - Architecture Agile des SI
Valtech - Architecture Agile des SIValtech - Architecture Agile des SI
Valtech - Architecture Agile des SI
 
Valtech - Big Data en action
Valtech - Big Data en actionValtech - Big Data en action
Valtech - Big Data en action
 
Tendances mobiles et digitales du MWC 2015
Tendances mobiles et digitales du MWC 2015Tendances mobiles et digitales du MWC 2015
Tendances mobiles et digitales du MWC 2015
 
CES 2015 : Décryptage et tendances / Objets connectés
CES 2015 : Décryptage et tendances / Objets connectésCES 2015 : Décryptage et tendances / Objets connectés
CES 2015 : Décryptage et tendances / Objets connectés
 
Valtech - Big Data en action
Valtech - Big Data en actionValtech - Big Data en action
Valtech - Big Data en action
 
Valtech - Economie Collaborative
Valtech - Economie CollaborativeValtech - Economie Collaborative
Valtech - Economie Collaborative
 
Valtech - Adobe - Résultats du Baromètre Digital Marketing 2014
Valtech - Adobe - Résultats du Baromètre Digital Marketing 2014Valtech - Adobe - Résultats du Baromètre Digital Marketing 2014
Valtech - Adobe - Résultats du Baromètre Digital Marketing 2014
 
[Veille thématique et décryptage] Cannes Lions 2014
[Veille thématique et décryptage] Cannes Lions 2014[Veille thématique et décryptage] Cannes Lions 2014
[Veille thématique et décryptage] Cannes Lions 2014
 
Valtech - Usages et technologie SaaS
Valtech - Usages et technologie SaaSValtech - Usages et technologie SaaS
Valtech - Usages et technologie SaaS
 
[ Revue Innovations ] Valtech - Mobile World Congress
[ Revue Innovations ] Valtech - Mobile World Congress[ Revue Innovations ] Valtech - Mobile World Congress
[ Revue Innovations ] Valtech - Mobile World Congress
 
Valtech - Digitalisation du Point de Vente - Toulouse - Février 2014
Valtech - Digitalisation du Point de Vente - Toulouse - Février 2014Valtech - Digitalisation du Point de Vente - Toulouse - Février 2014
Valtech - Digitalisation du Point de Vente - Toulouse - Février 2014
 
[ Veille de tendances ] Valtech : Objets connectés
[ Veille de tendances ] Valtech : Objets connectés[ Veille de tendances ] Valtech : Objets connectés
[ Veille de tendances ] Valtech : Objets connectés
 
Valtech - Sharepoint et le cloud Azure
Valtech - Sharepoint et le cloud AzureValtech - Sharepoint et le cloud Azure
Valtech - Sharepoint et le cloud Azure
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Valtech - Big Data & NoSQL : au-delà du nouveau buzz

  • 1. Big Data et NoSQL au-delà du nouveau buzz
  • 4. Définition «Les technologies Big Data correspondent à une nouvelle génération de technologies et d’architectures, conçues pour retirer une valeur économique de gigantesques volumes de données hétéroclites, en les capturant, en les explorant et/ou en les analysant en un temps record » 4
  • 5. Sommaire Ò  Présentation Ò  Cas d’utilisation Ò  Architecture Ò  Cas Pratique Ò  Conclusion Ò  Références Ò  Annexes 5 Méthodologie
  • 6. Big Data / Méthodologie La mise en place d’une démarche Big Data est toujours composée de trois étapes : Ò  Collecter, stocker les données : Partie infrastructure. Ò  Analyser, corréler, agréger les données : Partie analytique. Ò  Exploiter, afficher l’analyse Big Data : Comment exploiter les données et les analyses ?
  • 8. Architecture Big Data Audio, Vidéo, Image Docs, Texte, XML Web logs, Clicks, Social, Graphs, RSS, Capteurs, Graphs, RSS, Spatial, GPS Autres Base de données Orientée colonne NoSQL Distributed File System Map Reduce Base de données SQL SQL Analytiques , Business Intelligent COLLECTER LESDONNEES STOCKAGE&ORGANISATIONEXTRACTION ANALYSER & VISUALISER BUSINESS
  • 9. Technologie Big Data Audio, Vidéo, Image Docs, Texte, XML Web logs, Clicks, Social, Graphs, RSS, Capteurs, Graphs, RSS, Spatial, GPS Autres HBase, Big Table, Cassandra, Voldemort, … HDFS, GFS, S3, … Oracle, DB2, MySQL, … SQL COLLECTER LESDONNEES STOCKAGE&ORGANISATIONEXTRACTION ANALYSER & VISUALISER BUSINESS
  • 10. Architecture Big Data Open Source pour l’entreprise
  • 11. Big Data full solution
  • 14. Cas d’utilisation Hadoop Ò  Facebook uses Hadoop to store copies of internal log and dimension data sources and as a source for reporting/analytics and machine learning. There are two clusters, a 1100-machine cluster with 8800 cores and about 12 PB raw storage and a 300- machine cluster with 2400 cores and about 3 PB raw storage. Ò  Yahoo! deploys more than 100,000 CPUs in > 40,000 computers running Hadoop. The biggest cluster has 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM). This is used to support research for Ad Systems and Web Search and to do scaling tests to support development of Hadoop on larger clusters Ò  eBay uses a 532 nodes cluster (8 * 532 cores, 5.3PB), Java MapReduce, Pig, Hive and HBase Ò  Twitter uses Hadoop to store and process tweets, log files, and other data generated across Twitter. They use Cloudera’s CDH2 distribution of Hadoop. They use both Scala and Java to access Hadoop’s MapReduce APIs as well as Pig, Avro, Hive, and Cassandra.
  • 15. Gartner talk « D'ici 2015, 4,4 millions d'emplois informatiques seront créés dans le monde pour soutenir le Big Data, dont 1,9 millions aux Etat- Unis », a déclaré Peter Sondergaard, senior vice-président et responsable mondial de la recherche chez Gartner. Wanted « Sophisticated Statistical Analysis » 100 000 to 500 000 $
  • 16. HBase
  • 17. Présentation Architecture & API Cas d’utilisations Etude de cas Sécurité Ecosystème Hadoop
  • 19. Présentation “HBase is the Hadoop database. Think of it as a distributed scalable Big Data store” http://hbase.apache.org/ 19
  • 20. Présentation “Project's goal is the hosting of very large tables – billions of rows X millions of columns – atop clusters of commodity hardware” http://hbase.apache.org/ 20
  • 21. Présentation “HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's BigTable” http://hbase.apache.org/ 21
  • 22. La trilogie Google Ò The Google File System http://research.google.com/archive/gfs.html Ò MapReduce: Simplified Data Processing on Large Cluster http://research.google.com/archive/mapreduce.html Ò Bigtable: A Distributed Storage System for Structured Data http://research.google.com/archive/bigtable.html 22
  • 24. Installation / démarrage / arrêt $ mkdir bigdata $ cd bigdata $ wget http://apache.claz.org/hbase/ hbase-0.92.1/hbase-0.92.1.tar.gz … $ tar xvfz hbase-0.92.1.tar.gz … $ export HBASE_HOME=`pwd`/hbase-0.92.1 $ $HBASE_HOME/bin/start-hbase.sh … $ $HBASE_HOME/bin/stop-hbase.sh … 24
  • 25. HBase Shell / exemple de session $ $HBASE_HOME/hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.92.1, r1298924, Fri Mar 9 16:58:34 UTC 2012 hbase(main):001:0> list TABLE 0 row(s) in 0.5510 seconds hbase(main):002:0> create 'mytable', 'cf' 0 row(s) in 1.1730 seconds hbase(main):003:0> list TABLE mytable 1 row(s) in 0.0080 seconds hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase' 0 row(s) in 0.2070 seconds hbase(main):005:0> get 'mytable', 'first' COLUMN CELL cf:message timestamp=1323483954406, value=hello HBase 1 row(s) in 0.0250 seconds 25
  • 26. HBase Shell / Commandes Ò  Générales Ò  status, version Ò  DDL Ò  alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, is_disabled, is_enabled, list, show_filters Ò  DML Ò  count, delete, deleteall, get, get_counter, incr, put, scan, truncate Ò  Outils Ò  assign, balance_switch, balancer, close_region, compact, flush, hlog_roll, major_compact, move, split, unassign, zk_dump Ò  Replication Ò  add_peer, disable_peer, enable_peer, list_peers, remove_peer, start_replication, stop_replication Ò  Securité Ò  grant, revoke, user_permission 26
  • 30. Exemple de table 30 Ò Coordonnées des données [rowkey, column family, column qualifier, timestamp] → value [fr.wikipedia.org/wiki/NoSQL, links, count, 1234567890] → 24
  • 31. HBase Java API Ò HBaseAdmin, HTableDescriptor, HColumnDescriptor HTableDescriptor desc = new HTableDescriptor("TableName"); HColumnDescriptor cf = new HColumnDescriptor("Family".getBytes()); desc.addFamily(contentFamily); Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); admin.createTable(desc); 31
  • 32. HBase Java API Ò HTablePool & HTableInterface HTablePool pool = new HTablePool(); HTableInterface table = pool.getTable("TableName"); 32
  • 33. HBase Java API Ò Put byte[] cellValue = … Put p = new Put("RowKey".getBytes()); p.add("Family".getBytes(), "Qualifier".getBytes(), cellValue); table.put(p); 33
  • 35. HBase Java API Ò Get Get g = new Get("RowKey".getBytes()); g.addColumn("Family".getBytes(), "Qualifier".getBytes()); Result r = table.get(g); Ò Result byte[] cellValue = r.getValue("Family".getBytes(), "Qualifier".getBytes()); 35
  • 36. HBase Java API Ò Scan Scan scan = new Scan(); scan.setStartRow("StartRow".getBytes()); scan.setStopRow("StopRow".getBytes()); scan.addColumn("Family".getBytes(), "Qualifier".getBytes()); ResultScanner rs = table.getScanner(scan); for (Result r : rs) { // … } 36
  • 38. Sommaire Ò  Présentation Ò  Cas d’utilisation Ò  Architecture Ò  Cas Pratique Ò  Conclusion Ò  Références Ò  Annexes 38 Cas d’utilisations
  • 44. Création de la table Wikipedia HTableDescriptor desc = new HTableDescriptor("wikipedia"); HColumnDescriptor contentFamily = new HColumnDescriptor("content".getBytes()); contentFamily.setMaxVersions(1); desc.addFamily(contentFamily); HColumnDescriptor linksFamily = new HColumnDescriptor("links".getBytes()); linksFamily.setMaxVersions(1); desc.addFamily(linksFamily); Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); admin.createTable(desc); 44
  • 45. HBase API : insertion de données par les Crawlers Put p = new Put(Util.toBytes(url)); p.add("content".getBytes(), "text".getBytes(), htmlParseData.getText().getBytes()); List<WebURL> links = htmlParseData.getOutgoingUrls(); int count = 0; for (WebURL outGoingURL : links) { p.add("links".getBytes(), Bytes.toBytes(count++), outGoingURL.getURL().getBytes()); } p.add("links".getBytes(), "count".getBytes(), Bytes.toBytes(count)); try { table.put(p); } catch (IOException e) { e.printStackTrace(); } 45
  • 46. Table de l’index inversé 46
  • 48. MapReduce Index inversé – algorithme method map(url, text) for all word ∈ text do emit(word, url) method reduce(word, urls) count ← 0 for all url ∈ urls do put(word, "links", count, url) count ← count + 1 put(word, "links", "count", count) 48
  • 49. MapReduce Index inversé – Configuration public static void main(String[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); Job job = new Job(conf, "Reverse Index Job"); job.setJarByClass(InvertedIndex.class); Scan scan = new Scan(); scan.setCaching(500); scan.addColumn("content".getBytes(), "text".getBytes()); TableMapReduceUtil.initTableMapperJob("wikipedia", scan, Map.class, Text.class, Text.class, job); TableMapReduceUtil.initTableReducerJob("wikipedia_ri", Reduce.class, job); job.setNumReduceTasks(1); System.exit(job.waitForCompletion(true) ? 0 : 1); } 49
  • 50. MapReduce Index inversé – Map public static class Map extends TableMapper<Text, Text> { private static final Pattern PATTERN = Pattern.compile(SENTENCE_SPLITTER_REGEX); private Text key = new Text(); private Text value = new Text(); @Override protected void map(ImmutableBytesWritable rowkey, Result result, Context context) { byte[] b = result.getValue("content".getBytes(), "text".getBytes()); String text = Bytes.toString(b); String[] words = PATTERN.split(text); value.set(result.getRow()); for (String word : words) { key.set(word.getBytes()); try { context.write(key, value); } catch (Exception e) { e.printStackTrace(); } } } } 50
  • 51. MapReduce Index inversé – Reduce public static class Reduce extends TableReducer<Text, Text, ImmutableBytesWritable> { @Override protected void reduce(Text rowkey, Iterable<Text> values, Contex context) { Put p = new Put(rowkey.getBytes()); int count = 0; for (Text link : values) { p.add("links".getBytes(), Bytes.toBytes(count++), link.getBytes()); } p.add("links".getBytes(), "count".getBytes(), Bytes.toBytes(count)); try { context.write(new ImmutableBytesWritable(rowkey.getBytes()), p); } catch (Exception e) { e.printStackTrace(); } } } 51
  • 52. Question Comment trier par ordre d’importance les résultats d’une recherche ? 52
  • 56. Un peu d’algèbre linéaire 56
  • 59. Digression 1 59 Google@Stanford, Larry Page et Sergey Brin – 1998 Ò 2-proc Pentium II 300mhz, 512mb, five 9gb drives Ò 2-proc Pentium II 300mhz, 512mb, four 9gb drives Ò 4-proc PPC 604 333mhz, 512mb, eight 9gb drives Ò 2-proc UltraSparc II 200mhz, 256mb, three 9gb drives, six 4gb drives Ò Disk expansion, eight 9gb drives Ò Disk expansion, ten 9gb drives CPU   2933  MHz  (sur  10  CPUs)   RAM   1792  MB   HD   366  GB  
  • 60. Digression 2 60 Google 1999 – Premier serveur de production
  • 61. Une formule sans maths J Données + science + perspicacité = valeur 61
  • 64. Sommaire Ò  Présentation Ò  Cas d’utilisation Ò  Architecture Ò  Cas Pratique Ò  Conclusion Ò  Références Ò  Annexes 64 Ecosystème Hadoop