7. Main Distributions!
Licence!
Business Model!
Support!
Apache!
Apache 2.0!
Fundation!
community only!
HortonWorks!
Apache 2.0!
HortonWorks (add-on)!
PS + Training +
support!
community +
Professional!
Cloudera!
Apache 2.0!
Closed Source (not
core)!
PS + Licencing +
Training + support!
community +
Professional!
MapR!
Apache 2.0!
Closed Source (FS)!
PS + Licencing +
support!
community +
Professional!
WanDisco!
Apache 2.0!
Closed Source (DConE)!
PS + Licencing +
Training + support!
community +
Professional!
!
PS: Professional Services
olivier.varene@orange.com!
8. Big Name Distributions!
Paying & Closed Source!
•
IBM InfoSphere BigInsights!
•
GreenPlum (EMC)!
•
Intel Distribution for Hadoop!
•
…!
olivier.varene@orange.com!
14. Back in Time!
- 3 years!
• PageRank calculus on billions nodes and 10s billions
edges
• regularly failed ! (hardware ...)
• 4 to 8 weeks calculus
• unscalable
• failure rate around 80%
• One person full time to supervise!
olivier.varene@orange.com!
28. Before map()!
Map()
Block of Data
(Kin,Vin)
…
Data on
HDFS
Block of Data
Slicing
Partitioning
olivier.varene@orange.com!
(Kout,Vout)
Map()
JobTracker calculates
locality for job assignment
and input split data
29. Java (Api)!
Mapper!
Class YourMapper extends Mapper(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void map(Kin,Vin,context) {
…. Your program …!
}
}
olivier.varene@orange.com!
30. before reduce()!
OPTIONAL
RAM
sorting
disk write
Map()
RAM
sorting
disk write
file
file
file
(Kout,Vout)
Combine()
(Kout,Vout)
temporary
intermediate files
sorted in each
file
(Kout,Vout)
temporary
intermediate files
1 or more times
olivier.varene@orange.com!
file
file
Partition()
part
part
part
key namespace
partitioning
JobTracker
distribution to
reducers
31. Java (Api)!
Reducer!
Class YourReducer extends
Reducer(Kin,Vin,Kout,Vout) {
[void setup();]
[void cleanup();]
void reduce(Kin,List<Vin>,context) {
…. Your program …
}
}
olivier.varene@orange.com!
33. Streaming!
… | <mapper> | … | <reducer> |…!
•
•
•
•
•
•
STDIN !
STDOUT!
Text as input and output by default!
‘t’ as default separator!
Use your language : perl, python, shell, ruby, … !
(interpreter needed on all nodes)!
hadoop jar $streamingJar –input <inputDir> -output <outputDir> !
-mapper <mapProg> -reducer <reduceProg> -file <files>!
olivier.varene@orange.com!
34. Pipe – C++!
… | <mapper> | … | <reducer> |…!
• Socket communication!
• Bytes as input and output!
• C++ API!
class MyMap: public
HadoopPipes::Mapper { … }
class MyReducer: public
HadoopPipes::Reducer { … }
hadoop put <binFile> <toHDFS…>!
hadoop pipes –input <inputDir> -output <outputDir> !
-program <path/binfile> [-conf <confFile>]!
olivier.varene@orange.com!
35. Too difficult!
Hopefully there are tools that can generate
code for you or let you do SQL queries !!!!
Tools!
olivier.varene@orange.com!
Algo / Libs!
36. PIG!
Scripting Language :!
set job.name calculateGraphDegres!
• Simple!
!
%default nbpigreducers 10!
set default_parallel $nbpigreducers!
!
• Parallel execution!
-- degres sortant!
A = load '$degout' using PigStorage() as
(url:chararray,out_deg:int);!
-- keep entries where out_deg > 1!
A2 = filter A by (out_deg > 1);!
B = order A2 by out_deg DESC;!
store B into '$degoutOrdered';!
• Data oriented!
!
• Extensible via UDF!
• Automatic performance
enhancement via compiler!
olivier.varene@orange.com!
-- distribution des degres sortants!
C = foreach A generate out_deg,1 as deg_occ;!
D = group C by out_deg;!
E = foreach D generate FLATTEN(group) as
out_deg,SUM(C.deg_occ) as deg_occ;!
F = order E by out_deg ASC;!
store F into '$degoutDistrib';!
37. Hive!
Querying Language :!
• HiveQL (sql like)!
CREATE EXTERNAL TABLE b_packet (timestamp
string, packet_length int, protocol string) !
ROW FORMAT DELIMITED FIELDS TERMINATED
BY "|" !
LOCATION ‘b-file/input/';!
!
CREATE EXTERNAL TABLE b_packet_out (protocol
string, cnt int) !
ROW FORMAT DELIMITED FIELDS TERMINATED
BY "t" !
LOCATION ’b-file/output/1/’;!
• ETL Tool!
• HDFS, HBase, Thrift …!
!
• MapReduce interface (with
streaming to python …)!
• Extensible via UDF!
olivier.varene@orange.com!
INSERT INTO TABLE b_packet_out!
select count(*) as overall, !
sum( if(protocol like '^ip:tcp',1,0) as tcp,
sum( if(protocol like '^ip:udp',1.0) as udp,
sum( if(protocol like '^ip:icmp'1,0) as icmp !
from b_packet;!
44. In Production!
•
Since 2010!
•
Growth by internal projects needs!
•
Recycling Servers (€€ savings)!
•
We learned as we walked : !
* tar -> cdh3 -> cdh4 …!
* optimizations!
* Run processes …!
olivier.varene@orange.com!
45. Production « PFS »!
•
Shared among different teams!
•
xx nodes on COTS!
•
xxx TBytes!
•
>xxx jobs / per day!
•
Monitoring : Xymon !
•
Graphing via NetStat (SNMP / RRD : x’s oids/second)!
•
Automatic Configuration!
olivier.varene@orange.com!
46. Architecture!
App Services!
HIVE Server!
HIVE!
PIG!
Web Service!
Oozie!
Real Time Query Engine!
R!
!
Flume!
Sqoop!
HCatalog!
Cassandra!
olivier.varene@orange.com!
HBase!
MapReduce!
Mahout!
Khiops!
ZooKeeper!
Cascading
HDFS!
!
in POC
47. Benefits!
•
Infrastructure cost!
€
!
-70% loc!
-50% dev time!
-75% run cost!
•
Development cost!
•
Robustness!
•
Scalability!
•
New development areas (Graph Mining, Logs
statistics …)!
olivier.varene@orange.com!
48. A few of our use cases!
olivier.varene@orange.com!
49. Scoring - Search Engine!
Graph algorithms for http://www.lemoteur.fr/!
xRank!
!
xxx TB in RAM
xx TB compressed!
xx billions nodes!
>xxx billions edges!
olivier.varene@orange.com!
53. Benefits & Drawbacks!
Scalable!
Stable!
RUN Cost!
Development Cost!
Performance!
Very fast evolution!
New Dev areas!
olivier.varene@orange.com!
Learning curve!
Debug!
Algorithms!
Complex!
Very fast evolution!
54. Future!
•
Enhance Security and robustness!
•
Create Services & Functional Catalog!
•
Continue building our expertise : Fast Data,
Cascading, MR2, …!
•
A thousand nodes cluster !!
•
Help other teams to go on Production!
CONTACT US : olivier.varene@orange.com!
olivier.varene@orange.com!
56. My Thanks to!
•
Apache http://www.apache.org/!
•
http://hadooper.blogspot.com/!
•
Cloudera http://www.cloudera.com/!
•
HortonWorks http://www.hortonworks.com/!
•
And all the community !!
olivier.varene@orange.com!