15. Probleemstelling
Betere beelden
Betere sensoren Meer informatie
Duurdere opslag
Meer data
Data Transport
Dure supercomputers
Meer rekenwerk
Parallel Processing
Saturday 9 February 13
16. Doelstellingen
• Snel genoeg
• Betaalbaar
• Schaalbaar Bestandssysteem
+
Software framework
Saturday 9 February 13
17. Onderzoeksvragen
• Hoe kunnen grote satellietbeelden in
een HDFS filesysteem opgeslagen
worden zodat ze op een efficiënte
manier in parallel verwerkt kunnen
worden?
• Welke algoritmes kunnen gebruikt
worden met deze opslagtechniek en
MapReduce?
Saturday 9 February 13
19. Literatuurstudie
• Interessante projecten
• HDFS
• MapReduce
• Implementaties
• Distributies
• Huidige Literatuur
Saturday 9 February 13
20. Interessante projecten
• NA (12)
• Center for Climate Simulation
• Square Kilometer Array: 700 TB/sec
• Open Cloud Consortium(13)
• Project Matsu: Elastic Clouds for Disaster Relief
• : Large Hadron Collider (14)
• 20 PB/jaar
Saturday 9 February 13
21. HDFS
1
• Gedistribueerd bestandssysteem 2
...
• Gebaseerd op the Google File System(1) ...
n
• Grote blokken (128 MiB)
• Commodity hardware
• Falen = standaard
• Read & append (1)
Saturday 9 February 13
22. A DFS usually accounts for transparent file replication and fault to
HDFS
bles data locality for processing tasks. A DFS does this by subdividin
ese blocks within a cluster of computers. Figure 2 shows the distrib
of a file (left) subdivided into three blocks.
1 1
3
1 2
2 3
3
2 2
3
1
Figure 2: File blocks, distribution and replication in a distributed file system
Saturday 9 February 13
23. onsult GmbH HDFS Ca
1 1
3
1 2
2 2 3 3
3
2
1
Figure 4: Block assembly for data retrieval from the distributed file system
Saturday 9 February 13
24. rates how the file system handles node-failure by automated recov
HDFS
HDFS further uses checksums to verify block integrity. As long as th
ccessible copy of a block, it can automatically re-replicate to return
tion rate.
1 1 1 1
3 3
2 3 2
3 2 3 3
2
2 2 2
3
1 1
Figure 3: Automatic repair in case of cluster node failure by additional replication
Saturday 9 February 13
25. HDFS - Overzicht
• Schaalbaar
• Snel lezen/schrijven
• Robuust
• Factor 10 goedkoper (2)
Saturday 9 February 13
28. MapReduce - Overzicht
• Based on Google MapReduce (3)
• Data Locality
• Key/Value pairs
• Zeer snel
• Andere manier van denken
Saturday 9 February 13
29. Implementaties
Hadoop Stratosphere HPCC
Support + - +
Extensions + - ?
Community +++ +/- -
Target ANY EDU BI
• Apache Software Foundation
• Anderen: outdated, commercieel,
weinig support (4-6)
Saturday 9 February 13
30. Distributies
(8)
• Hortonworks (7)
•
• Cloudera : Cloudera Manager (9)
• Web Interface
• 1-Click install. (yeah right...)
• Interessant licentie model
Saturday 9 February 13
31. Algemeen
• Vooral tekstverwerking
• Voor kleine afbeeldingen (10)
• Weinig detail
• Commercieel (11)
Saturday 9 February 13
33. Planning
literatuur
fase 1
fase 2
fase 3
fase 4
01 01 15 20
/09 /02 / 03 /0
5
verslag
stage vandaag inleveren
masterproef
Saturday 9 February 13
34. Fase 1 - Done
Sven Workstation Workstation Workstation
192.168.10.248 TT
DN
Master Bruno Tim Patrick
JT TT TT TT
NN DN DN DN
192.168.10.245 192.168.10.246 192.168.10.247 192.168.10.249
JT = Job Tracker = Name Node
NN = RedHat 6.2 = RedHat 6.2
Workstation Virtual Machine
TT = Task Tracker DN = Data Node
Saturday 9 February 13
35. Fase 2
• Eenvoudig algoritme
• Beeld draaien
• Standaard IO
• HDFS
Saturday 9 February 13
36. Fase 3
• Meer complexiteit: MapReduce
• Spatiaal: Convolutiemasker, ROI
• Temporeel/Spectraal: Meerdere
afbeeldingen
•
Saturday 9 February 13
37. Fase 4
• Performantie in functie van pixel
afstand
Saturday 9 February 13
38. Planning
literatuur
fase 1
fase 2
fase 3
fase 4
01 01 15 20
/09 /02 / 03 /0
5
verslag
stage vandaag inleveren
masterproef
Saturday 9 February 13
39. The End
• Veel data
• Anders denken
• Veel mogelijkheden
• RLZ of nieuw keuzevak Big Data? ;)
• Mapreduce + OpenCL?
• Veel uitdagingen
• Veel vragen
Saturday 9 February 13
40. Referenties
(1) Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003), ‘The google file system’
(2) Krishnan, S., Baru, C. and Crosby, C. (2010), ‘Evaluation of mapreduce for gridding lidar data’
(3) Dean, J., Ghemawat, S. and Inc, G. (2004), ‘Mapreduce: simplified data processing on large clusters’
(4) http://hadoop.apache.org/
(5) Warneke, D. and Kao, O. (2009), ‘Nephele: Efficient parallel data processing in the cloud’, http://www.stratosphere.eu
(6) http://hpccsystems.com/
(7) http://hortonworks.com/
(8) http://mapr.com/
(9) http://cloudera.com/
(10) Sweeney, C. (2011), ‘Hipi: Hadoop image processing interface for image-based mapreduce’
(11) Guinan, O. (2011), ‘Indexing the earth - large scale satellite image processing using hadoop’, http://www.cloudera.com/content/
cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-indexing-the-earth-large-scale-satellite-image-
processing-using-hadoop.htmt
(12) Q. Duffy, D. (2013), ‘Untangling the computing landscape for NASA climate simulations’. URL: http://www.nas.nasa.gov/
SC12/demos/demo20.html
(13) http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief
(14) Lassnig, M., Garonne, V., Dimitrov, G. and Canali, L. (2012), ‘Atlas data management accounting with hadoop pig and hbase’.
Saturday 9 February 13