SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Too much Data!
                             Sven Meys




Saturday 9 February 13
Onderwerp
                              On-demand

                          Information Extraction
                                   from
                         Remote Sensing Images

                                  with

                              MapReduce
Saturday 9 February 13
Inhoud


                    • Context
                    • Literatuurstudie
                    • Planning




Saturday 9 February 13
Context

                    • VITO
                    • Remote Sensing
                    • Probleemstelling
                    • Onderzoeksvragen



Saturday 9 February 13
16%
                         700   €103 Milj.    84%



                                            Government
                                            Private




Saturday 9 February 13
Energy                Industrial Innovation         Quality of Environment


                                              Environ-
                                               mental             Separation
                                 Transition              Material              Remote    Environ-    Environ-
                     Energy                   Analysis                &
                                 Energy &                Techno-               Sensing    mental      mental
                   Technology                     &               Conversion
                                Environment                logy                          Modelling    Health
                                              Techno-             Technology
                                                logy




Saturday 9 February 13
Context

                    • VITO
                    • Remote Sensing
                    • Probleemstelling
                    • Onderzoeksvragen



Saturday 9 February 13
Saturday 9 February 13
Saturday 9 February 13
Remote Sensing




Saturday 9 February 13
2
 1 km per pixel
 0.5 miljard pixels
 1.2 GB



Saturday 9 February 13
RS Toepassingen




Saturday 9 February 13
Time Series:
            01-01-2001
            01-01-2012
            Algorithm:
            NDVI
          Output:
            Mean

                          SUBMIT
Saturday 9 February 13
Context

                    • VITO
                    • Remote Sensing
                    • Probleemstelling
                    • Onderzoeksvragen



Saturday 9 February 13
Probleemstelling
                                Betere beelden
   Betere sensoren              Meer informatie


                               Duurdere opslag
           Meer data
                               Data Transport


                               Dure supercomputers
Meer rekenwerk
                               Parallel Processing

Saturday 9 February 13
Doelstellingen


                    • Snel genoeg
                    • Betaalbaar
                    • Schaalbaar     Bestandssysteem
                                             +
                                    Software framework


Saturday 9 February 13
Onderzoeksvragen
                    • Hoe kunnen grote satellietbeelden in
                      een HDFS filesysteem opgeslagen
                      worden zodat ze op een efficiënte
                      manier in parallel verwerkt kunnen
                      worden?
                    • Welke algoritmes kunnen gebruikt
                      worden met deze opslagtechniek en
                      MapReduce?

Saturday 9 February 13
Inhoud


                    • Context
                    • Literatuurstudie
                    • Planning




Saturday 9 February 13
Literatuurstudie
                • Interessante projecten
                • HDFS
                • MapReduce
                • Implementaties
                • Distributies
                • Huidige Literatuur

Saturday 9 February 13
Interessante projecten
                    • NA (12)
                         •   Center for Climate Simulation

                         •   Square Kilometer Array: 700 TB/sec




                    • Open Cloud Consortium(13)
                         •   Project Matsu: Elastic Clouds for Disaster Relief




                    •        : Large Hadron Collider (14)
                         •   20 PB/jaar
Saturday 9 February 13
HDFS
                                                 1

      • Gedistribueerd bestandssysteem           2

                                                 ...
      • Gebaseerd op the Google File System(1)   ...
                                                 n
      • Grote blokken (128 MiB)

      • Commodity hardware

      • Falen = standaard

      • Read & append (1)



Saturday 9 February 13
A DFS usually accounts for transparent file replication and fault to

                                    HDFS
bles data locality for processing tasks. A DFS does this by subdividin
 ese blocks within a cluster of computers. Figure 2 shows the distrib
 of a file (left) subdivided into three blocks.
                                                1           1

                                                    3



                                1                                   2

                                2                               3

                                3


                                                2               2

                                                                3

                                                        1


   Figure 2: File blocks, distribution and replication in a distributed file system

 Saturday 9 February 13
onsult GmbH                            HDFS                                            Ca



                                                   1           1

                                                       3



                                   1                                   2

                                   2               2   3           3

                                   3


                                                   2



                                                           1


        Figure 4: Block assembly for data retrieval from the distributed file system

Saturday 9 February 13
rates how the file system handles node-failure by automated recov

                                              HDFS
 HDFS further uses checksums to verify block integrity. As long as th
ccessible copy of a block, it can automatically re-replicate to return
tion rate.
                          1           1           1           1

                              3                       3



                                              2       3               2

                                          3       2   3           3

                                                  2


                          2               2       2

                                          3

                                  1                       1


Figure 3: Automatic repair in case of cluster node failure by additional replication

 Saturday 9 February 13
HDFS - Overzicht

                    • Schaalbaar
                    • Snel lezen/schrijven
                    • Robuust
                    • Factor 10 goedkoper (2)



Saturday 9 February 13
MapReduce




Saturday 9 February 13
MapReduce - WordCount




Saturday 9 February 13
MapReduce - Overzicht

                    • Based on Google MapReduce (3)
                    • Data Locality
                    • Key/Value pairs
                    • Zeer snel
                    • Andere manier van denken


Saturday 9 February 13
Implementaties

                                      Hadoop   Stratosphere   HPCC
                          Support       +            -         +
                         Extensions     +            -          ?
                         Community     +++          +/-         -
                           Target      ANY         EDU         BI


                    • Apache Software Foundation
                    • Anderen: outdated, commercieel,
                      weinig support (4-6)

Saturday 9 February 13
Distributies
                                                              (8)
                    • Hortonworks       (7)


                    •
                    • Cloudera : Cloudera Manager (9)
                         • Web Interface
                         • 1-Click install. (yeah right...)
                         • Interessant licentie model

Saturday 9 February 13
Algemeen

                    • Vooral tekstverwerking
                    • Voor kleine afbeeldingen (10)
                    • Weinig detail
                    • Commercieel (11)



Saturday 9 February 13
Inhoud


                    • Context
                    • Literatuurstudie
                    • Planning




Saturday 9 February 13
Planning

  literatuur
        fase 1
        fase 2
        fase 3
        fase 4
                           01           01       15          20
                              /09          /02      /   03        /0
                                                                       5
                                          verslag
                         stage      vandaag        inleveren
                                                  masterproef
Saturday 9 February 13
Fase 1 - Done
                                                Sven               Workstation      Workstation            Workstation


                         192.168.10.248          TT


                                                  DN


                                               Master                 Bruno             Tim                  Patrick

                                                 JT                    TT               TT                     TT

                                                 NN                    DN                DN                    DN




                                          192.168.10.245        192.168.10.246   192.168.10.247       192.168.10.249

                              JT      = Job Tracker                = Name Node
                                                           NN                         = RedHat 6.2           = RedHat 6.2
                                                                                        Workstation            Virtual Machine
                              TT      = Task Tracker       DN      = Data Node




Saturday 9 February 13
Fase 2


                    • Eenvoudig algoritme
                    • Beeld draaien
                    • Standaard IO
                    • HDFS


Saturday 9 February 13
Fase 3
                    • Meer complexiteit: MapReduce
                    • Spatiaal: Convolutiemasker, ROI
                    • Temporeel/Spectraal: Meerdere
                      afbeeldingen


                    •


Saturday 9 February 13
Fase 4
                    • Performantie in functie van pixel
                      afstand




Saturday 9 February 13
Planning

  literatuur
        fase 1
        fase 2
        fase 3
        fase 4
                           01           01       15          20
                              /09          /02      /   03        /0
                                                                       5
                                          verslag
                         stage      vandaag        inleveren
                                                  masterproef
Saturday 9 February 13
The End
                    • Veel data
                    • Anders denken
                    • Veel mogelijkheden
                         •   RLZ of nieuw keuzevak Big Data? ;)

                         •   Mapreduce + OpenCL?


                    • Veel uitdagingen


                    • Veel vragen
Saturday 9 February 13
Referenties
  (1) Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003), ‘The google file system’
  (2) Krishnan, S., Baru, C. and Crosby, C. (2010), ‘Evaluation of mapreduce for gridding lidar data’
  (3) Dean, J., Ghemawat, S. and Inc, G. (2004), ‘Mapreduce: simplified data processing on large clusters’
  (4) http://hadoop.apache.org/
  (5) Warneke, D. and Kao, O. (2009), ‘Nephele: Efficient parallel data processing in the cloud’, http://www.stratosphere.eu
  (6) http://hpccsystems.com/
  (7) http://hortonworks.com/
  (8) http://mapr.com/
  (9) http://cloudera.com/
  (10) Sweeney, C. (2011), ‘Hipi: Hadoop image processing interface for image-based mapreduce’
  (11) Guinan, O. (2011), ‘Indexing the earth - large scale satellite image processing using hadoop’, http://www.cloudera.com/content/
  cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-indexing-the-earth-large-scale-satellite-image-
  processing-using-hadoop.htmt
  (12)  Q. Duffy, D. (2013), ‘Untangling the computing landscape for NASA climate simulations’. URL: http://www.nas.nasa.gov/
  SC12/demos/demo20.html
  (13) http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief
  (14)     Lassnig, M., Garonne, V., Dimitrov, G. and Canali, L. (2012), ‘Atlas data management accounting with hadoop pig and hbase’.



Saturday 9 February 13

Weitere ähnliche Inhalte

Andere mochten auch

Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageWebinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageCloudera, Inc.
 
How to give a Creative Presentation in 10 minutes by Two pens
How to give a Creative Presentation in 10 minutes by Two pensHow to give a Creative Presentation in 10 minutes by Two pens
How to give a Creative Presentation in 10 minutes by Two pensCynthia Hartwig
 
Arbonne 15 Min Presentation
Arbonne 15 Min PresentationArbonne 15 Min Presentation
Arbonne 15 Min PresentationKat Bamford
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsDataWorks Summit
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBernard Marr
 

Andere mochten auch (7)

Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageWebinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
 
How to give a Creative Presentation in 10 minutes by Two pens
How to give a Creative Presentation in 10 minutes by Two pensHow to give a Creative Presentation in 10 minutes by Two pens
How to give a Creative Presentation in 10 minutes by Two pens
 
Arbonne 15 Min Presentation
Arbonne 15 Min PresentationArbonne 15 Min Presentation
Arbonne 15 Min Presentation
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Scrum In 15 Minutes
Scrum In 15 MinutesScrum In 15 Minutes
Scrum In 15 Minutes
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Ähnlich wie 15 minute presentation about Thesis

Emc 2013 Big Data in Astronomy
Emc 2013 Big Data in AstronomyEmc 2013 Big Data in Astronomy
Emc 2013 Big Data in AstronomyFabio Porto
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionPhil Cryer
 
GitHub Notable OSS Project
GitHub  Notable OSS ProjectGitHub  Notable OSS Project
GitHub Notable OSS Projectroumia
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudKhazret Sapenov
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013CS, NcState
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computingViet-Trung TRAN
 
Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)packetloop
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
Towards the Wikipedia of World Wide Sensors
Towards the Wikipedia of World Wide SensorsTowards the Wikipedia of World Wide Sensors
Towards the Wikipedia of World Wide SensorsCybera Inc.
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
From Sensor Data to Triples: Information Flow in Semantic Sensor Networks
From Sensor Data to Triples: Information Flow in Semantic Sensor NetworksFrom Sensor Data to Triples: Information Flow in Semantic Sensor Networks
From Sensor Data to Triples: Information Flow in Semantic Sensor NetworksNikolaos Konstantinou
 
Publishing consuming Linked Sensor Data meetup Cuenca
Publishing consuming Linked Sensor Data meetup CuencaPublishing consuming Linked Sensor Data meetup Cuenca
Publishing consuming Linked Sensor Data meetup CuencaJean-Paul Calbimonte
 
DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05John Cobb
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...The Synergy Between the Object Database, Graph Database, Cloud Computing and ...
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...InfiniteGraph
 

Ähnlich wie 15 minute presentation about Thesis (20)

Emc 2013 Big Data in Astronomy
Emc 2013 Big Data in AstronomyEmc 2013 Big Data in Astronomy
Emc 2013 Big Data in Astronomy
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage Solution
 
GitHub Notable OSS Project
GitHub  Notable OSS ProjectGitHub  Notable OSS Project
GitHub Notable OSS Project
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computing
 
Viet stack 2nd meetup - BigData in Cloud Computing
Viet stack 2nd meetup - BigData in Cloud ComputingViet stack 2nd meetup - BigData in Cloud Computing
Viet stack 2nd meetup - BigData in Cloud Computing
 
Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
Towards the Wikipedia of World Wide Sensors
Towards the Wikipedia of World Wide SensorsTowards the Wikipedia of World Wide Sensors
Towards the Wikipedia of World Wide Sensors
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
From Sensor Data to Triples: Information Flow in Semantic Sensor Networks
From Sensor Data to Triples: Information Flow in Semantic Sensor NetworksFrom Sensor Data to Triples: Information Flow in Semantic Sensor Networks
From Sensor Data to Triples: Information Flow in Semantic Sensor Networks
 
Publishing consuming Linked Sensor Data meetup Cuenca
Publishing consuming Linked Sensor Data meetup CuencaPublishing consuming Linked Sensor Data meetup Cuenca
Publishing consuming Linked Sensor Data meetup Cuenca
 
DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Session 23 - Intro to EGEE-III
Session 23 - Intro to EGEE-IIISession 23 - Intro to EGEE-III
Session 23 - Intro to EGEE-III
 
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...The Synergy Between the Object Database, Graph Database, Cloud Computing and ...
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...
 

15 minute presentation about Thesis

  • 1. Too much Data! Sven Meys Saturday 9 February 13
  • 2. Onderwerp On-demand Information Extraction from Remote Sensing Images with MapReduce Saturday 9 February 13
  • 3. Inhoud • Context • Literatuurstudie • Planning Saturday 9 February 13
  • 4. Context • VITO • Remote Sensing • Probleemstelling • Onderzoeksvragen Saturday 9 February 13
  • 5. 16% 700 €103 Milj. 84% Government Private Saturday 9 February 13
  • 6. Energy Industrial Innovation Quality of Environment Environ- mental Separation Transition Material Remote Environ- Environ- Energy Analysis & Energy & Techno- Sensing mental mental Technology & Conversion Environment logy Modelling Health Techno- Technology logy Saturday 9 February 13
  • 7. Context • VITO • Remote Sensing • Probleemstelling • Onderzoeksvragen Saturday 9 February 13
  • 11. 2 1 km per pixel 0.5 miljard pixels 1.2 GB Saturday 9 February 13
  • 13. Time Series: 01-01-2001 01-01-2012 Algorithm: NDVI Output: Mean SUBMIT Saturday 9 February 13
  • 14. Context • VITO • Remote Sensing • Probleemstelling • Onderzoeksvragen Saturday 9 February 13
  • 15. Probleemstelling Betere beelden Betere sensoren Meer informatie Duurdere opslag Meer data Data Transport Dure supercomputers Meer rekenwerk Parallel Processing Saturday 9 February 13
  • 16. Doelstellingen • Snel genoeg • Betaalbaar • Schaalbaar Bestandssysteem + Software framework Saturday 9 February 13
  • 17. Onderzoeksvragen • Hoe kunnen grote satellietbeelden in een HDFS filesysteem opgeslagen worden zodat ze op een efficiënte manier in parallel verwerkt kunnen worden? • Welke algoritmes kunnen gebruikt worden met deze opslagtechniek en MapReduce? Saturday 9 February 13
  • 18. Inhoud • Context • Literatuurstudie • Planning Saturday 9 February 13
  • 19. Literatuurstudie • Interessante projecten • HDFS • MapReduce • Implementaties • Distributies • Huidige Literatuur Saturday 9 February 13
  • 20. Interessante projecten • NA (12) • Center for Climate Simulation • Square Kilometer Array: 700 TB/sec • Open Cloud Consortium(13) • Project Matsu: Elastic Clouds for Disaster Relief • : Large Hadron Collider (14) • 20 PB/jaar Saturday 9 February 13
  • 21. HDFS 1 • Gedistribueerd bestandssysteem 2 ... • Gebaseerd op the Google File System(1) ... n • Grote blokken (128 MiB) • Commodity hardware • Falen = standaard • Read & append (1) Saturday 9 February 13
  • 22. A DFS usually accounts for transparent file replication and fault to HDFS bles data locality for processing tasks. A DFS does this by subdividin ese blocks within a cluster of computers. Figure 2 shows the distrib of a file (left) subdivided into three blocks. 1 1 3 1 2 2 3 3 2 2 3 1 Figure 2: File blocks, distribution and replication in a distributed file system Saturday 9 February 13
  • 23. onsult GmbH HDFS Ca 1 1 3 1 2 2 2 3 3 3 2 1 Figure 4: Block assembly for data retrieval from the distributed file system Saturday 9 February 13
  • 24. rates how the file system handles node-failure by automated recov HDFS HDFS further uses checksums to verify block integrity. As long as th ccessible copy of a block, it can automatically re-replicate to return tion rate. 1 1 1 1 3 3 2 3 2 3 2 3 3 2 2 2 2 3 1 1 Figure 3: Automatic repair in case of cluster node failure by additional replication Saturday 9 February 13
  • 25. HDFS - Overzicht • Schaalbaar • Snel lezen/schrijven • Robuust • Factor 10 goedkoper (2) Saturday 9 February 13
  • 28. MapReduce - Overzicht • Based on Google MapReduce (3) • Data Locality • Key/Value pairs • Zeer snel • Andere manier van denken Saturday 9 February 13
  • 29. Implementaties Hadoop Stratosphere HPCC Support + - + Extensions + - ? Community +++ +/- - Target ANY EDU BI • Apache Software Foundation • Anderen: outdated, commercieel, weinig support (4-6) Saturday 9 February 13
  • 30. Distributies (8) • Hortonworks (7) • • Cloudera : Cloudera Manager (9) • Web Interface • 1-Click install. (yeah right...) • Interessant licentie model Saturday 9 February 13
  • 31. Algemeen • Vooral tekstverwerking • Voor kleine afbeeldingen (10) • Weinig detail • Commercieel (11) Saturday 9 February 13
  • 32. Inhoud • Context • Literatuurstudie • Planning Saturday 9 February 13
  • 33. Planning literatuur fase 1 fase 2 fase 3 fase 4 01 01 15 20 /09 /02 / 03 /0 5 verslag stage vandaag inleveren masterproef Saturday 9 February 13
  • 34. Fase 1 - Done Sven Workstation Workstation Workstation 192.168.10.248 TT DN Master Bruno Tim Patrick JT TT TT TT NN DN DN DN 192.168.10.245 192.168.10.246 192.168.10.247 192.168.10.249 JT = Job Tracker = Name Node NN = RedHat 6.2 = RedHat 6.2 Workstation Virtual Machine TT = Task Tracker DN = Data Node Saturday 9 February 13
  • 35. Fase 2 • Eenvoudig algoritme • Beeld draaien • Standaard IO • HDFS Saturday 9 February 13
  • 36. Fase 3 • Meer complexiteit: MapReduce • Spatiaal: Convolutiemasker, ROI • Temporeel/Spectraal: Meerdere afbeeldingen • Saturday 9 February 13
  • 37. Fase 4 • Performantie in functie van pixel afstand Saturday 9 February 13
  • 38. Planning literatuur fase 1 fase 2 fase 3 fase 4 01 01 15 20 /09 /02 / 03 /0 5 verslag stage vandaag inleveren masterproef Saturday 9 February 13
  • 39. The End • Veel data • Anders denken • Veel mogelijkheden • RLZ of nieuw keuzevak Big Data? ;) • Mapreduce + OpenCL? • Veel uitdagingen • Veel vragen Saturday 9 February 13
  • 40. Referenties (1) Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003), ‘The google file system’ (2) Krishnan, S., Baru, C. and Crosby, C. (2010), ‘Evaluation of mapreduce for gridding lidar data’ (3) Dean, J., Ghemawat, S. and Inc, G. (2004), ‘Mapreduce: simplified data processing on large clusters’ (4) http://hadoop.apache.org/ (5) Warneke, D. and Kao, O. (2009), ‘Nephele: Efficient parallel data processing in the cloud’, http://www.stratosphere.eu (6) http://hpccsystems.com/ (7) http://hortonworks.com/ (8) http://mapr.com/ (9) http://cloudera.com/ (10) Sweeney, C. (2011), ‘Hipi: Hadoop image processing interface for image-based mapreduce’ (11) Guinan, O. (2011), ‘Indexing the earth - large scale satellite image processing using hadoop’, http://www.cloudera.com/content/ cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-indexing-the-earth-large-scale-satellite-image- processing-using-hadoop.htmt (12) Q. Duffy, D. (2013), ‘Untangling the computing landscape for NASA climate simulations’. URL: http://www.nas.nasa.gov/ SC12/demos/demo20.html (13) http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief (14) Lassnig, M., Garonne, V., Dimitrov, G. and Canali, L. (2012), ‘Atlas data management accounting with hadoop pig and hbase’. Saturday 9 February 13