SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
MAKING BIG DATA, SMALL
Using distributed systems for processing, analysing and managing
large huge data sets


    Marcin Jedyk
    Software Professional’s Network, Cheshire Datasystems Ltd
WARM-UP QUESTIONS
 How many of you heard about Big Data before?
 How many about NoSQL?

 Hadoop?
AGENDA.
 Intro – motivation, goal and ‘not about…’
 What is Big Data?
 NoSQL and systems classification
 Hadoop & HDFS
 MapReduce & live demo
 HBase
AGENDA
 Pig
 Building Hadoop cluster

 Conclusions

 Q&A
MOTIVATION
 Data is everywhere – why not to analyse it?
 With Hadoop and NoSQL systems, building
  distributed systems is easier than before
 Relying on software & cheap hardware rather
  than expensive hardware works better!
MOTIVATION
GOAL
 To explain basic ideas behind Big Data
 To present different approaches towards BD

 To show that Big Data systems are easy to build

 To show you where to start with such systems
WHAT IT IS NOT ABOUT?
 Not a detailed lecture on a single system
 Not about advanced techniques in Big Data

 Not only about technology – but also about its
  application
WHAT IS BIG DATA?
   Data characterised by 3 Vs:
     Volume

     Variety

     Velocity

   The interesting ones: variety & velocity
WHAT IS BIG DATA
 Data of high velocity: cannot store? Process on
  the fly!
 Data of high variety: doesn’t fit into relational
  schema? Don’t use schema, use NoSQL!
 Data which is impractical to process on a single
  server
NO-SQL
 Hand in and with Big Data
 NoSQL – an umbrella term for non-relational
  data bases or data storages
 It’s not always possible to replace RDBMS with
  NoSQL! (opposite is also true)
NO-SQL
   NoSQL DBs are built around different principles
     Key-value stores: Redis, Riak
     Document stores: i.e. MongoDB – record as a
      document; each entry has its own meta-data (JSON like,
      BSON)
     Table stores: i.e. Hbase – data persisted in multiple
      columns (even millions), billions of rows and multiple
      versions of records
HADOOP
 Existed before ‘Big Data’ buzzword emerged
 A simple idea – MapReduce

 A primary purpose – to crunch tera- and
  petabytes of data
 HDFS as underlying distributed file system
HADOOP – ARCHITECTURE BY EXAMPLE
 Image you need to process 1TB of logs
 What would you need?

 A server!
HADOOP – ARCHITECTURE BY EXAMPLE
 But 1TB is quite a lot of data… we want it
  quicker!
 Ok, what about distributed environment?
HADOOP – ARCHITECTURE BY EXAMPLE
   So what about that Hadoop stuff?
     Each node can: store data & process it (DataNode
      & TaskTracker)
HADOOP – ARCHITECTURE BY EXAMPLE
   How about allocating jobs to slaves? We need a
    JobTracker!
HADOOP – ARCHITECTURE BY EXAMPLE
 How about HDFS, how data blocks are
  assembled into files?
 NameNode does it.
HADOOP – ARCHITECTURE BY EXAMPLE
 NameNode – manages HDFS metadata, doesn’t
  deal with files directly
 JobTracker – schedules, allocates and monitors
  job execution on slaves – TaskTrackers
 TaskTracker – runs MapReduce operations
 DataNode – stores blocks of HDFS – default
  replication level for each block: 3
HADOOP - LIMITATIONS
 DataNodes & TaskTrackers are fault tollerant
 NameNode & JobTracker are NOT! (existing
  workaround for this problem)
 HDFS deals nicely with large files, doesn’t do
  well with billions of small files
MAP_REDUCE
 MapReduce – parallelisation approach
 Two main stages:
     Map – do an actual bit of work, i.e.: extract info
     Reduce – summarise, aggregate or filter outputs from
      Map operation
   For each job, multiple Map and Reduce operations
    – each may run on different node = parallelism
MAP_REDUCE – AN EXAMPLE
 Let’s process 1TB of raw logs and extract traffic by
  host.
 After submitting a job, JobTracker allocates tasks
  to slaves – possibly divided into 64MB packs =
  16384 Map operations!
 Map - analyse logs and return them as set of
  <key,value>
 Reduce -> merge output of Map operations
MAP_REDUCE – AN EXAMPLE
  Take a look at mocked log extract:
[IP – bandwidth]
10.0.0.1 – 1234
10.0.0.1 – 900
10.0.0.2 – 1230
10.0.0.3 – 999
MAP_REDUCE – AN EXAMPLE
 It’s important to define key, in this case IP
<10.0.0.1;2134>
<10.0.0.2;1230>
<10.0.0.3;999>
 Now, assume another Map operation returned:
<10.0.0.1;1500>
<10.0.0.3;1000>
<10.0.0.4;500>
MAP_REDUCE – AN EXAMPLE
Now, Reduce will merge those results:
<10.0.0.1;3624>
<10.0.0.2;2230>
<10.0.0.3;1499>
<10.0.0.4;500>
MAP_REDUCE
 Selecting a key is important
 It’s possible to define composite key, i.e.
  IP+date
 For more complex tasks, it’s possible to chain
  MapReduce jobs
HBASE
 Another layer on top of Hadoop/HDFS
 A distributed data storage

 Not a replacement for RDBMS!

 Can be used with MapReduce

 Good for unstructured data – no need to worry
  about exact schema in advance
PIG – HBASE ENHANCEMENT
 HBase - missing proper query language
 Pig – makes life easier for HBase users

 Translates queries into MapReduce jobs

 When working with Pig or HBase, forget what
  you know about SQL – it makes your life easier
BUILDING HADOOP CLUSTER
 Post production servers are ok
 Don’t take ‘cheap hardware’ too literally
 Good connection between nodes is a must!
 >=1Gbps between nodes
 >=10Gbps between racks
 1 disk per CPU core
 More RAM, more caching!
FINAL CONCLUSIONS
 Hadoop and NoSQL-like DB/DS scale very well
 Hadoop ideal for crunching huge data sets

 Does very well in production environment

 Cluster of slaves is fault tolerant, NameNode
  and JobTracker are not!
EXTERNAL RESOURCES
 Trending Topic – build on Wikipedia access logs:
  http://goo.gl/BWWO1
 Building web crawler with Hadoop:
  http://goo.gl/xPTlJ
 Analysing adverse drug events:
  http://goo.gl/HFXAx
 Moving average for large data sets:
  http://goo.gl/O4oml
EXTERNAL RESOURCES – USEFUL LINKS
http://www.slideshare.net/fullscreen/jpatanooga/la-hug-dec-2011-
recommendation-talk/1
https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
http://hstack.org/hbase-performance-testing/
http://www.theregister.co.uk/2012/06/12/hortonworks_data_platform_one/
http://wiki.apache.org/hadoop/MachineScaling
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-
ladis2009.pdf
http://www.cloudera.com/resource-types/video/
http://hstack.org/why-were-using-hbase-part-2/
QUESTIONS?

Weitere ähnliche Inhalte

Was ist angesagt?

Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
Ashish Saraf
 

Was ist angesagt? (20)

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
 
Hadoop
HadoopHadoop
Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
알쓸신잡
알쓸신잡알쓸신잡
알쓸신잡
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 

Andere mochten auch

Andere mochten auch (7)

Big data
Big dataBig data
Big data
 
Big Data, Small Data, Data that Totally Rocks - SMWTO
Big Data, Small Data, Data that Totally Rocks - SMWTOBig Data, Small Data, Data that Totally Rocks - SMWTO
Big Data, Small Data, Data that Totally Rocks - SMWTO
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media Plan
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
 

Ähnlich wie Making Big Data, small

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourse
Samatha Kamuni
 

Ähnlich wie Making Big Data, small (20)

Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
Big Data - Part III
Big Data - Part IIIBig Data - Part III
Big Data - Part III
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourse
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Kürzlich hochgeladen (20)

How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

Making Big Data, small

  • 1. MAKING BIG DATA, SMALL Using distributed systems for processing, analysing and managing large huge data sets Marcin Jedyk Software Professional’s Network, Cheshire Datasystems Ltd
  • 2. WARM-UP QUESTIONS  How many of you heard about Big Data before?  How many about NoSQL?  Hadoop?
  • 3. AGENDA.  Intro – motivation, goal and ‘not about…’  What is Big Data?  NoSQL and systems classification  Hadoop & HDFS  MapReduce & live demo  HBase
  • 4. AGENDA  Pig  Building Hadoop cluster  Conclusions  Q&A
  • 5. MOTIVATION  Data is everywhere – why not to analyse it?  With Hadoop and NoSQL systems, building distributed systems is easier than before  Relying on software & cheap hardware rather than expensive hardware works better!
  • 7. GOAL  To explain basic ideas behind Big Data  To present different approaches towards BD  To show that Big Data systems are easy to build  To show you where to start with such systems
  • 8. WHAT IT IS NOT ABOUT?  Not a detailed lecture on a single system  Not about advanced techniques in Big Data  Not only about technology – but also about its application
  • 9. WHAT IS BIG DATA?  Data characterised by 3 Vs:  Volume  Variety  Velocity  The interesting ones: variety & velocity
  • 10. WHAT IS BIG DATA  Data of high velocity: cannot store? Process on the fly!  Data of high variety: doesn’t fit into relational schema? Don’t use schema, use NoSQL!  Data which is impractical to process on a single server
  • 11. NO-SQL  Hand in and with Big Data  NoSQL – an umbrella term for non-relational data bases or data storages  It’s not always possible to replace RDBMS with NoSQL! (opposite is also true)
  • 12. NO-SQL  NoSQL DBs are built around different principles  Key-value stores: Redis, Riak  Document stores: i.e. MongoDB – record as a document; each entry has its own meta-data (JSON like, BSON)  Table stores: i.e. Hbase – data persisted in multiple columns (even millions), billions of rows and multiple versions of records
  • 13. HADOOP  Existed before ‘Big Data’ buzzword emerged  A simple idea – MapReduce  A primary purpose – to crunch tera- and petabytes of data  HDFS as underlying distributed file system
  • 14. HADOOP – ARCHITECTURE BY EXAMPLE  Image you need to process 1TB of logs  What would you need?  A server!
  • 15. HADOOP – ARCHITECTURE BY EXAMPLE  But 1TB is quite a lot of data… we want it quicker!  Ok, what about distributed environment?
  • 16. HADOOP – ARCHITECTURE BY EXAMPLE  So what about that Hadoop stuff?  Each node can: store data & process it (DataNode & TaskTracker)
  • 17. HADOOP – ARCHITECTURE BY EXAMPLE  How about allocating jobs to slaves? We need a JobTracker!
  • 18. HADOOP – ARCHITECTURE BY EXAMPLE  How about HDFS, how data blocks are assembled into files?  NameNode does it.
  • 19. HADOOP – ARCHITECTURE BY EXAMPLE  NameNode – manages HDFS metadata, doesn’t deal with files directly  JobTracker – schedules, allocates and monitors job execution on slaves – TaskTrackers  TaskTracker – runs MapReduce operations  DataNode – stores blocks of HDFS – default replication level for each block: 3
  • 20. HADOOP - LIMITATIONS  DataNodes & TaskTrackers are fault tollerant  NameNode & JobTracker are NOT! (existing workaround for this problem)  HDFS deals nicely with large files, doesn’t do well with billions of small files
  • 21. MAP_REDUCE  MapReduce – parallelisation approach  Two main stages:  Map – do an actual bit of work, i.e.: extract info  Reduce – summarise, aggregate or filter outputs from Map operation  For each job, multiple Map and Reduce operations – each may run on different node = parallelism
  • 22. MAP_REDUCE – AN EXAMPLE  Let’s process 1TB of raw logs and extract traffic by host.  After submitting a job, JobTracker allocates tasks to slaves – possibly divided into 64MB packs = 16384 Map operations!  Map - analyse logs and return them as set of <key,value>  Reduce -> merge output of Map operations
  • 23. MAP_REDUCE – AN EXAMPLE  Take a look at mocked log extract: [IP – bandwidth] 10.0.0.1 – 1234 10.0.0.1 – 900 10.0.0.2 – 1230 10.0.0.3 – 999
  • 24. MAP_REDUCE – AN EXAMPLE  It’s important to define key, in this case IP <10.0.0.1;2134> <10.0.0.2;1230> <10.0.0.3;999>  Now, assume another Map operation returned: <10.0.0.1;1500> <10.0.0.3;1000> <10.0.0.4;500>
  • 25. MAP_REDUCE – AN EXAMPLE Now, Reduce will merge those results: <10.0.0.1;3624> <10.0.0.2;2230> <10.0.0.3;1499> <10.0.0.4;500>
  • 26. MAP_REDUCE  Selecting a key is important  It’s possible to define composite key, i.e. IP+date  For more complex tasks, it’s possible to chain MapReduce jobs
  • 27. HBASE  Another layer on top of Hadoop/HDFS  A distributed data storage  Not a replacement for RDBMS!  Can be used with MapReduce  Good for unstructured data – no need to worry about exact schema in advance
  • 28. PIG – HBASE ENHANCEMENT  HBase - missing proper query language  Pig – makes life easier for HBase users  Translates queries into MapReduce jobs  When working with Pig or HBase, forget what you know about SQL – it makes your life easier
  • 29. BUILDING HADOOP CLUSTER  Post production servers are ok  Don’t take ‘cheap hardware’ too literally  Good connection between nodes is a must!  >=1Gbps between nodes  >=10Gbps between racks  1 disk per CPU core  More RAM, more caching!
  • 30. FINAL CONCLUSIONS  Hadoop and NoSQL-like DB/DS scale very well  Hadoop ideal for crunching huge data sets  Does very well in production environment  Cluster of slaves is fault tolerant, NameNode and JobTracker are not!
  • 31. EXTERNAL RESOURCES  Trending Topic – build on Wikipedia access logs: http://goo.gl/BWWO1  Building web crawler with Hadoop: http://goo.gl/xPTlJ  Analysing adverse drug events: http://goo.gl/HFXAx  Moving average for large data sets: http://goo.gl/O4oml
  • 32. EXTERNAL RESOURCES – USEFUL LINKS http://www.slideshare.net/fullscreen/jpatanooga/la-hug-dec-2011- recommendation-talk/1 https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html http://hstack.org/hbase-performance-testing/ http://www.theregister.co.uk/2012/06/12/hortonworks_data_platform_one/ http://wiki.apache.org/hadoop/MachineScaling http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote- ladis2009.pdf http://www.cloudera.com/resource-types/video/ http://hstack.org/why-were-using-hbase-part-2/