Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Overview of Hadoop and HDFS

209 Aufrufe

Veröffentlicht am

Overview of Hadoop and HDFS

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Overview of Hadoop and HDFS

  1. 1. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Introduction, Background to Hadoop and HDFS! ! ! ! ! Brendan Tierney
  2. 2. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com What is Big Data? O’Reilly Radar definition: •  Big data is when the size of the data itself becomes part of the problem EMC/IDC definition: •  Big Data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery and/or analysis •  McKinsey definition: •  Big Data refers to datasets whose size is beyond the availability of typical database software tools to capture, store, manage and analyse http://www.oreilly.com/data/free/big-data-now-2012.csp! http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf! http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation! http://csrc.nist.gov/groups/SMA/forum/documents/june2012presentations/fcsm_june2012_cooper_mell.pdf
  3. 3. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data Some Companies continue to generate large amounts of data: •  Facebook ~ 6 billion messages per day •  EBay ~ 2 billion page views a day, ~ 9 Petabytes of storage •  Satellite Images by Skybox Imaging ~ 1 Terabyte per day •  These numbers are probably out of date before I finished writing this slide Important : This is for some companies and not all companies Part of their data management architecture. It will not replace existing DBs etc
  4. 4. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Basic idea •  The basic idea behind the phrase Big Data is that everything we do is increasingly leaving a digital trace (data) which we can use and analyse •  Big Data therefore refers to our ability to make use of ever increasing volumes of data Traditional data storage methods can be a challenge! Why ?
  5. 5. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data
  6. 6. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com 2013 2013
  7. 7. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com 2014 Where is Predictive Analytics?
  8. 8. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com 2015
  9. 9. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop •  Existing tools were not designed to handle such large amounts of data •  "The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.” •  http://hadoop.apache.org •  – Process Big Data on clusters of commodity hardware •  – Vibrant open-source community •  – Many products and tools reside on top of Hadoop
  10. 10. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Who is using Hadoop in Ireland ? Big websites Big telcos Big Banks Big Financial CERN Big ….
  11. 11. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Access Speeds? 1990: Typical drive ~1370MB Transfer speed ~ 4.4MB/s read drive in 5 mins 2010: Typical drive ~1TB Transfer speed ~ 100MB/s read drive in 2.5 hrs Hadoop - 100 drives working at the same time can read 1TB of data in 2 minutes
  12. 12. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Scaling issue $ $ $ $ ?
  13. 13. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Scaling issue •  It is harder and more expensive to scale-up ( “It Depends” needs to be applied) •  Add additional resources to an existing node (CPU, RAM) •  Moore’s Law can’t keep up with data growth •  New units must be purchased if required resources can not be added •  Also known as scale vertically •  Scale-Out •  Add more nodes/machines to an existing distributed application •  Software Layer is designed for node additions or removal •  Hadoop takes this approach - A set of nodes are bonded together as a single distributed system •  Very easy to scale down as well
  14. 14. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Principles •  Scale-Out rather than Scale-Up •  Bring code to data rather than data to code •  Deal with failures – they are common •  Abstract complexity of distributed and concurrent applications •  Self managing •  Auto parallel processing
  15. 15. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data – Example Applications Not all of these are using Hadoop or require Hadoop!
  16. 16. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Cluster •  A set of "cheap" commodity hardware •  Networked together •  Resides in the same location •  Set of servers in a set of racks in a data center •  “Cheap” Commodity Server Hardware •  No need for super-computers, use commodity unreliable hardware •  Not desktops Yes you can build a Hadoop Cluster using Raspberry Pi’s
  17. 17. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Abstracting Complexity •  Distributed Computing is HARD WORK •  Hadoop abstracts many complexities in distributed and concurrent applications •  Defines small number of components •  Provides simple and well defined interfaces of interactions between these components •  Frees developer from worrying about system level challenges •  race conditions, data starvation •  processing pipelines, data partitioning, code distribution, etc. •  Allows developers to focus on application development and business logic
  18. 18. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop vs RDBMS •  Always keep the phrase “It Depends” in mind when discussing Big Data •  Hadoop != RDBMS •  Hadoop will not replace RDBMS •  Hadoop is part of your data management architecture •  and only if it is needed !
  19. 19. RDBMS Hadoop Data size Gigabytes Petabytes Access Interactive & Batch Batch Updates Read & write many times Write once, read many times Integrity High Low Scaling Non Linear Linear Data representation Structured Unstructured, semi- structured
  20. 20. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  21. 21. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  22. 22. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  23. 23. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  24. 24. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com
  25. 25. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  26. 26. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Working together •  Hadoop and RDBMS frequently complement each other within an architecture •  For example, a website that •  has a small number of users •  produces a large amount of audit logs
  27. 27. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Ecosystem
  28. 28. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Ecosytems
  29. 29. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Ecosytems
  30. 30. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Distributions •  Large number of independent products (Apache projects) •  Can be challenging to get all/some of these to work together •  We will will be working with Hadoop, installing and using some products •  Hadoop Distributions aim to resolve version incompatibilities •  Distribution Vendor will •  Integration Test a set of Hadoop products •  Package Hadoop products in various installation formats •  Linux Packages, tarballs, etc. •  Distributions may provide additional scripts to execute Hadoop •  Some vendors may choose to backport features and bug fixes made by Apache •  Typically vendors will employ Hadoop committers so the bugs they find will make it into Apache’s repository
  31. 31. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Distributions •  Cloudera Distribution for Hadoop (CDH) •  Check out the pre-built VM with most of Cloudera products (Hadoop, etc) •  http://www.cloudera.com/downloads/quickstart_vms/5-8.html •  MapR Distribution •  Check out the MapR Sandbox VM •  https://www.mapr.com/products/mapr-sandbox-hadoop •  Hortonworks Data Platform (HDP) •  Check out the Hortonworks Sandbox VM •  http://hortonworks.com/products/sandbox/ •  Oracle Big Data Applicance •  Check out a pre-built VM with Hadoop, Oracle and lots of other tools all installed and configured for you to use •  http://www.oracle.com/technetwork/database/bigdata-appliance/oracle- bigdatalite-2104726.html $
  32. 32. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop - “move-code-to-data” approach •  Data is distributed among the nodes as it is initially stored in the system •  Data is replicated multiple times on the system for increased reliability & availability •  Master allocates work to nodes •  Computation happens on the nodes where the data is stored - data locality •  Nodes work in parallel each on their own part of the overall dataset •  Nodes are independent and self-sufficient - shared-nothing architecture •  If a node fails, master detects the failure and re-assigns work to other nodes •  If a failed node restarts, it is automatically added back into the system and assigned new tasks
  33. 33. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS •  A distributed file system modelled on the Google File System (GFS)
 [http://research.google.com/archive/gfs.html] •  Data is split into blocks, typically 64MB or 128MB in size, spread across many nodes •  Works better on large files >= 1 HDFS block in size •  Each block is replicated to a number of nodes (typically 3) •  ensures reliability and availability •  Files in HDFS are write once - no random writes to files allowed •  HDFS is optimised for large streaming reads of files - no random access to files allowed •  see HIVE later on for more DBMS-type access to HDFS files....
  34. 34. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS is good for •  Storing large files •  Terabytes, Petabytes, etc... •  Millions rather than billions of files •  100MB or more per file •  Streaming data •  Unstructured data => really mixed structured data •  Write once and read-many times patterns •  Schema on Read (RDBMS = schema on write) •  Huge time saving at data write time •  BUT !!! •  Optimized for streaming reads rather than random reads •  “Cheap” Commodity Hardware •  No need for super-computers, use less reliable commodity hardware
  35. 35. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS is not so good at •  Low-latency reads •  High-throughput rather than low latency for small chunks of data •  HBase and other DBs can address this issue (?) •  Large amount of small files •  Better for millions of large files instead of billions of small files •  Block size of 128M or 256M •  For example each file can be 100MB or more •  Multiple Writers •  Single writer per file •  Writes only at the end of file, no-support for arbitrary offset •  Time needed for replication
  36. 36. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS •  Two types of nodes in a HDFS cluster •  NameNode - the master node •  DataNodes - slave or worker nodes •  NameNode manages the file system •  keeps track of the metadata - which blocks make up a file (using 2 files - namespace image and the edit log) •  knows on which DataNodes the blocks are stored •  DataNodes do the work •  store the blocks •  retrieve blocks when requested to (by the client or the NameNode) •  poll and report back to the NameNode periodically with the list of blocks that they are storing
  37. 37. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS •  When a client application wants to read a file... •  it communicates with the NameNode to determine which blocks make up the file, and on which DataNodes the block reside •  it then communicates directly with the DataNodes •  NameNode is the single point of failure of a Hadoop system •  backup periodically to remote NFS (setup as part of Hadoop configuration) •  use Secondary NameNode •  not the same as the NameNode •  periodically merges namespace with edit log and maintains a copy
  38. 38. [from Hadoop in Practice, Alex Holmes] HDFS Architecture
  39. 39. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Files and Blocks •  Files are split into blocks (single unit of storage) •  Managed by Namenode, stored by Datanode •  Transparent to user •  Replicated across machines at load time •  Same block is stored on multiple machines •  Good for fault-tolerance and access •  Can lead to inconsistent reads •  Default replication is 3 Have you ever experienced inconsistent reads?
  40. 40. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS File Writes
  41. 41. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS File Reads
  42. 42. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Who is using Hadoop in Ireland ? •  List of Cloudera customers in Ireland •  Citi •  Allianz •  Deutsche Bank •  Ulster Bank •  dun & bradstreet •  Ryanair •  BT •  Vodafone •  Novartis •  airbnb •  Dell •  Intel •  Rockwell Automation •  Revenue •  Adecco •  Experian •  M&S
  43. 43. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Discuss Hadoop is not FREE J vs Hadoop is not FREE L
  44. 44. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Something to think about

×