Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

7. emc isilon hdfs enterprise storage for hadoop

4.231 Aufrufe

Veröffentlicht am

Veröffentlicht in: Technologie, Business
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

7. emc isilon hdfs enterprise storage for hadoop

  1. 1. 1© Copyright 2011 EMC Corporation. All rights reserved.EMC Isilon HDFS –Enterprise Storage forHadoopFeaturing EMC Isilon Scale-Out NASStorageShai HarmelinEMC System Enginer – Isilon SpecialistMay 21, 2013
  2. 2. 2© Copyright 2011 EMC Corporation. All rights reserved.Today’s Agenda• EMC Isilon Background• HDFS Architectural Challenges• Isilon HDFS Benefits• Performance Comparison• Customer Case Study• Q+A
  3. 3. 3© Copyright 2011 EMC Corporation. All rights reserved.EMC IsilonSetting the standard for scale-out NAS• Founded in 2000 as the leader in Scaleout NAS (Gartner 2010)• Broad adoption across many markets– High Performance Computing (HPC): Life Sciences, Oil & Gas, ElectronicDesign Automation, Media & Entertainment, Financial Services– Enterprise IT: Archive, Home Directories, File Shares, Virtualization,Business Analytics• Acquired by EMC in 2011 for $2.5B• Over 3,500 global customers• Isilon OneFS: Seventh generation, industry-proven, innovativescale-out operating environment• 2012 – EMC Isilon is Industry’s First Scale-Out NAS System with NativeHDFS Support
  4. 4. 4© Copyright 2011 EMC Corporation. All rights reserved.Isilon Growing Momentum3,500+ customers
  5. 5. 5© Copyright 2011 EMC Corporation. All rights reserved.Why Hadoop is Important to EMCIsilon CustomersPragmatic approach to analytics on a very large scale– Opens up new ways of gaining insights and identifyingopportunities for businessesDesigned to address the rise of unstructured data– Enterprise data to grow by 650% over next 5 years– More than 80% of this growth will be unstructured dataHadoop is only ONE component ofEnterprise Big Data Analytics PIPELINE
  6. 6. 6© Copyright 2011 EMC Corporation. All rights reserved.Isilon Scale-Out NAS ArchitectureOneFS OperatingEnvironmentIntra-clusterCommunication LayerServersClient/Application Layer Ethernet LayerServersServersSingleFS/VolumeCIFSNFSFTPHTTPHDFSforHadoop
  7. 7. 7© Copyright 2011 EMC Corporation. All rights reserved.Isilon Core InnovationOneFS scale-out operating systemSingle File SystemSimplicityLeadership EfficiencyHigh PerformanceEasy GrowthAutomated TieringLinear Scalability
  8. 8. 8© Copyright 2011 EMC Corporation. All rights reserved.Largest and Most Scalable File System500X More Scalable than Traditional Storage SystemsOneFS™ can scale from 18TB to over 20,000 TB in asingle file system•••
  9. 9. 9© Copyright 2011 EMC Corporation. All rights reserved.AutoBalanceAutomated data balancing across nodes reduces costs,complexity and risks for scaling storage“Using Software to do Work Unfit for Humans”• AutoBalance migratescontent to new storage nodeswhile system isonline and in production• Requires NO manualintervention, NOreconfiguration,NO server or client mount pointor application changes• Eliminate “Hot Spots”EMPTYEMPTYEMPTYEMPTYEMPTYFULLFULLFULLFULLBALANCEDBALANCEDBALANCEDBALANCEDBALANCED
  10. 10. 10© Copyright 2011 EMC Corporation. All rights reserved. Back to Navigation
  11. 11. 11© Copyright 2011 EMC Corporation. All rights reserved.• Load balancing• Seamless failover• Performance zones• Quotamanagement• Thin provisioning• High speed replication• Disaster recovery• Business continuance• Instant recovery• Data protectionIsilon, Scale-Out NAS for Big DataSingle File System, Single Volume Simplicity For Active,Persistent, And Archive DataWAN/LANPrimary &Nearline StorageLocal/RemoteArchiveClient/ApplicationLayerVirtualized ServersVirtualizedServersClientsX-seriesNetworkNL-series• File immutability• Protection fromdeletion/changeNL-seriesBackupAcceleratorS-series• Automatedstorage tiering
  12. 12. 12© Copyright 2011 EMC Corporation. All rights reserved. Back to NavigationEasiest Storage System to ManageSingle-level ofManagementManage a 18TB to 10PBsingle file system fromone intuitive console"Isilon has made some verybold claims with respect to itsclustered storage products -not least the idea ofgenuinely revolutionizing theease and speed with whichmass storage - over 500Terabytes - can be added andmanaged thereafter. We haveconducted rigorous testingand unanimously agree withtheir assertions. This stuffis almost frighteninglysimple to use.”Steve Broadhead, Founder,Broadband-TestingLaboratories
  13. 13. 14© Copyright 2011 EMC Corporation. All rights reserved.HDFS Overview
  14. 14. 15© Copyright 2011 EMC Corporation. All rights reserved.Secondary NameNodeDataNode / Task TrackerJob TrackerNameNodeCore Hadoop Components
  15. 15. 16© Copyright 2011 EMC Corporation. All rights reserved.Job TrackerManages all the jobs to the clusterTracks and reports the status of jobs and tasksProvides job queuing functionalityCommunicates with NameNode and tries to align TaskTracker to Data NodesThe compute workhorseServes read/write requests from the clientsExecutes Map/Reduce tasksTypically performs I/O against local or remote DataNodesTask TrackerCompute Components
  16. 16. 17© Copyright 2011 EMC Corporation. All rights reserved.NameNodeManages the file system namespaceStores all the Metadata in the RAM – alimitation on file system sizeFilenames, owners, group, access infoKnows associated blocksManages block replication acrossDataNodesManages edit log and check-pointing of name node metadataDoes not provide name node hotfailoverCDH4 has a solution for this, butis not in full scale production inmost environmentsSecondary NameNodeStores blocks of files on top of native host OS file system (e.g. EXT3, XFS, ZFS)Same block is stored on multiple DataNodes for redundancyHas no “awareness” of data blocks living elsewhere (only the namenode does)DataNodeFile SystemComponents
  17. 17. 18© Copyright 2011 EMC Corporation. All rights reserved.Enterprise Challenges of HadoopHadoop DAS Environment1Dedicated Storage Infrastructure– One-off for Hadoop only2Single Point of Failure– Namenode3Lacking Enterprise Data Protection– No Snapshots, replication, backup4Poor Storage Efficiency– 3X mirroring5Fixed Scalability– Rigid compute to storage ratio6Manual Import/Export– No protocol interoperability supportName node
  18. 18. 19© Copyright 2011 EMC Corporation. All rights reserved.Enterprise Challenges of HadoopHadoop DAS Environment1Dedicated Storage Infrastructure– One-off for Hadoop only2Single Point of Failure– Namenode3Lacking Enterprise Data Protection– No Snapshots, replication, backup4Poor Storage Efficiency– 3X mirroring5Fixed Scalability– Rigid compute to storage ratio6Manual Import/Export– No protocol support1x1x2x2x3x2x3x3x1xNamenode
  19. 19. 20© Copyright 2011 EMC Corporation. All rights reserved.Isilon HDFS SupportIsilon supports the HDFSinterfaces for the NameNodeand DataNode to host andmetadata and dataUnderlying file system isOneFSAs simple as pointing theHadoop Nodes to the DNSname of the Isilon cluster!
  20. 20. 21© Copyright 2011 EMC Corporation. All rights reserved.HDFS is a protocol!Each Isilon node now “speaks” the HDFS NameNode andDataNode protocolWe eliminate need to run these services on the Hadoop computeclusterEvery Isilon node acts as both a namenode and datanode(isi_hdfs_d)Data is laid out within OneFS exactly the same as for NFS, SMB,etc.Data is protected just like any other data in the Isilon FileSystem. No Mirroring, only Parity = 80% utilizationAll Isilon Enterprise Features are applied to Hadoop data:Snapshots, Replication, SmartCache, SmartLock, etc…
  21. 21. 22© Copyright 2011 EMC Corporation. All rights reserved.HDFS Writes on IsilonJobtracker asks Isilon namenode (isi_hdfs_d) “tell me where toplace /path/file”OneFS isi_hdfs_d hands JT list of 3 “datanode” addresses foreach block (aligned to block size defined on Hadoop cluster)Jobtracker assigns task tracker to communicate to data-node(isi_hdfs_d) to write each data block (an abstraction in our case)When complete, isi_hdfs_d responds by saying the block isreplicated (a lie) because Data is striped like any other file,written over any protocol.HDFS files are laid out on Isilon File Systems (IFS) similarly to any otherprotocol (NFS, CIFS, FTP)File can be written over NFS (nfsd) or CIFS (lwiod) and accessedover HDFS (isi_hdfs_d)
  22. 22. 23© Copyright 2011 EMC Corporation. All rights reserved.HDFS Reads on IsilonJobtracker asks Isilon namenode (isi_hdfs_d) “tell me where/path/file lives”isi_hdfs_d responds with list of block addresses (3 datanode IP’sper block). Note that the blocksize in this case is configurableon isilon (default 64MB)Jobtracker assigns task trackers to read each block (first addressout of 3 for each)Tasks within each task tracker ask namenode (again) for blocklocations, then initiate I/O transactions to read the data over thenetworkThe concept of locality is eliminated accept for rack awareness.
  23. 23. 24© Copyright 2011 EMC Corporation. All rights reserved.Isilon HDFS Settings
  24. 24. 25© Copyright 2011 EMC Corporation. All rights reserved.How EMC Isilon Addresses the HadoopChallenge1Dedicated Storage Infrastructure– One-off for Hadoop only2Single Point of Failure– Namenode3Lacking Enterprise Data Protection– No Snapshots, replication, backup4Poor Storage Efficiency– 3X mirroring5Fixed Scalability– Rigid compute to storage ratio6Manual Import/Export– No protocol support1Scale-Out Storage Platform– Multiple applications & workflows2No Single Point of Failure– Distributed Namenode3End-to-End Data Protection– SnapshotIQ, SyncIQ, NDMP Backup4Industry-Leading Storage Efficiency– >80% Storage Utilization5Independent Scalability– Add compute & storage separately6Multi-Protocol– Industry standard protocols– NFS, CIFS, FTP, HTTP, HDFS
  25. 25. 27© Copyright 2011 EMC Corporation. All rights reserved.Distributed (Clustered) Name Node When Using IsilonMTTDL = 5,000 yearsMetadata stored acrosssystems same way asstandard file metadataBuilt-in clustered redundancyacross many nodesName NodeClustering theNameNode onIsilon allowsfor the failureprotectionlevel IsilonalreadyprovidesClusteredNameNode
  26. 26. 28© Copyright 2011 EMC Corporation. All rights reserved.Fixed Scaling / Independent ScalingHadoopIsilonStorage to Compute ratio is fixedScaling compute means scalingcapacityDifficult to provide QoSCompute upgrade is a forkliftScale compute independent ofstorageAchieve optimal performancebalance even as workloads evolveNo data migrations, ever!Add new performance ashardware evolvesstoragecomputeDesiredperformance/capacity
  27. 27. 29© Copyright 2011 EMC Corporation. All rights reserved.Protocol SupportServersServersServersBeforeAfterHDFS is not visible toWindows, Unix, Linux,Apple, or any other filesystem nativelyBig Data is only used forBig DataInherent Multi-ProtocolSupport in Isilon allowsubiquitous access to allfile systems includingHadoopBig Data is actual data!Servers
  28. 28. 30© Copyright 2011 EMC Corporation. All rights reserved.Data Center NetworkTime-to-ResultsData Copy Analysis In-Place AnalysisExisting Primary StorageHadoop on a StickHave you evercopied 100TB fromPrimary Storage toa Hadoop system?How long does ittake ≈ to copy100TB from oneplace to anotherover a 10GB link?>24 HoursData Center NetworkExisting Primary StorageHadoop Processing NodesReading relevantdata to analysis
  29. 29. 31© Copyright 2011 EMC Corporation. All rights reserved.Snapshot/Version ControlBeforeAfterTraditional HDFS does nothave replicationNo Snapshotting of dataLoss of Version controlNot designed for MissionCriticalFull Snapshot IQTMintegration identifieschangesMulti-threaded, Multi-NodeScale-Out replicationImproved RPO/RTO forbusiness continuityGeo-replicated Hadoop!5 5
  30. 30. 32© Copyright 2011 EMC Corporation. All rights reserved.Hadoop Distributions Support on Isilon HDFS• Available now in 7.0.1.5• Multiple HDFS:// namespaces– hdfs://DAS + hdfs://isilon– Potential for archive/tiering– Hadoop cluster version mixing• Distributions:– Cloudera CDH4.x– Hortonworks HDP-2– PivotalHD 1.0 (aka: GPHD 2.0)– Apache 0.23 / apache 2.0HDFS v2HDFS v1
  31. 31. 33© Copyright 2011 EMC Corporation. All rights reserved.Performance
  32. 32. 34© Copyright 2011 EMC Corporation. All rights reserved.Test Used HiBenchDeveloped by Intel and Open Sourced– Collection of standard Hadoop jobs– Our tests focused on TeraSort and TestDFSIOAll results normalized as throughput per node to allow comparison of differingconfigsTestDFSIO tests were uncompressed, which shows actual I/O efficiency– Compressed gives much higher performance, but is not actual I/O
  33. 33. 35© Copyright 2011 EMC Corporation. All rights reserved.GPHD-Isilon is Highly Competitive
  34. 34. 36© Copyright 2011 EMC Corporation. All rights reserved.Terasort Performance is ComparableBetween Configurations
  35. 35. 37© Copyright 2011 EMC Corporation. All rights reserved.I/O Performance Scales As Isilon NodesAre Added
  36. 36. 38© Copyright 2011 EMC Corporation. All rights reserved.For Typical Workloads, 1.5 ComputeNodes Per Isilon x400 Node is Good(4) Isilon x400Nodes Tested
  37. 37. 39© Copyright 2011 EMC Corporation. All rights reserved.Return Pathhttp://www.emc.com/collateral/customer-profiles/h11528-return-path-cp.pdfChallengesLimited performance and capacity to support intensive Hadoop analyticsNFS and Hadoop environments struggled to handle unique data sets comprised ofhundreds of millions of small email files, and large analytics files, which hinderedanalytics and delivery of customer solutions25 different DAS and NAS storage systems lacked performance and capacityStorage projected to increase from 150TB to 2PB over the next 5 yearsCompany background:• Return Path is the worldwide leader in email intelligence, serving Internetservice providers (ISPs), businesses, and individuals.• The company’s email intelligence solutions process and analyze massive volumesof data to maximize email performance, ensure email delivery, and protect usersfrom spam and other abuse.• Developed Hadoop based email intelligence solutions combined with NAS baseddata access
  38. 38. 40© Copyright 2011 EMC Corporation. All rights reserved.Return PathResultsReturn Path now has a single repository for all its Big Data, accessible to emailanalysts, product development teams and external customers.Isilon delivers real-time data to Return Path’s end-user applications whileproviding seamless integration with Hadoop for back-end data analyticsReduces shared storage data center footprint by 30 percentShortens weekly administration time by more than 35 percentImproves availability and reliability for Hadoop analyticsSavings of $350,000 from lower power, cooling, and maintenanceIsilon Solution and BenefitsSolutionIsilon X400 Scaleout NAS – Approx 200TB capacitySmartConnect, SmartQuotas, InsightIQ Software suiteNFS and HDFS Data Access Protocols
  39. 39. 41© Copyright 2011 EMC Corporation. All rights reserved.Return Path“To have all this data being generated by our email intelligence products, but no wayto access it directly by Hadoop, was a major hindrance,”“Isilon serves NFS data across multiple product suites and makes it easily accessible toour Hadoop analytics team. That’s a significant business enabler, allowing Return Path todevelop customer solutions much faster.”“Isilon InsightIQ software has been invaluable, providing visibility into our infrastructureand managing our space efficiently as we grow.”DIZ CARTERVP InfrastructureOperationsCustomer Quotes
  40. 40. 42© Copyright 2011 EMC Corporation. All rights reserved.Questions?
  41. 41. 43© Copyright 2011 EMC Corporation. All rights reserved.Thank You!

×