SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
HDFS ARCHITECTURE
How HDFS is evolving to meet new needs
✛  Aaron T. Myers
    ✛  Hadoop PMC Member / Committer at ASF
    ✛  Software Engineer at Cloudera
    ✛  Primarily work on HDFS and Hadoop Security




2
✛  HDFS architecture circa 2010
    ✛  New requirements for HDFS
       >  Random read patterns
       >  Higher scalability
       >  Higher availability
    ✛  HDFS evolutions to address requirements
       >  Read pipeline performance improvements
       >  Federated namespaces
       >  Highly available Name Node



3
HDFS ARCHITECTURE: 2010
✛  Each cluster has…
       >  A single Name Node
           ∗  Stores file system metadata
           ∗  Stores “Block ID” -> Data Node mapping
       >  Many Data Nodes
           ∗  Store actual file data
       >  Clients of HDFS…
           ∗  Communicate with Name Node to browse file system, get
              block locations for files
           ∗  Communicate directly with Data Nodes to read/write files




5
6
✛  Want to support larger clusters
       >  ~4,000 node limit with 2010 architecture
       >  New nodes beefier than old nodes
          ∗  2009: 8 cores, 16GB RAM, 4x1TB disks
          ∗  2012: 16 cores, 48GB RAM, 12x3TB disks

    ✛  Want to increase availability
       >  With rise of HBase, HDFS now serving live traffic
       >  Downtime means immediate user-facing impact
    ✛  Want to improve random read performance
       >  HBase usually does small, random reads, not bulk


7
✛  Single Name Node
       >  If Name Node goes offline, cluster is unavailable
       >  Name Node must fit all FS metadata in memory
    ✛  Inefficiencies in read pipeline
       >  Designed for large, streaming reads
       >  Not small, random reads (like HBase use case)




8
✛  Fine for offline, batch-oriented applications
    ✛  If cluster goes offline, external customers don’t
      notice
    ✛  Can always use separate clusters for different
      groups
    ✛  HBase didn’t exist when Hadoop first created
       >  MapReduce was the only client application




9
HDFS PERFORMANCE IMPROVEMENTS
HDFS CPU Improvements: Checksumming

•  HDFS checksums every piece of data in/out
•  Significant CPU overhead
   •  Measure by putting ~1G in HDFS, cat file in a loop
   •  0.20.2: ~30-50% of CPU time is CRC32 computation!
•  Optimizations:
   •  Switch to “bulk” API: verify/compute 64KB at a time
      instead of 512 bytes (better instruction cache locality,
      amortize JNI overhead)
   •  Switch to CRC32C polynomial, SSE4.2, highly tuned
      assembly (~8 bytes per cycle with instruction level
      parallelism!)


    11                 Copyright 2011 Cloudera Inc. All rights reserved
Checksum improvements
                              (lower is better)
            1360us
100%
 90%
 80%
 70%
 60%              760us
 50%
                                                                                             CDH3u0
 40%
                                                                                             Optimized
 30%
 20%
 10%
  0%
            Random-read     Random-read CPU                                Sequential-read
              latency            usage                                       CPU usage
 Post-optimization: only 16% overhead vs un-checksummed access
 Maintain ~800MB/sec from a single thread reading OS cache

       12                      Copyright 2011 Cloudera Inc. All rights reserved
HDFS Random access

•  0.20.2:
    •  Each individual read operation reconnects to
       DataNode
    •  Much TCP Handshake overhead, thread creation,
       etc
•  2.0.0:
    •  Clients cache open sockets to each datanode (like
       HTTP Keepalive)
    •  Local readers can bypass the DN in some
       circumstances to directly read data
    •  Rewritten BlockReader to eliminate a data copy
    •  Eliminated lock contention in DataNode’s
       FSDataset class

   13                 Copyright 2011 Cloudera Inc. All rights reserved
Random-read micro benchmark (higher is better)
                  700
                  600
 Speed (MB/sec)




                  500
                  400
                  300
                  200
                  100
                        106 253 299                        247 488 635                              187 477 633
                    0
                        4 threads, 1 file              16 threads, 1 file                          8 threads, 2 files
                            0.20.2     Trunk (no native)                                   Trunk (native)
       TestParallelRead benchmark, modified to 100% random read
       proportion.
       Quad core Core i7 Q820@1.73Ghz
                   14                       Copyright 2011 Cloudera Inc. All rights reserved
Random-read macro benchmark (HBase YCSB)

                CDH4
  Reads/sec




              CDH3u1




                                   time
      15         Copyright 2011 Cloudera Inc. All rights reserved
HDFS FEDERATION ARCHITECTURE
✛  Instead of one Name Node per cluster, several
   >  Before: Only one Name Node, many Data Nodes
   >  Now: A handful of Name Nodes, many Data Nodes
✛  Distribute file system metadata between the
  NNs
✛  Each Name Node operates independently
   >  Potentially overlapping ranges of block IDs
   >  Introduce a new concept: block pool ID
   >  Each Name Node manages a single block pool
HDFS Architecture: Federation
✛  Improve scalability to 6,000+ Data Nodes
    >  Bumping into single Data Node scalability now
 ✛  Allow for better isolation
    >  Could locate HBase dirs on dedicated Name Node
    >  Could locate /user dirs on dedicated Name Node
 ✛  Clients still see unified view of FS namespace
    >  Use ViewFS – client side mount table configuration


     Note: Federation != Increased Availability

19
HDFS HIGH AVAILABILITY ARCHITECTURE
Current HDFS Availability & Data Integrity

•  Simple design, storage fault tolerance
   •  Storage: Rely on OS’s file system rather
      than use raw disk
   •  Storage Fault Tolerance: multiple replicas,
      active monitoring
   •  Single NameNode Master
  •  Persistent state: multiple copies + checkpoints
  •  Restart on failure




                                  21
Current HDFS Availability & Data Integrity

•  How well did it work?

•  Lost 19 out of 329 Million blocks on 10 clusters with 20K
  nodes in 2009
   •  7-9’s of reliability, and that bug was fixed in 0.20


•  18 months Study: 22 failures on 25 clusters - 0.58 failures
  per year per cluster
   •  Only 8 would have benefitted from HA failover!! (0.23
     failures per cluster year)



                                               22
So why build an HA NameNode?

•  Most cluster downtime in practice is planned
  downtime
   •  Cluster restart for a NN configuration change (e.g
      new JVM configs, new HDFS configs)
   •  Cluster restart for a NN hardware upgrade/repair
   •  Cluster restart for a NN software upgrade (e.g. new
      Hadoop, new kernel, new JVM)
•  Planned downtimes cause the vast majority of
  outage!

•  Manual failover solves all of the above!
   •  Failover to NN2, fix NN1, fail back to NN1, zero
      downtime
    23
Approach and Terminology
•  Initial goal: Active-Standby with Hot
  Failover

•  Terminology
   •  Active NN: actively serves read/write
      operations from clients
   •  Standby NN: waits, becomes active when
      Active dies or is unhealthy
   •  Hot failover: standby able to take over
      instantly

                             24
HDFS Architecture: High Availability

•  Single NN configuration; no failover
•  Active and Standby with manual failover
   •  Addresses downtime during upgrades – main
      cause of unavailability
•  Active and Standby with automatic
  failover
   •  Addresses downtime during unplanned outages
       (kernel panics, bad memory, double PDU failure,
       etc)
    •  See HDFS-1623 for detailed use cases
•  With Federation each namespace volume has an
   active-standby NameNode pair

                                  25
HDFS Architecture: High Availability

•  Failover controller outside NN
•  Parallel Block reports to Active and
   Standby
•  NNs share namespace state via a shared
   edit log
   •  NAS or Journal Nodes
   •  Like RDBMS “log shipping replication”
•  Client failover
   •  Smart clients (e.g configuration, or ZooKeeper for
      coordination)
   •  IP Failover in the future
                                  26
HDFS Architecture: High Availability
HDFS ARCHITECTURE: WHAT’S NEXT
✛  Increase scalability of single Data Node
   >  Currently the most-noticed scalability limit
✛  Support for point-in-time snapshots
   >  To better support DR, backups
✛  Completely separate block / namespace layers
   >  Increase scalability even further, new use cases
✛  Fully distributed NN metadata
   >  No pre-determined “special nodes” in the system
[B4]deview 2012-hdfs

Weitere ähnliche Inhalte

Was ist angesagt?

Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBase
Christopher Choi
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 

Was ist angesagt? (20)

Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Interactive Hadoop via Flash and Memory
Interactive Hadoop via Flash and MemoryInteractive Hadoop via Flash and Memory
Interactive Hadoop via Flash and Memory
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoCCeph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBase
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Introduction to hadoop high availability
Introduction to hadoop high availability Introduction to hadoop high availability
Introduction to hadoop high availability
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on Mesos
 

Andere mochten auch

B1 최신분산시스템이해결하고있는오래된이슈들
B1 최신분산시스템이해결하고있는오래된이슈들B1 최신분산시스템이해결하고있는오래된이슈들
B1 최신분산시스템이해결하고있는오래된이슈들
NAVER D2
 
[B1]real time large data at twitter
[B1]real time large data at twitter[B1]real time large data at twitter
[B1]real time large data at twitter
NAVER D2
 

Andere mochten auch (19)

Anatomy of file read in hadoop
Anatomy of file read in hadoopAnatomy of file read in hadoop
Anatomy of file read in hadoop
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
 
B1 최신분산시스템이해결하고있는오래된이슈들
B1 최신분산시스템이해결하고있는오래된이슈들B1 최신분산시스템이해결하고있는오래된이슈들
B1 최신분산시스템이해결하고있는오래된이슈들
 
[B1]real time large data at twitter
[B1]real time large data at twitter[B1]real time large data at twitter
[B1]real time large data at twitter
 
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
 
[2A6]web & health 2.0. 회사에서의 data science란?
[2A6]web & health 2.0. 회사에서의 data science란?[2A6]web & health 2.0. 회사에서의 data science란?
[2A6]web & health 2.0. 회사에서의 data science란?
 
[2A3]Big Data Launching Episodes
[2A3]Big Data Launching Episodes[2A3]Big Data Launching Episodes
[2A3]Big Data Launching Episodes
 
[2A5]하둡 보안 어떻게 해야 할까
[2A5]하둡 보안 어떻게 해야 할까[2A5]하둡 보안 어떻게 해야 할까
[2A5]하둡 보안 어떻게 해야 할까
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
[2A7]Linkedin'sDataScienceWhyIsItScience
[2A7]Linkedin'sDataScienceWhyIsItScience[2A7]Linkedin'sDataScienceWhyIsItScience
[2A7]Linkedin'sDataScienceWhyIsItScience
 
[1C6]오픈소스 하드웨어 플랫폼과 Node.js로 구현하는 IoT 플랫폼
[1C6]오픈소스 하드웨어 플랫폼과 Node.js로 구현하는 IoT 플랫폼[1C6]오픈소스 하드웨어 플랫폼과 Node.js로 구현하는 IoT 플랫폼
[1C6]오픈소스 하드웨어 플랫폼과 Node.js로 구현하는 IoT 플랫폼
 
Deview 2013 :: Backend PaaS, CloudFoundry 뽀개기
Deview 2013 :: Backend PaaS, CloudFoundry 뽀개기Deview 2013 :: Backend PaaS, CloudFoundry 뽀개기
Deview 2013 :: Backend PaaS, CloudFoundry 뽀개기
 
[243]kaleido 노현걸
[243]kaleido 노현걸[243]kaleido 노현걸
[243]kaleido 노현걸
 
[153] apache reef
[153] apache reef[153] apache reef
[153] apache reef
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
[2A4]DeepLearningAtNAVER
[2A4]DeepLearningAtNAVER[2A4]DeepLearningAtNAVER
[2A4]DeepLearningAtNAVER
 
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
 

Ähnlich wie [B4]deview 2012-hdfs

Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 

Ähnlich wie [B4]deview 2012-hdfs (20)

Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Red Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep Dive
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Hadoop at a glance
Hadoop at a glanceHadoop at a glance
Hadoop at a glance
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 

Mehr von NAVER D2

Mehr von NAVER D2 (20)

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual Search
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 

[B4]deview 2012-hdfs

  • 1. HDFS ARCHITECTURE How HDFS is evolving to meet new needs
  • 2. ✛  Aaron T. Myers ✛  Hadoop PMC Member / Committer at ASF ✛  Software Engineer at Cloudera ✛  Primarily work on HDFS and Hadoop Security 2
  • 3. ✛  HDFS architecture circa 2010 ✛  New requirements for HDFS >  Random read patterns >  Higher scalability >  Higher availability ✛  HDFS evolutions to address requirements >  Read pipeline performance improvements >  Federated namespaces >  Highly available Name Node 3
  • 5. ✛  Each cluster has… >  A single Name Node ∗  Stores file system metadata ∗  Stores “Block ID” -> Data Node mapping >  Many Data Nodes ∗  Store actual file data >  Clients of HDFS… ∗  Communicate with Name Node to browse file system, get block locations for files ∗  Communicate directly with Data Nodes to read/write files 5
  • 6. 6
  • 7. ✛  Want to support larger clusters >  ~4,000 node limit with 2010 architecture >  New nodes beefier than old nodes ∗  2009: 8 cores, 16GB RAM, 4x1TB disks ∗  2012: 16 cores, 48GB RAM, 12x3TB disks ✛  Want to increase availability >  With rise of HBase, HDFS now serving live traffic >  Downtime means immediate user-facing impact ✛  Want to improve random read performance >  HBase usually does small, random reads, not bulk 7
  • 8. ✛  Single Name Node >  If Name Node goes offline, cluster is unavailable >  Name Node must fit all FS metadata in memory ✛  Inefficiencies in read pipeline >  Designed for large, streaming reads >  Not small, random reads (like HBase use case) 8
  • 9. ✛  Fine for offline, batch-oriented applications ✛  If cluster goes offline, external customers don’t notice ✛  Can always use separate clusters for different groups ✛  HBase didn’t exist when Hadoop first created >  MapReduce was the only client application 9
  • 11. HDFS CPU Improvements: Checksumming •  HDFS checksums every piece of data in/out •  Significant CPU overhead •  Measure by putting ~1G in HDFS, cat file in a loop •  0.20.2: ~30-50% of CPU time is CRC32 computation! •  Optimizations: •  Switch to “bulk” API: verify/compute 64KB at a time instead of 512 bytes (better instruction cache locality, amortize JNI overhead) •  Switch to CRC32C polynomial, SSE4.2, highly tuned assembly (~8 bytes per cycle with instruction level parallelism!) 11 Copyright 2011 Cloudera Inc. All rights reserved
  • 12. Checksum improvements (lower is better) 1360us 100% 90% 80% 70% 60% 760us 50% CDH3u0 40% Optimized 30% 20% 10% 0% Random-read Random-read CPU Sequential-read latency usage CPU usage Post-optimization: only 16% overhead vs un-checksummed access Maintain ~800MB/sec from a single thread reading OS cache 12 Copyright 2011 Cloudera Inc. All rights reserved
  • 13. HDFS Random access •  0.20.2: •  Each individual read operation reconnects to DataNode •  Much TCP Handshake overhead, thread creation, etc •  2.0.0: •  Clients cache open sockets to each datanode (like HTTP Keepalive) •  Local readers can bypass the DN in some circumstances to directly read data •  Rewritten BlockReader to eliminate a data copy •  Eliminated lock contention in DataNode’s FSDataset class 13 Copyright 2011 Cloudera Inc. All rights reserved
  • 14. Random-read micro benchmark (higher is better) 700 600 Speed (MB/sec) 500 400 300 200 100 106 253 299 247 488 635 187 477 633 0 4 threads, 1 file 16 threads, 1 file 8 threads, 2 files 0.20.2 Trunk (no native) Trunk (native) TestParallelRead benchmark, modified to 100% random read proportion. Quad core Core i7 Q820@1.73Ghz 14 Copyright 2011 Cloudera Inc. All rights reserved
  • 15. Random-read macro benchmark (HBase YCSB) CDH4 Reads/sec CDH3u1 time 15 Copyright 2011 Cloudera Inc. All rights reserved
  • 17. ✛  Instead of one Name Node per cluster, several >  Before: Only one Name Node, many Data Nodes >  Now: A handful of Name Nodes, many Data Nodes ✛  Distribute file system metadata between the NNs ✛  Each Name Node operates independently >  Potentially overlapping ranges of block IDs >  Introduce a new concept: block pool ID >  Each Name Node manages a single block pool
  • 19. ✛  Improve scalability to 6,000+ Data Nodes >  Bumping into single Data Node scalability now ✛  Allow for better isolation >  Could locate HBase dirs on dedicated Name Node >  Could locate /user dirs on dedicated Name Node ✛  Clients still see unified view of FS namespace >  Use ViewFS – client side mount table configuration Note: Federation != Increased Availability 19
  • 20. HDFS HIGH AVAILABILITY ARCHITECTURE
  • 21. Current HDFS Availability & Data Integrity •  Simple design, storage fault tolerance •  Storage: Rely on OS’s file system rather than use raw disk •  Storage Fault Tolerance: multiple replicas, active monitoring •  Single NameNode Master •  Persistent state: multiple copies + checkpoints •  Restart on failure 21
  • 22. Current HDFS Availability & Data Integrity •  How well did it work? •  Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 •  7-9’s of reliability, and that bug was fixed in 0.20 •  18 months Study: 22 failures on 25 clusters - 0.58 failures per year per cluster •  Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year) 22
  • 23. So why build an HA NameNode? •  Most cluster downtime in practice is planned downtime •  Cluster restart for a NN configuration change (e.g new JVM configs, new HDFS configs) •  Cluster restart for a NN hardware upgrade/repair •  Cluster restart for a NN software upgrade (e.g. new Hadoop, new kernel, new JVM) •  Planned downtimes cause the vast majority of outage! •  Manual failover solves all of the above! •  Failover to NN2, fix NN1, fail back to NN1, zero downtime 23
  • 24. Approach and Terminology •  Initial goal: Active-Standby with Hot Failover •  Terminology •  Active NN: actively serves read/write operations from clients •  Standby NN: waits, becomes active when Active dies or is unhealthy •  Hot failover: standby able to take over instantly 24
  • 25. HDFS Architecture: High Availability •  Single NN configuration; no failover •  Active and Standby with manual failover •  Addresses downtime during upgrades – main cause of unavailability •  Active and Standby with automatic failover •  Addresses downtime during unplanned outages (kernel panics, bad memory, double PDU failure, etc) •  See HDFS-1623 for detailed use cases •  With Federation each namespace volume has an active-standby NameNode pair 25
  • 26. HDFS Architecture: High Availability •  Failover controller outside NN •  Parallel Block reports to Active and Standby •  NNs share namespace state via a shared edit log •  NAS or Journal Nodes •  Like RDBMS “log shipping replication” •  Client failover •  Smart clients (e.g configuration, or ZooKeeper for coordination) •  IP Failover in the future 26
  • 27. HDFS Architecture: High Availability
  • 29. ✛  Increase scalability of single Data Node >  Currently the most-noticed scalability limit ✛  Support for point-in-time snapshots >  To better support DR, backups ✛  Completely separate block / namespace layers >  Increase scalability even further, new use cases ✛  Fully distributed NN metadata >  No pre-determined “special nodes” in the system