SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
A High Availability story for HDFS
AvatarNode
Dhruba Borthakur & Dmytro Molkov
dhruba@apache.org & dms@facebook.com
Presented at The Hadoop User Group Meeting,
Sept 29, 2010
How infrequently does the NameNode (NN) stop?
  Hadoop Software Bugs
–  Two directories in fs.name.dir, but when a write to first
directory failed, the NN ignored the second one (once)
–  Upgrade from 0.17 to 0.18 caused data corruption
(once)
  Configuration errors
–  Fsimage partition ran out of space (once)
–  Network Load Anomalies (about 10 times)
  Maintenance:
–  Deploy new patches (once every month)
What does the SecondaryNameNode do?
  Periodically merges Transaction logs
  Requires the same amount of memory as NN
  Why is it separate from NN?
–  Avoids fine-grain locking of NN data structures
–  Avoids implementing copy-on-write for NN data
structures
  Renamed as CheckpointNode (CN) in 0.21 release.
Shortcomings of the SecondaryNameNode?
  Does not have a copy of the latest transaction log
  Periodic and is not continuous
–  Configured to run every hour
  If the NN dies, the SecondaryNameNode does not take
over the responsibilities of the NN
BackupNode (BN)
  NN streams transaction log to BackupNode
  BackupNode applies log to in-memory and disk image
  BN always commit to disk before success to NN
  If BN restarts, it has to catch up with NN
  Available in HDFS 0.21 release
Limitations of BackupNode (BN)
  Maximum of one BackupNode per NN
–  Support only two-machine failure
  NN does not forward block reports to BackupNode
  Time to restart from 12 GB image, 70M files + 100 M
blocks
–  3 – 5 minutes to read the image from disk
–  20 min to process block reports
–  BN will still take 25+ minutes to failover!
Overlapping Clusters for HA
  “Always available for write” model
  Two logical clusters each with their own NN
  Each physical machine runs two instances of DataNode
  Two DataNode instances share the same physical storage
device
  Application has logic to failover writes from one HDFS
cluster to another
  More details at http://hadoopblog.blogspot.com/2009/06/hdfs-
scribe-integration.html
HDFS+Zookeeper
  HDFS can store transaction logs in Zookeeper/Bookeeper
–  http://issues.apache.org/jira/browse/HDFS-234
  Transaction log need not be stored in NFS filer
  A new NN will still have to process block reports
–  Not good for HA yet, because NN failover will take 30 minutes
Our use case for High Availability
  Failover should occur in less than a minute
  Failovers are needed only for new software upgrades
Challenges
  DataNodes send block location
information to only one
NameNode
  NameNode needs block locations
in memory to serve clients
  The in-memory metadata for 100
million files could be 60 GB,
huge!
DataNodes
Primary
NameNode
Client
Block location
message “yes, I
have blockid 123”
Client retrieves
block location from
NameNode
Introduction for AvatarNode
  Active-Standby Pair
–  Coordinated via zookeeper
–  Failover in few seconds
–  Wrapper over NameNode
  Active AvatarNode
–  Writes transaction log to filer
  Standby AvatarNode
–  Reads transactions from filer
–  Latest metadata in memory
http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html
NFS
Filer
Active
AvatarNode
(NameNode)
Client
Standby
AvatarNode
(NameNode)
Block
location
messages
Client retrieves
block location from
Primary or Standby
write
transaction
read
transaction
Block
location
messages
DataNodes
ZooKeeper integration for Clients
  DistributedAvatarFileSystem:
–  Connects to ZooKeeper to figure out who the Primary node is.
There is a znode in ZooKeeper that has the current address of the
primary. Clients read it on creation and during failover.
–  Is aware of the failover state and pauses until it is over. If the
znode is empty the cluster is failing over to the new Primary, just
wait for that to finish.
–  Handles failures in calls to the NameNode. If the call failed with
network exception – checks for the failover in progress and retries
the call after the new Primary is up.
Four steps to failover
  Wipe ZooKeeper entry. Clients will know the failover is in
progress. (0 seconds)
  Stop the primary namenode. Last bits of data will be
flushed to Transaction Log and it will die. (Seconds)
  Switch Standby to Primary. It will consume the rest of the
Transaction log and get out of safemode ready to serve
traffic. (Seconds)
  Update the entry in ZooKeeper. All the clients waiting for
failover will pick up the new connection. (0 seconds)
  After: Start the first node in the Standby Mode (Takes a
while, but the cluster is up and running)
Why add ZooKeeper to the mix
  Provides a clean way to execute failovers in the application
layer.
  A centralized control of all the clients. Gives us the ability
to Pause clients until the failover is done.
  A good stepping stone for future improvements needed to
perform automatic failover:
–  Nodes voting on who will be the primary
–  DataNodes knowing who has the authority to delete blocks
ZooKeeper is an option
  Can be implemented using IP failover based on the existing
infrastructure.
  The clients will not know if the failover is done and it is
safe to make the call again.
  IP failover works well in tightly coupled system (both nodes
in one rack) so the Single Point of Failure is still there (rack
switch)
  There is no need to run a dedicated ZooKeeper cluster.
It is not all about failover
  The Standby node has a lot of CPU that is not used
  The Standby node has a full (but delayed a bit) picture of
Blocks and the Namesystem.
  Send all reads that can deal with stale data to Standby.
  Have pluggable services run as a part of Standby node and
use the metadata of the filesystem directly from the
shared memory instead of querying the namenode all the
time.
  My Hadoop Blog:
–  http://hadoopblog.blogspot.com/
–  http://www.facebook.com/hadoopfs

Weitere ähnliche Inhalte

Was ist angesagt?

Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_presoHadoop User Group
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesPhil Peace
 
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiHBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiMichael Stack
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...Michael Stack
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs
 
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLHBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLCloudera, Inc.
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 

Was ist angesagt? (20)

Mumak
MumakMumak
Mumak
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiHBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLHBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 

Ă„hnlich wie Hdfs high availability

NetApp against ransomware
NetApp against ransomwareNetApp against ransomware
NetApp against ransomwareDamien Berezenko
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Practice and challenges from building IaaS
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaSShawn Zhu
 
Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Mandakini Kumari
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introducejhao niu
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
Apache Hadoop- Hadoop Basics.pptx
Apache Hadoop- Hadoop Basics.pptxApache Hadoop- Hadoop Basics.pptx
Apache Hadoop- Hadoop Basics.pptxMiraj Godha
 
MySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspectiveMySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspectiveUlf Wendel
 
Network Testing ques
Network Testing quesNetwork Testing ques
Network Testing quesPragya Rastogi
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hivesrikanthhadoop
 
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyPilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyStuart Pook
 
LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISOR
LOAD BALANCING OF APPLICATIONS  USING XEN HYPERVISORLOAD BALANCING OF APPLICATIONS  USING XEN HYPERVISOR
LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISORVanika Kapoor
 
Interview questions
Interview questionsInterview questions
Interview questionsxavier john
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop adminBalaji Rajan
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualizationTesting Delphix: easy data virtualization
Testing Delphix: easy data virtualizationFranck Pachot
 

Ă„hnlich wie Hdfs high availability (20)

.ppt
.ppt.ppt
.ppt
 
Hadoop
HadoopHadoop
Hadoop
 
NetApp against ransomware
NetApp against ransomwareNetApp against ransomware
NetApp against ransomware
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Practice and challenges from building IaaS
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaS
 
Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introduce
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
Apache Hadoop- Hadoop Basics.pptx
Apache Hadoop- Hadoop Basics.pptxApache Hadoop- Hadoop Basics.pptx
Apache Hadoop- Hadoop Basics.pptx
 
Unit 1
Unit 1Unit 1
Unit 1
 
MySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspectiveMySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspective
 
Network Testing ques
Network Testing quesNetwork Testing ques
Network Testing ques
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyPilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
 
LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISOR
LOAD BALANCING OF APPLICATIONS  USING XEN HYPERVISORLOAD BALANCING OF APPLICATIONS  USING XEN HYPERVISOR
LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISOR
 
Interview questions
Interview questionsInterview questions
Interview questions
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Hdfs questions answers
Hdfs questions answersHdfs questions answers
Hdfs questions answers
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualizationTesting Delphix: easy data virtualization
Testing Delphix: easy data virtualization
 

Mehr von Hadoop User Group

Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentationHadoop User Group
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availabilityHadoop User Group
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21Hadoop User Group
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21Hadoop User Group
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010Hadoop User Group
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security PreviewHadoop User Group
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation HadoopHadoop User Group
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security PreviewHadoop User Group
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security PreviewHadoop User Group
 
Hadoop Release Plan Feb17
Hadoop Release Plan Feb17Hadoop Release Plan Feb17
Hadoop Release Plan Feb17Hadoop User Group
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709Hadoop User Group
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record CollectionHadoop User Group
 

Mehr von Hadoop User Group (20)

Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Pig at Linkedin
Pig at LinkedinPig at Linkedin
Pig at Linkedin
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation Hadoop
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Hadoop Release Plan Feb17
Hadoop Release Plan Feb17Hadoop Release Plan Feb17
Hadoop Release Plan Feb17
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 

KĂĽrzlich hochgeladen

Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 

KĂĽrzlich hochgeladen (20)

Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 

Hdfs high availability

  • 1. A High Availability story for HDFS AvatarNode Dhruba Borthakur & Dmytro Molkov dhruba@apache.org & dms@facebook.com Presented at The Hadoop User Group Meeting, Sept 29, 2010
  • 2. How infrequently does the NameNode (NN) stop?   Hadoop Software Bugs –  Two directories in fs.name.dir, but when a write to first directory failed, the NN ignored the second one (once) –  Upgrade from 0.17 to 0.18 caused data corruption (once)   Configuration errors –  Fsimage partition ran out of space (once) –  Network Load Anomalies (about 10 times)   Maintenance: –  Deploy new patches (once every month)
  • 3. What does the SecondaryNameNode do?   Periodically merges Transaction logs   Requires the same amount of memory as NN   Why is it separate from NN? –  Avoids fine-grain locking of NN data structures –  Avoids implementing copy-on-write for NN data structures   Renamed as CheckpointNode (CN) in 0.21 release.
  • 4. Shortcomings of the SecondaryNameNode?   Does not have a copy of the latest transaction log   Periodic and is not continuous –  Configured to run every hour   If the NN dies, the SecondaryNameNode does not take over the responsibilities of the NN
  • 5. BackupNode (BN)   NN streams transaction log to BackupNode   BackupNode applies log to in-memory and disk image   BN always commit to disk before success to NN   If BN restarts, it has to catch up with NN   Available in HDFS 0.21 release
  • 6. Limitations of BackupNode (BN)   Maximum of one BackupNode per NN –  Support only two-machine failure   NN does not forward block reports to BackupNode   Time to restart from 12 GB image, 70M files + 100 M blocks –  3 – 5 minutes to read the image from disk –  20 min to process block reports –  BN will still take 25+ minutes to failover!
  • 7. Overlapping Clusters for HA   “Always available for write” model   Two logical clusters each with their own NN   Each physical machine runs two instances of DataNode   Two DataNode instances share the same physical storage device   Application has logic to failover writes from one HDFS cluster to another   More details at http://hadoopblog.blogspot.com/2009/06/hdfs- scribe-integration.html
  • 8. HDFS+Zookeeper   HDFS can store transaction logs in Zookeeper/Bookeeper –  http://issues.apache.org/jira/browse/HDFS-234   Transaction log need not be stored in NFS filer   A new NN will still have to process block reports –  Not good for HA yet, because NN failover will take 30 minutes
  • 9. Our use case for High Availability   Failover should occur in less than a minute   Failovers are needed only for new software upgrades
  • 10. Challenges   DataNodes send block location information to only one NameNode   NameNode needs block locations in memory to serve clients   The in-memory metadata for 100 million files could be 60 GB, huge! DataNodes Primary NameNode Client Block location message “yes, I have blockid 123” Client retrieves block location from NameNode
  • 11. Introduction for AvatarNode   Active-Standby Pair –  Coordinated via zookeeper –  Failover in few seconds –  Wrapper over NameNode   Active AvatarNode –  Writes transaction log to filer   Standby AvatarNode –  Reads transactions from filer –  Latest metadata in memory http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html NFS Filer Active AvatarNode (NameNode) Client Standby AvatarNode (NameNode) Block location messages Client retrieves block location from Primary or Standby write transaction read transaction Block location messages DataNodes
  • 12. ZooKeeper integration for Clients   DistributedAvatarFileSystem: –  Connects to ZooKeeper to figure out who the Primary node is. There is a znode in ZooKeeper that has the current address of the primary. Clients read it on creation and during failover. –  Is aware of the failover state and pauses until it is over. If the znode is empty the cluster is failing over to the new Primary, just wait for that to finish. –  Handles failures in calls to the NameNode. If the call failed with network exception – checks for the failover in progress and retries the call after the new Primary is up.
  • 13. Four steps to failover   Wipe ZooKeeper entry. Clients will know the failover is in progress. (0 seconds)   Stop the primary namenode. Last bits of data will be flushed to Transaction Log and it will die. (Seconds)   Switch Standby to Primary. It will consume the rest of the Transaction log and get out of safemode ready to serve traffic. (Seconds)   Update the entry in ZooKeeper. All the clients waiting for failover will pick up the new connection. (0 seconds)   After: Start the first node in the Standby Mode (Takes a while, but the cluster is up and running)
  • 14. Why add ZooKeeper to the mix   Provides a clean way to execute failovers in the application layer.   A centralized control of all the clients. Gives us the ability to Pause clients until the failover is done.   A good stepping stone for future improvements needed to perform automatic failover: –  Nodes voting on who will be the primary –  DataNodes knowing who has the authority to delete blocks
  • 15. ZooKeeper is an option   Can be implemented using IP failover based on the existing infrastructure.   The clients will not know if the failover is done and it is safe to make the call again.   IP failover works well in tightly coupled system (both nodes in one rack) so the Single Point of Failure is still there (rack switch)   There is no need to run a dedicated ZooKeeper cluster.
  • 16. It is not all about failover   The Standby node has a lot of CPU that is not used   The Standby node has a full (but delayed a bit) picture of Blocks and the Namesystem.   Send all reads that can deal with stale data to Standby.   Have pluggable services run as a part of Standby node and use the metadata of the filesystem directly from the shared memory instead of querying the namenode all the time.
  • 17.   My Hadoop Blog: –  http://hadoopblog.blogspot.com/ –  http://www.facebook.com/hadoopfs