SlideShare ist ein Scribd-Unternehmen logo
1 von 35
hosted by
Improving HBase reliability at Pinterest with Geo-
replication and Efficient Backup
August 17,2018
Chenji Pan
Lianghong Xu
Storage & Caching, Pinterest
hosted by
Content 01
02
03
HBase in Pinterest
Multicell
Backup
hosted by
HBase in Pinterest01
hosted by
HBase in Pinterest
• Use HBase for online
service since 2013
• Back data abstraction
layer like Zen, UMS
• ~50 Hbase 1.2 clusters
• Internal repo with ZSTD,
CCSMAP, Bucket cache,
timestamp, etc
hosted by
Multicell02
hosted by
2011 2012 2013 2014 2015 2016
Why Multicell?
hosted by
Architecture
DB
US-East US-West
Data Service
Global Load
Balancer
Cache
…
Local Load
Balancer
DB
Data Service Cache
…
Local Load
Balancer
Replication
hosted by
Master-Master
DB
US-East US-West
Data Service
Cache
DB
Data Service
Cache
Replication
write req:
key: val
write req:
key: val
Invalidate
Update
Invalidate
Update
hosted by
Master-Slave Write
DB
US-East US-West(Master)
Data Service
Remote Marker Pool
DB
Data Service
write req:
key: val
Update
Forward
Set Remote
Marker
Replication
clean
hosted by
Master-Slave Read
DB
US-East US-West(Master)
Data Service
Remote Marker Pool
DB
Data Service
read req:
key
Read
Read
remote
Check Remote
Marker
Read Local
hosted by
Cache Invalidation Service
DB
US-East US-West
Data Service
Cache
DB
Data Service Cache
write req:
key: val
Replication
Invalidate
Update
Kafka
Cache Invalidation
Service
Consume
Invalidate
hosted by
Mysql
HBase
DB
hosted by
Maxwell
Mysql Comment
Mysql
hosted by
Cache Invalidation Service
DB
US-East US-West
Data Service
Cache
DB
Data Service Cache
write req:
key: val
Replication
Invalidate
Update
Kafka
Cache Invalidation
Service
Consume
Invalidate
hosted by
Mysql and HBase
DB Kafka Comment
Mysql Maxwell Mysql Comment
HBase HBase replication proxy Hbase Annotations
hosted by
• Expose HBase replicate API
• Customized Kafka Topic
• Event corresponding to mutation
• Multiple HBase clusters share one HRP
HBase Replication
Proxy
HBase
ClusterA
HRP
Kafka
HBase
Cluster B
Replication
Replication
publish
hosted by
• Part of Mutate
• Written in WAL log, but not Memstore
HBase Annotations
hosted by
• Avoid race condition
HBase Timestamp
HBase
Cluster
Data service
Response
with TS
Cache
Compare
TS and
update
hosted by
Replication Topology Issue
HBase
M_E
HBase
S_E
HBase
M_W
HBase
S_W
US-East US-West
hosted by
Replication Topology Issue
HBase
M_E
HBase
S_E
HBase
M_W
HBase
S_W
US-East US-West
hosted by
Replication Topology Issue
HBase
M_E
HBase
S_E
HBase
M_W
HBase
S_W
US-East US-West
hosted by
Replication Topology Issue
HBase
M_E
HBase
S_E
HBase
M_W
HBase
S_W
US-East US-West
ZK ZK
Get remote
master region
server list
Update zk
with
master’s
region
server list
hosted by
Improving
backup
efficiency
01
02
03
HBase backup at Pinterest
Simplifying backup pipeline
Offline Deduplication
hosted by
HBase Backup at Pinterest
HBase serves highly critical data
• Requires very high availability
• 10s of clusters with 10s of PB of data
• All needs backup to S3
Daily backup to S3 for disaster recovery
• Snapshot + WAL for point-in-time recovery
• Maintain weekly/monthly backups according to retention policy
• Also used for offline data analysis
hosted by
Legacy Backup Problem
Two-step backup pipeline
• HBase -> HDFS backup cluster
• HDFS -> S3
Problem with the HDFS backup cluster
• Infra cost as data volume increases
• Operational pain on failure
HBase 0.94 does not support S3 export Hbase
cluster
HDFS backup
cluster
AWS S3
Hbase
cluster
Hbase
cluster
…
Snapshots
and WALs
hosted by
Upgrade Backup Pipeline
Hbase
cluster
HDFS backup
cluster
AWS S3
Hbase
cluster
Hbase
cluster
…
Snapshots
and WALs
Hbase
cluster
AWS S3
Hbase
cluster
Hbase
cluster
…
Snapshots
and WALs
PinDedup
Offline
deduplication
HBase 1.2HBase 0.94
hosted by
Challenge and Approach
Directly export HBase backup to S3
• Table export done using a variant of distcp
• Use S3A client with the fast upload option
Direct S3 upload is very CPU intensive
• Large HFiles broken down into smaller chunks
• Each chunk needs to be hashed and signed before upload
Minimize impact on prod HBase clusters
• Constrain max number of threads and Yarn contains per host
• Max CPU Overhead during backup < 30%
hosted by
Offline HFile Deduplication
HBase backup contains many duplicates
Observation: large HFiles rarely change
• Account for most storage usage
• Only merged during major compaction
• For read-heavy clusters, much redundancy
across backup cycles
PinDedup: offline S3 deduplication tool
• Asynchronously checks for duplicate S3 files
• Replace old files with references to new ones
rs1/F1
rs1/F2
rs1/F3
rs1/F1
rs1/F4
rs1/F5
10GB 10GB
500MB
30MB
400MB
80MB
Day1
backup
Day2
backup
Largest file usually
unchanged
hosted by
PinDedup Approach
rs1/F1
rs1/F2
rs1/F3
rs1/F1
rs1/F4
rs1/F5
10GB
500MB
30MB
400MB
80MB
Day1 backup Day2 backup
10GB
Source: s3://bucket/dir/dt=dt1 Target: s3://bucket/dir/dt=dt2
Same file name and checksum
Dedup candidates
• Only checks HFiles in the same regions in adjacent dates
• Declare duplicates when both filename and md5sum match
• No need for large on-disk dedup index, very fast lookup
hosted by
Design Choices
File encoding
File- vs. chunk-level deduplication
Online vs. offline deduplication
hosted by
File- vs. Chunk-level Dedup
More fine-grained duplication detection? -> Chunk-level dedup
Only marginal benefits
• Rabin fingerprint chunking, 4K average chunk size
• Increased complexity for implementation
• During compaction, merged changes are spread across entire file
Lessons
• File-level dedup is good enough
• Less aggressive major compaction to keep the largest files unchanged
After compaction
Before compaction
Large HFile Small HFile
hosted by
Online vs. Offline Dedup
AWS S3
HBase cluster
PinDedup
File
checksums
Non-duplicate
files
Online dedup:
• reduces data transfer to S3
AWS S3
HBase cluster
PinDedup
All files
Offline dedup:
• More control on when dedup occurs
• Isolate backup and dedup failures
hosted by
File encoding
Dedup the old or new file?
• Pros: one-step decoding
• Cons: dangling file pointers when old files are deleted. E.g,
when F1 is garbage collected, F2’ and F3’ become unaccessible.
Intuition: keep the old file, dedup the new one
Design choice: keep the new file, dedup the old one
• No overhead accessing the latest copy (most use cases)
• Avoids the dangling pointer problem
hosted by
Results
Significantly reduced infra cost
Reduced backup end-to-end time by 50%
3-137X compression on S3 storage usage
Lower operational overhead
hosted by
Thanks

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudHBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
 
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkHBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
 
HBaseConAsia2018 Keynote1: Apache HBase Project Status
HBaseConAsia2018 Keynote1: Apache HBase Project StatusHBaseConAsia2018 Keynote1: Apache HBase Project Status
HBaseConAsia2018 Keynote1: Apache HBase Project Status
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to Contribute
 

Ähnlich wie HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo -- replication and efficient backup

Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy
 

Ähnlich wie HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo -- replication and efficient backup (20)

Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Azure DBA with IaaS
Azure DBA with IaaSAzure DBA with IaaS
Azure DBA with IaaS
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
 
Azure Databases with IaaS
Azure Databases with IaaSAzure Databases with IaaS
Azure Databases with IaaS
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 

Mehr von Michael Stack

Mehr von Michael Stack (20)

hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloudhbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
 
hbaseconasia2019 Recent work on HBase at Pinterest
hbaseconasia2019 Recent work on HBase at Pinteresthbaseconasia2019 Recent work on HBase at Pinterest
hbaseconasia2019 Recent work on HBase at Pinterest
 
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltdhbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
 
hbaseconasia2019 HBase at Didi
hbaseconasia2019 HBase at Didihbaseconasia2019 HBase at Didi
hbaseconasia2019 HBase at Didi
 
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
 
hbaseconasia2019 HBase at Tencent
hbaseconasia2019 HBase at Tencenthbaseconasia2019 HBase at Tencent
hbaseconasia2019 HBase at Tencent
 
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
 
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
 
hbaseconasia2019 Pharos as a Pluggable Secondary Index Component
hbaseconasia2019 Pharos as a Pluggable Secondary Index Componenthbaseconasia2019 Pharos as a Pluggable Secondary Index Component
hbaseconasia2019 Pharos as a Pluggable Secondary Index Component
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
hbaseconasia2019 OpenTSDB at Xiaomi
hbaseconasia2019 OpenTSDB at Xiaomihbaseconasia2019 OpenTSDB at Xiaomi
hbaseconasia2019 OpenTSDB at Xiaomi
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
 
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBasehbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
 
hbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 Distributed Bitmap Index Solutionhbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 Distributed Bitmap Index Solution
 
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
hbaseconasia2019 HBase Bucket Cache on Persistent Memoryhbaseconasia2019 HBase Bucket Cache on Persistent Memory
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
 
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACLhbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
 
hbaseconasia2019 BDS: A data synchronization platform for HBase
hbaseconasia2019 BDS: A data synchronization platform for HBasehbaseconasia2019 BDS: A data synchronization platform for HBase
hbaseconasia2019 BDS: A data synchronization platform for HBase
 
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
 
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
 
HBaseConAsia2019 Keynote
HBaseConAsia2019 KeynoteHBaseConAsia2019 Keynote
HBaseConAsia2019 Keynote
 

Kürzlich hochgeladen

valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
nilamkumrai
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
nirzagarg
 

Kürzlich hochgeladen (20)

Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 

HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo -- replication and efficient backup

  • 1. hosted by Improving HBase reliability at Pinterest with Geo- replication and Efficient Backup August 17,2018 Chenji Pan Lianghong Xu Storage & Caching, Pinterest
  • 2. hosted by Content 01 02 03 HBase in Pinterest Multicell Backup
  • 3. hosted by HBase in Pinterest01
  • 4. hosted by HBase in Pinterest • Use HBase for online service since 2013 • Back data abstraction layer like Zen, UMS • ~50 Hbase 1.2 clusters • Internal repo with ZSTD, CCSMAP, Bucket cache, timestamp, etc
  • 6. hosted by 2011 2012 2013 2014 2015 2016 Why Multicell?
  • 7. hosted by Architecture DB US-East US-West Data Service Global Load Balancer Cache … Local Load Balancer DB Data Service Cache … Local Load Balancer Replication
  • 8. hosted by Master-Master DB US-East US-West Data Service Cache DB Data Service Cache Replication write req: key: val write req: key: val Invalidate Update Invalidate Update
  • 9. hosted by Master-Slave Write DB US-East US-West(Master) Data Service Remote Marker Pool DB Data Service write req: key: val Update Forward Set Remote Marker Replication clean
  • 10. hosted by Master-Slave Read DB US-East US-West(Master) Data Service Remote Marker Pool DB Data Service read req: key Read Read remote Check Remote Marker Read Local
  • 11. hosted by Cache Invalidation Service DB US-East US-West Data Service Cache DB Data Service Cache write req: key: val Replication Invalidate Update Kafka Cache Invalidation Service Consume Invalidate
  • 14. hosted by Cache Invalidation Service DB US-East US-West Data Service Cache DB Data Service Cache write req: key: val Replication Invalidate Update Kafka Cache Invalidation Service Consume Invalidate
  • 15. hosted by Mysql and HBase DB Kafka Comment Mysql Maxwell Mysql Comment HBase HBase replication proxy Hbase Annotations
  • 16. hosted by • Expose HBase replicate API • Customized Kafka Topic • Event corresponding to mutation • Multiple HBase clusters share one HRP HBase Replication Proxy HBase ClusterA HRP Kafka HBase Cluster B Replication Replication publish
  • 17. hosted by • Part of Mutate • Written in WAL log, but not Memstore HBase Annotations
  • 18. hosted by • Avoid race condition HBase Timestamp HBase Cluster Data service Response with TS Cache Compare TS and update
  • 19. hosted by Replication Topology Issue HBase M_E HBase S_E HBase M_W HBase S_W US-East US-West
  • 20. hosted by Replication Topology Issue HBase M_E HBase S_E HBase M_W HBase S_W US-East US-West
  • 21. hosted by Replication Topology Issue HBase M_E HBase S_E HBase M_W HBase S_W US-East US-West
  • 22. hosted by Replication Topology Issue HBase M_E HBase S_E HBase M_W HBase S_W US-East US-West ZK ZK Get remote master region server list Update zk with master’s region server list
  • 23. hosted by Improving backup efficiency 01 02 03 HBase backup at Pinterest Simplifying backup pipeline Offline Deduplication
  • 24. hosted by HBase Backup at Pinterest HBase serves highly critical data • Requires very high availability • 10s of clusters with 10s of PB of data • All needs backup to S3 Daily backup to S3 for disaster recovery • Snapshot + WAL for point-in-time recovery • Maintain weekly/monthly backups according to retention policy • Also used for offline data analysis
  • 25. hosted by Legacy Backup Problem Two-step backup pipeline • HBase -> HDFS backup cluster • HDFS -> S3 Problem with the HDFS backup cluster • Infra cost as data volume increases • Operational pain on failure HBase 0.94 does not support S3 export Hbase cluster HDFS backup cluster AWS S3 Hbase cluster Hbase cluster … Snapshots and WALs
  • 26. hosted by Upgrade Backup Pipeline Hbase cluster HDFS backup cluster AWS S3 Hbase cluster Hbase cluster … Snapshots and WALs Hbase cluster AWS S3 Hbase cluster Hbase cluster … Snapshots and WALs PinDedup Offline deduplication HBase 1.2HBase 0.94
  • 27. hosted by Challenge and Approach Directly export HBase backup to S3 • Table export done using a variant of distcp • Use S3A client with the fast upload option Direct S3 upload is very CPU intensive • Large HFiles broken down into smaller chunks • Each chunk needs to be hashed and signed before upload Minimize impact on prod HBase clusters • Constrain max number of threads and Yarn contains per host • Max CPU Overhead during backup < 30%
  • 28. hosted by Offline HFile Deduplication HBase backup contains many duplicates Observation: large HFiles rarely change • Account for most storage usage • Only merged during major compaction • For read-heavy clusters, much redundancy across backup cycles PinDedup: offline S3 deduplication tool • Asynchronously checks for duplicate S3 files • Replace old files with references to new ones rs1/F1 rs1/F2 rs1/F3 rs1/F1 rs1/F4 rs1/F5 10GB 10GB 500MB 30MB 400MB 80MB Day1 backup Day2 backup Largest file usually unchanged
  • 29. hosted by PinDedup Approach rs1/F1 rs1/F2 rs1/F3 rs1/F1 rs1/F4 rs1/F5 10GB 500MB 30MB 400MB 80MB Day1 backup Day2 backup 10GB Source: s3://bucket/dir/dt=dt1 Target: s3://bucket/dir/dt=dt2 Same file name and checksum Dedup candidates • Only checks HFiles in the same regions in adjacent dates • Declare duplicates when both filename and md5sum match • No need for large on-disk dedup index, very fast lookup
  • 30. hosted by Design Choices File encoding File- vs. chunk-level deduplication Online vs. offline deduplication
  • 31. hosted by File- vs. Chunk-level Dedup More fine-grained duplication detection? -> Chunk-level dedup Only marginal benefits • Rabin fingerprint chunking, 4K average chunk size • Increased complexity for implementation • During compaction, merged changes are spread across entire file Lessons • File-level dedup is good enough • Less aggressive major compaction to keep the largest files unchanged After compaction Before compaction Large HFile Small HFile
  • 32. hosted by Online vs. Offline Dedup AWS S3 HBase cluster PinDedup File checksums Non-duplicate files Online dedup: • reduces data transfer to S3 AWS S3 HBase cluster PinDedup All files Offline dedup: • More control on when dedup occurs • Isolate backup and dedup failures
  • 33. hosted by File encoding Dedup the old or new file? • Pros: one-step decoding • Cons: dangling file pointers when old files are deleted. E.g, when F1 is garbage collected, F2’ and F3’ become unaccessible. Intuition: keep the old file, dedup the new one Design choice: keep the new file, dedup the old one • No overhead accessing the latest copy (most use cases) • Avoids the dangling pointer problem
  • 34. hosted by Results Significantly reduced infra cost Reduced backup end-to-end time by 50% 3-137X compression on S3 storage usage Lower operational overhead

Hinweis der Redaktion

  1. Hello, everyone, my name is Chenji and this is Lianghong. We’re from Pinterest’s storage and caching team. Today, we’re gonna present our works in past year, which mainly focus on multicell and efficient backup for hbase.
  2. First, I’m gonna go through how Hbase is used in Pinterest. Then I’m gonna talk about the our multicell work for hbase. Since Pinterest uses AWS, you can think cell as a region or a data center in AWS. After that, Lianghong will present the hbase efficiency backup work.
  3. We started to use hbase for online service since 2013. It is used as backend storage engine for data abstraction layer like Zen and UMS. Zen is like Facebook’s TAO, which deal with graph based data. And UMS is our key-value abstraction data service. Currently, we have around 50 Hbase clusters running on 1.2 version. Our internal build is based on 1.2, but supports more features like zstd, ccsmap, offheap bucket cache. CCSMAP is a GC friendly skiplistmap published by Alibaba. We also changed the hbase protocol to return the timestamp for any mutation operation.
  4. So multicell. Why multicell? In past few years, Pinterest started to investigate internationalization and more then half of our active users are from out of states. So to provide a more reliability with lower latency service, we decided to explore the multicell solution for our infrastructure.
  5. Here is the basic architecture for our stack in multicell environment. We have a global load balancer managed by our traffic team, the global load balancer will forward the traffic to the nearest cell. In each cell, we have similar mirror stack which contains local load balancer, frontend, backend services, data services, cache and database. The data service deals call DB or cache to read or write data. And the source of truth database will replicate data to remote db. And no cross cell traffic is allowed except data service and db layer.
  6. We provide two patterns for different consistency level on table level. So you can mark your case as master-master or master-slave. Master-master, or sometimes we called it “alive-alive” means that both cell can take the write traffic. You can see from the graph that, in each cell, write request will be executed in local db and the changes will be synced up by bi-direction replication flow. This pattern is mainly used for cases that do not have strong requirements on consistency and data conflicts is less likely to happen, which is pretty common for most of clients’ use cases.
  7. And for cases need stronger consistency and avoid conflicts. For example, cases related to compete for primary key like email, username sign up. Another pattern we provide is master-slave. So here, if west is the master cell. Only db in west side can take the write traffic. If a write request is sent to east data service, it will forward all the write traffic to remote peer and remote peer will update the master db. The one way replication will sync the data between dbs in cells. And here we introduce another concept, remote marker, data service in east side here will set the remote marker after getting response from remote peer. Remote marker means that the related data is out of date in local db and needs to go to remote cell for latest data. And it will be cleaned as long as the data got replicated to local db.
  8. So for read request, data service will check the remote marker and depends on the result, it will decide either forward the read traffic or read the data from local db. We setup a big enough Memcached cluster with replica to be used as remote marker pool. The remote marker is set with TTL. So remote marker will be cleaned after expiration. But it is pretty hard to tune a static TTL to appropriate value. Longer TTL will lead to more cross cell traffic which means higher latency and cost. Lower TTL may cause the markers to be cleaned before replication is arrived. So we introduce a new system to clean the marker as long as the replication arrives.
  9. So cache invalidation service. Besides, clean remote marker. Cache invalidation will also deal with another consistencies issue we met in our multicell environment is between cache and db. So no matter in master-master or master-slave patterns, there is always the case that the write traffic is taken care in remote cell and since our cache invalidation logic stays in data service, we’ll not be able to clean the out of date cache entry if the write request happened in another cell. To conquer this issue, we developed database change system with kafka and cache invalidation service. So all the database changes will be published to kafka and cache invalidation service will consume the event and infer the out of data cache entries based on customized mapping logic and kafka event. So as long as the local db got latest data with replication flow, the out of date entries will also be invalidated from cache. And at the same time, remote markers will also be cleaned by cache invalidation service.
  10. Previously, we talked about our multicell architecture in general but did not touch how it embeded with different kinds of database. In the following parts, I’ll go through how it works with mysql and hbase, which are the major two databases we used for online services.
  11. So actually, when we designed the architecture, it is mainly based on mysql. Mysql is very friendly to this multicell solution because Facebook has explored the similar idea with mysql and we can simply adapt some open source projects like Maxwell and mysql’s feature like comment.
  12. So here. The database needs to publish the changes to kafka event with some customized information for consumer. Maxwell is a open source project published by Zendesk to read binlog and write row updates to queue system like kafka. Mysql comment is the feature that allow clients to add customized info in the sql query and will be part of binlog entry. So in the architecture described above, the cache invalidation service will consume the database change event and the customized info in the event to infer the cache entries.
  13. So to make Hbase adapt or embedded into our multicell architecture, we developed corresponding solutions as Maxwell and comment for mysql, which we called hbase replication proxy and hbase annotations. The hbase replication proxy will publish the hbase changes to kafka and the hbase annotations allow clients to add customized info in hbase mutate request.
  14. Hbase replication proxy works as a fake hbase clusters. But instead of writing data to wal log and memstore, the service will publish the replication request to kafka. Proxy expose hbase replicate API and multiple hbase clusters can share the same proxy as long as we set up the replication peer. Hbase replication proxy support customized kafka topic and each kafka event is corresponding to a mutation in hbase.
  15. We also changed the hbase protocol to support hbase annotations, which allow customized info in hbase mutate request. The annotation is part of mutate, it works as a map of byte arrays. Like mysql comment, it will only be written in wal log but not memstore. Here is one hbase Kafka event example. We can see that the rowkey ,table, operation, delta changes, timestamp. The fields in red circle is from annotations. Hbase replication proxy will convert annotations as part of the kafka event.
  16. Another thing we did specific for hbase is the timestamp. In Pinterest, the service backed by hbase usually have higher write rate and it is likely that we may have race condition happened when data service tried to update cache. We modified the hbase protocol to ask hbase to return the timestamp for mutate request so that we update cache based on the timestamp.
  17. The last issue we met in multicell environment is the replication topology. We sometimes need to do operation on one hbase cluster and doing global failover by routing the traffic in load balancer side is too expensive. So in each cell, we keep two hbase clusters. If we set 4 replication links like this,
  18. If one cluster is down, the replication will still work. But if each cell has one cluster in trouble, our replication queue will be blocked.
  19. So to make sure that the replication queue can survive two clusters in trouble, we have to set up 4 choose 2 which is 6 replication links. But this is pretty heavy since each request will be replicated to the same cluster at most 3 times and waste a lot of hardware resource and cross cell traffic.
  20. To solve the issue, we setup up a zookeeper proxy in each cell. In each cell, the two clusters will register its region server set to zookeeper depends on which one is the master. As long as the we failed over, the zk proxy will be updated with new master’s server list. For inter–cell, the local master cluster will enable the replication peer to the remote zk proxy. We’re still testing this solution and may have more results in future. Next, Lianghong will talk about how we improve our hbase back process in Pinterest.
  21. Thanks CJ, multi-cell makes Pinterest infrastructure tolerate failures from an entire cell. In addition to that, at Pinterest we use backup to enhance the availability of our critical data. While backup is a common practice in industry, in this talk, I’ll present how our backup pipeline has evolved over the years and how we were able to dramatically improve the HBase backup efficiency.
  22. As CJ mentioned, HBase is used by both online and offline services and serves highly critical data. We have 10s of clusters containing 10s of petabytes of data. All of the data needs to be backed up to S3 on a daily basis. We do a combination of full and incremental backups. Specifically, we backup daily snapshots as well as write-ahead-logs for point-in-time recovery. For write heavy clusters WAL size could be large, but in our case, majority of backup data is taken up by the full daily backups. For garbage collection, we maintain weekly and monthly backups and discard old enough backups. These backups are important in that they not only provide a disaster recovery mechanism, but also allows offline jobs to analyze the HBase dumps.
  23. Before we dive in, I want to note that for historical reasons, Pinterest used HBase version 0.94 until lately when we did an version upgrade. When we first built the backup pipeline, there were no existing tools to directly export HBase snapshots to S3. The only supported method was to export snapshots to a HDFS cluster. As a result, our original backup pipeline consisted of two steps: Exporting HBase table snapshots and write ahead logs (WALs) to a dedicated backup HDFS cluster. Uploading data from the backup cluster to S3. However, as the amount of data (on the order of PBs) grows over time, the storage cost on S3 and the backup cluster continues to increase. It also incurs high operational overhead for us, since when the HDFS cluster is in trouble, our backup pipeline would be broken.
  24. Recently we completed a HBase upgrade from version 0.94 to 1.2. Along with numerous bug fixes and performance improvements, the new version of HBase comes with native support to directly export table snapshots to S3. Taking this opportunity, we optimized our backup pipeline by removing the HDFS cluster from the backup path. In addition, we created a tool called PinDedup which asynchronously deduplicates redundant snapshot files to reduce our S3 footprint. We will talk about it later.
  25. One major challenge we encountered in the migration was minimizing its impact on production HBase clusters since they serve online requests. Table export is done using a MapReduce job similar to distcp. To increase the upload throughput, we use the S3A client with the fast upload option. During the experiments, we observed that direct S3 upload tends to be very CPU-intensive, especially for transferring large files such as HFiles. This happens when a large file is broken down into multiple chunks, each of which needs to be hashed and signed before being uploaded. If we use more threads than the number of cores on the machines, the regionserver performing the upload will be saturated and could crash. To mitigate this problem, we constrain the maximum number of concurrent threads and Yarn containers per host, so that the maximum CPU overhead caused by backup is under 30 percent.
  26. The idea of deduplicating HBase snapshots is inspired by the observation that large HFiles often remain unchanged across backup cycles. While incremental updates are merged with minor compactions, large HFiles that account for most storage usage are only merged during a major compaction. As a result, adjacent backup dates usually contain many duplicate large HFiles, especially for read-heavy HBase clusters. As you can see from the right graph, the largest file F1 remains the same in the backup of day1 and day2, although the smaller files may be changed due to minor compactions. Based on this observation, we designed and implemented a simple file-level deduplication tool called PinDedup. It asynchronously checks for duplicate S3 files across adjacent backup cycles and replaces older files with references.
  27. Let me briefly explain how pindedup works. It’s simple yet very effective in removing duplicate backup files. It takes two inputs, which are the S3 locations of backup data in two adjacent dates. It traverses the directory hierarchy and determines for each region the set of HFiles. For each region, it compares the HFiles in both backup dates. In this example, let’s say region rs1 has 3 files F1, F2, and F3 when the first backup occurs. On the next day, F2 and F3 were changed probably due to minor compacitons, resulting two different files F4 and F5. However, if major compaction didn’t occur, the largest file F1 remained the same. After a result, simply by identifying the largest duplicate file, we were able to reclaim a lot of space. Pindedup claims two files to be identical on when their names and hashs match. It is very simple since the comparison is done on a per-region basis. There is no need for on-disk dedup index and the duplicate detection is very fast.
  28. Despite the simplicity of PinDedup, there were several key design choices we had to make. We will mainly talk about three, namely File- vs. chunk-level deduplication, Online vs. offline deduplication, and how we encode deduplicated files.
  29. While-file deduplication has provided us good compression ratio. We were trying to take a step further to see how much more compression we could get. The hypothesis is that, if we use a more fine-grained dedup technique, such as variable-size chunk-level dedup, we should be able to save more space. We actually implemented chunk-level dedup in pindedup. It computes rabin fingerpints with 4K average chunk size and indexes chunk hashes. The result turned out to be a bit surprising though, chunk-level dedup only brought marginal benefits, and we ended up not using it in production. We looked into this, and found the reason- during compaction, although the changes to be merged could be small, they spread all over the compacted file. This changed the content of most chunks, making chunk-level dedup not effective. To conclude, we have learned two lessons in this process: first, file-level dedup is good enough for HBase backups. Second, we tune major compaction to be less aggressive and triggered only when necessary, so that the largest files are unmodified across backup cycles.
  30. Another design choice is online vs offline deduplication. The graph on the left shows the process of online dedup. Pindedup fetches file checksums from s3, does local comparison, and only transfer non-duplicate files to S3. This could potentially reduce the S3 transfer time. An alternative is offline dedup, shown on the right, where all backup files are first transferred to S3, and deduplication is done in an asynchronous manner. While online dedup seems more effiicent, we eventually chose offline deduplication because it allows us to control when deduplication occurs. Since client teams often use the latest snapshots for offline analysis, we could delay the deduplication until the analysis jobs are finished. Doing so also separates the backup and dedup pipelines, so that dedup fail wouldn’t cause backup jobs to fail and itt’s easier for us to identify problems
  31. When identifying two duplicate files, one important question is whether to replace the older or newer file with a reference. We chose the former, because the latest files are much more likely to be accessed. Below I’ll try to argue why we made this choice. Suppose we replace the newer files with references, which we call “backward dedup chain”. This is actually a more intuitive way to encode files since you don’t rewrite old data. It also has the nice property that when accessing an deduplicated file, you only need one decoding step to recover the file. However, it causes a dangling problem when old files are deleted. E.g., when F1 is deleted due to retention policy, both F2 and F3 became un-recoverable. In comparison, we chose the other approach. The key idea is to keep the latest file unchanged, since it’s most likely to be accessed. There is no decoding overhead to read the latest copy, and it avoids the dangling pointer problem. The tradeoff we have to make here is that recovering an older file may require multiple decoding steps.
  32. By upgrading the backup pipeline, we were able to reduce the e2e backup time by half. We obtained up to 2 orders of magnitude compression by use of deduplication. These two combined led to significantly reduced infra cost and lower operational overhead.