SlideShare ist ein Scribd-Unternehmen logo
1 von 38
HBase at Xiaomi
{xieliang, fenghonghua}@xiaomi.com
Liang Xie / Honghua Feng
1www.mi.com
2
About Us
Honghua FengLiang Xie
www.mi.com
3
Outline
 Introduction
 Latency practice
 Some patches we contributed
 Some ongoing patches
 Q&A
www.mi.com
4
About Xiaomi
 Mobile internet company founded in 2010
 Sold 18.7 million phones in 2013
 Over $5 billion revenue in 2013
 Sold 11 million phones in Q1, 2014
www.mi.com
5
Hardware
www.mi.com
6
Software
www.mi.com
7
Internet Services
www.mi.com
8
About Our HBase Team
 Founded in October 2012
 5 members
 Liang Xie
 Shaohui Liu
 Jianwei Cui
 Liangliang He
 Honghua Feng
 Resolved 130+ JIRAs so far
www.mi.com
9
Our Clusters and Scenarios
 15 Clusters : 9 online / 2 processing / 4 test
 Scenarios
 MiCloud
 MiPush
 MiTalk
 Perf Counter
www.mi.com
10
Our Latency Pain Points
 Java GC
 Stable page write in OS layer
 Slow buffered IO (FS journal IO)
 Read/Write IO contention
www.mi.com
11
 Bucket cache with off-heap mode
 Xmn/ServivorRatio/MaxTenuringThreshold
 PretenureSizeThreshold & repl src size
 GC concurrent thread number
GC time per day :
[2500, 3000] -> [300, 600]s !!!
www.mi.com
HBase GC Practice
12
HBase client put
->HRegion.batchMutate
->HLog.sync
->SequenceFileLogWriter.sync
->DFSOutputStream.flushOrSync
->DFSOutputStream.waitForAckedSeqno <Stuck here often!>
===================================================
DataNode pipeline write, in BlockReceiver.receivePacket() :
->receiveNextPacket
->mirrorPacketTo(mirrorOut) //write packet to the mirror
->out.write/flush //write data to local disk. <- buffered IO
[Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also
confirmed it
www.mi.com
Write Latency Spikes
13
 write() is expected to be fast
 But blocked by write-back sometimes!
www.mi.com
Root Cause of Write Latency Spikes
14
Workaround :
2.6.32.279(6.3) -> 2.6.32.220(6.2)
or
2.6.32.279(6.3) -> 2.6.32.358(6.4)
Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive
HBase cluster!
www.mi.com
Stable page write issue workaround
15
...
0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]
0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]
0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]
0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4]
0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4]
0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4]
0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel]
0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel]
0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel]
0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4]
0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel]
0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel]
0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel]
XFS in latest kernel can relieve journal IO blocking issue, more friendly to
metadata heavy scenarios like HBase + HDFS
www.mi.com
Root Cause of Write Latency Spikes
16
8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel :
3.12.17
Statistic the stalled write() which costs > 100ms
The largest write() latency in Ext4 : ~600ms !
www.mi.com
Write Latency Spikes Testing
17
Hedged Read (HDFS-5776)
www.mi.com
18
 Long first “put” issue (HBASE-10010)
 Token invalid (HDFS-5637)
 Retry/timeout setting in DFSClient
 Reduce write traffic? (HLog compression)
 HDFS IO Priority (HADOOP-10410)
Other Meaningful Latency Work
www.mi.com
19
 Real-time HDFS, esp. priority related
 Core data structure GC friendly
 More off-heap; shenandoah GC
 TCP/Disk IO characteristic analysis
Need more eyes on OS
Stay tuned…
www.mi.com
Wish List
 New write thread model(HBASE-8755)
 Reverse scan(HBASE-4811)
 Per table/cf replication(HBASE-8751)
 Block index key optimization(HBASE-7845)
20www.mi.com
Some Patches Xiaomi Contributed
WriteHandler :sync to HDFS
WriteHandler :write to HDFS
WriteHandler :sync to HDFS
WriteHandler :write to HDFS
1. New Write Thread Model
WriteHandler WriteHandlerWriteHandler ……
WriteHandler : write to HDFS
WriteHandler : sync to HDFS
Local Buffer
Problem : WriteHandler does everything, severe lock race!
Old model:
21www.mi.com
256
256
256
WriteHandler :sync to HDFSWriteHandler :sync to HDFS
New Write Thread Model
WriteHandler WriteHandlerWriteHandler ……
AsyncWriter : write to HDFS
AsyncSyncer : sync to HDFS
Local Buffer
New model :
AsyncNotifier : notify writers
22www.mi.com
256
1
1
4
New Write Thread Model
 Low load : No improvement
 Heavy load : Huge improvement (3.5x)
23www.mi.com
2. Reverse Scan
Row2 kv2
Row3 kv1
Row3 kv3
Row4 kv2
Row4 kv5
Row5 kv2
Row1 kv2
Row3 kv2
Row3 kv4
Row4 kv4
Row4 kv6
Row5 kv3
Row1 kv1
Row2 kv1
Row2 kv3
Row4 kv1
Row4 kv3
Row6 kv1
1. All scanners seek to ‘previous’ rows (SeekBefore)
2. Figure out next row : max ‘previous’ row
3. All scanners seek to first KV of next row (SeekTo)
Performance : 70% of forward scan
24www.mi.com
Need a way to specify which data to replicate!
3. Per Table/CF Replication
Source
PeerA
(backup)
PeerB
(T2:cfX)
T1 : cfA, cfB
T2 : cfX, cfY
 PeerB creates T2 only : replication can’t work!
T1:cfA,cfB; T2:cfX,cfY
?
 PeerB creates T1&T2 : all data replicated!
25www.mi.com
Per Table/CF Replication
Source
PeerA
PeerB
(T2:cfX)
T1:cfA,cfB; T2:cfX,cfY
T2:cfX
 add_peer ‘PeerA’, ‘PeerA_ZK’
 add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’
T1 : cfA, cfB
T2 : cfX, cfY
26www.mi.com
4. Block Index Key Optimization
Block 1 Block 2
… …
k1:“ab” k2 : “ah, hello world”
Before : ‘Block 2’ block index key = “ah, hello world/…”
Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)
 Reduce block index size
 Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’]
27www.mi.com
 Cross-table cross-row transaction(HBASE-10999)
 HLog compactor(HBASE-9873)
 Adjusted delete semantic(HBASE-8721)
 Coordinated compaction (HBASE-9528)
 Quorum master (HBASE-10296)
28www.mi.com
Some ongoing patches
http://github.com/xiaomi/themis
1. Cross-Row Transaction : Themis
 Google Percolator : Large-scale Incremental Processing Using
Distributed Transactions and Notifications
 Two-phase commit : strong cross-table/row consistency
 Global timestamp server : global strictly incremental timestamp
 No touch to HBase internal: based on HBase Client and coprocessor
 Read : 90%, Write : 23% (same downgrade as Google percolator)
 More details : HBASE-10999
29www.mi.com
2. HLog Compactor HLog 1,2,3
Region 1Memstore
HFiles
Region 2 Region x
Region x : few writes but scatter in many HLogs
PeriodicMemstoreFlusher : flush old memstores forcefully
 ‘flushCheckInterval’/‘flushPerChanges’ : hard to config
 Result in ‘tiny’ HFiles
 HBASE-10499 : problematic region can’t be flushed!
30
www.mi.com
HLog Compactor HLog 1, 2, 3,4
Region 1Memstore
HFiles
Region 2 Region x
 Compact : HLog 1,2,3,4  HLog x
 Archive : HLog1,2,3,4
HLog x
31www.mi.com
3. Adjusted Delete Semantic
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Write kvA at t0 again
4. Read kvA
Result : kvA can’t be read out
Scenario 1
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Major compact
4. Write kvA at t0 again
Result : kvA can be read out
Scenario 2
5. Read kvA
Fix : “delete can’t mask kvs with larger mvcc ( put later )”
32www.mi.com
4. Coordinated Compaction
HDFS (global resource)
RS RS RS
Compact storm!
 Compact uses a global HDFS, while whether to compact is decided locally!
33www.mi.com
Coordinated Compaction
RS RS RS
MasterCan I ?OK Can I ? OK
Can I ?
NO
HDFS (global resource)
 Compact is scheduled by master, no compact storm any longer
34www.mi.com
5. Quorum Master
zk3 zk2
zk1
RS RSRS
Master
Master
ZooKeeper
X
Read info/states
A
A
 When active master serves, standby master stays ‘really’ idle
 When standby master becomes active, it needs to rebuild in-memory status
35www.mi.com
Quorum Master
Master 3 Master 1
Master 2
RS RSRS
X
A
A
 Better master failover perf : No phase to rebuild in-memory status
 No external(ZooKeeper) dependency
 No potential consistency issue
 Simpler deployment
 Better restart perf for BIG cluster(10+K regions)
36www.mi.com
Hangjun Ye, Zesheng Wu, Peng Zhang
Xing Yong, Hao Huang, Hailei Li
Shaohui Liu, Jianwei Cui, Liangliang He
Dihao Chen
Acknowledgement
37www.mi.com
Thank You!
xieliang@xiaomi.com
fenghonghua@xiaomi.com
www.mi.com
38www.mi.com

Weitere ähnliche Inhalte

Was ist angesagt?

hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architectureHBaseCon
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
 
HBaseCon 2013: A Developer’s Guide to Coprocessors
HBaseCon 2013: A Developer’s Guide to CoprocessorsHBaseCon 2013: A Developer’s Guide to Coprocessors
HBaseCon 2013: A Developer’s Guide to CoprocessorsCloudera, Inc.
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction HBaseCon
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaCloudera, Inc.
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon
 
006 performance tuningandclusteradmin
006 performance tuningandclusteradmin006 performance tuningandclusteradmin
006 performance tuningandclusteradminScott Miao
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low LatencyNick Dimiduk
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBaseCon
 
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
HBaseCon 2013: How to Get the MTTR Below 1 Minute and MoreHBaseCon 2013: How to Get the MTTR Below 1 Minute and More
HBaseCon 2013: How to Get the MTTR Below 1 Minute and MoreCloudera, Inc.
 
Real-time HBase: Lessons from the Cloud
Real-time HBase: Lessons from the CloudReal-time HBase: Lessons from the Cloud
Real-time HBase: Lessons from the CloudHBaseCon
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014Nick Dimiduk
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101Nick Dimiduk
 
The State of HBase Replication
The State of HBase ReplicationThe State of HBase Replication
The State of HBase ReplicationHBaseCon
 

Was ist angesagt? (20)

hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
 
HBaseCon 2013: A Developer’s Guide to Coprocessors
HBaseCon 2013: A Developer’s Guide to CoprocessorsHBaseCon 2013: A Developer’s Guide to Coprocessors
HBaseCon 2013: A Developer’s Guide to Coprocessors
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
Velocity 2010 - ATS
Velocity 2010 - ATSVelocity 2010 - ATS
Velocity 2010 - ATS
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
 
006 performance tuningandclusteradmin
006 performance tuningandclusteradmin006 performance tuningandclusteradmin
006 performance tuningandclusteradmin
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Accordion HBaseCon 2017
Accordion HBaseCon 2017Accordion HBaseCon 2017
Accordion HBaseCon 2017
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low Latency
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
 
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
HBaseCon 2013: How to Get the MTTR Below 1 Minute and MoreHBaseCon 2013: How to Get the MTTR Below 1 Minute and More
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
 
Real-time HBase: Lessons from the Cloud
Real-time HBase: Lessons from the CloudReal-time HBase: Lessons from the Cloud
Real-time HBase: Lessons from the Cloud
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101
 
The State of HBase Replication
The State of HBase ReplicationThe State of HBase Replication
The State of HBase Replication
 

Ähnlich wie HBase at Xiaomi

Fundamentals of Complete Crash and Hang Memory Dump Analysis
Fundamentals of Complete Crash and Hang Memory Dump AnalysisFundamentals of Complete Crash and Hang Memory Dump Analysis
Fundamentals of Complete Crash and Hang Memory Dump AnalysisDmitry Vostokov
 
The true story_of_hello_world
The true story_of_hello_worldThe true story_of_hello_world
The true story_of_hello_worldfantasy zheng
 
Low Level Exploits
Low Level ExploitsLow Level Exploits
Low Level Exploitshughpearse
 
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause AnalysisEric Sloof
 
DEF CON 27 - KYLE GWINNUP - next generation process emulation with binee
DEF CON 27 - KYLE GWINNUP - next generation process emulation with bineeDEF CON 27 - KYLE GWINNUP - next generation process emulation with binee
DEF CON 27 - KYLE GWINNUP - next generation process emulation with bineeFelipe Prado
 
06 - ELF format, knowing your friend
06 - ELF format, knowing your friend06 - ELF format, knowing your friend
06 - ELF format, knowing your friendAlexandre Moneger
 
Tesla Hacking to FreedomEV
Tesla Hacking to FreedomEVTesla Hacking to FreedomEV
Tesla Hacking to FreedomEVJasper Nuyens
 
Sql server scalability fundamentals
Sql server scalability fundamentalsSql server scalability fundamentals
Sql server scalability fundamentalsChris Adkin
 
Fundamentals of Complete Crash and Hang Memory Dump Analysis (Revision 2)
Fundamentals of Complete Crash and Hang Memory Dump Analysis (Revision 2)Fundamentals of Complete Crash and Hang Memory Dump Analysis (Revision 2)
Fundamentals of Complete Crash and Hang Memory Dump Analysis (Revision 2)Dmitry Vostokov
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuEstelaJeffery653
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
 
Fundamentals of Physical Memory Analysis
Fundamentals of Physical Memory AnalysisFundamentals of Physical Memory Analysis
Fundamentals of Physical Memory AnalysisDmitry Vostokov
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesPeter Hlavaty
 
C mode class
C mode classC mode class
C mode classAccenture
 

Ähnlich wie HBase at Xiaomi (20)

Fundamentals of Complete Crash and Hang Memory Dump Analysis
Fundamentals of Complete Crash and Hang Memory Dump AnalysisFundamentals of Complete Crash and Hang Memory Dump Analysis
Fundamentals of Complete Crash and Hang Memory Dump Analysis
 
The true story_of_hello_world
The true story_of_hello_worldThe true story_of_hello_world
The true story_of_hello_world
 
Low Level Exploits
Low Level ExploitsLow Level Exploits
Low Level Exploits
 
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause Analysis
 
Analisis_avanzado_vmware
Analisis_avanzado_vmwareAnalisis_avanzado_vmware
Analisis_avanzado_vmware
 
DEF CON 27 - KYLE GWINNUP - next generation process emulation with binee
DEF CON 27 - KYLE GWINNUP - next generation process emulation with bineeDEF CON 27 - KYLE GWINNUP - next generation process emulation with binee
DEF CON 27 - KYLE GWINNUP - next generation process emulation with binee
 
06 - ELF format, knowing your friend
06 - ELF format, knowing your friend06 - ELF format, knowing your friend
06 - ELF format, knowing your friend
 
What the Fax!?
What the Fax!?What the Fax!?
What the Fax!?
 
Tesla Hacking to FreedomEV
Tesla Hacking to FreedomEVTesla Hacking to FreedomEV
Tesla Hacking to FreedomEV
 
Sql server scalability fundamentals
Sql server scalability fundamentalsSql server scalability fundamentals
Sql server scalability fundamentals
 
Obstacles & Solutions for Livepatch Support on ARM64 Architecture
Obstacles & Solutions for Livepatch Support on ARM64 ArchitectureObstacles & Solutions for Livepatch Support on ARM64 Architecture
Obstacles & Solutions for Livepatch Support on ARM64 Architecture
 
Fundamentals of Complete Crash and Hang Memory Dump Analysis (Revision 2)
Fundamentals of Complete Crash and Hang Memory Dump Analysis (Revision 2)Fundamentals of Complete Crash and Hang Memory Dump Analysis (Revision 2)
Fundamentals of Complete Crash and Hang Memory Dump Analysis (Revision 2)
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structu
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Fundamentals of Physical Memory Analysis
Fundamentals of Physical Memory AnalysisFundamentals of Physical Memory Analysis
Fundamentals of Physical Memory Analysis
 
Heap Base Exploitation
Heap Base ExploitationHeap Base Exploitation
Heap Base Exploitation
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
 
C mode class
C mode classC mode class
C mode class
 

Mehr von HBaseCon

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
 
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on BeamHBaseCon
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at HuaweiHBaseCon
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in PinterestHBaseCon
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程HBaseCon
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at NeteaseHBaseCon
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践HBaseCon
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台HBaseCon
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comHBaseCon
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at HuaweiHBaseCon
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMiHBaseCon
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon
 
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...HBaseCon
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon
 
HBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase at XiaomiHBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase at XiaomiHBaseCon
 
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon2017 HBase/Phoenix @ Scale @ SalesforceHBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon2017 HBase/Phoenix @ Scale @ SalesforceHBaseCon
 

Mehr von HBaseCon (20)

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beam
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Netease
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.com
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBase
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
 
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
HBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase at XiaomiHBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase at Xiaomi
 
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon2017 HBase/Phoenix @ Scale @ SalesforceHBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
 

Kürzlich hochgeladen

Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 

Kürzlich hochgeladen (20)

Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

HBase at Xiaomi

  • 1. HBase at Xiaomi {xieliang, fenghonghua}@xiaomi.com Liang Xie / Honghua Feng 1www.mi.com
  • 3. 3 Outline  Introduction  Latency practice  Some patches we contributed  Some ongoing patches  Q&A www.mi.com
  • 4. 4 About Xiaomi  Mobile internet company founded in 2010  Sold 18.7 million phones in 2013  Over $5 billion revenue in 2013  Sold 11 million phones in Q1, 2014 www.mi.com
  • 8. 8 About Our HBase Team  Founded in October 2012  5 members  Liang Xie  Shaohui Liu  Jianwei Cui  Liangliang He  Honghua Feng  Resolved 130+ JIRAs so far www.mi.com
  • 9. 9 Our Clusters and Scenarios  15 Clusters : 9 online / 2 processing / 4 test  Scenarios  MiCloud  MiPush  MiTalk  Perf Counter www.mi.com
  • 10. 10 Our Latency Pain Points  Java GC  Stable page write in OS layer  Slow buffered IO (FS journal IO)  Read/Write IO contention www.mi.com
  • 11. 11  Bucket cache with off-heap mode  Xmn/ServivorRatio/MaxTenuringThreshold  PretenureSizeThreshold & repl src size  GC concurrent thread number GC time per day : [2500, 3000] -> [300, 600]s !!! www.mi.com HBase GC Practice
  • 12. 12 HBase client put ->HRegion.batchMutate ->HLog.sync ->SequenceFileLogWriter.sync ->DFSOutputStream.flushOrSync ->DFSOutputStream.waitForAckedSeqno <Stuck here often!> =================================================== DataNode pipeline write, in BlockReceiver.receivePacket() : ->receiveNextPacket ->mirrorPacketTo(mirrorOut) //write packet to the mirror ->out.write/flush //write data to local disk. <- buffered IO [Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also confirmed it www.mi.com Write Latency Spikes
  • 13. 13  write() is expected to be fast  But blocked by write-back sometimes! www.mi.com Root Cause of Write Latency Spikes
  • 14. 14 Workaround : 2.6.32.279(6.3) -> 2.6.32.220(6.2) or 2.6.32.279(6.3) -> 2.6.32.358(6.4) Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive HBase cluster! www.mi.com Stable page write issue workaround
  • 15. 15 ... 0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2] 0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2] 0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4] 0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4] 0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] 0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4] 0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4] 0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel] 0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel] 0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel] 0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4] 0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel] 0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel] 0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel] XFS in latest kernel can relieve journal IO blocking issue, more friendly to metadata heavy scenarios like HBase + HDFS www.mi.com Root Cause of Write Latency Spikes
  • 16. 16 8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel : 3.12.17 Statistic the stalled write() which costs > 100ms The largest write() latency in Ext4 : ~600ms ! www.mi.com Write Latency Spikes Testing
  • 18. 18  Long first “put” issue (HBASE-10010)  Token invalid (HDFS-5637)  Retry/timeout setting in DFSClient  Reduce write traffic? (HLog compression)  HDFS IO Priority (HADOOP-10410) Other Meaningful Latency Work www.mi.com
  • 19. 19  Real-time HDFS, esp. priority related  Core data structure GC friendly  More off-heap; shenandoah GC  TCP/Disk IO characteristic analysis Need more eyes on OS Stay tuned… www.mi.com Wish List
  • 20.  New write thread model(HBASE-8755)  Reverse scan(HBASE-4811)  Per table/cf replication(HBASE-8751)  Block index key optimization(HBASE-7845) 20www.mi.com Some Patches Xiaomi Contributed
  • 21. WriteHandler :sync to HDFS WriteHandler :write to HDFS WriteHandler :sync to HDFS WriteHandler :write to HDFS 1. New Write Thread Model WriteHandler WriteHandlerWriteHandler …… WriteHandler : write to HDFS WriteHandler : sync to HDFS Local Buffer Problem : WriteHandler does everything, severe lock race! Old model: 21www.mi.com 256 256 256
  • 22. WriteHandler :sync to HDFSWriteHandler :sync to HDFS New Write Thread Model WriteHandler WriteHandlerWriteHandler …… AsyncWriter : write to HDFS AsyncSyncer : sync to HDFS Local Buffer New model : AsyncNotifier : notify writers 22www.mi.com 256 1 1 4
  • 23. New Write Thread Model  Low load : No improvement  Heavy load : Huge improvement (3.5x) 23www.mi.com
  • 24. 2. Reverse Scan Row2 kv2 Row3 kv1 Row3 kv3 Row4 kv2 Row4 kv5 Row5 kv2 Row1 kv2 Row3 kv2 Row3 kv4 Row4 kv4 Row4 kv6 Row5 kv3 Row1 kv1 Row2 kv1 Row2 kv3 Row4 kv1 Row4 kv3 Row6 kv1 1. All scanners seek to ‘previous’ rows (SeekBefore) 2. Figure out next row : max ‘previous’ row 3. All scanners seek to first KV of next row (SeekTo) Performance : 70% of forward scan 24www.mi.com
  • 25. Need a way to specify which data to replicate! 3. Per Table/CF Replication Source PeerA (backup) PeerB (T2:cfX) T1 : cfA, cfB T2 : cfX, cfY  PeerB creates T2 only : replication can’t work! T1:cfA,cfB; T2:cfX,cfY ?  PeerB creates T1&T2 : all data replicated! 25www.mi.com
  • 26. Per Table/CF Replication Source PeerA PeerB (T2:cfX) T1:cfA,cfB; T2:cfX,cfY T2:cfX  add_peer ‘PeerA’, ‘PeerA_ZK’  add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’ T1 : cfA, cfB T2 : cfX, cfY 26www.mi.com
  • 27. 4. Block Index Key Optimization Block 1 Block 2 … … k1:“ab” k2 : “ah, hello world” Before : ‘Block 2’ block index key = “ah, hello world/…” Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)  Reduce block index size  Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’] 27www.mi.com
  • 28.  Cross-table cross-row transaction(HBASE-10999)  HLog compactor(HBASE-9873)  Adjusted delete semantic(HBASE-8721)  Coordinated compaction (HBASE-9528)  Quorum master (HBASE-10296) 28www.mi.com Some ongoing patches
  • 29. http://github.com/xiaomi/themis 1. Cross-Row Transaction : Themis  Google Percolator : Large-scale Incremental Processing Using Distributed Transactions and Notifications  Two-phase commit : strong cross-table/row consistency  Global timestamp server : global strictly incremental timestamp  No touch to HBase internal: based on HBase Client and coprocessor  Read : 90%, Write : 23% (same downgrade as Google percolator)  More details : HBASE-10999 29www.mi.com
  • 30. 2. HLog Compactor HLog 1,2,3 Region 1Memstore HFiles Region 2 Region x Region x : few writes but scatter in many HLogs PeriodicMemstoreFlusher : flush old memstores forcefully  ‘flushCheckInterval’/‘flushPerChanges’ : hard to config  Result in ‘tiny’ HFiles  HBASE-10499 : problematic region can’t be flushed! 30 www.mi.com
  • 31. HLog Compactor HLog 1, 2, 3,4 Region 1Memstore HFiles Region 2 Region x  Compact : HLog 1,2,3,4  HLog x  Archive : HLog1,2,3,4 HLog x 31www.mi.com
  • 32. 3. Adjusted Delete Semantic 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Write kvA at t0 again 4. Read kvA Result : kvA can’t be read out Scenario 1 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Major compact 4. Write kvA at t0 again Result : kvA can be read out Scenario 2 5. Read kvA Fix : “delete can’t mask kvs with larger mvcc ( put later )” 32www.mi.com
  • 33. 4. Coordinated Compaction HDFS (global resource) RS RS RS Compact storm!  Compact uses a global HDFS, while whether to compact is decided locally! 33www.mi.com
  • 34. Coordinated Compaction RS RS RS MasterCan I ?OK Can I ? OK Can I ? NO HDFS (global resource)  Compact is scheduled by master, no compact storm any longer 34www.mi.com
  • 35. 5. Quorum Master zk3 zk2 zk1 RS RSRS Master Master ZooKeeper X Read info/states A A  When active master serves, standby master stays ‘really’ idle  When standby master becomes active, it needs to rebuild in-memory status 35www.mi.com
  • 36. Quorum Master Master 3 Master 1 Master 2 RS RSRS X A A  Better master failover perf : No phase to rebuild in-memory status  No external(ZooKeeper) dependency  No potential consistency issue  Simpler deployment  Better restart perf for BIG cluster(10+K regions) 36www.mi.com
  • 37. Hangjun Ye, Zesheng Wu, Peng Zhang Xing Yong, Hao Huang, Hailei Li Shaohui Liu, Jianwei Cui, Liangliang He Dihao Chen Acknowledgement 37www.mi.com

Hinweis der Redaktion

  1. This is the throughput comparison against a single regionserver: when the write load is low, there is almost no improvement, but as write load gets heavier and heavier, the improvement is pretty amazing, 3.5x at most Actually when write load is very low, new model has some small downgrade(about 10%), Michael Stack has fixed this downgrade in another patch, Thanks Stack!
  2. The second one is reverse scan. Before explaining how reverse scan works, I want to point out an important fact which can help understanding this patch. This fact is the granularity of scan is row, not key-value. All key-values of a row are read out in order from HFile or Memstore, and assembled together as a result row in RegionServer’s memory and then be returned to the client. This work is the same for both forward scan and reverse scan. So the difficulty of reverse scan is when the current row is done, figure out which is the next row, then jump to that row, and start to scan. Let’s see how we do it Since there are two more extra seek operations compared to forward scan, there is 30% downgrade in performance compared to forward scan, almost the same as in LevelDB. Finally thanks Chunhui very much for porting our patch to trunk!
  3. This is the third patch : per table/cf replication. Suppose we have a source cluster, it has two tables and four column families, all can be replicated. For data safety we deployed a peer cluster for backup, and the source cluster replicates all the data to this backup cluster, that’s just what we want and the replication works pretty well Then for some reason such as data analysis or experimental purpose we deployed another peer cluster, and our experimental program just needs data from cfX of table T2, What kind of replication we expect? Ideally we expect only data from cfX of T2 is replicated… but replication can’t work! Then we have to create all tables and column-families in PeerB, and all the data will be replicated, it’s really bad, either in term of bandwidth between source and PeerB, or in term of PeerB’s resource usage.
  4. Then we implement this feature, it allows to specify which data will be replicated to a peer cluster. For PeerA, the add_peer command is the same as before since PeerA want to replicate all the data. But for PeerB, the add_peer has an additional argument to specify which tables or column-families to replicate The implementation change is quite straightforward : In the source cluster, when parsing the log entries, the replication source thread will ignore all other ones and only replicates entries from cfX of table T2
  5. This is the fourth patch : block index key optimization. It is to reduce the overall block index size Suppose two contiguous blocks, the last key-avlue’s row of Block1 is “ab”, the first key-value’s row of Block2 is “ah, hello world”, before our patch the block index key of Block2 is “ah, hello world”(the first keyvalue of Block2), after our patch the block index key is “ac”(a fake key, it’s the minimal keyvalue which is larger than the last keyvalue of Block1 and less than or equal to the first keyvalue of Block2, with shortest row length), the new block index key is much shorter than old one
  6. Now let’s continue to talk about some work items we are currently working on
  7. The second one is HLog compactor, its target is to keep as few HLogs as possible, so we can say its final target is to improve regionserver failover performance, since the less HLog files to split, the better failover performance is We know a regionserver typically serves many regions, and the write patterns for all these regions can be quite different, so the flush frequency and timing of these regions can also be very different. Considering there is a region x, its memstore contains quite few entries, no flush triggered for a long time, and all its entries scatter in many HLogs. For these HLogs, though all other entries have been flushed to HFiles, they still can’t be archived since they contain entries from region x… We do have a background flusher thread to flush old memstores forcefully, but it has some obvious drawbacks, the first one is it’s hard to configure good-enough flushCheckInterval and flushPerChanges, second is forceful flush will result in tiny Hfiles, last one, as in jira HBASE-10499, some problematic region just can’t be flushed at all by this background flusher thread!
  8. Our patch works as this : we introduce another background thread, HLog compactor. When the HLog size is too large compared to the memstore size(which means we flushed enough, but not enough archive), we trigger the HLog compactor, it reads entries from all active HLog files, if the entry is still in some region’s memstore, write it to new HLog file; if not in any memstore(which means it has been flushed to some HFile) ignore it. After the compaction, we can archive all the old HLog files without flushing any memstore We have finished this feature and are testing it in our test cluster, we’ll share the patch after the test
  9. Let’s consider two scenarios The first scenario: first we write kvA at timestamp t0, then delete it and flush, and then write it again, and finally we try to read it, the result is we can’t read it out since both writes are masked by the delete The second scenario is the same as the first one except that before writing kvA for the second time we trigger a major compact. But this time kvA can be read out, since the delete is collected by the major compact This is inconsistent since major compact is transparent to the client but the read results are different depending on whether major compact occurs or not, the root cause is that the delete can even mask a key-values put later than it. The fix is simple, since mvcc represents the order all writes(including put/delete) entering HBase, we use it as an additional delete criterion to prevent delete from masking later put We ever have some severe discussion on this patch, personally I still insist that it deserves further thinking and discussion
  10. The fourth item is coordinated compaction. We talk about compact storm from time to time, now let’s check how it happens, when a regionserver wants to do compact, it just triggers it, and compact reads from HDFS and write back to HDFS, a regionserver can trigger a new compact no matter how overloaded the whole system is So we can see the problem is, what compact eventually uses is a global HDFS, but whether to trigger a compact is a local decision by each regionserver
  11. What we propose is using the master as a coordinator for compact scheduling, it works this way: when a regionserver want a compact, it asks the master, if the master says yes, it can trigger a compact, if the master thinks the system is loaded, it will reject all later compact requests until the system becomes not loaded
  12. The last item is quorum master. This is a master re-design and there are some discussion on it already. And I noticed that JimmyXiang from Cloudera and Mikhail from wandisco have put some efforts on it. It’s great! Current master design has 2 problems: 1. the first problem is some system-wide metadata and status are only maintained in the active master, for master failover these metadata and status are stored in ZooKeeper as well, and during master failover the new active master needs to read from ZooKeeper to rebuild the in-memory state 2. the second problem is the way ZooKeeper is used as the communication channel between master and regionservers for the state machine of region assigning task, ZooKeeper’s asynchronous notification mechanism is just not suitable for state machine logic, it’s also the root cause of many tricky bugs ever found
  13. We propose this new design: Instead of storing in-memory status in ZooKeeper, we replicate it among all master instances using a consensus protocol such as Raft or Paxos. This way when active master fails, a new active master is elected via consensus protocol among all alive standby masters, and the new active master serves immediately without reading from elsewhere Quorum master has some advantages: Better master failover performance Better restart performance for big cluster, since the communication between master and ZooKeeper is the bottleneck when a big number region assignment tasks happen concurrently No external dependency on ZooKeeper No potential consistency issue any longer Simpler deployment