1. The document discusses a distributed, scalable, and highly reliable data storage system for virtual machines and other uses called HighLoad++.
2. It proposes using a simplified design that focuses on core capabilities like data replication and recovery to achieve both low costs and high performance.
3. The design splits data into chunks that are replicated across multiple servers and includes metadata servers to track the location and versions of chunks to enable eventual consistency despite failures.
Cистема распределенного, масштабируемого и высоконадежного хранения данных для виртуальных машин (Кирилл Коротаев)
1. Система распределенного,
масштабируемого и высоконадежного
хранения данных для виртуальных
машин (и не только)
HighLoad++, 22-23 октября 2012
Kirill Korotaev
VP of Advanced Research, Parallels
4. SAN storage
Redundancy
High availability
Monitoring
Self healing
Reliability
Fiber channel
Performance
4
5. SAN
local
How to get best of 2
worlds?
100$ 100,000$
5
6. Idea of maximum simplification
Focus on needed capabilities is a key to success
7. From required capability to consistency
Consistency - “some well defined and understandable order” of operations.
Required for real file systems (NTFS, etx3/4,…) cause they use and rely on
these properties of real disks.
A
B A B
Journal Metadata
• Immediate/Strict consistency
• Sequential consistency (all observers see the same order)
• Eventual consistency (no time guarantees, most object storages)
• Weak consistency (only synchronization primitives define order)
7
8. We want grow on demand also…
• Grow on demand means ability to allocate more then locally
available (>HDD or no free space)
• Requires split of data into pieces
• Probability(any of N computer failure) is increasing with N
• So redundancy/replication and auto recovery is a MUST
• But recovery can be done in parallel (~1TB in 10mins)
• Good news is that MTTF ~1/T2, where T is recovery time.
• So let’s split all objects to chunks (64Mb)
• Replicate chunks, not objects
• Spread chunks over the cluster to reduce recovery T
• Need to take into account topology of cluster
8
9. More ideas
Simplifications:
• No need for POSIX FS
• Optimize for big objects only
• Can assume infrequent/slow metadata updates
Major assumption:
• Shit happens: servers die, even DC can crash
9
10. Why consistency is not easy? Replication…
Simplest case: object is fully stored on a single node Object
Next step: object is replicated, i.e. multiple instances present
t
Server1 v1 v2
Server2 v1 v2 DC crash
Server3 v1 v1 Stale data can be read!
10
11. Why consistency is not easy? Data striping…
Splitting data into chunks leads to similar issues as replication:
File = c1v1 c2v1
t
write write
Server1 c1v1
Server2 c1v1 c1v2
DC crash
Server3 c1v1 c1v2
actual state is
Server4 c2v1 c1v2+c2v2,
Server5 c2v1 c2v2 but all combinations
Server6 c2v1 c2v2 civj can be found
Note: c1v1 or/and c2v1 should never be read!
11
12. So why consistency is not easy?
Data versioning is crucial and should be attributed
to data, plus heavily checked on operations!
Problems with versions:
1. Tracking versions requires transactions support and sync operations.
SLOW!
2. Can’t store versions near data only! Node can not decide alone whether
it’s uptodate or not.
Solution:
1. Let’s update version only when some server can’t complete write! And
leave it constant on successful data modifications.
2. To solve #2 let’s store versions on metadata servers (MDS) – authority
which tracks full cluster state. It should be reliable and HA.
12
14. Overall design
Meta Data Server (MDS)
• Metadata database
• Chunks and versions tracker
• HA
…
Clients
MDS
Chunk Server (CS) Ethernet
• Stores data (chunks)
• Chunk management
• READ/WRITE
CS CS CS CS CS CS
… CS
14
15. Magic (MetaData) Server HA
• Single point of failure => need HA…
• Ouch, but we don’t want replicated DB, MySQL or Oracle…
It’s a nightmare to manage. Not scalable.
• Database adds noticeable latency per chunk access
We have to create our own replicated DB for MDS
15
16. Meta Data Server database
Ideas from GoogleFS:
• ~128 byte per chunk
• 64Mb RAM describe ~32 Terabytes
Full state in memory chunks
Commit deltas
• Journal stores modification records
• It’s growing, so need compacting
Local method
Local
journal • To compact in background, need
journal memory snapshotting
• Journal and state can be synced to
outdated nodes
16
17. Meta Data Server database compacting
Full state in memory
Snap- Snap-
shot shot
Local
journal
New New
journal journal
17
18. PAXOS algorithm
The Part-Time Parliament, Leslie Lamport:
• The Island of Paxos, parliament government
• Legislators are traders, travel a lot and often not present in
chamber
• They used non-reliable messengers for notifications and
communication
• Decree is adopted using ballots, need majority of votes
• Legislators are each having it’s own journal of records
• Consistency of decrees is required
• Add/removal of legislators is needed
• Progress is needed
18
20. Chunk server performance tricks
Write requests are split (64k) and chained:
Chained write
CS CS CS
Write 256Kb req Completion t
64k
1 server, 256K write
3 servers, 256K split/chained write
3 servers, dumb
20
21. SSD caching on clients
• SSD bursts performance > 10x times
• Caching on 2nd read access only to avoid excessive caching
• Detect sequential access and avoid caching
{fileID, offset}
8-way hash generation
and LRU blockid
…
accesstime
• 3 parts: filter (RAM), cache, boot cache (1/8, 100MB / 2min)
• Sequential access detection relies on access time
• Write simply invalidates cached blocks
• Avoid interlaced reads: remote block, cached, remote,
cached
• Writes (caching) affect reads performance (user I/O)
21
22. SSD journaling at CS
• Bursts write performance, async commit to rotational drives
• Allows to implement data check summing and scrubbing
• Reason: major difference from object storages –
performance and latency are very noticeable and important.
Latency is not hidden by WAN.
22
24. Final result
Storage system:
• With file system interface, scalable to Petabytes
• Suitable for running VMs and Containers, Object Storage,
shared hosting and other needs
• SAN like performance over ethernet networks:
• 13K IOPS on 14 machines cluster (4 SATA HDD each)
• 600K IOPS with SSD caching
• 1TB drive recovery takes 10 minutes!
24
25. Some experience to share
• Asynchronous non-blocking design
• QA: unit tests and stress testing are the must
• QA: how to emulate power off? SIGSTOP+SIGCONT/KILL.
• It hangs connections and avoids RESETs as if host
disappeared
• Drives/RAIDs/SSDs lie about FLUSHes.
• SSD write performance depends on data. Beware compression
inside.
• Checksum everything (4GB/sec) and validate HW memtest86
25
26. Some experience to share
• All queues should be limited and controlled, like in TCP
congestion control is required (both for memory limit and latency
control, i.e. IO queue length)
• One client can fire Nx4K random requests and effectively it’s equivalent to
1MB requests. Congestion should be calculated correctly (taking into
accoung quality of queue).
• In addition to queues limit FDs/connections etc.
• Linux sync() / syncfs() is not usable
• Linux fdatasync() is 3.5-4x faster then fsync(), but should not be
used when file size changed.
• Replication should be done by pairs
26
27. Some experience to share
• Cluster identity: unique ID and name
• Cluster resolving: DNS, manual and zeroconf
• Interesting OSDI’12 Microsoft Research storage: uniform network
topology, 2GB/sec per client, sorting world record 1.4TB in
~60sec.
27
30. Object storage
Image: ext4 filesystem
Object O O
Image: ext4 filesystem
Image: ext4 filesystem
Object O O
Object O O
Image: ext4 filesystem
Object O O
Image: ext4 filesystem
Object O O
Image: ext4 filesystem
Object O O
30
31. Thank You
Kirill Korotaev dev@parallels.com
Try Parallels Cloud Server 6.0 beta2 at
http://www.parallels.com/product/pcs
31
32. Parallels - лидер на международном рынке в своих сегментах
• Автоматизация оказания • Виртуализация ПК и Mac
облачных сервисов • 3 млн. ПК во всем мире
• 10 млн. СМБ в 125 странах • 81% рынка в розничных сетях США
Работа в Parallels - от программирования в ядрах ОС до
создания web интерфейсов и мобильных приложений
• Интересные проекты • Процесс разработки
мирового класса
• Работа бок о бок с
легендами ИТ- • Карьера, опционы
индустрии
Присоединяйся к лучшим! Job@parallels.com
• 32
Hinweis der Redaktion
Let’s start from prehistory about how our logic and vision evolved and why idea was developed this way historically…
Slide beginning: it was about single computer and single disk. The whole story below is about disks and storage.What local storage means?Can’t grow on demand (more then locally available)Can’t quickly migrate dataCan’t access data from different nodesSingle point of failureManual HW support
This approach changes a lot! Automation, scalability
Cheap/slow vs. manageability/fast/reliable/scalable
Mention object storages…Tell about FLUSH/BARRIER, it’s not present on this pic
- S3 and scality have only put/get operations since they write version with data and do fsync. After that new object version is discoverable/accessible and old one can be removed.GoogleFS does similar version increment on close. i.e. on crash inconsistent state may be detected.NOTE: in reality version must be updated when client issues SYNC only (not after every write). But it is still slow to do it this way.
Sounds like авантюра векаBut think about requirements: low read latency, full state replication