Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Nightmare with ceph : Recovery from ceph cluster total failure

662 Aufrufe

Veröffentlicht am

Recovery from total failure of ceph cluster.

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

Nightmare with ceph : Recovery from ceph cluster total failure

  1. 1. Nightmare with Ceph Andrew YJ Kong kakao
  2. 2. By the way • I like redhat
  3. 3. New Year’s Greeting Plain Error Message Never expect that this lasts for 3week!
  4. 4. Fail, always part of iceberg
  5. 5. In the abyss CEPH ERROR
  6. 6. What?
  7. 7. In the abyss CEPH ERROR
  8. 8. Gotta Fix this
  9. 9. What is the real problem? • Before you ask this, – Need to know ceph configuration. • Our CEPH configuration. – How many mon server? 3 – How many OSD? 12 – How many replica? 2
  10. 10. OK, What is the real problem • 4 (out of 12)OSDs are gone – à Restart OSD – Ceph doing backfilling. – We had 386 incomplete Placement Group. Finally • That is the start of Night Mare. – Nothing works! • Adjusting PG count • scrubbing • Killing incomplete PG • …
  11. 11. Pray to God Google By Definition: Ceph detects that a placement group is missing a necessary period of history from its log. If you see this state, report a bug, and try to start any failed OSDs that may contain the needed information.
  12. 12. Ceph Internals • By the way, What is Ceph. • The Thing is Openstack does not use Ceph at all
  13. 13. Openstack uses RBD(Rados Block Device) RADOS (Reliable Autonomous Distributed Object Storage ) LIBRADOS (Client Library bindings C, C++, Java,Python) RGW Rados GateWay RBD Rados Block Device CEPHFS OPENSTACK Keystone API SWIFT API CINDER API GLANCE API NOVA API QEMU
  15. 15. RBD and PG and OSD [root@seal123 ~]# rbd -p images ls 08909734-66fa-48e3-ab5e-2e2b8bb3a58c [root@seal123 ~]# rbd -p images info 08909734-66fa-48e3-ab5e-2e2b8bb3a58c rbd image '08909734-66fa-48e3-ab5e-2e2b8bb3a58c': size 810 MB in 102 objects order 23 (8192 kB objects) block_name_prefix: rbd_data.5a57484353d0cd format: 2 features: layering RBD rbd_data.5a57484353d0 cd. 0000000000000000 rbd_data.5a57484353d0 cd. 0000000000000001 rbd_data.5a57484353d0 cd. 00000000000000C1
  16. 16. Consistant Hashing Algorithm. • Systematically locate files based on Algorithm.(e.g. Gluster) Brick1 /data0 Brick2 /data0 Brick10 /data0 [a-c] [d-f] [x-z] A Algorithm. File location? Brick1, /data0 • In this algorithm, file’s located alphabetically • What if file’s name is more than 2 words? • what if file’s Is created with A and B?
  17. 17. • Using HASH locate files based on Algorithm. Brick1 /data0 Brick2 /data0 Brick10 /data0 [01-10] [11-20] [91-100] A Algorithm. (2digit hashing) File location? Brick1, /data0 • Use a hashing algorithm file name + path • Hashes are fixed length • Unique name • evenly distributed. • what if we add or delete physical disk. Consistant Hashing Algorithm
  18. 18. Consistant Hashing Algorithm. • Make it Robust. Brick1 /data0 Brick2 /data0 Brick10 /data0 [01-10] [11-20] [91-99] A Algorithm. File location? Volume 1, • Provision huge # of virtual disks •Virtualization lets you deal flexibly with physical disks. /logical volume1 /logical Volume 2 /logical Volume 3 /logical Volume 4 Gluster Mgt Functions: Add, substract, replicate, heal,.. Robert Jenkins 32bit mixing. Placement Group CRUSH algorithm OSDs
  19. 19. Ceph. PGs
  20. 20. Hashing Based location • Cluster Membership Management. – Mon Membership – Placement Group – OSD Membership • IF Something Happened to membership – CREATE event – OSD implemented by FSM (Finite State Machine)
  21. 21. Cluster Management. (e.g. Sheepdog) 23
  22. 22. Ceph: ‘Paxos’ made simple • Try to Guarantee the order of event history. • Agreement based 2 Phase Commit( Prepare, write )
  23. 23. Finite State Machine, CEPH OSD
  24. 24. Last one, Log MONs OSD1 Epoch: New map [PG/OSD] OSD2 OSD3 History Log DB Request previous log Request previous log
  25. 25. CEPH log $ ceph pg 4.0 query {"recovery_state": [ { "name": "Started/Primary/Active", "enter_time": "2012-09-26 13:35:57.631197", "might_have_unfound": [], "scrub": { "scrub_epoch_start": "259", "scrub_active": 0, "scrub_block_writes": 0, "finalizing_scrub": 0, "scrub_waiting_on": 0, "scrub_waiting_on_whom": []}}, { "name": "Started", "enter_time": "2012-09-26 13:35:56.625867"}]}
  26. 26. By the way • So, What about incomplete PG? – During 2 weeks, we didn’t find a way fix incomplete PG – We make code gathering every rados object file from OSD directory to make one ‘raw’ type image
  27. 27. Lesson learned. • To make robust file system – Make Replica more than 3. – Make snapshot ( RBD or Pool ) as often as you can • This will clear incomplete PG. – Create PG as many as possible to make failure localized – Take a close look on error message – Do not fret. Refilling will make the peace.
  28. 28. Special Thanks to • Google with ceph , pg incomplete keyword • Al.l • Issac.lim • Charlie.choe • www.facebook.com 엔지니어를 위한 정치