Brents IT Blog

Random thoughts by an IT GOAT

NAVIGATION - SEARCH

Ceph Recovery

So, if you end up deploying ceph with old hardware that ends up having issues, say you even end up using a JBOD setup instead of RAID.  Then you decide that you want to maximize your space, then later you have a disk and a host failure.  You may end up where i ended up: Down PGs, Incomplete PGs or Inconsistent PGs.

Ceph is highly reliable, but hardware isnt always.  When you configure Ceph for JBOD and have old hardware, you may be asking for it.  So be prepared to remove and add OSDs more so than you would using a RAID setup.  Be prepared when your old hardware causes your OSD to fail either at the drive level or the whole host.  Please dont set min-size on pools to 1(meaning can write to one PG if said PG is available) or set your pool size to 2 copies when said min-size is 1.  These are all bad ideas.  This may even having you asking, "why would anyone do this".  The simple answer is that sometimes you are asked by people to do this just to get things online again.  Given the time it takes to rebalance 100+ TB clusters, getting the system back online, may result in these settings being changed.  You may end up with multiple host failures, run out of space and have to adjust these settings to get back online again as quick as possible.  I am in no way saying ANYONE should use these settings, its just bad.  But if you used them, say temporarily to get out of a crisis, you may have ended up in a deeper crisis when another OSD(essentially a spinner drive in JBOD in my case) goes offline.  So here are some steps to get you back online.  I caution that these options really suck and are a pain to go through, but given the hours it took me to piece together the information, i thought i would share it.  

Side Note:  If you know a better way, please comment.  I am good, but i wouldnt call myself an expert ( kinda jack of all IT trades, master of none ).

So you have down PGs?  First thing to do is set the "Ceph osd set noout" to avoid repair while working on this.  Next try restarting all the OSDs ( one by one ).  If you have a lot of OSDs and only a few downed PGs, you could do a "Ceph pg <insertpgidhere> query |more" (e.g: ceph pg 11.4ga query |more ) to get a list of the acting OSDs.  You do have to scroll a few pages to get to the part that states the last acting OSDs.  Once you have that list, restart each one, one by one.  I usually run a "ceph -s" after each, wait till it shows the output back to normal (or what is was before i restarted the osd ), then go on to the next one.  This is a good read:  http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ 

Usually, assuming you have at least one OSD with that PG on it online, the PG will change from down to up or go incomplete.  If you sustained data loss or ceph doesnt know if that OSD has all the data for that PG, then you will end up with an incomplete status.  That is where the next part comes in....  this part is not fun and you dont want to be here.

So lets stop here:  You need to figure out what your course of action is and that will depend on where the data is(or isnt).  If most of the acting OSDs are online ( say minus one or two OSDs that had the PG and were last acting ), you can check each drive to see whats there.  The easiest way to do that is to log on to each OSD host and path to the OSD PG folder ( usually /var/lib/ceph/osd/ceph-xxx/currrent/pg.idhere_head where xxx is the OSD number ).  If you see a __head_000000BDF__b looking file there and no directories or other files ( that are not of the same type ), then that OSDs PG folder is useless to you.  You must then check the rest of the acting OSDs ( i should state the primary at this point is the one offline and thus why ceph is not sure if the PG is healthy, which is why its bad to have less than a 2 min-size setting ).  If you do find an OSD with data in the PG folder, then you can at least use that OSD to get that PG back from incomplete.  If you have multiple OSDs with data and its not exactly the same, then you have to compare each one to the other to see who most likely has the most current data.  You can also bring data back from a failed disk if you really need to go that far and assuming the disk is at least readable or you have recovered that data in a different way (  ).  For our purposes, we are going to assume the disk failed(meaning the primary OSD cannot be returned to operation because we have a JBOD setup which is one OSD per disk ).  Once you have determined the winner(whichever OSD has the most current data and you are ok with losing any remaining data, mainly because you need the PG online ), you can then proceed to the next step.

The next step is to install the ceph-objectstore-tool on the OSD host you need to repair.  I install it on all my osd host nodes due to prior issues.  I use ubuntu 16 right now and last time i tried to install it, i had to do the following(in order):

wget http://download.ceph.com/debian-luminous/pool/main/c/ceph/ceph-test_12.2.2-1trusty_amd64.deb
apt update
apt install jq socat xmlstarlet
dpkg -i ceph-test*.deb

This installs the ceph-objectstore-tool, which you can then run with that same command.  

Now, we need to make sure "ceph osd set noout" is set for the cluster, we dont want any unnecessary data migration yet.  We then stop the winning OSD with the PG data on it.  Next we flush the journal for the OSD, then we run the Mark Complete ceph object store command to ensure that the OSD tells the cluster that the PG it has is complete and its ok to replica the data from that folder out to the rest of the copies.  Finally, not sure if this is a thing ( you could also run the object store tool command under the data owner if need be), but i had to set the permissions back to the user before the OSD would start.

stop ceph-osd id=X
ceph-osd -i X --flush-journal
ceph-objectstore-tool --op mark-complete --pgid XX.XXX --data-path /var/lib/ceph/osd/ceph-X --journal-path /var/lib/ceph/osd/ceph-X/journal
chown ceph:ceph /var/lib/ceph/osd/ceph-X/current/omap/*
start ceph-osd id=X

NOTE:  X is fill in the blank.  So first command, the X is the OSD number.  Same with the second command.  The third command, the first XX.XXX is the PG id.  The second and third X is the OSD number again.  The next command setting the permissions, the X is the OSD number.  The final command, the X is the OSD number, like the first one. 

If all is well at this point, you can unset the noout on the cluster ( ceph osd unset noout ).  You will then see ceph change the PG from incomplete to complete, copy the data around the cluster and go back to healthy( or rinse repeat for mulitiple incomplete PGs ), make sure to leave noout set until you are done.

If there is no data in any of the OSD PG folders on all the online OSDs and you are ok with losing the data just to get the PG back online.  You can just pick a healty OSD and do the same commands.  Obviously, it will just sync the empty folder, but more importantly, it will mark the PG online.

You may ask, why go through all this?  Well, our reason was that having PGs offline caused our Rados Gateways to randomly fail due to writes getting blocked by down or incomplete PGs.  Ceph wont write to the PG until it knows its healthy(makes sense).  I really wish the rados gateways would throw an error back to the client though.  Failing to the point where you have restart the server or the process on the gateway is annoying.  yes, my stuff shouldnt be offline in the first place because ceph is really redundant and yes i have taken all of the safety valves off.  To me its standard error handling though.  Simply tell the rados gateways its not available for that write and have it throw the error to the client(yes the gateway is also a client). 

Obviously, you can simply avoid all of this by having min-size set to 2 or higher( or leaving it default ).  Based on the mailing lists though, i know i am not the only one doing stupid things with ceph.  So for all my brothers and sisters forced to do this kind of stuff, here you go!


I was looking for an article i found that goes through pulling data from a failed OSD that is mounted using the ceph object store tool, but i cant find it now..  Basically you are mounting the drive back in and exporting the data to a file.  You then take the destination OSD offline and import the data there.  They use the following commands:

ceph-objectstore-tool --op export --pgid XX.XXX --data-path /var/lib/ceph/osd/ceph-X --journal-path /var/lib/ceph/osd/ceph-X/journal --file XX.XXX.export

then

ceph-objectstore-tool --op import --pgid XX.XXX --data-path /var/lib/ceph/osd/ceph-X --journal-path /var/lib/ceph/osd/ceph-X/journal --file XX.XXX.export


I should note that i tried the process they outlined and the data i tried to pull in was not accepted by the destination OSD host server.  It could have been because the data was from an OSD that had a previous version on it.  I had upgraded ceph and it went unhealthy during the upgrade.  I then continued to the final upgrade point (they were a few versions behind ).  When i realized there was no good copy of the data, i tried to export and import.  Lesson learned, stop and export/import the data on the same version.  There were many other factors involved that complicated it beyond what should have been done, so i cant take full responsibility for it.  It was just a bad situation, but they were ok with the data loss, so marking empty PGs complete was the only way to get the cluster healthy again.

This post is not meant to encourage bad behavior, stick to the recommended bits and you shouldnt end up here.  Also, this was done with Luminous on unbuntu 16.  The last bit with the export was done with ubuntu 14 and Hammer being upgraded in between, which is why i stated to stop and fix first.  I should also note that i also have a firefly cluster running on ubuntu 12.  The hope is to upgrade that one this year, but its been so solid, no one wants to touch it ( 6 years online ).  Hopefully, they keep the repositories online for when i do upgrade it.  Perhaps its online because i spent soo much time nurturing it when i first brought it up( or at least i like to think that is the case ).