Post-Mortem: outage

The main service became unreliable at Tuesday 2nd June 2020 21:50 CEST. We want to give you some insight into what happened.
Added by lrupp over 2 years ago


As the current storage used on is running out of service, we started to move to a new storage via pvmove command. The first 12TB were transferred without any problem and no noticeable impact to production. After that, the old storage produced some (maybe longstanding, but unnoticed) problems on some drives, resulting in "unreadable sectors" failure messages in the upper filesystem levels. We managed to recover some data by restarting the pvmove with some offset (like pvmove /dev/vdd:4325375+131072 /dev/vde) over and over again - and finally triggered a bug in dm_mirror at kernel level, which is used by pvmove, and a bad block on a hard drive...


As result, we needed to reboot to get the system back to work. As we wanted to get all data transferred to the new storage device, this became a loop:

  1. starting pvmove with offset
  2. waiting for the old storage to run in hard drive timeouts and resetting a drive
  3. looking at the pvmove/dm_mirror running into trouble
  4. seeing the meanwhile known kernel oops
  5. rebooting the machine; start at 1

And as everyone knows: the last steps are always the hardest. While reaching the end of the transfer, the loop started to happen more often. Finally too often for our feeling - so we decided to switch over to our 2nd mirror in Provo, which normally holds all the data (21T) as well, but often a bit outdated because of latency and bandwidth. But this mirror was running stable, so better old content than no content.

So we finally switched the DNS entries for and at 23:00 CEST, pointing to the mirror server in Provo.

Next morning, around 08:00 CEST, people notified us that the SSL certificate for is not correct. Right: we forgot to renew the "Let's Encrypt" certificate on the Provo mirror to also contain the new DNS entries. This was a one minute job, but an important one we forgot after the long day before.

Our openSUSE Kernel Guru Jeff Mahoney and our Bugfinder RĂ¼diger Oertel helped us with the main problem and provided debug information and new test-kernels over the whole time, that helped us to track down and finally eliminate the original problem. A big THANK YOU for this, Jeff and Rudi!

So finally, in the morning of Wednesday 3rd June 2020, around 10:00, we were able to finish the pvmove to the new storage. But: with all the problems, we decided to run an xfs_check/xfs_repair on the filesystem - and this takes some time on a 21TB storage. So we decided to leave the DNS in Provo, but instead provide the redirector database there, to free up some bandwidth that is needed to run the office in Provo. Luckily, we still had the DB server, configs and other stuff ready to use there. So all we needed to do was to transfer a current database dump from Nuremberg to Provo, restore the dump and check the old backup setup. This was done in ~30min and Provo was "the new" redirector.

After checking the xfs on the new storage, we finally declared the machine in Nuremberg production ready again around 12:00 CEST and switched the DNS back to the old system in Nuremberg with the new storage.

Lessons Learned

What Went Well

  • As always our power users and admins are very fast and vocal about problems they see.
  • The close cooperation with our kernel guru and the live chat helped to identify and solve at least the kernel problem
  • Having a full secondary mirror server at hand which is running in another DC and even in another continent is very helpful, if you need to switch over
  • Having the needed backups and setups ready before a problem occurs also helps to keep the downtime low

What Went Wrong

  • the full secondary mirror server did not contain up-to date data for all the 21TB of packages and files. This lead to some (luckily small) confusion, as some repositories suddenly contained old data
  • our OBS was not directly affected by the outage, but could not push new packages to the secondary mirror directly. The available bandwidth did not allow to keep everything in sync.

Where We Got Lucky

  • having the experts together and having the ability for them to talk directly with each other solves problems way quicker than anything else
  • the setup we used during a power outage of the Nuremberg office 3 years ago was still up and running (and maintained) over all the years. This helped us to setup the backup system in a very quick time frame.

Action Items

Limited to the available bandwidth in Provo:

  • try to establish a sync between the databases in Provo and Nuremberg, which would allow us a hot-standby
  • evaluate possibilities to sync the Provo mirror more often


  • As the filesystem on the standard machine is now some years old, was hot-resized multiple times and now had seen some problems (which could be somehow repaired by xfs_repair, but nevertheless), we will try to copy the data over to a completely new xfs version 5 filesystem during the next days
  • Try to get an additional full mirror closer to the one in Nuremberg, which does not have the bandwidth and latency problems - and establish this one as "hot-standby" or even a load-balanced system.


Added by bmwiedemann over 2 years ago

We also should simulate regular (monthly?) power outages to make sure recovery works smoothly.

Added by lrupp over 2 years ago

Hehe: at the moment, I see us having enough real outages... ;-)