openSUSE admin: Post-Mortem: outage (2 comments)

Added by lrupp over 3 years ago


As the current storage used on is running out of service, we started to move to a new storage via pvmove command. The first 12TB were transferred without any problem and no noticeable impact to production. After that, the old storage produced some (maybe longstanding, but unnoticed) problems on some drives, resulting in "unreadable sectors" failure messages in the upper filesystem levels. We managed to recover some data by restarting the pvmove with some offset (like pvmove /dev/vdd:4325375+131072 /dev/vde) over and over again - and finally triggered a bug in dm_mirror at kernel level, which is used by pvmove, and a bad block on a hard drive...


As result, we needed to reboot to get the system back to work. As we wanted to get all data transferred to the new storage device, this became a loop:

  1. starting pvmove with offset
  2. waiting for the old storage to run in hard drive timeouts and resetting a drive
  3. looking at the pvmove/dm_mirror running into trouble
  4. seeing the meanwhile known kernel oops
  5. rebooting the machine; start at 1

And as everyone knows: the last steps are always the hardest. While reaching the end of the transfer, the loop started to happen more often. Finally too often for our feeling - so we decided to switch over to our 2nd mirror in Provo, which normally holds all the data (21T) as well, but often a bit outdated because of latency and bandwidth. But this mirror was running stable, so better old content than no content.

So we finally switched the DNS entries for and at 23:00 CEST, pointing to the mirror server in Provo.

Next morning, around 08:00 CEST, people notified us that the SSL certificate for is not correct. Right: we forgot to renew the "Let's Encrypt" certificate on the Provo mirror to also contain the new DNS entries. This was a one minute job, but an important one we forgot after the long day before.

Our openSUSE Kernel Guru Jeff Mahoney and our Bugfinder Rüdiger Oertel helped us with the main problem and provided debug information and new test-kernels over the whole time, that helped us to track down and finally eliminate the original problem. A big THANK YOU for this, Jeff and Rudi!

So finally, in the morning of Wednesday 3rd June 2020, around 10:00, we were able to finish the pvmove to the new storage. But: with all the problems, we decided to run an xfs_check/xfs_repair on the filesystem - and this takes some time on a 21TB storage. So we decided to leave the DNS in Provo, but instead provide the redirector database there, to free up some bandwidth that is needed to run the office in Provo. Luckily, we still had the DB server, configs and other stuff ready to use there. So all we needed to do was to transfer a current database dump from Nuremberg to Provo, restore the dump and check the old backup setup. This was done in ~30min and Provo was "the new" redirector.

After checking the xfs on the new storage, we finally declared the machine in Nuremberg production ready again around 12:00 CEST and switched the DNS back to the old system in Nuremberg with the new storage.

Lessons Learned

What Went Well

  • As always our power users and admins are very fast and vocal about problems they see.
  • The close cooperation with our kernel guru and the live chat helped to identify and solve at least the kernel problem
  • Having a full secondary mirror server at hand which is running in another DC and even in another continent is very helpful, if you need to switch over
  • Having the needed backups and setups ready before a problem occurs also helps to keep the downtime low

What Went Wrong

  • the full secondary mirror server did not contain up-to date data for all the 21TB of packages and files. This lead to some (luckily small) confusion, as some repositories suddenly contained old data
  • our OBS was not directly affected by the outage, but could not push new packages to the secondary mirror directly. The available bandwidth did not allow to keep everything in sync.

Where We Got Lucky

  • having the experts together and having the ability for them to talk directly with each other solves problems way quicker than anything else
  • the setup we used during a power outage of the Nuremberg office 3 years ago was still up and running (and maintained) over all the years. This helped us to setup the backup system in a very quick time frame.

Action Items

Limited to the available bandwidth in Provo:

  • try to establish a sync between the databases in Provo and Nuremberg, which would allow us a hot-standby
  • evaluate possibilities to sync the Provo mirror more often


  • As the filesystem on the standard machine is now some years old, was hot-resized multiple times and now had seen some problems (which could be somehow repaired by xfs_repair, but nevertheless), we will try to copy the data over to a completely new xfs version 5 filesystem during the next days
  • Try to get an additional full mirror closer to the one in Nuremberg, which does not have the bandwidth and latency problems - and establish this one as "hot-standby" or even a load-balanced system.

openSUSE admin: IP renumbering in Provo 2020-06-05

Added by lrupp almost 4 years ago

SUSE is getting a new ISP in Provo - and a new set of external IP addresses. This switch affects also some openSUSE servers that are currently running in the Provo datacenter. Mainly the Provo mirror server of, available via

All machines that are currently using an IPv4 address starting with 130.57.72.XX will get a new IPv4 address assigned in the network. Normally, this should go unnoticed, especially if you are using DNS.

Namely, the following four productive services are affected:

The migration will start next Friday, 2020-06-05, 09:00 MDT (click on the link to see the event in your timezone) - we hope to finish it during a few hours.

openSUSE admin: Upgraded Redmine on (3 comments)

Added by tuanpembual almost 4 years ago

Hi openSUSE Community,

We have been using Redmine as a ticketing system for a very long time. The previous server had Redmine 2.4.5 from 2014 installed on an old SLE 11 SP4 server.

And finally we have successfully migrated to a newer Redmine version. Currently running Redmine 3.4.12 on a brand new server with Leap 15.1.

This is a long awaited step in a long, long journey. Much time was spent fixing broken plugins, configuration and the database to match the new Redmine version. And we have a new theme to make it look fresh.

Thank you to all people who helped this migration run smoothly.


openSUSE admin: Introducing debuginfod service for Tumbleweed

Added by lrupp about 4 years ago

We are happy to pre-announce a new service entering the openSUSE world:

debuginfod is an HTTP file server that serves debugging resources to debugger-like tools.

Instead of using the old way to install the needed debugging packages one by one as root like:

zypper install $package-debuginfo

the new debuginfod service lets you debug anywhere, anytime.

Right now the service serves only openSUSE Tumbleweed packages for the x86_64 architecture and runs in an experimental mode.

The simple solution to use the debuginfod for openSUSE Tumbleweed is:

gdb ...

For every lookup, the client will send a query to the debuginfod server and get's back the requested information, allowing to just download the debugging binaries you really need.

More information is available at the start page - feel free to contact the initiator marxin directly for more information or error reports.

openSUSE admin: Database monitoring

Added by lrupp about 4 years ago

While we monitor basic functionality of our MariaDB (running as Galera-Cluster) and PostgreSQL databases since years, we missed a way to get an easy overview of what's really happening within our databases in production. Especially peaks, that slow down the response times, are not so easy to detect.

That's why we set up our own Grafana instance. The dashboard is public and allows everyone to have a look at:

  • The PostgreSQL cluster behind Around 230 average and up to 500 queries per second are not that bad...
  • The Galera cluster behind the wikis and other MariaDB driven applications like Matomo or Etherpad. One interesting detail here is - for example - the archiving job of Matomo, triggering some peaks every hour.
  • The Elasticsearch cluster behind the wiki search. Here we have a relatively high JVM memory foodprint. Something to look at...

Both: the Grafana dashboard and the databases are driving big parts of the openSUSE infrastructure. And while everything is still up and running, we would love to hear from experts how we could improve. If you are an expert or know someone, feel free to contact us via Email or in our IRC channel.

openSUSE admin: Blocking spammers in

Added by lrupp about 4 years ago

As you may know, every single Email to is forwarded into our ticket system at As this Email is meanwhile widely known in the public Internet, we see a lot of Spam in our ticket system. So far, we mainly ignored that stuff and simply deleted the Email/Ticket.

But our ticket system is not really planned to become a ticket system: we run Redmine, which originally is intended to be a project management software. The ability to create issues (or tickets, as we call them) in the system by sending an Email was not really intended in the beginning. So the ability to detect and mark Spam Emails as such simply does not exist. Even worse: every Email results in a user, that get's created automatically, to allow us to send out an Email to this person as answer to his ticket.

All of this is not really problematic: you learn to deal with it. But with over 14,000 "users" in the database (and over 17,000 real tickets), the system started to become slow. So we invested a bit of our time and looked into the user list. Good for us: most of the Spammers seen to have special days to submit their stuff. And even more interesting: they do it at the same time from multiple accounts!

So we ended up in setting huge user blocks to "locked", which will not allow them to use the same Email account again to send their Spam to us - and on the other side this fastens up our database, as most queries only search for "active" users (which is the default). Maybe we can use the gathered Email addresses to feed a Spam filter - later, once we have one.

As good and simple as this message is: there is a small potential that we might have blocked/locked some real user accounts in our Redmine instance with this simple workaround. We tried our best and already excluded a lot of domains we trust (like '') in the query. But we can not guarantee that we did not block your account at the moment, as there are simply too many (to us) unknown openSUSE users. And we want to spend more time on fixing your tickets than on finding out if one of the 10,000 now locked accounts is a false positive.

If you are locked out of (and ONLY on this system/URL), please get in touch with us.

openSUSE admin: updated

Added by lrupp about 4 years ago

The information below might fall into the "unsung heroes of openSUSE" category - we think it is clearly worth to be mentioned and getting some applause (not saying that every user should owe the author a beer at the next conference ;-).

  • You are searching for a nice font for the next document?
  • You want to install such a font directly via 1-click-install once you had a closer look?
  • You want to know more about rendering or language information or the character set for a font you want to install?

Just have a look at, which provides all these information for you + some more. Special thanks to Petr Gajdos, who maintains the page and the package with the same name since years.

openSUSE admin: Etherpad updated (again)

Added by lrupp about 4 years ago

As you might have noticed on our status page, our etherpad instance at was updated to the latest version 3 days ago.

But this time,we did not only upgrade the package (which lives, btw, in our openSUSE:infrastructure project), we also migrated the underlying database.

As often, the initial deployment was done with a "just for testing" mindset by someone, who afterward left his little project. And - also as often - these kind of deployments suddenly became productive. This means - in turn - that our openSUSE heroes team suddenly gets tickets for services we originally did neither set up, nor maintain.

For etherpad, this means that we suddenly faced a "dirty.db" file of over 2GB in size, filling up the root-fs of the machine. Upstream even has a warning in their boot script, telling everyone that a dirty.db is NOT for production... :-/

The first try, using the script to reduce the size, did not finish after 2 days. So we decided to dump the data directly from the dirty.db into our Galera cluster. After fixing the initially created table scheme from MyISAM to InnoDB (Galera does not like MyISAM), the migration script took "only" 16 hours.

With this final migration, we hope to be prepared for the next update - and hope that this only takes minutes again.

openSUSE admin: IPv6 for machines in Provo

Added by lrupp about 4 years ago

After some back and forth, I'm happy to announce that more machines in the Provo data center use IPv6 in addition to their IPv4 address. Namely:

  • (main mirror for US/Pacific regions)

  • (fallback for

  • (fallback for

  • (new DNS server - not yet productive)

Sadly neither the forums nor WordPress instances are IPv6 enabled. But we are hoping for the best: this is something we like to work on next year...

openSUSE admin: Root cause analysis of the OBS downtime 2019-12-14

Added by lrupp about 4 years ago

Around 16:00 CET at 2019-12-14, one of the Open Build Service (OBS) virtualization servers (which run some of the backend machines) decided to stop operating. Reason: a power failure in one of the UPS systems. Other than normal, this single server had both power supplies on the same UPS - resulting in a complete power loss, while all other servers were still powered via their redundant power supply.

In turn, the communication between the API and those backend machines stopped. The API summed up the incoming requests up to a state where it was not able to handle more.

By moving the backends over to another virtualization server, the problem was temporarily fixed (since ~19:00) and the API was working on the backlog. The cabling on the problematic server is meanwhile fixed and the machine is online again. So we are sure that this specific problem will not happen again in the future.


Also available in: Atom