Root cause analysis of the OBS downtime 2019-12-14

Added by lrupp over 4 years ago

Around 16:00 CET at 2019-12-14, one of the Open Build Service (OBS) virtualization servers (which run some of the backend machines) decided to stop operating. Reason: a power failure in one of the UPS systems. Other than normal, this single server had both power supplies on the same UPS - resulting in a complete power loss, while all other servers were still powered via their redundant power supply.

In turn, the communication between the API and those backend machines stopped. The API summed up the incoming requests up to a state where it was not able to handle more.

By moving the backends over to another virtualization server, the problem was temporarily fixed (since ~19:00) and the API was working on the backlog. The cabling on the problematic server is meanwhile fixed and the machine is online again. So we are sure that this specific problem will not happen again in the future.

Piwik -> Matomo

Added by lrupp over 4 years ago

You might know that Piwik was renamed into Matomo more than a year ago. While everything is still compatible and even the scripts and other (internal) data is still named piwik, the rename is affecting more and more areas. Upstream is working hard to finalize their rename - while trying not to break too much on the other side. But even the file names will be renamed in some future version.

Time - for us - to do some maintenance and start following upstream with the rename. Luckily, our famous distribution already has matomo packages in the main repository (which currently still miss Apparmor profiles, but hey: we can and will help here). So the main thing left (to do) is a database migration and the adjustments of all the small bits and bytes here and there, where we still use the old name.

While the database migration silently happened already, the other, "small" adjustments will take some time - especially as we need to find all the places that need to get adjusted and also need to identify the contact persons, who can do the final change. But we are on it - way before Matomo upstream will do the final switch. :-) updated

Added by lrupp over 4 years ago

Our infrastructure status page at is using Cachet under the hood. While the latest update brought a couple of bugfixes it also deprecated the RSS and Atom feeds, that could be used to integrate the information easily in other applications.

While we are somehow sad to see such a feature go, we also have to admit that the decision of the developers is not really bad - as the generation of those feeds had some problems (bugs) in the old Cachet versions. Instead of fixing them, the developers decided to move on and focus on other areas. So it's understandable that they cut off something, which is not in their focus, to save resources.

As alternative, you might want to subscribe to status changes and incident updates via Email or use the API that is included in the software for your own notification system. And who knows: maybe someone provides us with a RSS feed generator that utilizes the API?

SSL cipher updates

Added by lrupp over 4 years ago

Sometimes it's a good idea to follow best practices. This is what we did by following the recommendations for "general-purpose servers with a variety of clients, recommended for almost all systems" from

With this, our services accept only TLS 1.2 connections and the latest elliptic curve ciphers. If your client or browser does not support these settings, it's definitely time for you to consider an update.

While we are looking for TLS 1.3 support, the openssl version on our systems (running currently Leap 15.1) does not support it - yet. Once there is an update, we'll let you know.

The first steps to a more modern infrastructure

Added by tampakrap almost 6 years ago

Happy SysAdmin day!


This is a small write-up of our ongoing effort to move our infrastructure to modern technologies like Kubernetes. It all started a bit before the Hack Week 17 with the microservices and serverless for the infrastructure project. As it mentions, the trigger behind it is that our infrastructure is getting bigger and more complicated, so the need to migrate to a better solution is also increasing. Docker containers and Kubernetes (for container orchestration) seemed like the proper ones, so after reading tutorials and docs, it was time to get our hands dirty!

Installing and experimenting with Kubernetes / CaaSP

We started by installing the SUSE CaaSP product on the internal Heroes VLAN. It provides an additional admin node which sets up the Kubernetes cluster. The cluster is not that big for now. It consists of the admin node, three kube-master nodes behind a load balancer and four kube-minions (workers). The product was in version 2 when we started, but version 3 became available which is the one we're using right now. It worked flawlessly, and we were even able to install the Kubernetes dashboard on top of it, which was our first Kubernetes-hosted webapp.

Since the containers inside Kubernetes are on their own internal network, we needed also a loadbalancer to expose the services to our VPN. Thus, we experimented with Ingress for load balancing of the applications deployed inside Kubernetes, also successfully. A lot of experiments around deployments, scaling and permissions took place afterwards, to get us more familiarized with the new concepts, which of course ended up in us destroying our cluster multiple times. We were surprised to see though the self-healing mechanisms taking over.

Although the experiments took place only with static pages so far, it still allowed us to learn a lot about Docker itself, eg how to create our own images and deploy them to our cluster. It's worth also mentioning the amazing kctl tool, just take a look at its README to realize how much more useful it is compared to the official kubectl.

Time to move to the next layer.

Installing and experimenting with Cloud Foundry / CAP

The next step was to install yet another SUSE product, this time the Cloud Application Platform, which offers a Platform as a Service solution based on the software named Cloud Foundry. The first blocker was met here though. CAP requires a working Kubernetes storageclass, which means that we needed to have a persistent storage backend. A good solution would be to use a distributed filesystem solution like Ceph, but due to the time limitations of Hack Week, we decided to go with a simpler solution for now, and the simplest was an NFS server. The CAP installation was smooth from that point, and we managed to login to our Cloud Foundry installation via the command line tool, as well as via the Stratos webUI. A wildcard domain *.cf.mydomain.tld was also needed here.

The idea was quite straightforward here: go to your git repository, and type a simple command like: cf push -m 256M -k 512M myapp. This would deploy a new app directly to Cloud Foundry, giving it 256MB of RAM and 512MB of disk space. As a bonus, it created a domain immediately! So the benefits here were quite obvious, no need to build our own container image with the app and set up a mechanism to deploy it, and no need for the manual step of setting up a DNS. The Ingress LB for Kubernetes that was mentioned before is also obsolete now, as Cloud Foundry handles this as well. The command cf scale could also give us the ability to scale up/down (increase/decrease memory/disk) or scale in/out (increase/decrease number of instances) as well.

Time for stress testing

Hack Week 17 was over, so the next days we deployed a few static apps (and one dynamic that needed also a memcached backend), by giving them the absolute minimal disk/RAM (around 10MB ram and 10MB to 256MB disk, depending on the webapp). We triggered a number of bots that started requesting the webpages in a loop, and the results were really impressive: even with such minimal resources and only one instance running, we saw only an increase on the CPU usage to max 15%!

As a second step, we put some static webapps of low importance running in the cluster and we let them public. We're not going to reveal which ones yet though, feel free to guess :) We plan to monitor the resource usage and activity for a few days, and if everything is fine to even put some more important webapps in.

Future tasks

There are a lot of future tasks that need to be resolved before we fully hit production. First of all, as mentioned, the NFS storageclass needs to be replaced with a proper distributed filesystem solution. SUSE Enterprise Storage product is a good candidate for it. Furthermore, we'd need to integrate our LDAP server with both CaaSP and CAP accounts. Last but not least, we are very close on making dynamic webpages with relational database needs working.

The overall progress is tracked in a trello board, and of course the internal heroes documentation has more info about the setup. Volunteers are always welcome, feel free to contact us in case you'd like to jump onboard.

Thanks to anybody who helped on setting up the cluster, the SUSE CaaSP and CAP teams for replying to our tons of questions. Special thanks go to Dimitris Karakasilis and Panagiotis Georgiadis for joining me before, during and after Hack Week 17 and still being around, making this from a simple idea to a production-ready project.

On behalf of the openSUSE Heroes team,
Theo Chatzimichos

Extended maintenance this Thursday, 2017-12-07

Added by Anonymous over 6 years ago

UPDATE: all systems are back online.

During the maintenance window this Thursday, 2017-12-07, we will not only do the regular maintenance on all machines: this time we will migrate the machine hosting to a completely new system running openSUSE Leap 42.3. Together with this switch, we will bring the new PostgreSQL database cluster in production, which is running now since a while also on openSUSE 42.3. As some of the old configurations and services will be changed during that time as well (for example: switching from lighttpd to nginx with TLS 1.2 and http2 support for our "last resort mirror"), we will use this week for some extended testing to make the migration as smooth and quick as possible. But as always: bad things can happen, so we like to inform you in front that there might be some longer downtimes during the switch. If you need to upgrade or update your machines this Thursday morning, please check for a mirror server on our mirror page.

The maintenance of will not last for longer than 30 minutes: we will fire up a second machine that will (in a first version) act as failover (using keepalived) in case the main machine is under maintenance or has a bigger issue. As we want to test the failover, please expect small hickups during that 30 minutes. After that, we hope that this service should also be high available as a couple of other services we setup during the last weeks.

New Galera cluster running in production (1 comment)

Added by Anonymous over 6 years ago

As we reported in one of our last news, we setup a new galera cluster for all our applications that make use of MySQL. This cluster should allow us to do maintenance on one of the cluster nodes at any time - and also should scale the workload between the nodes, via the HaProxy in front.

One problem, that affected us for example in case of, are the MyISAM tables: Galera is not really ready (yet?) to sync the content of such tables (even if you can enable the "experimental" feature, if you don't care much about your data). As result, we do not only need to have a look and migrate each an every single MyISAM table - but more worse, we also need to have a look at the used code of the application to identify problematic SQL statements (like DELAYED inserts for example) - and patch it where needed.

But the good news for the two databases mentioned: so far everything seems (still) to work. Other applications will follow one by one (as some like connect need adaptions).

New Galera Cluster up and running

Added by Anonymous over 6 years ago

One major step towards a reliable infrastructure was done last week: we implemented a new Galera Cluster, which should provide a high available environment for all services that rely on MySQL/MariaDB. Instead of simply migrating the old Master-Master setup, we decided to implement something new - also giving us not only the ability to grow, but also to show how reliable an openSUSE driven infrastructure is (Note: the new cluster is of course using Leap 42.3 as base).

During the next days, we will fine-tune the database setup and migrate the productive workload from the old to the new cluster. Some of the steps we learned during that migration might end up in some articles on - so stay tuned! :-)


Also available in: Atom