Cumulative update (1 comment)
Hello!
It's been a long time since a news update here. Many things changed, and there are lots of exciting news which have been partially communicated in emails and meetings, but not in much detail. I try to clean up some of the news backlog gathered in a good half of a year with this post.
Data center move¶
With SUSE moving one of their datacenters (the one which hosted the majority of openSUSE infrastructure) to Prague towards the end of 2023, we got the opportunity to not only move our systems along, but also to introduce fundamental changes which would have otherwise been difficult to implement in the existing environment - both for SUSE, and for openSUSE.
SUSE graciously provided us with not only brand new hardware, but also autonomy over what we do with it. Whereas the physical infrastructure in the old location was entirely operated by SUSE, with openSUSE merely having SSH access to the virtual machines, the new setup allows the openSUSE Heroes to fully manage their hypervisors and networking stack. This freedom allows the team to implement new ideas and react to issues with virtual machines without relying on and waiting for SUSE internal support.
Maintenance¶
Of course, lots of freedom comes with lots of responsibility. That makes it even more important for our infrastructure to be stable and easy to maintain - after all, the Heroes team is made of volunteers, of which there aren't many who can be motivated to spend their free time with boring maintenance tasks. Hence we now enforce automatic security updates on all machines using os-update and rebootmgr, with added logic to reduce outages by staggering the package updates on reboots of machines in high availability clusters, and to alert administrators about machines requiring updates which cannot install them on their own, or ones requiring manual reboot intervention.
Automation¶
We already use Salt as an automation and infrastructure as code solution since a long time, but increased the Salt coverage by a large margin. Whereas previously only a comparatively small amount of machine configuration was actively maintained using Salt, now the whole base system (network configuration, repositories, kernel settings, authentication, monitoring, logging, certificates, email, automatic updates) is covered, with a steadily increasing amount of the various machine specific services (not surprisingly, the large amount of services under the opensuse.org umbrella comes with a large amount of applications requiring configuration).
Using the infrastructure as code approach has multiple benefits, some of which are:
- configuration changes are tracked and documented in a Git history
- configuration is unified across all machines
- central deployment of changes is possible To fully use those benefits, it is essential to have a large coverage, and to reduce the times one needs to manually visit a machine for changes.
We encourage you to check our Git repositories if you are curious (the salt.git repository is now automatically mirrored to code.opensuse.org and GitHub!):
https://github.com/openSUSE/heroes-salt or https://code.opensuse.org/heroes/salt/tree/production
https://github.com/openSUSE/salt-formulas or https://code.opensuse.org/heroes/salt-formulas/tree/main
Monitoring¶
Monitoring already had an important role in our infrastructure before, but it was time to expand it further. The setup which had its center in our old data center, and was based on Nagios and Icinga, served us very well - but after working with it for some time, I gathered the following concerns:
- the server configuration was not managed by Salt, and there was no existing Salt formula community around it
- lots of checks are executed on machines through a central agent which runs as root
- single failure domain
- often the need for custom check scripts to cover new applications
The data center move broke the existing setup, which finally answered my question about whether to refactor or whether to build something new.
After evaluating multiple solutions, including Nagios again, Munin and Zabbix, the choice eventually fell to a stack with Prometheus at its heart. Not only do I have existing experience with Prometheus, but also did I value:
- configuration of all server and client components is either through environment variables, or YAML files - perfect for integration into a code based infrastructure without lots of custom templating
- existing Salt formulas
- each service being monitored is equipped with a dedicated metrics exporter - this for one keeps other services actively monitored if one exporter breaks, but more importantly it allows for deep privilege separation, as most monitoring checks can be performed with only a small set of permissions
- a large amount of applications, especially ones providing web services, which we run plenty of, have built-in support for serving Prometheus compatible metrics - those allow for extremely easy integration with no need to install or write custom monitoring checks or "translation" tools
- plenty of existing community supported exporters for applications which do not provide built-in Prometheus metrics - the configuration of them is usually standardized, making RPM packaging and deployment easy
- easy to add custom checks to collect metrics covered by neither of the above
Already a large amount of additional metrics from various services which were not monitored before are gathered using the new setup. For most we already implemented alerting to notice odd behavior early, and added visualization dashboards with graphs - of course because they are pretty, but also to make spotting anomalies at a glance easier.
The alerting rules which are made of PromQL expression can be found in the previously linked repositories, under salt/files/prometheus/alerts/
. Currently, we split alerts based on their severity into either IRC (#opensuse-admin-alerts - still a bit experimental), email, or dashboard receivers.
The alerting dashboard (using Karma) - currently only reachable if connected to the openSUSE Heroes VPN:
https://alerts.infra.opensuse.org/
The visualization dashboards in Grafana - accessible to everyone:
https://monitor.opensuse.org/grafana/.
Our monitoring posture is on a good track, but there is still a lot of work to do to cover more services, fine tune alerts, and to update the internal documentation and response.
Security and internal authentication¶
Previously we used a FreeIPA based LDAP setup for authentication of Heroes to internal services and machines.
We did not use the majority of features FreeIPA offered, and had noone wanting to maintain a machine not running openSUSE anymore - with Kanidm experience in the team, the decision what to migrate to was easy.
In the process, we replaced the sssd installations on all machines with kanidm-unixd, and globally enforced public key based authentications for SSH. Not only is the setup simpler and well maintained, we also found improvements with credential caching, allowing login even throughout potential network issues on a host.
VPN¶
Gateway to all of our internal openSUSE infrastructure is a OpenVPN setup. We improved its security by switching to the current best practices for compression.
Closing words¶
Of course, there have been various other changes, some of which being too small to cover here, some of which simply forgotten about. You will either find out about them in future news posts, or by inspecting our commit and ticket histories.
Have fun!