Project

General

Profile

News

openSUSE admin: Thank you, SUSE QE

Added by lrupp 14 days ago

Some here might not know it, but some teams from the 'SUSE Quality Engineering Linux Systems Group' use the Redmine installation here at https://progress.opensuse.org/ to track the results of the test automation for openSUSE products. Especially openQA feature requests are tracked and coordinated here.

As the plain Redmine installation does not provide all wanted features, we included the "Redmine Agile plugin" from RedmineUP since a while now. Luckily the free version of the plugin already provided nearly 90% of the requested additional features. So everybody was happy and we could run this service without problems. But today, we got some money to buy the PRO version of the plugin - which we happily did :-)

There is another plugin, named Checklist, for which we also got the GO to order the PRO version. Both plugins are now up and running on our instance here - and all projects can make use of the additional features.

We like to thank SUSE QE for their sponsoring. And we also like to thank RedmineUP for providing these (and more) plugins to the community as free and PRO versions. We are happy to be able to donate something back for your work on these plugins. Keep up with the good work!

openSUSE admin: Thank you, SonarSource (1 comment)

Added by lrupp 2 months ago

There are times, when keeping your system up-to date does not help you against vulnerabilities. During these times, you want to have your servers and applications hardened as good as possible - including good Apparmor profiles. But even then, something bad can easily happen - and it's very good to see that others take care. Especially if these others are professionals, that take care for you, even if you did not ask them directly.

Tuesday, 2021-08-31, was such a day for our openSUSE infrastructure status page: SonarSource reported to us a pre-auth remote code execution at the https://status.opensuse.org/api/v1/incidents endpoint.

SonarSource, equally driven by studying and understanding real-world vulnerabilities, is trying to help the open-source community to secure their projects. They disclosed vulnerabilities in the open-source status page software Cachet - and informed us directly - that our running version is vulnerable to CVE-2021-39165. Turned out that the Cachet upstream project is meanwhile seen as dead - at least it went out of support by their original maintainers since a while. It went into this unsupported state unnoticed by us - and potentially also unnoticed by many others. A problem, that many other, dead open source projects sadly share.

Thankfully, the openSUSE Security team (well known as first contact for security issues) as well as Christian (as one of our glorious openSUSE heroes) reacted quick and professional:

  • SonarSource informed our Security team 2021-08-31, 15:39
  • Our Security team opened a ticket for us just two hours later, at 2021-08-31, 17:08
  • Already one hour later, at 2021-08-31 18:29, Christian deployed a first hot-fix on our instances (Note: the original admin of the systems was on vacation)
  • 2021-08-31 at 23:35, Christian already provided a collection of suspicious requests to the affected URL
  • Meanwhile, there was a fix provided in a forked Github repository, which was applied to our installations one day later, 2021-09-01. This made our installations secure again (cross-checked by our Security Team and SonarSource). A response time of one day, even if the original upstream of a project is not available any longer - and the original admin of a system is on vacation! :-)
  • ...and we started a long analysis of the "what" and "when"...
  • In the end, we identified 6 requests from one suspicious IP, which we couldn't assign to someone we know. So we decided to distrust our installations. There might be a successful attack, even if we could not find any further evidence on the installed system (maybe thanks to the Apparmor profile?) or in the database. BUT: an attacker could have extracted user account data.
  • The user accounts of the Cachet application are only used to inform our users about any infrastructure incident. An attacker might be able to log in and report fake incidents - or send out Emails to those, who subscribed to incident reports or updates. Something we don't like to see. Luckily, these accounts are in no way connected to the normal user accounts. They just existed on these systems, for exactly one purpose: informing our users.
  • As result, we informed all users of the status.opensuse.org instances that they should change their password on a new system, setup from scratch. This new system is now deployed and in production, while the image of the old system is still available for further investigation.

Big kudos to Thomas Chauchefoin (SonarSource), Gianluca Gabrielli and Marcus Meissner (openSUSE Security Team) and Christian Boltz (openSUSE Heroes) for all their work, their good cooperation and quick reactions!

openSUSE admin: Upgrading to the next PostgreSQL version (1 comment)

Added by lrupp 8 months ago

Time passes by so quickly: we installed our PostgreSQL cluster around 2008. At least, this was the time of the first public MirrorBrain release 2.2, which was the reason to run a PostgreSQL installation for openSUSE. But MirrorBrain (and therefor the PostgreSQL cluster behind it) is way older. So maybe it's fair to say that MirrorBrain started with openSUSE in 2005...?

Anyway: if you maintain a database for such a long time, you don't want to loose data. Downtimes are also not a good idea, but that's why we have a cluster, right?

While the MirrorBrain database is currently still the biggest one (>105GB in size and ~120 million entries alone in the table listing the files on the mirrors), our new services like Matrix, Mailman3, Gitlab, Pagure, lnt or Weblate are also not that small any more. All together use currently 142GB.

We already upgraded our database multiple times now (starting with version 7. in the past). But this time, we decided to try a major jump from PostgreSQL 11 to 13, without any step in between.

So how do we handle an upgrade of a PostgreSQL database? - In general, we just follow the documentation from upstream - only adjusting the values to our local setup:

Local setup details

  • Our configuration files are stored in /etc/postgresql/ - and symlinked into the current data directory. This makes it not only easier for us to have them in a general backup, we also set their file ownership to root:postgres - editable just by root and readable just for the postgres group (file permissions: 0640).
  • Below the generic data directory for PostgreSQL on openSUSE (/var/lib/pgsql), we have "data" directories for each version: data11 for the currently used PostgreSQL 11 version.
  • A Symlink /var/lib/pgsql/data points to the currently active database directory (data11 in the beginning)

Step-by-Step

Preparation

First let us set up some shell variables that we will use through out the steps. As we need these variables multiple times as user ‘root’ and user ‘postgres’ later, let’s place them into a file that we can refer to (source) later…

    cat > /tmp/postgresql_update << EOL
    export FROM_VERSION=11
    export TO_VERSION=13
    export DATA_BASEDIR="/var/lib/pgsql/"
    export BIN_BASEDIR="/usr/lib/postgresql"
    EOL

Note: DATA_BASEDIR you can get from the currently running postgresl instance with: ps aufx | grep '^postgres.* -D'

Don’t forget to source the file with the variables in the steps below.

Install new RPMs

Install the new binaries in parallel to the old ones (find out, which ones you need either via rpm or zypper):

    source /tmp/postgresql_update
    zypper in $(rpmqpack | grep "^postgresql${FROM_VERSION}" | sed -e "s|${FROM_VERSION}|${TO_VERSION}|g")

Initialize the new version

Now change into the database directory and create a new sub-directory for the migration:

    su - postgres
    source /tmp/postgresql_update
    cd ${DATA_BASEDIR}
    install -d -m 0700 -o postgres -g postgres data${TO_VERSION}
    cd ${DATA_BASEDIR}/data${TO_VERSION}
    ${BIN_BASEDIR}/bin/initdb .

For the exact parameters for the initdb call, you can search the shell history of the last run of initdb. But we go with the standard setup above.

You should end up in a completely independent, fresh and clean PostgreSQL data directory.

Now start to backup the new config files and create Symlinks to the current ones. It’s recommended to diff the old with the new config files and have a close look at the logs during the first starts. Worst case: the new server won’t start with old settings at all. But this can be found in the log files.

    su - postgres
    source /tmp/postgresql_update
    cd ${DATA_BASEDIR}/data${TO_VERSION}

    for i in  pg_hba.conf pg_ident.conf postgresql.conf postgresql.auto.conf ; do 
     old $i
     ln -s /etc/postgresql/$i .; 
     # diff $i $i-$(date +"%Y%m%d")
    done

Downtime ahead: do the migration

Next step is to finally do the job - this includes a downtime of the database!

    rcpostgresql stop

    su - postgres
    source /tmp/postgresql_update
    pg_upgrade --link             \
     --old-bindir="${BIN_BASEDIR}${FROM_VERSION}/bin"     \
     --new-bindir="${BIN_BASEDIR}${TO_VERSION}/bin"       \
     --old-datadir="${DATA_BASEDIR}/data${FROM_VERSION}/" \
     --new-datadir="${DATA_BASEDIR}/data${TO_VERSION}/"

The --link option is very important, if you want to have a short downtime:

--link                    link instead of copying files to new cluster

In our case, the operation above took ~20 minutes.

Hopefully you end up with something like:

    [...]
    Upgrade Complete
    ----------------
    Optimizer statistics are not transferred by pg_upgrade so,
    once you start the new server, consider running:
        ./analyze_new_cluster.sh

    Running this script will delete the old cluster's data files:
        ./delete_old_cluster.sh

Switch to the new PostgreSQL version

Switch to the new database directory. In our case, we prefer a Symlink, which points to the right directory:

    source /tmp/postgresql_update
    cd ${DATA_BASEDIR}
    ln -fs data${TO_VERSION} data

As alternative, you can switch the database directory by editing the configuration in /etc/sysconfig/postgresql:

    source /tmp/postgresql_update
    echo "POSTGRES_DATADIR='${DATA_BASEDIR}/data${TO_VERSION}'" >> /etc/sysconfig/postgresql

(Maybe you want to edit the file directly instead and set the correct values right at the point.) The prefix in the file should match the ${DATA_BASEDIR} variable.

Start the new server

    systemctl start postgresql

Cleanup

Postgres created some scripts in the folder where the pg_upgrade started. Either execute these scripts (as postgres user) directly or use the following commands:

    sudo -i -u postgres
    source /tmp/postgresql_update
    ${BIN_BASEDIR}${TO_VERSION}/bin/vacuumdb \
     --all \
     --analyze-in-stages
    ${BIN_BASEDIR}${TO_VERSION}/bin/reindexdb \
     --all \
     --concurrently

Please note that the two commands above have influence on the performance of your server. When you execute them, your database might become less responsive (up to not responsive at all). So you might want to use a maintenance window for them. On the other side, your new database server will perform much better once you executed these commands. So don't wait too long.

Check everything

Now it might be a perfect time to check monitoring, database access from applications and such. After that, you might remove the old database directory and de-install the old binaries as well together with an rm /tmp/postgresql_update.

But in general, you can mark this migration as finished.

openSUSE admin: Playing along with NFTables (1 comment)

Added by lrupp 9 months ago

By default, openSUSE Leap 15.x is using the firewalld firewall implementation (and the firewalld backend is using iptables under the hood).

But since a while, openSUSE also has nftables support available - but neither YaST nor other special tooling is currently configured to directly support it. But we have some machines in our infrastructure, that are neither straight forward desktop machines nor do they idle most of the time. So let's try out how good we are at trying out and testing new things and use one of our central administrative machines: the VPN gateway, which gives all openSUSE heroes access to the internal world of the openSUSE infrastructure.

This machine is already a bit special:

  • The "external" interface holds the connection to the internet
  • The "private" interface is inside the openSUSE heroes private network
  • We run openVPN with tun devices (one for udp and one for tcp) to allow the openSUSE heroes to connect via a personal certificate + their user credentials
  • In addition, we run wireguard to connect the private networks in Provo and Nuremberg (at our Sponsors) together
  • And before we forget: our VPN gateway is not only a VPN gateway: it is also used as gateway to the internet for all internal machines, allowing only 'pre-known traffic' destinations

All this makes the firewall setup a little bit more complicated.

BTW: naming your interfaces by giving them explicit names like "external" or "private", like in our example, has a huge benefit, if you play along with services or firewalls. Just have a look in /etc/udev/rules.d/70-persistent-net.rules once your devices are up and rename them according to your needs (you can also use YaST for this). But remember to also check/rename the interfaces in /etc/sysconfig/network/ifcfg-* to use the same name before rebooting your machine. Otherwise your end up in a non-working network setup.

Let's have a short look at the area we are talking about:

{width: 80%}openSUSE Heroes gateway

As you hopefully notice, none of the services on the community side is affected. There we have standard (iptables) based firewalls and use proxies to forward user requests to the right server.

On the openSUSE hero side, we exchanged the old SuSEfirewall2 based setup with a new one based on nftables.

There are a couple of reasons that influenced us in switching over to nftables:

  • the old SuSEfirewall2 worked, but generated a huge iptables list on our machine in question
  • using ipsets or variables with SuSEfirewall2 was doable, but not an easy task
  • we ran into some problems with NAT and Masquerading using firewalld as frontend
  • Salt is another interesting field:
    • Salt'ing SuSEfirewall2 by deploying some files on a machine is always possible, but not really straight forward
    • there is no Salt module for SuSEfirewall2 (and there will probably never be one)
    • there are Salt modules for firewalld and nftables, both on nearly the same level
  • nftables is integrated since a while in the kernel and should replace all the *tables modules long term. So why not jumping directly to it, as we (as admins) do not use GUI tools like YaST or firewalld-gui anyway?

So what are the major advantages?

  1. Sets are part of the core functionality. You can have sets of ports, interface names, and address ranges. No more ipset. No more multiport. ip daddr { 1.1.1.1, 1.0.0.1 } tcp dport { dns, https } oifname { "external", "wg_vpn1" } accept; This means you can have very compact firewall sets to cover a lot of cases with a few rules.
  2. No more extra rules for logging. Only turn on counter where you need it. counter log prefix "[nftables] forward reject " reject
  3. You can cover IPv4 and IPv6 with a single ruleset when using table inet, but you can have per IP protocol tables as well. And sometimes even need them e.g. for postrouting.

Starting from scratch

A very basic /etc/nftables.conf would look something like this

#!/usr/sbin/nft -f

flush ruleset

# This matches IPv4 and IPv6
table inet filter {
    # chain names are up to you.
    # what part of the traffic they cover, 
    # depends on the type line.
    chain input {
        type filter hook input priority 0; policy accept;
    }
    chain forward {
        type filter hook forward priority 0; policy accept;
    }
    chain output {
        type filter hook output priority 0; policy accept;
    }
}

But so far we did not stop or allow any traffic. Well actually we let everything in and out now because all chains have the policy accept.

#!/usr/sbin/nft -f

flush ruleset

table inet filter {
    chain base_checks {
        ## another set, this time for connection tracking states.
        # allow established/related connections
        ct state {established, related} accept;

        # early drop of invalid connections
        ct state invalid drop;
    }

    chain input {
        type filter hook input priority 0; policy drop;

        # allow from loopback
        iif "lo" accept;

        jump base_checks;

        # allow icmp and igmp
        ip6 nexthdr icmpv6 icmpv6 type { echo-request, echo-reply, packet-too-big, time-exceeded, parameter-problem, destination-unreachable, packet-too-big, mld-listener-query, mld-listener-report, mld-listener-reduction, nd-router-solicit, nd-router-advert, nd-neighbor-solicit, nd-neighbor-advert, ind-neighbor-solicit, ind-neighbor-advert, mld2-listener-report } accept;
        ip protocol icmp icmp type { echo-request, echo-reply, destination-unreachable, router-solicitation, router-advertisement, time-exceeded, parameter-problem } accept;
        ip protocol igmp accept;

        # for testing reject with logging
        counter log prefix "[nftables] input reject " reject;
    }
    chain forward {
        type filter hook forward priority 0; policy accept;
    }
    chain output {
        type filter hook output priority 0; policy accept;
    }
}

You can activate the configuration with nft --file nftables.conf, but do NOT do this on a remote machine. It is also a good habit to run nft --check --file nftables.conf before actually loading the file to catch syntax errors.

So what did we change?

  1. most importantly we changed the policy of the chain to drop and added a reject rule at the end. So nothing gets in right now.
  2. We allow all traffic on the localhost interface.
  3. The base_checks chain handles all packets related to established connections. This makes sure that incoming packets for outgoing connections get through.
  4. We allowed important ICMP/IGMP packets. Again this is using a set and the type names and not some crtyptic numbers. YAY for readability.

Now if someome tries to do a ssh connect to our machine, we will see:

[nftables] input reject IN=enp1s0 OUT= MAC=52:54:00:4c:51:6c:52:54:00:73:a1:57:08:00 SRC=172.16.16.2 DST=172.16.16.30 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=22652 DF PROTO=TCP SPT=55574 DPT=22 WINDOW=64240 RES=0x00 SYN URGP=0 

and nft list ruleset will show us

counter packets 1 bytes 60 log prefix "[nftables] input reject " reject

So we are secure now. Though maybe allowing SSH back in would be nice. You know just in case.
We have 2 options now. Option 1 would be to insert the following line before our reject line.

tcp dport 22 accept;

But did we mention already that we have sets and that they are great? Especially great if we need the same list of ports/ip ranges/interface names in multiple places?

We have 2 ways to define sets:

define wanted_tcp_ports {
  22,
}

Yes the trailing comma is ok. And it makes adding elements to the list easier. So we do them all the time.
This will change our rule above to

tcp dport $wanted_tcp_ports accept;

If we load the config file and run nft list ruleset, we will see:

tcp dport { 22 } accept

But there is actually a slightly better way to do this:

    set wanted_tcp_ports {
        type inet_service; flags interval;
        elements = {
           ssh
        }
    }

That way our firewall rule becomes:

tcp dport @wanted_tcp_ports accept;

And if we dump our firewall with nft list ruleset afterwards it will still be shown as @wanted_tcp_ports and not have variable replaced with the value.
While this is great already, the 2nd syntax actually has one more advantage.

$ nft add element inet filter wanted_tcp_ports \{ 443 \}

Now our wanted_tcp_ports list will allow port 22 and 443.
This is of course often more useful if we use it with IP addresses.

    set fail2ban_hosts {
        type ipv4_addr; flags interval;
        elements = {
           192.168.0.0/24
        }       
    }

Let us append some elements to that set too.

$ nft add element inet filter fail2ban_hosts \{ 192.168.254.255, 192.168.253.0/24 \}
$ nft list ruleset

... and we get ...

        set fail2ban_hosts {
                type ipv4_addr
                flags interval
                elements = { 192.168.0.0/24, 192.168.253.0/24,
                             192.168.254.255 }
        }

Now we could change fail2ban to append elements to the set instead of creating a new rule for each new machine it wants to block. Fewer rules. Faster processing.

But with reloading the firewall we dropped port 443 from the port list again. Oops.
Though ... if you are happy with the rules. You can just run

$ nft list ruleset > nftables.conf

When you are using all the sets instead of the variables, all your firewall rules will still look nice.

Our complete firewall looks like

table inet filter {
        set wanted_tcp_ports {
                type inet_service
                flags interval
                elements = { 22, 443 }
        }

        set fail2ban_hosts {
                type ipv4_addr
                flags interval
                elements = { 192.168.0.0/24, 192.168.253.0/24,
                             192.168.254.255 }
        }

        chain base_checks {
                ct state { established, related } accept
                ct state invalid drop
        }

        chain input {
                type filter hook input priority filter; policy drop;
                iif "lo" accept
                jump base_checks
                ip6 nexthdr ipv6-icmp icmpv6 type { destination-unreachable, packet-too-big, time-exceeded, parameter-problem, echo-request, echo-reply, mld-listener-query, mld-listener-report, mld-listener-done, nd-router-solicit, nd-router-advert, nd-neighbor-solicit, nd-neighbor-advert, ind-neighbor-solicit, ind-neighbor-advert, mld2-listener-report } accept
                ip protocol icmp icmp type { echo-reply, destination-unreachable, echo-request, router-advertisement, router-solicitation, time-exceeded, parameter-problem } accept
                ip protocol igmp accept
                tcp dport @wanted_tcp_ports accept
                counter packets 12 bytes 828 log prefix "[nftables] input reject " reject
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
        }

        chain output {
                type filter hook output priority filter; policy accept;
        }
}

For more see the nftables wiki

openSUSE admin: IPv6 support for machines in US region

Added by lrupp 10 months ago

Today we reached a new milestone: all openSUSE services around the world now support IPv6 natively. The last set of machines in Provo are equipped with IPv6 addresses since today. IPv6 was missing for those machines since the renumbering (which was needed because of the carve out of SUSE from Microfocus). Thanks to one of our providers, who now reserved and routed a whole /48-IPv6 network for us.

With this, we can also run all our DNS servers with IPv6 (and they do not only have a IPv6 address, but all our external DNS entries for the opensuse.org domain should now contain IPv4 and IPv6 addresses as well. Don't worry, you did not miss much. The Dual-Stack (IPv4 and IPv6) is the case for all services in Germany since a long, long time already - and we even had it for the machines in US for a long time, before SUSE switched the provider. But this finally brings us to the same level on all locations!

openSUSE admin: Playing with Etherpad-lite

Added by lrupp 10 months ago

When updating to the latest etherpad-lite version 1.8.7 (including quite some bug fixes), we also revisuted the currently installed plugins and updated them to their latest version as well, as usual.

To get some impressions about the usage of our Etherpad instance, we also enabled the ether-o-meter plugin now, which is now producing some nice live graphs and statistics here: https://etherpad.opensuse.org/metrics

We also enabled some additional styles now and hope this makes the usage of Etherpad even more fun and productive for you. If you want to have some more modules enabled (or just want to say "hello"), feel free to contact us! Either via an Email to admin@opensuse.org or by reaching out for us at irc.opensuse.org channel #opensuse-admin.

openSUSE admin: Upgraded matomo

Added by lrupp 10 months ago

As usual, we keep our infrastructure up-to-date. While this is easy for the base system ('zypper patch', you know? ;-) most of the applications need special handling. Normally, we package them as well in our OBS repositories. But this often means that we need to maintain them on our own. At least: packaging them allows us to track them easily and integrates them in the normal workflow. All we have to do to keep them updated on the production machines is a 'zypper up' (which updates all packages with higher version and/or release number - while a 'zypper patch' only updates packages, which have an official patchinfo with them).

Upgrading Matomo from version 3.14.1 to 4.1.1 was not that easy: simply replacing the files in the package was not enough. Upstream changed so much in the database structure, that the standard calls in the post installation script (which normally update the database as well during package update) were just not enough. As this is (hopefully) a one-time effort, we run some steps manually from the commandline, which took ~20hours. After that, our DB was updated, cleaned up and ready to go again.

Summary: Being an openSUSE hero includes not only being an infrastructure Guru with automation and scripting skills. Often enough, you need some packaging expertise as well. ...and sometimes even this is not enough!

openSUSE admin: Short network timeouts expected during the weekend

Added by lrupp over 1 year ago

The SUSE office in Nuremberg will get disconnected from any former Microfocus network over the upcoming weekend (2020-06-20/21). Most of the changes should go unnoticed. But SUSE-IT needs to replace some hardware and they informed us that there might be short outages or curious network timeouts during this time. Especially around Sunday, 2020-06-21, in the afternoon.

We will keep you updated via our status page, if we get aware of any longer outage.

openSUSE admin: Post-Mortem: download.opensuse.org outage (2 comments)

Added by lrupp over 1 year ago

Summary

As the current storage used on download.opensuse.org is running out of service, we started to move to a new storage via pvmove command. The first 12TB were transferred without any problem and no noticeable impact to production. After that, the old storage produced some (maybe longstanding, but unnoticed) problems on some drives, resulting in "unreadable sectors" failure messages in the upper filesystem levels. We managed to recover some data by restarting the pvmove with some offset (like pvmove /dev/vdd:4325375+131072 /dev/vde) over and over again - and finally triggered a bug in dm_mirror at kernel level, which is used by pvmove, and a bad block on a hard drive...

Details

As result, we needed to reboot download.opensuse.org to get the system back to work. As we wanted to get all data transferred to the new storage device, this became a loop:

  1. starting pvmove with offset
  2. waiting for the old storage to run in hard drive timeouts and resetting a drive
  3. looking at the pvmove/dm_mirror running into trouble
  4. seeing the meanwhile known kernel oops
  5. rebooting the machine; start at 1

And as everyone knows: the last steps are always the hardest. While reaching the end of the transfer, the loop started to happen more often. Finally too often for our feeling - so we decided to switch over to our 2nd mirror in Provo, which normally holds all the data (21T) as well, but often a bit outdated because of latency and bandwidth. But this mirror was running stable, so better old content than no content.

So we finally switched the DNS entries for download.opensuse.org and downloadcontent.opensuse.org at 23:00 CEST, pointing to the mirror server in Provo.

Next morning, around 08:00 CEST, people notified us that the SSL certificate for download.opensuse.org is not correct. Right: we forgot to renew the "Let's Encrypt" certificate on the Provo mirror to also contain the new DNS entries. This was a one minute job, but an important one we forgot after the long day before.

Our openSUSE Kernel Guru Jeff Mahoney and our Bugfinder Rüdiger Oertel helped us with the main problem and provided debug information and new test-kernels over the whole time, that helped us to track down and finally eliminate the original problem. A big THANK YOU for this, Jeff and Rudi!

So finally, in the morning of Wednesday 3rd June 2020, around 10:00, we were able to finish the pvmove to the new storage. But: with all the problems, we decided to run an xfs_check/xfs_repair on the filesystem - and this takes some time on a 21TB storage. So we decided to leave the DNS in Provo, but instead provide the redirector database there, to free up some bandwidth that is needed to run the office in Provo. Luckily, we still had the DB server, configs and other stuff ready to use there. So all we needed to do was to transfer a current database dump from Nuremberg to Provo, restore the dump and check the old backup setup. This was done in ~30min and Provo was "the new download.opensuse.org" redirector.

After checking the xfs on the new storage, we finally declared the machine in Nuremberg production ready again around 12:00 CEST and switched the DNS back to the old system in Nuremberg with the new storage.

Lessons Learned

What Went Well

  • As always our power users and admins are very fast and vocal about problems they see.
  • The close cooperation with our kernel guru and the live chat helped to identify and solve at least the kernel problem
  • Having a full secondary mirror server at hand which is running in another DC and even in another continent is very helpful, if you need to switch over
  • Having the needed backups and setups ready before a problem occurs also helps to keep the downtime low

What Went Wrong

  • the full secondary mirror server did not contain up-to date data for all the 21TB of packages and files. This lead to some (luckily small) confusion, as some repositories suddenly contained old data
  • our OBS was not directly affected by the outage, but could not push new packages to the secondary mirror directly. The available bandwidth did not allow to keep everything in sync.

Where We Got Lucky

  • having the experts together and having the ability for them to talk directly with each other solves problems way quicker than anything else
  • the setup we used during a power outage of the Nuremberg office 3 years ago was still up and running (and maintained) over all the years. This helped us to setup the backup system in a very quick time frame.

Action Items

Limited to the available bandwidth in Provo:

  • try to establish a sync between the databases in Provo and Nuremberg, which would allow us a hot-standby
  • evaluate possibilities to sync the Provo mirror more often

General:

  • As the filesystem on the standard download.opensuse.org machine is now some years old, was hot-resized multiple times and now had seen some problems (which could be somehow repaired by xfs_repair, but nevertheless), we will try to copy the data over to a completely new xfs version 5 filesystem during the next days
  • Try to get an additional full mirror closer to the one in Nuremberg, which does not have the bandwidth and latency problems - and establish this one as "hot-standby" or even a load-balanced system.

openSUSE admin: IP renumbering in Provo 2020-06-05

Added by lrupp over 1 year ago

SUSE is getting a new ISP in Provo - and a new set of external IP addresses. This switch affects also some openSUSE servers that are currently running in the Provo datacenter. Mainly the Provo mirror server of download.opensuse.org, available via http://provo-mirror.opensuse.org/.

All machines that are currently using an IPv4 address starting with 130.57.72.XX will get a new IPv4 address assigned in the 91.193.113.64/27 network. Normally, this should go unnoticed, especially if you are using DNS.

Namely, the following four productive services are affected:

The migration will start next Friday, 2020-06-05, 09:00 MDT (click on the link to see the event in your timezone) - we hope to finish it during a few hours.

(1-10/49)

Also available in: Atom