All backend servers run now on one of three virtualization hosts using KVM:
* 48 Cores
* 512 GB RAM
* 20 Gb Ethernet (incl. FCOE)
The eight virtual servers running on this hardware are using the resources very well - while we still have the ability to use just two of the virtualization hosts for continuous operation during service. We hope to improve the availability of the openSUSE Build Service with this new setup and reduce the overall downtime for you.
At the moment, we are trying to fix the last small issues (like long live migration times or synchronization of the configuration between the machines).
The migration of the openSUSE Mailing Lists has been finished successfully. If you encounter any issues, please let us know by mail on admin at opensuse dot org.
On Tuesday 2015-06-09, from 09:00 to 11:00 UTC, the machine that hosts the openSUSE Mailing Lists will be offline. During that time, sending or receiving mails to the openSUSE mailing lists, or viewing their archives will not be possible. All the mails that will be sent during the downtime will be delayed.
The reason is that the old machine is on an old distribution, and running out of resources. We will migrate the service to a new virtual machine, that will integrate it to a new configuration management infrastructure.
We’ll send a followup announcement with the final status as soon as we finish the migration.
This time of the year again. The monitoring was prodding us with "Your certificate will expire soon". When we fiddled with the tools to create the new CSR, we were wondering "Can we go 4K?". 4K is hip right now. 4K video. 4K TVs. So why not a 4K certificate? A quick check with the security team about the extra CPU load and checking our monitoring.
"Yes we can"
So with this refresh the certificate for all the SSL enabled services in Nuremberg is 4096 bits long. The setup is running with the new certificate for a few days already and so far we did not notice any problems. Next stop will be upgrading the SSL end points to a newer distribution so we get TLS 1.2 and our A grade back. Stay tuned for more news on this front.
Thanks to our monitoring, we get aware of a "nearly full" source server partition for build.opensuse.org in time. (The size of the submitted sources inside the openSUSE Build Service is currently more than 7.4 TB, which already includes a deduplication!)
So we started to move to a new storage array using the "pvmove" command, but for currently unknown reasons this stopped in the middle of the transfer :-(
As result of the attempts to stop the running process, the database (also used for software.opensuse.org) and some files got damaged. So we spent most of the weekend to restore files from the last backup and finish the migration.
Now everything has been moved to a new storage and is up and running again. But that one reminded us that there is never a need for a backup - just for restore... ;-)
You may know that openSUSE is providing many parts of the infrastructure also via IPv6. Something that started as "proof of concept" in 2011 turned into a reliable and problem free service since then. But in the beginning, we got a "temporary" IPv6 range that needs to be used elsewhere now: so it's time to move to a "final" IPv6 range that should last for the next years.
Our ISP already provided us with the new range of IPv6 addresses and we will start next week (week 20 of 2014) to
1. add the new addresses to the running hosts
2. change the DNS entries pointing to the new addresses
3. run the old and new IPv6 addresses in parallel for a few days
4. remove the old addresses
For endusers, this switch should be "invisible" - but we will of course run some tests in front and listen carefully on irc.freenode.net#opensuse-admin and email@example.com if someone encounters any problem.
You might have seen the announcement on news.opensuse.org already: one of the main storages had some problems last week and we still suffer from the effects (one virtual array is still missing).
But there was also another, smaller issue: the internal storage on one of the backend servers also reported problems. The machine is a bit older and using an internal RAID array to provide 10 1TB disks to the running SLES system. As the RAID array controller is also "very old" (~ 6-7 years), each of the 1TB disks is exported as "single RAID" - as the controller is just not able to handle a RAID with more than 1 TB in size. In the end there is a software RAID 6 running over all 10 disks. Now our monitoring notified us that the RAID is degraded: one of the 10 disks died (a naughty little beggar who claims "btrfs is the reason" ;-). So far so good. But guess how frustrated an admin can be if he tries to identify the broken disk and there is absolutely NO LIGHT or other indicator at the disk cages? So guessing the "right" disk - and - heya: choose the wrong one. But happily with RAID 6 you can loose two hard disks without a problem. So re-inserting the disk and waiting for the RAID to finish the rebuild, trying... But sadly the RAID controller now starts to break: right after inserting the disk, the controller lost nearly all disks, resulting in an array with a lot of "spares". Reboot solved the problem - for ~10 minutes...
So after 60 minutes of fighting against old hardware, we decided to go with another solution: using an old, dedicated FC storage. Luckily the old server did come back successfully after inserting the extra FC card and even the RAID controller allowed us at least to mount the degraded RAID in read-only mode to copy over the last bits and bites.
After 3 hours of "mdadm /dev/md2 --add /dev/sdx1; mdadm --stop /dev/md2; mdadm assemble --scan --force; mdadm ..." loops , we can report that the backend for the staging projects is back without any data loss...
Today was a bit of a cleanup day - so we took the freedom to spend 1.5 hours to check all feeds aggregated on http://planet.opensuse.org/ manually today and disabled 52 of them. The good news: there are still 327 active feeds left.
You hopefully know already that the source code of planet.opensuse.org is available at GitHub ?
If not - and especially if you wonder which RSS feed is now disabled exactly - please visit https://github.com/openSUSE/planet.opensuse.org/blob/master/planetsuse/feeds. This is the authoritative source for the parser. So if you find some mistakes (might happen - we are all humans) or want to get your feed aggregated, please "fork us on GitHub" or sent us patches.
The Event Calendar Plugin on http://news.opensuse.org/ has been found to cause the site to break. The sites were recently moved to some new servers and a new database server that runs MySQL 5.5.x, where the old server only ran MySQL 5.1.x. After moving the application, the site was throwing a SQL Syntax error and was found to be a result of the Event Calendar Plugin that is installed.
This plugin has been deactivated and the site is now functioning properly. A replacement plugin will need to be found, or an update to the plugin.
As Ticket #2302 explains, we are currently facing an issue with the scripts that are re-creating the hotstuff-XXXg modules on stage.opensuse.org and rsync.opensuse.org : all directories created by this script are producing empty directories.
The developer of this script is currently on vacation, so we might need to wait until (at least) Monday, 2014-04-14, before this will get fixed.
As result, we stopped the rsync server on rsync.opensuse.org and stage.opensuse.org for now to avoid that people remove 30 - 640 GB of content if they run their rsync commands with the recommended "--delete --delete-after" option.
Also available in: Atom