You may know that openSUSE is providing many parts of the infrastructure also via IPv6. Something that started as "proof of concept" in 2011 turned into a reliable and problem free service since then. But in the beginning, we got a "temporary" IPv6 range that needs to be used elsewhere now: so it's time to move to a "final" IPv6 range that should last for the next years.
Our ISP already provided us with the new range of IPv6 addresses and we will start next week (week 20 of 2014) to
- add the new addresses to the running hosts
- change the DNS entries pointing to the new addresses
- run the old and new IPv6 addresses in parallel for a few days
- remove the old addresses
For endusers, this switch should be "invisible" - but we will of course run some tests in front and listen carefully on irc.freenode.net#opensuse-admin and admin@opensuse.org if someone encounters any problem.
You might have seen the announcement on news.opensuse.org already: one of the main storages had some problems last week and we still suffer from the effects (one virtual array is still missing).
But there was also another, smaller issue: the internal storage on one of the backend servers also reported problems. The machine is a bit older and using an internal RAID array to provide 10 1TB disks to the running SLES system. As the RAID array controller is also "very old" (~ 6-7 years), each of the 1TB disks is exported as "single RAID" - as the controller is just not able to handle a RAID with more than 1 TB in size. In the end there is a software RAID 6 running over all 10 disks. Now our monitoring notified us that the RAID is degraded: one of the 10 disks died (a naughty little beggar who claims "btrfs is the reason" ;-). So far so good. But guess how frustrated an admin can be if he tries to identify the broken disk and there is absolutely NO LIGHT or other indicator at the disk cages? So guessing the "right" disk - and - heya: choose the wrong one. But happily with RAID 6 you can loose two hard disks without a problem. So re-inserting the disk and waiting for the RAID to finish the rebuild, trying... But sadly the RAID controller now starts to break: right after inserting the disk, the controller lost nearly all disks, resulting in an array with a lot of "spares". Reboot solved the problem - for ~10 minutes...
So after 60 minutes of fighting against old hardware, we decided to go with another solution: using an old, dedicated FC storage. Luckily the old server did come back successfully after inserting the extra FC card and even the RAID controller allowed us at least to mount the degraded RAID in read-only mode to copy over the last bits and bites.
After 3 hours of "mdadm /dev/md2 --add /dev/sdx1; mdadm --stop /dev/md2; mdadm assemble --scan --force; mdadm ..." loops , we can report that the backend for the staging projects is back without any data loss...
Today was a bit of a cleanup day - so we took the freedom to spend 1.5 hours to check all feeds aggregated on http://planet.opensuse.org/ manually today and disabled 52 of them. The good news: there are still 327 active feeds left.
You hopefully know already that the source code of planet.opensuse.org is available at GitHub ?
If not - and especially if you wonder which RSS feed is now disabled exactly - please visit https://github.com/openSUSE/planet.opensuse.org/blob/master/planetsuse/feeds. This is the authoritative source for the parser. So if you find some mistakes (might happen - we are all humans) or want to get your feed aggregated, please "fork us on GitHub" or sent us patches.
The Event Calendar Plugin on http://news.opensuse.org/ has been found to cause the site to break. The sites were recently moved to some new servers and a new database server that runs MySQL 5.5.x, where the old server only ran MySQL 5.1.x. After moving the application, the site was throwing a SQL Syntax error and was found to be a result of the Event Calendar Plugin that is installed.
This plugin has been deactivated and the site is now functioning properly. A replacement plugin will need to be found, or an update to the plugin.
As Ticket #2302 explains, we are currently facing an issue with the scripts that are re-creating the hotstuff-XXXg modules on stage.opensuse.org and rsync.opensuse.org : all directories created by this script are producing empty directories.
The developer of this script is currently on vacation, so we might need to wait until (at least) Monday, 2014-04-14, before this will get fixed.
As result, we stopped the rsync server on rsync.opensuse.org and stage.opensuse.org for now to avoid that people remove 30 - 640 GB of content if they run their rsync commands with the recommended "--delete --delete-after" option.
The database cluster behind http://download.opensuse.org/ has a new home: two machines with 24 CPU Cores and 128 GB RAM.
But the cluster is not running directly on this hardware - instead we have two virtual machines (KVM) that are currently just using 32 GB RAM and 10 CPUs each. This has some interesting benefits:
- Rebooting a virtual machines takes less than one minute. The old, bare metal systems took up to 10 minutes for a reboot. After that, the slave database needs ~5-10 minutes to resync with the master. So a complete "kernel update round" took ~30 minutes only for these two hosts in the past.
- If we need to reboot the underlying systems (the host), all virtual guests can be migrated without any downtime to the other host.
- We have enough free space for the migration of the web servers in front of the database (the ones every user reaches if he visits http://download.opensuse.org/ ) - so this will be the next step
The new machine behind https://openqa.opensuse.org/ has 128 GBB RAM and a 3.6 TB RAID system, allowing to run all tests in tmpfs (RAM) and to store the test results for a very long time.
As always, we used the chance to move the web frontend behind our HA-Proxy, which allows us to provide you immediately with a "service down" page and (hopefully more important) to scale up to many more "backends" once the amount of users trying to see the latest results is increasing over the limit one single machine can handle.
The funny part with RAID5 is: you loose one hard disk in size, but you also can only loose one hard disk in your RAID array. As always, Mr. Murphy knows about this fact and kills two hard disks at once...
So now after replacing two of six hard disks in the system, we are back online (after a complete re-install and re-sync, of course). As the PERC5i RAID controller on this 7 year old machine only allows RAID level 0, 1, 10 and 5, we run again with RAID 5, but this time also defined one of the disks as hot spare - loosing another TB for our data...
As you might note by reading https://en.opensuse.org/Lifetime openSUSE currently supports only 12.3 and 13.1. Yes, this means that:
- 11.4
- 12.1 and even
- 12.2
already reached their end of life state and are therefor not supported any more.
As Admins are born to be lazy, we still had some old (namely 11.4, 12.1 and 12.2) repositories and their files on http://download.opensuse.org/ - which might helped users with their old installations.
But as the openSUSE Build Service is gaining more and more popularity, the used space is getting more and more a problem. So today we removed the old ISOs and repositories of 11.4 and 12.1 now from download.opensuse.org - and like to point you to one of the nice mirror admins that still provide the outdated files.
The problem was fixed by
- starting our hot-standby database as new master
- replacing the half broken SSD with a new one
- re-syncing from the current master
...and while we are on it, we took the chance to let PostgreSQL do some cleanups via the "vacuum" and "reindex" commands: the database shrunk from ~30GB to ~10GB, but this took nearly 2 hours!
During the whole time, download.opensuse.org was still reachable - just a bit slow for 1 hour during the vacuuming (sorry for that). Now everything is back to normal, but we will try to figure out how the autovacuum can be tuned, so the database cleanup can be done automatically.