In the old setup only the MF IT team could do DNS changes for us in their DNS appliance. So we always had to run through tickets for changes.
Now we have a FreeIPA instance for the openSUSE cluster to manage the DNS zone. We use FreeIPA not only for DNS but that is a topic for another article.
After we fixed all the technical problems 3 weeks ago (you can read about it here), we finally got the approval for the change from upper management.
Now also their change control team agreed and we finally completed the change.
So lets all welcome home the opensuse.org zone!
The new status page of the openSUSE infrastructure team provides updates on how the systems of the openSUSE community are doing. If there are interruptions to service, we will post a note here.
The idea behind the new status page is to inform our users via a central point about outages or service interruptions, so there should be no need to check other resources, if there is an outage or problem. By using the open source status page system Cachet, we benefit from the work of another great community, while we try to contribute back by doing marketing and pushing our changes upstream.
Cachet has - beside others - the following wonderful features:
* Email subscriptions: users can subscribe with their Email address to get informed about incidents via personal Email
* RSS and Atom feeds: just put the links to the feeds (provided at the bottom of the page) into your favorite newsreader to get incident updates via RSS feeds
* Nice overview about the provided services: as the openSUSE community is providing so many services, the status page might be a good starting point for everyone to get an overview
At the moment, the page is not fully operational: we might change some settings and/or add some more features. But it might already be good enough to get the idea behind it.
As always, if you are experiencing any issues with the openSUSE infrastructure, don't hesitate to get in touch with us at firstname.lastname@example.org or via irc.opensuse.org/#opensuse-admin and we'll get back to you as soon as we can.
What a start in the new year: the server running rsync.opensuse.org died with two broken hard disks at 2016-01-10.
As the hardware is located in the data center of our sponsor IP Exchange, we apologize for the delay it will take to fix the problem: we need not only the correct replacement hard drives, but also a field worker at the location who has the appropriate permissions and skills.
During the downtime (and maybe also a good tip afterward), please check on http://mirrors.opensuse.org/ for the closest mirror nearby your location that also offers rsync for you.
All backend servers run now on one of three virtualization hosts using KVM:
* 48 Cores
* 512 GB RAM
* 20 Gb Ethernet (incl. FCOE)
The eight virtual servers running on this hardware are using the resources very well - while we still have the ability to use just two of the virtualization hosts for continuous operation during service. We hope to improve the availability of the openSUSE Build Service with this new setup and reduce the overall downtime for you.
At the moment, we are trying to fix the last small issues (like long live migration times or synchronization of the configuration between the machines).
The migration of the openSUSE Mailing Lists has been finished successfully. If you encounter any issues, please let us know by mail on admin at opensuse dot org.
On Tuesday 2015-06-09, from 09:00 to 11:00 UTC, the machine that hosts the openSUSE Mailing Lists will be offline. During that time, sending or receiving mails to the openSUSE mailing lists, or viewing their archives will not be possible. All the mails that will be sent during the downtime will be delayed.
The reason is that the old machine is on an old distribution, and running out of resources. We will migrate the service to a new virtual machine, that will integrate it to a new configuration management infrastructure.
We’ll send a followup announcement with the final status as soon as we finish the migration.
This time of the year again. The monitoring was prodding us with "Your certificate will expire soon". When we fiddled with the tools to create the new CSR, we were wondering "Can we go 4K?". 4K is hip right now. 4K video. 4K TVs. So why not a 4K certificate? A quick check with the security team about the extra CPU load and checking our monitoring.
"Yes we can"
So with this refresh the certificate for all the SSL enabled services in Nuremberg is 4096 bits long. The setup is running with the new certificate for a few days already and so far we did not notice any problems. Next stop will be upgrading the SSL end points to a newer distribution so we get TLS 1.2 and our A grade back. Stay tuned for more news on this front.
Thanks to our monitoring, we get aware of a "nearly full" source server partition for build.opensuse.org in time. (The size of the submitted sources inside the openSUSE Build Service is currently more than 7.4 TB, which already includes a deduplication!)
So we started to move to a new storage array using the "pvmove" command, but for currently unknown reasons this stopped in the middle of the transfer :-(
As result of the attempts to stop the running process, the database (also used for software.opensuse.org) and some files got damaged. So we spent most of the weekend to restore files from the last backup and finish the migration.
Now everything has been moved to a new storage and is up and running again. But that one reminded us that there is never a need for a backup - just for restore... ;-)
You may know that openSUSE is providing many parts of the infrastructure also via IPv6. Something that started as "proof of concept" in 2011 turned into a reliable and problem free service since then. But in the beginning, we got a "temporary" IPv6 range that needs to be used elsewhere now: so it's time to move to a "final" IPv6 range that should last for the next years.
Our ISP already provided us with the new range of IPv6 addresses and we will start next week (week 20 of 2014) to
1. add the new addresses to the running hosts
2. change the DNS entries pointing to the new addresses
3. run the old and new IPv6 addresses in parallel for a few days
4. remove the old addresses
For endusers, this switch should be "invisible" - but we will of course run some tests in front and listen carefully on irc.freenode.net#opensuse-admin and email@example.com if someone encounters any problem.
You might have seen the announcement on news.opensuse.org already: one of the main storages had some problems last week and we still suffer from the effects (one virtual array is still missing).
But there was also another, smaller issue: the internal storage on one of the backend servers also reported problems. The machine is a bit older and using an internal RAID array to provide 10 1TB disks to the running SLES system. As the RAID array controller is also "very old" (~ 6-7 years), each of the 1TB disks is exported as "single RAID" - as the controller is just not able to handle a RAID with more than 1 TB in size. In the end there is a software RAID 6 running over all 10 disks. Now our monitoring notified us that the RAID is degraded: one of the 10 disks died (a naughty little beggar who claims "btrfs is the reason" ;-). So far so good. But guess how frustrated an admin can be if he tries to identify the broken disk and there is absolutely NO LIGHT or other indicator at the disk cages? So guessing the "right" disk - and - heya: choose the wrong one. But happily with RAID 6 you can loose two hard disks without a problem. So re-inserting the disk and waiting for the RAID to finish the rebuild, trying... But sadly the RAID controller now starts to break: right after inserting the disk, the controller lost nearly all disks, resulting in an array with a lot of "spares". Reboot solved the problem - for ~10 minutes...
So after 60 minutes of fighting against old hardware, we decided to go with another solution: using an old, dedicated FC storage. Luckily the old server did come back successfully after inserting the extra FC card and even the RAID controller allowed us at least to mount the degraded RAID in read-only mode to copy over the last bits and bites.
After 3 hours of "mdadm /dev/md2 --add /dev/sdx1; mdadm --stop /dev/md2; mdadm assemble --scan --force; mdadm ..." loops , we can report that the backend for the staging projects is back without any data loss...
Also available in: Atom