Root cause analysis of the OBS downtime 2019-12-14

Sometimes a single cable can make a huge difference. Especially if it is plugged into the wrong port...
Added by lrupp over 1 year ago

Around 16:00 CET at 2019-12-14, one of the Open Build Service (OBS) virtualization servers (which run some of the backend machines) decided to stop operating. Reason: a power failure in one of the UPS systems. Other than normal, this single server had both power supplies on the same UPS - resulting in a complete power loss, while all other servers were still powered via their redundant power supply.

In turn, the communication between the API and those backend machines stopped. The API summed up the incoming requests up to a state where it was not able to handle more.

By moving the backends over to another virtualization server, the problem was temporarily fixed (since ~19:00) and the API was working on the backlog. The cabling on the problematic server is meanwhile fixed and the machine is online again. So we are sure that this specific problem will not happen again in the future.