communication #151441
closed20231122 postmortem
0%
Description
From https://etherpad.opensuse.org/p/20231122-postmortem:
What/Problem: various openSUSE services were disfunctional (wiki, etherpad, VPN)
When: 2023-11-22 00:5x to 13:00 UTC
Why: We identified a number of contributing issues:
autoupdates run at night when nobody is around to fix breakages
https://bugzilla.opensuse.org/show_bug.cgi?id=1217403 virtlockd stopped VMs after auto-update
galera DB disintegrated when all its 3 VMs were restarted due to the virtlockd issue
- galera had a missing sst_user, causing cluster recovery to fail
- ulimit restrictions caused recovery debugging to fail
- existing data directories on consumer nodes caused recovery to fail
freeipa had no network after the machine restarted due to the virtlockd issue - fixed
- freeipa had no slapd running after reboot
without freeipa, no openVPN-access was possible
The VM host that provides remote-access without Heroes-openVPN did not have autostart enabled for VMs - fixed
OpenVPN gateway had missing ip-forward config applied from salt - fixed https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/1063
The network is tightly firewalled, so no SSH access from places other than the jump host or the VPN was possible - third fallback entrypoint still needs to be set up