action #162332
closed2024-06-15 osd not accessible size:M
0%
Description
Observation¶
Disks are failing to mount and preventing osd from booting.
Acceptance criteria¶
- AC1: OSD is fully reachable again
- AC2: OSD is able to at least clearly reboot
Suggestions¶
- DONE File an SD ticket https://sd.suse.com/servicedesk/customer/portal/1/SD-159799
- Identify the root cause
- Identify and create follow-up tasks
- Ensure that OSD can at least clearly reboot once
Out of scope¶
- HTTP code 502 alert -> #160877
Updated by okurz 6 months ago
- Copied from action #161309: osd not accessible, 502 Bad Gateway added
Updated by jbaier_cz 6 months ago
From the very little info we get, the symptoms indeed look very similar to the last problem in #161309. Only this time the salt pipelines looks all right (at least there is a status after salt run on osd). Could that be that this time the manual actions done as part of #162320#note-3 did something unexpected?
Updated by okurz 6 months ago
- Related to action #162320: multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry added
Updated by jbaier_cz 6 months ago · Edited
From the logs, the first problematic lines are:
Jun 16 00:35:10 openqa systemd[1]: Stopped target Local File Systems.
Jun 16 00:35:10 openqa systemd[1]: Stopping Local File Systems...
Jun 16 00:35:10 openqa systemd[1]: var-lib-openqa-archive.automount: Path /var/lib/openqa/archive is already a mount point, refusing start.
Jun 16 00:35:10 openqa systemd[1]: Failed to set up automount var-lib-openqa-archive.automount.
Jun 16 00:35:10 openqa systemd[1]: Dependency failed for Local File Systems.
Jun 16 00:35:10 openqa systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
...
Jun 16 00:35:26 openqa systemd[1]: Stopped target System Initialization.
Jun 16 00:35:26 openqa systemd[1]: Stopping Security Auditing Service...
Jun 16 00:35:26 openqa systemd[1]: Started Emergency Shell.
Jun 16 00:35:26 openqa systemd[1]: Reached target Emergency Mode.
Updated by okurz 6 months ago · Edited
- Status changed from Blocked to In Progress
machine is back up with the help from gschlotter. OSD could not boot up cleanly by itself and had problems with filesystems. We will need to check reboot stability during a timeframe when people able to recover are also available.
machine is back up with the help from gschlotter. Thank you for that. We can continue from here as long as the machine is reachable. We will also check if the system reboots fine. After I could login over ssh I made other members of the tools team aware and we could join a root owned screen session on OSD and continue the recovery.
Recovery¶
gschlotter had previously disabled problematic partitions to allow the system to come up. But what this caused is that openQA services would start on an incomplete mount of /var/lib/openqa/share and create empty directories which are getting in the way of mounting so we called
umount --lazy --force /var/lib/openqa
umount --lazy --force /var/lib/openqa/share
rm -rf /var/lib/openqa/{share,archive}
mount -a
systemctl start default.target
systemctl is-system-running
What happened initially¶
We found in the system journal that the system triggered filesystem checks on reboot, as expected, but those failed for at least some of them. Same as in the recovery of https://sd.suse.com/servicedesk/customer/portal/1/SD-158390 XFS repair seems to have run into an OOM condition preventing further automatic boot of the system. Possibly this is a regression of xfs_repair or maybe 32GB of RAM is not big enough for the default behaviour of xfs_repair with multi-TB storage volumes. We have no clear indication what triggered the initial reboot which caused the problems. Could be externally induced by the hypervisor.
Tasks¶
- Again the root password was not known by IT members so I set it explicitly and created https://gitlab.suse.de/openqa/password/-/merge_requests/15 -> #162353
- OSD was encountered as unbootable, possibly due to filesystem problems in the big volumes like vdb,vdc,vdd,vde . We should treat those mount points as non-critical for boot but make critical services like openQA depend on them so that openQA would only start up with a consistent set of mount points. If possible still trigger the automatic filesystem checks but make them non-critical for making the system reachable over network -> #162356
- / is still treated as "ext3". Consider moving to "ext4" -> #162359
- Check the filesystems from the running OS without rebooting to check if there are errors. If there are then gracefully and announced shut down the openQA services and fix the problems from the running OS and only after ensuring cleanness trigger reboots -> #162362
- Research for the xfs OOM issues and try to come up with a way so that the system triggered filesystem checks don't fail on OOM -> #162365
Updated by szarate 6 months ago
What happened initially¶
We found in the system journal that the system triggered filesystem checks on reboot, as expected, but those failed for at least some of them. Same as in the recovery of https://sd.suse.com/servicedesk/customer/portal/1/SD-158390 XFS repair seems to have run into an OOM condition preventing further automatic boot of the system. Possibly this is a regression of xfs_repair or maybe 32GB of RAM is not big enough for the default behaviour of xfs_repair with multi-TB storage volumes. We have no clear indication what triggered the initial reboot which caused the problems. Could be externally induced by the hypervisor.
is there a bug report of this particular problem? (oom condition on xfs)
Updated by okurz 6 months ago
szarate wrote in #note-9:
What happened initially¶
We found in the system journal that the system triggered filesystem checks on reboot, as expected, but those failed for at least some of them. Same as in the recovery of https://sd.suse.com/servicedesk/customer/portal/1/SD-158390 XFS repair seems to have run into an OOM condition preventing further automatic boot of the system. Possibly this is a regression of xfs_repair or maybe 32GB of RAM is not big enough for the default behaviour of xfs_repair with multi-TB storage volumes. We have no clear indication what triggered the initial reboot which caused the problems. Could be externally induced by the hypervisor.
is there a bug report of this particular problem? (oom condition on xfs)
Not aware so far. That's why I noted down a task that we should do
Research for the xfs OOM issues and try to come up with a way so that the system triggered filesystem checks don't fail on OOM
Updated by okurz 6 months ago
- Copied to action #162353: Ensure consistent known root password on all OSD webUI+workers size:S added
Updated by okurz 6 months ago
Always this NFS. I now did on OSD systemctl stop nfs-server && systemctl start nfs-server
to ask the kernel internal NFS server to reset the kthreads handling the connections or something and workers can list /var/lib/openqa/share
again. And at least worker instance on sapworker1 is now "idle", not "offline". Also verified on openqa-piworker that this works again.
Updated by okurz 6 months ago
- Copied to action #162380: 2024-06-15 osd not accessible - causing false alerts for other hosts size:S added
Updated by openqa_review 6 months ago
- Due date set to 2024-07-02
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 6 months ago
Rebooted OSD while Steven Mallindine from IT was standing ready. back but needed manual intervention so we are not back to where we were some weeks ago. Thanks a lot to Steven to collaborate quickly with me over google meet, shared screen and GNU screen. The problem is that still during boot automated filesystem checks start for XFS partitions and all in parallel causing OOM conditions.
Updated by okurz 6 months ago
Again problems with machines accessing NFS after the reboot. And again systemctl stop nfs-server; systemctl start nfs-server
fixed it. Likely due to the problems with the mount point. Likely because /var/lib/openqa/share wasn't available at boot time when nfs-server already started?
failed_since="2024-06-19 14:00Z" result="result='parallel_failed'" host=openqa.suse.de comment="label:poo162332" openqa-advanced-retrigger-jobs
Updated by okurz 6 months ago
- Status changed from Workable to Resolved
In the meantime we could verify that automated reboots work after we disabled the automated file checks preventing the OOM condition. And we did not have that problem per se. We can continue in the reported follow-up tickets for improvements.