Project

General

Profile

Actions

action #162332

closed

2024-06-15 osd not accessible size:M

Added by okurz 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

Disks are failing to mount and preventing osd from booting.

Acceptance criteria

  • AC1: OSD is fully reachable again
  • AC2: OSD is able to at least clearly reboot

Suggestions

Out of scope


Related issues 4 (1 open3 closed)

Related to openQA Project - action #162320: multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retryResolvedokurz2024-06-15

Actions
Copied from openQA Infrastructure - action #161309: osd not accessible, 502 Bad GatewayResolvedjbaier_cz2024-05-31

Actions
Copied to openQA Infrastructure - action #162353: Ensure consistent known root password on all OSD webUI+workers size:SResolvednicksinger2024-06-17

Actions
Copied to openQA Infrastructure - action #162380: 2024-06-15 osd not accessible - causing false alerts for other hosts size:SWorkable2024-06-17

Actions
Actions #1

Updated by okurz 5 months ago

  • Copied from action #161309: osd not accessible, 502 Bad Gateway added
Actions #2

Updated by okurz 5 months ago

  • Status changed from In Progress to Blocked
  • Priority changed from Urgent to High
Actions #3

Updated by jbaier_cz 5 months ago

From the very little info we get, the symptoms indeed look very similar to the last problem in #161309. Only this time the salt pipelines looks all right (at least there is a status after salt run on osd). Could that be that this time the manual actions done as part of #162320#note-3 did something unexpected?

Actions #4

Updated by okurz 5 months ago

Yes, I guess so. I applied a high state and AFAIR that went through. I did not trigger a reboot but as it was Sunday 00:30, maybe that's when an automatic reboot was triggered?

Actions #5

Updated by okurz 5 months ago

  • Related to action #162320: multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry added
Actions #6

Updated by jbaier_cz 5 months ago · Edited

From the logs, the first problematic lines are:

Jun 16 00:35:10 openqa systemd[1]: Stopped target Local File Systems.
Jun 16 00:35:10 openqa systemd[1]: Stopping Local File Systems...
Jun 16 00:35:10 openqa systemd[1]: var-lib-openqa-archive.automount: Path /var/lib/openqa/archive is already a mount point, refusing start.
Jun 16 00:35:10 openqa systemd[1]: Failed to set up automount var-lib-openqa-archive.automount.
Jun 16 00:35:10 openqa systemd[1]: Dependency failed for Local File Systems.
Jun 16 00:35:10 openqa systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
...
Jun 16 00:35:26 openqa systemd[1]: Stopped target System Initialization.
Jun 16 00:35:26 openqa systemd[1]: Stopping Security Auditing Service...
Jun 16 00:35:26 openqa systemd[1]: Started Emergency Shell.
Jun 16 00:35:26 openqa systemd[1]: Reached target Emergency Mode.

Actions #7

Updated by okurz 5 months ago · Edited

  • Status changed from Blocked to In Progress

machine is back up with the help from gschlotter. OSD could not boot up cleanly by itself and had problems with filesystems. We will need to check reboot stability during a timeframe when people able to recover are also available.

machine is back up with the help from gschlotter. Thank you for that. We can continue from here as long as the machine is reachable. We will also check if the system reboots fine. After I could login over ssh I made other members of the tools team aware and we could join a root owned screen session on OSD and continue the recovery.

Recovery

gschlotter had previously disabled problematic partitions to allow the system to come up. But what this caused is that openQA services would start on an incomplete mount of /var/lib/openqa/share and create empty directories which are getting in the way of mounting so we called

umount --lazy --force /var/lib/openqa
umount --lazy --force /var/lib/openqa/share
rm -rf /var/lib/openqa/{share,archive}
mount -a
systemctl start default.target
systemctl is-system-running

What happened initially

We found in the system journal that the system triggered filesystem checks on reboot, as expected, but those failed for at least some of them. Same as in the recovery of https://sd.suse.com/servicedesk/customer/portal/1/SD-158390 XFS repair seems to have run into an OOM condition preventing further automatic boot of the system. Possibly this is a regression of xfs_repair or maybe 32GB of RAM is not big enough for the default behaviour of xfs_repair with multi-TB storage volumes. We have no clear indication what triggered the initial reboot which caused the problems. Could be externally induced by the hypervisor.

Tasks

  1. Again the root password was not known by IT members so I set it explicitly and created https://gitlab.suse.de/openqa/password/-/merge_requests/15 -> #162353
  2. OSD was encountered as unbootable, possibly due to filesystem problems in the big volumes like vdb,vdc,vdd,vde . We should treat those mount points as non-critical for boot but make critical services like openQA depend on them so that openQA would only start up with a consistent set of mount points. If possible still trigger the automatic filesystem checks but make them non-critical for making the system reachable over network -> #162356
  3. / is still treated as "ext3". Consider moving to "ext4" -> #162359
  4. Check the filesystems from the running OS without rebooting to check if there are errors. If there are then gracefully and announced shut down the openQA services and fix the problems from the running OS and only after ensuring cleanness trigger reboots -> #162362
  5. Research for the xfs OOM issues and try to come up with a way so that the system triggered filesystem checks don't fail on OOM -> #162365
Actions #8

Updated by okurz 5 months ago

  • Parent task set to #162350
Actions #9

Updated by szarate 5 months ago

What happened initially

We found in the system journal that the system triggered filesystem checks on reboot, as expected, but those failed for at least some of them. Same as in the recovery of https://sd.suse.com/servicedesk/customer/portal/1/SD-158390 XFS repair seems to have run into an OOM condition preventing further automatic boot of the system. Possibly this is a regression of xfs_repair or maybe 32GB of RAM is not big enough for the default behaviour of xfs_repair with multi-TB storage volumes. We have no clear indication what triggered the initial reboot which caused the problems. Could be externally induced by the hypervisor.

is there a bug report of this particular problem? (oom condition on xfs)

Actions #10

Updated by okurz 5 months ago

szarate wrote in #note-9:

What happened initially

We found in the system journal that the system triggered filesystem checks on reboot, as expected, but those failed for at least some of them. Same as in the recovery of https://sd.suse.com/servicedesk/customer/portal/1/SD-158390 XFS repair seems to have run into an OOM condition preventing further automatic boot of the system. Possibly this is a regression of xfs_repair or maybe 32GB of RAM is not big enough for the default behaviour of xfs_repair with multi-TB storage volumes. We have no clear indication what triggered the initial reboot which caused the problems. Could be externally induced by the hypervisor.

is there a bug report of this particular problem? (oom condition on xfs)

Not aware so far. That's why I noted down a task that we should do

Research for the xfs OOM issues and try to come up with a way so that the system triggered filesystem checks don't fail on OOM

Actions #11

Updated by okurz 5 months ago

As various hosts had problems with the mount points from OSD being not available for so long I called

salt -C 'G@roles:worker' cmd.run 'systemctl --failed | grep -q "0.*units listed" || reboot'

which rebooted some machines.

Actions #12

Updated by okurz 5 months ago

  • Copied to action #162353: Ensure consistent known root password on all OSD webUI+workers size:S added
Actions #13

Updated by okurz 5 months ago

Created new dedicated tickets for identified tasks

Actions #14

Updated by dheidler 5 months ago

The NFS export of openqa.suse.de:/var/lib/openqa/share doesn't seem to work - the piworker hangs when trying to access it.

Actions #15

Updated by okurz 5 months ago

Always this NFS. I now did on OSD systemctl stop nfs-server && systemctl start nfs-server to ask the kernel internal NFS server to reset the kthreads handling the connections or something and workers can list /var/lib/openqa/share again. And at least worker instance on sapworker1 is now "idle", not "offline". Also verified on openqa-piworker that this works again.

Actions #16

Updated by okurz 5 months ago

  • Copied to action #162380: 2024-06-15 osd not accessible - causing false alerts for other hosts size:S added
Actions #17

Updated by openqa_review 5 months ago

  • Due date set to 2024-07-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #18

Updated by okurz 5 months ago

  • Subject changed from 2024-06-15 osd not accessible to 2024-06-15 osd not accessible size:M
  • Description updated (diff)
Actions #19

Updated by okurz 5 months ago

Rebooted OSD while Steven Mallindine from IT was standing ready. back but needed manual intervention so we are not back to where we were some weeks ago. Thanks a lot to Steven to collaborate quickly with me over google meet, shared screen and GNU screen. The problem is that still during boot automated filesystem checks start for XFS partitions and all in parallel causing OOM conditions.

Actions #20

Updated by okurz 5 months ago

Again problems with machines accessing NFS after the reboot. And again systemctl stop nfs-server; systemctl start nfs-server fixed it. Likely due to the problems with the mount point. Likely because /var/lib/openqa/share wasn't available at boot time when nfs-server already started?

failed_since="2024-06-19 14:00Z" result="result='parallel_failed'" host=openqa.suse.de comment="label:poo162332" openqa-advanced-retrigger-jobs
Actions #21

Updated by okurz 5 months ago

  • Due date deleted (2024-07-02)
  • Status changed from In Progress to Blocked
Actions #22

Updated by livdywan 5 months ago

  • Status changed from Blocked to Workable

okurz wrote in #note-21:

#162362

The blocker was resolved.

Actions #23

Updated by livdywan 5 months ago

okurz wrote in #note-2:

https://sd.suse.com/servicedesk/customer/portal/1/SD-159799

The SD ticket was resolved 🎉

Actions #24

Updated by okurz 5 months ago

  • Status changed from Workable to Resolved

In the meantime we could verify that automated reboots work after we disabled the automated file checks preventing the OOM condition. And we did not have that problem per se. We can continue in the reported follow-up tickets for improvements.

Actions

Also available in: Atom PDF