action #162332: 2024-06-15 osd not accessible size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #162332

closed

2024-06-15 osd not accessible size:M

Added by okurz 9 months ago. Updated 8 months ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Observation¶

Disks are failing to mount and preventing osd from booting.

Acceptance criteria¶

AC1: OSD is fully reachable again
AC2: OSD is able to at least clearly reboot

Suggestions¶

DONE File an SD ticket https://sd.suse.com/servicedesk/customer/portal/1/SD-159799
Identify the root cause
Identify and create follow-up tasks
Ensure that OSD can at least clearly reboot once

Out of scope¶

HTTP code 502 alert -> #160877

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by okurz 9 months ago

Copied from action #161309: osd not accessible, 502 Bad Gateway added

Actions

Copy link

Updated by okurz 9 months ago

Status changed from In Progress to Blocked
Priority changed from Urgent to High

https://sd.suse.com/servicedesk/customer/portal/1/SD-159799

Actions

Copy link

Updated by jbaier_cz 9 months ago

From the very little info we get, the symptoms indeed look very similar to the last problem in #161309. Only this time the salt pipelines looks all right (at least there is a status after salt run on osd). Could that be that this time the manual actions done as part of #162320#note-3 did something unexpected?

Actions

Copy link

Updated by okurz 9 months ago

Yes, I guess so. I applied a high state and AFAIR that went through. I did not trigger a reboot but as it was Sunday 00:30, maybe that's when an automatic reboot was triggered?

Actions

Copy link

Updated by okurz 9 months ago

Related to action #162320: multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry added

Actions

Copy link

Updated by jbaier_cz 9 months ago · Edited

From the logs, the first problematic lines are:

Jun 16 00:35:10 openqa systemd[1]: Stopped target Local File Systems.
Jun 16 00:35:10 openqa systemd[1]: Stopping Local File Systems...
Jun 16 00:35:10 openqa systemd[1]: var-lib-openqa-archive.automount: Path /var/lib/openqa/archive is already a mount point, refusing start.
Jun 16 00:35:10 openqa systemd[1]: Failed to set up automount var-lib-openqa-archive.automount.
Jun 16 00:35:10 openqa systemd[1]: Dependency failed for Local File Systems.
Jun 16 00:35:10 openqa systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
...
Jun 16 00:35:26 openqa systemd[1]: Stopped target System Initialization.
Jun 16 00:35:26 openqa systemd[1]: Stopping Security Auditing Service...
Jun 16 00:35:26 openqa systemd[1]: Started Emergency Shell.
Jun 16 00:35:26 openqa systemd[1]: Reached target Emergency Mode.

Actions

Copy link

Updated by okurz 9 months ago · Edited

Status changed from Blocked to In Progress

machine is back up with the help from gschlotter. OSD could not boot up cleanly by itself and had problems with filesystems. We will need to check reboot stability during a timeframe when people able to recover are also available.

machine is back up with the help from gschlotter. Thank you for that. We can continue from here as long as the machine is reachable. We will also check if the system reboots fine. After I could login over ssh I made other members of the tools team aware and we could join a root owned screen session on OSD and continue the recovery.

Recovery¶

gschlotter had previously disabled problematic partitions to allow the system to come up. But what this caused is that openQA services would start on an incomplete mount of /var/lib/openqa/share and create empty directories which are getting in the way of mounting so we called

umount --lazy --force /var/lib/openqa
umount --lazy --force /var/lib/openqa/share
rm -rf /var/lib/openqa/{share,archive}
mount -a
systemctl start default.target
systemctl is-system-running

What happened initially¶

We found in the system journal that the system triggered filesystem checks on reboot, as expected, but those failed for at least some of them. Same as in the recovery of https://sd.suse.com/servicedesk/customer/portal/1/SD-158390 XFS repair seems to have run into an OOM condition preventing further automatic boot of the system. Possibly this is a regression of xfs_repair or maybe 32GB of RAM is not big enough for the default behaviour of xfs_repair with multi-TB storage volumes. We have no clear indication what triggered the initial reboot which caused the problems. Could be externally induced by the hypervisor.

Tasks¶

Again the root password was not known by IT members so I set it explicitly and created https://gitlab.suse.de/openqa/password/-/merge_requests/15 -> #162353
OSD was encountered as unbootable, possibly due to filesystem problems in the big volumes like vdb,vdc,vdd,vde . We should treat those mount points as non-critical for boot but make critical services like openQA depend on them so that openQA would only start up with a consistent set of mount points. If possible still trigger the automatic filesystem checks but make them non-critical for making the system reachable over network -> #162356
/ is still treated as "ext3". Consider moving to "ext4" -> #162359
Check the filesystems from the running OS without rebooting to check if there are errors. If there are then gracefully and announced shut down the openQA services and fix the problems from the running OS and only after ensuring cleanness trigger reboots -> #162362
Research for the xfs OOM issues and try to come up with a way so that the system triggered filesystem checks don't fail on OOM -> #162365

Actions

Copy link

Updated by okurz 9 months ago

Parent task set to #162350

Actions

Copy link

Updated by szarate 9 months ago

What happened initially¶

We found in the system journal that the system triggered filesystem checks on reboot, as expected, but those failed for at least some of them. Same as in the recovery of https://sd.suse.com/servicedesk/customer/portal/1/SD-158390 XFS repair seems to have run into an OOM condition preventing further automatic boot of the system. Possibly this is a regression of xfs_repair or maybe 32GB of RAM is not big enough for the default behaviour of xfs_repair with multi-TB storage volumes. We have no clear indication what triggered the initial reboot which caused the problems. Could be externally induced by the hypervisor.

is there a bug report of this particular problem? (oom condition on xfs)

Actions

Copy link

#10

Updated by okurz 9 months ago

szarate wrote in #note-9:

What happened initially¶

We found in the system journal that the system triggered filesystem checks on reboot, as expected, but those failed for at least some of them. Same as in the recovery of https://sd.suse.com/servicedesk/customer/portal/1/SD-158390 XFS repair seems to have run into an OOM condition preventing further automatic boot of the system. Possibly this is a regression of xfs_repair or maybe 32GB of RAM is not big enough for the default behaviour of xfs_repair with multi-TB storage volumes. We have no clear indication what triggered the initial reboot which caused the problems. Could be externally induced by the hypervisor.

is there a bug report of this particular problem? (oom condition on xfs)

Not aware so far. That's why I noted down a task that we should do

Research for the xfs OOM issues and try to come up with a way so that the system triggered filesystem checks don't fail on OOM

Actions

Copy link

#11

Updated by okurz 9 months ago

As various hosts had problems with the mount points from OSD being not available for so long I called

salt -C 'G@roles:worker' cmd.run 'systemctl --failed | grep -q "0.*units listed" || reboot'

which rebooted some machines.

Actions

Copy link

#12

Updated by okurz 9 months ago

Copied to action #162353: Ensure consistent known root password on all OSD webUI+workers size:S added

Actions

Copy link

#13

Updated by okurz 9 months ago

Created new dedicated tickets for identified tasks

Actions

Copy link

#14

Updated by dheidler 9 months ago

The NFS export of openqa.suse.de:/var/lib/openqa/share doesn't seem to work - the piworker hangs when trying to access it.

Actions

Copy link

#15

Updated by okurz 9 months ago

Always this NFS. I now did on OSD systemctl stop nfs-server && systemctl start nfs-server to ask the kernel internal NFS server to reset the kthreads handling the connections or something and workers can list /var/lib/openqa/share again. And at least worker instance on sapworker1 is now "idle", not "offline". Also verified on openqa-piworker that this works again.

Actions

Copy link

#16

Updated by okurz 9 months ago

Copied to action #162380: 2024-06-15 osd not accessible - causing false alerts for other hosts size:S added

Actions

Copy link

#17

Updated by openqa_review 9 months ago

Due date set to 2024-07-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#18

Updated by okurz 9 months ago

Subject changed from 2024-06-15 osd not accessible to 2024-06-15 osd not accessible size:M
Description updated (diff)

Actions

Copy link

#19

Updated by okurz 9 months ago

Rebooted OSD while Steven Mallindine from IT was standing ready. back but needed manual intervention so we are not back to where we were some weeks ago. Thanks a lot to Steven to collaborate quickly with me over google meet, shared screen and GNU screen. The problem is that still during boot automated filesystem checks start for XFS partitions and all in parallel causing OOM conditions.

Actions

Copy link

#20

Updated by okurz 9 months ago

Again problems with machines accessing NFS after the reboot. And again systemctl stop nfs-server; systemctl start nfs-server fixed it. Likely due to the problems with the mount point. Likely because /var/lib/openqa/share wasn't available at boot time when nfs-server already started?

failed_since="2024-06-19 14:00Z" result="result='parallel_failed'" host=openqa.suse.de comment="label:poo162332" openqa-advanced-retrigger-jobs

Actions

Copy link

#21

Updated by okurz 9 months ago

Due date deleted (~~2024-07-02~~)
Status changed from In Progress to Blocked

#162362

Actions

Copy link

#22

Updated by livdywan 8 months ago

Status changed from Blocked to Workable

okurz wrote in #note-21:

#162362

The blocker was resolved.

Actions

Copy link

#23

Updated by livdywan 8 months ago

okurz wrote in #note-2:

https://sd.suse.com/servicedesk/customer/portal/1/SD-159799

The SD ticket was resolved 🎉

Actions

Copy link

#24

Updated by okurz 8 months ago

Status changed from Workable to Resolved

In the meantime we could verify that automated reboots work after we disabled the automated file checks preventing the OOM condition. And we did not have that problem per se. We can continue in the reported follow-up tickets for improvements.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #162332

2024-06-15 osd not accessible size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Out of scope¶

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by jbaier_cz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by jbaier_cz 9 months ago · Edited

Updated by okurz 9 months ago · Edited

Recovery¶

What happened initially¶

Tasks¶

Updated by okurz 9 months ago

Updated by szarate 9 months ago

What happened initially¶

Updated by okurz 9 months ago

What happened initially¶

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by dheidler 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by openqa_review 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by livdywan 8 months ago

Updated by livdywan 8 months ago

Updated by okurz 8 months ago