action #163097: Share mount not working on openqaworker-arm-1 and other workers size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #163097

closed

Share mount not working on openqaworker-arm-1 and other workers size:M

Added by livdywan 11 months ago. Updated 10 months ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

alert, infra

Description

Observation¶

Failed systemd services (osd):

2024-07-02 07:15:00 openqaworker-arm-1 var-lib-openqa-share.automount 1

This is happening more often since 2024-06-14 07:49:00, see https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1716816941347&to=1720002135377.

Acceptance criteria¶

AC1: var-lib-openqa-share.automount is consistently not causing alerts
AC2: /var/lib/openqa/share NFS mount on workers is consistently working

Suggestions¶

ssh seems fine
- ping seems fine
Investigate what is or was actually failing here
Three points that you could follow, independent of each other:

Implement a custom systemd restart unit and custom systemd check+monitoring unit and blocklist the .automount units

Research how a systemd automount unit which is not a service could be restarted on failure: For this I found an open feature request https://github.com/systemd/systemd/issues/16811 with workaround in https://github.com/systemd/systemd/issues/16811#issuecomment-728662590

Research about the error "Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?". As we haven't found anything in web search I suggest to create a bug on bugzilla.suse.com plus ask experts in SUSE internal chat as well as external upstream chat

Rollback steps¶

Remove silence alertname=Failed systemd services alert (except openqa.suse.de) from https://monitor.qa.suse.de/alerting/silences

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by okurz 11 months ago

Related to action #162590: NFS mounts are stuck on OSD workers if partitions on OSD fail to come up properly on boot size:S added

Actions

Copy link

Updated by mkittler 11 months ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 11 months ago

Subject changed from Share mount not working on openqaworker-arm-1 to Share mount not working on openqaworker-arm-1 and other workers

The journal looks like this:

sudo journalctl -fu var-lib-openqa-share.automount
…
-- Boot 2e0ffe940b3e4639ae27151d93e4f9ef --
Jul 02 02:41:33 openqaworker-arm-1 systemd[1]: Set up automount var-lib-openqa-share.automount.
Jul 02 02:53:32 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 03:01:08 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Failed with result 'unmounted'.

Judging by the logs this is a recurring problem. Unless someone restarted the unit manually on Jul 02 07:16:00 the problem fixed itself (which is not visible in the journal but on Grafana).

The only relevant log lines in dmesg are:

[  783.324257] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  783.532483] Key type dns_resolver registered
[  783.958133] NFS: Registering the id_resolver key type

When looking at the last 30 days in Grafana it becomes very apparent that the var-lib-openqa-share.automount unit failed on various hosts on various occasions. So this is really not a new problem and not specific to arm-1.

Considering there apparently is a retry going on and the problem always fixes itself (or was ever manual invention required?) I suggest we exclude this unit from the systemd services alert.

Actions

Copy link

Updated by mkittler 11 months ago

This is what it how an exclusion could be done (not tested yet): https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1218

Actions

Copy link

Updated by okurz 11 months ago

Related to action #131309: [alert] NFS mount can fail due to hostname resolution error size:M added

Actions

Copy link

Updated by okurz 11 months ago

mkittler wrote in #note-3:

The journal looks like this:
sudo journalctl -fu var-lib-openqa-share.automount
…
-- Boot 2e0ffe940b3e4639ae27151d93e4f9ef --
Jul 02 02:41:33 openqaworker-arm-1 systemd[1]: Set up automount var-lib-openqa-share.automount.
Jul 02 02:53:32 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 03:01:08 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Failed with result 'unmounted'.
Judging by the logs this is a recurring problem. Unless someone restarted the unit manually on Jul 02 07:16:00 the problem fixed itself (which is not visible in the journal but on Grafana).

The only relevant log lines in dmesg are:
[  783.324257] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  783.532483] Key type dns_resolver registered
[  783.958133] NFS: Registering the id_resolver key type
When looking at the last 30 days in Grafana it becomes very apparent that the var-lib-openqa-share.automount unit failed on various hosts on various occasions. So this is really not a new problem and not specific to arm-1.

Correct. I saw that problem the past weeks but failed to find corresponding reports in tickets where I would have commented about that.

Considering there apparently is a retry going on and the problem always fixes itself (or was ever manual invention required?)

Yes, AFAIR I needed to apply manual intervention.

I suggest we exclude this unit from the systemd services alert.

Hm, I am not convinced about that. Maybe we can find a pattern under which conditions this failed on the various hosts? And as there is a retry already included why do we end up with failed systemd units then?

By the way I see this related to https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/925 : If we would manage to not need to rely on the NFS share from OSD to workers then we wouldn't need the mount at all anymore preventing multiple problems.

Actions

Copy link

Updated by livdywan 11 months ago

Also worker40:

2024-07-02 14:58:00 worker40 var-lib-openqa-share.automount 1

Actions

Copy link

Updated by mkittler 11 months ago · Edited

The latest version of my MR for ignoring should work now, I tested it via:

sudo bash -c "salt-call --out=json \\
    --pillar-root=../salt-pillars-openqa --local slsutil.renderer \\
    '$PWD/monitoring/telegraf/telegraf-common.conf' \\
    default_renderer=jinja host=foo \\
  | jq -r '.local'"

I invoked systemctl reset-failed on worker40 because the unit was indeed failing there but the NFS mount was actually functional.

I had another look on arm-1 and there the NFS mount was actually not functional. So I started var-lib-openqa-share.automount there again which worked immediately.

I though it might make sense to also ensure var-lib-openqa-share.automount is started on worker40 but starting it failed via var-lib-openqa-share.automount: Path /var/lib/openqa/share is already a mount point, refusing start.. So maybe the mount on worker40 was restored by other means than the automount unit. I'll keep it as-is.

Considering this means that the NFS mount is not functional after all I'm not so sure anymore whether we should ignore if this unit fails. However, having to deal with this alert/problem so often manually is also not ideal.

Useful documentation:

Actions

Copy link

Updated by openqa_review 11 months ago

Due date set to 2024-07-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#10

Updated by okurz 11 months ago

Description updated (diff)

Added a silence as the alert triggered again

2024-07-03 10:12:00 worker-arm1 var-lib-openqa-share.automount
2024-07-02 15:39:00 worker40 var-lib-openqa-share.automount

Actions

Copy link

#11

Updated by okurz 11 months ago

Related to action #94949: Failed systemd services alert for openqaworker3 var-lib-openqa-share.automount added

Actions

Copy link

#12

Updated by okurz 11 months ago

Related to action #93964: salt-states CI pipeline deploy step fails on some workers with "Unable to unmount /var/lib/openqa/share: umount.nfs: /var/lib/openqa/share: device is busy." added

Actions

Copy link

#13

Updated by mkittler 11 months ago

This is happening more often since 2024-06-14 07:49:00, see https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1716816941347&to=1720002135377.

Actions

Copy link

#14

Updated by okurz 11 months ago · Edited

Three points that you could follow, indepentant of each other:

Implement a custom systemd restart unit and custom systemd check+monitoring unit and blocklist the .automount units
Research how a systemd automount unit which is not a service could be restarted on failure: For this I found an open feature request https://github.com/systemd/systemd/issues/16811 with workaround in https://github.com/systemd/systemd/issues/16811#issuecomment-728662590
Research about the error "Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?". As we haven't found anything in web search I suggest to create a bug on bugzilla.suse.com plus ask experts in SUSE internal chat as well as external upstream chat

Actions

Copy link

#15

Updated by okurz 11 months ago

Subject changed from Share mount not working on openqaworker-arm-1 and other workers to Share mount not working on openqaworker-arm-1 and other workers size:M
Description updated (diff)

Actions

Copy link

#16

Updated by mkittler 11 months ago · Edited

I looked into those points from 3 to 1.

\3. This search isn't really revealing. It leads to https://github.com/systemd/systemd/blob/main/src/core/automount.c and of course a few forum posts and issues without clear resolution.
\2. This is an open issue https://github.com/systemd/systemd/issues/16811. It mentions how to workaround it similar to our own idea in point 1. There's also the closed issue https://github.com/systemd/systemd/issues/4468 about mount units themselves which doesn't read very promising as well.
\1. I'll go for that option using the code from https://github.com/systemd/systemd/issues/16811#issuecomment-728662590.

Actions

Copy link

#17

Updated by mkittler 11 months ago

Status changed from In Progress to Feedback

MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1222

Actions

Copy link

#18

Updated by mkittler 11 months ago

Status changed from Feedback to Resolved

The MR was merged and I invoked systemctl daemon-reload on all workers¹. I tested it on worker35 and worker40 and it works. (The automount unit is restarted and active again after entering a failed state provoked via sudo umount /var/lib/openqa/share.) With that I'm considering this ticket resolved.

¹ Apparently not done by Salt automatically, at least I got before:

martchus@worker35:~> sudo systemctl status var-lib-openqa-share.automount
Warning: The unit file, source configuration file or drop-ins of var-lib-openqa-share.automount changed on disk. Run 'systemctl daemon-reload' to reload units.

Actions

Copy link

#19

Updated by okurz 10 months ago

Status changed from Resolved to Feedback

Actions

Copy link

#20

Updated by okurz 10 months ago

Due date deleted (~~2024-07-17~~)

Actions

Copy link

#21

Updated by okurz 10 months ago

https://stats.openqa-monitor.qa.suse.de/alerting/silences still shows an active silence which is actually still needed now as there are still firing and https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1 showing multiple "var-lib-openqa-share.automount". Please look into those.

Actions

Copy link

#22

Updated by mkittler 10 months ago

The var-lib-openqa-share.automount were never in a failed state for a very long time. So I don't think they would have caused any alert on their own. That's also why now everything is fine again without us having to do anything.

In the journal on an example worker it also looks like the workaround for restarting the automount unit works:

martchus@worker35:~> sudo journalctl -fu var-lib-openqa-share.automount
Aug 14 09:36:54 worker35 systemd[1]: var-lib-openqa-share.automount: Triggering OnFailure= dependencies.
Aug 14 09:37:04 worker35 systemd[1]: Set up automount var-lib-openqa-share.automount.
Aug 14 10:36:53 worker35 systemd[1]: var-lib-openqa-share.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?
Aug 14 10:36:53 worker35 systemd[1]: var-lib-openqa-share.automount: Failed with result 'unmounted'.
Aug 14 10:36:53 worker35 systemd[1]: var-lib-openqa-share.automount: Triggering OnFailure= dependencies.
Aug 14 10:37:03 worker35 systemd[1]: Set up automount var-lib-openqa-share.automount.
Aug 14 10:55:57 worker35 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 81580 (worker)
Aug 14 11:13:26 worker35 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 81637 (worker)
Aug 14 12:02:42 worker35 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 81587 (worker)
Aug 14 13:06:00 worker35 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 81625 (worker)

martchus@worker35:~> sudo journalctl -fu automount-restarter@var-lib-openqa-share.service
Aug 14 07:50:06 worker35 systemd[1]: Finished Restarts the automount unit var-lib-openqa-share.
Aug 14 08:36:53 worker35 systemd[1]: Starting Restarts the automount unit var-lib-openqa-share...
Aug 14 08:37:03 worker35 systemd[1]: automount-restarter@var-lib-openqa-share.service: Deactivated successfully.
Aug 14 08:37:03 worker35 systemd[1]: Finished Restarts the automount unit var-lib-openqa-share.
Aug 14 09:36:54 worker35 systemd[1]: Starting Restarts the automount unit var-lib-openqa-share...
Aug 14 09:37:04 worker35 systemd[1]: automount-restarter@var-lib-openqa-share.service: Deactivated successfully.
Aug 14 09:37:04 worker35 systemd[1]: Finished Restarts the automount unit var-lib-openqa-share.
Aug 14 10:36:53 worker35 systemd[1]: Starting Restarts the automount unit var-lib-openqa-share...
Aug 14 10:37:03 worker35 systemd[1]: automount-restarter@var-lib-openqa-share.service: Deactivated successfully.
Aug 14 10:37:03 worker35 systemd[1]: Finished Restarts the automount unit var-lib-openqa-share.

Of course from the time stamps one can see that it takes a few seconds to restart the automount unit.

So I would resolve this ticket again. I don't think we can ensure the automount unit never goes into the failed state. We could of course ignore it completely in our monitoring if you think that's better than it cluttering our table of failing systemd services.

Actions

Copy link

#23

Updated by mkittler 10 months ago

We decided to ignore these units in our alerting. @okurz wants to give it a try so I'm not doing it and keep the ticket in feedback.

Actions

Copy link

#24

Updated by mkittler 10 months ago

MR for excluding: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1252

Actions

Copy link

#25

Updated by mkittler 10 months ago

Status changed from Feedback to Resolved

The change was merged 2 days ago and in Grafana I see nothing anymore if I select the time range of the last two days. I guess that's good enough.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #163097

Share mount not working on openqaworker-arm-1 and other workers size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Updated by okurz 11 months ago

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago

Updated by okurz 11 months ago

Updated by okurz 11 months ago

Updated by livdywan 11 months ago

Updated by mkittler 11 months ago · Edited

Updated by openqa_review 11 months ago

Updated by okurz 11 months ago

Updated by okurz 11 months ago

Updated by okurz 11 months ago

Updated by mkittler 11 months ago

Updated by okurz 11 months ago · Edited

Updated by okurz 11 months ago

Updated by mkittler 11 months ago · Edited

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by okurz 10 months ago

Updated by mkittler 10 months ago

Updated by mkittler 10 months ago

Updated by mkittler 10 months ago

Updated by mkittler 10 months ago