action #163097
closedShare mount not working on openqaworker-arm-1 and other workers size:M
Added by livdywan 5 months ago. Updated 4 months ago.
0%
Description
Observation¶
Failed systemd services (osd):
2024-07-02 07:15:00 openqaworker-arm-1 var-lib-openqa-share.automount 1
This is happening more often since 2024-06-14 07:49:00, see https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1716816941347&to=1720002135377.
Acceptance criteria¶
- AC1: var-lib-openqa-share.automount is consistently not causing alerts
- AC2: /var/lib/openqa/share NFS mount on workers is consistently working
Suggestions¶
- ssh seems fine
- ping seems fine
Investigate what is or was actually failing here
Three points that you could follow, independent of each other:Implement a custom systemd restart unit and custom systemd check+monitoring unit and blocklist the .automount units
Research how a systemd automount unit which is not a service could be restarted on failure: For this I found an open feature request https://github.com/systemd/systemd/issues/16811 with workaround in https://github.com/systemd/systemd/issues/16811#issuecomment-728662590
Research about the error "Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?". As we haven't found anything in web search I suggest to create a bug on bugzilla.suse.com plus ask experts in SUSE internal chat as well as external upstream chat
Rollback steps¶
- Remove silence
alertname=Failed systemd services alert (except openqa.suse.de)
from https://monitor.qa.suse.de/alerting/silences
Updated by okurz 5 months ago
- Related to action #162590: NFS mounts are stuck on OSD workers if partitions on OSD fail to come up properly on boot size:S added
Updated by mkittler 5 months ago
- Subject changed from Share mount not working on openqaworker-arm-1 to Share mount not working on openqaworker-arm-1 and other workers
The journal looks like this:
sudo journalctl -fu var-lib-openqa-share.automount
…
-- Boot 2e0ffe940b3e4639ae27151d93e4f9ef --
Jul 02 02:41:33 openqaworker-arm-1 systemd[1]: Set up automount var-lib-openqa-share.automount.
Jul 02 02:53:32 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 03:01:08 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Failed with result 'unmounted'.
Judging by the logs this is a recurring problem. Unless someone restarted the unit manually on Jul 02 07:16:00 the problem fixed itself (which is not visible in the journal but on Grafana).
The only relevant log lines in dmesg are:
[ 783.324257] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 783.532483] Key type dns_resolver registered
[ 783.958133] NFS: Registering the id_resolver key type
When looking at the last 30 days in Grafana it becomes very apparent that the var-lib-openqa-share.automount unit failed on various hosts on various occasions. So this is really not a new problem and not specific to arm-1.
Considering there apparently is a retry going on and the problem always fixes itself (or was ever manual invention required?) I suggest we exclude this unit from the systemd services alert.
Updated by mkittler 5 months ago
This is what it how an exclusion could be done (not tested yet): https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1218
Updated by okurz 5 months ago
- Related to action #131309: [alert] NFS mount can fail due to hostname resolution error size:M added
Updated by okurz 5 months ago
mkittler wrote in #note-3:
The journal looks like this:
sudo journalctl -fu var-lib-openqa-share.automount … -- Boot 2e0ffe940b3e4639ae27151d93e4f9ef -- Jul 02 02:41:33 openqaworker-arm-1 systemd[1]: Set up automount var-lib-openqa-share.automount. Jul 02 02:53:32 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker) Jul 02 03:01:08 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker) Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else? Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Failed with result 'unmounted'.
Judging by the logs this is a recurring problem. Unless someone restarted the unit manually on Jul 02 07:16:00 the problem fixed itself (which is not visible in the journal but on Grafana).
The only relevant log lines in dmesg are:
[ 783.324257] RPC: Registered tcp NFSv4.1 backchannel transport module. [ 783.532483] Key type dns_resolver registered [ 783.958133] NFS: Registering the id_resolver key type
When looking at the last 30 days in Grafana it becomes very apparent that the var-lib-openqa-share.automount unit failed on various hosts on various occasions. So this is really not a new problem and not specific to arm-1.
Correct. I saw that problem the past weeks but failed to find corresponding reports in tickets where I would have commented about that.
Considering there apparently is a retry going on and the problem always fixes itself (or was ever manual invention required?)
Yes, AFAIR I needed to apply manual intervention.
I suggest we exclude this unit from the systemd services alert.
Hm, I am not convinced about that. Maybe we can find a pattern under which conditions this failed on the various hosts? And as there is a retry already included why do we end up with failed systemd units then?
By the way I see this related to https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/925 : If we would manage to not need to rely on the NFS share from OSD to workers then we wouldn't need the mount at all anymore preventing multiple problems.
Updated by mkittler 5 months ago · Edited
The latest version of my MR for ignoring should work now, I tested it via:
sudo bash -c "salt-call --out=json \\
--pillar-root=../salt-pillars-openqa --local slsutil.renderer \\
'$PWD/monitoring/telegraf/telegraf-common.conf' \\
default_renderer=jinja host=foo \\
| jq -r '.local'"
I invoked systemctl reset-failed
on worker40 because the unit was indeed failing there but the NFS mount was actually functional.
I had another look on arm-1 and there the NFS mount was actually not functional. So I started var-lib-openqa-share.automount
there again which worked immediately.
I though it might make sense to also ensure var-lib-openqa-share.automount
is started on worker40 but starting it failed via var-lib-openqa-share.automount: Path /var/lib/openqa/share is already a mount point, refusing start.
. So maybe the mount on worker40 was restored by other means than the automount unit. I'll keep it as-is.
Considering this means that the NFS mount is not functional after all I'm not so sure anymore whether we should ignore if this unit fails. However, having to deal with this alert/problem so often manually is also not ideal.
Useful documentation:
Updated by openqa_review 5 months ago
- Due date set to 2024-07-17
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 5 months ago
- Related to action #94949: Failed systemd services alert for openqaworker3 var-lib-openqa-share.automount added
Updated by okurz 5 months ago
- Related to action #93964: salt-states CI pipeline deploy step fails on some workers with "Unable to unmount /var/lib/openqa/share: umount.nfs: /var/lib/openqa/share: device is busy." added
Updated by mkittler 5 months ago
This is happening more often since 2024-06-14 07:49:00, see https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1716816941347&to=1720002135377.
Updated by okurz 5 months ago · Edited
Three points that you could follow, indepentant of each other:
- Implement a custom systemd restart unit and custom systemd check+monitoring unit and blocklist the .automount units
- Research how a systemd automount unit which is not a service could be restarted on failure: For this I found an open feature request https://github.com/systemd/systemd/issues/16811 with workaround in https://github.com/systemd/systemd/issues/16811#issuecomment-728662590
- Research about the error "Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?". As we haven't found anything in web search I suggest to create a bug on bugzilla.suse.com plus ask experts in SUSE internal chat as well as external upstream chat
Updated by mkittler 5 months ago · Edited
I looked into those points from 3 to 1.
\3. This search isn't really revealing. It leads to https://github.com/systemd/systemd/blob/main/src/core/automount.c and of course a few forum posts and issues without clear resolution.
\2. This is an open issue https://github.com/systemd/systemd/issues/16811. It mentions how to workaround it similar to our own idea in point 1. There's also the closed issue https://github.com/systemd/systemd/issues/4468 about mount units themselves which doesn't read very promising as well.
\1. I'll go for that option using the code from https://github.com/systemd/systemd/issues/16811#issuecomment-728662590.
Updated by mkittler 5 months ago
- Status changed from Feedback to Resolved
The MR was merged and I invoked systemctl daemon-reload
on all workers¹. I tested it on worker35 and worker40 and it works. (The automount unit is restarted and active again after entering a failed state provoked via sudo umount /var/lib/openqa/share
.) With that I'm considering this ticket resolved.
¹ Apparently not done by Salt automatically, at least I got before:
martchus@worker35:~> sudo systemctl status var-lib-openqa-share.automount
Warning: The unit file, source configuration file or drop-ins of var-lib-openqa-share.automount changed on disk. Run 'systemctl daemon-reload' to reload units.
Updated by okurz 4 months ago
https://stats.openqa-monitor.qa.suse.de/alerting/silences still shows an active silence which is actually still needed now as there are still firing and https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1 showing multiple "var-lib-openqa-share.automount". Please look into those.
Updated by mkittler 4 months ago
The var-lib-openqa-share.automount
were never in a failed state for a very long time. So I don't think they would have caused any alert on their own. That's also why now everything is fine again without us having to do anything.
In the journal on an example worker it also looks like the workaround for restarting the automount unit works:
martchus@worker35:~> sudo journalctl -fu var-lib-openqa-share.automount
Aug 14 09:36:54 worker35 systemd[1]: var-lib-openqa-share.automount: Triggering OnFailure= dependencies.
Aug 14 09:37:04 worker35 systemd[1]: Set up automount var-lib-openqa-share.automount.
Aug 14 10:36:53 worker35 systemd[1]: var-lib-openqa-share.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?
Aug 14 10:36:53 worker35 systemd[1]: var-lib-openqa-share.automount: Failed with result 'unmounted'.
Aug 14 10:36:53 worker35 systemd[1]: var-lib-openqa-share.automount: Triggering OnFailure= dependencies.
Aug 14 10:37:03 worker35 systemd[1]: Set up automount var-lib-openqa-share.automount.
Aug 14 10:55:57 worker35 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 81580 (worker)
Aug 14 11:13:26 worker35 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 81637 (worker)
Aug 14 12:02:42 worker35 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 81587 (worker)
Aug 14 13:06:00 worker35 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 81625 (worker)
martchus@worker35:~> sudo journalctl -fu automount-restarter@var-lib-openqa-share.service
Aug 14 07:50:06 worker35 systemd[1]: Finished Restarts the automount unit var-lib-openqa-share.
Aug 14 08:36:53 worker35 systemd[1]: Starting Restarts the automount unit var-lib-openqa-share...
Aug 14 08:37:03 worker35 systemd[1]: automount-restarter@var-lib-openqa-share.service: Deactivated successfully.
Aug 14 08:37:03 worker35 systemd[1]: Finished Restarts the automount unit var-lib-openqa-share.
Aug 14 09:36:54 worker35 systemd[1]: Starting Restarts the automount unit var-lib-openqa-share...
Aug 14 09:37:04 worker35 systemd[1]: automount-restarter@var-lib-openqa-share.service: Deactivated successfully.
Aug 14 09:37:04 worker35 systemd[1]: Finished Restarts the automount unit var-lib-openqa-share.
Aug 14 10:36:53 worker35 systemd[1]: Starting Restarts the automount unit var-lib-openqa-share...
Aug 14 10:37:03 worker35 systemd[1]: automount-restarter@var-lib-openqa-share.service: Deactivated successfully.
Aug 14 10:37:03 worker35 systemd[1]: Finished Restarts the automount unit var-lib-openqa-share.
Of course from the time stamps one can see that it takes a few seconds to restart the automount unit.
So I would resolve this ticket again. I don't think we can ensure the automount unit never goes into the failed state. We could of course ignore it completely in our monitoring if you think that's better than it cluttering our table of failing systemd services.
Updated by mkittler 4 months ago
MR for excluding: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1252