Project

General

Profile

Actions

action #163097

closed

Share mount not working on openqaworker-arm-1 and other workers size:M

Added by livdywan 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Failed systemd services (osd):

2024-07-02 07:15:00 openqaworker-arm-1 var-lib-openqa-share.automount 1

This is happening more often since 2024-06-14 07:49:00, see https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1716816941347&to=1720002135377.

Acceptance criteria

  • AC1: var-lib-openqa-share.automount is consistently not causing alerts
  • AC2: /var/lib/openqa/share NFS mount on workers is consistently working

Suggestions

  • ssh seems fine
    • ping seems fine
  • Investigate what is or was actually failing here
    Three points that you could follow, independent of each other:

    Implement a custom systemd restart unit and custom systemd check+monitoring unit and blocklist the .automount units

    Research how a systemd automount unit which is not a service could be restarted on failure: For this I found an open feature request https://github.com/systemd/systemd/issues/16811 with workaround in https://github.com/systemd/systemd/issues/16811#issuecomment-728662590

    Research about the error "Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?". As we haven't found anything in web search I suggest to create a bug on bugzilla.suse.com plus ask experts in SUSE internal chat as well as external upstream chat

Rollback steps


Related issues 4 (0 open4 closed)

Related to openQA Infrastructure (public) - action #162590: NFS mounts are stuck on OSD workers if partitions on OSD fail to come up properly on boot size:SResolvedokurz2024-06-17

Actions
Related to openQA Infrastructure (public) - action #131309: [alert] NFS mount can fail due to hostname resolution error size:MResolvednicksinger2023-06-192023-08-11

Actions
Related to openQA Infrastructure (public) - action #94949: Failed systemd services alert for openqaworker3 var-lib-openqa-share.automountResolvedokurz2021-06-30

Actions
Related to openQA Infrastructure (public) - action #93964: salt-states CI pipeline deploy step fails on some workers with "Unable to unmount /var/lib/openqa/share: umount.nfs: /var/lib/openqa/share: device is busy."Resolvedokurz2021-06-142021-07-27

Actions
Actions #1

Updated by okurz 5 months ago

  • Related to action #162590: NFS mounts are stuck on OSD workers if partitions on OSD fail to come up properly on boot size:S added
Actions #2

Updated by mkittler 5 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #3

Updated by mkittler 5 months ago

  • Subject changed from Share mount not working on openqaworker-arm-1 to Share mount not working on openqaworker-arm-1 and other workers

The journal looks like this:

sudo journalctl -fu var-lib-openqa-share.automount
…
-- Boot 2e0ffe940b3e4639ae27151d93e4f9ef --
Jul 02 02:41:33 openqaworker-arm-1 systemd[1]: Set up automount var-lib-openqa-share.automount.
Jul 02 02:53:32 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 03:01:08 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Failed with result 'unmounted'.

Judging by the logs this is a recurring problem. Unless someone restarted the unit manually on Jul 02 07:16:00 the problem fixed itself (which is not visible in the journal but on Grafana).

The only relevant log lines in dmesg are:

[  783.324257] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  783.532483] Key type dns_resolver registered
[  783.958133] NFS: Registering the id_resolver key type

When looking at the last 30 days in Grafana it becomes very apparent that the var-lib-openqa-share.automount unit failed on various hosts on various occasions. So this is really not a new problem and not specific to arm-1.

Considering there apparently is a retry going on and the problem always fixes itself (or was ever manual invention required?) I suggest we exclude this unit from the systemd services alert.

Actions #4

Updated by mkittler 5 months ago

This is what it how an exclusion could be done (not tested yet): https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1218

Actions #5

Updated by okurz 5 months ago

  • Related to action #131309: [alert] NFS mount can fail due to hostname resolution error size:M added
Actions #6

Updated by okurz 5 months ago

mkittler wrote in #note-3:

The journal looks like this:

sudo journalctl -fu var-lib-openqa-share.automount
…
-- Boot 2e0ffe940b3e4639ae27151d93e4f9ef --
Jul 02 02:41:33 openqaworker-arm-1 systemd[1]: Set up automount var-lib-openqa-share.automount.
Jul 02 02:53:32 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 03:01:08 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Failed with result 'unmounted'.

Judging by the logs this is a recurring problem. Unless someone restarted the unit manually on Jul 02 07:16:00 the problem fixed itself (which is not visible in the journal but on Grafana).

The only relevant log lines in dmesg are:

[  783.324257] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  783.532483] Key type dns_resolver registered
[  783.958133] NFS: Registering the id_resolver key type

When looking at the last 30 days in Grafana it becomes very apparent that the var-lib-openqa-share.automount unit failed on various hosts on various occasions. So this is really not a new problem and not specific to arm-1.

Correct. I saw that problem the past weeks but failed to find corresponding reports in tickets where I would have commented about that.

Considering there apparently is a retry going on and the problem always fixes itself (or was ever manual invention required?)

Yes, AFAIR I needed to apply manual intervention.

I suggest we exclude this unit from the systemd services alert.

Hm, I am not convinced about that. Maybe we can find a pattern under which conditions this failed on the various hosts? And as there is a retry already included why do we end up with failed systemd units then?

By the way I see this related to https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/925 : If we would manage to not need to rely on the NFS share from OSD to workers then we wouldn't need the mount at all anymore preventing multiple problems.

Actions #7

Updated by livdywan 5 months ago

Also worker40:

2024-07-02 14:58:00 worker40 var-lib-openqa-share.automount 1
Actions #8

Updated by mkittler 5 months ago · Edited

The latest version of my MR for ignoring should work now, I tested it via:

sudo bash -c "salt-call --out=json \\
    --pillar-root=../salt-pillars-openqa --local slsutil.renderer \\
    '$PWD/monitoring/telegraf/telegraf-common.conf' \\
    default_renderer=jinja host=foo \\
  | jq -r '.local'"

I invoked systemctl reset-failed on worker40 because the unit was indeed failing there but the NFS mount was actually functional.

I had another look on arm-1 and there the NFS mount was actually not functional. So I started var-lib-openqa-share.automount there again which worked immediately.

I though it might make sense to also ensure var-lib-openqa-share.automount is started on worker40 but starting it failed via var-lib-openqa-share.automount: Path /var/lib/openqa/share is already a mount point, refusing start.. So maybe the mount on worker40 was restored by other means than the automount unit. I'll keep it as-is.

Considering this means that the NFS mount is not functional after all I'm not so sure anymore whether we should ignore if this unit fails. However, having to deal with this alert/problem so often manually is also not ideal.


Useful documentation:

Actions #9

Updated by openqa_review 5 months ago

  • Due date set to 2024-07-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by okurz 5 months ago

  • Description updated (diff)

Added a silence as the alert triggered again

2024-07-03 10:12:00 worker-arm1 var-lib-openqa-share.automount
2024-07-02 15:39:00 worker40 var-lib-openqa-share.automount
Actions #11

Updated by okurz 5 months ago

  • Related to action #94949: Failed systemd services alert for openqaworker3 var-lib-openqa-share.automount added
Actions #12

Updated by okurz 5 months ago

  • Related to action #93964: salt-states CI pipeline deploy step fails on some workers with "Unable to unmount /var/lib/openqa/share: umount.nfs: /var/lib/openqa/share: device is busy." added
Actions #14

Updated by okurz 5 months ago · Edited

Three points that you could follow, indepentant of each other:

  1. Implement a custom systemd restart unit and custom systemd check+monitoring unit and blocklist the .automount units
  2. Research how a systemd automount unit which is not a service could be restarted on failure: For this I found an open feature request https://github.com/systemd/systemd/issues/16811 with workaround in https://github.com/systemd/systemd/issues/16811#issuecomment-728662590
  3. Research about the error "Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?". As we haven't found anything in web search I suggest to create a bug on bugzilla.suse.com plus ask experts in SUSE internal chat as well as external upstream chat
Actions #15

Updated by okurz 5 months ago

  • Subject changed from Share mount not working on openqaworker-arm-1 and other workers to Share mount not working on openqaworker-arm-1 and other workers size:M
  • Description updated (diff)
Actions #16

Updated by mkittler 5 months ago · Edited

I looked into those points from 3 to 1.

\3. This search isn't really revealing. It leads to https://github.com/systemd/systemd/blob/main/src/core/automount.c and of course a few forum posts and issues without clear resolution.
\2. This is an open issue https://github.com/systemd/systemd/issues/16811. It mentions how to workaround it similar to our own idea in point 1. There's also the closed issue https://github.com/systemd/systemd/issues/4468 about mount units themselves which doesn't read very promising as well.
\1. I'll go for that option using the code from https://github.com/systemd/systemd/issues/16811#issuecomment-728662590.

Actions #17

Updated by mkittler 5 months ago

  • Status changed from In Progress to Feedback
Actions #18

Updated by mkittler 5 months ago

  • Status changed from Feedback to Resolved

The MR was merged and I invoked systemctl daemon-reload on all workers¹. I tested it on worker35 and worker40 and it works. (The automount unit is restarted and active again after entering a failed state provoked via sudo umount /var/lib/openqa/share.) With that I'm considering this ticket resolved.


¹ Apparently not done by Salt automatically, at least I got before:

martchus@worker35:~> sudo systemctl status var-lib-openqa-share.automount
Warning: The unit file, source configuration file or drop-ins of var-lib-openqa-share.automount changed on disk. Run 'systemctl daemon-reload' to reload units.
Actions #19

Updated by okurz 4 months ago

  • Status changed from Resolved to Feedback
Actions #20

Updated by okurz 4 months ago

  • Due date deleted (2024-07-17)
Actions #21

Updated by okurz 4 months ago

https://stats.openqa-monitor.qa.suse.de/alerting/silences still shows an active silence which is actually still needed now as there are still firing and https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1 showing multiple "var-lib-openqa-share.automount". Please look into those.

Actions #22

Updated by mkittler 4 months ago

The var-lib-openqa-share.automount were never in a failed state for a very long time. So I don't think they would have caused any alert on their own. That's also why now everything is fine again without us having to do anything.

In the journal on an example worker it also looks like the workaround for restarting the automount unit works:

martchus@worker35:~> sudo journalctl -fu var-lib-openqa-share.automount
Aug 14 09:36:54 worker35 systemd[1]: var-lib-openqa-share.automount: Triggering OnFailure= dependencies.
Aug 14 09:37:04 worker35 systemd[1]: Set up automount var-lib-openqa-share.automount.
Aug 14 10:36:53 worker35 systemd[1]: var-lib-openqa-share.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?
Aug 14 10:36:53 worker35 systemd[1]: var-lib-openqa-share.automount: Failed with result 'unmounted'.
Aug 14 10:36:53 worker35 systemd[1]: var-lib-openqa-share.automount: Triggering OnFailure= dependencies.
Aug 14 10:37:03 worker35 systemd[1]: Set up automount var-lib-openqa-share.automount.
Aug 14 10:55:57 worker35 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 81580 (worker)
Aug 14 11:13:26 worker35 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 81637 (worker)
Aug 14 12:02:42 worker35 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 81587 (worker)
Aug 14 13:06:00 worker35 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 81625 (worker)
martchus@worker35:~> sudo journalctl -fu automount-restarter@var-lib-openqa-share.service
Aug 14 07:50:06 worker35 systemd[1]: Finished Restarts the automount unit var-lib-openqa-share.
Aug 14 08:36:53 worker35 systemd[1]: Starting Restarts the automount unit var-lib-openqa-share...
Aug 14 08:37:03 worker35 systemd[1]: automount-restarter@var-lib-openqa-share.service: Deactivated successfully.
Aug 14 08:37:03 worker35 systemd[1]: Finished Restarts the automount unit var-lib-openqa-share.
Aug 14 09:36:54 worker35 systemd[1]: Starting Restarts the automount unit var-lib-openqa-share...
Aug 14 09:37:04 worker35 systemd[1]: automount-restarter@var-lib-openqa-share.service: Deactivated successfully.
Aug 14 09:37:04 worker35 systemd[1]: Finished Restarts the automount unit var-lib-openqa-share.
Aug 14 10:36:53 worker35 systemd[1]: Starting Restarts the automount unit var-lib-openqa-share...
Aug 14 10:37:03 worker35 systemd[1]: automount-restarter@var-lib-openqa-share.service: Deactivated successfully.
Aug 14 10:37:03 worker35 systemd[1]: Finished Restarts the automount unit var-lib-openqa-share.

Of course from the time stamps one can see that it takes a few seconds to restart the automount unit.

So I would resolve this ticket again. I don't think we can ensure the automount unit never goes into the failed state. We could of course ignore it completely in our monitoring if you think that's better than it cluttering our table of failing systemd services.

Actions #23

Updated by mkittler 4 months ago

We decided to ignore these units in our alerting. @okurz wants to give it a try so I'm not doing it and keep the ticket in feedback.

Actions #25

Updated by mkittler 4 months ago

  • Status changed from Feedback to Resolved

The change was merged 2 days ago and in Grafana I see nothing anymore if I select the time range of the last two days. I guess that's good enough.

Actions

Also available in: Atom PDF