Project

General

Profile

Actions

action #163097

closed

Share mount not working on openqaworker-arm-1 and other workers size:M

Added by livdywan 15 days ago. Updated 12 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
2024-07-17
% Done:

0%

Estimated time:
Tags:

Description

Observation

Failed systemd services (osd):

2024-07-02 07:15:00 openqaworker-arm-1 var-lib-openqa-share.automount 1

This is happening more often since 2024-06-14 07:49:00, see https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1716816941347&to=1720002135377.

Acceptance criteria

  • AC1: var-lib-openqa-share.automount is consistently not causing alerts
  • AC2: /var/lib/openqa/share NFS mount on workers is consistently working

Suggestions

  • ssh seems fine
    • ping seems fine
  • Investigate what is or was actually failing here
    Three points that you could follow, independent of each other:

    Implement a custom systemd restart unit and custom systemd check+monitoring unit and blocklist the .automount units

    Research how a systemd automount unit which is not a service could be restarted on failure: For this I found an open feature request https://github.com/systemd/systemd/issues/16811 with workaround in https://github.com/systemd/systemd/issues/16811#issuecomment-728662590

    Research about the error "Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?". As we haven't found anything in web search I suggest to create a bug on bugzilla.suse.com plus ask experts in SUSE internal chat as well as external upstream chat

Rollback steps


Related issues 4 (0 open4 closed)

Related to openQA Infrastructure - action #162590: NFS mounts are stuck on OSD workers if partitions on OSD fail to come up properly on boot size:SResolvedokurz2024-06-17

Actions
Related to openQA Infrastructure - action #131309: [alert] NFS mount can fail due to hostname resolution error size:MResolvednicksinger2023-06-192023-08-11

Actions
Related to openQA Infrastructure - action #94949: Failed systemd services alert for openqaworker3 var-lib-openqa-share.automountResolvedokurz2021-06-30

Actions
Related to openQA Infrastructure - action #93964: salt-states CI pipeline deploy step fails on some workers with "Unable to unmount /var/lib/openqa/share: umount.nfs: /var/lib/openqa/share: device is busy."Resolvedokurz2021-06-142021-07-27

Actions
Actions #1

Updated by okurz 15 days ago

  • Related to action #162590: NFS mounts are stuck on OSD workers if partitions on OSD fail to come up properly on boot size:S added
Actions #2

Updated by mkittler 15 days ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #3

Updated by mkittler 15 days ago

  • Subject changed from Share mount not working on openqaworker-arm-1 to Share mount not working on openqaworker-arm-1 and other workers

The journal looks like this:

sudo journalctl -fu var-lib-openqa-share.automount
…
-- Boot 2e0ffe940b3e4639ae27151d93e4f9ef --
Jul 02 02:41:33 openqaworker-arm-1 systemd[1]: Set up automount var-lib-openqa-share.automount.
Jul 02 02:53:32 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 03:01:08 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Failed with result 'unmounted'.

Judging by the logs this is a recurring problem. Unless someone restarted the unit manually on Jul 02 07:16:00 the problem fixed itself (which is not visible in the journal but on Grafana).

The only relevant log lines in dmesg are:

[  783.324257] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  783.532483] Key type dns_resolver registered
[  783.958133] NFS: Registering the id_resolver key type

When looking at the last 30 days in Grafana it becomes very apparent that the var-lib-openqa-share.automount unit failed on various hosts on various occasions. So this is really not a new problem and not specific to arm-1.

Considering there apparently is a retry going on and the problem always fixes itself (or was ever manual invention required?) I suggest we exclude this unit from the systemd services alert.

Actions #4

Updated by mkittler 15 days ago

This is what it how an exclusion could be done (not tested yet): https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1218

Actions #5

Updated by okurz 15 days ago

  • Related to action #131309: [alert] NFS mount can fail due to hostname resolution error size:M added
Actions #6

Updated by okurz 15 days ago

mkittler wrote in #note-3:

The journal looks like this:

sudo journalctl -fu var-lib-openqa-share.automount
…
-- Boot 2e0ffe940b3e4639ae27151d93e4f9ef --
Jul 02 02:41:33 openqaworker-arm-1 systemd[1]: Set up automount var-lib-openqa-share.automount.
Jul 02 02:53:32 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 03:01:08 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got automount request for /var/lib/openqa/share, triggered by 2590 (worker)
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?
Jul 02 04:48:38 openqaworker-arm-1 systemd[1]: var-lib-openqa-share.automount: Failed with result 'unmounted'.

Judging by the logs this is a recurring problem. Unless someone restarted the unit manually on Jul 02 07:16:00 the problem fixed itself (which is not visible in the journal but on Grafana).

The only relevant log lines in dmesg are:

[  783.324257] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  783.532483] Key type dns_resolver registered
[  783.958133] NFS: Registering the id_resolver key type

When looking at the last 30 days in Grafana it becomes very apparent that the var-lib-openqa-share.automount unit failed on various hosts on various occasions. So this is really not a new problem and not specific to arm-1.

Correct. I saw that problem the past weeks but failed to find corresponding reports in tickets where I would have commented about that.

Considering there apparently is a retry going on and the problem always fixes itself (or was ever manual invention required?)

Yes, AFAIR I needed to apply manual intervention.

I suggest we exclude this unit from the systemd services alert.

Hm, I am not convinced about that. Maybe we can find a pattern under which conditions this failed on the various hosts? And as there is a retry already included why do we end up with failed systemd units then?

By the way I see this related to https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/925 : If we would manage to not need to rely on the NFS share from OSD to workers then we wouldn't need the mount at all anymore preventing multiple problems.

Actions #7

Updated by livdywan 15 days ago

Also worker40:

2024-07-02 14:58:00 worker40 var-lib-openqa-share.automount 1
Actions #8

Updated by mkittler 15 days ago · Edited

The latest version of my MR for ignoring should work now, I tested it via:

sudo bash -c "salt-call --out=json \\
    --pillar-root=../salt-pillars-openqa --local slsutil.renderer \\
    '$PWD/monitoring/telegraf/telegraf-common.conf' \\
    default_renderer=jinja host=foo \\
  | jq -r '.local'"

I invoked systemctl reset-failed on worker40 because the unit was indeed failing there but the NFS mount was actually functional.

I had another look on arm-1 and there the NFS mount was actually not functional. So I started var-lib-openqa-share.automount there again which worked immediately.

I though it might make sense to also ensure var-lib-openqa-share.automount is started on worker40 but starting it failed via var-lib-openqa-share.automount: Path /var/lib/openqa/share is already a mount point, refusing start.. So maybe the mount on worker40 was restored by other means than the automount unit. I'll keep it as-is.

Considering this means that the NFS mount is not functional after all I'm not so sure anymore whether we should ignore if this unit fails. However, having to deal with this alert/problem so often manually is also not ideal.


Useful documentation:

Actions #9

Updated by openqa_review 14 days ago

  • Due date set to 2024-07-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by okurz 14 days ago

  • Description updated (diff)

Added a silence as the alert triggered again

2024-07-03 10:12:00 worker-arm1 var-lib-openqa-share.automount
2024-07-02 15:39:00 worker40 var-lib-openqa-share.automount
Actions #11

Updated by okurz 14 days ago

  • Related to action #94949: Failed systemd services alert for openqaworker3 var-lib-openqa-share.automount added
Actions #12

Updated by okurz 14 days ago

  • Related to action #93964: salt-states CI pipeline deploy step fails on some workers with "Unable to unmount /var/lib/openqa/share: umount.nfs: /var/lib/openqa/share: device is busy." added
Actions #14

Updated by okurz 14 days ago · Edited

Three points that you could follow, indepentant of each other:

  1. Implement a custom systemd restart unit and custom systemd check+monitoring unit and blocklist the .automount units
  2. Research how a systemd automount unit which is not a service could be restarted on failure: For this I found an open feature request https://github.com/systemd/systemd/issues/16811 with workaround in https://github.com/systemd/systemd/issues/16811#issuecomment-728662590
  3. Research about the error "Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?". As we haven't found anything in web search I suggest to create a bug on bugzilla.suse.com plus ask experts in SUSE internal chat as well as external upstream chat
Actions #15

Updated by okurz 14 days ago

  • Subject changed from Share mount not working on openqaworker-arm-1 and other workers to Share mount not working on openqaworker-arm-1 and other workers size:M
  • Description updated (diff)
Actions #16

Updated by mkittler 12 days ago · Edited

I looked into those points from 3 to 1.

\3. This search isn't really revealing. It leads to https://github.com/systemd/systemd/blob/main/src/core/automount.c and of course a few forum posts and issues without clear resolution.
\2. This is an open issue https://github.com/systemd/systemd/issues/16811. It mentions how to workaround it similar to our own idea in point 1. There's also the closed issue https://github.com/systemd/systemd/issues/4468 about mount units themselves which doesn't read very promising as well.
\1. I'll go for that option using the code from https://github.com/systemd/systemd/issues/16811#issuecomment-728662590.

Actions #17

Updated by mkittler 12 days ago

  • Status changed from In Progress to Feedback
Actions #18

Updated by mkittler 12 days ago

  • Status changed from Feedback to Resolved

The MR was merged and I invoked systemctl daemon-reload on all workers¹. I tested it on worker35 and worker40 and it works. (The automount unit is restarted and active again after entering a failed state provoked via sudo umount /var/lib/openqa/share.) With that I'm considering this ticket resolved.


¹ Apparently not done by Salt automatically, at least I got before:

martchus@worker35:~> sudo systemctl status var-lib-openqa-share.automount
Warning: The unit file, source configuration file or drop-ins of var-lib-openqa-share.automount changed on disk. Run 'systemctl daemon-reload' to reload units.
Actions

Also available in: Atom PDF