Project

General

Profile

Actions

action #163097

closed

Share mount not working on openqaworker-arm-1 and other workers size:M

Added by livdywan 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Failed systemd services (osd):

2024-07-02 07:15:00 openqaworker-arm-1 var-lib-openqa-share.automount 1

This is happening more often since 2024-06-14 07:49:00, see https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1716816941347&to=1720002135377.

Acceptance criteria

  • AC1: var-lib-openqa-share.automount is consistently not causing alerts
  • AC2: /var/lib/openqa/share NFS mount on workers is consistently working

Suggestions

  • ssh seems fine
    • ping seems fine
  • Investigate what is or was actually failing here
    Three points that you could follow, independent of each other:

    Implement a custom systemd restart unit and custom systemd check+monitoring unit and blocklist the .automount units

    Research how a systemd automount unit which is not a service could be restarted on failure: For this I found an open feature request https://github.com/systemd/systemd/issues/16811 with workaround in https://github.com/systemd/systemd/issues/16811#issuecomment-728662590

    Research about the error "Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?". As we haven't found anything in web search I suggest to create a bug on bugzilla.suse.com plus ask experts in SUSE internal chat as well as external upstream chat

Rollback steps


Related issues 4 (0 open4 closed)

Related to openQA Infrastructure (public) - action #162590: NFS mounts are stuck on OSD workers if partitions on OSD fail to come up properly on boot size:SResolvedokurz2024-06-17

Actions
Related to openQA Infrastructure (public) - action #131309: [alert] NFS mount can fail due to hostname resolution error size:MResolvednicksinger2023-06-192023-08-11

Actions
Related to openQA Infrastructure (public) - action #94949: Failed systemd services alert for openqaworker3 var-lib-openqa-share.automountResolvedokurz2021-06-30

Actions
Related to openQA Infrastructure (public) - action #93964: salt-states CI pipeline deploy step fails on some workers with "Unable to unmount /var/lib/openqa/share: umount.nfs: /var/lib/openqa/share: device is busy."Resolvedokurz2021-06-142021-07-27

Actions
Actions

Also available in: Atom PDF