action #163097
closedShare mount not working on openqaworker-arm-1 and other workers size:M
0%
Description
Observation¶
Failed systemd services (osd):
2024-07-02 07:15:00 openqaworker-arm-1 var-lib-openqa-share.automount 1
This is happening more often since 2024-06-14 07:49:00, see https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1716816941347&to=1720002135377.
Acceptance criteria¶
- AC1: var-lib-openqa-share.automount is consistently not causing alerts
- AC2: /var/lib/openqa/share NFS mount on workers is consistently working
Suggestions¶
- ssh seems fine
- ping seems fine
Investigate what is or was actually failing here
Three points that you could follow, independent of each other:Implement a custom systemd restart unit and custom systemd check+monitoring unit and blocklist the .automount units
Research how a systemd automount unit which is not a service could be restarted on failure: For this I found an open feature request https://github.com/systemd/systemd/issues/16811 with workaround in https://github.com/systemd/systemd/issues/16811#issuecomment-728662590
Research about the error "Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?". As we haven't found anything in web search I suggest to create a bug on bugzilla.suse.com plus ask experts in SUSE internal chat as well as external upstream chat
Rollback steps¶
- Remove silence
alertname=Failed systemd services alert (except openqa.suse.de)
from https://monitor.qa.suse.de/alerting/silences