action #163097
Updated by okurz 14 days ago
## Observation
[Failed systemd services (osd)](https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1):
2024-07-02 07:15:00 openqaworker-arm-1 var-lib-openqa-share.automount 1
This is happening more often since 2024-06-14 07:49:00, see https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1716816941347&to=1720002135377.
## Acceptance criteria
* **AC1:** var-lib-openqa-share.automount is consistently not causing alerts
* **AC2:** /var/lib/openqa/share NFS mount on workers is consistently working
## Suggestions
* ssh seems fine
* ping seems fine
* Investigate what is or was actually failing here
Three points that you could follow, independent of each other:
Implement a custom systemd restart unit and custom systemd check+monitoring unit and blocklist the .automount units
Research how a systemd automount unit which is not a service could be restarted on failure: For this I found an open feature request https://github.com/systemd/systemd/issues/16811 with workaround in https://github.com/systemd/systemd/issues/16811#issuecomment-728662590
Research about the error "Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?". As we haven't found anything in web search I suggest to create a bug on bugzilla.suse.com plus ask experts in SUSE internal chat as well as external upstream chat
## Rollback steps
* Remove silence `alertname=Failed systemd services alert (except openqa.suse.de)` from https://monitor.qa.suse.de/alerting/silences
Back