Project

General

Profile

action #163097

Updated by okurz 14 days ago

## Observation 

 [Failed systemd services (osd)](https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1): 

 

     2024-07-02 07:15:00 openqaworker-arm-1 var-lib-openqa-share.automount 1 

 This is happening more often since 2024-06-14 07:49:00, see https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1716816941347&to=1720002135377. 


 ## Acceptance criteria 
 * **AC1:** var-lib-openqa-share.automount is consistently not causing alerts 
 * **AC2:** /var/lib/openqa/share NFS mount on workers is consistently working 

 ## Suggestions 
 * ssh seems fine 
     * ping seems fine 
 * Investigate what is or was actually failing here 
 Three points that you could follow, independent of each other: 

     Implement a custom systemd restart unit and custom systemd check+monitoring unit and blocklist the .automount units 

     Research how a systemd automount unit which is not a service could be restarted on failure: For this I found an open feature request https://github.com/systemd/systemd/issues/16811 with workaround in https://github.com/systemd/systemd/issues/16811#issuecomment-728662590 

     Research about the error "Got hangup/error on autofs pipe from kernel. Likely our automount point has been unmounted by someone or something else?". As we haven't found anything in web search I suggest to create a bug on bugzilla.suse.com plus ask experts in SUSE internal chat as well as external upstream chat 


 

 ## Rollback steps 
 * Remove silence `alertname=Failed systemd services alert (except openqa.suse.de)` from https://monitor.qa.suse.de/alerting/silences

Back