action #179032
Updated by nicksinger 2 months ago
## # Observation After VMs on qamaster got recovered, we received an alert about failing services on netboot: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?viewPanel=panel-6&orgId=1&from=2025-03-17T10:31:21.141Z&to=2025-03-17T12:46:04.738Z&timezone=UTC This was about `srv-tftpboot-mnt-openqa.mount` failing since 2025-03-16 3:30: ``` netboot:~ # systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● srv-tftpboot-mnt-openqa.mount loaded failed failed /srv/tftpboot/mnt/openqa LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 1 loaded units listed. netboot:~ # systemctl status srv-tftpboot-mnt-openqa.mount × srv-tftpboot-mnt-openqa.mount - /srv/tftpboot/mnt/openqa Loaded: loaded (/etc/fstab; generated) Active: failed (Result: timeout) since Sun 2025-03-16 03:33:26 UTC; 1 day 8h ago Where: /srv/tftpboot/mnt/openqa What: openqa.suse.de:/factory Docs: man:fstab(5) man:systemd-fstab-generator(8) CPU: 16ms Mar 16 03:31:56 netboot systemd[1]: Mounting /srv/tftpboot/mnt/openqa... Mar 16 03:33:26 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Mounting timed out. Terminating. Mar 16 03:33:26 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Mount process exited, code=killed, status=15/TERM Mar 16 03:33:26 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Failed with result 'timeout'. Mar 16 03:33:26 netboot systemd[1]: Failed to mount /srv/tftpboot/mnt/openqa. netboot:~ # uptime 12:31:27 up 1 day 8:59, 1 user, load average: 0.00, 0.00, 0.00 ``` ## Acceptance criteria * **AC1:** netboot.qe.prg2.suse.org can reboot without any services failing afterwards ## # Suggestions * We have some logic for workers already, maybe this can be generalized for such hosts too - see https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/nfs_share.sls#L7-26 * See the history of https://gitlab.suse.de/openqa/salt-states-openqa/-/commits/master/openqa/nfs_share.sls why a "simply retry" might not be enough