Actions
action #179032
openmachine netboot.qe.prg2.suse.org can randomly fail "srv-tftpboot-mnt-openqa.mount"-unit
Status:
New
Priority:
Low
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2025-03-17
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
After VMs on qamaster got recovered, we received an alert about failing services on netboot: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?viewPanel=panel-6&orgId=1&from=2025-03-17T10:31:21.141Z&to=2025-03-17T12:46:04.738Z&timezone=UTC
This was about srv-tftpboot-mnt-openqa.mount
failing since 2025-03-16 3:30:
netboot:~ # systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● srv-tftpboot-mnt-openqa.mount loaded failed failed /srv/tftpboot/mnt/openqa
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
netboot:~ # systemctl status srv-tftpboot-mnt-openqa.mount
× srv-tftpboot-mnt-openqa.mount - /srv/tftpboot/mnt/openqa
Loaded: loaded (/etc/fstab; generated)
Active: failed (Result: timeout) since Sun 2025-03-16 03:33:26 UTC; 1 day 8h ago
Where: /srv/tftpboot/mnt/openqa
What: openqa.suse.de:/factory
Docs: man:fstab(5)
man:systemd-fstab-generator(8)
CPU: 16ms
Mar 16 03:31:56 netboot systemd[1]: Mounting /srv/tftpboot/mnt/openqa...
Mar 16 03:33:26 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Mounting timed out. Terminating.
Mar 16 03:33:26 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Mount process exited, code=killed, status=15/TERM
Mar 16 03:33:26 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Failed with result 'timeout'.
Mar 16 03:33:26 netboot systemd[1]: Failed to mount /srv/tftpboot/mnt/openqa.
netboot:~ # uptime
12:31:27 up 1 day 8:59, 1 user, load average: 0.00, 0.00, 0.00
Acceptance criteria¶
- AC1: netboot.qe.prg2.suse.org can reboot without any services failing afterwards
Suggestions¶
- We have some logic for workers already, maybe this can be generalized for such hosts too - see https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/nfs_share.sls#L7-26
- See the history of https://gitlab.suse.de/openqa/salt-states-openqa/-/commits/master/openqa/nfs_share.sls why a "simply retry" might not be enough
Updated by nicksinger 2 months ago
- Tags set to infra, salt, reactive work
- Description updated (diff)
- Category set to Regressions/Crashes
Updated by nicksinger 2 months ago
- Related to action #178972: [s390x][s390zl13][tools] nfs mount to openqa.suse.de is missing size:S added
Actions