Project

General

Profile

action #179032

Updated by nicksinger 2 months ago

## # Observation 

 After VMs on qamaster got recovered, we received an alert about failing services on netboot: https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?viewPanel=panel-6&orgId=1&from=2025-03-17T10:31:21.141Z&to=2025-03-17T12:46:04.738Z&timezone=UTC 
 This was about `srv-tftpboot-mnt-openqa.mount` failing since 2025-03-16 3:30: 

 ``` 
 netboot:~ # systemctl --failed 
   UNIT                            LOAD     ACTIVE SUB      DESCRIPTION 
 ● srv-tftpboot-mnt-openqa.mount loaded failed failed /srv/tftpboot/mnt/openqa 

 LOAD     = Reflects whether the unit definition was properly loaded. 
 ACTIVE = The high-level unit activation state, i.e. generalization of SUB. 
 SUB      = The low-level unit activation state, values depend on unit type. 
 1 loaded units listed. 
 netboot:~ # systemctl status srv-tftpboot-mnt-openqa.mount 
 × srv-tftpboot-mnt-openqa.mount - /srv/tftpboot/mnt/openqa 
      Loaded: loaded (/etc/fstab; generated) 
      Active: failed (Result: timeout) since Sun 2025-03-16 03:33:26 UTC; 1 day 8h ago 
       Where: /srv/tftpboot/mnt/openqa 
        What: openqa.suse.de:/factory 
        Docs: man:fstab(5) 
              man:systemd-fstab-generator(8) 
         CPU: 16ms 

 Mar 16 03:31:56 netboot systemd[1]: Mounting /srv/tftpboot/mnt/openqa... 
 Mar 16 03:33:26 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Mounting timed out. Terminating. 
 Mar 16 03:33:26 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Mount process exited, code=killed, status=15/TERM 
 Mar 16 03:33:26 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Failed with result 'timeout'. 
 Mar 16 03:33:26 netboot systemd[1]: Failed to mount /srv/tftpboot/mnt/openqa. 
 netboot:~ # uptime 
  12:31:27    up 1 day    8:59,    1 user,    load average: 0.00, 0.00, 0.00 
 ``` 

 ## Acceptance criteria 
 * **AC1:** netboot.qe.prg2.suse.org can reboot without any services failing afterwards 

 ## # Suggestions 

 * We have some logic for workers already, maybe this can be generalized for such hosts too - see https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/nfs_share.sls#L7-26 
 * See the history of https://gitlab.suse.de/openqa/salt-states-openqa/-/commits/master/openqa/nfs_share.sls why a "simply retry" might not be enough

Back