action #135491


fozzie and quinn unable to access PXE server or iPXE server (TFTP open timeout)

Added by Julie_CAO 9 months ago. Updated 9 months ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:



fozzie and quinn are in NUE1, they failed to access the generic static iPXE menu(the same on with O3), I changed its dhcp config to kernel qa team's baremetal-support services(, they still failed to reach out this server.

"TFTP open timeout" is reported:

Could you please take a look?


atfpd on qanet apparently stuck


  • Try to identify stuck server processes, restart, lazy unmount NFS shares and such

Rollback steps

  • Unsilence alert alertname=Packet loss between worker hosts and other hosts alert


TFTP_error.png (50.4 KB) TFTP_error.png Julie_CAO, 2023-09-11 09:30
Actions #1

Updated by livdywan 9 months ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
  • Target version set to Ready

We've not had a chance to estimate it properly but let's rather show what is keeping team members busy and re-assess later if it's more than a short task

Actions #2

Updated by okurz 9 months ago

  • Tags set to infra
Actions #3

Updated by okurz 9 months ago

I checked a manual download of pxelinux.0 over tftp and that yields "Transfer timed out.". I am running a system update of qanet first.

# w
 14:35:14 up 60 days, 22:44,  0 users,  load average: 105.33, 103.77, 102.52

and many atftpd in D state. But I see no problem with the NFS mount point /mnt/openqa . Meanwhile the update finished. In htop I found a lot of atftpd processes in D state but surprisingly I could not see those processes in neither top nor ps output. Anyway, I killed one atftpd process in htop which made all vanish and the load of the system decreased. However also after restarting atftpd I could not get a file. I retriggered a reboot of qanet.

EDIT: And now qanet does not come up again.

Actions #4

Updated by okurz 9 months ago

  • Assignee changed from nicksinger to okurz
Actions #5

Updated by okurz 9 months ago

  • Description updated (diff)
Actions #6

Updated by okurz 9 months ago

(Oliver Kurz) anyone near to NUE1 Maxtorhof that can help to recover not coming up after reboot? Likely stuck on NFS share or something
(Oliver Kurz) @Marius Kittler maybe?

EDIT: With the help of mkittler could be brought up again and after boot the system could happily respond to TFTP download requests again.

Actions #7

Updated by okurz 9 months ago

  • Priority changed from Normal to High
Actions #8

Updated by Julie_CAO 9 months ago

  • Status changed from In Progress to Resolved

Verified that these two machnes work with the new iPXE over kernel qa team's baremetal-service:

Many thanks @okurz's for the quick fix!! Mark the ticket resolved.


Also available in: Atom PDF