action #135491
closedfozzie and quinn unable to access PXE server or iPXE server (TFTP open timeout)
0%
Description
Observation¶
fozzie and quinn are in NUE1, they failed to access the generic static iPXE menu(the same on with O3), I changed its dhcp config to kernel qa team's baremetal-support services(https://gitlab.suse.de/qa-sle/qanet-configs/-/merge_requests/75), they still failed to reach out this server.
"TFTP open timeout" is reported:
Could you please take a look?
Problem¶
atfpd on qanet apparently stuck
Suggestions¶
- Try to identify stuck server processes, restart, lazy unmount NFS shares and such
Rollback steps¶
- Unsilence alert
alertname=Packet loss between worker hosts and other hosts alert
Files
Updated by livdywan over 1 year ago
- Status changed from New to In Progress
- Assignee set to nicksinger
- Target version set to Ready
We've not had a chance to estimate it properly but let's rather show what is keeping team members busy and re-assess later if it's more than a short task
Updated by okurz over 1 year ago
I checked a manual download of pxelinux.0 over tftp and that yields "Transfer timed out.". I am running a system update of qanet first.
# w
14:35:14 up 60 days, 22:44, 0 users, load average: 105.33, 103.77, 102.52
and many atftpd in D state. But I see no problem with the NFS mount point /mnt/openqa . Meanwhile the update finished. In htop I found a lot of atftpd processes in D state but surprisingly I could not see those processes in neither top nor ps output. Anyway, I killed one atftpd process in htop which made all vanish and the load of the system decreased. However also after restarting atftpd I could not get a file. I retriggered a reboot of qanet.
EDIT: And now qanet does not come up again.
Updated by okurz over 1 year ago
https://suse.slack.com/archives/C02CANHLANP/p1694437437320729
(Oliver Kurz) anyone near to NUE1 Maxtorhof that can help to recover qanet.qa.suse.de not coming up after reboot? Likely stuck on NFS share or something
(Oliver Kurz) @Marius Kittler maybe?
EDIT: With the help of mkittler qanet.qa.suse.de could be brought up again and after boot the system could happily respond to TFTP download requests again.
Updated by Julie_CAO over 1 year ago
- Status changed from In Progress to Resolved
Verified that these two machnes work with the new iPXE over kernel qa team's baremetal-service:
https://openqa.suse.de/tests/12088748
https://openqa.suse.de/tests/12088801
Many thanks @okurz's for the quick fix!! Mark the ticket resolved.