Project

General

Profile

Actions

action #135491

closed

fozzie and quinn unable to access PXE server or iPXE server (TFTP open timeout)

Added by Julie_CAO over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2023-09-11
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

fozzie and quinn are in NUE1, they failed to access the generic static iPXE menu(the same on with O3), I changed its dhcp config to kernel qa team's baremetal-support services(https://gitlab.suse.de/qa-sle/qanet-configs/-/merge_requests/75), they still failed to reach out this server.

"TFTP open timeout" is reported:
TFTP_error

Could you please take a look?

Problem

atfpd on qanet apparently stuck

Suggestions

  • Try to identify stuck server processes, restart, lazy unmount NFS shares and such

Rollback steps

  • Unsilence alert alertname=Packet loss between worker hosts and other hosts alert

Files

TFTP_error.png (50.4 KB) TFTP_error.png Julie_CAO, 2023-09-11 09:30
Actions #1

Updated by livdywan over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
  • Target version set to Ready

We've not had a chance to estimate it properly but let's rather show what is keeping team members busy and re-assess later if it's more than a short task

Actions #2

Updated by okurz over 1 year ago

  • Tags set to infra
Actions #3

Updated by okurz over 1 year ago

I checked a manual download of pxelinux.0 over tftp and that yields "Transfer timed out.". I am running a system update of qanet first.

# w
 14:35:14 up 60 days, 22:44,  0 users,  load average: 105.33, 103.77, 102.52

and many atftpd in D state. But I see no problem with the NFS mount point /mnt/openqa . Meanwhile the update finished. In htop I found a lot of atftpd processes in D state but surprisingly I could not see those processes in neither top nor ps output. Anyway, I killed one atftpd process in htop which made all vanish and the load of the system decreased. However also after restarting atftpd I could not get a file. I retriggered a reboot of qanet.

EDIT: And now qanet does not come up again.

Actions #4

Updated by okurz over 1 year ago

  • Assignee changed from nicksinger to okurz
Actions #5

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #6

Updated by okurz over 1 year ago

https://suse.slack.com/archives/C02CANHLANP/p1694437437320729

(Oliver Kurz) anyone near to NUE1 Maxtorhof that can help to recover qanet.qa.suse.de not coming up after reboot? Likely stuck on NFS share or something
(Oliver Kurz) @Marius Kittler maybe?

EDIT: With the help of mkittler qanet.qa.suse.de could be brought up again and after boot the system could happily respond to TFTP download requests again.

Actions #7

Updated by okurz over 1 year ago

  • Priority changed from Normal to High
Actions #8

Updated by Julie_CAO over 1 year ago

  • Status changed from In Progress to Resolved

Verified that these two machnes work with the new iPXE over kernel qa team's baremetal-service:
https://openqa.suse.de/tests/12088748
https://openqa.suse.de/tests/12088801

Many thanks @okurz's for the quick fix!! Mark the ticket resolved.

Actions

Also available in: Atom PDF