Project

General

Profile

Actions

action #127256

closed

missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 size:M

Added by MMoese over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2023-04-05
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Randomly, the baremetal machines in NUE-FC-B 2 (https://racktables.nue.suse.com/index.php?page=rack&rack_id=19178) don't receive nameservers from DHCP. They receive IP address, default route, and even DNS search domains, but /etc/resolv.conf does not contain nameserver-entries.
It seems to (at least) affect all machines in this rack, not sure about others. Also restarting wicked manually ususally resolves the issue.


Files

dns-in-ack.pcap (3.57 KB) dns-in-ack.pcap pcervinka, 2023-04-26 12:49
no-dns-in-ack.pcap (3.51 KB) no-dns-in-ack.pcap pcervinka, 2023-04-26 12:49

Related issues 3 (0 open3 closed)

Related to openQA Project (public) - action #126188: [openQA][infra][worker][sut] openQA infra performance fluctuates to the level that that leads to tangible test run failure size:MResolvedmkittler2023-03-20

Actions
Related to openQA Infrastructure (public) - action #125744: [tools][alert][FIRING:1] (Failed systemd services alert (except openqa.suse.de) QDG8aXAVz) due to openqa-piworker.qa.suse.de unable to reach openqa.suse.deResolveddheidler2023-03-10

Actions
Blocks openQA Infrastructure (public) - action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:MResolvedmkittler2023-01-022023-05-12

Actions
Actions #1

Updated by mkittler over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

We've been observing the problem of DNS not working on scooter as well, see #126188. It is in another rack but also in the same server room. I suppose the problem mentioned in the last paragraph of #126188#note-23 counts here as well. So we don't have access to that DHCP server. Likely the best we can do is to create an Eng-Infra ticket describing the problem. (There's already https://sd.suse.com/servicedesk/customer/portal/1/SD-113959 for us getting access in general but until then we should likely create a ticket for the immediate problem.)

Actions #2

Updated by mkittler over 1 year ago

  • Related to action #126188: [openQA][infra][worker][sut] openQA infra performance fluctuates to the level that that leads to tangible test run failure size:M added
Actions #3

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback
Actions #4

Updated by mkittler over 1 year ago

Maybe even #122983#note-37 is related.

Actions #5

Updated by okurz over 1 year ago

  • Tags set to infra
  • Target version set to Ready
Actions #6

Updated by mkittler over 1 year ago

  • Blocks action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M added
Actions #7

Updated by mkittler over 1 year ago

  • Tags deleted (infra)
  • Target version deleted (Ready)

It looks like openqaworker1 is affected as well. Since it is only happening randomly, the main host has a valid DNS server configured. However, when running tests at some point one runs into it inside a VM.

Actions #8

Updated by okurz over 1 year ago

  • Tags set to infra
  • Target version set to Ready
Actions #9

Updated by livdywan over 1 year ago

  • Subject changed from missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 to missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 size:M
  • Status changed from Feedback to Blocked

Updated by pcervinka over 1 year ago

This issue is really test blocker. I did investigation on tests server itself with tcpdump capture during wicked restart.

I have two files:

  • dns-in-ack.pcap - with DNS
  • no-dns-in-ack.pcap - missing DNS

You can load file into wireshark and use filter dhcp.option.type == 6 to find option with DNS servers. They are missing in file no-dns-in-ack.pcap.

Actions #12

Updated by pcervinka over 1 year ago

Sometimes even wicked restart will not help https://openqa.suse.de/tests/10986078#step/add_repositories/10 and need to be done more than once.

Actions #13

Updated by okurz over 1 year ago

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3456 was merged and is deployed to both our DHCP servers walter1.qe.nue2.suse.org and walter2.qe.nue2.suse.org . We assume this fixes the problem.

Actions #14

Updated by mkittler over 1 year ago

  • Status changed from Blocked to Feedback

So no longer blocked. It would be nice if you could confirm that it works now.

Actions #15

Updated by MMoese over 1 year ago

So far, I did not see it happen again, but I'll re-trigger some more jobs to verify.

Actions #16

Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

So let's assume the problem is gone as we haven't heard more.

Actions #17

Updated by MMoese over 1 year ago

It looks like the problem is gone, yes.

Actions #18

Updated by dheidler over 1 year ago

  • Related to action #125744: [tools][alert][FIRING:1] (Failed systemd services alert (except openqa.suse.de) QDG8aXAVz) due to openqa-piworker.qa.suse.de unable to reach openqa.suse.de added
Actions

Also available in: Atom PDF