Project

General

Profile

action #120339

QEMU DNS fails to resolve openqa.suse.de via IP address

Added by MDoucha 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2022-11-11
Due date:
% Done:

0%

Estimated time:

Description

Observation

LTP test host started failing today. The QEMU DNS service running at 10.0.2.3 correctly resolves hostnames to IP addresses but reverse lookup fails. Old tests which passed up until yesterday are now also failing upon restart so this appears to be a QEMU configuration issue. The physical worker machine can resolve IP addresses without issue.

This issue is confirmed on worker3, worker5, worker8 and worker13. Other workers may be affected as well. PPC64LE QEMU workers do not seem to be affected, though.

Rollback steps

  • DONE Revert removal of faulty DNS
sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'sudo sed -i "s/\(NETCONFIG_DNS_POLICY=\)\"\"/\1\"auto\"/;s/\(NETCONFIG_DNS_STATIC_SERVERS=\)\"10.160.0.1 10.100.2.10\"/\1\"\"/" /etc/sysconfig/network/config && sudo netconfig update -f'
openqa-host-fail.png (27.9 KB) openqa-host-fail.png manual test on worker5:7 MDoucha, 2022-11-11 12:39
screenshot_20221116_165649.png (49.7 KB) screenshot_20221116_165649.png mkittler, 2022-11-16 15:57
screenshot_20221116_172722.png (62.3 KB) screenshot_20221116_172722.png mkittler, 2022-11-16 16:27
screenshot_20221116_174604.png (45.8 KB) screenshot_20221116_174604.png mkittler, 2022-11-16 16:46
14097
14109
14112
14115

History

#1 Updated by okurz 3 months ago

  • Priority changed from Normal to Urgent
  • Target version set to Ready

As ppc64le workers do not seem to be affected I assume this is due to #119443

#2 Updated by okurz 3 months ago

  • Parent task set to #116623

#3 Updated by nicksinger 3 months ago

  • Assignee set to nicksinger

#5 Updated by msmeissn 3 months ago

this blocks important maintenance updates, please work / fix this ASAP

#6 Updated by cdywan 3 months ago

  • Priority changed from Urgent to Immediate

msmeissn wrote:

this blocks important maintenance updates, please work / fix this ASAP

Seems this is Immediate then

#7 Updated by mkittler 3 months ago

Since the ticket description wasn't clear about what was tested on the physical machine I tried the following:

martchus@worker3:~> host worker2.oqa.suse.de
worker2.oqa.suse.de has address 10.137.10.2
worker2.oqa.suse.de has IPv6 address 2a07:de40:a203:12:2e60:cff:fe73:2ac
martchus@worker3:~> host 10.137.10.2
2.10.137.10.in-addr.arpa domain name pointer worker2.oqa.suse.de.
martchus@worker3:~> host worker5.oqa.suse.de
worker5.oqa.suse.de has address 10.137.10.5
worker5.oqa.suse.de has IPv6 address 2a07:de40:a203:12:56ab:3aff:fe24:358d
martchus@worker3:~> host 10.137.10.5
5.10.137.10.in-addr.arpa domain name pointer worker5.oqa.suse.de.

So it works and might therefore indeed be a QEMU problem.


We haven't had a QEMU or libslirp0 update recently. We also haven't changed the way we invoke QEMU - at least I'm not aware of any change in os-autoinst.

#8 Updated by mkittler 3 months ago

14109

I've restarted the test: https://openqa.suse.de/tests/9975697

It was scheduled on openqaworker14 where I could reproduce the issue by connecting to the VM manually:

#9 Updated by mkittler 3 months ago

openqaworker14 cannot even do the reverse lookup itself. I've scheduled https://openqa.suse.de/tests/9975736 to run explicitly on worker8 to re-conduct the test there.

#10 Updated by MDoucha 3 months ago

mkittler wrote:

I've restarted the test: https://openqa.suse.de/tests/9975697

It was scheduled on openqaworker14 where I could reproduce the issue by connecting to the VM manually

Note that the error (NXDOMAIN) is different from the original failure (SERVFAIL). The LTP test itself actually passed there:
https://openqa.suse.de/tests/9975697#step/host/8

#11 Updated by mkittler 3 months ago

14112

I've checked in a QEMU-VM on worker8 and reverse DNS generally works but not for the OSD domain:

It doesn't work for IPv4 as well so this is not an IPv6-only issue.

On the worker directly this particular request works:

martchus@worker8:~> host openqa.suse.de
openqa.suse.de has address 10.160.0.207
openqa.suse.de has IPv6 address 2620:113:80c0:8080:10:160:0:207
martchus@worker8:~> host 2620:113:80c0:8080:10:160:0:207
7.0.2.0.0.0.0.0.0.6.1.0.0.1.0.0.0.8.0.8.0.c.0.8.3.1.1.0.0.2.6.2.ip6.arpa domain name pointer openqa.suse.de.

#12 Updated by mkittler 3 months ago

14115

It seems to depend on the DNS server:

When commenting the other DNS servers out in /etc/resolv.conf on the worker host¹ then the VM can normally reverse-resolve the OSD IP using the "QEMU provided" default DNS server. So it is really just the DNS server.


¹

#nameserver 10.160.2.88
nameserver 10.160.0.1
#nameserver 10.100.2.10

(But 10.100.2.10 works as well. Only 10.160.2.88 seems bad.)

#13 Updated by mkittler 3 months ago

So it would supposedly help to either fix 10.160.2.88 or to remove that nameserver from /etc/resolv.conf. Note that the "bad" nameserver is present in /etc/resolv.conf on OSD itself (old "zone") and workers in the new security zone.

#14 Updated by kraih 3 months ago

Since this is urgent, #help-it-ama on Slack has also been notified: https://suse.slack.com/archives/C029APBKLGK/p1668618504701409

#15 Updated by tinita 3 months ago

Another thing that we realized it that the lookup for workers is ok, just the lookup for openqa.suse.de doesn't get a result on 10.160.2.88

% host openqa.suse.de 10.160.2.88
Using domain server:
Name: 10.160.2.88
Address: 10.160.2.88#53
Aliases: 

openqa.suse.de has address 10.160.0.207
openqa.suse.de has IPv6 address 2620:113:80c0:8080:10:160:0:207
% host 10.160.0.207 10.160.2.88
Using domain server:
Name: 10.160.2.88
Address: 10.160.2.88#53
Aliases: 

Host 207.0.160.10.in-addr.arpa not found: 2(SERVFAIL)

% host worker11.oqa.suse.de 10.160.2.88
Using domain server:
Name: 10.160.2.88
Address: 10.160.2.88#53
Aliases: 

worker11.oqa.suse.de has address 10.137.10.11
worker11.oqa.suse.de has IPv6 address 2a07:de40:a203:12:ec4:7aff:fe7a:7896
% host 10.137.10.11 10.160.2.88
Using domain server:
Name: 10.160.2.88
Address: 10.160.2.88#53
Aliases: 

11.10.137.10.in-addr.arpa domain name pointer worker11.oqa.suse.de.

#16 Updated by okurz 3 months ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee changed from nicksinger to okurz
  • Priority changed from Immediate to Urgent

I also reported this in https://sd.suse.com/servicedesk/customer/portal/1/SD-104548 now.

And I patched workers with

sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'sed -i "s/\(NETCONFIG_DNS_POLICY=\)\"auto\"/\1\"\"/;s/\(NETCONFIG_DNS_STATIC_SERVERS=\)\"\"/\1\"10.160.0.1 10.100.2.10\"/" /etc/sysconfig/network/config && sed -i "/nameserver 10.160.2.88/d" /etc/resolv.conf'

Added according rollback steps in description. Unfortunately I can't use auto-review as the error output is not shown in autoinst-log.txt but just serial_terminal.txt

#17 Updated by okurz 3 months ago

  • Status changed from In Progress to Blocked

openqa-label-all -vvv --module host --label https://progress.opensuse.org/issues/120339 should work:

DEBUG:/usr/bin/openqa-label-all:args: Namespace(build=None, dry_run=False, groupid=None, label='https://progress.opensuse.org/issues/120339', module='host', no_restart=False, openqa_host='https://openqa.suse.de', result='failed', verbose=4)
DEBUG:/usr/bin/openqa-label-all:Retrieving comments on job 9927713
DEBUG:/usr/bin/openqa-label-all:Comments: 
DEBUG:/usr/bin/openqa-label-all:9927713 has label: False
…

Now it looks like other issues are still blocking, e.g. https://openqa.suse.de/tests/9938116#step/execve06_postun/7 . Don't know how to handle that.

blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-104548

#18 Updated by cdywan 3 months ago

  • Status changed from Blocked to Feedback

okurz wrote:

Now it looks like other issues are still blocking, e.g. https://openqa.suse.de/tests/9938116#step/execve06_postun/7 . Don't know how to handle that.

blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-104548

Errors in the zone file [..] Works now.

#19 Updated by mkittler 3 months ago

Looks like reverse DNS via 10.160.2.88 works now (for hosts where it previously didn't).

I'm also not sure about the remaining test failures in the test okurz has mentioned. They seem to be unrelated to this DNS issue.

#20 Updated by nicksinger 3 months ago

Indeed resolved looking at manual dig output and also the latest test runs: https://openqa.suse.de/tests/9977119#step/host/6 - @MDoucha can you confirm this fixed your issues?

#21 Updated by okurz 3 months ago

  • Description updated (diff)

Executed the rollback steps

#22 Updated by okurz 3 months ago

  • Due date set to 2022-11-30
  • Priority changed from Urgent to High

#23 Updated by okurz 3 months ago

  • Due date deleted (2022-11-30)
  • Status changed from Feedback to Resolved

#24 Updated by MDoucha 2 months ago

nicksinger wrote:

Indeed resolved looking at manual dig output and also the latest test runs: https://openqa.suse.de/tests/9977119#step/host/6 - @MDoucha can you confirm this fixed your issues?

I haven't seen any new DNS failures in our tests this week, I can confirm this issue is fixed.

The execve06_postun failures are expected, they're caused by intentionally rolling back a livepatch which fixes bsc#1204381.

Also available in: Atom PDF