Project

General

Profile

Actions

action #120339

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones

QEMU DNS fails to resolve openqa.suse.de via IP address

Added by MDoucha over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-11-11
Due date:
% Done:

0%

Estimated time:

Description

Observation

LTP test host started failing today. The QEMU DNS service running at 10.0.2.3 correctly resolves hostnames to IP addresses but reverse lookup fails. Old tests which passed up until yesterday are now also failing upon restart so this appears to be a QEMU configuration issue. The physical worker machine can resolve IP addresses without issue.

This issue is confirmed on worker3, worker5, worker8 and worker13. Other workers may be affected as well. PPC64LE QEMU workers do not seem to be affected, though.

Rollback steps

  • DONE Revert removal of faulty DNS
sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'sudo sed -i "s/\(NETCONFIG_DNS_POLICY=\)\"\"/\1\"auto\"/;s/\(NETCONFIG_DNS_STATIC_SERVERS=\)\"10.160.0.1 10.100.2.10\"/\1\"\"/" /etc/sysconfig/network/config && sudo netconfig update -f'

Files

openqa-host-fail.png (27.9 KB) openqa-host-fail.png manual test on worker5:7 MDoucha, 2022-11-11 12:39
screenshot_20221116_165649.png (49.7 KB) screenshot_20221116_165649.png mkittler, 2022-11-16 15:57
screenshot_20221116_172722.png (62.3 KB) screenshot_20221116_172722.png mkittler, 2022-11-16 16:27
screenshot_20221116_174604.png (45.8 KB) screenshot_20221116_174604.png mkittler, 2022-11-16 16:46

Related issues 1 (0 open1 closed)

Related to QA - action #119443: Conduct the migration of SUSE openQA systems from Nbg SRV1 to new security zones size:MResolvedokurz2022-11-17

Actions
Actions #1

Updated by okurz over 1 year ago

  • Priority changed from Normal to Urgent
  • Target version set to Ready

As ppc64le workers do not seem to be affected I assume this is due to #119443

Actions #2

Updated by okurz over 1 year ago

  • Parent task set to #116623
Actions #3

Updated by nicksinger over 1 year ago

  • Assignee set to nicksinger
Actions #4

Updated by okurz over 1 year ago

  • Related to action #119443: Conduct the migration of SUSE openQA systems from Nbg SRV1 to new security zones size:M added
Actions #5

Updated by msmeissn over 1 year ago

this blocks important maintenance updates, please work / fix this ASAP

Actions #6

Updated by livdywan over 1 year ago

  • Priority changed from Urgent to Immediate

msmeissn wrote:

this blocks important maintenance updates, please work / fix this ASAP

Seems this is Immediate then

Actions #7

Updated by mkittler over 1 year ago

Since the ticket description wasn't clear about what was tested on the physical machine I tried the following:

martchus@worker3:~> host worker2.oqa.suse.de
worker2.oqa.suse.de has address 10.137.10.2
worker2.oqa.suse.de has IPv6 address 2a07:de40:a203:12:2e60:cff:fe73:2ac
martchus@worker3:~> host 10.137.10.2
2.10.137.10.in-addr.arpa domain name pointer worker2.oqa.suse.de.
martchus@worker3:~> host worker5.oqa.suse.de
worker5.oqa.suse.de has address 10.137.10.5
worker5.oqa.suse.de has IPv6 address 2a07:de40:a203:12:56ab:3aff:fe24:358d
martchus@worker3:~> host 10.137.10.5
5.10.137.10.in-addr.arpa domain name pointer worker5.oqa.suse.de.

So it works and might therefore indeed be a QEMU problem.


We haven't had a QEMU or libslirp0 update recently. We also haven't changed the way we invoke QEMU - at least I'm not aware of any change in os-autoinst.

Actions #8

Updated by mkittler over 1 year ago

I've restarted the test: https://openqa.suse.de/tests/9975697

It was scheduled on openqaworker14 where I could reproduce the issue by connecting to the VM manually:

Actions #9

Updated by mkittler over 1 year ago

openqaworker14 cannot even do the reverse lookup itself. I've scheduled https://openqa.suse.de/tests/9975736 to run explicitly on worker8 to re-conduct the test there.

Actions #10

Updated by MDoucha over 1 year ago

mkittler wrote:

I've restarted the test: https://openqa.suse.de/tests/9975697

It was scheduled on openqaworker14 where I could reproduce the issue by connecting to the VM manually

Note that the error (NXDOMAIN) is different from the original failure (SERVFAIL). The LTP test itself actually passed there:
https://openqa.suse.de/tests/9975697#step/host/8

Actions #11

Updated by mkittler over 1 year ago

I've checked in a QEMU-VM on worker8 and reverse DNS generally works but not for the OSD domain:

It doesn't work for IPv4 as well so this is not an IPv6-only issue.

On the worker directly this particular request works:

martchus@worker8:~> host openqa.suse.de
openqa.suse.de has address 10.160.0.207
openqa.suse.de has IPv6 address 2620:113:80c0:8080:10:160:0:207
martchus@worker8:~> host 2620:113:80c0:8080:10:160:0:207
7.0.2.0.0.0.0.0.0.6.1.0.0.1.0.0.0.8.0.8.0.c.0.8.3.1.1.0.0.2.6.2.ip6.arpa domain name pointer openqa.suse.de.
Actions #12

Updated by mkittler over 1 year ago

It seems to depend on the DNS server:

When commenting the other DNS servers out in /etc/resolv.conf on the worker host¹ then the VM can normally reverse-resolve the OSD IP using the "QEMU provided" default DNS server. So it is really just the DNS server.


¹

#nameserver 10.160.2.88
nameserver 10.160.0.1
#nameserver 10.100.2.10

(But 10.100.2.10 works as well. Only 10.160.2.88 seems bad.)

Actions #13

Updated by mkittler over 1 year ago

So it would supposedly help to either fix 10.160.2.88 or to remove that nameserver from /etc/resolv.conf. Note that the "bad" nameserver is present in /etc/resolv.conf on OSD itself (old "zone") and workers in the new security zone.

Actions #14

Updated by kraih over 1 year ago

Since this is urgent, #help-it-ama on Slack has also been notified: https://suse.slack.com/archives/C029APBKLGK/p1668618504701409

Actions #15

Updated by tinita over 1 year ago

Another thing that we realized it that the lookup for workers is ok, just the lookup for openqa.suse.de doesn't get a result on 10.160.2.88

% host openqa.suse.de 10.160.2.88
Using domain server:
Name: 10.160.2.88
Address: 10.160.2.88#53
Aliases: 

openqa.suse.de has address 10.160.0.207
openqa.suse.de has IPv6 address 2620:113:80c0:8080:10:160:0:207
% host 10.160.0.207 10.160.2.88
Using domain server:
Name: 10.160.2.88
Address: 10.160.2.88#53
Aliases: 

Host 207.0.160.10.in-addr.arpa not found: 2(SERVFAIL)

% host worker11.oqa.suse.de 10.160.2.88
Using domain server:
Name: 10.160.2.88
Address: 10.160.2.88#53
Aliases: 

worker11.oqa.suse.de has address 10.137.10.11
worker11.oqa.suse.de has IPv6 address 2a07:de40:a203:12:ec4:7aff:fe7a:7896
% host 10.137.10.11 10.160.2.88
Using domain server:
Name: 10.160.2.88
Address: 10.160.2.88#53
Aliases: 

11.10.137.10.in-addr.arpa domain name pointer worker11.oqa.suse.de.
Actions #16

Updated by okurz over 1 year ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee changed from nicksinger to okurz
  • Priority changed from Immediate to Urgent

I also reported this in https://sd.suse.com/servicedesk/customer/portal/1/SD-104548 now.

And I patched workers with

sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'sed -i "s/\(NETCONFIG_DNS_POLICY=\)\"auto\"/\1\"\"/;s/\(NETCONFIG_DNS_STATIC_SERVERS=\)\"\"/\1\"10.160.0.1 10.100.2.10\"/" /etc/sysconfig/network/config && sed -i "/nameserver 10.160.2.88/d" /etc/resolv.conf'

Added according rollback steps in description. Unfortunately I can't use auto-review as the error output is not shown in autoinst-log.txt but just serial_terminal.txt

Actions #17

Updated by okurz over 1 year ago

  • Status changed from In Progress to Blocked

openqa-label-all -vvv --module host --label https://progress.opensuse.org/issues/120339 should work:

DEBUG:/usr/bin/openqa-label-all:args: Namespace(build=None, dry_run=False, groupid=None, label='https://progress.opensuse.org/issues/120339', module='host', no_restart=False, openqa_host='https://openqa.suse.de', result='failed', verbose=4)
DEBUG:/usr/bin/openqa-label-all:Retrieving comments on job 9927713
DEBUG:/usr/bin/openqa-label-all:Comments: 
DEBUG:/usr/bin/openqa-label-all:9927713 has label: False
…

Now it looks like other issues are still blocking, e.g. https://openqa.suse.de/tests/9938116#step/execve06_postun/7 . Don't know how to handle that.

blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-104548

Actions #18

Updated by livdywan over 1 year ago

  • Status changed from Blocked to Feedback

okurz wrote:

Now it looks like other issues are still blocking, e.g. https://openqa.suse.de/tests/9938116#step/execve06_postun/7 . Don't know how to handle that.

blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-104548

Errors in the zone file [..] Works now.

Actions #19

Updated by mkittler over 1 year ago

Looks like reverse DNS via 10.160.2.88 works now (for hosts where it previously didn't).

I'm also not sure about the remaining test failures in the test @okurz has mentioned. They seem to be unrelated to this DNS issue.

Actions #20

Updated by nicksinger over 1 year ago

Indeed resolved looking at manual dig output and also the latest test runs: https://openqa.suse.de/tests/9977119#step/host/6 - @MDoucha can you confirm this fixed your issues?

Actions #21

Updated by okurz over 1 year ago

  • Description updated (diff)

Executed the rollback steps

Actions #22

Updated by okurz over 1 year ago

  • Due date set to 2022-11-30
  • Priority changed from Urgent to High
Actions #23

Updated by okurz over 1 year ago

  • Due date deleted (2022-11-30)
  • Status changed from Feedback to Resolved
Actions #24

Updated by MDoucha over 1 year ago

nicksinger wrote:

Indeed resolved looking at manual dig output and also the latest test runs: https://openqa.suse.de/tests/9977119#step/host/6 - @MDoucha can you confirm this fixed your issues?

I haven't seen any new DNS failures in our tests this week, I can confirm this issue is fixed.

The execve06_postun failures are expected, they're caused by intentionally rolling back a livepatch which fixes bsc#1204381.

Actions

Also available in: Atom PDF