action #120339
closedQA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA (public) - coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones
QEMU DNS fails to resolve openqa.suse.de via IP address
Added by MDoucha about 2 years ago. Updated about 2 years ago.
0%
Description
Observation¶
LTP test host
started failing today. The QEMU DNS service running at 10.0.2.3 correctly resolves hostnames to IP addresses but reverse lookup fails. Old tests which passed up until yesterday are now also failing upon restart so this appears to be a QEMU configuration issue. The physical worker machine can resolve IP addresses without issue.
This issue is confirmed on worker3, worker5, worker8 and worker13. Other workers may be affected as well. PPC64LE QEMU workers do not seem to be affected, though.
Rollback steps¶
- DONE Revert removal of faulty DNS
sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'sudo sed -i "s/\(NETCONFIG_DNS_POLICY=\)\"\"/\1\"auto\"/;s/\(NETCONFIG_DNS_STATIC_SERVERS=\)\"10.160.0.1 10.100.2.10\"/\1\"\"/" /etc/sysconfig/network/config && sudo netconfig update -f'
Files
openqa-host-fail.png (27.9 KB) openqa-host-fail.png | manual test on worker5:7 | MDoucha, 2022-11-11 12:39 | |
screenshot_20221116_165649.png (49.7 KB) screenshot_20221116_165649.png | mkittler, 2022-11-16 15:57 | ||
screenshot_20221116_172722.png (62.3 KB) screenshot_20221116_172722.png | mkittler, 2022-11-16 16:27 | ||
screenshot_20221116_174604.png (45.8 KB) screenshot_20221116_174604.png | mkittler, 2022-11-16 16:46 |
Updated by okurz about 2 years ago
- Priority changed from Normal to Urgent
- Target version set to Ready
As ppc64le workers do not seem to be affected I assume this is due to #119443
Updated by okurz about 2 years ago
- Related to action #119443: Conduct the migration of SUSE openQA systems from Nbg SRV1 to new security zones size:M added
Updated by msmeissn about 2 years ago
this blocks important maintenance updates, please work / fix this ASAP
Updated by livdywan about 2 years ago
- Priority changed from Urgent to Immediate
msmeissn wrote:
this blocks important maintenance updates, please work / fix this ASAP
Seems this is Immediate then
Updated by mkittler about 2 years ago
Since the ticket description wasn't clear about what was tested on the physical machine I tried the following:
martchus@worker3:~> host worker2.oqa.suse.de
worker2.oqa.suse.de has address 10.137.10.2
worker2.oqa.suse.de has IPv6 address 2a07:de40:a203:12:2e60:cff:fe73:2ac
martchus@worker3:~> host 10.137.10.2
2.10.137.10.in-addr.arpa domain name pointer worker2.oqa.suse.de.
martchus@worker3:~> host worker5.oqa.suse.de
worker5.oqa.suse.de has address 10.137.10.5
worker5.oqa.suse.de has IPv6 address 2a07:de40:a203:12:56ab:3aff:fe24:358d
martchus@worker3:~> host 10.137.10.5
5.10.137.10.in-addr.arpa domain name pointer worker5.oqa.suse.de.
So it works and might therefore indeed be a QEMU problem.
We haven't had a QEMU or libslirp0 update recently. We also haven't changed the way we invoke QEMU - at least I'm not aware of any change in os-autoinst.
Updated by mkittler about 2 years ago
I've restarted the test: https://openqa.suse.de/tests/9975697
It was scheduled on openqaworker14 where I could reproduce the issue by connecting to the VM manually:
Updated by mkittler about 2 years ago
openqaworker14 cannot even do the reverse lookup itself. I've scheduled https://openqa.suse.de/tests/9975736 to run explicitly on worker8 to re-conduct the test there.
Updated by MDoucha about 2 years ago
mkittler wrote:
I've restarted the test: https://openqa.suse.de/tests/9975697
It was scheduled on openqaworker14 where I could reproduce the issue by connecting to the VM manually
Note that the error (NXDOMAIN) is different from the original failure (SERVFAIL). The LTP test itself actually passed there:
https://openqa.suse.de/tests/9975697#step/host/8
Updated by mkittler about 2 years ago
I've checked in a QEMU-VM on worker8 and reverse DNS generally works but not for the OSD domain:
It doesn't work for IPv4 as well so this is not an IPv6-only issue.
On the worker directly this particular request works:
martchus@worker8:~> host openqa.suse.de
openqa.suse.de has address 10.160.0.207
openqa.suse.de has IPv6 address 2620:113:80c0:8080:10:160:0:207
martchus@worker8:~> host 2620:113:80c0:8080:10:160:0:207
7.0.2.0.0.0.0.0.0.6.1.0.0.1.0.0.0.8.0.8.0.c.0.8.3.1.1.0.0.2.6.2.ip6.arpa domain name pointer openqa.suse.de.
Updated by mkittler about 2 years ago
It seems to depend on the DNS server:
When commenting the other DNS servers out in /etc/resolv.conf
on the worker host¹ then the VM can normally reverse-resolve the OSD IP using the "QEMU provided" default DNS server. So it is really just the DNS server.
¹
#nameserver 10.160.2.88
nameserver 10.160.0.1
#nameserver 10.100.2.10
(But 10.100.2.10 works as well. Only 10.160.2.88 seems bad.)
Updated by mkittler about 2 years ago
So it would supposedly help to either fix 10.160.2.88 or to remove that nameserver from /etc/resolv.conf
. Note that the "bad" nameserver is present in /etc/resolv.conf
on OSD itself (old "zone") and workers in the new security zone.
Updated by kraih about 2 years ago
Since this is urgent, #help-it-ama
on Slack has also been notified: https://suse.slack.com/archives/C029APBKLGK/p1668618504701409
Updated by tinita about 2 years ago
Another thing that we realized it that the lookup for workers is ok, just the lookup for openqa.suse.de
doesn't get a result on 10.160.2.88
% host openqa.suse.de 10.160.2.88
Using domain server:
Name: 10.160.2.88
Address: 10.160.2.88#53
Aliases:
openqa.suse.de has address 10.160.0.207
openqa.suse.de has IPv6 address 2620:113:80c0:8080:10:160:0:207
% host 10.160.0.207 10.160.2.88
Using domain server:
Name: 10.160.2.88
Address: 10.160.2.88#53
Aliases:
Host 207.0.160.10.in-addr.arpa not found: 2(SERVFAIL)
% host worker11.oqa.suse.de 10.160.2.88
Using domain server:
Name: 10.160.2.88
Address: 10.160.2.88#53
Aliases:
worker11.oqa.suse.de has address 10.137.10.11
worker11.oqa.suse.de has IPv6 address 2a07:de40:a203:12:ec4:7aff:fe7a:7896
% host 10.137.10.11 10.160.2.88
Using domain server:
Name: 10.160.2.88
Address: 10.160.2.88#53
Aliases:
11.10.137.10.in-addr.arpa domain name pointer worker11.oqa.suse.de.
Updated by okurz about 2 years ago
- Description updated (diff)
- Status changed from New to In Progress
- Assignee changed from nicksinger to okurz
- Priority changed from Immediate to Urgent
I also reported this in https://sd.suse.com/servicedesk/customer/portal/1/SD-104548 now.
And I patched workers with
sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'sed -i "s/\(NETCONFIG_DNS_POLICY=\)\"auto\"/\1\"\"/;s/\(NETCONFIG_DNS_STATIC_SERVERS=\)\"\"/\1\"10.160.0.1 10.100.2.10\"/" /etc/sysconfig/network/config && sed -i "/nameserver 10.160.2.88/d" /etc/resolv.conf'
Added according rollback steps in description. Unfortunately I can't use auto-review as the error output is not shown in autoinst-log.txt but just serial_terminal.txt
Updated by okurz about 2 years ago
- Status changed from In Progress to Blocked
openqa-label-all -vvv --module host --label https://progress.opensuse.org/issues/120339
should work:
DEBUG:/usr/bin/openqa-label-all:args: Namespace(build=None, dry_run=False, groupid=None, label='https://progress.opensuse.org/issues/120339', module='host', no_restart=False, openqa_host='https://openqa.suse.de', result='failed', verbose=4)
DEBUG:/usr/bin/openqa-label-all:Retrieving comments on job 9927713
DEBUG:/usr/bin/openqa-label-all:Comments:
DEBUG:/usr/bin/openqa-label-all:9927713 has label: False
…
Now it looks like other issues are still blocking, e.g. https://openqa.suse.de/tests/9938116#step/execve06_postun/7 . Don't know how to handle that.
blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-104548
Updated by livdywan about 2 years ago
- Status changed from Blocked to Feedback
okurz wrote:
Now it looks like other issues are still blocking, e.g. https://openqa.suse.de/tests/9938116#step/execve06_postun/7 . Don't know how to handle that.
blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-104548
Errors in the zone file [..] Works now.
Updated by mkittler about 2 years ago
Looks like reverse DNS via 10.160.2.88 works now (for hosts where it previously didn't).
I'm also not sure about the remaining test failures in the test @okurz has mentioned. They seem to be unrelated to this DNS issue.
Updated by nicksinger about 2 years ago
Indeed resolved looking at manual dig output and also the latest test runs: https://openqa.suse.de/tests/9977119#step/host/6 - @MDoucha can you confirm this fixed your issues?
Updated by okurz about 2 years ago
- Due date set to 2022-11-30
- Priority changed from Urgent to High
Updated by okurz about 2 years ago
- Due date deleted (
2022-11-30) - Status changed from Feedback to Resolved
Updated by MDoucha about 2 years ago
nicksinger wrote:
Indeed resolved looking at manual dig output and also the latest test runs: https://openqa.suse.de/tests/9977119#step/host/6 - @MDoucha can you confirm this fixed your issues?
I haven't seen any new DNS failures in our tests this week, I can confirm this issue is fixed.
The execve06_postun
failures are expected, they're caused by intentionally rolling back a livepatch which fixes bsc#1204381.