Project

General

Profile

Actions

action #166511

closed

[tools] Could not resolve host: workerX.mshome.net. seems something wrong with domain "mshome.net"

Added by rfan1 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-09-09
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Description

For jobs running on s390x, I can see blow job setting is changed:

Passed job 2 days ago: "WORKER_HOSTNAME" : "worker33.oqa.prg2.suse.org", => https://openqa.suse.de/tests/15367493
Failed job now: "WORKER_HOSTNAME" : "worker33.mshome.net", => https://openqa.suse.de/tests/15374870

Is there DNS configuration change recently? can you please help fix it?

Observation

openQA test in scenario sle-15-SP6-Server-DVD-Updates-s390x-mau-filesystem@s390x-kvm fails in
prepare_test_data

Test suite description

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml. Run filesystem tests against aggregated test repo

Reproducible

Fails since (at least) Build 20240908-1

Expected result

Last good: 20240906-1 (or more recent)

Rollback steps

  • DONE ssh osd "sudo salt-key -y -a worker33.oqa.prg2.suse.org && sudo salt 'worker33*' state.apply"

Further details

Always latest result in this scenario: latest

Workaround

  • On the affected host call wicked ifup all and confirm with grep ^search /etc/resolv.conf that mshome.net is not the first entry

Related issues 1 (1 open0 closed)

Copied to openQA Infrastructure (public) - coordination #166571: [epic] Separate testing machines from production machines (again)New2024-09-09

Actions
Actions #1

Updated by okurz 3 months ago

  • Tags set to infra
  • Project changed from openQA Project (public) to openQA Infrastructure (public)
  • Category set to Regressions/Crashes
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #2

Updated by okurz 3 months ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to okurz

From

sudo salt \* cmd.run "grep -v '^#' /etc/resolv.conf"
s390zl13.oqa.prg2.suse.org:
    search prg2.suse.org oqa.prg2.suse.org oqa.suse.de suse.de
    nameserver 10.144.53.53
    nameserver 10.144.53.54
ada.qe.prg2.suse.org:
    search qe.prg2.suse.org oqa.prg2.suse.org prg2.suse.org arch.prg2.suse.org suse.de suse.cz suse.asia prv.suse.net
    nameserver 10.144.53.53
    nameserver 10.144.53.54
backup-qam.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
s390zl12.oqa.prg2.suse.org:
    search prg2.suse.org mshome.net oqa.prg2.suse.org oqa.suse.de suse.de
    nameserver 10.144.53.53
    nameserver fe80::584f:67d6:5d82:4874%vlan2114
    nameserver 10.144.53.54
osiris-1.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
storage.qe.prg2.suse.org:
    search qe.prg2.suse.org oqa.prg2.suse.org prg2.suse.org arch.prg2.suse.org suse.de suse.cz suse.asia prv.suse.net
    nameserver 10.144.53.53
    nameserver 10.144.53.54
openqa.suse.de:
    search suse.de arch.suse.de nue.suse.com openvpn.suse.de suse.cz qa.suse.de
    nameserver 2a07:de40:b205:7:10:144:53:53
    nameserver 10.144.53.53
    nameserver 2a07:de40:b205:7:10:144:53:54
openqaworker18.qa.suse.cz:
    search qa.suse.cz suse.cz suse.de qa.suse.de qam.suse.de
    nameserver 10.100.96.1
    nameserver 10.100.96.2
openqaworker17.qa.suse.cz:
    search qa.suse.cz suse.cz suse.de qa.suse.de qam.suse.de
    nameserver 10.100.96.1
    nameserver 10.100.96.2
openqaworker16.qa.suse.cz:
    search qa.suse.cz suse.cz suse.de qa.suse.de qam.suse.de
    nameserver 10.100.96.1
    nameserver 10.100.96.2
qesapworker-prg6.qa.suse.cz:
    search qa.suse.cz suse.cz suse.de qa.suse.de qam.suse.de
    nameserver 10.100.96.1
    nameserver 10.100.96.2
qesapworker-prg4.qa.suse.cz:
    search qa.suse.cz suse.cz suse.de qa.suse.de qam.suse.de
    nameserver 10.100.96.1
    nameserver 10.100.96.2
sapworker2.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
qesapworker-prg7.qa.suse.cz:
    search qa.suse.cz suse.cz suse.de qa.suse.de qam.suse.de
    nameserver 10.100.96.1
    nameserver 10.100.96.2
worker33.oqa.prg2.suse.org:
    search mshome.net oqa.prg2.suse.org oqa.suse.de suse.de
    nameserver fe80::584f:67d6:5d82:4874%eth0
    nameserver 10.144.53.53
    nameserver 10.144.53.54
openqaw5-xen.qe.prg2.suse.org:
    search qe.prg2.suse.org oqa.prg2.suse.org prg2.suse.org arch.prg2.suse.org suse.de suse.cz suse.asia prv.suse.net
    nameserver 10.144.53.53
    nameserver 10.144.53.54
worker29.oqa.prg2.suse.org:
    search mshome.net oqa.prg2.suse.org oqa.suse.de suse.de
    nameserver fe80::584f:67d6:5d82:4874%eth0
    nameserver 10.144.53.53
    nameserver 10.144.53.54
worker40.oqa.prg2.suse.org:
    search mshome.net oqa.prg2.suse.org oqa.suse.de suse.de
    nameserver fe80::584f:67d6:5d82:4874%eth0
    nameserver 10.144.53.53
    nameserver 10.144.53.54
worker32.oqa.prg2.suse.org:
    search oqa.prg2.suse.org oqa.suse.de suse.de mshome.net
    nameserver 10.144.53.53
    nameserver 10.144.53.54
    nameserver fe80::584f:67d6:5d82:4874%eth0
worker30.oqa.prg2.suse.org:
    search oqa.prg2.suse.org oqa.suse.de suse.de mshome.net
    nameserver 10.144.53.53
    nameserver 10.144.53.54
    nameserver fe80::584f:67d6:5d82:4874%eth0
worker34.oqa.prg2.suse.org:
    search oqa.prg2.suse.org oqa.suse.de suse.de mshome.net
    nameserver 10.144.53.53
    nameserver 10.144.53.54
    nameserver fe80::584f:67d6:5d82:4874%eth0
sapworker3.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
sapworker1.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
worker-arm2.oqa.prg2.suse.org:
    search oqa.prg2.suse.org oqa.suse.de suse.de mshome.net
    nameserver 10.144.53.53
    nameserver 10.144.53.54
    nameserver fe80::584f:67d6:5d82:4874%eth0
worker-arm1.oqa.prg2.suse.org:
    search oqa.prg2.suse.org oqa.suse.de suse.de mshome.net
    nameserver 10.144.53.53
    nameserver 10.144.53.54
    nameserver fe80::584f:67d6:5d82:4874%eth0
openqaworker1.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz
    nameserver 10.168.0.1
    nameserver 10.168.0.2
tumblesle.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
baremetal-support.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
backup-vm.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
schort-server.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
monitor.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
worker35.oqa.prg2.suse.org:
    search oqa.prg2.suse.org oqa.suse.de suse.de mshome.net
    nameserver 10.144.53.53
    nameserver 10.144.53.54
    nameserver fe80::584f:67d6:5d82:4874%eth0
openqaworker14.qa.suse.cz:
    search qa.suse.cz suse.cz suse.de qa.suse.de qam.suse.de
    nameserver 10.100.96.1
    nameserver 10.100.96.2
jenkins.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
unreal6.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
qamaster.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
imagetester.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
petrol.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
openqa-piworker.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
mania.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
diesel.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2
grenache-1.oqa.prg2.suse.org:
    search mshome.net oqa.prg2.suse.org oqa.suse.de suse.de
    nameserver fe80::584f:67d6:5d82:4874%eth0
    nameserver 10.144.53.53
    nameserver 10.144.53.54
openqaworker-arm-1.qe.nue2.suse.org:
    search qe.nue2.suse.org nue2.suse.org suse.de arch.suse.de nue.suse.com suse.cz suse.asia prv.suse.net
    nameserver 10.168.0.1
    nameserver 10.168.0.2

I see that multiple machines have "search mshome.net", among them grenache-1, worker29, worker33, worker40, but not others. I triggered a reboot for worker29, worker40, grenache-1 but took w33 out of production and not reboot to keep it in this stage for better investigation.

Actions #3

Updated by nicksinger 3 months ago

btw, the first occurrence I can spot on w33 is:

worker33:/var/log # journalctl -x | grep mshome.net
Sep 07 05:16:50 worker33 worker[124732]:  - worker address (WORKER_HOSTNAME): worker33.mshome.net
Actions #4

Updated by okurz 3 months ago

I could identify the underlying issue. With tcpdump -i eth0 -vvv -s 0 -l -n port 547 and a call to wicked --debug all ifup eth0 I found

13:45:49.046177 IP6 (flowlabel 0xda96c, hlim 128, next-header UDP (17) payload length: 84) fe80::584f:67d6:5d82:4874.547 > fe80::7ec2:55ff:fe24:de2a.546: [udp sum ok] dhcp6 reply (xid=c5b933 (client-ID hwaddr/time type 1 time 744113586 7cc25524de2a) (DNS-search-list mshome.net.) (DNS-server fe80::584f:67d6:5d82:4874) (server-ID hwaddr/time type 1 time 778797996 3cecefff16ab))

the "server-id" mentions "3cecefff16ab" which is the mac address of bare-metal4.oqa.prg2.suse.org https://racktables.nue.suse.com/index.php?page=object&tab=default&hl_port_id=171158&object_id=23398 which despite the name is actually use as HyperV server by the virt squad.

In https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/865#note_647493 I already mentioned problems we should foresee with this setup. And there is still pending https://sd.suse.com/servicedesk/customer/portal/1/SD-162636 to use a better name for the server. The machine was setup as part of #164009 and I assume there is a rogue DHCPv6 server running on the host falsely answering requests. I will create another ticket to the virt squad to disable that service.

Actions #6

Updated by okurz 3 months ago

  • Description updated (diff)
  • Status changed from In Progress to Blocked
  • Priority changed from Urgent to High

Reported #166553. I brought w33 back into production. The same problem can reappear anytime depending on which DHCP server answers first. The mitigation is to re-request the network config, e.g. wicked ifup all. Added "workaround" section to the ticket.

Actions #7

Updated by okurz 3 months ago

  • Copied to coordination #166571: [epic] Separate testing machines from production machines (again) added
Actions #8

Updated by livdywan 3 months ago

okurz wrote in #note-6:

Reported #166553. I brought w33 back into production. The same problem can reappear anytime depending on which DHCP server answers first. The mitigation is to re-request the network config, e.g. wicked ifup all. Added "workaround" section to the ticket.

The blocker is in Feedback now.

Actions #9

Updated by okurz 3 months ago

  • Status changed from Blocked to Resolved

I checked the current state by updating /var/lib/wicked/lease-eth0-dhcp-ipv6.xml and /etc/resolv.conf calling wicked ifup all on worker33 and verified that /etc/resolv.conf does not mention mshome.net anymore. Other hosts that have not been rebooted since multiple days still have the entry but not as primary so we are good.

Actions

Also available in: Atom PDF