Project

General

Profile

Actions

action #88449

closed

[aarch64] Parallel jobs fail due to DNS problem

Added by ggardet_arm about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2021-02-04
Due date:
% Done:

100%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario opensuse-Tumbleweed-DVD-aarch64-remote_ssh_controller@aarch64 fails in
await_install

Parallel jobs fail due to DNS problem on aarch64:

Ethernet connection by itself seems ok since logs are uploaded properly.

Test suite description

Maintainer: jrivera Install remote server (parallel job) with ssh.

Reproducible

Fails since (at least) Build 20210122

Expected result

Last good: 20210122 (or more recent)

Further details

Always latest result in this scenario: latest

Actions #1

Updated by favogt about 3 years ago

https://openqa.opensuse.org/tests/1615864/file/serial0.txt has timed out resolving 'download.opensuse.org/A/IN': 192.168.112.100#53 repeatedly.
I couldn't find any abnormalities on o3, but on aarch64 there are some soft lockups shortly after multi-machine tests started today and also some yesterday:

Feb 04 03:36:22 openqa-aarch64 ovs-vsctl[5439]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap68 tag=1 vlan_mode=dot1q-tunnel
Feb 04 03:36:33 openqa-aarch64 kernel: watchdog: BUG: soft lockup - CPU#49 stuck for 22s! [qemu-system-aar:4385]
Feb 04 03:36:33 openqa-aarch64 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc nfs_ssc fscache af_packet tun nfnetlink_cttimeout iscsi_ibft iscsi_boot_sysfs rfkill openvswitch nsh nf_conncount >
Feb 04 03:36:33 openqa-aarch64 kernel: CPU: 49 PID: 4385 Comm: qemu-system-aar Not tainted 5.10.7-2.gc9a364d-default #1 openSUSE Tumbleweed (unreleased)
Feb 04 03:36:33 openqa-aarch64 kernel: Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.50 06/01/2018
Feb 04 03:36:33 openqa-aarch64 kernel: pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
Feb 04 03:36:33 openqa-aarch64 kernel: pc : invalidate_icache_range+0x5c/0xb0
Feb 04 03:36:33 openqa-aarch64 kernel: lr : kvm_handle_guest_abort+0x998/0xa8c
Feb 04 03:36:33 openqa-aarch64 kernel: sp : ffff80002566ba60
Feb 04 03:36:33 openqa-aarch64 kernel: x29: ffff80002566ba60 x28: ffff0090b63b8000 
Feb 04 03:36:33 openqa-aarch64 kernel: x27: 00000000000000f3 x26: 0000000009e40000 
Feb 04 03:36:33 openqa-aarch64 kernel: x25: ffff0090b6060000 x24: 000000000000000c 
Feb 04 03:36:33 openqa-aarch64 kernel: x23: 0000000000080000 x22: 000000008200000d 
Feb 04 03:36:33 openqa-aarch64 kernel: x21: 0000000000000007 x20: 0000000000000000 
Feb 04 03:36:33 openqa-aarch64 kernel: x19: 0000000080000000 x18: 0000000000000000 
Feb 04 03:36:33 openqa-aarch64 kernel: x17: 0000000000000000 x16: 0000000000000000 
Feb 04 03:36:33 openqa-aarch64 kernel: x15: 0000000000000000 x14: 0000000000000000 
Feb 04 03:36:33 openqa-aarch64 kernel: x13: 0000000000000000 x12: 0000000000000040 
Feb 04 03:36:33 openqa-aarch64 kernel: x11: ffff0090b6011800 x10: ffff0010407ee21a 
Feb 04 03:36:33 openqa-aarch64 kernel: x9 : ffff0090a4666f80 x8 : 0000000000000003 
Feb 04 03:36:33 openqa-aarch64 kernel: x7 : ffff0090b6011800 x6 : ffff80001222e368 
Feb 04 03:36:33 openqa-aarch64 kernel: x5 : ffff0090b6011c10 x4 : 0000000000000000 
Feb 04 03:36:33 openqa-aarch64 kernel: x3 : ffff009e6afc2c80 x2 : 0000000000000040 
Feb 04 03:36:33 openqa-aarch64 kernel: x1 : ffff009e80000000 x0 : ffff009e40000000 
Feb 04 03:36:33 openqa-aarch64 kernel: Call trace:
Feb 04 03:36:33 openqa-aarch64 kernel:  invalidate_icache_range+0x5c/0xb0
Feb 04 03:36:33 openqa-aarch64 kernel:  handle_exit+0x78/0x200
Feb 04 03:36:33 openqa-aarch64 kernel:  kvm_arch_vcpu_ioctl_run+0x1f4/0x8c0
Feb 04 03:36:33 openqa-aarch64 kernel:  kvm_vcpu_ioctl+0x248/0x5f0
Feb 04 03:36:33 openqa-aarch64 kernel:  __arm64_sys_ioctl+0xb4/0x100

This also seems to have triggered some issue with eth0:

Feb 04 03:36:41 openqa-aarch64 kernel: NETDEV WATCHDOG: eth0 (hns-nic): transmit queue 10 timed out
Feb 04 03:36:41 openqa-aarch64 kernel: WARNING: CPU: 1 PID: 17 at net/sched/sch_generic.c:442 dev_watchdog+0x384/0x38c
Feb 04 03:36:41 openqa-aarch64 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc nfs_ssc fscache af_packet tun nfnetlink_cttimeout iscsi_ibft iscsi_boot_sysfs rfkill openvswitch nsh nf_conncount >
Feb 04 03:36:41 openqa-aarch64 kernel: CPU: 1 PID: 17 Comm: ksoftirqd/1 Tainted: G             L    5.10.7-2.gc9a364d-default #1 openSUSE Tumbleweed (unreleased)
Feb 04 03:36:41 openqa-aarch64 kernel: Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.50 06/01/2018
Feb 04 03:36:41 openqa-aarch64 kernel: pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
Feb 04 03:36:41 openqa-aarch64 kernel: pc : dev_watchdog+0x384/0x38c
Feb 04 03:36:41 openqa-aarch64 kernel: lr : dev_watchdog+0x384/0x38c
Feb 04 03:36:41 openqa-aarch64 kernel: sp : ffff8000134ebbe0
Feb 04 03:36:41 openqa-aarch64 kernel: x29: ffff8000134ebbe0 x28: 0000000000000000 
Feb 04 03:36:41 openqa-aarch64 kernel: x27: ffff80001190e000 x26: 0000000000000001 
Feb 04 03:36:41 openqa-aarch64 kernel: x25: 0000000000000140 x24: 00000000ffffffff 
Feb 04 03:36:41 openqa-aarch64 kernel: x23: 0000000000000001 x22: ffff001046d42480 
Feb 04 03:36:41 openqa-aarch64 kernel: x21: ffff800011e27000 x20: ffff001046d42000 
Feb 04 03:36:41 openqa-aarch64 kernel: x19: 000000000000000a x18: 00000000fffffffd 
Feb 04 03:36:41 openqa-aarch64 kernel: x17: 0000000000000000 x16: 0000000000000000 
Feb 04 03:36:41 openqa-aarch64 kernel: x15: 0000000000000020 x14: ffffffffffffffff 
Feb 04 03:36:41 openqa-aarch64 kernel: x13: ffff8000121880d0 x12: ffff800012187d22 
Feb 04 03:36:41 openqa-aarch64 kernel: x11: ffff001040400248 x10: ffff009feb48f7c0 
Feb 04 03:36:41 openqa-aarch64 kernel: x9 : ffff800010123efc x8 : ffff009faa500000 
Feb 04 03:36:41 openqa-aarch64 kernel: x7 : ffff009feb48f7c0 x6 : 0000000000000000 
Feb 04 03:36:41 openqa-aarch64 kernel: x5 : ffff001ffbbf5a48 x4 : 0000000000000000 
Feb 04 03:36:41 openqa-aarch64 kernel: x3 : 0000000000000027 x2 : 0000000000000000 
Feb 04 03:36:41 openqa-aarch64 kernel: x1 : 0000000000000000 x0 : ffff00104621dc40 
Feb 04 03:36:41 openqa-aarch64 kernel: Call trace:
Feb 04 03:36:41 openqa-aarch64 kernel:  dev_watchdog+0x384/0x38c
Feb 04 03:36:41 openqa-aarch64 kernel:  call_timer_fn+0x3c/0x184
Feb 04 03:36:41 openqa-aarch64 kernel:  __run_timers.part.0+0x31c/0x380
Feb 04 03:36:41 openqa-aarch64 kernel:  run_timer_softirq+0x48/0x80
Feb 04 03:36:41 openqa-aarch64 kernel:  __do_softirq+0x128/0x37c
Feb 04 03:36:41 openqa-aarch64 kernel:  run_ksoftirqd+0x6c/0x94
Feb 04 03:36:41 openqa-aarch64 kernel:  smpboot_thread_fn+0x15c/0x1a0
Feb 04 03:36:41 openqa-aarch64 kernel:  kthread+0x130/0x134
Feb 04 03:36:41 openqa-aarch64 kernel:  ret_from_fork+0x10/0x18
Feb 04 03:36:41 openqa-aarch64 kernel: ---[ end trace cd05227355c06071 ]---
Feb 04 03:36:41 openqa-aarch64 kernel: hns-nic HISI00C2:00 eth0: watchdog_timo changed to 1000.
Actions #2

Updated by ggardet_arm about 3 years ago

Pings to 8.8.8.8 or to the DNS which should be used (192.168.112.100) are just hanging.

Actions #3

Updated by favogt about 3 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

ggardet_arm wrote:

Pings to 8.8.8.8 or to the DNS which should be used (192.168.112.100) are just hanging.

I initially ruled that out because I assumed that at least some access to the outside network had to work for the tests to get that far, but apparently not.
Network access between the SUTs worked, so this made me check ip forwarding in sysctl, which was indeed 0.
This is usually set by firewalld, and indeed, firewalld.service was not running. It failed to start because iptables-restore (and some others) were not installed, but that didn't make the service failed.

The reason for the missing iptables binaries is because of an incomplete /etc/alternatives/ directory.

This was caused by my manual fix/workaround for the "mount parameters for /etc too long" issue, where I synced all of /etc into the snapshot, but apparently /etc at that time didn't have all (or any?) overlays mounted. Thus /etc was effectively at a pretty ancient state, probably before the upgrade to 15.2 even.

To sync the old overlays properly I did:

transactional-update shell
mount /var
mkdir /var/lib/overlay/work-etc-tmp
mount -t overlay overlay /mnt -o defaults,upperdir=/var/lib/overlay/292/etc,... (mount options from /.snapshots/292/snapshot/etc/fstab, the snapshot before I coalesced the overlays)
rsync -v --archive --inplace --xattrs --filter='-x security.selinux' --acls --delete --dry-run /mnt/ /etc/
rsync -v --archive --inplace --xattrs --filter='-x security.selinux' --acls --delete /mnt/ /etc/
umount /mnt
umount /var
exit
kexec --initrd /boot/initrd --reuse-cmdline /boot/Image (I was lazy here, but kernel/initrd didn't change)
cp /.snapshots/303/snapshot/etc/fstab /etc/fstab (As the rsync also overwrote t-u's fstab changes)

In hindsight, I should've synced into /.snapshots/303/snapshot/etc instead, which would've avoided the need to copy /etc/fstab again and preserved changes between snapshots 292 and 303, but I didn't see anything other than the stuff which is apparently modified on each boot by various services...

firewalld is up again and net.ipv4.ip_forward is now set to 1 as well. To confirm, I restarted a previously failing test, let's see: https://openqa.opensuse.org/tests/1618070

Actions #4

Updated by ggardet_arm about 3 years ago

  • Assignee set to favogt
Actions

Also available in: Atom PDF