action #120651
Updated by mkittler about 2 years ago
## Observation Recently, I found some new failures share the same symptom and failed due to the same reason, for example: [test 9968639 worker2:18](https://openqa.suse.de/tests/9968639#step/host_upgrade_step2_run/6) wait_serial timed out after waiting for more than 10 hours. [test 9962611 worker2:18](https://openqa.suse.de/tests/9962611#step/host_upgrade_step2_run/6) wait_serial timed out after waiting for more than 10 hours. These two [9973067 worker2:18](https://openqa.suse.de/tests/9973067#) and [9975435 worker2:19](https://openqa.suse.de/tests/9975435#) are about to fail as well due to the same reason. But, actually, all guests were successfully installed by the failed module: ~~~ fozzie-1:~ # virsh list --all Id Name State -------------------------------------------- 0 Domain-0 running - sles-12-sp5-64-fv-def-net shut off - sles-12-sp5-64-pv-def-net shut off fozzie-1:~ # fozzie-1:~ # virsh start sles-12-sp5-64-fv-def-net Domain sles-12-sp5-64-fv-def-net started fozzie-1:~ # virsh start sles-12-sp5-64-pv-def-net Domain sles-12-sp5-64-pv-def-net started fozzie-1:~ # fozzie-1:~ # virsh list --all Id Name State ------------------------------------------- 0 Domain-0 running 22 sles-12-sp5-64-fv-def-net running 23 sles-12-sp5-64-pv-def-net running fozzie-1:~ # --- 192.168.123.10 ping statistics --- 5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 4088ms pipe 4 fozzie-1:~ # ping -c5 192.168.123.10 PING 192.168.123.10 (192.168.123.10) 56(84) bytes of data. 64 bytes from 192.168.123.10: icmp_seq=1 ttl=64 time=2315 ms 64 bytes from 192.168.123.10: icmp_seq=2 ttl=64 time=1294 ms 64 bytes from 192.168.123.10: icmp_seq=3 ttl=64 time=270 ms 64 bytes from 192.168.123.10: icmp_seq=4 ttl=64 time=0.314 ms 64 bytes from 192.168.123.10: icmp_seq=5 ttl=64 time=0.285 ms --- 192.168.123.10 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4060ms rtt min/avg/max/mdev = 0.285/776.139/2315.106/905.173 ms, pipe 3 fozzie-1:~ # ping -c5 192.168.123.11 PING 192.168.123.11 (192.168.123.11) 56(84) bytes of data. 64 bytes from 192.168.123.11: icmp_seq=1 ttl=64 time=0.352 ms 64 bytes from 192.168.123.11: icmp_seq=2 ttl=64 time=0.239 ms 64 bytes from 192.168.123.11: icmp_seq=3 ttl=64 time=0.217 ms 64 bytes from 192.168.123.11: icmp_seq=4 ttl=64 time=0.213 ms 64 bytes from 192.168.123.11: icmp_seq=5 ttl=64 time=0.227 ms --- 192.168.123.11 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4083ms rtt min/avg/max/mdev = 0.213/0.249/0.352/0.054 ms fozzie-1:~ # ssh 192.168.123.10 Warning: Permanently added '192.168.123.10' (ECDSA) to the list of known hosts. Password: linux:~ # linux:~ # cat /etc/os-release .bash_history .cache/ .dbus/ .gnupg/ .kbd/ bin/ inst-sys/ linux:~ # cat /etc/os-release NAME="SLES" VERSION="12-SP5" VERSION_ID="12.5" PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5" ID="sles" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:suse:sles:12:sp5" linux:~ # exit logout Connection to 192.168.123.10 closed. fozzie-1:~ # ssh 192.168.123.11 Warning: Permanently added '192.168.123.11' (ECDSA) to the list of known hosts. Password: linux:~ # cat /etc/os-release NAME="SLES" VERSION="12-SP5" VERSION_ID="12.5" PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5" ID="sles" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:suse:sles:12:sp5" ~~~ And the entered cmd as below (wait_serial had been waiting for its return) already returned before wait_serial timed out, namely failed test run: (rm /var/log/qa/old* /var/log/qa/ctcs2/* -r;/usr/share/qa/tools/test-VH-Upgrade-std-xen-sles12sp5-sles15sp5-run 02; echo CMD_FINISHED-339853) 2>&1 | tee -a /dev/sshserial\n This can also been seen clearly from screenshot below: ![](prj2_step2_wait_serial.png) These tests has been keeping failing to pass since Build40.1. Sometimes they can pass this module and go further, for example, [this one](https://openqa.suse.de/tests/9959802#step/reboot_and_wait_up_upgrade/1). But they failed most times. Fortunately some other tests, which use and run the same test module in the same way, can pass completely, for example, [passed 1 worker2:20](https://openqa.suse.de/tests/9976710) and [passed 2 worker2:19](https://openqa.suse.de/tests/9968680). There is no such problem being spotted in earlier days. Probably there is something wrong with openQA infra recently. ## Steps to reproduce * Use reproducer mentioned in #120651#note-45 * Trigger test run with openQA for tests prj2_host_upgrade_sles12sp5_to_developing_xen or prj2_host_upgrade_sles12sp5_to_developing_xen on a worker slot on openqaworker2 (note that slots have been moved to grenache-1 by https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/465 so supposedly at least one slot needed to be moved back to openqaworker2) * Wait for results of whether they fail at wait_serial time-out at step host_upgrade_step2_run ## Impact This affects overall virtualization test run efficiency and prevents test results from being generated in a timely manner, because we have to re-run many times and, at the same time, investigate the issue. ## Problem Looks like problem with openQA infra, including networking connection, worker2 performance and os-autoinst engine. ## Suggestion 1. Look into openQA infra connection and network performance 2. Look into worker2 healthy status 3. Look into os-autoinst engine ## Workaround None