Project

General

Profile

action #120651

Updated by livdywan almost 2 years ago

 
 ## Observation 
 Recently, I found some new failures share the same symptom and failed due to the same reason, for example: 
 [test 9968639 worker2:18](https://openqa.suse.de/tests/9968639#step/host_upgrade_step2_run/6) wait_serial timed out after waiting for more than 10 hours. 
 [test 9962611 worker2:18](https://openqa.suse.de/tests/9962611#step/host_upgrade_step2_run/6) wait_serial timed out after waiting for more than 10 hours. 
 These two [9973067 worker2:18](https://openqa.suse.de/tests/9973067#) and [9975435 worker2:19](https://openqa.suse.de/tests/9975435#) are about to fail as well due to the same reason. 

 But, actually, all guests were successfully installed by the failed module: 
 ~~~ 
 fozzie-1:~ # virsh list --all 
  Id     Name                          State 
 -------------------------------------------- 
  0      Domain-0                      running 
  -      sles-12-sp5-64-fv-def-net     shut off 
  -      sles-12-sp5-64-pv-def-net     shut off 

 fozzie-1:~ #  
 fozzie-1:~ # virsh start sles-12-sp5-64-fv-def-net 
 Domain sles-12-sp5-64-fv-def-net started 

 fozzie-1:~ # virsh start sles-12-sp5-64-pv-def-net 
 Domain sles-12-sp5-64-pv-def-net started 

 fozzie-1:~ #  
 fozzie-1:~ # virsh list --all 
  Id     Name                          State 
 ------------------------------------------- 
  0      Domain-0                      running 
  22     sles-12-sp5-64-fv-def-net     running 
  23     sles-12-sp5-64-pv-def-net     running 

 fozzie-1:~ #  

 --- 192.168.123.10 ping statistics --- 
 5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 4088ms 
 pipe 4 
 fozzie-1:~ # ping -c5 192.168.123.10 
 PING 192.168.123.10 (192.168.123.10) 56(84) bytes of data. 
 64 bytes from 192.168.123.10: icmp_seq=1 ttl=64 time=2315 ms 
 64 bytes from 192.168.123.10: icmp_seq=2 ttl=64 time=1294 ms 
 64 bytes from 192.168.123.10: icmp_seq=3 ttl=64 time=270 ms 
 64 bytes from 192.168.123.10: icmp_seq=4 ttl=64 time=0.314 ms 
 64 bytes from 192.168.123.10: icmp_seq=5 ttl=64 time=0.285 ms 

 --- 192.168.123.10 ping statistics --- 
 5 packets transmitted, 5 received, 0% packet loss, time 4060ms 
 rtt min/avg/max/mdev = 0.285/776.139/2315.106/905.173 ms, pipe 3 
 fozzie-1:~ # ping -c5 192.168.123.11 
 PING 192.168.123.11 (192.168.123.11) 56(84) bytes of data. 
 64 bytes from 192.168.123.11: icmp_seq=1 ttl=64 time=0.352 ms 
 64 bytes from 192.168.123.11: icmp_seq=2 ttl=64 time=0.239 ms 
 64 bytes from 192.168.123.11: icmp_seq=3 ttl=64 time=0.217 ms 
 64 bytes from 192.168.123.11: icmp_seq=4 ttl=64 time=0.213 ms 
 64 bytes from 192.168.123.11: icmp_seq=5 ttl=64 time=0.227 ms 

 --- 192.168.123.11 ping statistics --- 
 5 packets transmitted, 5 received, 0% packet loss, time 4083ms 
 rtt min/avg/max/mdev = 0.213/0.249/0.352/0.054 ms 

 fozzie-1:~ # ssh 192.168.123.10 
 Warning: Permanently added '192.168.123.10' (ECDSA) to the list of known hosts. 
 Password:  
 linux:~ #  
 linux:~ # cat /etc/os-release  
 .bash_history    .cache/          .dbus/           .gnupg/          .kbd/            bin/             inst-sys/       
 linux:~ # cat /etc/os-release  
 NAME="SLES" 
 VERSION="12-SP5" 
 VERSION_ID="12.5" 
 PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5" 
 ID="sles" 
 ANSI_COLOR="0;32" 
 CPE_NAME="cpe:/o:suse:sles:12:sp5" 
 linux:~ # exit 
 logout 
 Connection to 192.168.123.10 closed. 
 fozzie-1:~ # ssh 192.168.123.11 
 Warning: Permanently added '192.168.123.11' (ECDSA) to the list of known hosts. 
 Password:  
 linux:~ # cat /etc/os-release  
 NAME="SLES" 
 VERSION="12-SP5" 
 VERSION_ID="12.5" 
 PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5" 
 ID="sles" 
 ANSI_COLOR="0;32" 
 CPE_NAME="cpe:/o:suse:sles:12:sp5" 
 ~~~ 

 And the entered cmd as below (wait_serial had been waiting for its return) already returned before wait_serial timed out, namely failed test run: 
 (rm /var/log/qa/old* /var/log/qa/ctcs2/* -r;/usr/share/qa/tools/test-VH-Upgrade-std-xen-sles12sp5-sles15sp5-run 02; echo CMD_FINISHED-339853) 2>&1 | tee -a /dev/sshserial\n 

 This can also been seen clearly from screenshot below: 
 ![](prj2_step2_wait_serial.png) 

 These tests has been keeping failing to pass since Build40.1. Sometimes they can pass this module and go further, for example, [this one](https://openqa.suse.de/tests/9959802#step/reboot_and_wait_up_upgrade/1). But they failed most times. Fortunately some other tests, which use and run the same test module in the same way, can pass completely, for example, [passed 1 worker2:20](https://openqa.suse.de/tests/9976710) and [passed 2 worker2:19](https://openqa.suse.de/tests/9968680). There is no such problem being spotted in earlier days. 

 Probably there is something wrong with openQA infra recently. 

 ## Steps to reproduce 
 * Use reproducer mentioned in #120651#note-45 
 * Trigger test run with openQA for tests prj2_host_upgrade_sles12sp5_to_developing_xen or prj2_host_upgrade_sles12sp5_to_developing_xen on a worker slot on openqaworker2 (note that slots have been moved to grenache-1 by https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/465 so supposedly at least one slot needed to be moved back to openqaworker2) 
     * Wait for results of whether they fail at wait_serial time-out at step host_upgrade_step2_run 

 ## Impact 
 This affects overall virtualization test run efficiency and prevents test results from being generated in a timely manner, because we have to re-run many times and, at the same time, investigate the issue. 

 ## Problem 
 Looks like problem with openQA infra, including networking connection, worker2 performance and os-autoinst engine. 

 ## Acceprance criteria 
 * **AC1:** No more test timeouts 

 ## Suggestion 
 1. Look into openQA infra connection and network performance 
 2. Look into worker2 healthy status 
 3. Look into os-autoinst engine 

 ## Workaround 
 None 

Back