Project

General

Profile

Actions

action #120651

open

[openQA][infra][ipmi][worker][api] The expected pattern CMD_FINISHED-xxxxx returned but did not show up in serial log (wait_serial timed out) size:M

Added by waynechen55 over 1 year ago. Updated 8 months ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
-
Target version:
Start date:
2022-11-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

Recently, I found some new failures share the same symptom and failed due to the same reason, for example:
test 9968639 worker2:18 wait_serial timed out after waiting for more than 10 hours.
test 9962611 worker2:18 wait_serial timed out after waiting for more than 10 hours.
These two 9973067 worker2:18 and 9975435 worker2:19 are about to fail as well due to the same reason.

But, actually, all guests were successfully installed by the failed module:

fozzie-1:~ # virsh list --all
 Id   Name                        State
--------------------------------------------
 0    Domain-0                    running
 -    sles-12-sp5-64-fv-def-net   shut off
 -    sles-12-sp5-64-pv-def-net   shut off

fozzie-1:~ # 
fozzie-1:~ # virsh start sles-12-sp5-64-fv-def-net
Domain sles-12-sp5-64-fv-def-net started

fozzie-1:~ # virsh start sles-12-sp5-64-pv-def-net
Domain sles-12-sp5-64-pv-def-net started

fozzie-1:~ # 
fozzie-1:~ # virsh list --all
 Id   Name                        State
-------------------------------------------
 0    Domain-0                    running
 22   sles-12-sp5-64-fv-def-net   running
 23   sles-12-sp5-64-pv-def-net   running

fozzie-1:~ # 

--- 192.168.123.10 ping statistics ---
5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 4088ms
pipe 4
fozzie-1:~ # ping -c5 192.168.123.10
PING 192.168.123.10 (192.168.123.10) 56(84) bytes of data.
64 bytes from 192.168.123.10: icmp_seq=1 ttl=64 time=2315 ms
64 bytes from 192.168.123.10: icmp_seq=2 ttl=64 time=1294 ms
64 bytes from 192.168.123.10: icmp_seq=3 ttl=64 time=270 ms
64 bytes from 192.168.123.10: icmp_seq=4 ttl=64 time=0.314 ms
64 bytes from 192.168.123.10: icmp_seq=5 ttl=64 time=0.285 ms

--- 192.168.123.10 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4060ms
rtt min/avg/max/mdev = 0.285/776.139/2315.106/905.173 ms, pipe 3
fozzie-1:~ # ping -c5 192.168.123.11
PING 192.168.123.11 (192.168.123.11) 56(84) bytes of data.
64 bytes from 192.168.123.11: icmp_seq=1 ttl=64 time=0.352 ms
64 bytes from 192.168.123.11: icmp_seq=2 ttl=64 time=0.239 ms
64 bytes from 192.168.123.11: icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from 192.168.123.11: icmp_seq=4 ttl=64 time=0.213 ms
64 bytes from 192.168.123.11: icmp_seq=5 ttl=64 time=0.227 ms

--- 192.168.123.11 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4083ms
rtt min/avg/max/mdev = 0.213/0.249/0.352/0.054 ms

fozzie-1:~ # ssh 192.168.123.10
Warning: Permanently added '192.168.123.10' (ECDSA) to the list of known hosts.
Password: 
linux:~ # 
linux:~ # cat /etc/os-release 
.bash_history  .cache/        .dbus/         .gnupg/        .kbd/          bin/           inst-sys/      
linux:~ # cat /etc/os-release 
NAME="SLES"
VERSION="12-SP5"
VERSION_ID="12.5"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp5"
linux:~ # exit
logout
Connection to 192.168.123.10 closed.
fozzie-1:~ # ssh 192.168.123.11
Warning: Permanently added '192.168.123.11' (ECDSA) to the list of known hosts.
Password: 
linux:~ # cat /etc/os-release 
NAME="SLES"
VERSION="12-SP5"
VERSION_ID="12.5"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp5"

And the entered cmd as below (wait_serial had been waiting for its return) already returned before wait_serial timed out, namely failed test run:
(rm /var/log/qa/old* /var/log/qa/ctcs2/* -r;/usr/share/qa/tools/test-VH-Upgrade-std-xen-sles12sp5-sles15sp5-run 02; echo CMD_FINISHED-339853) 2>&1 | tee -a /dev/sshserial\n

This can also been seen clearly from screenshot below:

These tests has been keeping failing to pass since Build40.1. Sometimes they can pass this module and go further, for example, this one. But they failed most times. Fortunately some other tests, which use and run the same test module in the same way, can pass completely, for example, passed 1 worker2:20 and passed 2 worker2:19. There is no such problem being spotted in earlier days.

Probably there is something wrong with openQA infra recently.

Steps to reproduce

  • Use reproducer mentioned in #120651#note-45
  • Trigger test run with openQA for tests prj2_host_upgrade_sles12sp5_to_developing_xen or prj2_host_upgrade_sles12sp5_to_developing_xen on a worker slot on openqaworker2 (note that slots have been moved to grenache-1 by https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/465 so supposedly at least one slot needed to be moved back to openqaworker2)
    • Wait for results of whether they fail at wait_serial time-out at step host_upgrade_step2_run

Impact

This affects overall virtualization test run efficiency and prevents test results from being generated in a timely manner, because we have to re-run many times and, at the same time, investigate the issue.

Problem

Looks like problem with openQA infra, including networking connection, worker2 performance and os-autoinst engine.

Acceprance criteria

  • AC1: No more test timeouts

Suggestion

  1. Look into openQA infra connection and network performance
  2. Look into worker2 healthy status
  3. Look into os-autoinst engine

Workaround

None


Files

prj2_step2_wait_serial.png (113 KB) prj2_step2_wait_serial.png waynechen55, 2022-11-17 03:13
dev_sshserial_feed_01.png (96.7 KB) dev_sshserial_feed_01.png waynechen55, 2022-11-21 01:45
dev_sshserial_feed_02.png (63.4 KB) dev_sshserial_feed_02.png waynechen55, 2022-11-21 01:45

Related issues 3 (0 open3 closed)

Related to QA - action #119446: Conduct the migration of SUSE openQA+QA systems from Nbg SRV2 to new security zonesResolvedokurz2022-09-15

Actions
Related to openQA Infrastructure - action #122302: Support SD-105827 "PowerPC often fails to boot from network with 'error: time out opening'"Resolvednicksinger2022-12-212023-02-02

Actions
Related to openQA Infrastructure - action #120270: Conduct the migration of SUSE openQA systems IPMI from Nbg SRV1 to new security zones size:MResolvedmkittler

Actions
Actions

Also available in: Atom PDF