action #120651: [openQA][infra][ipmi][worker][api] The expected pattern CMD_FINISHED-xxxxx returned but did not show up in serial log (wait_serial timed out) size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #120651

open

[openQA][infra][ipmi][worker][api] The expected pattern CMD_FINISHED-xxxxx returned but did not show up in serial log (wait_serial timed out) size:M

Added by waynechen55 over 2 years ago. Updated over 1 year ago.

Status:

New

Priority:

Low

Assignee:

Category:

Target version:

QA (public) - future

Start date:

2022-11-17

Due date:

% Done:

Estimated time:

Tags:

infra, reactive work

Description

Observation¶

Recently, I found some new failures share the same symptom and failed due to the same reason, for example:
test 9968639 worker2:18 wait_serial timed out after waiting for more than 10 hours.
test 9962611 worker2:18 wait_serial timed out after waiting for more than 10 hours.
These two 9973067 worker2:18 and 9975435 worker2:19 are about to fail as well due to the same reason.

But, actually, all guests were successfully installed by the failed module:

fozzie-1:~ # virsh list --all
 Id   Name                        State
--------------------------------------------
 0    Domain-0                    running
 -    sles-12-sp5-64-fv-def-net   shut off
 -    sles-12-sp5-64-pv-def-net   shut off

fozzie-1:~ # 
fozzie-1:~ # virsh start sles-12-sp5-64-fv-def-net
Domain sles-12-sp5-64-fv-def-net started

fozzie-1:~ # virsh start sles-12-sp5-64-pv-def-net
Domain sles-12-sp5-64-pv-def-net started

fozzie-1:~ # 
fozzie-1:~ # virsh list --all
 Id   Name                        State
-------------------------------------------
 0    Domain-0                    running
 22   sles-12-sp5-64-fv-def-net   running
 23   sles-12-sp5-64-pv-def-net   running

fozzie-1:~ # 

--- 192.168.123.10 ping statistics ---
5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 4088ms
pipe 4
fozzie-1:~ # ping -c5 192.168.123.10
PING 192.168.123.10 (192.168.123.10) 56(84) bytes of data.
64 bytes from 192.168.123.10: icmp_seq=1 ttl=64 time=2315 ms
64 bytes from 192.168.123.10: icmp_seq=2 ttl=64 time=1294 ms
64 bytes from 192.168.123.10: icmp_seq=3 ttl=64 time=270 ms
64 bytes from 192.168.123.10: icmp_seq=4 ttl=64 time=0.314 ms
64 bytes from 192.168.123.10: icmp_seq=5 ttl=64 time=0.285 ms

--- 192.168.123.10 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4060ms
rtt min/avg/max/mdev = 0.285/776.139/2315.106/905.173 ms, pipe 3
fozzie-1:~ # ping -c5 192.168.123.11
PING 192.168.123.11 (192.168.123.11) 56(84) bytes of data.
64 bytes from 192.168.123.11: icmp_seq=1 ttl=64 time=0.352 ms
64 bytes from 192.168.123.11: icmp_seq=2 ttl=64 time=0.239 ms
64 bytes from 192.168.123.11: icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from 192.168.123.11: icmp_seq=4 ttl=64 time=0.213 ms
64 bytes from 192.168.123.11: icmp_seq=5 ttl=64 time=0.227 ms

--- 192.168.123.11 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4083ms
rtt min/avg/max/mdev = 0.213/0.249/0.352/0.054 ms

fozzie-1:~ # ssh 192.168.123.10
Warning: Permanently added '192.168.123.10' (ECDSA) to the list of known hosts.
Password: 
linux:~ # 
linux:~ # cat /etc/os-release 
.bash_history  .cache/        .dbus/         .gnupg/        .kbd/          bin/           inst-sys/      
linux:~ # cat /etc/os-release 
NAME="SLES"
VERSION="12-SP5"
VERSION_ID="12.5"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp5"
linux:~ # exit
logout
Connection to 192.168.123.10 closed.
fozzie-1:~ # ssh 192.168.123.11
Warning: Permanently added '192.168.123.11' (ECDSA) to the list of known hosts.
Password: 
linux:~ # cat /etc/os-release 
NAME="SLES"
VERSION="12-SP5"
VERSION_ID="12.5"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp5"

And the entered cmd as below (wait_serial had been waiting for its return) already returned before wait_serial timed out, namely failed test run:
(rm /var/log/qa/old* /var/log/qa/ctcs2/* -r;/usr/share/qa/tools/test-VH-Upgrade-std-xen-sles12sp5-sles15sp5-run 02; echo CMD_FINISHED-339853) 2>&1 | tee -a /dev/sshserial\n

This can also been seen clearly from screenshot below:

These tests has been keeping failing to pass since Build40.1. Sometimes they can pass this module and go further, for example, this one. But they failed most times. Fortunately some other tests, which use and run the same test module in the same way, can pass completely, for example, passed 1 worker2:20 and passed 2 worker2:19. There is no such problem being spotted in earlier days.

Probably there is something wrong with openQA infra recently.

Steps to reproduce¶

Use reproducer mentioned in #120651#note-45
Trigger test run with openQA for tests prj2_host_upgrade_sles12sp5_to_developing_xen or prj2_host_upgrade_sles12sp5_to_developing_xen on a worker slot on openqaworker2 (note that slots have been moved to grenache-1 by https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/465 so supposedly at least one slot needed to be moved back to openqaworker2)
- Wait for results of whether they fail at wait_serial time-out at step host_upgrade_step2_run

Impact¶

This affects overall virtualization test run efficiency and prevents test results from being generated in a timely manner, because we have to re-run many times and, at the same time, investigate the issue.

Problem¶

Looks like problem with openQA infra, including networking connection, worker2 performance and os-autoinst engine.

Acceprance criteria¶

AC1: No more test timeouts

Suggestion¶

Look into openQA infra connection and network performance
Look into worker2 healthy status
Look into os-autoinst engine

Workaround¶

None

Files

Download all files

prj2_step2_wait_serial.png (113 KB) prj2_step2_wait_serial.png		waynechen55, 2022-11-17 03:13
dev_sshserial_feed_01.png (96.7 KB) dev_sshserial_feed_01.png		waynechen55, 2022-11-21 01:45
dev_sshserial_feed_02.png (63.4 KB) dev_sshserial_feed_02.png		waynechen55, 2022-11-21 01:45

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #120651

[openQA][infra][ipmi][worker][api] The expected pattern CMD_FINISHED-xxxxx returned but did not show up in serial log (wait_serial timed out) size:M

Observation¶

Steps to reproduce¶

Impact¶

Problem¶

Acceprance criteria¶

Suggestion¶

Workaround¶

Updated by waynechen55 over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by okurz over 2 years ago

Updated by Julie_CAO over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by okurz over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by okurz over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by waynechen55 over 2 years ago

Firstly:¶

Quoting from #120651#note-15:¶

Quoting from #120651#note-16:¶

At the last:¶

Updated by waynechen55 over 2 years ago

Updated by livdywan over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by mkittler over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by okurz over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by mkittler over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by waynechen55 over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan about 2 years ago

Updated by mkittler about 2 years ago

Updated by waynechen55 about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago