Project

General

Profile

Actions

action #127337

closed

Some s390x workers have been failing for all jobs since 11 months ago

Added by syrianidou_sofia about 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-04-06
Due date:
% Done:

0%

Estimated time:
Tags:

Description

There are quite a few s390x failures on booting, so after comparing test results, the observation that workers:
grenache-1:30
grenache-1:40
grenache-1:43
grenache-1:44

have been constantly failing for all jobs for the last 11 months. It seeems that before this period of time, the workers (besides grenache-1:44 which was still s390x) were used for ipm and ppc tests and they are now switched to s390x service.

A very frequent error message is : Test died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 187. example

Possibly related with: https://progress.opensuse.org/issues/108266

Note: we noticed this issue, after creating a generic s390x-kvm machine, to take advantage of s390-kvm WORKER_CLASS. It seems that other current s390x jobs that are designated to use s390x-kvm-sle12/15 do not run on these workers. For more information, see : https://progress.opensuse.org/issues/126293


Related issues 7 (3 open4 closed)

Related to openQA Infrastructure - action #127523: [qe-core][s390x][kvm] Make use of generic "s390-kvm" class to prevent too long waiting for s390x worker ressourcesResolvedmgrifalconi

Actions
Related to openQA auto review - openqa-force-result #134807: [qe-core] test fails in bootloader_start - missing assets on s390 NewActions
Related to openQA Infrastructure - action #108266: grenache: script_run() commands randomly time out since server room moveNew2022-03-14

Actions
Related to qe-yam - action #126293: Change WORKER_CLASS from 390-kvm-sle{12,15} to s390-kvmResolvedsyrianidou_sofia2023-03-21

Actions
Related to openQA Infrastructure - action #51836: Manage (parts) of s390 kvm instances (formerly s390p7 and s390p8) with saltResolvedokurz2019-05-22

Actions
Related to openQA Tests - action #137387: [s390x][qe-core] worker imagetester can't reach SUT - turn job timeout into module failureNew2023-10-04

Actions
Related to qe-yam - action #138185: Change WORKER_CLASS from s390-kvm-sle{12,15} to s390-kvmResolvedrainerkoenig2023-10-18

Actions
Actions #1

Updated by syrianidou_sofia about 1 year ago

  • Project changed from 175 to openQA Infrastructure
Actions #2

Updated by syrianidou_sofia about 1 year ago

  • Tags set to infra
Actions #3

Updated by okurz about 1 year ago

  • Target version set to future
Actions #5

Updated by syrianidou_sofia about 1 year ago

  • Description updated (diff)
Actions #6

Updated by syrianidou_sofia about 1 year ago

  • Description updated (diff)
Actions #7

Updated by okurz 7 months ago

  • Related to action #127523: [qe-core][s390x][kvm] Make use of generic "s390-kvm" class to prevent too long waiting for s390x worker ressources added
Actions #8

Updated by okurz 7 months ago

Actions #9

Updated by okurz 7 months ago

  • Related to action #108266: grenache: script_run() commands randomly time out since server room move added
Actions #10

Updated by okurz 7 months ago

  • Related to action #126293: Change WORKER_CLASS from 390-kvm-sle{12,15} to s390-kvm added
Actions #12

Updated by okurz 7 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version changed from future to Ready

From SQL on OSD: select count(j.id),wp.value from jobs j join workers w on w.id = j.assigned_worker_id join worker_properties wp on wp.worker_id = w.id where result='timeout_exceeded' and arch='s390x' and t_finished >= '2023-10-05' and wp.key = 'WORKER_CLASS' and wp.value ~ 's390-kvm' group by wp.value order by count desc;

 count |                                    value                                     
-------+------------------------------------------------------------------------------
     8 | s390-kvm,s390-kvm-sle12-mm,s390zp17,s390kvm110,nue,frankencampus,imagetester
     8 | s390-kvm,s390-kvm-sle12-mm,s390zp17,s390kvm111,nue,frankencampus,imagetester
     8 | s390-kvm,s390-kvm-sle12-mm,s390zp17,s390kvm112,nue,frankencampus,imagetester
     8 | s390-kvm,s390-kvm-sle12-mm,s390zp17,s390kvm114,nue,frankencampus,imagetester
     8 | s390-kvm,s390-kvm-sle12-mm,s390zp17,s390kvm115,nue,frankencampus,imagetester
     4 | s390-kvm,s390-kvm-sle12-mm,s390zp17,s390kvm113,nue,frankencampus,imagetester
(6 rows)

s390zp17 seems to be misbehaving.

for i in s390zp17.suse.de s390zp18.suse.de s390zp19.suse.de; do echo "### $i" && ssh_nt root@$i "grep processors /proc/cpuinfo; grep MemTotal /proc/meminfo"; done
### s390zp17.suse.de
Warning: Permanently added 's390zp17.suse.de,10.161.159.122' (ECDSA) to the list of known hosts.
# processors    : 6
MemTotal:        8040836 kB
### s390zp18.suse.de
Warning: Permanently added 's390zp18.suse.de,10.161.159.123' (ECDSA) to the list of known hosts.
# processors    : 6
MemTotal:       32980380 kB
### s390zp19.suse.de
Warning: Permanently added 's390zp19.suse.de,10.161.159.124' (ECDSA) to the list of known hosts.
# processors    : 6
MemTotal:       32811420 kB

I think s390zp17 does not have enough memory.

For that I created
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/642

On imagetester I did systemctl disable --now openqa-worker-auto-restart@{12,16,17,18} but still tests take very long. What I find in logs like from https://openqa.suse.de/tests/12392304

[2023-10-05T20:07:14.427909Z] [debug] [pid:16837] tests/installation/bootloader_zkvm.pm:73 called bootloader_setup::zkvm_add_disk -> lib/bootloader_setup.pm:1105 called backend::console_proxy::__ANON__
[2023-10-05T20:07:14.428177Z] [debug] [pid:16837] <<< backend::console_proxy::__ANON__(wrapped_call={
    "args" => [
                "find /var/lib/openqa/share/factory/hdd -name SLES-15-SP4-s390x-containers.qcow2 | head -n1 | tr -d '\n'"
              ],
    "function" => "get_cmd_output",
    "wantarray" => "",
    "console" => "svirt"
  })
[2023-10-05T20:07:14.431906Z] [debug] [pid:16838] <<< backend::baseclass::run_ssh_cmd(cmd="find /var/lib/openqa/share/factory/hdd -name SLES-15-SP4-s390x-containers.qcow2 | head -n1 | tr -d '\n'", password="SECRET", username=undef, keep_open=1, hostname="s390zp17.suse.de", wantarray=1)
[2023-10-05T20:07:14.432571Z] [debug] [pid:16838] <<< backend::baseclass::run_ssh(cmd="find /var/lib/openqa/share/factory/hdd -name SLES-15-SP4-s390x-containers.qcow2 | head -n1 | tr -d '\n'", keep_open=1, wantarray=1, hostname="s390zp17.suse.de", username=undef, password="SECRET")
[2023-10-05T20:07:14.433236Z] [debug] [pid:16838] <<< backend::baseclass::new_ssh_connection(blocking=1, username=undef, hostname="s390zp17.suse.de", keep_open=1, wantarray=1, password="SECRET")
[2023-10-05T20:07:14.483656Z] [debug] [pid:16838] Use existing SSH connection (key:hostname=s390zp17.suse.de,username=root,port=22)
(stuck here very long)

On s390zp17 I find

s390zp17:~ # pgrep -a find
3316 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.3.0-Default-kvm-GM.qcow2
4827 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP2-s390x-containers.qcow2
6009 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.5.0-Default-GM.raw.xz
8241 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP1-s390x-mru-install-minimal-with-addons-Build20231002-1-Server-DVD-Updates-s390x-kvm.qcow2
9113 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP2-s390x-mru-install-minimal-with-addons-Build20231002-1-Server-DVD-Updates-s390x-kvm.qcow2
9932 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-minimal_installed_for_LTP.qcow2
10765 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.3.0-Default-kvm-GM-Updated.qcow2
11062 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP4-s390x-Installtest-HA.qcow2
11182 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP4-s390x-Installtest-HA.qcow2
11307 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-mru-install-minimal-with-addons-Build20231002-1-Server-DVD-Updates-s390x-kvm.qcow2
11443 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-Installtest-HA.qcow2
11497 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-Installtest-HA.qcow2
12048 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.2.0-Default-kvm-GM.qcow2
12571 find /var/lib/openqa/share/factory/hdd -name sle-15-SP4-s390x-5.14.21-150400.1360.1.g1354c58-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2
12751 find /var/lib/openqa/share/factory/hdd -name sle-15-SP4-s390x-5.14.21-150400.1360.1.g1354c58-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2
13005 find /var/lib/openqa/share/factory/hdd -name sle-15-SP5-s390x-5.14.21-150500.272.1.g353358d-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2
13490 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP1-s390x-mru-install-minimal-with-addons-Build20231002-1-Server-DVD-Updates-s390x-kvm.qcow2
13803 find /var/lib/openqa/share/factory/hdd -name sle-15-SP5-s390x-5.14.21-150500.272.1.g353358d-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2
14675 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.3.0-Default-kvm-GM.qcow2
14901 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.3.0-Default-kvm-GM.qcow2
15665 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP2-s390x-mru-install-minimal-with-addons-Build20231002-1-Server-DVD-Updates-s390x-kvm.qcow2
15710 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-minimal_installed_for_LTP.qcow2
16546 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP2-s390x-Installtest-HA.qcow2
16793 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-Installtest-HA.qcow2
16874 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP4-s390x-Installtest-HA.qcow2
17061 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP1-s390x-Installtest.qcow2
17258 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP3-s390x-Installtest-HA.qcow2
18069 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-Installtest-HA.qcow2
20108 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.4.0-Default-GM.raw.xz
20615 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-Installtest.qcow2
21063 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.5.0-Default-GM.raw.xz
23728 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-mru-install-minimal-with-addons-Build20231002-1-Server-DVD-Updates-s390x-kvm.qcow2
24204 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP2-s390x-mru-install-minimal-with-addons-Build20231002-1-Server-DVD-Updates-s390x-kvm.qcow2
24624 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP2-s390x-mru-install-minimal-with-addons-Build20231002-1-Server-DVD-Updates-s390x-kvm.qcow2
24998 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP2-s390x-mru-install-minimal-with-addons-Build20231002-1-Server-DVD-Updates-s390x-kvm.qcow2
25243 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-Installtest.qcow2
25581 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-containers.qcow2
26011 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-Installtest.qcow2
26194 find /var/lib/openqa/share/factory/hdd -name SLES-12-SP5-s390x-Installtest-HA.qcow2
26719 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.5.0-Default-GM.raw.xz
26747 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.4.0-Default-GM.raw.xz
27103 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-containers.qcow2
29731 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.2.0-Default-kvm-GM.qcow2
29788 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.1.0-Default-kvm-GM.qcow2
29945 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.1.0-Default-kvm-GM.qcow2
29973 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.1.0-Default-kvm-GM.qcow2
30087 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.2.0-Default-kvm-GM.qcow2
33636 find /var/lib/openqa/share/factory/hdd -name sle-12-SP5-s390x-20231003-1-Server-DVD-Updates@s390x-kvm-with-ltp.qcow2
35237 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP4-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
35470 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP4-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
35498 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP4-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
35655 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP4-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
36975 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP4-s390x-mru-install-desktop-with-addons_security-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
38261 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP3-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
38461 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP3-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
38489 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP3-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
41526 find /var/lib/openqa/share/factory/hdd -name SLE-Micro.s390x-5.4.0-Default-GM.raw.xz
42568 find /var/lib/openqa/share/factory/hdd -name sle-15-SP5-s390x-118.2-textmode@s390x-kvm-sle12-common_criteria.qcow2
42736 find /var/lib/openqa/share/factory/hdd -name sle-15-SP5-s390x-118.2-textmode@s390x-kvm-sle12-common_criteria.qcow2
42824 find /var/lib/openqa/share/factory/hdd -name sle-15-SP5-s390x-118.2-textmode@s390x-kvm-sle12-common_criteria.qcow2
42904 find /var/lib/openqa/share/factory/hdd -name sle-15-SP5-s390x-118.2-textmode@s390x-kvm-sle12-common_criteria.qcow2
44253 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP2-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
44860 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
44974 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
45717 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP3-s390x-containers.qcow2
46187 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP4-s390x-containers.qcow2
46785 find /var/lib/openqa/share/factory/hdd -name SLES-12-SP5-s390x-containers.qcow2
47005 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP1-s390x-containers.qcow2
47120 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP3-s390x-containers.qcow2
48376 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP3-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
48891 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
49323 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP3-s390x-containers.qcow2
49747 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-containers.qcow2
49775 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP1-s390x-containers.qcow2
50110 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP3-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
50404 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP3-s390x-mru-install-minimal-with-addons-Build20231003-1-Server-DVD-Updates-s390x-kvm.qcow2
51923 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP1-s390x-containers.qcow2
52309 find /var/lib/openqa/share/factory/hdd -name SLES-15-SP5-s390x-containers.qcow2
…

and the list goes on. So those processes are effectively stuck. From ps auxf | grep -c '\<D\>' I see 315 processes in D state. w shows that the system is up 604 days so it's not properly maintained, out-of-date and relies on openQA assets over NFS in a bad way.

I called

umount --force --lazy /var/lib/openqa/share && mount -a

but this is only a quick fix, not a long-term solution. This brings us back to #51836

Actions #13

Updated by okurz 7 months ago

  • Related to action #51836: Manage (parts) of s390 kvm instances (formerly s390p7 and s390p8) with salt added
Actions #14

Updated by okurz 7 months ago

  • Related to action #137387: [s390x][qe-core] worker imagetester can't reach SUT - turn job timeout into module failure added
Actions #15

Updated by okurz 7 months ago

  • Status changed from In Progress to Resolved

https://openqa.suse.de/tests/12394907 is ok now. The rest is for #137387 and #51836

Actions #16

Updated by leli 6 months ago

  • Related to action #138185: Change WORKER_CLASS from s390-kvm-sle{12,15} to s390-kvm added
Actions

Also available in: Atom PDF