Project

General

Profile

Actions

action #137384

closed

[tools][s390x] worker imagetester can't reach SUT auto_review:"backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out" size:M

Added by dzedro about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Infrastructure
Start date:
2023-10-04
Due date:
% Done:

0%

Estimated time:
Difficulty:
Tags:

Description

Observation

I think there are two variants of same issue.
Have seen it failing (only) on imagetester.qe.nue2.suse.org
bootloader_start is much worse because it is not failing but timing out at MAX_JOB_TIME
bootloader_start
bootloader_zkvm

fails with
Reason: backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out

Reproducible

Fails since (at least) Build 20231003-1 (current job)

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#137384

Expected result

Last good: 20231002-1 (or more recent)

Acceptance Criteria

  • AC1: It is known if s390zl14.suse.de is usable as a production worker

Suggestions

  • As the error states an obvious "connection timed out" to s390zl14.suse.de check if that machine is generally reachable, e.g. sudo salt \* cmd.run 'ping -c1 s390zl14.suse.de'
    • sudo salt --no-color --out txt '*' cmd.run 'ping -c1 s390zl14.suse.de' says unreachable or 100% packet loss so it's not specific to imagetester
  • Also check our monitoring which includes a ping check to various important hosts. see if s390zl14 is there - we did not see an alert
  • If the problem persists consider disabling the production use of the worker instance in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls , e.g. s/390-kvm/&-poo137384/
  • Maybe s390zl14 is not expected to be usable in general, crosscheck git history of https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls, ask the ones that introduced it or who maintain the machine or according tests
  • Maybe the machine is not supposed to be working, then remove the according openQA worker instance and ensure that someone takes care that s390zl14 is properly used outside the context of OSD so that no hardware is uselessly just wasting power and destroying our nice earth
  • If the machine is supposed to be working as OSD worker target then create an according Eng-Infra ticket

Out of scope

  • test code improvement, see #137387
  • Looking into fixing the machine and improving the setup

Further details

Always latest result in this scenario: latest


Related issues 3 (2 open1 closed)

Related to openQA auto review - openqa-force-result #134807: [qe-core] test fails in bootloader_start - missing assets on s390 NewActions
Related to openQA Infrastructure (public) - action #134912: Gradually phase out NUE1 based openQA workers size:MResolvedokurz

Actions
Copied to openQA Tests (public) - action #137387: [s390x][qe-core] worker imagetester can't reach SUT - turn job timeout into module failureNew2023-10-04

Actions
Actions

Also available in: Atom PDF