Project

General

Profile

Actions

action #137384

closed

[tools][s390x] worker imagetester can't reach SUT auto_review:"backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out" size:M

Added by dzedro about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Infrastructure
Start date:
2023-10-04
Due date:
% Done:

0%

Estimated time:
Difficulty:
Tags:

Description

Observation

I think there are two variants of same issue.
Have seen it failing (only) on imagetester.qe.nue2.suse.org
bootloader_start is much worse because it is not failing but timing out at MAX_JOB_TIME
bootloader_start
bootloader_zkvm

fails with
Reason: backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out

Reproducible

Fails since (at least) Build 20231003-1 (current job)

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#137384

Expected result

Last good: 20231002-1 (or more recent)

Acceptance Criteria

  • AC1: It is known if s390zl14.suse.de is usable as a production worker

Suggestions

  • As the error states an obvious "connection timed out" to s390zl14.suse.de check if that machine is generally reachable, e.g. sudo salt \* cmd.run 'ping -c1 s390zl14.suse.de'
    • sudo salt --no-color --out txt '*' cmd.run 'ping -c1 s390zl14.suse.de' says unreachable or 100% packet loss so it's not specific to imagetester
  • Also check our monitoring which includes a ping check to various important hosts. see if s390zl14 is there - we did not see an alert
  • If the problem persists consider disabling the production use of the worker instance in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls , e.g. s/390-kvm/&-poo137384/
  • Maybe s390zl14 is not expected to be usable in general, crosscheck git history of https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls, ask the ones that introduced it or who maintain the machine or according tests
  • Maybe the machine is not supposed to be working, then remove the according openQA worker instance and ensure that someone takes care that s390zl14 is properly used outside the context of OSD so that no hardware is uselessly just wasting power and destroying our nice earth
  • If the machine is supposed to be working as OSD worker target then create an according Eng-Infra ticket

Out of scope

  • test code improvement, see #137387
  • Looking into fixing the machine and improving the setup

Further details

Always latest result in this scenario: latest


Related issues 3 (2 open1 closed)

Related to openQA auto review - openqa-force-result #134807: [qe-core] test fails in bootloader_start - missing assets on s390 NewActions
Related to openQA Infrastructure (public) - action #134912: Gradually phase out NUE1 based openQA workers size:MResolvedokurz

Actions
Copied to openQA Tests (public) - action #137387: [s390x][qe-core] worker imagetester can't reach SUT - turn job timeout into module failureNew2023-10-04

Actions
Actions #1

Updated by dzedro about 1 year ago

Actions #2

Updated by dzedro about 1 year ago

  • Subject changed from [tools][s390x] worker can't reach SUT on imagetester to [tools][s390x] worker imagetester can't reach SUT
Actions #3

Updated by okurz about 1 year ago

  • Tags set to infra
  • Category changed from Bugs in existing tests to Infrastructure
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #4

Updated by okurz about 1 year ago

  • Copied to action #137387: [s390x][qe-core] worker imagetester can't reach SUT - turn job timeout into module failure added
Actions #5

Updated by okurz about 1 year ago

  • Related to action #134912: Gradually phase out NUE1 based openQA workers size:M added
Actions #6

Updated by okurz about 1 year ago

  • Subject changed from [tools][s390x] worker imagetester can't reach SUT to [tools][s390x] worker imagetester can't reach SUT auto_review:"backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out"
  • Description updated (diff)
  • Assignee set to okurz
Actions #7

Updated by okurz about 1 year ago

  • Subject changed from [tools][s390x] worker imagetester can't reach SUT auto_review:"backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out" to [tools][s390x] worker imagetester can't reach SUT auto_review:"backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out" size:M
  • Status changed from New to In Progress
Actions #8

Updated by okurz about 1 year ago

  • Due date set to 2023-10-18
  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/629 merged and effective. Called export host=openqa.suse.de; ~/local/os-autoinst/scripts/openqa-monitor-investigation-candidates | ~/local/os-autoinst/scripts/openqa-label-known-issues-multi to handle failures.

openqa-query-for-job-label poo#137384 yields

12371628|2023-10-04 09:13:30|done|failed|python_3.11_on_SLES_12-SP5|backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out|imagetester
12368632|2023-10-04 04:51:49|done|timeout_exceeded|docker_tests|timeout: test execution exceeded MAX_JOB_TIME|imagetester
12369978|2023-10-04 04:23:18|done|failed|docker_tests|backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out|imagetester
12369985|2023-10-04 03:20:58|done|timeout_exceeded|mau-extratests-phub|timeout: test execution exceeded MAX_JOB_TIME|imagetester
12369893|2023-10-04 02:20:03|done|timeout_exceeded|mau-extratests2|timeout: test execution exceeded MAX_JOB_TIME|imagetester
12369891|2023-10-04 01:20:06|done|timeout_exceeded|mau-extratests-phub|timeout: test execution exceeded MAX_JOB_TIME|imagetester
12369892|2023-10-04 01:20:06|done|timeout_exceeded|mau-extratests1|timeout: test execution exceeded MAX_JOB_TIME|imagetester
12369890|2023-10-04 01:20:05|done|timeout_exceeded|mau-extratests-dracut|timeout: test execution exceeded MAX_JOB_TIME|imagetester
Actions #9

Updated by okurz about 1 year ago

  • Due date deleted (2023-10-18)
  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF