action #137384
closed[tools][s390x] worker imagetester can't reach SUT auto_review:"backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out" size:M
0%
Description
Observation¶
I think there are two variants of same issue.
Have seen it failing (only) on imagetester.qe.nue2.suse.org
bootloader_start is much worse because it is not failing but timing out at MAX_JOB_TIME
bootloader_start
bootloader_zkvm
fails with
Reason: backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out
Reproducible¶
Fails since (at least) Build 20231003-1 (current job)
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#137384
Expected result¶
Last good: 20231002-1 (or more recent)
Acceptance Criteria¶
- AC1: It is known if s390zl14.suse.de is usable as a production worker
Suggestions¶
- As the error states an obvious "connection timed out" to s390zl14.suse.de check if that machine is generally reachable, e.g.
sudo salt \* cmd.run 'ping -c1 s390zl14.suse.de'
sudo salt --no-color --out txt '*' cmd.run 'ping -c1 s390zl14.suse.de'
says unreachable or 100% packet loss so it's not specific to imagetester
- Also check our monitoring which includes a ping check to various important hosts. see if s390zl14 is there - we did not see an alert
- If the problem persists consider disabling the production use of the worker instance in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls , e.g.
s/390-kvm/&-poo137384/
- Maybe s390zl14 is not expected to be usable in general, crosscheck git history of https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls, ask the ones that introduced it or who maintain the machine or according tests
- 6d0f181b60681956847458369f92f636a7449e4b "Remove s390zl14 as required external host from monitoring" and following discussion from there it seems it is expected that this mainframe is not usable for us
- https://racktables.nue.suse.com/?page=search&last_page=index&last_tab=default&q=s390zl14 says "nothing found" so ask around what that machine should be
- Maybe the machine is not supposed to be working, then remove the according openQA worker instance and ensure that someone takes care that s390zl14 is properly used outside the context of OSD so that no hardware is uselessly just wasting power and destroying our nice earth
- If the machine is supposed to be working as OSD worker target then create an according Eng-Infra ticket
Out of scope¶
- test code improvement, see #137387
- Looking into fixing the machine and improving the setup
Further details¶
Always latest result in this scenario: latest
Updated by dzedro about 1 year ago
- Related to openqa-force-result #134807: [qe-core] test fails in bootloader_start - missing assets on s390 added
Updated by dzedro about 1 year ago
- Subject changed from [tools][s390x] worker can't reach SUT on imagetester to [tools][s390x] worker imagetester can't reach SUT
Updated by okurz about 1 year ago
- Tags set to infra
- Category changed from Bugs in existing tests to Infrastructure
- Priority changed from Normal to Urgent
- Target version set to Ready
Updated by okurz about 1 year ago
- Copied to action #137387: [s390x][qe-core] worker imagetester can't reach SUT - turn job timeout into module failure added
Updated by okurz about 1 year ago
- Related to action #134912: Gradually phase out NUE1 based openQA workers size:M added
Updated by okurz about 1 year ago
- Subject changed from [tools][s390x] worker imagetester can't reach SUT to [tools][s390x] worker imagetester can't reach SUT auto_review:"backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out"
- Description updated (diff)
- Assignee set to okurz
Updated by okurz about 1 year ago
- Subject changed from [tools][s390x] worker imagetester can't reach SUT auto_review:"backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out" to [tools][s390x] worker imagetester can't reach SUT auto_review:"backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out" size:M
- Status changed from New to In Progress
Updated by okurz about 1 year ago
- Due date set to 2023-10-18
- Status changed from In Progress to Feedback
- Priority changed from Urgent to High
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/629 merged and effective. Called export host=openqa.suse.de; ~/local/os-autoinst/scripts/openqa-monitor-investigation-candidates | ~/local/os-autoinst/scripts/openqa-label-known-issues-multi
to handle failures.
openqa-query-for-job-label poo#137384
yields
12371628|2023-10-04 09:13:30|done|failed|python_3.11_on_SLES_12-SP5|backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out|imagetester
12368632|2023-10-04 04:51:49|done|timeout_exceeded|docker_tests|timeout: test execution exceeded MAX_JOB_TIME|imagetester
12369978|2023-10-04 04:23:18|done|failed|docker_tests|backend done: Error connecting to <root@s390zl14.suse.de>: Connection timed out|imagetester
12369985|2023-10-04 03:20:58|done|timeout_exceeded|mau-extratests-phub|timeout: test execution exceeded MAX_JOB_TIME|imagetester
12369893|2023-10-04 02:20:03|done|timeout_exceeded|mau-extratests2|timeout: test execution exceeded MAX_JOB_TIME|imagetester
12369891|2023-10-04 01:20:06|done|timeout_exceeded|mau-extratests-phub|timeout: test execution exceeded MAX_JOB_TIME|imagetester
12369892|2023-10-04 01:20:06|done|timeout_exceeded|mau-extratests1|timeout: test execution exceeded MAX_JOB_TIME|imagetester
12369890|2023-10-04 01:20:05|done|timeout_exceeded|mau-extratests-dracut|timeout: test execution exceeded MAX_JOB_TIME|imagetester
Updated by okurz about 1 year ago
- Due date deleted (
2023-10-18) - Status changed from Feedback to Resolved
https://openqa.suse.de/tests/12384723 looks good