action #175698
closed[tools][multi-machine tests] Timeout_exceeded on multiple workers including arm1, arm2 and mania
0%
Description
Description¶
Please see https://suse.slack.com/archives/C02CANHLANP/p1737105984558279 for more detail information.
I can see some multi-machine jobs are failing there since the one of the job failed to start until timeout, the issue started to show up since build 53.1
based on openQA test results.
Affected jobs can be seen at:
https://openqa.suse.de/tests/overview?result=failed&result=incomplete&result=timeout_exceeded&distri=sle&version=15-SP7&build=55.1&groupid=262
and
https://openqa.suse.de/tests/overview?result=failed&result=incomplete&result=timeout_exceeded&distri=sle&version=15-SP7&build=55.6&groupid=262
Observation¶
openQA test in scenario sle-15-SP7-Online-aarch64-wicked_basic_sut@aarch64 fails in
locks_init
Test suite description¶
Basic wicked checks . Maintainer : asmorodskyi@suse.de, jalausuch@suse.com , cfamullaconrad@suse.de
Reproducible¶
Fails since (at least) Build 55.1 (current job)
Expected result¶
Last good: 53.1 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by jbaier_cz 14 days ago
- Status changed from In Progress to Blocked
- Priority changed from Urgent to Normal
From the https://openqa.suse.de/tests/16460705/logfile?filename=autoinst-log.txt
[2025-01-15T17:49:17.848165Z] [debug] [pid:6790] tests/wicked/locks_init.pm:28 called lockapi::mutex_wait
[2025-01-15T17:49:17.848467Z] [debug] [pid:6790] <<< testapi::record_info(title="Paused", output="Wait for wicked_barriers_created (on parent job)", result="ok")
[2025-01-15T17:49:17.849652Z] [debug] [pid:6790] mutex lock 'wicked_barriers_created'
[2025-01-15T17:49:17.878949Z] [debug] [pid:6790] mutex lock 'wicked_barriers_created' unavailable, sleeping 5 seconds
When looking at the dependencies of that job, namely https://openqa.suse.de/tests/16460702, we can see that job is cancelled (so it never reached the mutex and hence the other job time-outed).
I believe this is an issue we already know from #174583, the second example
{"blocked_by_id":null,"id":16374878,"result":"skipped","state":"cancelled"}
{"blocked_by_id":null,"id":16374879,"result":"timeout_exceeded","state":"done"}
Looking at the last job, it has already passed, so I believe the issue is actually about the proper cancellation / correct restarts of the whole dependent jobs.
I propose to block on #174583 where we might improve this behavior.
Updated by jbaier_cz 14 days ago
- Related to action #174583: openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? size:S added
Updated by jbaier_cz 11 days ago
Related PR: https://github.com/os-autoinst/openQA/pull/6130
Updated by okurz 3 days ago
- Tags changed from infra, osd, timeout, arm1, arm2, mania, reactive work to osd, timeout, arm1, arm2, mania, reactive work
- Status changed from New to Resolved
- Assignee set to mkittler
seems to be an upstream openQA issue in the scheduler, possibly fixed by https://github.com/os-autoinst/openQA/pull/6130. History of jobs like from https://openqa.suse.de/tests/latest?arch=aarch64&distri=sle&flavor=Online&machine=aarch64&test=wicked_basic_sut&version=15-SP7#next_previous like passed and stable. Assuming fixed