Project

General

Profile

Actions

action #175698

closed

[tools][multi-machine tests] Timeout_exceeded on multiple workers including arm1, arm2 and mania

Added by rfan1 14 days ago. Updated 3 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Start date:
2025-01-17
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Description

Please see https://suse.slack.com/archives/C02CANHLANP/p1737105984558279 for more detail information.

I can see some multi-machine jobs are failing there since the one of the job failed to start until timeout, the issue started to show up since build 53.1
based on openQA test results.

Affected jobs can be seen at:

https://openqa.suse.de/tests/overview?result=failed&result=incomplete&result=timeout_exceeded&distri=sle&version=15-SP7&build=55.1&groupid=262
and
https://openqa.suse.de/tests/overview?result=failed&result=incomplete&result=timeout_exceeded&distri=sle&version=15-SP7&build=55.6&groupid=262

Observation

openQA test in scenario sle-15-SP7-Online-aarch64-wicked_basic_sut@aarch64 fails in
locks_init

Test suite description

Basic wicked checks . Maintainer : asmorodskyi@suse.de, jalausuch@suse.com , cfamullaconrad@suse.de

Reproducible

Fails since (at least) Build 55.1 (current job)

Expected result

Last good: 53.1 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #174583: openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? size:SResolvedmkittler2024-12-19

Actions
Actions #1

Updated by okurz 14 days ago

  • Tags set to infra, osd, timeout, arm1, arm2, mania, reactive work
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #2

Updated by jbaier_cz 14 days ago

  • Status changed from New to In Progress
  • Assignee set to jbaier_cz
Actions #3

Updated by jbaier_cz 14 days ago

  • Status changed from In Progress to Blocked
  • Priority changed from Urgent to Normal

From the https://openqa.suse.de/tests/16460705/logfile?filename=autoinst-log.txt

[2025-01-15T17:49:17.848165Z] [debug] [pid:6790] tests/wicked/locks_init.pm:28 called lockapi::mutex_wait
[2025-01-15T17:49:17.848467Z] [debug] [pid:6790] <<< testapi::record_info(title="Paused", output="Wait for wicked_barriers_created (on parent job)", result="ok")
[2025-01-15T17:49:17.849652Z] [debug] [pid:6790] mutex lock 'wicked_barriers_created'
[2025-01-15T17:49:17.878949Z] [debug] [pid:6790] mutex lock 'wicked_barriers_created' unavailable, sleeping 5 seconds

When looking at the dependencies of that job, namely https://openqa.suse.de/tests/16460702, we can see that job is cancelled (so it never reached the mutex and hence the other job time-outed).

I believe this is an issue we already know from #174583, the second example

{"blocked_by_id":null,"id":16374878,"result":"skipped","state":"cancelled"}
{"blocked_by_id":null,"id":16374879,"result":"timeout_exceeded","state":"done"}

Looking at the last job, it has already passed, so I believe the issue is actually about the proper cancellation / correct restarts of the whole dependent jobs.

I propose to block on #174583 where we might improve this behavior.

Actions #4

Updated by jbaier_cz 14 days ago

  • Related to action #174583: openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? size:S added
Actions #7

Updated by okurz 7 days ago

  • Status changed from Blocked to New
Actions #8

Updated by jbaier_cz 7 days ago

  • Assignee deleted (jbaier_cz)
Actions #9

Updated by okurz 3 days ago

  • Tags changed from infra, osd, timeout, arm1, arm2, mania, reactive work to osd, timeout, arm1, arm2, mania, reactive work
  • Status changed from New to Resolved
  • Assignee set to mkittler

seems to be an upstream openQA issue in the scheduler, possibly fixed by https://github.com/os-autoinst/openQA/pull/6130. History of jobs like from https://openqa.suse.de/tests/latest?arch=aarch64&distri=sle&flavor=Online&machine=aarch64&test=wicked_basic_sut&version=15-SP7#next_previous like passed and stable. Assuming fixed

Actions

Also available in: Atom PDF