action #175698: [tools][multi-machine tests] Timeout_exceeded on multiple workers including arm1, arm2 and mania - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #175698

closed

[tools][multi-machine tests] Timeout_exceeded on multiple workers including arm1, arm2 and mania

Added by rfan1 4 months ago. Updated 4 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Bugs in existing tests

Target version:

openQA Project (public) - Ready

Start date:

2025-01-17

Due date:

% Done:

Estimated time:

Difficulty:

Tags:

osd, timeout, reactive work, arm1, arm2, mania

Description

Description¶

Please see https://suse.slack.com/archives/C02CANHLANP/p1737105984558279 for more detail information.

I can see some multi-machine jobs are failing there since the one of the job failed to start until timeout, the issue started to show up since build 53.1
based on openQA test results.

Affected jobs can be seen at:

https://openqa.suse.de/tests/overview?result=failed&result=incomplete&result=timeout_exceeded&distri=sle&version=15-SP7&build=55.1&groupid=262
and
https://openqa.suse.de/tests/overview?result=failed&result=incomplete&result=timeout_exceeded&distri=sle&version=15-SP7&build=55.6&groupid=262

Observation¶

openQA test in scenario sle-15-SP7-Online-aarch64-wicked_basic_sut@aarch64 fails in
locks_init

Test suite description¶

Basic wicked checks . Maintainer : asmorodskyi@suse.de, jalausuch@suse.com , cfamullaconrad@suse.de

Reproducible¶

Fails since (at least) Build 55.1 (current job)

Expected result¶

Last good: 53.1 (or more recent)

Further details¶

Always latest result in this scenario: latest

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz 4 months ago

Tags set to infra, osd, timeout, arm1, arm2, mania, reactive work
Priority changed from Normal to Urgent
Target version set to Ready

Actions

Copy link

Updated by jbaier_cz 4 months ago

Status changed from New to In Progress
Assignee set to jbaier_cz

Actions

Copy link

Updated by jbaier_cz 4 months ago

Status changed from In Progress to Blocked
Priority changed from Urgent to Normal

From the https://openqa.suse.de/tests/16460705/logfile?filename=autoinst-log.txt

[2025-01-15T17:49:17.848165Z] [debug] [pid:6790] tests/wicked/locks_init.pm:28 called lockapi::mutex_wait
[2025-01-15T17:49:17.848467Z] [debug] [pid:6790] <<< testapi::record_info(title="Paused", output="Wait for wicked_barriers_created (on parent job)", result="ok")
[2025-01-15T17:49:17.849652Z] [debug] [pid:6790] mutex lock 'wicked_barriers_created'
[2025-01-15T17:49:17.878949Z] [debug] [pid:6790] mutex lock 'wicked_barriers_created' unavailable, sleeping 5 seconds

When looking at the dependencies of that job, namely https://openqa.suse.de/tests/16460702, we can see that job is cancelled (so it never reached the mutex and hence the other job time-outed).

I believe this is an issue we already know from #174583, the second example

{"blocked_by_id":null,"id":16374878,"result":"skipped","state":"cancelled"}
{"blocked_by_id":null,"id":16374879,"result":"timeout_exceeded","state":"done"}

Looking at the last job, it has already passed, so I believe the issue is actually about the proper cancellation / correct restarts of the whole dependent jobs.

I propose to block on #174583 where we might improve this behavior.

Actions

Copy link

Updated by jbaier_cz 4 months ago

Related to action #174583: openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? size:S added

Actions

Copy link

Updated by jbaier_cz 4 months ago

Related PR: https://github.com/os-autoinst/openQA/pull/6130

Actions

Copy link

Updated by okurz 4 months ago

https://github.com/os-autoinst/openQA/pull/6130 merged

Actions

Copy link

Updated by okurz 4 months ago

Status changed from Blocked to New

Actions

Copy link

Updated by jbaier_cz 4 months ago

Assignee deleted (~~jbaier_cz~~)

Actions

Copy link

Updated by okurz 4 months ago

Tags changed from infra, osd, timeout, arm1, arm2, mania, reactive work to osd, timeout, arm1, arm2, mania, reactive work
Status changed from New to Resolved
Assignee set to mkittler

seems to be an upstream openQA issue in the scheduler, possibly fixed by https://github.com/os-autoinst/openQA/pull/6130. History of jobs like from https://openqa.suse.de/tests/latest?arch=aarch64&distri=sle&flavor=Online&machine=aarch64&test=wicked_basic_sut&version=15-SP7#next_previous like passed and stable. Assuming fixed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #175698

[tools][multi-machine tests] Timeout_exceeded on multiple workers including arm1, arm2 and mania

Description¶

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Further details¶

Updated by okurz 4 months ago

Updated by jbaier_cz 4 months ago

Updated by jbaier_cz 4 months ago

Updated by jbaier_cz 4 months ago

Updated by jbaier_cz 4 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by jbaier_cz 4 months ago

Updated by okurz 4 months ago