action #156052: [alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:S - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #156052

closed

[alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:S

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

Ready

Start date:

2024-02-26

Due date:

2024-03-13

% Done:

Estimated time:

Tags:

alert, reactive work

Description

Observation¶

https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2315561

2 jobs have been created:
 - http://openqa.suse.de/tests/13603796
 - http://openqa.suse.de/tests/13603797
{"blocked_by_id":null,"id":13603796,"result":"none","state":"scheduled"}
Job state of job ID 13603796: scheduled, waiting …
{"blocked_by_id":null,"id":13603796,"result":"none","state":"running"}
Job state of job ID 13603796: running, waiting …

Suggestions¶

Investigate what is causing the pipeline to fail
- The pipeline fails.
- The two created jobs failed.
- There is a lot of log messages mentioning "waiting" which is not shown to be successful or unsuccessful.

Rollback steps¶

Active pipelines on https://gitlab.suse.de/openqa/scripts-ci/-/pipeline_schedules again

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by jbaier_cz over 1 year ago

Actually, the output stated the failures:

{"blocked_by_id":null,"id":13603796,"result":"parallel_failed","state":"done"}
{"blocked_by_id":null,"id":13603797,"result":"failed","state":"done"}

So this should be probably more about the test failure: https://openqa.suse.de/tests/13603797

Actions

Copy link

Updated by jbaier_cz over 1 year ago

Related to action #156067: [alert] test fails in setup_multimachine added

Actions

Copy link

Updated by mkittler over 1 year ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 1 year ago

There is a lot of log messages mentioning "waiting" which is not shown to be successful or unsuccessful.

That's wrong. The log mentioned in the ticket description shows failed and parallel_failed:

Job state of job ID 13603796: running, waiting …
{"blocked_by_id":null,"id":13603796,"result":"parallel_failed","state":"done"}
{"blocked_by_id":null,"id":13603797,"result":"failed","state":"done"}
+ rm -f /tmp/tmp.wT5SJe6WBi
Cleaning up project directory and file based variables 00:01
ERROR: Job failed: exit code 1

This test just failed due to the problem in setup_multimachine that I've already fixed in #156067.

There were still three failures earlier today (after my fixed for #156067 had been merged). I'll have a look into them.

Actions

Copy link

Updated by mkittler over 1 year ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Category deleted (~~Regressions/Crashes~~ )

In one case openQA was not available:

++ openqa-cli api --host http://openqa.suse.de jobs version=15-SP5 scope=relevant arch=x86_64 flavor=Server-DVD-Updates test=ovs-client latest=1
++ jq -r '.jobs | map(select(.result == "passed")) | max_by(.settings.BUILD) .settings.HDD_1'
+ openqa-cli schedule --monitor --host http://openqa.suse.de --param-file SCENARIO_DEFINITIONS_YAML=/tmp/tmp.G1mjzb8tmA DISTRI=sle VERSION=15-SP5 FLAVOR=Server-DVD-Updates ARCH=x86_64 BUILD=2024-02-27T05:11+00:00 _GROUP_ID=0 HDD_1=SLES-15-SP5-x86_64-mru-install-minimal-with-addons-Build20240226-1-Server-DVD-Updates-64bit.qcow2
503 Service Unavailable

That's not the first time the CI fails that way. It would probably best to implement a retry in openqa-cli.

A synchronization-related failure in https://openqa.suse.de/tests/13619795 (ping_client, e.g. https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=ping_client&version=15-SP5):

# Test died: mydie at /usr/lib/os-autoinst/bmwqemu.pm line 266.
	bmwqemu::mydie("acquiring barrier 'MM_SETUP_PING_CHECK_DONE': lock owner alre"...) called at /usr/lib/os-autoinst/lockapi.pm line 36
	lockapi::_try_lock("barrier", "MM_SETUP_PING_CHECK_DONE", HASH(0x55a111d71548)) called at /usr/lib/os-autoinst/lockapi.pm line 119
	lockapi::_wait_action("MM_SETUP_PING_CHECK_DONE", undef, 0) called at /usr/lib/os-autoinst/lockapi.pm line 140
	lockapi::barrier_wait("MM_SETUP_PING_CHECK_DONE") called at sle/tests/network/setup_multimachine.pm line 60
	setup_multimachine::run(setup_multimachine=HASH(0x55a11539df28)) called at /usr/lib/os-autoinst/basetest.pm line 352
	eval {...} called at /usr/lib/os-autoinst/basetest.pm line 346
	basetest::runtest(setup_multimachine=HASH(0x55a11539df28)) called at /usr/lib/os-autoinst/autotest.pm line 415

The ping_server job passed. Probably it had already passed when ping_client tried to acquire the barrier which therefore failed. I guess the synchronization must be improved in the test code (maybe a regression?).

This kind of failure happened twice so it is generally reproducible.

Actions

Copy link

Updated by mkittler over 1 year ago · Edited

Passing job:

[2024-02-27T12:14:40.544405+01:00] [debug] [pid:118334] >>> testapi::wait_serial: (?^u:mvtMU-\d+-): ok
[2024-02-27T12:14:40.544500+01:00] [debug] [pid:118334] barrier wait 'MM_SETUP_DONE'
[2024-02-27T12:14:40.544595+01:00] [debug] [pid:118334] tests/network/setup_multimachine.pm:56 called lockapi::barrier_wait
[2024-02-27T12:14:40.544688+01:00] [debug] [pid:118334] <<< testapi::record_info(title="Paused", output="Wait for MM_SETUP_DONE (on parent job)", result="ok")
[2024-02-27T12:14:40.574018+01:00] [debug] [pid:118334] barrier 'MM_SETUP_DONE' not released, sleeping 5 seconds
[2024-02-27T12:14:45.593869+01:00] [debug] [pid:118334] tests/network/setup_multimachine.pm:56 called lockapi::barrier_wait
[2024-02-27T12:14:45.594169+01:00] [debug] [pid:118334] <<< testapi::record_info(title="Paused 0m5s", output="Wait for MM_SETUP_DONE (on parent job)", result="ok")
[2024-02-27T12:14:45.594608+01:00] [debug] [pid:118334] barrier wait 'MM_SETUP_PING_CHECK_DONE'
[2024-02-27T12:14:45.594777+01:00] [debug] [pid:118334] tests/network/setup_multimachine.pm:60 called lockapi::barrier_wait
[2024-02-27T12:14:45.594967+01:00] [debug] [pid:118334] <<< testapi::record_info(title="Paused", output="Wait for MM_SETUP_PING_CHECK_DONE (on parent job)", result="ok")

Failing job:

[2024-02-27T12:14:45.527797+01:00] [debug] [pid:67399] >>> testapi::wait_serial: (?^u:xn1dA-\d+-): ok
[2024-02-27T12:14:45.527866+01:00] [debug] [pid:67399] barrier wait 'MM_SETUP_PING_CHECK_DONE'
[2024-02-27T12:14:45.527933+01:00] [debug] [pid:67399] tests/network/setup_multimachine.pm:60 called lockapi::barrier_wait
[2024-02-27T12:14:45.528007+01:00] [debug] [pid:67399] <<< testapi::record_info(title="Paused", output="Wait for MM_SETUP_PING_CHECK_DONE (on parent job)", result="ok")
[2024-02-27T12:14:45.546329+01:00] [debug] [pid:67399] barrier 'MM_SETUP_PING_CHECK_DONE' not released, sleeping 5 seconds
[2024-02-27T12:14:50.565092+01:00] [debug] [pid:67399] tests/network/setup_multimachine.pm:60 called lockapi::barrier_wait
[2024-02-27T12:14:50.565280+01:00] [debug] [pid:67399] <<< bmwqemu::mydie(cause_of_death="acquiring barrier 'MM_SETUP_PING_CHECK_DONE': lock owner already finished")
[2024-02-27T12:14:50.626107+01:00] [info] [pid:67399] ::: basetest::runtest: # Test died: mydie at /usr/lib/os-autoinst/bmwqemu.pm line 266.
  	bmwqemu::mydie("acquiring barrier 'MM_SETUP_PING_CHECK_DONE': lock owner alre"...) called at /usr/lib/os-autoinst/lockapi.pm line 36
  	lockapi::_try_lock("barrier", "MM_SETUP_PING_CHECK_DONE", HASH(0x55f07c0a6990)) called at /usr/lib/os-autoinst/lockapi.pm line 119
  	lockapi::_wait_action("MM_SETUP_PING_CHECK_DONE", undef, 0) called at /usr/lib/os-autoinst/lockapi.pm line 140
  	lockapi::barrier_wait("MM_SETUP_PING_CHECK_DONE") called at sle/tests/network/setup_multimachine.pm line 60

So the following happened:

The client reached the barrier first (14:45.546329).
It then slept 5 seconds to wait for the server.
Meanwhile, the server reached the barrier (14:45.594608). Because the client had reached the barrier before it could just continue and finish. There's no wait_for_children in the test code that would prevent it from finishing.
Then the client checked the barrier again (after the 5 seconds of sleeping) and ran into the error lock owner already finished which is expected considering the previous point.

I think that the lock API behaves that way is very problematic. I suppose we usually workaround that by adding a wait_for_children at the end of the parent's test code. However, I'll have a look whether we can improve the openQA so this is not necessary.

EDIT: Improving the API would be possible but of course change the semantics (and thus basically break compatibility). So I went with adding wait_for_children which is also what our documentation already suggests: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18754

Actions

Copy link

Updated by okurz over 1 year ago

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18754 merged as well

Actions

Copy link

Updated by mkittler over 1 year ago

PR for adding a retry: https://gitlab.suse.de/openqa/scripts-ci/-/merge_requests/4

Actions

Copy link

Updated by okurz over 1 year ago

Related to action #155170: [openqa-in-openqa] [sporadic] test fails in test_running: parallel_failed size:M added

Actions

Copy link

#10

Updated by openqa_review over 1 year ago

Due date set to 2024-03-13

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#11

Updated by okurz over 1 year ago

https://gitlab.suse.de/openqa/scripts-ci/-/merge_requests/4 merged

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18754 caused problems, see https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18754#issuecomment-1968256104 , reverted with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18757 (merged)

Actions

Copy link

#12

Updated by mkittler over 1 year ago

Description updated (diff)

Actions

Copy link

#13

Updated by mkittler over 1 year ago

I created https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18758 and https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18759 to bring back my changes after they had to be reverted.

Actions

Copy link

#14

Updated by mkittler over 1 year ago

Status changed from In Progress to Feedback

Actions

Copy link

#15

Updated by livdywan over 1 year ago

Subject changed from [alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting to [alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:S

Actions

Copy link

#16

Updated by mkittler over 1 year ago

Everything is merged now again. I'll keep having a look at https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=ping_client&version=15-SP5 and https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=ping_client&version=Tumbleweed to see whether the change is effective.

Actions

Copy link

#17

Updated by mkittler over 1 year ago

Status changed from Feedback to Resolved

The new module is now on https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=ping_client&version=Tumbleweed. No new tests have been scheduled yet in the OSD scenario but I guess o3 suffices. So I enabled the pipelines again and triggered one run immediately: https://gitlab.suse.de/openqa/scripts-ci/-/pipelines/1020024 - It worked so I'm considering this ticket resolved.

Actions

Copy link

#18

Updated by okurz about 1 year ago

Copied to action #156625: [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:M added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #156052

[alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:S

Observation¶

Suggestions¶

Rollback steps¶

Updated by jbaier_cz over 1 year ago

Updated by jbaier_cz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago · Edited

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by openqa_review over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by livdywan over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz about 1 year ago