action #156052
closed[alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:S
0%
Description
Observation¶
https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2315561
2 jobs have been created:
- http://openqa.suse.de/tests/13603796
- http://openqa.suse.de/tests/13603797
{"blocked_by_id":null,"id":13603796,"result":"none","state":"scheduled"}
Job state of job ID 13603796: scheduled, waiting …
{"blocked_by_id":null,"id":13603796,"result":"none","state":"running"}
Job state of job ID 13603796: running, waiting …
Suggestions¶
- Investigate what is causing the pipeline to fail
- The pipeline fails.
- The two created jobs failed.
- There is a lot of log messages mentioning "waiting" which is not shown to be successful or unsuccessful.
Rollback steps¶
Active pipelines on https://gitlab.suse.de/openqa/scripts-ci/-/pipeline_schedules again
Updated by jbaier_cz 8 months ago
Actually, the output stated the failures:
{"blocked_by_id":null,"id":13603796,"result":"parallel_failed","state":"done"}
{"blocked_by_id":null,"id":13603797,"result":"failed","state":"done"}
So this should be probably more about the test failure: https://openqa.suse.de/tests/13603797
Updated by jbaier_cz 8 months ago
- Related to action #156067: [alert] test fails in setup_multimachine added
Updated by mkittler 8 months ago
There is a lot of log messages mentioning "waiting" which is not shown to be successful or unsuccessful.
That's wrong. The log mentioned in the ticket description shows failed
and parallel_failed
:
Job state of job ID 13603796: running, waiting …
{"blocked_by_id":null,"id":13603796,"result":"parallel_failed","state":"done"}
{"blocked_by_id":null,"id":13603797,"result":"failed","state":"done"}
+ rm -f /tmp/tmp.wT5SJe6WBi
Cleaning up project directory and file based variables 00:01
ERROR: Job failed: exit code 1
This test just failed due to the problem in setup_multimachine
that I've already fixed in #156067.
There were still three failures earlier today (after my fixed for #156067 had been merged). I'll have a look into them.
Updated by mkittler 8 months ago
- Project changed from openQA Infrastructure to openQA Project
- Category deleted (
Regressions/Crashes)
In one case openQA was not available:
++ openqa-cli api --host http://openqa.suse.de jobs version=15-SP5 scope=relevant arch=x86_64 flavor=Server-DVD-Updates test=ovs-client latest=1
++ jq -r '.jobs | map(select(.result == "passed")) | max_by(.settings.BUILD) .settings.HDD_1'
+ openqa-cli schedule --monitor --host http://openqa.suse.de --param-file SCENARIO_DEFINITIONS_YAML=/tmp/tmp.G1mjzb8tmA DISTRI=sle VERSION=15-SP5 FLAVOR=Server-DVD-Updates ARCH=x86_64 BUILD=2024-02-27T05:11+00:00 _GROUP_ID=0 HDD_1=SLES-15-SP5-x86_64-mru-install-minimal-with-addons-Build20240226-1-Server-DVD-Updates-64bit.qcow2
503 Service Unavailable
That's not the first time the CI fails that way. It would probably best to implement a retry in openqa-cli
.
A synchronization-related failure in https://openqa.suse.de/tests/13619795 (ping_client
, e.g. https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=ping_client&version=15-SP5):
# Test died: mydie at /usr/lib/os-autoinst/bmwqemu.pm line 266.
bmwqemu::mydie("acquiring barrier 'MM_SETUP_PING_CHECK_DONE': lock owner alre"...) called at /usr/lib/os-autoinst/lockapi.pm line 36
lockapi::_try_lock("barrier", "MM_SETUP_PING_CHECK_DONE", HASH(0x55a111d71548)) called at /usr/lib/os-autoinst/lockapi.pm line 119
lockapi::_wait_action("MM_SETUP_PING_CHECK_DONE", undef, 0) called at /usr/lib/os-autoinst/lockapi.pm line 140
lockapi::barrier_wait("MM_SETUP_PING_CHECK_DONE") called at sle/tests/network/setup_multimachine.pm line 60
setup_multimachine::run(setup_multimachine=HASH(0x55a11539df28)) called at /usr/lib/os-autoinst/basetest.pm line 352
eval {...} called at /usr/lib/os-autoinst/basetest.pm line 346
basetest::runtest(setup_multimachine=HASH(0x55a11539df28)) called at /usr/lib/os-autoinst/autotest.pm line 415
The ping_server
job passed. Probably it had already passed when ping_client
tried to acquire the barrier which therefore failed. I guess the synchronization must be improved in the test code (maybe a regression?).
This kind of failure happened twice so it is generally reproducible.
Updated by mkittler 8 months ago · Edited
Passing job:
[2024-02-27T12:14:40.544405+01:00] [debug] [pid:118334] >>> testapi::wait_serial: (?^u:mvtMU-\d+-): ok
[2024-02-27T12:14:40.544500+01:00] [debug] [pid:118334] barrier wait 'MM_SETUP_DONE'
[2024-02-27T12:14:40.544595+01:00] [debug] [pid:118334] tests/network/setup_multimachine.pm:56 called lockapi::barrier_wait
[2024-02-27T12:14:40.544688+01:00] [debug] [pid:118334] <<< testapi::record_info(title="Paused", output="Wait for MM_SETUP_DONE (on parent job)", result="ok")
[2024-02-27T12:14:40.574018+01:00] [debug] [pid:118334] barrier 'MM_SETUP_DONE' not released, sleeping 5 seconds
[2024-02-27T12:14:45.593869+01:00] [debug] [pid:118334] tests/network/setup_multimachine.pm:56 called lockapi::barrier_wait
[2024-02-27T12:14:45.594169+01:00] [debug] [pid:118334] <<< testapi::record_info(title="Paused 0m5s", output="Wait for MM_SETUP_DONE (on parent job)", result="ok")
[2024-02-27T12:14:45.594608+01:00] [debug] [pid:118334] barrier wait 'MM_SETUP_PING_CHECK_DONE'
[2024-02-27T12:14:45.594777+01:00] [debug] [pid:118334] tests/network/setup_multimachine.pm:60 called lockapi::barrier_wait
[2024-02-27T12:14:45.594967+01:00] [debug] [pid:118334] <<< testapi::record_info(title="Paused", output="Wait for MM_SETUP_PING_CHECK_DONE (on parent job)", result="ok")
Failing job:
[2024-02-27T12:14:45.527797+01:00] [debug] [pid:67399] >>> testapi::wait_serial: (?^u:xn1dA-\d+-): ok
[2024-02-27T12:14:45.527866+01:00] [debug] [pid:67399] barrier wait 'MM_SETUP_PING_CHECK_DONE'
[2024-02-27T12:14:45.527933+01:00] [debug] [pid:67399] tests/network/setup_multimachine.pm:60 called lockapi::barrier_wait
[2024-02-27T12:14:45.528007+01:00] [debug] [pid:67399] <<< testapi::record_info(title="Paused", output="Wait for MM_SETUP_PING_CHECK_DONE (on parent job)", result="ok")
[2024-02-27T12:14:45.546329+01:00] [debug] [pid:67399] barrier 'MM_SETUP_PING_CHECK_DONE' not released, sleeping 5 seconds
[2024-02-27T12:14:50.565092+01:00] [debug] [pid:67399] tests/network/setup_multimachine.pm:60 called lockapi::barrier_wait
[2024-02-27T12:14:50.565280+01:00] [debug] [pid:67399] <<< bmwqemu::mydie(cause_of_death="acquiring barrier 'MM_SETUP_PING_CHECK_DONE': lock owner already finished")
[2024-02-27T12:14:50.626107+01:00] [info] [pid:67399] ::: basetest::runtest: # Test died: mydie at /usr/lib/os-autoinst/bmwqemu.pm line 266.
bmwqemu::mydie("acquiring barrier 'MM_SETUP_PING_CHECK_DONE': lock owner alre"...) called at /usr/lib/os-autoinst/lockapi.pm line 36
lockapi::_try_lock("barrier", "MM_SETUP_PING_CHECK_DONE", HASH(0x55f07c0a6990)) called at /usr/lib/os-autoinst/lockapi.pm line 119
lockapi::_wait_action("MM_SETUP_PING_CHECK_DONE", undef, 0) called at /usr/lib/os-autoinst/lockapi.pm line 140
lockapi::barrier_wait("MM_SETUP_PING_CHECK_DONE") called at sle/tests/network/setup_multimachine.pm line 60
So the following happened:
- The client reached the barrier first (14:45.546329).
- It then slept 5 seconds to wait for the server.
- Meanwhile, the server reached the barrier (14:45.594608). Because the client had reached the barrier before it could just continue and finish. There's no
wait_for_children
in the test code that would prevent it from finishing. - Then the client checked the barrier again (after the 5 seconds of sleeping) and ran into the error
lock owner already finished
which is expected considering the previous point.
I think that the lock API behaves that way is very problematic. I suppose we usually workaround that by adding a wait_for_children
at the end of the parent's test code. However, I'll have a look whether we can improve the openQA so this is not necessary.
EDIT: Improving the API would be possible but of course change the semantics (and thus basically break compatibility). So I went with adding wait_for_children
which is also what our documentation already suggests: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18754
Updated by mkittler 8 months ago
PR for adding a retry: https://gitlab.suse.de/openqa/scripts-ci/-/merge_requests/4
Updated by okurz 8 months ago
- Related to action #155170: [openqa-in-openqa] [sporadic] test fails in test_running: parallel_failed size:M added
Updated by openqa_review 8 months ago
- Due date set to 2024-03-13
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 8 months ago
https://gitlab.suse.de/openqa/scripts-ci/-/merge_requests/4 merged
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18754 caused problems, see https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18754#issuecomment-1968256104 , reverted with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18757 (merged)
Updated by mkittler 8 months ago
I created https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18758 and https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18759 to bring back my changes after they had to be reverted.
Updated by mkittler 7 months ago
Everything is merged now again. I'll keep having a look at https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=ping_client&version=15-SP5 and https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=ping_client&version=Tumbleweed to see whether the change is effective.
Updated by mkittler 7 months ago
- Status changed from Feedback to Resolved
The new module is now on https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=ping_client&version=Tumbleweed. No new tests have been scheduled yet in the OSD scenario but I guess o3 suffices. So I enabled the pipelines again and triggered one run immediately: https://gitlab.suse.de/openqa/scripts-ci/-/pipelines/1020024 - It worked so I'm considering this ticket resolved.
Updated by okurz 7 months ago
- Copied to action #156625: [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:M added