Project

General

Profile

Actions

action #156052

closed

[alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:S

Added by livdywan 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2024-02-26
Due date:
2024-03-13
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2315561

2 jobs have been created:
 - http://openqa.suse.de/tests/13603796
 - http://openqa.suse.de/tests/13603797
{"blocked_by_id":null,"id":13603796,"result":"none","state":"scheduled"}
Job state of job ID 13603796: scheduled, waiting …
{"blocked_by_id":null,"id":13603796,"result":"none","state":"running"}
Job state of job ID 13603796: running, waiting …

Suggestions

  • Investigate what is causing the pipeline to fail
    • The pipeline fails.
    • The two created jobs failed.
    • There is a lot of log messages mentioning "waiting" which is not shown to be successful or unsuccessful.

Rollback steps

Active pipelines on https://gitlab.suse.de/openqa/scripts-ci/-/pipeline_schedules again


Related issues 3 (0 open3 closed)

Related to openQA Tests - action #156067: [alert] test fails in setup_multimachineResolvedmkittler2024-02-262024-03-12

Actions
Related to openQA Project - action #155170: [openqa-in-openqa] [sporadic] test fails in test_running: parallel_failed size:MResolvedybonatakis2024-02-082024-02-29

Actions
Copied to openQA Project - action #156625: [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:MResolvedtinita

Actions
Actions #1

Updated by jbaier_cz 8 months ago

Actually, the output stated the failures:

{"blocked_by_id":null,"id":13603796,"result":"parallel_failed","state":"done"}
{"blocked_by_id":null,"id":13603797,"result":"failed","state":"done"}

So this should be probably more about the test failure: https://openqa.suse.de/tests/13603797

Actions #2

Updated by jbaier_cz 8 months ago

  • Related to action #156067: [alert] test fails in setup_multimachine added
Actions #3

Updated by mkittler 8 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #4

Updated by mkittler 8 months ago

There is a lot of log messages mentioning "waiting" which is not shown to be successful or unsuccessful.

That's wrong. The log mentioned in the ticket description shows failed and parallel_failed:

Job state of job ID 13603796: running, waiting …
{"blocked_by_id":null,"id":13603796,"result":"parallel_failed","state":"done"}
{"blocked_by_id":null,"id":13603797,"result":"failed","state":"done"}
+ rm -f /tmp/tmp.wT5SJe6WBi
Cleaning up project directory and file based variables 00:01
ERROR: Job failed: exit code 1

This test just failed due to the problem in setup_multimachine that I've already fixed in #156067.

There were still three failures earlier today (after my fixed for #156067 had been merged). I'll have a look into them.

Actions #5

Updated by mkittler 8 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Category deleted (Regressions/Crashes )

In one case openQA was not available:

++ openqa-cli api --host http://openqa.suse.de jobs version=15-SP5 scope=relevant arch=x86_64 flavor=Server-DVD-Updates test=ovs-client latest=1
++ jq -r '.jobs | map(select(.result == "passed")) | max_by(.settings.BUILD) .settings.HDD_1'
+ openqa-cli schedule --monitor --host http://openqa.suse.de --param-file SCENARIO_DEFINITIONS_YAML=/tmp/tmp.G1mjzb8tmA DISTRI=sle VERSION=15-SP5 FLAVOR=Server-DVD-Updates ARCH=x86_64 BUILD=2024-02-27T05:11+00:00 _GROUP_ID=0 HDD_1=SLES-15-SP5-x86_64-mru-install-minimal-with-addons-Build20240226-1-Server-DVD-Updates-64bit.qcow2
503 Service Unavailable

That's not the first time the CI fails that way. It would probably best to implement a retry in openqa-cli.


A synchronization-related failure in https://openqa.suse.de/tests/13619795 (ping_client, e.g. https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=ping_client&version=15-SP5):

# Test died: mydie at /usr/lib/os-autoinst/bmwqemu.pm line 266.
    bmwqemu::mydie("acquiring barrier 'MM_SETUP_PING_CHECK_DONE': lock owner alre"...) called at /usr/lib/os-autoinst/lockapi.pm line 36
    lockapi::_try_lock("barrier", "MM_SETUP_PING_CHECK_DONE", HASH(0x55a111d71548)) called at /usr/lib/os-autoinst/lockapi.pm line 119
    lockapi::_wait_action("MM_SETUP_PING_CHECK_DONE", undef, 0) called at /usr/lib/os-autoinst/lockapi.pm line 140
    lockapi::barrier_wait("MM_SETUP_PING_CHECK_DONE") called at sle/tests/network/setup_multimachine.pm line 60
    setup_multimachine::run(setup_multimachine=HASH(0x55a11539df28)) called at /usr/lib/os-autoinst/basetest.pm line 352
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 346
    basetest::runtest(setup_multimachine=HASH(0x55a11539df28)) called at /usr/lib/os-autoinst/autotest.pm line 415

The ping_server job passed. Probably it had already passed when ping_client tried to acquire the barrier which therefore failed. I guess the synchronization must be improved in the test code (maybe a regression?).

This kind of failure happened twice so it is generally reproducible.

Actions #6

Updated by mkittler 8 months ago · Edited

Passing job:

[2024-02-27T12:14:40.544405+01:00] [debug] [pid:118334] >>> testapi::wait_serial: (?^u:mvtMU-\d+-): ok
[2024-02-27T12:14:40.544500+01:00] [debug] [pid:118334] barrier wait 'MM_SETUP_DONE'
[2024-02-27T12:14:40.544595+01:00] [debug] [pid:118334] tests/network/setup_multimachine.pm:56 called lockapi::barrier_wait
[2024-02-27T12:14:40.544688+01:00] [debug] [pid:118334] <<< testapi::record_info(title="Paused", output="Wait for MM_SETUP_DONE (on parent job)", result="ok")
[2024-02-27T12:14:40.574018+01:00] [debug] [pid:118334] barrier 'MM_SETUP_DONE' not released, sleeping 5 seconds
[2024-02-27T12:14:45.593869+01:00] [debug] [pid:118334] tests/network/setup_multimachine.pm:56 called lockapi::barrier_wait
[2024-02-27T12:14:45.594169+01:00] [debug] [pid:118334] <<< testapi::record_info(title="Paused 0m5s", output="Wait for MM_SETUP_DONE (on parent job)", result="ok")
[2024-02-27T12:14:45.594608+01:00] [debug] [pid:118334] barrier wait 'MM_SETUP_PING_CHECK_DONE'
[2024-02-27T12:14:45.594777+01:00] [debug] [pid:118334] tests/network/setup_multimachine.pm:60 called lockapi::barrier_wait
[2024-02-27T12:14:45.594967+01:00] [debug] [pid:118334] <<< testapi::record_info(title="Paused", output="Wait for MM_SETUP_PING_CHECK_DONE (on parent job)", result="ok")

Failing job:

[2024-02-27T12:14:45.527797+01:00] [debug] [pid:67399] >>> testapi::wait_serial: (?^u:xn1dA-\d+-): ok
[2024-02-27T12:14:45.527866+01:00] [debug] [pid:67399] barrier wait 'MM_SETUP_PING_CHECK_DONE'
[2024-02-27T12:14:45.527933+01:00] [debug] [pid:67399] tests/network/setup_multimachine.pm:60 called lockapi::barrier_wait
[2024-02-27T12:14:45.528007+01:00] [debug] [pid:67399] <<< testapi::record_info(title="Paused", output="Wait for MM_SETUP_PING_CHECK_DONE (on parent job)", result="ok")
[2024-02-27T12:14:45.546329+01:00] [debug] [pid:67399] barrier 'MM_SETUP_PING_CHECK_DONE' not released, sleeping 5 seconds
[2024-02-27T12:14:50.565092+01:00] [debug] [pid:67399] tests/network/setup_multimachine.pm:60 called lockapi::barrier_wait
[2024-02-27T12:14:50.565280+01:00] [debug] [pid:67399] <<< bmwqemu::mydie(cause_of_death="acquiring barrier 'MM_SETUP_PING_CHECK_DONE': lock owner already finished")
[2024-02-27T12:14:50.626107+01:00] [info] [pid:67399] ::: basetest::runtest: # Test died: mydie at /usr/lib/os-autoinst/bmwqemu.pm line 266.
    bmwqemu::mydie("acquiring barrier 'MM_SETUP_PING_CHECK_DONE': lock owner alre"...) called at /usr/lib/os-autoinst/lockapi.pm line 36
    lockapi::_try_lock("barrier", "MM_SETUP_PING_CHECK_DONE", HASH(0x55f07c0a6990)) called at /usr/lib/os-autoinst/lockapi.pm line 119
    lockapi::_wait_action("MM_SETUP_PING_CHECK_DONE", undef, 0) called at /usr/lib/os-autoinst/lockapi.pm line 140
    lockapi::barrier_wait("MM_SETUP_PING_CHECK_DONE") called at sle/tests/network/setup_multimachine.pm line 60

So the following happened:

  1. The client reached the barrier first (14:45.546329).
  2. It then slept 5 seconds to wait for the server.
  3. Meanwhile, the server reached the barrier (14:45.594608). Because the client had reached the barrier before it could just continue and finish. There's no wait_for_children in the test code that would prevent it from finishing.
  4. Then the client checked the barrier again (after the 5 seconds of sleeping) and ran into the error lock owner already finished which is expected considering the previous point.

I think that the lock API behaves that way is very problematic. I suppose we usually workaround that by adding a wait_for_children at the end of the parent's test code. However, I'll have a look whether we can improve the openQA so this is not necessary.


EDIT: Improving the API would be possible but of course change the semantics (and thus basically break compatibility). So I went with adding wait_for_children which is also what our documentation already suggests: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18754

Actions #9

Updated by okurz 8 months ago

  • Related to action #155170: [openqa-in-openqa] [sporadic] test fails in test_running: parallel_failed size:M added
Actions #10

Updated by openqa_review 8 months ago

  • Due date set to 2024-03-13

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by mkittler 8 months ago

  • Description updated (diff)
Actions #14

Updated by mkittler 8 months ago

  • Status changed from In Progress to Feedback
Actions #15

Updated by livdywan 7 months ago

  • Subject changed from [alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting to [alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:S
Actions #17

Updated by mkittler 7 months ago

  • Status changed from Feedback to Resolved

The new module is now on https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=ping_client&version=Tumbleweed. No new tests have been scheduled yet in the OSD scenario but I guess o3 suffices. So I enabled the pipelines again and triggered one run immediately: https://gitlab.suse.de/openqa/scripts-ci/-/pipelines/1020024 - It worked so I'm considering this ticket resolved.

Actions #18

Updated by okurz 7 months ago

  • Copied to action #156625: [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:M added
Actions

Also available in: Atom PDF