action #175060
opencoordination #176337: [saga][epic] Stable os-autoinst backends with stable command execution (no mistyping)
coordination #176340: [epic] Stable qemu backend with no unexpected mistyping
[sporadic] [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82) size:M
0%
Description
Observation¶
fullstack failed on master
https://app.circleci.com/pipelines/github/os-autoinst/openQA/15433/workflows/dd25c1f6-7e00-4342-b808-f9f1398fcc47/jobs/147885/parallel-runs/0/steps/0-111
[14:08:58] t/full-stack.t ......... 31/? # stopped waiting for '(?^u:Result: passed)', result turned out to be 'failed,'
# full result panel contents:
# 2 1
# Result: failed, finished less than a minute ago (ran for 02:06 minutes)
# Scheduled product: job has not been created by posting an ISO
# Assigned worker: 31dcf6c6b299:1
# Actions:
#
# Restart job
# Failed test 'Expected result not found'
# at /home/squamata/project/t/lib/OpenQA/Test/FullstackUtils.pm line 76.
# Failed test 'job 5 passed'
# at t/full-stack.t line 349.
# 0
# Looks like you failed 2 tests of 19.
# Failed test 'results of test 5'
# at t/full-stack.t line 387.
Acceptance Criteria¶
- AC1: full-stack tests pass on CircleCI tests reliably with no retries
Suggestions¶
- See full logs on https://output.circle-artifacts.com/output/job/bb84062f-82a9-49c3-815e-e875716cac82/artifacts/0/artifacts/full-stack.t
- Maybe CircleCI is being slow. We've seen that before
- Is this reproducible? Re-run seemed to pass. Seems to happen ~monthly https://app.circleci.com/pipelines/github/os-autoinst/openQA?branch=master&status=none&status=failing&status=failed
- As mentioned in #175060#note-3 this happens because
sudo poweroff
cannot be typed - the messageudevadm settle - timeout of 5 seconds reached …
appears in the middle.- Maybe we can handle that condition in test code and retry typing? This is probably problematic as the issue is not easily reproducible and the exact behavior might differ from case to case (e.g. console output might be interleaved in some cases).
- Check the prompt e.g. look for a lot of blank space
- Maybe extend logging from within the shutdown function to make it easier to check?
- Override the load detection to more easily reproduce the issue?
Rollback actions¶
Files
Updated by tinita 3 months ago
- Subject changed from [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82) to [sporadic] [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82)
- Status changed from New to In Progress
- Assignee set to tinita
- Priority changed from Urgent to High
I would say it's no Urgent as it is sporadic, and I couldn't find an occurrence in the last 15 days. Earlier test runs are currently not available in CircleCI, the GUI hangs when I try to load more runs.
https://app.circleci.com/pipelines/github/os-autoinst/openQA?branch=master&status=none&status=failed
Updated by mkittler 3 months ago
It failed yesterday with a different symptom, though:
# Failed test 'no worker with class qemu_i386 showed up after 18.0321 seconds'
# at t/full-stack.t line 120.
# 0
# Failed test 'Unable to assign 1 jobs after 18.0321 seconds'
# at t/full-stack.t line 143.
# full result panel contents:
# State: scheduled, created 28 minutes ago
# Scheduled product: job has not been created by posting an ISO
# 50
# Actions:
#
# Cancel job
[17:25:04] t/full-stack.t ......... 17/?
# Failed test 'Expected result not found'
The case we discussed and this ticket is about: Typing issues in the VM, probably due to the VM being very slow, see #175060#note-3.
The case from yesterday: The worker didn't pick up the job in the first place (at least not in time).
The reason for both problems is probably the same, though. CircleCI is overloaded. In the case from yesterday we see the following log message showing that the worker didn't pick up the job because of the load threshold:
[info] [pid:336] Registered and connected via websockets with openQA host http://localhost:9526 and worker ID 1
[warn] [pid:336] The average load (73.51 38.51 27.36) is exceeding the configured threshold of 40. The worker will temporarily not accept new jobs until the load is lower again. - checking again for web UI 'http://localhost:9526' in 228.39 s
[debug] [pid:201] Updating seen of worker 1 from worker_status (broken)
not ok 15 - no worker with class qemu_i386 showed up after 18.0321 seconds
# free workers: []
not ok 16 - Unable to assign 1 jobs after 18.0321 seconds
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 0 of 3240)
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 1 of 3240)
…
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 406 of 3240)
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 407 of 3240)
[warn] [pid:336] The average load (45.93 40.02 30.30) is exceeding the configured threshold of 40. The worker will temporarily not accept new jobs until the load is lower again. - checking again for web UI 'http://localhost:9526' in 290.71 s
[debug] [pid:201] Updating seen of worker 1 from worker_status (broken)
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 408 of 3240)
…
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 3238 of 3240)
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 3239 of 3240)
ok 17 - No unexpected js warnings
not ok 18 - Expected result not found
not ok 19 - test 1 is running
Bail out! unable to run test 1
[info] [pid:336] Received signal TERM
[debug] [pid:336] Informing http://localhost:9526 that we are going offline
[info] [pid:201] Worker 1 websocket connection closed - 1006
We could disable the load threshold in the fullstack test but this would turn more cases like the one from yesterday into cases where we have typing issues within the VM.
We probably should actually lower the threshold in the fullstack test to avoid typing issues. Then we still have to think about what we do when running into the threshold. Not sure whether it would make sense to simply wait longer. The failure from yesterday ran into the timeout after 30 minutes. So there's maybe still room to extend timeouts¹ a little bit without running into the CircleCI context deadline but it will probably not help much.
Of course there could also be a regression on our side, e.g. a spinning loop in one of the processes that causes the high load. I suspect more that it is CircleCI, though.
¹ We should only extend the timeout of wait_for_result_panel
when se actually see "… exceeding the configured threshold …" being logged. Not sure how to read the log messages of the worker in that function efficiently, though.
Updated by okurz 3 months ago
mkittler wrote in #note-10:
[...] We probably should actually lower the threshold in the fullstack test to avoid typing issues. Then we still have to think about what we do when running into the threshold. Not sure whether it would make sense to simply wait longer. The failure from yesterday ran into the timeout after 30 minutes. So there's maybe still room to extend timeouts¹ a little bit without running into the CircleCI context deadline but it will probably not help much.
That actually sounds very good to me. It's great to see that the worker load detection helps here. I agree we should lower the load threshold and wait longer for the circleCI environment to have less load. At best we can make it clear enough in logs what's happening and also state what is going on in case the load did not settle within the timeout.
Updated by okurz 3 months ago
- as urgency mitigation apply retry, e.g. in level of the Makefile
- prevent VNC typing of "poweroff", use qemu poweroff ACPI or virtio_terminal or just force shutdown
- create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
- disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load
Updated by gpuliti 3 months ago
I've apply t/full-stack.t
as unstable_tests.txt in https://github.com/os-autoinst/openQA/pull/6107
Updated by gpuliti 3 months ago
- Status changed from In Progress to Workable
- Assignee deleted (
gpuliti) - Priority changed from Urgent to High
Mitigated with pr above https://github.com/os-autoinst/openQA/pull/6107 merged
I drop the priority as we discussed this morning.
Leave the next 3 points as workable. @okurz wrote in https://progress.opensuse.org/issues/175060#note-13:
as urgency mitigation apply retry, e.g. in level of the Makefile- prevent VNC typing of "poweroff", use qemu poweroff ACPI or virtio_terminal or just force shutdown
- create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
- disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load
Updated by gpuliti 3 months ago
- Status changed from In Progress to Feedback
I've apply a retry mechanism at makefile level to handle sporadic failures caused by circleci load https://github.com/os-autoinst/openQA/pull/6115 and by looking at the history 2 retry I think are enough.
I've done further investigation in order to understand how to handle the shutdown properly and I found that shutdown is handle in full-stack:90 that send the shutdown request, than full-stack:197 will wait for the console showing that the system is paused due to a shutdown and I guess is here that the typing might cause problems!?
If my changes will be approved, we should add a rollback actions to the task.
Updated by openqa_review 3 months ago
- Due date set to 2025-01-29
Setting due date based on mean cycle time of SUSE QE Tools
Updated by gpuliti 3 months ago
Working on solution for point 2: https://github.com/os-autoinst/os-autoinst/pull/2623
Rollback of mistaken merged pr: https://github.com/os-autoinst/openQA/pull/6124
Updated by gpuliti 2 months ago · Edited
I'm debugging the tests using the github actions pipeline.
It seems that the change to shutdown.pm broke a test:
3: # ------
3: # $VAR1 = {
3: # 'dents' => 0,
3: # 'details' => [
3: # {
3: # 'result' => 'fail',
3: # 'screenshot' => 'shutdown-2.png',
3: # 'frametime' => [
3: # '17.62',
3: # '17.67'
3: # ]
3: # },
3: # {
3: # 'text' => 'shutdown-3.txt',
3: # 'title' => 'Failed',
3: # 'result' => 'fail'
3: # }
3: # ],
3: # 'result' => 'fail',
3: # 'execution_time' => 61
3: # };
3: #
3: # ------
3:
3: # Failed test 'Result in testresults/result-shutdown.json is ok'
3: # at t/99-full-stack.t line 79.
3: # got: 'fail'
3: # expected: 'ok'
Updated by gpuliti 2 months ago
I've done 2 test to prevent VNC typing of "poweroff"
: the first one is by using acpi and the second is to force the shutdown.
Using power('acpi')¶
By changing shutdown.pm:9 to acpi the test fail with:
3: # Failed test 'Result in testresults/result-shutdown.json is ok'
3: # at t/99-full-stack.t line 83.
3: # got: 'fail'
3: # expected: 'ok'
Looking at autoinst-log.txt it seems that the machine is not properly shutting down causing a timeout on the testapi::check_shutdown
method:
3: # [2025-01-27T14:16:47.274338Z] [debug] [pid:2821] <<< testapi::power(action="acpi")
3: # [2025-01-27T14:16:47.275427Z] [debug] [pid:2824] EVENT {"event":"POWERDOWN","timestamp":{"microseconds":275186,"seconds":1737987407}}
3: # [2025-01-27T14:16:47.275970Z] [debug] [pid:2821] tests/shutdown.pm:10 called testapi::assert_shutdown
3: # [2025-01-27T14:16:47.276076Z] [debug] [pid:2821] <<< testapi::check_shutdown(timeout=60)
3: # [2025-01-27T14:16:47.277158Z] [debug] [pid:2824] QEMU status is not 'shutdown', it is 'running'
3: # <a lot of QEMU status is not shutdown>
3: # [2025-01-27T14:17:47.409491Z] [debug] [pid:2824] QEMU status is not 'shutdown', it is 'running'
3: # [2025-01-27T14:17:48.454331Z] [info] [pid:2821] ::: basetest::runtest: # Test died: Machine didn't shut down! at /opt/t/data/tests/tests/shutdown.pm line 10.
3: #
3: # [2025-01-27T14:17:48.457336Z] [debug] [pid:2821] ||| finished shutdown tests (runtime: 61 s)
3: # [2025-01-27T14:17:48.457452Z] [debug] [pid:2821] ||| post fail hooks runtime: 0 s
3: # [2025-01-27T14:17:48.457864Z] [debug] [pid:2821] stopping overall test execution after a fatal test failure
I think the machine is not getting the ACPI shutdown signal, so maybe forcing it is the way to go.
Forcing shutdown by using power('off')¶
By changing shutdown.pm:9 with force shutdown the test 2 times:
3: # Failed test 'test executed fine'
3: # at t/99-full-stack.t line 57.
3: # '
3: # <a lot of logs>
3: # 2825: EXIT 1
3: # '
3: # doesn't match '(?^u:\d*: EXIT 0)'
3:
3: # Failed test 'Result in testresults/result-shutdown.json is ok'
3: # at t/99-full-stack.t line 83.
3: # got: 'canceled'
3: # expected: 'ok'
3: # Looks like you failed 2 tests of 30.
About the *line 57by looking at the
autoinst-log` we get:
3: # [2025-01-27T19:05:33.766445Z] [warn] [pid:2825] !!! OpenQA::Isotovideo::Runner::_read_response: THERE IS NOTHING TO READ 17 5 4
3: # [2025-01-27T19:05:33.766692Z] [debug] [pid:2825] isotovideo handle_shutdown
3: # [2025-01-27T19:05:33.766852Z] [debug] [pid:2825] stopping command server 2832 because test execution ended
3: # [2025-01-27T19:05:33.766970Z] [debug] [pid:2825] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:15223/KPAoBalWrQ/broadcast
3: # [2025-01-27T19:05:35.258101Z] [debug] [pid:2825] commands process exited: 0
3: # [2025-01-27T19:05:35.358677Z] [debug] [pid:2825] done with command server
3: # [2025-01-27T19:05:35.358863Z] [debug] [pid:2825] stopping autotest process 2835
3: # [2025-01-27T19:05:35.359249Z] [debug] [pid:2835] autotest received signal TERM, saving results of current test before exiting
3: # [2025-01-27T19:05:35.365028Z] [debug] [pid:2825] [autotest] process exited: 1
3: # [2025-01-27T19:05:35.465514Z] [debug] [pid:2825] done with autotest process
3: # [2025-01-27T19:05:35.465673Z] [debug] [pid:2825] isotovideo failed
3: # [2025-01-27T19:05:35.466479Z] [debug] [pid:2825] stopping backend process 2838
3: # [2025-01-27T19:05:36.167979Z] [debug] [pid:2825] done with backend process
3: # 2825: EXIT 1
There are something wrong with the handling of the isotovideo shutdown.
Next steps¶
Since the first shutdown method is not shutting down the machine, I'm trying to find a solution for the second one.
Updated by gpuliti 2 months ago
I'm trying to climb the canvas to understand how the shutdown is been handled:
- everything start with line 53 of 99-full-stack with the isotovideo run
perl $toplevel_dir/isotovideo --workdir $pool_dir -d
- inside the script line 185 there is the
$runner->handle_commands;
that is a call torunner.pm::handle_commands
in line 114 - In the block there is another function called with an interesting choice of name in line 146
$self->command_handler($command_handler)
that call something somewhere inCommandHandler
probably in line 81$self->can('_handle_command_' . $cmd)
and I guess something here should run theshutdown:8 power('off')
defined in quemu line 70sub power ($self, $args)
- all of this generate a log called
testresults/result-shutdown.json
- In the end this log is parse by
99-full-stack:74
that decode the json and run the actual test expectation
But I miss a lot in order to understand why
3: # Failed test 'test executed fine'
3: # at t/99-full-stack.t line 57.
and
3: # Failed test 'Result in testresults/result-shutdown.json is ok'
3: # at t/99-full-stack.t line 83.
are failing.
Updated by okurz 2 months ago
- Copied to action #176319: testapi power function call "off" needs to be handled gracefully by os-autoinst size:S added
Updated by gpuliti 2 months ago
- Status changed from In Progress to Blocked
- Priority changed from High to Normal
During the Midweekly Unblock we try to resolve the ticket and find solution using power('off')
, but it seems that the function itself is "not working properly" so we decided to open a new ticket that will solve the host-qmp-quit
qmp call problem.
Blocked by #176319
Updated by okurz 2 months ago
- Copied to action #176334: Use non-VNC console in os-autoinst full-stack test to ensure stable command execution (prevent mistyping) added
Updated by okurz 2 months ago
- Copied to action #176346: [spike][timebox:10h] Clear error indication in the openQA full-stack.t if tests fail with mistyping or lost keys while system is under high load added
Updated by livdywan about 1 month ago
As this ticket has a higher priority than the blocker and exceeded our SLO's I'm raising the priority of #176319 again.
Updated by okurz 30 days ago
- Status changed from Blocked to Workable
gpuliti wrote in #note-15:
- create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
- disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load
I think both those points still need to be done, right? From today
in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16031/workflows/d37c7fa9-d434-45b2-9646-ac6e09a1a952/jobs/154391 I saw
# Failed test 'no worker with class qemu_i386 showed up after 18.013961 seconds'
# at t/full-stack.t line 119.
# free workers: []
# Failed test 'Unable to assign 1 jobs after 18.013961 seconds'
# at t/full-stack.t line 142.
# full result panel contents:
# State: scheduled, created 28 minutes ago
which IIUC will not be fixed by #176319
Updated by livdywan 29 days ago
okurz wrote in #note-39:
gpuliti wrote in #note-15:
- create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
- disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load
I think both those points still need to be done, right? From today
in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16031/workflows/d37c7fa9-d434-45b2-9646-ac6e09a1a952/jobs/154391 I saw# Failed test 'no worker with class qemu_i386 showed up after 18.013961 seconds' # at t/full-stack.t line 119. # free workers: [] # Failed test 'Unable to assign 1 jobs after 18.013961 seconds' # at t/full-stack.t line 142. # full result panel contents: # State: scheduled, created 28 minutes ago
which IIUC will not be fixed by #176319
Right. It's both related to load but that wouldn't be affected by a more reliable shutdown.
Updated by okurz 29 days ago
- Priority changed from Normal to Urgent
Another failure 2025-03-03, 1530Z: https://app.circleci.com/pipelines/github/os-autoinst/openQA/16047/workflows/be893671-47af-4c11-a2a7-a69140b42c57/jobs/154592
Updated by gpuliti 29 days ago
Mitigation: https://github.com/os-autoinst/openQA/pull/6244
Updated by gpuliti 29 days ago
- Follows action #178249: load detection and job flagging under high load conditions in openQA job execution size:S added
Updated by tinita 28 days ago
https://github.com/os-autoinst/openQA/pull/6248 ci: Reduce timeout for wait_for_result_panel (merged)
Updated by openqa_review 28 days ago
- Due date set to 2025-03-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by gpuliti 27 days ago
- Status changed from In Progress to Feedback
okurz wrote in #note-39:
gpuliti wrote in #note-15:
- create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
- disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load
I think both those points still need to be done, right? From today
in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16031/workflows/d37c7fa9-d434-45b2-9646-ac6e09a1a952/jobs/154391 I saw# Failed test 'no worker with class qemu_i386 showed up after 18.013961 seconds' # at t/full-stack.t line 119. # free workers: [] # Failed test 'Unable to assign 1 jobs after 18.013961 seconds' # at t/full-stack.t line 142. # full result panel contents: # State: scheduled, created 28 minutes ago
which IIUC will not be fixed by #176319
I've thought about for quite some times by now and I don't understand what to do for the last point. is something that can be done by changing the fullstack test?
Updated by livdywan 26 days ago
gpuliti wrote in #note-52:
okurz wrote in #note-39:
gpuliti wrote in #note-15:
- create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
- disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load
I think both those points still need to be done, right? From today
in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16031/workflows/d37c7fa9-d434-45b2-9646-ac6e09a1a952/jobs/154391 I saw# Failed test 'no worker with class qemu_i386 showed up after 18.013961 seconds' # at t/full-stack.t line 119. # free workers: [] # Failed test 'Unable to assign 1 jobs after 18.013961 seconds' # at t/full-stack.t line 142. # full result panel contents: # State: scheduled, created 28 minutes ago
which IIUC will not be fixed by #176319
I've thought about for quite some times by now and I don't understand what to do for the last point. is something that can be done by changing the fullstack test?
You could add CRITICAL_LOAD_AVG_THRESHOLD to the worker settings used by the (developer mode)fullstack test.
Apparently it's not on open.qa - I'll prepare a PR for that
Updated by gpuliti 22 days ago
- Status changed from Feedback to In Progress
Add pr to solve point 4: https://github.com/os-autoinst/openQA/pull/6283
Updated by okurz 19 days ago
- Status changed from Blocked to Workable
#176319 resolved. All included with https://github.com/os-autoinst/os-autoinst/pull/2676 which was merged. This will be used automatically within openQA soon I guess. Then we can verify that the openQA test works stable
Updated by gpuliti 19 days ago · Edited
- Status changed from Workable to Feedback
- as urgency mitigation apply retry, e.g. in level of the Makefile
- prevent VNC typing of "poweroff", use qemu poweroff ACPI or virtio_terminal or just force shutdown
- create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
- disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load
Since #2676 is been closed by @dheidler I've open a rollback pr, #6293, and prepare to sovle all the point in the list above.
Updated by gpuliti about 11 hours ago
- Status changed from Blocked to In Progress
Updated by okurz about 10 hours ago
The CI job already failed today in the morning in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16470/workflows/e3abc18d-d02a-4123-b96b-14ff526d86e8/jobs/159319 showing that the latest version of os-autoinst wasn't deployed. I wonder why you haven't seen that.
Updated by gpuliti about 9 hours ago
- Status changed from In Progress to Feedback
@okurz I can't notice everything :)
for the time being:
success: 17, failure: 8, total: 25
failure percentage: 32%
move to feedback waiting for more runs from circleci job 159382