action #175060: [sporadic] [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82) size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #175060

closed

coordination #176337: [saga][epic] Stable os-autoinst backends with stable command execution (no mistyping)

coordination #176340: [epic] Stable qemu backend with no unexpected mistyping

[sporadic] [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82) size:M

Added by okurz 4 months ago. Updated 16 days ago.

Status:

Resolved

Priority:

Normal

Assignee:

gpuliti

Category:

Regressions/Crashes

Target version:

Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

alert, reactive work

Description

Observation¶

fullstack failed on master
https://app.circleci.com/pipelines/github/os-autoinst/openQA/15433/workflows/dd25c1f6-7e00-4342-b808-f9f1398fcc47/jobs/147885/parallel-runs/0/steps/0-111

[14:08:58] t/full-stack.t ......... 31/?         # stopped waiting for '(?^u:Result: passed)', result turned out to be 'failed,'
        # full result panel contents:
        # 2  1 
        # Result: failed, finished less than a minute ago (ran for 02:06 minutes)
        # Scheduled product: job has not been created by posting an ISO
        # Assigned worker: 31dcf6c6b299:1
        # Actions:
        # 
        # Restart job

        #   Failed test 'Expected result not found'
        #   at /home/squamata/project/t/lib/OpenQA/Test/FullstackUtils.pm line 76.

        #   Failed test 'job 5 passed'
        #   at t/full-stack.t line 349.
        # 0
        # Looks like you failed 2 tests of 19.

    #   Failed test 'results of test 5'
    #   at t/full-stack.t line 387.

2025-01-09

Acceptance Criteria¶

AC1: full-stack tests pass on CircleCI tests reliably with no retries

Suggestions¶

See full logs on https://output.circle-artifacts.com/output/job/bb84062f-82a9-49c3-815e-e875716cac82/artifacts/0/artifacts/full-stack.t
Maybe CircleCI is being slow. We've seen that before
Is this reproducible? Re-run seemed to pass. Seems to happen ~monthly https://app.circleci.com/pipelines/github/os-autoinst/openQA?branch=master&status=none&status=failing&status=failed
As mentioned in #175060#note-3 this happens because sudo poweroff cannot be typed - the message udevadm settle - timeout of 5 seconds reached … appears in the middle.
- Maybe we can handle that condition in test code and retry typing? This is probably problematic as the issue is not easily reproducible and the exact behavior might differ from case to case (e.g. console output might be interleaved in some cases).
Check the prompt e.g. look for a lot of blank space
- Maybe extend logging from within the shutdown function to make it easier to check?
Override the load detection to more easily reproduce the issue?

Rollback actions¶

Revert https://github.com/os-autoinst/openQA/pull/6244

Files

issue-175060-fullstack-sudo-pow.png (3.14 KB) issue-175060-fullstack-sudo-pow.png

"sudo pow" needle

tinita, 2025-01-07 17:31

Related issues 4 (2 open — 2 closed)

Actions

Copy link

Updated by okurz 4 months ago

Priority changed from High to Urgent

Actions

Copy link

Updated by tinita 4 months ago

Subject changed from [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82) to [sporadic] [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82)
Status changed from New to In Progress
Assignee set to tinita
Priority changed from Urgent to High

I would say it's no Urgent as it is sporadic, and I couldn't find an occurrence in the last 15 days. Earlier test runs are currently not available in CircleCI, the GUI hangs when I try to load more runs.
https://app.circleci.com/pipelines/github/os-autoinst/openQA?branch=master&status=none&status=failed

Actions

Copy link

Updated by tinita 4 months ago

File issue-175060-fullstack-sudo-pow.png issue-175060-fullstack-sudo-pow.png added

I looked at the artifacts, specifically the images artifacts/full-stack.d/openqa/images/... and compared thenm to a successful run.
this one looks suspicious, it says "sudo: pow: command not found"

Actions

Copy link

Updated by tinita 4 months ago · Edited

Another difference:
failed "Unknown signal 85"
passed

edit: I think I found this as well in the passed job, though. difficult to keep the overview because the files don't have specific names.

Actions

Copy link

Updated by tinita 4 months ago

autoinst-log.txt of job 5

Actions

Copy link

Updated by tinita 4 months ago

Status changed from In Progress to New
Assignee deleted (~~tinita~~)

No idea currently what could be wrong...

Actions

Copy link

Updated by livdywan 3 months ago

Description updated (diff)

Actions

Copy link

Updated by tinita 3 months ago

Description updated (diff)

Actions

Copy link

Updated by okurz 3 months ago

Priority changed from High to Urgent

happened again yesterday
https://app.circleci.com/pipelines/github/os-autoinst/openQA/15446/workflows/107fa045-990a-49a2-aecb-08f4ead0742a

Actions

Copy link

#10

Updated by mkittler 3 months ago

It failed yesterday with a different symptom, though:

#   Failed test 'no worker with class qemu_i386 showed up after 18.0321 seconds'
#   at t/full-stack.t line 120.
# 0

#   Failed test 'Unable to assign 1 jobs after 18.0321 seconds'
#   at t/full-stack.t line 143.
# full result panel contents:
# State: scheduled, created 28 minutes ago
# Scheduled product: job has not been created by posting an ISO
#  50 
# Actions:
# 
# Cancel job
[17:25:04] t/full-stack.t ......... 17/? 
#   Failed test 'Expected result not found'

The case we discussed and this ticket is about: Typing issues in the VM, probably due to the VM being very slow, see #175060#note-3.

The case from yesterday: The worker didn't pick up the job in the first place (at least not in time).

The reason for both problems is probably the same, though. CircleCI is overloaded. In the case from yesterday we see the following log message showing that the worker didn't pick up the job because of the load threshold:

[info] [pid:336] Registered and connected via websockets with openQA host http://localhost:9526 and worker ID 1
[warn] [pid:336] The average load (73.51 38.51 27.36) is exceeding the configured threshold of 40. The worker will temporarily not accept new jobs until the load is lower again. - checking again for web UI 'http://localhost:9526' in 228.39 s
[debug] [pid:201] Updating seen of worker 1 from worker_status (broken)
not ok 15 - no worker with class qemu_i386 showed up after 18.0321 seconds
# free workers: []
not ok 16 - Unable to assign 1 jobs after 18.0321 seconds
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 0 of 3240)
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 1 of 3240)
…
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 406 of 3240)
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 407 of 3240)
[warn] [pid:336] The average load (45.93 40.02 30.30) is exceeding the configured threshold of 40. The worker will temporarily not accept new jobs until the load is lower again. - checking again for web UI 'http://localhost:9526' in 290.71 s
[debug] [pid:201] Updating seen of worker 1 from worker_status (broken)
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 408 of 3240)
…
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 3238 of 3240)
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 3239 of 3240)
ok 17 - No unexpected js warnings
not ok 18 - Expected result not found
not ok 19 - test 1 is running
Bail out!  unable to run test 1
[info] [pid:336] Received signal TERM
[debug] [pid:336] Informing http://localhost:9526 that we are going offline
[info] [pid:201] Worker 1 websocket connection closed - 1006

We could disable the load threshold in the fullstack test but this would turn more cases like the one from yesterday into cases where we have typing issues within the VM.

We probably should actually lower the threshold in the fullstack test to avoid typing issues. Then we still have to think about what we do when running into the threshold. Not sure whether it would make sense to simply wait longer. The failure from yesterday ran into the timeout after 30 minutes. So there's maybe still room to extend timeouts¹ a little bit without running into the CircleCI context deadline but it will probably not help much.

Of course there could also be a regression on our side, e.g. a spinning loop in one of the processes that causes the high load. I suspect more that it is CircleCI, though.

¹ We should only extend the timeout of wait_for_result_panel when se actually see "… exceeding the configured threshold …" being logged. Not sure how to read the log messages of the worker in that function efficiently, though.

Actions

Copy link

#11

Updated by okurz 3 months ago

mkittler wrote in #note-10:

[...] We probably should actually lower the threshold in the fullstack test to avoid typing issues. Then we still have to think about what we do when running into the threshold. Not sure whether it would make sense to simply wait longer. The failure from yesterday ran into the timeout after 30 minutes. So there's maybe still room to extend timeouts¹ a little bit without running into the CircleCI context deadline but it will probably not help much.

That actually sounds very good to me. It's great to see that the worker load detection helps here. I agree we should lower the load threshold and wait longer for the circleCI environment to have less load. At best we can make it clear enough in logs what's happening and also state what is going on in case the load did not settle within the timeout.

Actions

Copy link

#12

Updated by livdywan 3 months ago

Status changed from New to In Progress
Assignee set to gpuliti

Gabriele agreed to lead the conversation

Actions

Copy link

#13

Updated by okurz 3 months ago

as urgency mitigation apply retry, e.g. in level of the Makefile
prevent VNC typing of "poweroff", use qemu poweroff ACPI or virtio_terminal or just force shutdown
create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load

Actions

Copy link

#14

Updated by gpuliti 3 months ago

I've apply t/full-stack.t as unstable_tests.txt in https://github.com/os-autoinst/openQA/pull/6107

Actions

Copy link

#15

Updated by gpuliti 3 months ago

Status changed from In Progress to Workable
Assignee deleted (~~gpuliti~~)
Priority changed from Urgent to High

Mitigated with pr above https://github.com/os-autoinst/openQA/pull/6107 merged

I drop the priority as we discussed this morning.

Leave the next 3 points as workable. @okurz wrote in https://progress.opensuse.org/issues/175060#note-13:

~~as urgency mitigation apply retry, e.g. in level of the Makefile~~

prevent VNC typing of "poweroff", use qemu poweroff ACPI or virtio_terminal or just force shutdown

create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted

disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load

Actions

Copy link

#16

Updated by gpuliti 3 months ago

Status changed from Workable to In Progress
Priority changed from High to Urgent

Still present: https://app.circleci.com/pipelines/github/os-autoinst/openQA/15471/workflows/2b4151fd-115a-42d5-be42-2b1313760ca2/jobs/148275/parallel-runs/0/steps/0-111

Actions

Copy link

#17

Updated by okurz 3 months ago

Assignee set to gpuliti

Actions

Copy link

#18

Updated by gpuliti 3 months ago

Status changed from In Progress to Feedback

I've apply a retry mechanism at makefile level to handle sporadic failures caused by circleci load https://github.com/os-autoinst/openQA/pull/6115 and by looking at the history 2 retry I think are enough.

I've done further investigation in order to understand how to handle the shutdown properly and I found that shutdown is handle in full-stack:90 that send the shutdown request, than full-stack:197 will wait for the console showing that the system is paused due to a shutdown and I guess is here that the typing might cause problems!?

If my changes will be approved, we should add a rollback actions to the task.

Actions

Copy link

#19

Updated by livdywan 3 months ago

Status changed from Feedback to In Progress

If my changes will be approved, we should add a rollback actions to the task.

Please keep the ticket in progress unless you can mitigate and lower the priority.

Actions

Copy link

#20

Updated by openqa_review 3 months ago

Due date set to 2025-01-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#21

Updated by livdywan 3 months ago

Tags changed from reactive work, alert to reactive work, alert, infra

Actions

Copy link

#22

Updated by gpuliti 3 months ago

Priority changed from Urgent to High

Pull request merged, lowering the priority

Actions

Copy link

#23

Updated by gpuliti 3 months ago

Description updated (diff)

Actions

Copy link

#24

Updated by gpuliti 3 months ago

Working on solution for point 2: https://github.com/os-autoinst/os-autoinst/pull/2623
Rollback of mistaken merged pr: https://github.com/os-autoinst/openQA/pull/6124

Actions

Copy link

#25

Updated by okurz 3 months ago

Tags changed from reactive work, alert, infra to reactive work, alert

seems "infra" was added by mistake

Actions

Copy link

#26

Updated by gpuliti 3 months ago

I was finally able to run the test locally and it's working, but pipeline is failing, I'm going to understand the differences.

Actions

Copy link

#27

Updated by gpuliti 3 months ago

Description updated (diff)

Actions

Copy link

#28

Updated by gpuliti 3 months ago

Subject changed from [sporadic] [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82) to [sporadic] [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82) size:M
Description updated (diff)

Actions

Copy link

#29

Updated by gpuliti 3 months ago · Edited

I'm debugging the tests using the github actions pipeline.

It seems that the change to shutdown.pm broke a test:

3: # ------
3: # $VAR1 = {
3: #           'dents' => 0,
3: #           'details' => [
3: #                          {
3: #                            'result' => 'fail',
3: #                            'screenshot' => 'shutdown-2.png',
3: #                            'frametime' => [
3: #                                             '17.62',
3: #                                             '17.67'
3: #                                           ]
3: #                          },
3: #                          {
3: #                            'text' => 'shutdown-3.txt',
3: #                            'title' => 'Failed',
3: #                            'result' => 'fail'
3: #                          }
3: #                        ],
3: #           'result' => 'fail',
3: #           'execution_time' => 61
3: #         };
3: # 
3: # ------
3: 
3: #   Failed test 'Result in testresults/result-shutdown.json is ok'
3: #   at t/99-full-stack.t line 79.
3: #          got: 'fail'
3: #     expected: 'ok'

Actions

Copy link

#30

Updated by gpuliti 3 months ago

I've done 2 test to prevent VNC typing of "poweroff": the first one is by using acpi and the second is to force the shutdown.

Using power('acpi')¶

By changing shutdown.pm:9 to acpi the test fail with:

3: #   Failed test 'Result in testresults/result-shutdown.json is ok'
3: #   at t/99-full-stack.t line 83.
3: #          got: 'fail'
3: #     expected: 'ok'

Looking at autoinst-log.txt it seems that the machine is not properly shutting down causing a timeout on the testapi::check_shutdown method:

3: # [2025-01-27T14:16:47.274338Z] [debug] [pid:2821] <<< testapi::power(action="acpi")
3: # [2025-01-27T14:16:47.275427Z] [debug] [pid:2824] EVENT {"event":"POWERDOWN","timestamp":{"microseconds":275186,"seconds":1737987407}}
3: # [2025-01-27T14:16:47.275970Z] [debug] [pid:2821] tests/shutdown.pm:10 called testapi::assert_shutdown
3: # [2025-01-27T14:16:47.276076Z] [debug] [pid:2821] <<< testapi::check_shutdown(timeout=60)
3: # [2025-01-27T14:16:47.277158Z] [debug] [pid:2824] QEMU status is not 'shutdown', it is 'running'
3: # <a lot of QEMU status is not shutdown>
3: # [2025-01-27T14:17:47.409491Z] [debug] [pid:2824] QEMU status is not 'shutdown', it is 'running'
3: # [2025-01-27T14:17:48.454331Z] [info] [pid:2821] ::: basetest::runtest: # Test died: Machine didn't shut down! at /opt/t/data/tests/tests/shutdown.pm line 10.
3: #   
3: # [2025-01-27T14:17:48.457336Z] [debug] [pid:2821] ||| finished shutdown tests (runtime: 61 s)
3: # [2025-01-27T14:17:48.457452Z] [debug] [pid:2821] ||| post fail hooks runtime: 0 s
3: # [2025-01-27T14:17:48.457864Z] [debug] [pid:2821] stopping overall test execution after a fatal test failure

I think the machine is not getting the ACPI shutdown signal, so maybe forcing it is the way to go.

Forcing shutdown by using power('off')¶

By changing shutdown.pm:9 with force shutdown the test 2 times:

3: #   Failed test 'test executed fine'
3: #   at t/99-full-stack.t line 57.
3: # '
3: # <a lot of logs>
3: # 2825: EXIT 1
3: # '
3: #     doesn't match '(?^u:\d*: EXIT 0)'
3: 
3: #   Failed test 'Result in testresults/result-shutdown.json is ok'
3: #   at t/99-full-stack.t line 83.
3: #          got: 'canceled'
3: #     expected: 'ok'
3: # Looks like you failed 2 tests of 30.

About the *line 57by looking at theautoinst-log` we get:

3: # [2025-01-27T19:05:33.766445Z] [warn] [pid:2825] !!! OpenQA::Isotovideo::Runner::_read_response: THERE IS NOTHING TO READ 17 5 4
3: # [2025-01-27T19:05:33.766692Z] [debug] [pid:2825] isotovideo handle_shutdown
3: # [2025-01-27T19:05:33.766852Z] [debug] [pid:2825] stopping command server 2832 because test execution ended
3: # [2025-01-27T19:05:33.766970Z] [debug] [pid:2825] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:15223/KPAoBalWrQ/broadcast
3: # [2025-01-27T19:05:35.258101Z] [debug] [pid:2825] commands process exited: 0
3: # [2025-01-27T19:05:35.358677Z] [debug] [pid:2825] done with command server
3: # [2025-01-27T19:05:35.358863Z] [debug] [pid:2825] stopping autotest process 2835
3: # [2025-01-27T19:05:35.359249Z] [debug] [pid:2835] autotest received signal TERM, saving results of current test before exiting
3: # [2025-01-27T19:05:35.365028Z] [debug] [pid:2825] [autotest] process exited: 1
3: # [2025-01-27T19:05:35.465514Z] [debug] [pid:2825] done with autotest process
3: # [2025-01-27T19:05:35.465673Z] [debug] [pid:2825] isotovideo failed
3: # [2025-01-27T19:05:35.466479Z] [debug] [pid:2825] stopping backend process 2838
3: # [2025-01-27T19:05:36.167979Z] [debug] [pid:2825] done with backend process
3: # 2825: EXIT 1

There are something wrong with the handling of the isotovideo shutdown.

Next steps¶

Since the first shutdown method is not shutting down the machine, I'm trying to find a solution for the second one.

Actions

Copy link

#31

Updated by gpuliti 3 months ago

I'm trying to climb the canvas to understand how the shutdown is been handled:

everything start with line 53 of 99-full-stack with the isotovideo run perl $toplevel_dir/isotovideo --workdir $pool_dir -d
inside the script line 185 there is the $runner->handle_commands; that is a call to runner.pm::handle_commands in line 114
In the block there is another function called with an interesting choice of name in line 146 $self->command_handler($command_handler) that call something somewhere in CommandHandler probably in line 81 $self->can('_handle_command_' . $cmd) and I guess something here should run the shutdown:8 power('off') defined in quemu line 70 sub power ($self, $args)
all of this generate a log called testresults/result-shutdown.json
In the end this log is parse by 99-full-stack:74 that decode the json and run the actual test expectation

But I miss a lot in order to understand why

  3: #   Failed test 'test executed fine'
  3: #   at t/99-full-stack.t line 57.

and

  3: #   Failed test 'Result in testresults/result-shutdown.json is ok'
  3: #   at t/99-full-stack.t line 83.

are failing.

Actions

Copy link

#32

Updated by okurz 3 months ago

Copied to action #176319: testapi power function call "off" needs to be handled gracefully by os-autoinst size:S added

Actions

Copy link

#33

Updated by gpuliti 3 months ago

Status changed from In Progress to Blocked
Priority changed from High to Normal

During the Midweekly Unblock we try to resolve the ticket and find solution using power('off'), but it seems that the function itself is "not working properly" so we decided to open a new ticket that will solve the host-qmp-quit qmp call problem.

Blocked by #176319

Actions

Copy link

#34

Updated by okurz 3 months ago

Copied to action #176334: Use non-VNC console in os-autoinst full-stack test to ensure stable command execution (prevent mistyping) added

Actions

Copy link

#35

Updated by okurz 3 months ago

Parent task set to #176340

Actions

Copy link

#36

Updated by okurz 3 months ago

Due date deleted (~~2025-01-29~~)

Actions

Copy link

#37

Updated by okurz 3 months ago

Copied to action #176346: [spike][timebox:10h] Clear error indication in the openQA full-stack.t if tests fail with mistyping or lost keys while system is under high load added

Actions

Copy link

#38

Updated by livdywan about 2 months ago

As this ticket has a higher priority than the blocker and exceeded our SLO's I'm raising the priority of #176319 again.

Actions

Copy link

#39

Updated by okurz about 2 months ago

Status changed from Blocked to Workable

gpuliti wrote in #note-15:

create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted

disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load

I think both those points still need to be done, right? From today
in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16031/workflows/d37c7fa9-d434-45b2-9646-ac6e09a1a952/jobs/154391 I saw

#   Failed test 'no worker with class qemu_i386 showed up after 18.013961 seconds'
#   at t/full-stack.t line 119.
# free workers: []

#   Failed test 'Unable to assign 1 jobs after 18.013961 seconds'
#   at t/full-stack.t line 142.
# full result panel contents:
# State: scheduled, created 28 minutes ago

which IIUC will not be fixed by #176319

Actions

Copy link

#40

Updated by livdywan about 2 months ago

okurz wrote in #note-39:

gpuliti wrote in #note-15:

create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted

disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load

I think both those points still need to be done, right? From today
in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16031/workflows/d37c7fa9-d434-45b2-9646-ac6e09a1a952/jobs/154391 I saw
#   Failed test 'no worker with class qemu_i386 showed up after 18.013961 seconds'
#   at t/full-stack.t line 119.
# free workers: []

#   Failed test 'Unable to assign 1 jobs after 18.013961 seconds'
#   at t/full-stack.t line 142.
# full result panel contents:
# State: scheduled, created 28 minutes ago
which IIUC will not be fixed by #176319

Right. It's both related to load but that wouldn't be affected by a more reliable shutdown.

Actions

Copy link

#41

Updated by okurz about 2 months ago

Priority changed from Normal to Urgent

Another failure 2025-03-03, 1530Z: https://app.circleci.com/pipelines/github/os-autoinst/openQA/16047/workflows/be893671-47af-4c11-a2a7-a69140b42c57/jobs/154592

Actions

Copy link

#42

Updated by gpuliti about 2 months ago

Status changed from Workable to In Progress

Actions

Copy link

#43

Updated by livdywan about 2 months ago

Description updated (diff)

Actions

Copy link

#44

Updated by gpuliti about 2 months ago

Mitigation: https://github.com/os-autoinst/openQA/pull/6244

Actions

Copy link

#45

Updated by gpuliti about 2 months ago

Subtask #178249 added

Actions

Copy link

#46

Updated by gpuliti about 2 months ago

Subtask deleted (~~#178249~~)

Actions

Copy link

#47

Updated by gpuliti about 2 months ago

Follows action #178249: load detection and job flagging under high load conditions in openQA job execution size:S added

Actions

Copy link

#48

Updated by gpuliti about 2 months ago

Create new ticket #178249 for load detection

Actions

Copy link

#49

Updated by tinita about 2 months ago

https://github.com/os-autoinst/openQA/pull/6248 ci: Reduce timeout for wait_for_result_panel (merged)

Actions

Copy link

#50

Updated by openqa_review about 2 months ago

Due date set to 2025-03-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#51

Updated by gpuliti about 2 months ago

Description updated (diff)

Actions

Copy link

#52

Updated by gpuliti about 2 months ago

Status changed from In Progress to Feedback

okurz wrote in #note-39:

gpuliti wrote in #note-15:

create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted

disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load

I think both those points still need to be done, right? From today
in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16031/workflows/d37c7fa9-d434-45b2-9646-ac6e09a1a952/jobs/154391 I saw
#   Failed test 'no worker with class qemu_i386 showed up after 18.013961 seconds'
#   at t/full-stack.t line 119.
# free workers: []

#   Failed test 'Unable to assign 1 jobs after 18.013961 seconds'
#   at t/full-stack.t line 142.
# full result panel contents:
# State: scheduled, created 28 minutes ago
which IIUC will not be fixed by #176319

I've thought about for quite some times by now and I don't understand what to do for the last point. is something that can be done by changing the fullstack test?

Actions

Copy link

#53

Updated by livdywan about 2 months ago

gpuliti wrote in #note-52:

okurz wrote in #note-39:
gpuliti wrote in #note-15:

create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted

disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load

I think both those points still need to be done, right? From today
in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16031/workflows/d37c7fa9-d434-45b2-9646-ac6e09a1a952/jobs/154391 I saw
#   Failed test 'no worker with class qemu_i386 showed up after 18.013961 seconds'
#   at t/full-stack.t line 119.
# free workers: []

#   Failed test 'Unable to assign 1 jobs after 18.013961 seconds'
#   at t/full-stack.t line 142.
# full result panel contents:
# State: scheduled, created 28 minutes ago
which IIUC will not be fixed by #176319
I've thought about for quite some times by now and I don't understand what to do for the last point. is something that can be done by changing the fullstack test?

You could add CRITICAL_LOAD_AVG_THRESHOLD to the worker settings used by the (developer mode)fullstack test.

Apparently it's not on open.qa - I'll prepare a PR for that

Actions

Copy link

#54

Updated by livdywan about 2 months ago

You could add CRITICAL_LOAD_AVG_THRESHOLD to the worker settings used by the (developer mode)fullstack test.

Apparently it's not on open.qa - I'll prepare a PR for that

https://github.com/os-autoinst/openQA/pull/6270

Actions

Copy link

#55

Updated by gpuliti about 1 month ago

Status changed from Feedback to In Progress

Add pr to solve point 4: https://github.com/os-autoinst/openQA/pull/6283

Actions

Copy link

#56

Updated by gpuliti about 1 month ago

Status changed from In Progress to Blocked

Pr merged, waiting for #176319

Actions

Copy link

#57

Updated by okurz about 1 month ago

Status changed from Blocked to Workable

#176319 resolved. All included with https://github.com/os-autoinst/os-autoinst/pull/2676 which was merged. This will be used automatically within openQA soon I guess. Then we can verify that the openQA test works stable

Actions

Copy link

#58

Updated by gpuliti about 1 month ago · Edited

Status changed from Workable to Feedback

@okurz wrote in #note-13:

as urgency mitigation apply retry, e.g. in level of the Makefile

prevent VNC typing of "poweroff", use qemu poweroff ACPI or virtio_terminal or just force shutdown

create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted

disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load

Since #2676 is been closed by @dheidler I've open a rollback pr, #6293, and prepare to sovle all the point in the list above.

Actions

Copy link

#59

Updated by gpuliti about 1 month ago

Status changed from Feedback to Blocked

See pr#6293 and circleci pipeline. I've run more than 50 run of fullstack tests in circleci and ~45% of them result in failures. I've reopen #176319 and block this once again.

Actions

Copy link

#60

Updated by livdywan about 1 month ago

Due date deleted (~~2025-03-19~~)

Actions

Copy link

#61

Updated by gpuliti 22 days ago

Status changed from Blocked to In Progress

#176319 pr is merged, checking the ci stability: https://app.circleci.com/pipelines/github/os-autoinst/openQA/16475/workflows/0ad86959-7104-426a-b59c-06cbbf163c7b?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-checks-link&utm_content=summary

Actions

Copy link

#62

Updated by okurz 22 days ago

The CI job already failed today in the morning in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16470/workflows/e3abc18d-d02a-4123-b96b-14ff526d86e8/jobs/159319 showing that the latest version of os-autoinst wasn't deployed. I wonder why you haven't seen that.

Actions

Copy link

#63

Updated by gpuliti 22 days ago

Status changed from In Progress to Feedback

@okurz I can't notice everything :)

for the time being:

success: 17, failure: 8, total: 25
failure percentage: 32%

move to feedback waiting for more runs from circleci job 159382

Actions

Copy link

#64

Updated by gpuliti 21 days ago

Status changed from Feedback to Blocked

Move to blocked once again.

Stability test (60% failures): https://app.circleci.com/pipelines/github/os-autoinst/openQA/16512/workflows/3de73858-78d7-4a85-b062-ff3c2cc6b6ae/jobs/159817/parallel-runs/0/steps/0-111

Actions

Copy link

#65

Updated by gpuliti 16 days ago

Status changed from Blocked to Feedback

Pr updated: https://github.com/os-autoinst/openQA/pull/6293

Actions

Copy link

#66

Updated by gpuliti 16 days ago

Status changed from Feedback to Resolved

Full stack tests are now passing on CircleCI without no retries https://app.circleci.com/pipelines/github/os-autoinst/openQA/16603/workflows/5ab0d3c0-d505-4a71-a03c-6cefc388b020

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #175060

[sporadic] [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82) size:M

Observation¶

Acceptance Criteria¶

Suggestions¶

Rollback actions¶

Updated by okurz 4 months ago

Updated by tinita 4 months ago

Updated by tinita 4 months ago

Updated by tinita 4 months ago · Edited

Updated by tinita 4 months ago

Updated by tinita 4 months ago

Updated by livdywan 3 months ago

Updated by tinita 3 months ago

Updated by okurz 3 months ago

Updated by mkittler 3 months ago

Updated by okurz 3 months ago

Updated by livdywan 3 months ago

Updated by okurz 3 months ago

Updated by gpuliti 3 months ago

Updated by gpuliti 3 months ago

Updated by gpuliti 3 months ago

Updated by okurz 3 months ago

Updated by gpuliti 3 months ago

Updated by livdywan 3 months ago

Updated by openqa_review 3 months ago

Updated by livdywan 3 months ago

Updated by gpuliti 3 months ago

Updated by gpuliti 3 months ago

Updated by gpuliti 3 months ago

Updated by okurz 3 months ago

Updated by gpuliti 3 months ago

Updated by gpuliti 3 months ago

Updated by gpuliti 3 months ago

Updated by gpuliti 3 months ago · Edited

Updated by gpuliti 3 months ago

Using power('acpi')¶

Forcing shutdown by using power('off')¶

Next steps¶

Updated by gpuliti 3 months ago

Updated by okurz 3 months ago

Updated by gpuliti 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by livdywan about 2 months ago

Updated by okurz about 2 months ago

Updated by livdywan about 2 months ago

Updated by okurz about 2 months ago

Updated by gpuliti about 2 months ago

Updated by livdywan about 2 months ago

Updated by gpuliti about 2 months ago

Updated by gpuliti about 2 months ago

Updated by gpuliti about 2 months ago

Updated by gpuliti about 2 months ago

Updated by gpuliti about 2 months ago

Updated by tinita about 2 months ago

Updated by openqa_review about 2 months ago

Updated by gpuliti about 2 months ago

Updated by gpuliti about 2 months ago

Updated by livdywan about 2 months ago

Updated by livdywan about 2 months ago

Updated by gpuliti about 1 month ago

Updated by gpuliti about 1 month ago

Updated by okurz about 1 month ago

Updated by gpuliti about 1 month ago · Edited

Updated by gpuliti about 1 month ago

Updated by livdywan about 1 month ago

Updated by gpuliti 22 days ago

Updated by okurz 22 days ago

Updated by gpuliti 22 days ago

Updated by gpuliti 21 days ago

Updated by gpuliti 16 days ago