Project

General

Profile

Actions

action #175060

open

coordination #176337: [saga][epic] Stable os-autoinst backends with stable command execution (no mistyping)

coordination #176340: [epic] Stable qemu backend with no unexpected mistyping

[sporadic] [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82) size:M

Added by okurz 3 months ago. Updated about 9 hours ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

fullstack failed on master
https://app.circleci.com/pipelines/github/os-autoinst/openQA/15433/workflows/dd25c1f6-7e00-4342-b808-f9f1398fcc47/jobs/147885/parallel-runs/0/steps/0-111

[14:08:58] t/full-stack.t ......... 31/?         # stopped waiting for '(?^u:Result: passed)', result turned out to be 'failed,'
        # full result panel contents:
        # 2  1 
        # Result: failed, finished less than a minute ago (ran for 02:06 minutes)
        # Scheduled product: job has not been created by posting an ISO
        # Assigned worker: 31dcf6c6b299:1
        # Actions:
        # 
        # Restart job

        #   Failed test 'Expected result not found'
        #   at /home/squamata/project/t/lib/OpenQA/Test/FullstackUtils.pm line 76.

        #   Failed test 'job 5 passed'
        #   at t/full-stack.t line 349.
        # 0
        # Looks like you failed 2 tests of 19.

    #   Failed test 'results of test 5'
    #   at t/full-stack.t line 387.

Acceptance Criteria

  • AC1: full-stack tests pass on CircleCI tests reliably with no retries

Suggestions

Rollback actions


Files

issue-175060-fullstack-sudo-pow.png (3.14 KB) issue-175060-fullstack-sudo-pow.png "sudo pow" needle tinita, 2025-01-07 17:31

Related issues 4 (3 open1 closed)

Follows openQA Project (public) - action #178249: load detection and job flagging under high load conditions in openQA job execution size:SWorkable

Actions
Copied to openQA Project (public) - action #176319: testapi power function call "off" needs to be handled gracefully by os-autoinst size:SFeedbackdheidler2025-04-08

Actions
Copied to openQA Project (public) - action #176334: Use non-VNC console in os-autoinst full-stack test to ensure stable command execution (prevent mistyping)Rejectedokurz2025-01-07

Actions
Copied to openQA Project (public) - action #176346: [spike][timebox:10h] Clear error indication in the openQA full-stack.t if tests fail with mistyping or lost keys while system is under high loadNewokurz

Actions
Actions #1

Updated by okurz 3 months ago

  • Priority changed from High to Urgent
Actions #2

Updated by tinita 3 months ago

  • Subject changed from [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82) to [sporadic] [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82)
  • Status changed from New to In Progress
  • Assignee set to tinita
  • Priority changed from Urgent to High

I would say it's no Urgent as it is sporadic, and I couldn't find an occurrence in the last 15 days. Earlier test runs are currently not available in CircleCI, the GUI hangs when I try to load more runs.
https://app.circleci.com/pipelines/github/os-autoinst/openQA?branch=master&status=none&status=failed

Actions #3

Updated by tinita 3 months ago

I looked at the artifacts, specifically the images artifacts/full-stack.d/openqa/images/... and compared thenm to a successful run.
this one looks suspicious, it says "sudo: pow: command not found"

Actions #4

Updated by tinita 3 months ago · Edited

Another difference:
failed "Unknown signal 85"
passed

edit: I think I found this as well in the passed job, though. difficult to keep the overview because the files don't have specific names.

Actions #6

Updated by tinita 3 months ago

  • Status changed from In Progress to New
  • Assignee deleted (tinita)

No idea currently what could be wrong...

Actions #7

Updated by livdywan 3 months ago

  • Description updated (diff)
Actions #8

Updated by tinita 3 months ago

  • Description updated (diff)
Actions #10

Updated by mkittler 3 months ago

It failed yesterday with a different symptom, though:

#   Failed test 'no worker with class qemu_i386 showed up after 18.0321 seconds'
#   at t/full-stack.t line 120.
# 0

#   Failed test 'Unable to assign 1 jobs after 18.0321 seconds'
#   at t/full-stack.t line 143.
# full result panel contents:
# State: scheduled, created 28 minutes ago
# Scheduled product: job has not been created by posting an ISO
#  50 
# Actions:
# 
# Cancel job
[17:25:04] t/full-stack.t ......... 17/? 
#   Failed test 'Expected result not found'

The case we discussed and this ticket is about: Typing issues in the VM, probably due to the VM being very slow, see #175060#note-3.

The case from yesterday: The worker didn't pick up the job in the first place (at least not in time).

The reason for both problems is probably the same, though. CircleCI is overloaded. In the case from yesterday we see the following log message showing that the worker didn't pick up the job because of the load threshold:

[info] [pid:336] Registered and connected via websockets with openQA host http://localhost:9526 and worker ID 1
[warn] [pid:336] The average load (73.51 38.51 27.36) is exceeding the configured threshold of 40. The worker will temporarily not accept new jobs until the load is lower again. - checking again for web UI 'http://localhost:9526' in 228.39 s
[debug] [pid:201] Updating seen of worker 1 from worker_status (broken)
not ok 15 - no worker with class qemu_i386 showed up after 18.0321 seconds
# free workers: []
not ok 16 - Unable to assign 1 jobs after 18.0321 seconds
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 0 of 3240)
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 1 of 3240)
…
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 406 of 3240)
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 407 of 3240)
[warn] [pid:336] The average load (45.93 40.02 30.30) is exceeding the configured threshold of 40. The worker will temporarily not accept new jobs until the load is lower again. - checking again for web UI 'http://localhost:9526' in 290.71 s
[debug] [pid:201] Updating seen of worker 1 from worker_status (broken)
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 408 of 3240)
…
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 3238 of 3240)
# Wait for AJAX successful: wait_for_result_panel: waiting for '(?^u:State: running)' (count 3239 of 3240)
ok 17 - No unexpected js warnings
not ok 18 - Expected result not found
not ok 19 - test 1 is running
Bail out!  unable to run test 1
[info] [pid:336] Received signal TERM
[debug] [pid:336] Informing http://localhost:9526 that we are going offline
[info] [pid:201] Worker 1 websocket connection closed - 1006

We could disable the load threshold in the fullstack test but this would turn more cases like the one from yesterday into cases where we have typing issues within the VM.

We probably should actually lower the threshold in the fullstack test to avoid typing issues. Then we still have to think about what we do when running into the threshold. Not sure whether it would make sense to simply wait longer. The failure from yesterday ran into the timeout after 30 minutes. So there's maybe still room to extend timeouts¹ a little bit without running into the CircleCI context deadline but it will probably not help much.

Of course there could also be a regression on our side, e.g. a spinning loop in one of the processes that causes the high load. I suspect more that it is CircleCI, though.


¹ We should only extend the timeout of wait_for_result_panel when se actually see "… exceeding the configured threshold …" being logged. Not sure how to read the log messages of the worker in that function efficiently, though.

Actions #11

Updated by okurz 3 months ago

mkittler wrote in #note-10:

[...] We probably should actually lower the threshold in the fullstack test to avoid typing issues. Then we still have to think about what we do when running into the threshold. Not sure whether it would make sense to simply wait longer. The failure from yesterday ran into the timeout after 30 minutes. So there's maybe still room to extend timeouts¹ a little bit without running into the CircleCI context deadline but it will probably not help much.

That actually sounds very good to me. It's great to see that the worker load detection helps here. I agree we should lower the load threshold and wait longer for the circleCI environment to have less load. At best we can make it clear enough in logs what's happening and also state what is going on in case the load did not settle within the timeout.

Actions #12

Updated by livdywan 3 months ago

  • Status changed from New to In Progress
  • Assignee set to gpuliti

Gabriele agreed to lead the conversation

Actions #13

Updated by okurz 3 months ago

  1. as urgency mitigation apply retry, e.g. in level of the Makefile
  2. prevent VNC typing of "poweroff", use qemu poweroff ACPI or virtio_terminal or just force shutdown
  3. create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
  4. disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load
Actions #14

Updated by gpuliti 3 months ago

I've apply t/full-stack.t as unstable_tests.txt in https://github.com/os-autoinst/openQA/pull/6107

Actions #15

Updated by gpuliti 3 months ago

  • Status changed from In Progress to Workable
  • Assignee deleted (gpuliti)
  • Priority changed from Urgent to High

Mitigated with pr above https://github.com/os-autoinst/openQA/pull/6107 merged

I drop the priority as we discussed this morning.

Leave the next 3 points as workable. @okurz wrote in https://progress.opensuse.org/issues/175060#note-13:

  1. as urgency mitigation apply retry, e.g. in level of the Makefile
  2. prevent VNC typing of "poweroff", use qemu poweroff ACPI or virtio_terminal or just force shutdown
  3. create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
  4. disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load
Actions #16

Updated by gpuliti 3 months ago

  • Status changed from Workable to In Progress
  • Priority changed from High to Urgent
Actions #17

Updated by okurz 3 months ago

  • Assignee set to gpuliti
Actions #18

Updated by gpuliti 3 months ago

  • Status changed from In Progress to Feedback

I've apply a retry mechanism at makefile level to handle sporadic failures caused by circleci load https://github.com/os-autoinst/openQA/pull/6115 and by looking at the history 2 retry I think are enough.

I've done further investigation in order to understand how to handle the shutdown properly and I found that shutdown is handle in full-stack:90 that send the shutdown request, than full-stack:197 will wait for the console showing that the system is paused due to a shutdown and I guess is here that the typing might cause problems!?

If my changes will be approved, we should add a rollback actions to the task.

Actions #19

Updated by livdywan 3 months ago

  • Status changed from Feedback to In Progress

If my changes will be approved, we should add a rollback actions to the task.

Please keep the ticket in progress unless you can mitigate and lower the priority.

Actions #20

Updated by openqa_review 3 months ago

  • Due date set to 2025-01-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions #21

Updated by livdywan 3 months ago

  • Tags changed from reactive work, alert to reactive work, alert, infra
Actions #22

Updated by gpuliti 3 months ago

  • Priority changed from Urgent to High

Pull request merged, lowering the priority

Actions #23

Updated by gpuliti 3 months ago

  • Description updated (diff)
Actions #24

Updated by gpuliti 3 months ago

Working on solution for point 2: https://github.com/os-autoinst/os-autoinst/pull/2623
Rollback of mistaken merged pr: https://github.com/os-autoinst/openQA/pull/6124

Actions #25

Updated by okurz 2 months ago

  • Tags changed from reactive work, alert, infra to reactive work, alert

seems "infra" was added by mistake

Actions #26

Updated by gpuliti 2 months ago

I was finally able to run the test locally and it's working, but pipeline is failing, I'm going to understand the differences.

Actions #27

Updated by gpuliti 2 months ago

  • Description updated (diff)
Actions #28

Updated by gpuliti 2 months ago

  • Subject changed from [sporadic] [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82) to [sporadic] [Workflow] Failed: os-autoinst/openQA on master / test (7dc9d82) size:M
  • Description updated (diff)
Actions #29

Updated by gpuliti 2 months ago · Edited

I'm debugging the tests using the github actions pipeline.

It seems that the change to shutdown.pm broke a test:

3: # ------
3: # $VAR1 = {
3: #           'dents' => 0,
3: #           'details' => [
3: #                          {
3: #                            'result' => 'fail',
3: #                            'screenshot' => 'shutdown-2.png',
3: #                            'frametime' => [
3: #                                             '17.62',
3: #                                             '17.67'
3: #                                           ]
3: #                          },
3: #                          {
3: #                            'text' => 'shutdown-3.txt',
3: #                            'title' => 'Failed',
3: #                            'result' => 'fail'
3: #                          }
3: #                        ],
3: #           'result' => 'fail',
3: #           'execution_time' => 61
3: #         };
3: # 
3: # ------
3: 
3: #   Failed test 'Result in testresults/result-shutdown.json is ok'
3: #   at t/99-full-stack.t line 79.
3: #          got: 'fail'
3: #     expected: 'ok'
Actions #30

Updated by gpuliti 2 months ago

I've done 2 test to prevent VNC typing of "poweroff": the first one is by using acpi and the second is to force the shutdown.

Using power('acpi')

By changing shutdown.pm:9 to acpi the test fail with:

3: #   Failed test 'Result in testresults/result-shutdown.json is ok'
3: #   at t/99-full-stack.t line 83.
3: #          got: 'fail'
3: #     expected: 'ok'

Looking at autoinst-log.txt it seems that the machine is not properly shutting down causing a timeout on the testapi::check_shutdown method:

3: # [2025-01-27T14:16:47.274338Z] [debug] [pid:2821] <<< testapi::power(action="acpi")
3: # [2025-01-27T14:16:47.275427Z] [debug] [pid:2824] EVENT {"event":"POWERDOWN","timestamp":{"microseconds":275186,"seconds":1737987407}}
3: # [2025-01-27T14:16:47.275970Z] [debug] [pid:2821] tests/shutdown.pm:10 called testapi::assert_shutdown
3: # [2025-01-27T14:16:47.276076Z] [debug] [pid:2821] <<< testapi::check_shutdown(timeout=60)
3: # [2025-01-27T14:16:47.277158Z] [debug] [pid:2824] QEMU status is not 'shutdown', it is 'running'
3: # <a lot of QEMU status is not shutdown>
3: # [2025-01-27T14:17:47.409491Z] [debug] [pid:2824] QEMU status is not 'shutdown', it is 'running'
3: # [2025-01-27T14:17:48.454331Z] [info] [pid:2821] ::: basetest::runtest: # Test died: Machine didn't shut down! at /opt/t/data/tests/tests/shutdown.pm line 10.
3: #   
3: # [2025-01-27T14:17:48.457336Z] [debug] [pid:2821] ||| finished shutdown tests (runtime: 61 s)
3: # [2025-01-27T14:17:48.457452Z] [debug] [pid:2821] ||| post fail hooks runtime: 0 s
3: # [2025-01-27T14:17:48.457864Z] [debug] [pid:2821] stopping overall test execution after a fatal test failure

I think the machine is not getting the ACPI shutdown signal, so maybe forcing it is the way to go.

Forcing shutdown by using power('off')

By changing shutdown.pm:9 with force shutdown the test 2 times:

3: #   Failed test 'test executed fine'
3: #   at t/99-full-stack.t line 57.
3: # '
3: # <a lot of logs>
3: # 2825: EXIT 1
3: # '
3: #     doesn't match '(?^u:\d*: EXIT 0)'
3: 
3: #   Failed test 'Result in testresults/result-shutdown.json is ok'
3: #   at t/99-full-stack.t line 83.
3: #          got: 'canceled'
3: #     expected: 'ok'
3: # Looks like you failed 2 tests of 30.

About the *line 57by looking at theautoinst-log` we get:

3: # [2025-01-27T19:05:33.766445Z] [warn] [pid:2825] !!! OpenQA::Isotovideo::Runner::_read_response: THERE IS NOTHING TO READ 17 5 4
3: # [2025-01-27T19:05:33.766692Z] [debug] [pid:2825] isotovideo handle_shutdown
3: # [2025-01-27T19:05:33.766852Z] [debug] [pid:2825] stopping command server 2832 because test execution ended
3: # [2025-01-27T19:05:33.766970Z] [debug] [pid:2825] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:15223/KPAoBalWrQ/broadcast
3: # [2025-01-27T19:05:35.258101Z] [debug] [pid:2825] commands process exited: 0
3: # [2025-01-27T19:05:35.358677Z] [debug] [pid:2825] done with command server
3: # [2025-01-27T19:05:35.358863Z] [debug] [pid:2825] stopping autotest process 2835
3: # [2025-01-27T19:05:35.359249Z] [debug] [pid:2835] autotest received signal TERM, saving results of current test before exiting
3: # [2025-01-27T19:05:35.365028Z] [debug] [pid:2825] [autotest] process exited: 1
3: # [2025-01-27T19:05:35.465514Z] [debug] [pid:2825] done with autotest process
3: # [2025-01-27T19:05:35.465673Z] [debug] [pid:2825] isotovideo failed
3: # [2025-01-27T19:05:35.466479Z] [debug] [pid:2825] stopping backend process 2838
3: # [2025-01-27T19:05:36.167979Z] [debug] [pid:2825] done with backend process
3: # 2825: EXIT 1

There are something wrong with the handling of the isotovideo shutdown.

Next steps

Since the first shutdown method is not shutting down the machine, I'm trying to find a solution for the second one.

Actions #31

Updated by gpuliti 2 months ago

I'm trying to climb the canvas to understand how the shutdown is been handled:

  • everything start with line 53 of 99-full-stack with the isotovideo run perl $toplevel_dir/isotovideo --workdir $pool_dir -d
  • inside the script line 185 there is the $runner->handle_commands; that is a call to runner.pm::handle_commands in line 114
  • In the block there is another function called with an interesting choice of name in line 146 $self->command_handler($command_handler) that call something somewhere in CommandHandler probably in line 81 $self->can('_handle_command_' . $cmd) and I guess something here should run the shutdown:8 power('off') defined in quemu line 70 sub power ($self, $args)
  • all of this generate a log called testresults/result-shutdown.json
  • In the end this log is parse by 99-full-stack:74 that decode the json and run the actual test expectation

But I miss a lot in order to understand why

  3: #   Failed test 'test executed fine'
  3: #   at t/99-full-stack.t line 57.

and

  3: #   Failed test 'Result in testresults/result-shutdown.json is ok'
  3: #   at t/99-full-stack.t line 83.

are failing.

Actions #32

Updated by okurz 2 months ago

  • Copied to action #176319: testapi power function call "off" needs to be handled gracefully by os-autoinst size:S added
Actions #33

Updated by gpuliti 2 months ago

  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal

During the Midweekly Unblock we try to resolve the ticket and find solution using power('off'), but it seems that the function itself is "not working properly" so we decided to open a new ticket that will solve the host-qmp-quit qmp call problem.

Blocked by #176319

Actions #34

Updated by okurz 2 months ago

  • Copied to action #176334: Use non-VNC console in os-autoinst full-stack test to ensure stable command execution (prevent mistyping) added
Actions #35

Updated by okurz 2 months ago

  • Parent task set to #176340
Actions #36

Updated by okurz 2 months ago

  • Due date deleted (2025-01-29)
Actions #37

Updated by okurz 2 months ago

  • Copied to action #176346: [spike][timebox:10h] Clear error indication in the openQA full-stack.t if tests fail with mistyping or lost keys while system is under high load added
Actions #38

Updated by livdywan about 1 month ago

As this ticket has a higher priority than the blocker and exceeded our SLO's I'm raising the priority of #176319 again.

Actions #39

Updated by okurz 30 days ago

  • Status changed from Blocked to Workable

gpuliti wrote in #note-15:

  1. create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
  2. disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load

I think both those points still need to be done, right? From today
in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16031/workflows/d37c7fa9-d434-45b2-9646-ac6e09a1a952/jobs/154391 I saw

#   Failed test 'no worker with class qemu_i386 showed up after 18.013961 seconds'
#   at t/full-stack.t line 119.
# free workers: []

#   Failed test 'Unable to assign 1 jobs after 18.013961 seconds'
#   at t/full-stack.t line 142.
# full result panel contents:
# State: scheduled, created 28 minutes ago

which IIUC will not be fixed by #176319

Actions #40

Updated by livdywan 29 days ago

okurz wrote in #note-39:

gpuliti wrote in #note-15:

  1. create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
  2. disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load

I think both those points still need to be done, right? From today
in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16031/workflows/d37c7fa9-d434-45b2-9646-ac6e09a1a952/jobs/154391 I saw

#   Failed test 'no worker with class qemu_i386 showed up after 18.013961 seconds'
#   at t/full-stack.t line 119.
# free workers: []

#   Failed test 'Unable to assign 1 jobs after 18.013961 seconds'
#   at t/full-stack.t line 142.
# full result panel contents:
# State: scheduled, created 28 minutes ago

which IIUC will not be fixed by #176319

Right. It's both related to load but that wouldn't be affected by a more reliable shutdown.

Actions #41

Updated by okurz 29 days ago

  • Priority changed from Normal to Urgent
Actions #42

Updated by gpuliti 29 days ago

  • Status changed from Workable to In Progress
Actions #43

Updated by livdywan 29 days ago

  • Description updated (diff)
Actions #45

Updated by gpuliti 29 days ago

  • Subtask #178249 added
Actions #46

Updated by gpuliti 29 days ago

  • Subtask deleted (#178249)
Actions #47

Updated by gpuliti 29 days ago

  • Follows action #178249: load detection and job flagging under high load conditions in openQA job execution size:S added
Actions #48

Updated by gpuliti 29 days ago

Create new ticket #178249 for load detection

Actions #49

Updated by tinita 28 days ago

https://github.com/os-autoinst/openQA/pull/6248 ci: Reduce timeout for wait_for_result_panel (merged)

Actions #50

Updated by openqa_review 28 days ago

  • Due date set to 2025-03-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #51

Updated by gpuliti 28 days ago

  • Description updated (diff)
Actions #52

Updated by gpuliti 27 days ago

  • Status changed from In Progress to Feedback

okurz wrote in #note-39:

gpuliti wrote in #note-15:

  1. create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
  2. disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load

I think both those points still need to be done, right? From today
in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16031/workflows/d37c7fa9-d434-45b2-9646-ac6e09a1a952/jobs/154391 I saw

#   Failed test 'no worker with class qemu_i386 showed up after 18.013961 seconds'
#   at t/full-stack.t line 119.
# free workers: []

#   Failed test 'Unable to assign 1 jobs after 18.013961 seconds'
#   at t/full-stack.t line 142.
# full result panel contents:
# State: scheduled, created 28 minutes ago

which IIUC will not be fixed by #176319

I've thought about for quite some times by now and I don't understand what to do for the last point. is something that can be done by changing the fullstack test?

Actions #53

Updated by livdywan 26 days ago

gpuliti wrote in #note-52:

okurz wrote in #note-39:

gpuliti wrote in #note-15:

  1. create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
  2. disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load

I think both those points still need to be done, right? From today
in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16031/workflows/d37c7fa9-d434-45b2-9646-ac6e09a1a952/jobs/154391 I saw

#   Failed test 'no worker with class qemu_i386 showed up after 18.013961 seconds'
#   at t/full-stack.t line 119.
# free workers: []

#   Failed test 'Unable to assign 1 jobs after 18.013961 seconds'
#   at t/full-stack.t line 142.
# full result panel contents:
# State: scheduled, created 28 minutes ago

which IIUC will not be fixed by #176319

I've thought about for quite some times by now and I don't understand what to do for the last point. is something that can be done by changing the fullstack test?

You could add CRITICAL_LOAD_AVG_THRESHOLD to the worker settings used by the (developer mode)fullstack test.

Apparently it's not on open.qa - I'll prepare a PR for that

Actions #54

Updated by livdywan 25 days ago

You could add CRITICAL_LOAD_AVG_THRESHOLD to the worker settings used by the (developer mode)fullstack test.

Apparently it's not on open.qa - I'll prepare a PR for that

https://github.com/os-autoinst/openQA/pull/6270

Actions #55

Updated by gpuliti 22 days ago

  • Status changed from Feedback to In Progress
Actions #56

Updated by gpuliti 19 days ago

  • Status changed from In Progress to Blocked

Pr merged, waiting for #176319

Actions #57

Updated by okurz 19 days ago

  • Status changed from Blocked to Workable

#176319 resolved. All included with https://github.com/os-autoinst/os-autoinst/pull/2676 which was merged. This will be used automatically within openQA soon I guess. Then we can verify that the openQA test works stable

Actions #58

Updated by gpuliti 19 days ago · Edited

  • Status changed from Workable to Feedback

@okurz wrote in #note-13:

  1. as urgency mitigation apply retry, e.g. in level of the Makefile
  2. prevent VNC typing of "poweroff", use qemu poweroff ACPI or virtio_terminal or just force shutdown
  3. create ticket about doing load detection while job is running and failing or "flagging" jobs where load was high and result might be tainted
  4. disable the load threshold job waiting in the openQA full stack test as that test should really be able to run after point 2. was done under even very high load

Since #2676 is been closed by @dheidler I've open a rollback pr, #6293, and prepare to sovle all the point in the list above.

Actions #59

Updated by gpuliti 14 days ago

  • Status changed from Feedback to Blocked

See pr#6293 and circleci pipeline. I've run more than 50 run of fullstack tests in circleci and ~45% of them result in failures. I've reopen #176319 and block this once again.

Actions #60

Updated by livdywan 14 days ago

  • Due date deleted (2025-03-19)
Actions #62

Updated by okurz about 10 hours ago

The CI job already failed today in the morning in https://app.circleci.com/pipelines/github/os-autoinst/openQA/16470/workflows/e3abc18d-d02a-4123-b96b-14ff526d86e8/jobs/159319 showing that the latest version of os-autoinst wasn't deployed. I wonder why you haven't seen that.

Actions #63

Updated by gpuliti about 9 hours ago

  • Status changed from In Progress to Feedback

@okurz I can't notice everything :)

for the time being:

success: 17, failure: 8, total: 25
failure percentage: 32%

move to feedback waiting for more runs from circleci job 159382

Actions

Also available in: Atom PDF