Project

General

Profile

action #105429

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #98952: [epic] t/full-stack.t sporadically fails "clickElement: element not interactable" and other errors

openQA's fullstack test fails in `shutdown` module

Added by mkittler 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2022-01-25
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

See https://github.com/os-autoinst/openQA/pull/4468#issuecomment-1019201379, the error is:

# [2022-01-22T08:07:01.494359Z] [debug] [pid:488] REST-API call: POST http://localhost:9526/api/v1/jobs/1/status
# [2022-01-22T08:07:01.528597Z] [info] [pid:831] Uploading video.ogv
# [2022-01-22T08:07:01.529126Z] [debug] [pid:831] Uploading artefact video.ogv
# [2022-01-22T08:07:01.566264Z] [info] [pid:831] Uploading vars.json
# [2022-01-22T08:07:01.566529Z] [debug] [pid:831] Uploading artefact vars.json
# [2022-01-22T08:07:01.594899Z] [info] [pid:831] Uploading autoinst-log.txt
# [2022-01-22T08:07:01.595178Z] [debug] [pid:831] Uploading artefact autoinst-log.txt
# [2022-01-22T08:07:01.621786Z] [info] [pid:831] Uploading worker-log.txt
# [2022-01-22T08:07:01.622061Z] [debug] [pid:831] Uploading artefact worker-log.txt
Bailout called.  Further testing stopped:  Job 1 produced the wrong results

#   Failed test 'no (unexpected) warnings (via END block)'
#   at /usr/lib/perl5/5.26.1/Test/Builder.pm line 135.
# Got the following unexpected warnings:
#   1: Use of uninitialized value in subroutine entry at t/full-stack.t line 213.

The Use of uninitialized value… is basically because the qcow image hasn't been uploaded. Those are the corresponding failing checks:

    not ok 8 - local upload feature used
    not ok 9 - image of hdd uploaded
    not ok 10 - can stat /tmp/IWUDY9BiMA/full-stack.d/openqa/share/factory/hdd/core-hdd.qcow2
    not ok 11 - exported image has correct permissions (420 -> 0644)

And that is simply because the test failed:

    not ok 2 - Expected result not found
    not ok 3 - job 1 passed
    # job info: {
    #   'assigned_worker_id' => 1,
    #   'blocked_by_id' => undef,
    #   'clone_id' => undef,
    #   'group_id' => undef,
    #   'id' => 1,
    #   'name' => 'tinycore-1-flavor-i386-Build1-core@coolone',
    #   'priority' => 50,
    #   'result' => 'failed',
    #   'settings' => {
    #     'ARCH' => 'i386',
    #     'BUILD' => '1',
    #     'DISTRI' => 'tinycore',
    #     'FLAVOR' => 'flavor',
    #     'INTEGRATION_TESTS' => '1',
    #     'ISO' => 'Core-7.2.iso',
    #     'MACHINE' => 'coolone',
    #     'NAME' => '00000001-tinycore-1-flavor-i386-Build1-core@coolone',
    #     'PAUSE_AT' => 'shutdown',
    #     'PUBLISH_HDD_1' => 'core-hdd.qcow2',
    #     'QEMU' => 'i386',
    #     'QEMU_NO_FDC_SET' => '1',
    #     'QEMU_NO_KVM' => '1',
    #     'QEMU_NO_TABLET' => '1',
    #     'TEST' => 'core',
    #     'UEFI_PFLASH_VARS' => '/usr/share/qemu/ovmf-x86_64.bin',
    #     'VERSION' => '1',
    #     'WORKER_CLASS' => 'qemu_i386'
    #   },
    #   'state' => 'done',
    #   't_finished' => '2022-01-22T08:07:02',
    #   't_started' => '2022-01-22T08:05:09',
    #   'test' => 'core'
    # }
    ok 4 - autoinst log file generated
    ok 5 - worker log file generated
    ok 6 - autoinst log uploaded
    ok 7 - worker log uploaded

It fails in the shutdown module:

[2022-01-22T08:05:26.126716Z] [debug] ||| finished boot tests (runtime: 12 s)
[2022-01-22T08:05:26.127797Z] [debug] ||| starting shutdown tests/shutdown.pm
[2022-01-22T08:05:26.128265Z] [debug] isotovideo: pausing test execution of shutdown because we're supposed to pause at this test module
[2022-01-22T08:05:26.128630Z] [debug] tests/shutdown.pm:8 called testapi::type_string
[2022-01-22T08:05:26.128802Z] [debug] <<< testapi::type_string(string="sudo su\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-22T08:05:26.129110Z] [debug] isotovideo: paused, so not passing backend_type_string to backend
[2022-01-22T08:05:28.330185Z] [debug] isotovideo: test execution will be resumed
[2022-01-22T08:05:28.330596Z] [debug] isotovideo: resuming, continue passing backend_type_string to backend
[2022-01-22T08:05:28.601320Z] [debug] tests/shutdown.pm:9 called testapi::type_string
[2022-01-22T08:05:28.601518Z] [debug] <<< testapi::type_string(string="poweroff\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-22T08:05:28.910444Z] [debug] tests/shutdown.pm:10 called testapi::assert_shutdown
[2022-01-22T08:05:28.910592Z] [debug] <<< testapi::check_shutdown(timeout=90)
[2022-01-22T08:05:28.911472Z] [debug] QEMU status is not 'shutdown', it is 'running'
[2022-01-22T08:05:29.912649Z] [debug] QEMU status is not 'shutdown', it is 'running'
[2022-01-22T08:05:30.913806Z] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[2]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":996204,"seconds":1642838729}}
[2022-01-22T08:05:30.914058Z] [debug] QEMU status is not 'shutdown', it is 'running'
…
[2022-01-22T08:06:58.035515Z] [debug] QEMU status is not 'shutdown', it is 'running'
[2022-01-22T08:06:59.036839Z] [debug] QEMU status is not 'shutdown', it is 'running'
[2022-01-22T08:07:00.097072Z] [info] ::: basetest::runtest: # Test died: Machine didn't shut down! at tinycore/tests/shutdown.pm line 10.

[2022-01-22T08:07:00.099824Z] [debug] ||| finished shutdown tests (runtime: 94 s)
[2022-01-22T08:07:00.099889Z] [debug] ||| post fail hooks runtime: 0 s
[2022-01-22T08:07:00.100157Z] [debug] stopping overall test execution after a fatal test failure

Likely the SUT isn't aware that it should shutdown because one of the type_string commands doesn't work as expected. That might have something to do with the fact that we pause/resume the test execution before the shutdown module but could also be a general typing issue.

Acceptance criteria

  • AC1: Retry for full stack test is removed again

Suggestion

  • Fix #104971 first
  • Have a look at the code (e.g. code for handling pausing/resuming in os-autoinst) for possible problems (e.g. lack of synchronization)
  • Find a way to reproduce the problem in a unit test
  • Workaround the issue by combining type_string commands in the test's shutdown module
  • Remove the retry again on Makefile

Related issues

Related to openQA Tests - action #105506: [sporadic][tools] openQA-in-openQA test sporadically fails in shutdownResolved2022-01-262022-02-18

Related to openQA Project - action #105515: Optionally keep pool dir content for debugging os-autoinst + openQA full stack testsNew2022-01-26

Related to openQA Project - action #104971: [sporadic] os-autoinst t/99-full-stack.t sporadically fails in "Result in testresults/result-reload_needles.json is ok" size:MResolved2022-01-172022-02-09

History

#1 Updated by mkittler 4 months ago

  • Parent task set to #98952

#2 Updated by mkittler 4 months ago

  • Assignee set to mkittler

#3 Updated by okurz 4 months ago

  • Category set to Concrete Bugs
  • Priority changed from Normal to Urgent
  • Target version set to Ready

#5 Updated by mkittler 4 months ago

  • Assignee deleted (mkittler)

That test failure might be related, indeed. I'm currently having already #104971 so I'm unassigning. (One ticket of that kind is enough. If nobody assigns here soon I'll take it back.)

#6 Updated by okurz 4 months ago

  • Related to action #105506: [sporadic][tools] openQA-in-openQA test sporadically fails in shutdown added

#7 Updated by mkittler 4 months ago

Notes from the meeting:

  1. The mentioned job (https://openqa.opensuse.org/tests/2157847#step/shutdown/2) fails likely because it has to switch from the graphical environment to the text console which can take a while. There's no good synchronization in the test (only wait_screen_change) so it isn't completely surprising that the machine might not be ready. However, our tests (the openQA fullstack test) does not have to switch consoles. Hence, it is likely a different issue.
  2. We could try system_powerdown instead of typing poweroff. This would also help us getting fullstack test coverage for that feature.
  3. Note that 2. would potentially still mask a problem in the command processing when the developer mode is used to pause/resume tests. So it might still make sense to follow my suggestions from the ticket description to have a look at it.
  4. One could try to reproduce the issue within os-autoinst's fullstack test (by setting the test variable INTEGRATION_TESTS to avoid being blocked by #104971).

#8 Updated by okurz 4 months ago

mkittler wrote:

Notes from the meeting:

  1. The mentioned job (https://openqa.opensuse.org/tests/2157847#step/shutdown/2) fails likely because it has to switch from the graphical environment to the text console which can take a while. There's no good synchronization in the test (only wait_screen_change) so it isn't completely surprising that the machine might not be ready. However, our tests (the openQA fullstack test) does not have to switch consoles. Hence, it is likely a different issue.
  2. We could try system_powerdown instead of typing poweroff. This would also help us getting fullstack test coverage for that feature.
  3. Note that 2. would potentially still mask a problem in the command processing when the developer mode is used to pause/resume tests. So it might still make sense to follow my suggestions from the ticket description to have a look at it.

I tried with a patch

--- a/t/data/tests/tests/shutdown.pm
+++ b/t/data/tests/tests/shutdown.pm
@@ -5,8 +5,7 @@ use Mojo::Base 'basetest', -signatures;
 use testapi;

 sub run {
-    type_string "sudo su\n";
-    type_string "poweroff\n";
+    power 'off';
     assert_shutdown(get_var('INTEGRATION_TESTS') ? 90 : undef);
 }

but this made the full stack test fail with

# [2022-01-26T14:06:27.097610+01:00] [debug] ||| starting shutdown tests/shutdown.pm
# [2022-01-26T14:06:27.099532+01:00] [debug] tests/shutdown.pm:8 called testapi::power
# [2022-01-26T14:06:27.099805+01:00] [debug] <<< testapi::power(action="off")
# [2022-01-26T14:06:27.101367+01:00] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[2]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":297928,"seconds":1643202242}}
# [2022-01-26T14:06:27.102545+01:00] [debug] tests/shutdown.pm:9 called testapi::assert_shutdown
# [2022-01-26T14:06:27.102821+01:00] [debug] <<< testapi::check_shutdown(timeout=60)
# [2022-01-26T14:06:27.104357+01:00] [debug] EVENT {"data":{"guest":false,"reason":"host-qmp-quit"},"event":"SHUTDOWN","timestamp":{"microseconds":101650,"seconds":1643202387}}
# dmesg: read kernel buffer failed: Operation not permitted
# [2022-01-26T14:06:27.297549+01:00] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
# [2022-01-26T14:06:27.299926+01:00] [debug] sending magic and exit
# Unexpected end of data 0
# [2022-01-26T14:06:27.311086+01:00] [debug] backend process exited: 0
# [2022-01-26T14:06:27.311343+01:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 11622 and exit status: 1
# [2022-01-26T14:06:27.331434+01:00] [debug] stopping command server 11591 because test execution ended
# [2022-01-26T14:06:27.331546+01:00] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:15223/xyHXbClMTX/broadcast
# [2022-01-26T14:06:27.917100+01:00] [debug] commands process exited: 0
# [2022-01-26T14:06:28.018790+01:00] [debug] done with command server
# Use of uninitialized value $serial in split at /home/okurz/local/os-autoinst/os-autoinst/t/../basetest.pm line 705.
# [2022-01-26T14:06:28.019081+01:00] [debug] stopping autotest process 11594
# [2022-01-26T14:06:28.019540+01:00] [debug] autotest received signal TERM, saving results of current test before exiting
# [2022-01-26T14:06:28.029389+01:00] [debug] [autotest] process exited: 1
# [2022-01-26T14:06:28.130022+01:00] [debug] done with autotest process
# [2022-01-26T14:06:28.130247+01:00] [debug] isotovideo failed
# [2022-01-26T14:06:28.131252+01:00] [debug] stopping backend process 11608
# [2022-01-26T14:06:28.131403+01:00] [debug] done with backend process
# 11586: EXIT 1
# '
#     doesn't match '(?^u:\d*: EXIT 0)'

#   Failed test 'Result in testresults/result-shutdown.json is ok'
#   at t/99-full-stack.t line 68.
#          got: 'canceled'
#     expected: 'ok'
# Looks like you failed 2 tests of 25.
t/99-full-stack.t .. Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/25 subtests 

Test Summary Report
-------------------
t/99-full-stack.t (Wstat: 512 Tests: 25 Failed: 2)
  Failed tests:  1, 19
  Non-zero exit status: 2
Files=1, Tests=25, 155 wallclock secs ( 0.08 usr  0.02 sys + 42.57 cusr  4.93 csys = 47.60 CPU)
Result: FAIL

I don't understand right now what this means.

  1. One could try to reproduce the issue within os-autoinst's fullstack test (by setting the test variable INTEGRATION_TESTS to avoid being blocked by #104971).

I could so far not reproduce the issue with cd ~/local/os-autoinst/os-autoinst/t/data/tests && runs=100 count_fail_ratio isotovideo -d INTEGRATION_TESTS=1 CASEDIR=.) after 40 runs.

#9 Updated by okurz 4 months ago

  • Related to action #105515: Optionally keep pool dir content for debugging os-autoinst + openQA full stack tests added

#10 Updated by okurz 4 months ago

  • Related to action #104971: [sporadic] os-autoinst t/99-full-stack.t sporadically fails in "Result in testresults/result-reload_needles.json is ok" size:M added

#11 Updated by okurz 4 months ago

  • Description updated (diff)
  • Status changed from New to Blocked
  • Assignee set to mkittler
  • Priority changed from Urgent to Normal

#12 Updated by mkittler 3 months ago

  • Status changed from Blocked to In Progress

#13 Updated by mkittler 3 months ago

power 'off' is invoking QEMU's quit command and thus causing QEMU to terminate completely. Therefore the further queries for the VMs state aren't working anymore. Considering how the QEMU backend is implemented I suppose power 'off' is not generally useful. I tried power 'acpi' instead which is using system_powerdown. Annoyingly the Tinycore system doesn't react to it at all.

#14 Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback

This PR should fix the problem (see commit message for details): https://github.com/os-autoinst/os-autoinst/pull/1945

I'm actually quite sure that this will work so I won't investigate a potential problem with the command handling at this point (which I can still do if it turns out to be still broken).

I'll revert https://github.com/os-autoinst/openQA/pull/4474 when https://github.com/os-autoinst/os-autoinst/pull/1945 has been merged and the os-autoinst version in openQA's CI has been updated. I can also rebase/reopen https://github.com/os-autoinst/openQA/pull/4460 to do some further testing.

Note that I couldn't reproduce the problem locally so far. That's likely because TinyCore is just booting too fast on my system (and even in the CI the cases where it is slow enough seem to be rare).

#15 Updated by mkittler 3 months ago

I conducted over 50 runs in https://github.com/os-autoinst/openQA/pull/4496 (where I forced the new os-autoinst version) and couldn't reproduce the issue anymore. I found one occurrence of another failure (see epic comment). I'll consider the ticket resolved when the new os-autoinst version is updated in openQA's CI.

#16 Updated by okurz 3 months ago

Seems like we have regressions on other backends though, is the VNC implementation used for IPMI backend as well?

#17 Updated by okurz 3 months ago

  • Related to action #106017: [ipmi backend]VNC stalled, no update for 5.38 seconds added

#18 Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved

The regression has been handled in the other ticket.

Note that none of the changes are in openQA's CI yet (see https://github.com/os-autoinst/openQA/blob/master/tools/ci/autoinst.sha). There are other problems with the fullstack test (see parent ticket) so I cannot remove the retry as conclusion of this ticket. Hence I'm resolving it now regardless as there's nothing more to do (regarding this particular problem).

#19 Updated by mkittler 3 months ago

  • Related to deleted (action #106017: [ipmi backend]VNC stalled, no update for 5.38 seconds)

#20 Updated by mkittler 3 months ago

By the way, the regression has nothing to do with changes I made for this ticket (https://github.com/os-autoinst/os-autoinst/pull/1945) so I'm removing the link.

Also available in: Atom PDF