Project

General

Profile

action #106912

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #98952: [epic] t/full-stack.t sporadically fails "clickElement: element not interactable" and other errors

Fullstack test can still fail due to `shutdown` module size:M

Added by mkittler 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2022-02-16
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

This is is a shortened version of the relevant log:

    # [2022-02-15T12:30:06.683729Z] [debug] ||| starting boot tests/boot.pm
    # [2022-02-15T12:30:06.685985Z] [debug] tests/boot.pm:9 called testapi::assert_screen
    # [2022-02-15T12:30:06.686361Z] [debug] <<< testapi::assert_screen(mustmatch="core", timeout=15, no_wait=1)
    # [2022-02-15T12:30:06.700419Z] [debug] no match: 14.9s, best candidate: boot-core (0.00)
…
    # [2022-02-15T12:30:07.125518Z] [debug] no change: 14.5s
    # [2022-02-15T12:30:07.134823Z] [debug] no match: 14.5s, best candidate: boot-core (0.00)
    # [2022-02-15T12:30:07.331434Z] [debug] >>> testapi::_handle_found_needle: found boot-core, similarity 0.97 @ 1/74
    # [2022-02-15T12:30:07.332261Z] [debug] tests/boot.pm:10 called testapi::send_key
    # [2022-02-15T12:30:07.332397Z] [debug] <<< testapi::send_key(key="ret", wait_screen_change=0, do_wait=0)
    # [2022-02-15T12:30:07.602261Z] [debug] tests/boot.pm:14 called testapi::assert_screen
    # [2022-02-15T12:30:07.602421Z] [debug] <<< testapi::assert_screen(mustmatch="on_prompt", timeout=90)
    # [2022-02-15T12:30:08.241080Z] [debug] no match: 89.3s, best candidate: boot-on_prompt (0.00)
…
    # [2022-02-15T12:30:21.249634Z] [debug] no match: 76.3s, best candidate: boot-on_prompt (0.00)
    # [2022-02-15T12:30:22.370453Z] [debug] >>> testapi::_handle_found_needle: found boot-on_prompt, similarity 1.00 @ 2/64
    # [2022-02-15T12:30:22.374020Z] [debug] ||| finished boot tests (runtime: 16 s)
    # [2022-02-15T12:30:22.375138Z] [debug] ||| starting assert_screen tests/assert_screen.pm
    # [2022-02-15T12:30:22.375777Z] [debug] tests/assert_screen.pm:9 called testapi::assert_screen
    # [2022-02-15T12:30:22.375889Z] [debug] <<< testapi::assert_screen(mustmatch="on_prompt", timeout=60)
    # [2022-02-15T12:30:23.323411Z] [debug] >>> testapi::_handle_found_needle: found boot-on_prompt, similarity 1.00 @ 2/64
    # [2022-02-15T12:30:23.323585Z] [debug] tests/assert_screen.pm:10 called testapi::assert_screen
    # [2022-02-15T12:30:23.323700Z] [debug] <<< testapi::assert_screen(mustmatch="on_prompt", timeout=30, no_wait=1)
    # [2022-02-15T12:30:23.397251Z] [debug] >>> testapi::_handle_found_needle: found boot-on_prompt, similarity 1.00 @ 2/64
    # [2022-02-15T12:30:23.398459Z] [debug] ||| finished assert_screen tests (runtime: 1 s)
    # [2022-02-15T12:30:23.399556Z] [debug] ||| starting shutdown tests/shutdown.pm
    # [2022-02-15T12:30:23.400299Z] [debug] tests/shutdown.pm:8 called testapi::type_string
    # [2022-02-15T12:30:23.400468Z] [debug] <<< testapi::type_string(string="sudo su\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
    # [2022-02-15T12:30:23.671645Z] [debug] tests/shutdown.pm:9 called testapi::type_string
    # [2022-02-15T12:30:23.671846Z] [debug] <<< testapi::type_string(string="poweroff\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
    # [2022-02-15T12:30:23.976613Z] [debug] tests/shutdown.pm:10 called testapi::assert_shutdown
    # [2022-02-15T12:30:23.976768Z] [debug] <<< testapi::check_shutdown(timeout=90)
    # [2022-02-15T12:30:23.977601Z] [debug] QEMU status is not 'shutdown', it is 'running'
    # [2022-02-15T12:30:24.978574Z] [debug] EVENT {"data":{"path":"\/machine\/peripheral-anon\/device[2]\/virtio-backend"},"event":"NIC_RX_FILTER_CHANGED","timestamp":{"microseconds":798568,"seconds":1644928224}}
    # [2022-02-15T12:30:24.978769Z] [debug] QEMU status is not 'shutdown', it is 'running'
…
    # [2022-02-15T12:31:54.110983Z] [debug] QEMU status is not 'shutdown', it is 'running'
    # [2022-02-15T12:31:55.174986Z] [info] ::: basetest::runtest: # Test died: Machine didn't shut down! at tinycore/tests/shutdown.pm line 10.
    #   
    # [2022-02-15T12:31:55.177594Z] [debug] ||| finished shutdown tests (runtime: 92 s)
    # [2022-02-15T12:31:55.177662Z] [debug] ||| post fail hooks runtime: 0 s

The test run was conducted as "job 7" as part of openQA's fullstack test (with caching enabled). I suppose it should also be reproducible in other "fullstack scenarios" where we expect the test to pass. Maybe os-autoinst's own fullstack test is not affected because it runs a few more test modules before shutdown. The change https://github.com/os-autoinst/os-autoinst/commit/a7103b16 was supposed to fix that and it definitely improved the situation. However, it seems in very rare cases¹ the test can still fail. (The fix was definitely used here as one can see in the logs that the assert_screen module is successfully executed.)

¹ I ran openQA's fullstack test about 75 times in CircleCI until I hit an occurrence.

Acceptance criteria

  • AC1: openQA fullstack test runs 100 times without failing

Suggestions

  • Not Ondrej's fault
  • Make it fail on purpose... but we don't know the exact cause, only the symptom
  • Reveal the video from the pool directory after reproducing locally

Out of scope

  • Reveal the video from the pool directory in CI
fullstack-shutdown-again.txt (42.5 KB) fullstack-shutdown-again.txt mkittler, 2022-02-16 11:59
full-stack-shutdown-again-detailed-log.txt (222 KB) full-stack-shutdown-again-detailed-log.txt mkittler, 2022-02-16 11:59
fullstack-shutdown-problem-video.ogv (128 KB) fullstack-shutdown-problem-video.ogv mkittler, 2022-02-22 11:58
mpv-shot0011.jpg (41.6 KB) mpv-shot0011.jpg mkittler, 2022-02-22 12:11
mpv-shot0013.jpg (53.7 KB) mpv-shot0013.jpg mkittler, 2022-02-22 12:11
mpv-shot0015.jpg (35.1 KB) mpv-shot0015.jpg mkittler, 2022-02-22 12:11
mpv-shot0019.jpg (72.4 KB) mpv-shot0019.jpg mkittler, 2022-02-22 12:11
mpv-shot0020.jpg (44.5 KB) mpv-shot0020.jpg mkittler, 2022-02-22 12:11
mpv-shot0021.jpg (21.8 KB) mpv-shot0021.jpg mkittler, 2022-02-22 12:11
mpv-shot0022.jpg (31 KB) mpv-shot0022.jpg mkittler, 2022-02-22 12:11
mpv-shot0023.jpg (31.6 KB) mpv-shot0023.jpg mkittler, 2022-02-22 12:11
mpv-shot0024.jpg (35.1 KB) mpv-shot0024.jpg mkittler, 2022-02-22 12:11
12844
12847
12850
12853
12856
12859
12862
12865
12868

Related issues

Related to openQA Project - action #107002: Expose fullstack test video from pool directory in CI size:MResolved2022-02-17

History

#1 Updated by okurz 4 months ago

  • Priority changed from Normal to High
  • Target version set to Ready

#2 Updated by okurz 4 months ago

  • Tracker changed from coordination to action

#3 Updated by cdywan 4 months ago

  • Subject changed from Fullstack test can still fail due to `shutdown` module to Fullstack test can still fail due to `shutdown` module size:M
  • Description updated (diff)
  • Status changed from New to Workable

#4 Updated by cdywan 4 months ago

  • Blocked by action #107002: Expose fullstack test video from pool directory in CI size:M added

#5 Updated by cdywan 4 months ago

  • Status changed from Workable to Blocked

#6 Updated by mkittler 4 months ago

12844
12847
12850
12853
12856
12859
12862
12865
12868

Thanks to cdywan's work on #107002 we now have a video. It shows that poweroff is typed before the prompt "within" sudo is ready and one can also shortly see some boot messages after reaching the prompt. The typing poweroff actually nevertheless appears within the prompt but I suppose the return key didn't make it and thus nothing happens.

I've been attaching the most important frames of the video (they're ordered despite some numbers being omitted).

So we see:

  1. Boot logging
  2. Prompt with keys for sudo and poweroff already being typed
  3. Prompt is gone again, further boot messages appear (and not it doesn't look like shutdown messages)
  4. Prompt is shown again but it is messed up by further boot messages
  5. Screen is cleared, just a single "_" appears
  6. We're back at a clean prompt
  7. The keystrokes for sudo and poweroff are "re-typed" or "replied"
  8. That's it, apparently the poweroff command is there but never submitted or has no effect

I'd say there are no typing issues on our side. It is also clear why waiting for the prompt doesn't do the trick. The prompt is already shown while the system is booting. Maybe we need to assert that the boot prompt is shown for at least e.g. 5 seconds to be sure the system is really ready. That would be really annoying. I'm afraid we cannot assert for the last boot message because we might not catch it at all. I'm also wondering why 7. happens.

#7 Updated by okurz 4 months ago

  • Status changed from Blocked to Feedback
  • Assignee set to okurz

cdywan please only "Blocked" with assignee so that we know who tracks the blocker. I set you as assignee now. (EDIT: took assignee myself now because I think it's not blocked anymore)

mkittler wrote:

Thanks to cdywan's work on #107002 we now have a video. It shows that poweroff is typed before the prompt "within" sudo is ready and one can also shortly see some boot messages after reaching the prompt. The typing poweroff actually nevertheless appears within the prompt but I suppose the return key didn't make it and thus nothing happens.
[…]
I'd say there are no typing issues on our side. It is also clear why waiting for the prompt doesn't do the trick. The prompt is already shown while the system is booting. Maybe we need to assert that the boot prompt is shown for at least e.g. 5 seconds to be sure the system is really ready. That would be really annoying. I'm afraid we cannot assert for the last boot message because we might not catch it at all.

Couple of different ideas to try

Trying that now myself: https://github.com/os-autoinst/os-autoinst/pull/1963

I'm also wondering why 7. happens.

I think this is normal behaviour. If you type while the next prompt isn't ready characters are echoed onto the (raw) console. But as soon as the prompt appears (and your PS command and stuff is finished processing) then the previously typed text is repeated from the buffer into the prompt line. I would not worry about that.

#8 Updated by okurz 4 months ago

I triggered the openQA full stack test manually both in os-autoinst https://github.com/os-autoinst/os-autoinst/runs/5323123216?check_suite_focus=true as well as openQA https://github.com/os-autoinst/openQA/runs/5323139061?check_suite_focus=true with the "fail if any retry fails". Locally I am running runs=200 count_fail_ratio prove -I. t/99-full-stack.t

Both CI runs failed due to updated dependency packages. So I triggered another dependency PR run from within https://app.circleci.com/pipelines/github/os-autoinst/openQA?branch=master&filter=all selecting a "nightly" workflow job. This triggered https://app.circleci.com/pipelines/github/os-autoinst/openQA/9058/workflows/2e92004c-e279-4c95-a130-6f857b6b3ee5/jobs/85580 which created https://github.com/os-autoinst/openQA/pull/4531

#9 Updated by okurz 4 months ago

  • Blocked by deleted (action #107002: Expose fullstack test video from pool directory in CI size:M)

#10 Updated by okurz 4 months ago

  • Related to action #107002: Expose fullstack test video from pool directory in CI size:M added

#11 Updated by okurz 4 months ago

  • Status changed from Feedback to Resolved

https://github.com/os-autoinst/openQA/runs/5324376037?check_suite_focus=true#step:7:117 failed after 5 successful runs with

Rerun 5 of 15 …
Calling retry hook ./tools/delete-coverdb-folder
Deleting cover_db_fullstack/
DBIx::Class::Storage::DBI::_exec_txn_commit(): DBI Exception: DBD::Pg::db commit failed: PANIC:  could not write to file "pg_wal/xlogtemp.2980": No space left on device
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->txn_do (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
[20:18:19] t/full-stack.t .. 
No subtests run 
[20:18:19]

maybe saving all artifacts won't scale that good :) But the original problem did not appear here.

The very same consistently was reproduced in https://github.com/os-autoinst/os-autoinst/runs/5324381907?check_suite_focus=true#step:8:93 . I don't feel like triggering the above two 10 times more. But locally I have finished 90 runs successfully without any error so I declare this good enough :)https://github.com/os-autoinst/openQA/runs/5324376037?check_suite_focus=true#step:7:117 failed after 5 successful runs with

Rerun 5 of 15 …
Calling retry hook ./tools/delete-coverdb-folder
Deleting cover_db_fullstack/
DBIx::Class::Storage::DBI::_exec_txn_commit(): DBI Exception: DBD::Pg::db commit failed: PANIC:  could not write to file "pg_wal/xlogtemp.2980": No space left on device
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->txn_do (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
[20:18:19] t/full-stack.t .. 
No subtests run 
[20:18:19]

maybe saving all artifacts won't scale that good :) But the original problem did not appear here.

The very same consistently was reproduced in https://github.com/os-autoinst/os-autoinst/runs/5324381907?check_suite_focus=true#step:8:93 . I don't feel like triggering the above two 10 times more. But locally I have finished 130 runs successfully without any error so I declare this good enough :)

Also available in: Atom PDF