action #16320
closedRandom timeouts while waiting for serial output when using the virtio backend
0%
Description
Observation¶
Tests timeout while waiting for output from an LTP test: https://openqa.suse.de/tests/743383.
It appears that the command text is sent to the SUT, but no response is received. In the serial log[1] for the above test it shows that the last test ran and returned a result. However nothing is read by the virtio console backend.
In this test: https://openqa.opensuse.org/tests/342884 [2], one call to wait_serial
fails, but then the next succeeds and then it fails again. The calls which pass do not use regular expressions to do the matching.
As a rough estimate this bug occurs in 1%-5% of tests.
Problem¶
- H1, QEMU is writing bytes to the log, but not the socket
- H2, The virtio backend function
read_until
is not reading bytes from the socket correctly - H3, One or more of the read buffers in
read_until
are being dropped.
Suggestions¶
- A0, Inspect more test failures.
- A1, Run the virtio terminal unit tests repeatedly.
- A2, Modify the virtio test module to perform a stress test.
- A3, Investigate how QEMU passes the data.
I am currently waiting for a crash dump of the SUT to be attempted after a freeze.
workaround¶
- W0, Retrigger the job manually.
- W1, Retrigger the job automatically after a timeout.
[1] The serial log is written by QEMU.
[2] There is no virtio serial log for this test, possibly O3 needs updating.
Updated by szarate almost 8 years ago
- Related to action #12350: [tools]version of os-autoinst on malbec+overdrive2 should be same as other workers (using salt) (was: looks like old version) added
Updated by szarate almost 8 years ago
Linking due to the mention of the neeed to upgrade
Updated by rpalethorpe almost 8 years ago
So far A1 and A2 resulted in me finding one bug which only effects script_output
, but may be responsible for some of the failures. Pull request is here: https://github.com/os-autoinst/os-autoinst/pull/710
Despite calling script_run
>100k times no other failures have been observed locally. At a guess my system is either not under the same kind of load or the problem is fixed on Tumbleweed. All the new failures on the main instance seem to be related to the script_output
bug.
Thanks whoever updated the OpenQA web UI, the virtio serial log file is now available on O3.
Updated by rpalethorpe almost 8 years ago
- Status changed from New to In Progress
Updated by rpalethorpe almost 8 years ago
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2440
I found/rediscovered another race condition and have submitted a fix to the test suite. This could also be fixed in os-autoinst by retaining the ring buffer between calls to wait_serial, but I like dropping trailing data to make the result snippets more concise. So for now I think just fixing it in the test will do.
Hopefully this is the last bug causing this issue.
Updated by rpalethorpe over 7 years ago
After https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2833 and https://github.com/os-autoinst/os-autoinst/pull/772, this no longer appears to happen to the LTP test runner. However it still happens on very rare occasions to script_run, so we will have to wait for the prompt to appear on that as well.
Updated by rpalethorpe over 7 years ago
So there is now a PR for script_run: https://github.com/os-autoinst/os-autoinst/pull/797
script_output may also need updating, but it is currently in the wrong place (testapi.pm instead of distribution.pm), so I will save that for later.
Updated by rpalethorpe over 7 years ago
- Status changed from In Progress to Resolved