Project

General

Profile

Actions

action #16320

closed

Random timeouts while waiting for serial output when using the virtio backend

Added by rpalethorpe about 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2017-01-30
Due date:
% Done:

0%

Estimated time:

Description

Observation

Tests timeout while waiting for output from an LTP test: https://openqa.suse.de/tests/743383.

It appears that the command text is sent to the SUT, but no response is received. In the serial log[1] for the above test it shows that the last test ran and returned a result. However nothing is read by the virtio console backend.

In this test: https://openqa.opensuse.org/tests/342884 [2], one call to wait_serial fails, but then the next succeeds and then it fails again. The calls which pass do not use regular expressions to do the matching.

As a rough estimate this bug occurs in 1%-5% of tests.

Problem

  • H1, QEMU is writing bytes to the log, but not the socket
  • H2, The virtio backend function read_until is not reading bytes from the socket correctly
  • H3, One or more of the read buffers in read_until are being dropped.

Suggestions

  • A0, Inspect more test failures.
  • A1, Run the virtio terminal unit tests repeatedly.
  • A2, Modify the virtio test module to perform a stress test.
  • A3, Investigate how QEMU passes the data.

I am currently waiting for a crash dump of the SUT to be attempted after a freeze.

workaround

  • W0, Retrigger the job manually.
  • W1, Retrigger the job automatically after a timeout.

[1] The serial log is written by QEMU.
[2] There is no virtio serial log for this test, possibly O3 needs updating.


Related issues 1 (0 open1 closed)

Related to openQA Tests - action #12350: [tools]version of os-autoinst on malbec+overdrive2 should be same as other workers (using salt) (was: looks like old version)ResolvedRBrownSUSE2016-09-09

Actions
Actions #1

Updated by szarate about 7 years ago

  • Related to action #12350: [tools]version of os-autoinst on malbec+overdrive2 should be same as other workers (using salt) (was: looks like old version) added
Actions #2

Updated by szarate about 7 years ago

Linking due to the mention of the neeed to upgrade

Actions #3

Updated by rpalethorpe about 7 years ago

So far A1 and A2 resulted in me finding one bug which only effects script_output, but may be responsible for some of the failures. Pull request is here: https://github.com/os-autoinst/os-autoinst/pull/710

Despite calling script_run >100k times no other failures have been observed locally. At a guess my system is either not under the same kind of load or the problem is fixed on Tumbleweed. All the new failures on the main instance seem to be related to the script_output bug.

Thanks whoever updated the OpenQA web UI, the virtio serial log file is now available on O3.

Actions #4

Updated by rpalethorpe about 7 years ago

  • Status changed from New to In Progress
Actions #5

Updated by rpalethorpe about 7 years ago

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2440

I found/rediscovered another race condition and have submitted a fix to the test suite. This could also be fixed in os-autoinst by retaining the ring buffer between calls to wait_serial, but I like dropping trailing data to make the result snippets more concise. So for now I think just fixing it in the test will do.

Hopefully this is the last bug causing this issue.

Actions #6

Updated by rpalethorpe almost 7 years ago

After https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2833 and https://github.com/os-autoinst/os-autoinst/pull/772, this no longer appears to happen to the LTP test runner. However it still happens on very rare occasions to script_run, so we will have to wait for the prompt to appear on that as well.

Actions #7

Updated by rpalethorpe almost 7 years ago

So there is now a PR for script_run: https://github.com/os-autoinst/os-autoinst/pull/797

script_output may also need updating, but it is currently in the wrong place (testapi.pm instead of distribution.pm), so I will save that for later.

Actions #8

Updated by rpalethorpe almost 7 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF