Project

General

Profile

Actions

action #113282

closed

Many incompletes due to VNC error "backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 183.", especially on o3/aarch64 size:M

Added by mkittler almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-07-05
Due date:
2022-07-20
% Done:

0%

Estimated time:

Description

Observation

Especially on o3/aarch64 many jobs incomplete with backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 183.. It looks like the first occurrence the problem 2357381 | 2022-05-19 20:09:39 | incomplete | backend died: unexpected end of data … and since then the log of incompletes on the worker openqa-aarch64 is significantly dominated by this error.

Likely the culprit is https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 (and by extension https://github.com/os-autoinst/os-autoinst/commit/93ff454deae61e573a9cbf88f172304002fb83a4). In my tests/investigation with svirt jobs this change was an overall improvement. However, I can imagine that in certain cases it would be better to rather block longer on reads instead of giving up and possibly not being able to recover. I suppose the timeouts should be handled more sensibly. We should create a separate ticket for that problem.

Further details

Suggestion

  • Attempt to re-connect instead of just stopping the backend process.
  • As a workaround, increase VNC_TIMEOUT_LOCAL and VNC_TIMEOUT_REMOTE to a very high value on affected workers (and also set VNC_CONNECT_TIMEOUT_LOCAL/VNC_CONNECT_TIMEOUT_REMOTE explicitly so these stay the same). This way the old behavior (very long timeout on reads on the VNC socket) is restored. (If it doesn't help then https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 was not the culprit after all).
  • Maybe it makes sense to increase the default timeout for VNC_TIMEOUT_LOCAL because 10 seconds might not be much on a busy worker. (I suppose VNC_CONNECT_TIMEOUT_LOCAL should be kept to be 10 seconds by default.)

Related issues 1 (0 open1 closed)

Related to openQA Project - action #111004: Timeout of test API functions not enforced if backend gets stuck, e.g. on the VNC socket size:MResolvedmkittler2022-05-122022-05-28

Actions
Actions

Also available in: Atom PDF