action #113282: Many incompletes due to VNC error "backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 183.", especially on o3/aarch64 size:M - openQA Project - openSUSE Project Management Tool

Actions

Copy link

action #113282

closed

Many incompletes due to VNC error "backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 183.", especially on o3/aarch64 size:M

Added by mkittler over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2022-07-05

Due date:

2022-07-20

% Done:

Estimated time:

Description

Observation¶

Especially on o3/aarch64 many jobs incomplete with backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 183.. It looks like the first occurrence the problem 2357381 | 2022-05-19 20:09:39 | incomplete | backend died: unexpected end of data … and since then the log of incompletes on the worker openqa-aarch64 is significantly dominated by this error.

Likely the culprit is https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 (and by extension https://github.com/os-autoinst/os-autoinst/commit/93ff454deae61e573a9cbf88f172304002fb83a4). In my tests/investigation with svirt jobs this change was an overall improvement. However, I can imagine that in certain cases it would be better to rather block longer on reads instead of giving up and possibly not being able to recover. I suppose the timeouts should be handled more sensibly. We should create a separate ticket for that problem.

Further details¶

It is really just that exact error, so the line number is always 183 which corresponds to reading the VNC protocol version/handshake ($socket->read(my $protocol_version, 12) || die 'unexpected end of data';).
Examples: https://openqa.opensuse.org/tests/2357381/logfile?filename=autoinst-log.txt, https://openqa.opensuse.org/tests/2451699/logfile?filename=autoinst-log.txt, https://openqa.opensuse.org/tests/2451712/logfile?filename=autoinst-log.txt
- In none of the cases a re-connect was attempted. Maybe that would have helped.
- See select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqa-aarch64' and result = 'incomplete' and t_finished >= '2022-05-01T00:00:00' order by t_finished asc; for all incompletes on aarch64.
- The problem is much less apparent on other workers, e.g. the same query for openqaworker1 shows only a few affected jobs. Likely it happens more often on slow workers.
The original problem (the change https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 actually helps with) is https://progress.opensuse.org/issues/111004.

Suggestion¶

Attempt to re-connect instead of just stopping the backend process.
As a workaround, increase VNC_TIMEOUT_LOCAL and VNC_TIMEOUT_REMOTE to a very high value on affected workers (and also set VNC_CONNECT_TIMEOUT_LOCAL/VNC_CONNECT_TIMEOUT_REMOTE explicitly so these stay the same). This way the old behavior (very long timeout on reads on the VNC socket) is restored. (If it doesn't help then https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 was not the culprit after all).
Maybe it makes sense to increase the default timeout for VNC_TIMEOUT_LOCAL because 10 seconds might not be much on a busy worker. (I suppose VNC_CONNECT_TIMEOUT_LOCAL should be kept to be 10 seconds by default.)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz over 2 years ago

Related to action #111004: Timeout of test API functions not enforced if backend gets stuck, e.g. on the VNC socket size:M added

Actions

Copy link

Updated by okurz over 2 years ago

Category set to Regressions/Crashes
Target version set to Ready

Actions

Copy link

Updated by mkittler over 2 years ago

I've been editing /etc/workers.ini on aarch64 (see second suggestion) and will have a look whether it makes a difference.

Actions

Copy link

Updated by mkittler over 2 years ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by openqa_review over 2 years ago

Due date set to 2022-07-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by mkittler over 2 years ago

Status changed from In Progress to Feedback

So far it looks better, the last incomplete is 2451865 | 2022-07-05 14:36:22 | incomplete | backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 183.. I'll have a look again in the next days. If it helps I'll adjust the defaults on os-autoinst level because it would mean that 10 seconds is not a good default timeouts for reads on the local VNC connections.

Actions

Copy link

Updated by ggardet_arm over 2 years ago

mkittler wrote:

So far it looks better, the last incomplete is 2451865 | 2022-07-05 14:36:22 | incomplete | backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 183.. I'll have a look again in the next days. If it helps I'll adjust the defaults on os-autoinst level because it would mean that 10 seconds is not a good default timeouts for reads on the local VNC connections.

Let's wait to have a new Tumbleweed snapshot to have a real load on workers.

Actions

Copy link

Updated by okurz over 2 years ago

Subject changed from Many incompletes due to VNC error "backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 183.", especially on o3/aarch64 to Many incompletes due to VNC error "backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 183.", especially on o3/aarch64 size:M
Description updated (diff)

Actions

Copy link

Updated by mkittler over 2 years ago

Description updated (diff)

Yes, let's wait until the next snapshot. At least so far it looks still good.

Actions

Copy link

#10

Updated by mkittler over 2 years ago

We've got no further VNC errors:

openqa=# select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqa-aarch64' and result = 'incomplete' and reason like '%VNC%' and t_finished >= '2022-07-06T00:00:00' order by t_finished asc;
 id | t_finished | result | reason 
----+------------+--------+--------
(0 Zeilen)

There were 582 jobs running on the worker:

openqa=# select count(id) from jobs where (select host from workers where id = assigned_worker_id) = 'openqa-aarch64' and t_finished >= '2022-07-06T00:00:00';
 count 
-------
   582
(1 Zeile)

I'll check again next week but I've already prepared a PR to change the defaults: https://github.com/os-autoinst/os-autoinst/pull/2112

Actions

Copy link

#11

Updated by mkittler over 2 years ago

Status changed from Feedback to Resolved

The query still gives no results (now after 818 jobs were running). So I guess we're good again. The PR has been merged and deployed as well. So I removed the custom config on aarch64 and I think we can consider this resolved.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project

Tags

Custom queries

action #113282

Many incompletes due to VNC error "backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 183.", especially on o3/aarch64 size:M

Observation¶

Further details¶

Suggestion¶

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by openqa_review over 2 years ago

Updated by mkittler over 2 years ago

Updated by ggardet_arm over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago