Project

General

Profile

action #113282

Updated by mkittler almost 2 years ago

## Observation 
 Especially on o3/aarch64 many jobs incomplete with `backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 183.`. It looks like the first occurrence the problem `2357381 | 2022-05-19 20:09:39 | incomplete | backend died: unexpected end of data …` and since then the log of incompletes on the worker openqa-aarch64 is significantly dominated by this error. 

 *Likely* the culprit is https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 (and by extension https://github.com/os-autoinst/os-autoinst/commit/93ff454deae61e573a9cbf88f172304002fb83a4). In my tests/investigation with svirt jobs this change was an overall improvement. However, I can imagine that in certain cases it would be better to rather block longer on reads instead of giving up and possibly not being able to recover. I suppose the timeouts should be handled more sensibly. We should create a separate ticket for that problem. 

 ## Acceptance criteria 
 * **AC1:** Significant less jobs fail with reason "backend died: unexpected end of data" on o3 aarch64 
 * **AC2:** Still a low job fail ratio with reason "backend died: unexpected end of data" on o3+osd workers 

 ## Suggestion 
 * Attempt to re-connect instead of just stopping the backend process. 
 * As a workaround, increase `VNC_TIMEOUT_LOCAL` and `VNC_TIMEOUT_REMOTE` to a very high value on affected workers (and also set `VNC_CONNECT_TIMEOUT_LOCAL`/`VNC_CONNECT_TIMEOUT_REMOTE` explicitly so these stay the same). This way the old behavior (very long timeout on reads on the VNC socket) is restored. (If it doesn't help then https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 was not the culprit after all). 
 * Maybe it makes sense to increase the default timeout for `VNC_TIMEOUT_LOCAL` because 10 seconds might not be much on a busy worker. (I suppose `VNC_CONNECT_TIMEOUT_LOCAL` should be kept to be 10 seconds by default.) 

 ## Further details 
 * It is really just that exact error, so the line number is always 183 which corresponds to reading the VNC protocol version/handshake (`$socket->read(my $protocol_version, 12) || die 'unexpected end of data';`). 
 * Examples: https://openqa.opensuse.org/tests/2357381/logfile?filename=autoinst-log.txt, https://openqa.opensuse.org/tests/2451699/logfile?filename=autoinst-log.txt, https://openqa.opensuse.org/tests/2451712/logfile?filename=autoinst-log.txt 
     * In none of the cases a re-connect was attempted. Maybe that would have helped. 
     * See `select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqa-aarch64' and result = 'incomplete' and t_finished >= '2022-05-01T00:00:00' order by t_finished asc;` for all incompletes on aarch64. 
     * The problem is much less apparent on other workers, e.g. the same query for openqaworker1 shows only a few affected jobs. Likely it happens more often on slow workers. 
 * The original problem (the change https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 actually helps with) is https://progress.opensuse.org/issues/111004. 

 ## Suggestion 
 * Attempt to re-connect instead Out of just stopping the backend process. scope 
 * As Implementing a workaround, increase `VNC_TIMEOUT_LOCAL` and `VNC_TIMEOUT_REMOTE` to a very high value on affected workers (and also set `VNC_CONNECT_TIMEOUT_LOCAL`/`VNC_CONNECT_TIMEOUT_REMOTE` explicitly so these stay the same). This way the old behavior (very long timeout on reads on the VNC socket) is restored. (If it doesn't help then https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 was not the culprit reconnection. But if after all). 
 * Maybe it makes sense to increase the default timeout for `VNC_TIMEOUT_LOCAL` because 10 seconds might not be much on a busy worker. (I suppose `VNC_CONNECT_TIMEOUT_LOCAL` this ticket we run into problems again *then* we should be kept to be 10 seconds by default.) consider this.

Back