Project

General

Profile

action #113282

Updated by okurz over 1 year ago

## Observation 
 Especially on o3/aarch64 many jobs incomplete with `backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 183.`. It looks like the first occurrence the problem `2357381 | 2022-05-19 20:09:39 | incomplete | backend died: unexpected end of data …` and since then the log of incompletes on the worker openqa-aarch64 is significantly dominated by this error. 

 *Likely* the culprit is https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 (and by extension https://github.com/os-autoinst/os-autoinst/commit/93ff454deae61e573a9cbf88f172304002fb83a4). In my tests/investigation with svirt jobs this change was an overall improvement. However, I can imagine that in certain cases it would be better to rather block longer on reads instead of giving up and possibly not being able to recover. I suppose the timeouts should be handled more sensibly. We should create a separate ticket for that problem. 

 ## Acceptance criteria 
 * **AC1:** Significant less jobs fail with reason "backend died: unexpected end of data" on o3 aarch64 
 * **AC2:** Still a low job fail ratio with reason "backend died: unexpected end of data" on o3+osd workers 

 ## Suggestion 
 * Attempt to re-connect instead of just stopping the backend process. 
 * As a workaround, increase `VNC_TIMEOUT_LOCAL` and `VNC_TIMEOUT_REMOTE` to a very high value on affected workers (and also set `VNC_CONNECT_TIMEOUT_LOCAL`/`VNC_CONNECT_TIMEOUT_REMOTE` explicitly so these stay the same). This way the old behavior (very long timeout on reads on the VNC socket) is restored. (If it doesn't help then https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 was not the culprit after all). 
 * Maybe it makes sense to increase the default timeout for `VNC_TIMEOUT_LOCAL` because 10 seconds might not be much on a busy worker. (I suppose `VNC_CONNECT_TIMEOUT_LOCAL` should be kept to be 10 seconds by default.) 

 ## Further details 
 * It is really just that exact error, so the line number is always 183 which corresponds to reading the VNC protocol version/handshake (`$socket->read(my $protocol_version, 12) || die 'unexpected end of data';`). 
 * Examples: https://openqa.opensuse.org/tests/2357381/logfile?filename=autoinst-log.txt, https://openqa.opensuse.org/tests/2451699/logfile?filename=autoinst-log.txt, https://openqa.opensuse.org/tests/2451712/logfile?filename=autoinst-log.txt 
     * In none of the cases a re-connect was attempted. Maybe that would have helped. 
     * See `select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqa-aarch64' and result = 'incomplete' and t_finished >= '2022-05-01T00:00:00' order by t_finished asc;` for all incompletes on aarch64. 
     * The problem is much less apparent on other workers, e.g. the same query for openqaworker1 shows only a few affected jobs. Likely it happens more often on slow workers. 
 * The original problem (the change https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 actually helps with) is https://progress.opensuse.org/issues/111004. 

 ## Out Suggestion 
 * Attempt to re-connect instead of scope just stopping the backend process. 
 Implementing * As a workaround, increase `VNC_TIMEOUT_LOCAL` and `VNC_TIMEOUT_REMOTE` to a very high value on affected workers (and also set `VNC_CONNECT_TIMEOUT_LOCAL`/`VNC_CONNECT_TIMEOUT_REMOTE` explicitly so these stay the same). This way the old behavior (very long timeout on reads on the VNC reconnection. But if socket) is restored. (If it doesn't help then https://github.com/os-autoinst/os-autoinst/commit/d1adda78adc34c5ac02b5040a2bc0e97eaa83827 was not the culprit after this ticket we run into problems again *then* we all). 
 * Maybe it makes sense to increase the default timeout for `VNC_TIMEOUT_LOCAL` because 10 seconds might not be much on a busy worker. (I suppose `VNC_CONNECT_TIMEOUT_LOCAL` should consider this. be kept to be 10 seconds by default.)

Back