action #40793
closedPrevent isotovideo <=> command server IPC to occasionally lose messages
Description
Symptom analysis¶
33-developer-mode.t
is failing often on Travis with the following error message:
not ok 2 - web socket connection closed prematurely, was waiting for resume
# Failed test 'web socket connection closed prematurely, was waiting for resume'
# at /opt/testing_area/openqa/t/lib/OpenQA/Test/FullstackUtils.pm line 128.
On the developer console the following lines are indeed missing in the error case (compared to a good run and also what I would expect):
{"data":{"json_cmd_token":"GxmYxieq","resume_test_execution":null},"type":"info","what":"cmdsrvmsg"}
{"data":{"reload_needles":{"cmd":"backend_reload_needles","json_cmd_token":"mzZzjVLv"}},"type":"info","what":"cmdsrvmsg"}
{"data":{"check_screen":"on_prompt"},"type":"info","what":"cmdsrvmsg"}
In the autoinst-log.txt
- even in the error case - we get 3 times isotovideo: test execution will be resumed
and the test actually resumes. So the client definitely sends the message to resume correctly and the backend/isotovideo reacts accordingly. Only propagating this information back to the command server seems to fail occasionally. Hence there are also the following lines in the autoinst-log.txt
missing in the error case:
[2018-09-10T15:04:59.0788 CEST] [debug] BYTES {"resume_test_execution":null,"json_cmd_token":"IetuXGbf"}
[2018-09-10T15:04:59.0789 CEST] [debug] BYTES {"reload_needles":{"cmd":"backend_reload_needles","json_cmd_token":"qyQbPVBm"},"json_cmd_token":"BCZHrjDC"}
Those lines should be printed by the command server after propagating the commands to all clients.
Deduction¶
It seems that the IPC between isotovideo and the command server doesn't work reliably.
Tasks¶
- Improve log messages and error handling (eg.
BYTES ...
is not a meaningful context for the JSON document) - Find the conditions for "occasionally", try to reproduce locally (if possible)
- Fix the IPC