See #62420#note-17 . Jobs can incomplete due to test contribution errors, e.g. invalid settings or simply syntax mistakes in test code, or with unexpected backend crashes which more likely target instance admins we should separate both by the "incomplete reason".
- AC1: test contributor errors, e.g. syntax mistakes in test code produce a distinct incomplete reason
- AC2: Unexpected backend crashes still yield "died: …"
- Research how syntax or compilation errors in test code are treated by isotovideo
- If necessary put error message not only in autoinst-log.txt but accessible to openQA, e.g. in one of the json files read by openQA
- in openQA where we write "died: terminated prematurely, see log output for details" distinguish if there is an error string about a known test contributor error and output according reason, else fall back to "died: …"
#6 Updated by mkittler about 1 year ago
PR for os-autoinst: https://github.com/os-autoinst/os-autoinst/pull/1409
#7 Updated by mkittler about 1 year ago
- Status changed from In Progress to Feedback
With both PRs merged we should now see
tests died: ... and
backend died: .... There might still be just
died: ... in case something else within os-autoinst failed.
Note that the PR only affects errors when loading
main.pm and the test modules. So it is mainly about compilation errors. Unhandled exceptions within the test execution were already distinguished before: The test result is set to failed and there's a failed step result with the exception message. I don't think it makes sense to duplicate that information into the "reason".
So when this works in production I consider both ACs done.
#8 Updated by mkittler about 1 year ago
It seems to work on OSD and o3 (see
select jobs.id, t_started, t_finished, workers.host as worker_host, workers.instance as worker_instance, reason from jobs join workers on assigned_worker_id=workers.id where result = 'incomplete' order by id desc limit 100;).
The error `tests died: unable to load main.pm, check the log for the cause (e.g. syntax error)´ can now be observed on both instances.
On OSD there are several instances of
backend died: No map for 'Ã\u0083' at /usr/lib/os-autoinst/consoles/VNC.pm line 741.. Maybe we should also look into that issue but it could also be caused by network issues or problems of the VNC server.
On o3 I've only seen several instances of
backend died: Migrate to file failed, it has been running for more than 240 at /usr/lib/os-autoinst/backend/qemu.pm line 258.. I've created a PR to include the unit here. Not sure whether this error needs actual fixing.