action #115580
closedReason: abandoned: associated worker openqaworker3:13 re-connected but abandoned the job size:M
0%
Description
Observation¶
Detail please check following job:
https://openqa.suse.de/tests/9363998#
After rerun the job, the issue not exist.
Suggestions¶
- Check if there's any logs on the worker e.g. maybe systemd killed the service because it took too long
- The same job also sometimes finishes with reason: timeout exceeded
- Maybe the test is often taking too long so it can't finish in time
Updated by coolgw over 2 years ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
Updated by livdywan over 2 years ago
Was this job missing the logs from the start? I can only see the iso
and the qcow2
. They might've been deleted because you didn't add the ticket earlier, which would have made it important (which makes openQA keep around assets longer).
I can find exactly one occurance, other jobs seem to finish with softfailed or timeout, I assume the former is what's considered good here. It would be helpful to confirm if this is reproducible, and if it happens on workers other than openqaworker3.
#96710 used to be an issue that was causing jobs to fail with the same reason, and might be worth considering here, although it's quite old at this point. We could still be seeing a new and completely unrelated issue.
Updated by coolgw over 2 years ago
The error happen on weekend so i didn't add ticket on time.
I suppose this is sporadic issue since not happen after clone the case.
Will keep an eye on this issue.
Updated by livdywan over 2 years ago
- Subject changed from Reason: abandoned: associated worker openqaworker3:13 re-connected but abandoned the job to Reason: abandoned: associated worker openqaworker3:13 re-connected but abandoned the job size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 2 years ago
- Status changed from Workable to Closed
- Assignee set to mkittler
The error means that the worker did not exit normally. That can have various reasons. It could be a bug in the worker code but it could also be the physical machine crashing, someone sending a SIGKILL (e.g. when stopping/restarting the systemd service and it ran into the timeout), a kernel panic, …. Without logs it is impossible to tell what happened. Unfortunately its the problem's nature that logs haven't been uploaded. Normally one can just have a look at the journal of the worker. In this case it is impossible tough because the oldest message is from 07.09. I also haven't found any more recent occurrences when checking Next & Previous jobs.
The same job also sometimes finishes with reason: timeout exceeded
Ok, that's a different kind of error then. Without even a job URL it is impossible to investigate the underlying problem. (I checked out more recent jobs, e.g. https://openqa.suse.de/tests/9492203. However, it looks like the SUT or test code simply gets stuck. I don't think there's something to improve here from the openQA side. It is completely normal that the job eventually ends up exceeding the timeout in that case.)
I don't think we can do anything about it at this point.