Project

General

Profile

action #115580

Reason: abandoned: associated worker openqaworker3:13 re-connected but abandoned the job size:M

Added by coolgw 3 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
2022-08-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

Detail please check following job:
https://openqa.suse.de/tests/9363998#
After rerun the job, the issue not exist.

Suggestions

  • Check if there's any logs on the worker e.g. maybe systemd killed the service because it took too long
  • The same job also sometimes finishes with reason: timeout exceeded
  • Maybe the test is often taking too long so it can't finish in time

History

#1 Updated by coolgw 3 months ago

  • Project changed from openQA Project to openQA Infrastructure

#2 Updated by cdywan 3 months ago

Was this job missing the logs from the start? I can only see the iso and the qcow2. They might've been deleted because you didn't add the ticket earlier, which would have made it important (which makes openQA keep around assets longer).

I can find exactly one occurance, other jobs seem to finish with softfailed or timeout, I assume the former is what's considered good here. It would be helpful to confirm if this is reproducible, and if it happens on workers other than openqaworker3.

#96710 used to be an issue that was causing jobs to fail with the same reason, and might be worth considering here, although it's quite old at this point. We could still be seeing a new and completely unrelated issue.

#3 Updated by coolgw 3 months ago

The error happen on weekend so i didn't add ticket on time.
I suppose this is sporadic issue since not happen after clone the case.
Will keep an eye on this issue.

#4 Updated by tinita 3 months ago

  • Target version set to Ready

#5 Updated by cdywan 3 months ago

  • Subject changed from Reason: abandoned: associated worker openqaworker3:13 re-connected but abandoned the job to Reason: abandoned: associated worker openqaworker3:13 re-connected but abandoned the job size:M
  • Description updated (diff)
  • Status changed from New to Workable

#6 Updated by mkittler 3 months ago

  • Status changed from Workable to Closed
  • Assignee set to mkittler

The error means that the worker did not exit normally. That can have various reasons. It could be a bug in the worker code but it could also be the physical machine crashing, someone sending a SIGKILL (e.g. when stopping/restarting the systemd service and it ran into the timeout), a kernel panic, …. Without logs it is impossible to tell what happened. Unfortunately its the problem's nature that logs haven't been uploaded. Normally one can just have a look at the journal of the worker. In this case it is impossible tough because the oldest message is from 07.09. I also haven't found any more recent occurrences when checking Next & Previous jobs.

The same job also sometimes finishes with reason: timeout exceeded

Ok, that's a different kind of error then. Without even a job URL it is impossible to investigate the underlying problem. (I checked out more recent jobs, e.g. https://openqa.suse.de/tests/9492203. However, it looks like the SUT or test code simply gets stuck. I don't think there's something to improve here from the openQA side. It is completely normal that the job eventually ends up exceeding the timeout in that case.)


I don't think we can do anything about it at this point.

Also available in: Atom PDF