action #64520: Deal with jobs stuck in assigned state - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #64520

closed

Deal with jobs stuck in assigned state

Added by mkittler about 5 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2020-03-16

Due date:

% Done:

Estimated time:

Tags:

scheduling

Description

So far the job assignment is only revoked when a worker sends status updates claiming it is not working on a job (although a job is assigned). This means that jobs are stuck in the assigned state if a worker slot goes unexpectedly offline like in https://progress.opensuse.org/issues/64514.

It is also problematic that assigned jobs can be restarted. That creates a clone but the original job is then still stuck in the assigned state (e.g. https://openqa.opensuse.org/tests/1203580).

suggestions¶

The stale job detection should care about assigned jobs as well (implemented within https://github.com/os-autoinst/openQA/pull/3389/commits/585a0a3ef4f84a789a3260b4f4fe9cef8311b469) after a reasonable timeout. If those jobs are just set back to scheduled we might run into https://progress.opensuse.org/issues/62984 when the worker tries to run the job after all. The already proposed PR https://github.com/os-autoinst/openQA/pull/3409 takes care of that.
Restarting assigned jobs should be improved. We could allow to cancel them like it is possible for scheduled jobs. However, if the worker then starts the job after all we will run into https://progress.opensuse.org/issues/62984. So it makes more sense to keep the restart but ensure that the assigned job is set to cancelled (with the result user cancelled).

Actions

Copy link

Updated by mkittler about 5 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)

Actions

Copy link

Updated by okurz about 5 years ago

Category set to Regressions/Crashes

similar: https://openqa.suse.de/tests/3998087 is stuck in running for 24 days on 100% progress with reason already filled on osd. The job is "running" but also "finished" since 24 days on 100% progress. I did not change the job so that we can investigate it further.

info from job over API:

$ openqa_client_osd --json-output jobs/3998087
{
   "job" : {
      "id" : 3998087,
      "state" : "running",
      "t_started" : "2020-03-16T19:54:27",
      "group" : "Functional",
      "reason" : "api failure: 400 response: Got status update for job 3998087 and worker 527 but there is not even a worker assigned to this job (job is scheduled)",
      "assets" : {
         "other" : [
            "SLE-15-SP2-Online-aarch64-Build160.1-Media1.iso.sha256"
         ]
      },
      "name" : "sle-15-SP2-Online-aarch64-Build160.1-default@aarch64",
      "test" : "default",
      "priority" : 50,
      "blocked_by_id" : null,
      "result" : "incomplete",
      "t_finished" : "2020-03-16T19:54:26",
      "group_id" : 110,
      "settings" : {
         "VERSION" : "15-SP2",
…
         "BACKEND" : "qemu"
      },
      "parents" : {
         "Chained" : [],
         "Directly chained" : [],
         "Parallel" : []
      },
      "clone_id" : 4001121,
      "assigned_worker_id" : 527,
      "children" : {
         "Directly chained" : [],
         "Parallel" : [],
         "Chained" : []
      }
   }
}

Actions

Copy link

Updated by mkittler about 5 years ago

Of course this job was actually set to running. The worker even sent a reason but somehow the web UI didn't set the state to done. That is very weird because I'm not aware of a way to set the reason and not setting the job state to done at the same time. Likely there's yet another race condition somewhere.

Actions

Copy link

Updated by okurz almost 5 years ago

Target version set to Ready

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from New to Blocked
Assignee set to okurz

looks related to #69784 which we should probably wait for before re-evaluating.

Actions

Copy link

Updated by mkittler over 4 years ago

Status changed from Blocked to In Progress
Assignee changed from okurz to mkittler

I re-evaluate this ticket myself:

I'll have to re-check my concern "If those jobs are just set back to scheduled…" in suggestion 1. because setting those jobs back to scheduled has now been implemented. ~~I suppose if the worker comes back after all it would simply run into API errors but those wouldn't be recorded (like in #62984) because the worker is no longer assigned.~~ Looks like the API error will be recorded and it would even interfere when another worker has already taken that job.
Suggestion 2. should still be implemented.

Actions

Copy link

Updated by mkittler over 4 years ago

Description updated (diff)

PR https://github.com/os-autoinst/openQA/pull/3409 will solve the points from my previous comment.

Actions

Copy link

Updated by mkittler over 4 years ago

Status changed from In Progress to Resolved

The PR has been merged. In order to test this locally with the full stack I modified the worker code to simulate a slow worker and started a 2nd worker at the right time. That's nothing I want to do in production so I'll mark the ticket as resolved.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #64520

Deal with jobs stuck in assigned state

suggestions¶

Updated by mkittler about 5 years ago

Updated by okurz about 5 years ago

Updated by mkittler about 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago