Project

General

Profile

Actions

action #69784

closed

Workers not considered offline after ungraceful disconnect; stale job detection has no effect in that case

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

When a worker does not disconnect gracefully it is supposed to be considered offline after WORKERS_CHECKER_THRESHOLD seconds. This mechanism doesn't work when there are still jobs which can be assigned to that worker because then the scheduler possibly attempts to assign jobs to that worker and therefore updates the t_updated column of the worker which is also used to track the last activity of the worker. So the worker keeps appearing online. This issue prevents the stale job detection for jobs where the worker really never appears again to work.

This can currently be observed with openqaworker7 on OSD (which has been an o3 worker since months): see #64514#note-6

Impact

  • stale job detection ineffective on according jobs
  • misleading information to users about the according workers

Notes

This was an oversight when solving #27454 and #57017 and increasing WORKERS_CHECKER_THRESHOLD to make the stale job detection less aggressive.

Suggestions

The scheduler should preserve the t_updated column or we keep track of the workers activity using an additional column. I don't think it would be a good idea to move back to the previous approach of tracking the worker activity only within the web socket server.


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #64514: openqaworker7 is down and IPMI SOL very unstableResolvedokurz2020-03-162020-03-19

Actions
Actions

Also available in: Atom PDF