Project

General

Profile

action #69784

Workers not considered offline after ungraceful disconnect; stale job detection has no effect in that case

Added by okurz 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

When a worker does not disconnect gracefully it is supposed to be considered offline after WORKERS_CHECKER_THRESHOLD seconds. This mechanism doesn't work when there are still jobs which can be assigned to that worker because then the scheduler possibly attempts to assign jobs to that worker and therefore updates the t_updated column of the worker which is also used to track the last activity of the worker. So the worker keeps appearing online. This issue prevents the stale job detection for jobs where the worker really never appears again to work.

This can currently be observed with openqaworker7 on OSD (which has been an o3 worker since months): see #64514#note-6

Impact

  • stale job detection ineffective on according jobs
  • misleading information to users about the according workers

Notes

This was an oversight when solving #27454 and #57017 and increasing WORKERS_CHECKER_THRESHOLD to make the stale job detection less aggressive.

Suggestions

The scheduler should preserve the t_updated column or we keep track of the workers activity using an additional column. I don't think it would be a good idea to move back to the previous approach of tracking the worker activity only within the web socket server.


Related issues

Copied from openQA Infrastructure - action #64514: openqaworker7 is down and IPMI SOL very unstableResolved2020-03-162020-03-19

History

#1 Updated by okurz 5 months ago

  • Copied from action #64514: openqaworker7 is down and IPMI SOL very unstable added

#2 Updated by okurz 5 months ago

  • Target version set to future

#3 Updated by mkittler 5 months ago

  • Subject changed from openqaworker7 reports as online in OSD but has been an o3 worker since months to Workers not considered offline after ungraceful disconnect; stale job detection has no effect in that case
  • Description updated (diff)

#4 Updated by cdywan 5 months ago

Can you clarify if this cause jobs to get rejected? Or otherwise interfere with scheduling? Or is it mostly a cosmetic issue?

#5 Updated by okurz 5 months ago

  • Description updated (diff)

added "impact" section

#6 Updated by okurz 5 months ago

  • Target version changed from future to Ready

Given that the stale job detection is also impacted and a regression this is something we should consider for the team.

#7 Updated by mkittler 5 months ago

I've already wrote a unit test to reproduce the issue to make sense of it. So I'll likely pick it up soon.

#8 Updated by mkittler 4 months ago

  • Assignee set to mkittler

#9 Updated by cdywan 4 months ago

  • Status changed from New to Workable

#11 Updated by mkittler 4 months ago

The PRs have been merged.


Note that the stale job detection is still very patient with lost/crashed/… workers so even with my PR workers which went offline unexpectedly are still appearing as online for 2 days. That also means jobs assigned to such workers will still stuck in the assigned state for up to 2 days as we've recently seen it on OSD. There the max job time alert triggered because aarch64 jobs piled up after ARM worker 2 went offline unexpectedly. A lot of jobs were affected because one chained parent was stuck and therefore lots of chained children also ended up stuck (being blocked).

I've introduced this "long patience" of 2 days explicitly as one of the measures to fix #60866. But in the context of the current ticket and the recent alert on OSD it is of course questionable whether it makes sense to wait 2 days for a worker to become online again. Note that I originally only wanted to increase the threshold from 2 minutes to 15 minutes but coolo convinced me it was better to increase it to 2 days right away. However, now I think about 30 minutes would be a good value. If workers/jobs get stuck for half an hour is isn't that notable and it leaves enough resilience for connection problems. I also suppose it is a big enough threshold to prevent overloading the stale job detection (which #60866 was about).

Another quick improvement which comes to my mind after re-scheduling jobs today manually on OSD: We could also make the stale job detection re-schedule assigned jobs instead of marking them as incomplete. We already do the same optimization on the stale job detection which runs when the worker re-connects and it would make sense to do that in both cases.

#12 Updated by okurz 4 months ago

mkittler wrote:

[…]
I've introduced this "long patience" of 2 days explicitly as one of the measures to fix #60866. But in the context of the current ticket and the recent alert on OSD it is of course questionable whether it makes sense to wait 2 days for a worker to become online again. Note that I originally only wanted to increase the threshold from 2 minutes to 15 minutes but coolo convinced me it was better to increase it to 2 days right away. However, now I think about 30 minutes would be a good value. If workers/jobs get stuck for half an hour is isn't that notable and it leaves enough resilience for connection problems. I also suppose it is a big enough threshold to prevent overloading the stale job detection (which #60866 was about).

Agreed then, still agree now :) But for any number like this having that as an overridable config option leaves room for instance admins to tweak based on their specific use cases. And I think instance-wide is a good selection because the specific instance might have specifics about "how long do we expect connection outages" and such.

Another quick improvement which comes to my mind after re-scheduling jobs today manually on OSD: We could also make the stale job detection re-schedule assigned jobs instead of marking them as incomplete. We already do the same optimization on the stale job detection which runs when the worker re-connects and it would make sense to do that in both cases.

I agree that we should end up with re-scheduled jobs because I can not think of a reason why the test reviewer should need to make a decision about retriggering or not in theses cases.

#13 Updated by mkittler 4 months ago

But for any number like this having that as an overridable config option

Right, we should finally make that configurable. PR: https://github.com/os-autoinst/openQA/pull/3389

because I can not think of a reason why the test reviewer should need to make a decision about retriggering or not in theses cases.

These stale jobs are automatically restarted so far. That means the reviewer already does not need to make that decision. However, restarting big job clusters from within the scheduler can keep the scheduler busy and simply leads to more incomplete jobs then necessary.

#14 Updated by mkittler 4 months ago

The PR https://github.com/os-autoinst/openQA/pull/3389 should now cover all mentioned improvements.

#15 Updated by okurz 4 months ago

https://github.com/os-autoinst/openQA/pull/3389 got merged now. Given that you now want to detect stale jobs after 30m I suggest to actually wait for the next osd deployment and explicitly crash e.g. openqaworker-arm-2 while it is running jobs, e.g. with ipmitool and power off and then see how the stale job detection fares while not overloading the scheduler.

#16 Updated by mkittler 4 months ago

  • Status changed from In Progress to Resolved

Currently the arm machine with the highest number of running jobs has 5 jobs without big clusters. That's not enough to test an overloading. I've tested the change locally and with unit tests and it should work. If it caused any scalability regressions we can still re-open this ticket or #60866 or create a new ticket. Since the worker timeout is now configurable it is also easy to change it again if needed.

#17 Updated by okurz 4 months ago

  • Status changed from Resolved to Feedback
  • Priority changed from Normal to Urgent

Tests fail now flaky on t_seen, e.g. https://build.opensuse.org/public/build/devel:openQA/openSUSE_Factory/x86_64/openQA/_log shows

[  266s] ./t/04-scheduler.t .......................................... 
[  266s] [INFO] using database schema "tmp_sp9GOuV6b0QT3EI6"
[  266s] # Subtest: Authentication
[  266s]     ok 1 - GET /test
[  266s]     ok 2 - 404 Not Found
[  266s]     ok 3 - content is similar
[  266s]     ok 4 - GET /
[  266s]     ok 5 - 200 OK
[  266s]     ok 6 - exact match for JSON Pointer ""
[  266s] [DEBUG] No API key from client.
[  266s]     ok 7 - GET /
[  266s]     ok 8 - 403 Forbidden
[  266s]     ok 9 - exact match for JSON Pointer ""
[  266s]     1..9
[  266s] ok 1 - Authentication
[  266s] # Subtest: Exception
[  266s]     ok 1 - GET /whatever
[  266s]     ok 2 - 500 Internal Server Error
[  266s]     ok 3 - content is similar
[  266s]     ok 4 - GET /whatever
[  266s]     ok 5 - 404 Not Found
[  266s]     1..5
[  266s] ok 2 - Exception
[  266s] # Subtest: API
[  266s]     ok 1 - GET /api/wakeup
[  266s]     ok 2 - 200 OK
[  266s]     ok 3 - exact match for content
[  266s]     ok 4 - scheduler has been woken up
[  266s]     ok 5 - GET /api/wakeup
[  266s]     ok 6 - 200 OK
[  266s]     ok 7 - exact match for content
[  266s]     ok 8 - scheduler has been woken up again
[  266s]     1..8
[  266s] ok 3 - API
[  266s] ok 4 - assert database has no jobs to start with
[  266s] # Subtest: worker registration
[  266s]     ok 1 - new worker registered
[  266s]     ok 2 - id set
[  266s]     ok 3 - host set
[  266s]     ok 4 - instance set
[  266s]     ok 5 - re-registered worker got same id
[  266s]     1..5
[  266s] ok 5 - worker registration
[  266s] ok 6 - job_create
[  266s] ok 7
[  266s] # Subtest: calling again with same settings
[  266s]     ok 1 - calling again with same settings yields new job
[  266s]     1..1
[  266s] ok 8 - calling again with same settings
[  266s] ok 9 - job_get
[  266s] # Subtest: job listing
[  266s]     ok 1 - All list_jobs
[  266s]     ok 2 - All list_jobs with state scheduled
[  266s]     ok 3 - All list_jobs with state running
[  266s]     ok 4 - list_jobs with build
[  266s]     ok 5 - list_jobs with iso
[  266s]     ok 6 - list_jobs combining a setting (BUILD) and state
[  266s]     ok 7 - list_jobs combining two settings (ISO and BUILD)
[  266s]     ok 8 - list_jobs messing two settings up
[  266s]     ok 9 - jobs with specified IDs and states (array ref)
[  266s]     ok 10 - jobs with specified IDs (comma list)
[  266s]     1..10
[  266s] ok 10 - job listing
[  266s] # Subtest: job grab (WORKER_CLASS mismatch)
[  266s] [DEBUG] +=----------------=+
[  266s] [DEBUG] -> Scheduling new jobs.
[  266s] [DEBUG]     Free workers: 1/1
[  266s] [DEBUG]     Scheduled jobs: 2
[  266s] [DEBUG] Scheduler took 0.01342s to perform operations and allocated 0 jobs
[  266s]     ok 1 - job not grabbed due to default WORKER_CLASS
[  266s]     ok 2 - no workers/jobs allocated
[  266s]     not ok 3 - t_seen has not changed
[  266s]     1..3
[  266s] not ok 11 - job grab (WORKER_CLASS mismatch)
[  266s] # Subtest: job grab (failed to send job to worker)
[  266s] [DEBUG] +=----------------=+
[  266s] [DEBUG] -> Scheduling new jobs.
[  266s] [DEBUG]     Free workers: 1/1
[  266s] [DEBUG]     Scheduled jobs: 2
[  266s] [DEBUG] Need to schedule 1 parallel jobs for job 1 (with priority 40)
[  266s] [DEBUG] Assigned job '1' to worker ID '1'
[  266s] [DEBUG] Failed to send data to websocket, reason: fake error at ./t/04-scheduler.t line 44.
[  266s] [DEBUG] Failed sending job(s) '1' to worker '1'
[  266s] [DEBUG] Job 1 reset to state scheduled
[  266s] [DEBUG] Scheduler took 0.02683s to perform operations and allocated 0 jobs
[  266s]     ok 1 - no workers/jobs allocated
[  266s]     not ok 2 - t_seen has not changed
[  266s]     1..2
[  266s] not ok 12 - job grab (failed to send job to worker)
[  266s] # Subtest: job grab (successful assignment)
[  266s] [DEBUG] +=----------------=+
[  266s] [DEBUG] -> Scheduling new jobs.
[  266s] [DEBUG]     Free workers: 1/1
[  266s] [DEBUG]     Scheduled jobs: 2
[  266s] [DEBUG] Need to schedule 1 parallel jobs for job 1 (with priority 40)
[  266s] [DEBUG] Assigned job '1' to worker ID '1'
[  266s] [DEBUG] [Job#1] Prepare for being processed by worker 1
[  266s] [DEBUG] Sent job(s) '1' to worker '1'
[  266s] [DEBUG] Scheduler took 0.03119s to perform operations and allocated 1 jobs
[  266s] [DEBUG] Allocated: { job => 1, worker => 1 }
[  266s]     ok 1 - job token present
[  266s]     ok 2 - settings correct
[  266s]     ok 3 - job start timestamp not present as job is not started
[  266s]     ok 4 - number of running jobs
[  266s]     ok 5 - assigned worker set
[  266s]     ok 6 - worker assigned to job
[  266s]     ok 7 - job assigned to worker
[  266s]     ok 8 - job is in assigned state
[  266s]     not ok 9 - t_seen has not changed
[  266s]     1..9
[  266s] not ok 13 - job grab (successful assignment)
…

maybe better remove the additional t_seen has not changed test until we have a better solution.

#18 Updated by okurz 4 months ago

mkittler wrote:

Currently the arm machine with the highest number of running jobs has 5 jobs without big clusters. That's not enough to test an overloading.

What do you mean with that? I meant that you simply force power off the machine while jobs are running

#19 Updated by mkittler 4 months ago

  • Priority changed from Urgent to Normal

You removed the unstable checks so I'm reducing the urgency: https://github.com/os-autoinst/openQA/pull/3408

What do you mean with that? I meant that you simply force power off the machine while jobs are running

Of course, but it would only be 5 jobs which would be affected. That's not good enough to test an overloading of the scheduler so it is kind of useless.

#20 Updated by mkittler 4 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF