Project

General

Profile

Actions

action #135380

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert

A significant number of scheduled jobs with one or two running triggers an alert

Added by livdywan almost 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-09-07
Due date:
% Done:

0%

Estimated time:

Description

Motivation

#135122 was discovered due to user feedback rather than alert handling. We should ensure we either adjust existing alerts accordingly or add one that would be able to discover this issue early.

Acceptance criteria

  • AC1: A significant number of scheduled jobs with one or two running triggers an alert

Suggestions


Files

wakeup-scheduler.ods (16.1 KB) wakeup-scheduler.ods tinita, 2023-09-08 10:42

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #135632: "Mojo::File::spurt is deprecated in favor of Mojo::File::spew" breaking os-autoinst OBS build and osd-deployment size:MResolvedokurz2023-05-08

Actions
Copied to openQA Infrastructure - action #135578: Long job age and jobs not executed for long size:MResolvednicksinger

Actions
Actions #1

Updated by okurz almost 1 year ago

  • Target version set to Ready
Actions #2

Updated by okurz almost 1 year ago

  • Tracker changed from coordination to action
Actions #3

Updated by tinita almost 1 year ago

I mentioned it before, but I think I wasn't clear enough.

We have these warnings in the osd webui journal:

Sep 08 09:40:36 openqa openqa[2649]: [warn] [pid:2649] Unable to wakeup scheduler: Request timeout
Sep 08 09:43:13 openqa openqa[2649]: [warn] [pid:2649] Unable to wakeup scheduler: Request timeout
Sep 08 09:44:11 openqa openqa[22509]: [warn] [pid:22509] Unable to wakeup scheduler: Request timeout
Sep 08 09:44:25 openqa openqa[5400]: [warn] [pid:5400] Unable to wakeup scheduler: Request timeout
Sep 08 09:45:07 openqa openqa[25792]: [warn] [pid:25792] Unable to wakeup scheduler: Request timeout
Sep 08 09:46:34 openqa openqa[5400]: [warn] [pid:5400] Unable to wakeup scheduler: Request timeout
Sep 08 09:47:05 openqa openqa[7085]: [warn] [pid:7085] Unable to wakeup scheduler: Request timeout
Sep 08 09:51:33 openqa openqa[5378]: [warn] [pid:5378] Unable to wakeup scheduler: Request timeout
Sep 08 09:51:51 openqa openqa[24950]: [warn] [pid:24950] Unable to wakeup scheduler: Request timeout
Sep 08 09:52:06 openqa openqa[28278]: [warn] [pid:28278] Unable to wakeup scheduler: Request timeout
Sep 08 09:52:20 openqa openqa[12326]: [warn] [pid:12326] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:11 openqa openqa[2519]: [warn] [pid:2519] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[2521]: [warn] [pid:2521] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[5378]: [warn] [pid:5378] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[622]: [warn] [pid:622] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[11005]: [warn] [pid:11005] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[5400]: [warn] [pid:5400] Unable to wakeup scheduler: Request timeout
Sep 08 09:57:57 openqa openqa[2521]: [warn] [pid:2521] Unable to wakeup scheduler: Request timeout
Sep 08 10:15:08 openqa openqa[2521]: [warn] [pid:2521] Unable to wakeup scheduler: Request timeout

And this happens all the time.
It's a bit hard to see the time distribution, so I just put it into a spreadsheet with a count of that warning per hour, and I marked those time frames with a gray background where we saw no new jobs scheduled.

Although the counts doesn't always correlate with that, I think it shows that it's connected.

Note that on o3, we only see that warning like once per day.

So maybe rather than trying to figure out how to measure scheduled and running jobs, we should create an alert on such warnings instead.

Actions #4

Updated by livdywan almost 1 year ago

It's a bit hard to see the time distribution, so I just put it into a spreadsheet with a count of that warning per hour, and I marked those time frames with a gray background where we saw no new jobs scheduled.

Although the counts doesn't always correlate with that, I think it shows that it's connected.

Note that on o3, we only see that warning like once per day.

So maybe rather than trying to figure out how to measure scheduled and running jobs, we should create an alert on such warnings instead.

@osukup already prepared an alert about the number of jobs: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/973

Imho we should still have an alert examining the logs, though. Having up to 84 messages per hour is not something we should be ignoring.

Actions #5

Updated by livdywan 12 months ago

@osukup already prepared an alert about the number of jobs: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/973

Looks like it's effective. The alert was active between Friday and a moment ago.

Actions #7

Updated by okurz 12 months ago

  • Status changed from New to Feedback
  • Assignee set to okurz
Actions #8

Updated by okurz 12 months ago

  • Copied to action #135578: Long job age and jobs not executed for long size:M added
Actions #9

Updated by kraih 12 months ago

Actually after looking into this a bit i think the Unable to wakeup scheduler: Request timeout log messages are a red herring here. Because of how the scheduler is designed, using a timer that can prevent I/O events from getting handled, multiple wakeups in a short timeframe will trigger the reschedule often enough to make some of them exceed their request timeout. And this is actually fine. However, this is therefore no good indicator for performance issues in other areas of the code, as we've been assuming previously.

What we should be looking out for instead is actually how long the reschedule runs take. Any outliers here would be much more indicative of real problems.

Actions #10

Updated by okurz 12 months ago

  • Status changed from Feedback to New

ok, I will try to reduce the urgency of the "request timeout" log message into "debug" or "trace" level

Actions #11

Updated by okurz 12 months ago

  • Status changed from New to Feedback
Actions #12

Updated by okurz 12 months ago

  • Status changed from Feedback to Blocked
Actions #13

Updated by okurz 12 months ago

  • Related to action #135632: "Mojo::File::spurt is deprecated in favor of Mojo::File::spew" breaking os-autoinst OBS build and osd-deployment size:M added
Actions #14

Updated by okurz 11 months ago

  • Status changed from Blocked to Resolved

unblocked, https://github.com/os-autoinst/openQA/pull/5308 just merged. That was the last remaining task. The alert was already there in before.

Actions

Also available in: Atom PDF