action #135380: A significant number of scheduled jobs with one or two running triggers an alert - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #135380

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert

A significant number of scheduled jobs with one or two running triggers an alert

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-09-07

Due date:

% Done:

Estimated time:

Description

Motivation¶

#135122 was discovered due to user feedback rather than alert handling. We should ensure we either adjust existing alerts accordingly or add one that would be able to discover this issue early.

Acceptance criteria¶

AC1: A significant number of scheduled jobs with one or two running triggers an alert

Suggestions¶

Files

wakeup-scheduler.ods (16.1 KB) wakeup-scheduler.ods

tinita, 2023-09-08 10:42

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz over 1 year ago

Target version set to Ready

Actions

Copy link

Updated by okurz over 1 year ago

Tracker changed from coordination to action

Actions

Copy link

Updated by tinita over 1 year ago

File wakeup-scheduler.ods wakeup-scheduler.ods added

I mentioned it before, but I think I wasn't clear enough.

We have these warnings in the osd webui journal:

Sep 08 09:40:36 openqa openqa[2649]: [warn] [pid:2649] Unable to wakeup scheduler: Request timeout
Sep 08 09:43:13 openqa openqa[2649]: [warn] [pid:2649] Unable to wakeup scheduler: Request timeout
Sep 08 09:44:11 openqa openqa[22509]: [warn] [pid:22509] Unable to wakeup scheduler: Request timeout
Sep 08 09:44:25 openqa openqa[5400]: [warn] [pid:5400] Unable to wakeup scheduler: Request timeout
Sep 08 09:45:07 openqa openqa[25792]: [warn] [pid:25792] Unable to wakeup scheduler: Request timeout
Sep 08 09:46:34 openqa openqa[5400]: [warn] [pid:5400] Unable to wakeup scheduler: Request timeout
Sep 08 09:47:05 openqa openqa[7085]: [warn] [pid:7085] Unable to wakeup scheduler: Request timeout
Sep 08 09:51:33 openqa openqa[5378]: [warn] [pid:5378] Unable to wakeup scheduler: Request timeout
Sep 08 09:51:51 openqa openqa[24950]: [warn] [pid:24950] Unable to wakeup scheduler: Request timeout
Sep 08 09:52:06 openqa openqa[28278]: [warn] [pid:28278] Unable to wakeup scheduler: Request timeout
Sep 08 09:52:20 openqa openqa[12326]: [warn] [pid:12326] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:11 openqa openqa[2519]: [warn] [pid:2519] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[2521]: [warn] [pid:2521] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[5378]: [warn] [pid:5378] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[622]: [warn] [pid:622] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[11005]: [warn] [pid:11005] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[5400]: [warn] [pid:5400] Unable to wakeup scheduler: Request timeout
Sep 08 09:57:57 openqa openqa[2521]: [warn] [pid:2521] Unable to wakeup scheduler: Request timeout
Sep 08 10:15:08 openqa openqa[2521]: [warn] [pid:2521] Unable to wakeup scheduler: Request timeout

And this happens all the time.
It's a bit hard to see the time distribution, so I just put it into a spreadsheet with a count of that warning per hour, and I marked those time frames with a gray background where we saw no new jobs scheduled.

Although the counts doesn't always correlate with that, I think it shows that it's connected.

Note that on o3, we only see that warning like once per day.

So maybe rather than trying to figure out how to measure scheduled and running jobs, we should create an alert on such warnings instead.

Actions

Copy link

Updated by livdywan over 1 year ago

It's a bit hard to see the time distribution, so I just put it into a spreadsheet with a count of that warning per hour, and I marked those time frames with a gray background where we saw no new jobs scheduled.

Although the counts doesn't always correlate with that, I think it shows that it's connected.

Note that on o3, we only see that warning like once per day.

So maybe rather than trying to figure out how to measure scheduled and running jobs, we should create an alert on such warnings instead.

@osukup already prepared an alert about the number of jobs: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/973

Imho we should still have an alert examining the logs, though. Having up to 84 messages per hour is not something we should be ignoring.

Actions

Copy link

Updated by livdywan over 1 year ago

@osukup already prepared an alert about the number of jobs: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/973

Looks like it's effective. The alert was active between Friday and a moment ago.

Actions

Copy link

Updated by tinita over 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/973 merged

Actions

Copy link

Updated by okurz over 1 year ago

Status changed from New to Feedback
Assignee set to okurz

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/980 to fix the wording

Actions

Copy link

Updated by okurz over 1 year ago

Copied to action #135578: Long job age and jobs not executed for long size:M added

Actions

Copy link

Updated by kraih over 1 year ago

Actually after looking into this a bit i think the Unable to wakeup scheduler: Request timeout log messages are a red herring here. Because of how the scheduler is designed, using a timer that can prevent I/O events from getting handled, multiple wakeups in a short timeframe will trigger the reschedule often enough to make some of them exceed their request timeout. And this is actually fine. However, this is therefore no good indicator for performance issues in other areas of the code, as we've been assuming previously.

What we should be looking out for instead is actually how long the reschedule runs take. Any outliers here would be much more indicative of real problems.

Actions

Copy link

#10

Updated by okurz over 1 year ago

Status changed from Feedback to New

ok, I will try to reduce the urgency of the "request timeout" log message into "debug" or "trace" level

Actions

Copy link

#11

Updated by okurz over 1 year ago

Status changed from New to Feedback

https://github.com/os-autoinst/openQA/pull/5308

Actions

Copy link

#12

Updated by okurz over 1 year ago

Status changed from Feedback to Blocked

#135632

Actions

Copy link

#13

Updated by okurz over 1 year ago

Related to action #135632: "Mojo::File::spurt is deprecated in favor of Mojo::File::spew" breaking os-autoinst OBS build and osd-deployment size:M added

Actions

Copy link

#14

Updated by okurz over 1 year ago

Status changed from Blocked to Resolved

unblocked, https://github.com/os-autoinst/openQA/pull/5308 just merged. That was the last remaining task. The alert was already there in before.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #135380

A significant number of scheduled jobs with one or two running triggers an alert

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by tinita over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by tinita over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by kraih over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago