action #135380
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert
A significant number of scheduled jobs with one or two running triggers an alert
0%
Description
Motivation¶
#135122 was discovered due to user feedback rather than alert handling. We should ensure we either adjust existing alerts accordingly or add one that would be able to discover this issue early.
Acceptance criteria¶
- AC1: A significant number of scheduled jobs with one or two running triggers an alert
Suggestions¶
Files
Updated by tinita over 1 year ago
- File wakeup-scheduler.ods wakeup-scheduler.ods added
I mentioned it before, but I think I wasn't clear enough.
We have these warnings in the osd webui journal:
Sep 08 09:40:36 openqa openqa[2649]: [warn] [pid:2649] Unable to wakeup scheduler: Request timeout
Sep 08 09:43:13 openqa openqa[2649]: [warn] [pid:2649] Unable to wakeup scheduler: Request timeout
Sep 08 09:44:11 openqa openqa[22509]: [warn] [pid:22509] Unable to wakeup scheduler: Request timeout
Sep 08 09:44:25 openqa openqa[5400]: [warn] [pid:5400] Unable to wakeup scheduler: Request timeout
Sep 08 09:45:07 openqa openqa[25792]: [warn] [pid:25792] Unable to wakeup scheduler: Request timeout
Sep 08 09:46:34 openqa openqa[5400]: [warn] [pid:5400] Unable to wakeup scheduler: Request timeout
Sep 08 09:47:05 openqa openqa[7085]: [warn] [pid:7085] Unable to wakeup scheduler: Request timeout
Sep 08 09:51:33 openqa openqa[5378]: [warn] [pid:5378] Unable to wakeup scheduler: Request timeout
Sep 08 09:51:51 openqa openqa[24950]: [warn] [pid:24950] Unable to wakeup scheduler: Request timeout
Sep 08 09:52:06 openqa openqa[28278]: [warn] [pid:28278] Unable to wakeup scheduler: Request timeout
Sep 08 09:52:20 openqa openqa[12326]: [warn] [pid:12326] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:11 openqa openqa[2519]: [warn] [pid:2519] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[2521]: [warn] [pid:2521] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[5378]: [warn] [pid:5378] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[622]: [warn] [pid:622] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[11005]: [warn] [pid:11005] Unable to wakeup scheduler: Request timeout
Sep 08 09:54:12 openqa openqa[5400]: [warn] [pid:5400] Unable to wakeup scheduler: Request timeout
Sep 08 09:57:57 openqa openqa[2521]: [warn] [pid:2521] Unable to wakeup scheduler: Request timeout
Sep 08 10:15:08 openqa openqa[2521]: [warn] [pid:2521] Unable to wakeup scheduler: Request timeout
And this happens all the time.
It's a bit hard to see the time distribution, so I just put it into a spreadsheet with a count of that warning per hour, and I marked those time frames with a gray background where we saw no new jobs scheduled.
Although the counts doesn't always correlate with that, I think it shows that it's connected.
Note that on o3, we only see that warning like once per day.
So maybe rather than trying to figure out how to measure scheduled and running jobs, we should create an alert on such warnings instead.
Updated by livdywan over 1 year ago
It's a bit hard to see the time distribution, so I just put it into a spreadsheet with a count of that warning per hour, and I marked those time frames with a gray background where we saw no new jobs scheduled.
Although the counts doesn't always correlate with that, I think it shows that it's connected.
Note that on o3, we only see that warning like once per day.
So maybe rather than trying to figure out how to measure scheduled and running jobs, we should create an alert on such warnings instead.
@osukup already prepared an alert about the number of jobs: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/973
Imho we should still have an alert examining the logs, though. Having up to 84 messages per hour is not something we should be ignoring.
Updated by livdywan over 1 year ago
@osukup already prepared an alert about the number of jobs: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/973
Looks like it's effective. The alert was active between Friday and a moment ago.
Updated by tinita over 1 year ago
Updated by okurz over 1 year ago
- Status changed from New to Feedback
- Assignee set to okurz
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/980 to fix the wording
Updated by okurz over 1 year ago
- Copied to action #135578: Long job age and jobs not executed for long size:M added
Updated by kraih over 1 year ago
Actually after looking into this a bit i think the Unable to wakeup scheduler: Request timeout
log messages are a red herring here. Because of how the scheduler is designed, using a timer that can prevent I/O events from getting handled, multiple wakeups in a short timeframe will trigger the reschedule often enough to make some of them exceed their request timeout. And this is actually fine. However, this is therefore no good indicator for performance issues in other areas of the code, as we've been assuming previously.
What we should be looking out for instead is actually how long the reschedule runs take. Any outliers here would be much more indicative of real problems.
Updated by okurz over 1 year ago
- Status changed from Feedback to New
ok, I will try to reduce the urgency of the "request timeout" log message into "debug" or "trace" level
Updated by okurz over 1 year ago
- Status changed from New to Feedback
Updated by okurz about 1 year ago
- Status changed from Feedback to Blocked
Updated by okurz about 1 year ago
- Related to action #135632: "Mojo::File::spurt is deprecated in favor of Mojo::File::spew" breaking os-autoinst OBS build and osd-deployment size:M added
Updated by okurz about 1 year ago
- Status changed from Blocked to Resolved
unblocked, https://github.com/os-autoinst/openQA/pull/5308 just merged. That was the last remaining task. The alert was already there in before.