Project

General

Profile

Actions

action #158125

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #158110: [epic] Prevent worker overload

typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:M

Added by okurz 9 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #158104 we observed typing issues due to mania being overloaded. mania was configured to run 30 openQA worker instances and that was mostly fine as proven in #139271-24. The recent overload was likely triggered by enabling video again as part of #157636. I already reduced the number of worker instances. But this has the drawback that again the long test backlog takes longer to be finished. We should be more flexible in using available ressource. Here I suggest to implement a check in the worker to only pick up new jobs if CPU load is below a configured threshold.

Acceptance criteria

  • AC1: An openQA worker does not start an openQA job if the CPU load is higher than configured threshold
  • AC2: By default the worker still picks up jobs if the load is not too high

Suggestions

Out of scope

  • Consider the existing grafana monitoring for "broken workers" if we use that feature of declaring as "broken" due to too high CPU load

Related issues 4 (2 open2 closed)

Copied from openQA Infrastructure (public) - action #158104: typing issue on ppc64 worker size:SResolvedokurz2024-03-27

Actions
Copied to openQA Infrastructure (public) - action #158709: typing issue on ppc64 worker - with automatic CPU load based limiting in place let's increase the instances on mania againNew

Actions
Copied to openQA Project (public) - action #158910: typing issue on ppc64 worker - reconsider number of worker instances in particular on ppc64le kvm tests size:MBlockedokurz

Actions
Copied to openQA Project (public) - action #168244: reconsider load calculation for worker load limit especially for ppc size:SResolvedokurz

Actions
Actions #1

Updated by okurz 9 months ago

  • Copied from action #158104: typing issue on ppc64 worker size:S added
Actions #2

Updated by okurz 9 months ago

  • Subject changed from typing issue on ppc64 worker - only pick up new jobs if CPU load is below configured threshold to typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold
  • Description updated (diff)
Actions #3

Updated by okurz 9 months ago

  • Project changed from openQA Infrastructure (public) to openQA Project (public)
  • Category changed from Feature requests to Feature requests
Actions #4

Updated by okurz 9 months ago

  • Subject changed from typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold to typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by mkittler 9 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #6

Updated by mkittler 9 months ago

  • Status changed from In Progress to Feedback
Actions #7

Updated by mkittler 9 months ago

  • Status changed from Feedback to In Progress
Actions #8

Updated by mkittler 9 months ago

  • Status changed from In Progress to Feedback

The PR was merged and I configured a threshold of 40 on o3 workers and created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/769 for OSD workers.

Let's see whether we'll have worker slots showing up as broken with that. Maybe not because running into too high load is currently mitigated by reducing the number of worker slots and avoiding extensive video encoding. Maybe we'll have to tweak the thresholds later when working on #157636 to be actually effective/helpful.

Actions #9

Updated by okurz 9 months ago

  • Due date set to 2024-04-19

I don't think ppc64le machines should be treated any different like in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/769 . I would use "40" as a default everywhere, at least within o3+osd, possibly already upstream in openQA

Actions #10

Updated by mkittler 9 months ago · Edited

I added a MR for configuring 40 on OSD (https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1141 https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/772) and configured it on o3 workers via for i in $hosts; do echo $i && ssh root@$i "sed -i -e '/CACHELIMIT/a CRITICAL_LOAD_AVG_THRESHOLD = 40' /etc/openqa/workers.ini " ; done.

Actions #11

Updated by okurz 9 months ago

We found that most of our systems are below the load value of 40 but I have found worker-arm-1 which reaches into the range of 65. During that timeframe when the load was above, 2024-04-07 20:10Z until 2024-04-07 22:45Z, we found multiple test issues which could very well be related to a too high system load, among them mistyping, hanging keys, network communication problems and timeouts, e.g.:

https://github.com/os-autoinst/openQA/pull/5567 for a generic upstream default.

Actions #12

Updated by mkittler 9 months ago

The change is effective and we got two alerts about it today. One would be sufficient so I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1144.

Actions #13

Updated by okurz 9 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1144 merged but can we consider something better than the state/term "broken" for states that are expected to be self-resolving? How about having the worker in state "Working" because it's not actually idle but picked up a job or is doing something and can not currently pick up other jobs but it's also not "broken" as in needs help from an admin?

Actions #14

Updated by okurz 9 months ago

  • Copied to action #158709: typing issue on ppc64 worker - with automatic CPU load based limiting in place let's increase the instances on mania again added
Actions #15

Updated by mkittler 9 months ago · Edited

Reworking the worker's states is probably out of scope for this ticket. Not sure whether "Working" would be the best idea; one had to check the code what other assumptions are associated with "Working". (In the worst case we need yet another state. Currently "Broken" is already abused even for graceful disconnects even causing occasional displaying bugs. So we should probably reconsider the design and perhaps add another explicit database column for the state. I just didn't do it back then because it was easier for the code to pass review without a database migration. EDIT: Looks like I've already investigated the problem with workers showing up as broken with the reason "graceful disconnect". The relevant ticket is #134924 and should probably be taken into account when changing the code handling the worker's state.)

Actions #16

Updated by mkittler 8 months ago · Edited

I am currently looking at the situation at petrol were the CPU load alert fired earlier today. Right now all slots are actively working on jobs and the CPU load exceeds the configured load of 40 by a lot (it is around 76). I've just started another slot to see whether the threshold is effective at all and it is. The newly started slot did not pick-up a job and is shown as broken in the web UI with the expected values for the load and configured threshold. So the feature works. It is just that it might not be what we want. By just looking at the average load in the last 15 minutes we probably create the following situation:

  1. The worker has just been started and no jobs are running. So the load is very low.
  2. Then jobs are assigned on all worker slots because the load is very low.
  3. Then the load goes up (overshooting the threshold by a lot). This doesn't change anything because all slots are already running jobs (which we don't interrupt) so no new jobs wouldn't run anyway.
  4. If now just one job finishes early then the new feature probably makes a difference. We'd get one less occupied slot because due to the high load no new job would be picked up immediately. That is probably a rare case.
  5. Probably the jobs finish around the same time. In a certain time window we'd still not pick-up any new jobs because it takes a while for the avg. load over the last 15 minutes to go below the threshold (especially because in 3. we overshot the threshold by a lot). This probably creates a big time window for almost all slots becoming idle again.
  6. The cycle repeats at 1. and (besides 4.) we haven't gained anything.

We could change the feature to look at the average load of the last minute. That wouldn't change much until 5. where the window would be smaller so maybe the we'd then started to pick-up new jobs sooner instead of waiting quite long and then start new jobs on all slots at the same time. So the cycle would less likely repeat. That's probably still not smart enough.

EDIT: When I started writing this comment we were at step 3. It looks like now we are at step 5; so almost all worker slots are "broken" because the avg. load over the last 15 minutes is still over 40 (it is around 50). It looks like the remaining 2 jobs will have just enough time to finish until we are below the 40 again. Then we will start over with step 1.

So before we do any out-of-scope tweaks on how the worker status is handled (#158125#note-15) we should rather think on how to make this work at all.

Actions #17

Updated by okurz 8 months ago

Well, that's why we have the step of 40-60 between not picking up new jobs and alerting. As we have seen quite some issues with openQA tests in the 50-70 load range we should keep the load lower. I see two tasks:

  1. Try with lower load limit, e.g. 20, in particular in PowerNV machines, i.e. mania, diesel, petrol
  2. Investigate what increases the load so significantly. Is it the external video encoder?
Actions #18

Updated by okurz 8 months ago

  • Copied to action #158910: typing issue on ppc64 worker - reconsider number of worker instances in particular on ppc64le kvm tests size:M added
Actions #19

Updated by okurz 8 months ago · Edited

We discussed it on multiple occassions today and came up with the following ideas after observing that e.g. petrol comes into the range of load values like load1 of 126 (!):

  1. Introduce a random time back-off when restarting or looking for new jobs, e.g. [1m:15m] but we think that wouldn't help because initially too many jobs can be picked up at the same time already when the system load is still low. https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker/WebUIConnection.pm#L349 already has a randomized back-off. We could tweak the OPENQA_WORKER_STATUS_MAX_INTERVAL variable
  2. As our documentation states we should reserve 2 logical CPU cores per openQA worker instance. This means for petrol running PowerNV with 16 effective cores (SMP disabled) that 8 instances do not leave any room for other processes so we should reduce that. Will do that in #158910
Actions #20

Updated by mkittler 8 months ago · Edited

  1. Judging by the code changing OPENQA_WORKER_STATUS_MIN_INTERVAL and OPENQA_WORKER_STATUS_MAX_INTERVAL would in fact randomize the delay between the checks while the worker is in the broken state. While it wouldn't help with 1. it would at least help with 5./6. - although we'd probably still overshoot the threshold by quite a bit. Not sure whether that's worth it.
  2. Ok, I suppose reducing the number of worker slots would make sense, indeed. Also in the light of #157636#note-7 - TLDR: We were probably even just using libtheora on petrol (and not an actually CPU intensive codec like VP9).
Actions #21

Updated by okurz 8 months ago

  • Due date deleted (2024-04-19)
  • Status changed from Feedback to Resolved

I'd say we are good here. Both ACs are fulfilled and while it's no silver bullet it might help :)

Actions #22

Updated by AdamWill 6 months ago

Uh. It seems weird to set a global default for this, as https://github.com/os-autoinst/openQA/commit/34ed70d80e5cbc90960fedeb7ac17006134049a5 did. Doesn't load average relate to CPU count? If the system has 1 CPU, a load average of 40 would be waaaaay too high. But if the system has 64 CPUs, a load average of 40 is totally fine.

Should this mechanism try to use a ratio of load average to CPU count (though determining that reliably can be tricky)? Or get rid of the universal default?

Actions #23

Updated by okurz 6 months ago · Edited

The load is a system load, not CPU load. So if your bottleneck is CPU then possibly the limit of 40 would not be useful otherwise I think it's better than nothing. Recently IO was much more of a bottleneck for us. But if you see actual problems then we can consider removing the default again

Actions #24

Updated by AdamWill 4 months ago

well, this threshold definitely throttled our aarch64 worker host, which runs 35 concurrent instances. About 1/3rd of them were doing nothing and showing as 'broken' in the admin interface; the logs showed the load was too high. I bumped the threshold to 60 in config and it's still hitting it, so now I'm trying 70.

In fairness, it's possible this is affecting test reliability, we do get flaky failures on that host, but I've never had time to see if they're related to system load or not (we've pretty much always got flakey failures on aarch64, whatever worker host HW we had at the time). I might try cutting it down to 20 workers for a bit and see if it makes a difference.

Actions #25

Updated by okurz 2 months ago

  • Copied to action #168244: reconsider load calculation for worker load limit especially for ppc size:S added
Actions

Also available in: Atom PDF