action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #158125

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #158110: [epic] Prevent worker overload

typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:M

Added by okurz about 1 year ago. Updated 9 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Motivation¶

In #158104 we observed typing issues due to mania being overloaded. mania was configured to run 30 openQA worker instances and that was mostly fine as proven in #139271-24. The recent overload was likely triggered by enabling video again as part of #157636. I already reduced the number of worker instances. But this has the drawback that again the long test backlog takes longer to be finished. We should be more flexible in using available ressource. Here I suggest to implement a check in the worker to only pick up new jobs if CPU load is below a configured threshold.

Acceptance criteria¶

AC1: An openQA worker does not start an openQA job if the CPU load is higher than configured threshold
AC2: By default the worker still picks up jobs if the load is not too high

Suggestions¶

Possibly the worker code somewhere in https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker.pm#L472 can be extended to check the cpu load, e.g. "load15", in GNU/Linux, on the openQA worker and if it exceeds a (configurable) threshold then skip picking up any next job
Or the openQA worker then decides to not even advertise itself, i.e. not connect or disconnect from the webUI instance
Add a sensible disabled default value in https://github.com/os-autoinst/openQA/blob/master/etc/openqa/workers.ini with an explanation comment

Out of scope¶

Consider the existing grafana monitoring for "broken workers" if we use that feature of declaring as "broken" due to too high CPU load

Related issues 4 (2 open — 2 closed)

Actions

Copy link

Updated by okurz about 1 year ago

Copied from action #158104: typing issue on ppc64 worker size:S added

Actions

Copy link

Updated by okurz about 1 year ago

Subject changed from typing issue on ppc64 worker - only pick up new jobs if CPU load is below configured threshold to typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold
Description updated (diff)

Actions

Copy link

Updated by okurz about 1 year ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Category changed from Feature requests to Feature requests

Actions

Copy link

Updated by okurz about 1 year ago

Subject changed from typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold to typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler about 1 year ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler about 1 year ago

Status changed from In Progress to Feedback

PR: https://github.com/os-autoinst/openQA/pull/5565

Actions

Copy link

Updated by mkittler about 1 year ago

Status changed from Feedback to In Progress

Actions

Copy link

Updated by mkittler about 1 year ago

Status changed from In Progress to Feedback

The PR was merged and I configured a threshold of 40 on o3 workers and created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/769 for OSD workers.

Let's see whether we'll have worker slots showing up as broken with that. Maybe not because running into too high load is currently mitigated by reducing the number of worker slots and avoiding extensive video encoding. Maybe we'll have to tweak the thresholds later when working on #157636 to be actually effective/helpful.

Actions

Copy link

Updated by okurz about 1 year ago

Due date set to 2024-04-19

I don't think ppc64le machines should be treated any different like in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/769 . I would use "40" as a default everywhere, at least within o3+osd, possibly already upstream in openQA

Actions

Copy link

#10

Updated by mkittler about 1 year ago · Edited

I added a MR for configuring 40 on OSD (~~https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1141~~ https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/772) and configured it on o3 workers via for i in $hosts; do echo $i && ssh root@$i "sed -i -e '/CACHELIMIT/a CRITICAL_LOAD_AVG_THRESHOLD = 40' /etc/openqa/workers.ini " ; done.

Actions

Copy link

#11

Updated by okurz about 1 year ago

We found that most of our systems are below the load value of 40 but I have found worker-arm-1 which reaches into the range of 65. During that timeframe when the load was above, 2024-04-07 20:10Z until 2024-04-07 22:45Z, we found multiple test issues which could very well be related to a too high system load, among them mistyping, hanging keys, network communication problems and timeouts, e.g.:

https://github.com/os-autoinst/openQA/pull/5567 for a generic upstream default.

Actions

Copy link

#12

Updated by mkittler about 1 year ago

The change is effective and we got two alerts about it today. One would be sufficient so I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1144.

Actions

Copy link

#13

Updated by okurz about 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1144 merged but can we consider something better than the state/term "broken" for states that are expected to be self-resolving? How about having the worker in state "Working" because it's not actually idle but picked up a job or is doing something and can not currently pick up other jobs but it's also not "broken" as in needs help from an admin?

Actions

Copy link

#14

Updated by okurz about 1 year ago

Copied to action #158709: typing issue on ppc64 worker - with automatic CPU load based limiting in place let's increase the instances on mania again added

Actions

Copy link

#15

Updated by mkittler about 1 year ago · Edited

Reworking the worker's states is probably out of scope for this ticket. Not sure whether "Working" would be the best idea; one had to check the code what other assumptions are associated with "Working". (In the worst case we need yet another state. Currently "Broken" is already abused even for graceful disconnects even causing occasional displaying bugs. So we should probably reconsider the design and perhaps add another explicit database column for the state. I just didn't do it back then because it was easier for the code to pass review without a database migration. EDIT: Looks like I've already investigated the problem with workers showing up as broken with the reason "graceful disconnect". The relevant ticket is #134924 and should probably be taken into account when changing the code handling the worker's state.)

Actions

Copy link

#16

Updated by mkittler about 1 year ago · Edited

I am currently looking at the situation at petrol were the CPU load alert fired earlier today. Right now all slots are actively working on jobs and the CPU load exceeds the configured load of 40 by a lot (it is around 76). I've just started another slot to see whether the threshold is effective at all and it is. The newly started slot did not pick-up a job and is shown as broken in the web UI with the expected values for the load and configured threshold. So the feature works. It is just that it might not be what we want. By just looking at the average load in the last 15 minutes we probably create the following situation:

The worker has just been started and no jobs are running. So the load is very low.
Then jobs are assigned on all worker slots because the load is very low.
Then the load goes up (overshooting the threshold by a lot). This doesn't change anything because all slots are already running jobs (which we don't interrupt) so no new jobs wouldn't run anyway.
If now just one job finishes early then the new feature probably makes a difference. We'd get one less occupied slot because due to the high load no new job would be picked up immediately. That is probably a rare case.
Probably the jobs finish around the same time. In a certain time window we'd still not pick-up any new jobs because it takes a while for the avg. load over the last 15 minutes to go below the threshold (especially because in 3. we overshot the threshold by a lot). This probably creates a big time window for almost all slots becoming idle again.
The cycle repeats at 1. and (besides 4.) we haven't gained anything.

We could change the feature to look at the average load of the last minute. That wouldn't change much until 5. where the window would be smaller so maybe the we'd then started to pick-up new jobs sooner instead of waiting quite long and then start new jobs on all slots at the same time. So the cycle would less likely repeat. That's probably still not smart enough.

EDIT: When I started writing this comment we were at step 3. It looks like now we are at step 5; so almost all worker slots are "broken" because the avg. load over the last 15 minutes is still over 40 (it is around 50). It looks like the remaining 2 jobs will have just enough time to finish until we are below the 40 again. Then we will start over with step 1.

So before we do any out-of-scope tweaks on how the worker status is handled (#158125#note-15) we should rather think on how to make this work at all.

Actions

Copy link

#17

Updated by okurz about 1 year ago

Well, that's why we have the step of 40-60 between not picking up new jobs and alerting. As we have seen quite some issues with openQA tests in the 50-70 load range we should keep the load lower. I see two tasks:

Try with lower load limit, e.g. 20, in particular in PowerNV machines, i.e. mania, diesel, petrol
Investigate what increases the load so significantly. Is it the external video encoder?

Actions

Copy link

#18

Updated by okurz about 1 year ago

Copied to action #158910: typing issue on ppc64 worker - reconsider number of worker instances in particular on ppc64le kvm tests size:M added

Actions

Copy link

#19

Updated by okurz about 1 year ago · Edited

We discussed it on multiple occassions today and came up with the following ideas after observing that e.g. petrol comes into the range of load values like load1 of 126 (!):

Introduce a random time back-off when restarting or looking for new jobs, e.g. [1m:15m] but we think that wouldn't help because initially too many jobs can be picked up at the same time already when the system load is still low. https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker/WebUIConnection.pm#L349 already has a randomized back-off. We could tweak the OPENQA_WORKER_STATUS_MAX_INTERVAL variable
As our documentation states we should reserve 2 logical CPU cores per openQA worker instance. This means for petrol running PowerNV with 16 effective cores (SMP disabled) that 8 instances do not leave any room for other processes so we should reduce that. Will do that in #158910

Actions

Copy link

#20

Updated by mkittler about 1 year ago · Edited

Judging by the code changing OPENQA_WORKER_STATUS_MIN_INTERVAL and OPENQA_WORKER_STATUS_MAX_INTERVAL would in fact randomize the delay between the checks while the worker is in the broken state. While it wouldn't help with 1. it would at least help with 5./6. - although we'd probably still overshoot the threshold by quite a bit. Not sure whether that's worth it.
Ok, I suppose reducing the number of worker slots would make sense, indeed. Also in the light of #157636#note-7 - TLDR: We were probably even just using libtheora on petrol (and not an actually CPU intensive codec like VP9).

Actions

Copy link

#21

Updated by okurz about 1 year ago

Due date deleted (~~2024-04-19~~)
Status changed from Feedback to Resolved

I'd say we are good here. Both ACs are fulfilled and while it's no silver bullet it might help :)

Actions

Copy link

#22

Updated by AdamWill 11 months ago

Uh. It seems weird to set a global default for this, as https://github.com/os-autoinst/openQA/commit/34ed70d80e5cbc90960fedeb7ac17006134049a5 did. Doesn't load average relate to CPU count? If the system has 1 CPU, a load average of 40 would be waaaaay too high. But if the system has 64 CPUs, a load average of 40 is totally fine.

Should this mechanism try to use a ratio of load average to CPU count (though determining that reliably can be tricky)? Or get rid of the universal default?

Actions

Copy link

#23

Updated by okurz 11 months ago · Edited

The load is a system load, not CPU load. So if your bottleneck is CPU then possibly the limit of 40 would not be useful otherwise I think it's better than nothing. Recently IO was much more of a bottleneck for us. But if you see actual problems then we can consider removing the default again

Actions

Copy link

#24

Updated by AdamWill 9 months ago

well, this threshold definitely throttled our aarch64 worker host, which runs 35 concurrent instances. About 1/3rd of them were doing nothing and showing as 'broken' in the admin interface; the logs showed the load was too high. I bumped the threshold to 60 in config and it's still hitting it, so now I'm trying 70.

In fairness, it's possible this is affecting test reliability, we do get flaky failures on that host, but I've never had time to see if they're related to system load or not (we've pretty much always got flakey failures on aarch64, whatever worker host HW we had at the time). I might try cutting it down to 20 workers for a bit and see if it makes a difference.

Actions

Copy link

#25

Updated by okurz 7 months ago

Copied to action #168244: reconsider load calculation for worker load limit especially for ppc size:S added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #158125

typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Out of scope¶

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago · Edited

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago · Edited

Updated by mkittler about 1 year ago · Edited

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago · Edited

Updated by mkittler about 1 year ago · Edited

Updated by okurz about 1 year ago

Updated by AdamWill 11 months ago

Updated by okurz 11 months ago · Edited

Updated by AdamWill 9 months ago

Updated by okurz 7 months ago