Project

General

Profile

Actions

action #162521

open

coordination #162524: [epic] Optimized o3 infrastructure

Reconsider the global job limit on o3, try higher than 170

Added by okurz 28 days ago. Updated 28 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2024-06-19
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #151807-10 the global job limit on o3 was set to 170. The previous limit wasn't mentioned so I don't know what it was but I assume much higher. 170 is rather low considering also that we have so many worker instance availble. #151807 saw multiple changes so I assume we can actually use a much higher job limit again.

Acceptance criteria

  • AC1: The global job limit on o3 is significantly higher than 170 or blocking improvement tasks are planned

Suggestions

  • Understand why the original selection of 170 jobs was done
  • Carefully increase the job limit and monitor over at least 10 days
  • Try to find a hard upper limit and select a job limit below that with a sane buffer

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #151807: [alert] o3 zabbix: Problem: /var/lib/snapshot-changes: Disk space is critically low (used > 94%) size:MResolvedtinita2023-11-302023-12-16

Actions
Related to openQA Infrastructure - action #138545: Munin - minion hook failed - opensuse.org :: openqa.opensuse.org size:SResolvedtinita2023-11-28

Actions
Actions #1

Updated by okurz 28 days ago

  • Related to action #151807: [alert] o3 zabbix: Problem: /var/lib/snapshot-changes: Disk space is critically low (used > 94%) size:M added
Actions #2

Updated by okurz 28 days ago

  • Parent task set to #121732
Actions #3

Updated by okurz 28 days ago

  • Parent task changed from #121732 to #162524
Actions #4

Updated by tinita 28 days ago ยท Edited

https://progress.opensuse.org/issues/151807#note-10 says

Also I set back max_running_jobs to 170. We lowered it to make sure the load is not too high so the cleanup job can finally finish.

That means "I set it back up to 170 from a lower limit".
170 is the limit we settled on, higher limits lead to more load and problems.

Also see https://progress.opensuse.org/issues/138545#note-20

Actions #5

Updated by tinita 28 days ago

  • Related to action #138545: Munin - minion hook failed - opensuse.org :: openqa.opensuse.org size:S added
Actions

Also available in: Atom PDF