Project

General

Profile

Actions

action #135329

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert

s390x work demand exceeds available workers

Added by ph03nix 8 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-09-07
Due date:
% Done:

0%

Estimated time:

Description

We're running into load issues with our s390x test runs and are falling back on our product delivery.

e.g. https://openqa.suse.de/tests/12027610 blocks BCI container releases and is in the scheduling queue for 18 hours. However those updates are expected to leave QA within hours.

We kindly ask for a solution for this problem in a timely matter. We are obliged to deliver certain container updates within 24h and not fulfilling this requirement can have severe impact on the some of our BCI contracts.

This is urgent.


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #127523: [qe-core][s390x][kvm] Make use of generic "s390-kvm" class to prevent too long waiting for s390x worker ressourcesResolvedmgrifalconi

Actions
Actions #1

Updated by ph03nix 8 months ago

I will decrease the priority for the test runs in question as a quickfix, but I think we really need more workers in the long run.

Actions #2

Updated by okurz 8 months ago

  • Target version set to Ready

If you can help us to raise the concern with SUSE-IT Eng-Infra and assign more CPU+RAM to the OSD VM we can increase the amount of workers

Actions #4

Updated by okurz 8 months ago

  • Status changed from New to Blocked
  • Assignee set to okurz

Felix filed https://sd.suse.com/servicedesk/customer/portal/1/SD-131786 for it. I shared with "OSD Admins". All others should just track this ticket #135329

Actions #5

Updated by ph03nix 8 months ago

I filed https://sd.suse.com/servicedesk/customer/portal/1/SD-131786 for it. Anyone who needs access, just ping me in Slack.

Actions #6

Updated by ph03nix 8 months ago

Most urgency is resolved for us now. Thanks for looking into this!

From my POV this ticket can be closed, unless you peeps need to have it open for further work.

Actions #7

Updated by okurz 8 months ago

Thanks for your explicit response. As long as the SD ticket is open at least I would like to keep the ticket open. But, is the original issue regarding s390x jobs then really resolved? If yes, what would you say was the impact of you manually tweaking the jobs scheduling priorities?

Actions #8

Updated by okurz 8 months ago

  • Parent task set to #135122
Actions #9

Updated by ph03nix 8 months ago

okurz wrote in #note-7:

Thanks for your explicit response. As long as the SD ticket is open at least I would like to keep the ticket open. But, is the original issue regarding s390x jobs then really resolved? If yes, what would you say was the impact of you manually tweaking the jobs scheduling priorities?

I'm not observing s390x blocking any ongoing issues at the moment, however we only notice this when things are already on fire.

So, the urgency of the task is gone, but I could not say with confidence that the load issue with s390x is resolved. I do see however ppc64le taking longer than other architectures.

Actions #10

Updated by okurz 8 months ago

  • Status changed from Blocked to Resolved

Ok, thx. https://sd.suse.com/servicedesk/customer/portal/1/SD-131786 was resolved, the OSD VM has more CPU and more RAM. In a related ticket I commented that we removed the job limit again for now so we can follow up there and resolve here

Actions #11

Updated by okurz 7 months ago

  • Related to action #127523: [qe-core][s390x][kvm] Make use of generic "s390-kvm" class to prevent too long waiting for s390x worker ressources added
Actions

Also available in: Atom PDF