Project

General

Profile

Actions

action #135329

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert

s390x work demand exceeds available workers

Added by ph03nix over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Start date:
2023-09-07
Due date:
% Done:

0%

Estimated time:

Description

We're running into load issues with our s390x test runs and are falling back on our product delivery.

e.g. https://openqa.suse.de/tests/12027610 blocks BCI container releases and is in the scheduling queue for 18 hours. However those updates are expected to leave QA within hours.

We kindly ask for a solution for this problem in a timely matter. We are obliged to deliver certain container updates within 24h and not fulfilling this requirement can have severe impact on the some of our BCI contracts.

This is urgent.


Related issues 2 (0 open2 closed)

Related to Containers and images - action #135332: Ensure recent containers are releasedResolved2023-09-07

Actions
Related to openQA Infrastructure (public) - action #127523: [qe-core][s390x][kvm] Make use of generic "s390-kvm" class to prevent too long waiting for s390x worker ressourcesResolvedmgrifalconi

Actions
Actions #1

Updated by ph03nix over 1 year ago

I will decrease the priority for the test runs in question as a quickfix, but I think we really need more workers in the long run.

Actions #2

Updated by okurz over 1 year ago

  • Target version set to Ready

If you can help us to raise the concern with SUSE-IT Eng-Infra and assign more CPU+RAM to the OSD VM we can increase the amount of workers

Actions #3

Updated by ph03nix over 1 year ago

  • Related to action #135332: Ensure recent containers are released added
Actions #4

Updated by okurz over 1 year ago

  • Status changed from New to Blocked
  • Assignee set to okurz

Felix filed https://sd.suse.com/servicedesk/customer/portal/1/SD-131786 for it. I shared with "OSD Admins". All others should just track this ticket #135329

Actions #5

Updated by ph03nix over 1 year ago

I filed https://sd.suse.com/servicedesk/customer/portal/1/SD-131786 for it. Anyone who needs access, just ping me in Slack.

Actions #6

Updated by ph03nix over 1 year ago

Most urgency is resolved for us now. Thanks for looking into this!

From my POV this ticket can be closed, unless you peeps need to have it open for further work.

Actions #7

Updated by okurz over 1 year ago

Thanks for your explicit response. As long as the SD ticket is open at least I would like to keep the ticket open. But, is the original issue regarding s390x jobs then really resolved? If yes, what would you say was the impact of you manually tweaking the jobs scheduling priorities?

Actions #8

Updated by okurz over 1 year ago

  • Parent task set to #135122
Actions #9

Updated by ph03nix over 1 year ago

okurz wrote in #note-7:

Thanks for your explicit response. As long as the SD ticket is open at least I would like to keep the ticket open. But, is the original issue regarding s390x jobs then really resolved? If yes, what would you say was the impact of you manually tweaking the jobs scheduling priorities?

I'm not observing s390x blocking any ongoing issues at the moment, however we only notice this when things are already on fire.

So, the urgency of the task is gone, but I could not say with confidence that the load issue with s390x is resolved. I do see however ppc64le taking longer than other architectures.

Actions #10

Updated by okurz over 1 year ago

  • Status changed from Blocked to Resolved

Ok, thx. https://sd.suse.com/servicedesk/customer/portal/1/SD-131786 was resolved, the OSD VM has more CPU and more RAM. In a related ticket I commented that we removed the job limit again for now so we can follow up there and resolve here

Actions #11

Updated by okurz about 1 year ago

  • Related to action #127523: [qe-core][s390x][kvm] Make use of generic "s390-kvm" class to prevent too long waiting for s390x worker ressources added
Actions

Also available in: Atom PDF