coordination #139010: [epic] Long OSD ppc64le job queue - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

coordination #139010

open

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

[epic] Long OSD ppc64le job queue

Added by okurz over 1 year ago. Updated 7 months ago.

Status:

Blocked

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

QA (public) - future

Start date:

2023-11-04

Due date:

% Done:

66%

Estimated time:

(Total: 0.00 h)

Tags:

ppc64le

Description

Motivation¶

Currently on OSD there is a longer job queue in particular for ppc64le. This seems to be due to multiple reasons:

Apparently multiple demanding product requirements come together including at least BCI and kernel live-patching
Multi-month shortage on PowerPC related testing due to datacenter migration, see #132140
KVM on PowerPC was repeatedly described as not supported anymore and deprecated and that we should not run this anymore hence there was less priority on providing more test capacity. Though for planning of migration we already prioritized setting up more free hardware reserved for manual testing instead to use for openQA tests.

I assume many teams learned that KVM on PowerPC is much more reliable for us than PowerVM for multiple reasons hence teams have worked with the temporarily higher capacity that we could provide at least for the time of the migration. By now about 60% of former kvm@powerpc capacity are available from FC Basement lab – merely a tertiary tier mitigation applied by qe tools team. The 100% which many might see as "reference" was a temporary situation preparing for datacenter migration. The actual reference should be about 1-2 years ago which is on par with the current provided capacity.

Ideas¶

Multiple improvement ideas, merely brainstorming with no commitment by anyone so far to conduct any of those:

Communicate to all stakeholders our monitoring dashboard https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now from which it should be clear what to expect.
~~Move prg2e target power8 machines still in nue1 to nue2 (this was based on wrong assumptions by okurz: There are no more PowerPC machines in nue1 so this leaves "3. Move nue3 power8 machines to nue2")~~
Move nue3 power8 machines to nue2 -> #139100
Try qemu on power9 from prg2 machines, also ask buildops team
Ask other teams for free ressources, also orthos
Decrease testing scope
Decrease test runtime
Decrease test failure rate, especially unreviewed, unlabeled failures
DONE Try to setup free power hardware in FC Basement, e.g. mania.qe.nue2.suse.org https://racktables.nue.suse.com/index.php?page=object&object_id=9588 -> #139271
Increase number of worker instances on existing kvm@powerpc machines and monitor for stability
Increase openQA instance job limit to give ppc64le jobs a better chance to run
Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs -> #139103
Put more effort into PowerVM, e.g. #71794, and use PowerVM more
Improve automatic decision so that products are marked as "acceptable" based on just a smaller critical subset of tests and give other not-critical tests a possibility to finish later so that longer job queue don't prevent releases.
New builds are triggered despite old tests have not even finished so multiple times tests "never finish" -> Improve automation to serialize building+testing+releasing
For BCI: Consider reducing polling interval when to trigger new tests so that not too many tests are scheduled

Subtasks 3 (1 open — 2 closed)

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by okurz over 1 year ago

Copied from action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M added

Actions

Copy link

Updated by okurz over 1 year ago

Project changed from 46 to openQA Project (public)
Category changed from Enhancement to existing tests to Support

Actions

Copy link

Updated by okurz over 1 year ago

Due date set to 2023-12-07
Status changed from In Progress to Feedback

I answered and linked to this ticket in

Awaiting feedback if any.

Actions

Copy link

Updated by ph03nix over 1 year ago

Related to action #138770: [BCI] Reduce coverage for ppc64le added

Actions

Copy link

Updated by MDoucha over 1 year ago

KVM on PowerPC was repeatedly described as not supported anymore and deprecated and that we should not run this anymore hence there was less priority on providing more test capacity. Though for planning of migration we already prioritized setting up more free hardware reserved for manual testing instead to use for openQA tests.

Migration of maintenance tests from KVM/QEMU backend to PowerVM is blocked by missing disk image support in the respective OpenQA backend implementation. We have requested the disk image support 3 years ago (#71794).

Actions

Copy link

Updated by okurz over 1 year ago

Related to action #138725: [BCI] Re-enable FIPS on ppc64le and s390x added

Actions

Copy link

Updated by okurz over 1 year ago

Description updated (diff)

Conducted meeting with runger, jlausuch, hrommel, pcervinka. Updating description with additions.

I will look into 3. and 12. myself.

For 3.
https://suse.slack.com/archives/C05UHQ49B7D/p1699015046191019

(Oliver Kurz) hi guys, how feasible would it be to move 1-2 machines from NUE3 "MB" to NUE2-FC_Basement?

For 12 I followed up in #139103

Actions

Copy link

#10

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#11

Updated by okurz over 1 year ago

Tracker changed from action to coordination
Subject changed from Long OSD ppc64le job queue to [epic] Long OSD ppc64le job queue
Status changed from Feedback to In Progress
Parent task set to #110833

Actions

Copy link

#12

Updated by okurz over 1 year ago

Subtask #139100 added

Actions

Copy link

#13

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#14

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#15

Updated by okurz over 1 year ago

Subtask #139103 added

Actions

Copy link

#16

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#17

Updated by okurz over 1 year ago

Status changed from In Progress to New
Assignee deleted (~~okurz~~)
Target version changed from Ready to future

Two subtasks defined, rest to be followed up with. Right now SUSE QE Tools does not plan to follow up with any other of the specified tasks except for the two explicit subtasks.

Actions

Copy link

#18

Updated by okurz over 1 year ago

Related to action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M added

Actions

Copy link

#19

Updated by jlausuch over 1 year ago

Looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-30d&to=now&viewPanel=12
I would say that the situation is under control.
I only see a unusual peak here: https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1699855177271&to=1699940834622&viewPanel=12
probably due to some milestone (maybe 15-SP6), but the trend is going down now, so we should be fine.
Do you agree?

Actions

Copy link

#20

Updated by MDoucha over 1 year ago

jlausuch wrote in #note-19:

Looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-30d&to=now&viewPanel=12
I would say that the situation is under control.
I only see a unusual peak here: https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1699855177271&to=1699940834622&viewPanel=12
probably due to some milestone (maybe 15-SP6), but the trend is going down now, so we should be fine.
Do you agree?

We're still running on only 16 qemu_ppc64le worker slots. When the next batch of livepatches comes, OSD will be overloaded for 2 weeks again. So the question is: Do we get more worker slots by the end of next week, or should I reduce LTP coverage for PPC64LE kernel maintenance updates?

Actions

Copy link

#21

Updated by okurz over 1 year ago

I am in the process to add more worker slots, see #139271. Likely to help this week still

Actions

Copy link

#22

Updated by okurz over 1 year ago

Description updated (diff)

#139271 was resolved by bringing into production 30 more qemu_ppc64le instances from the machine mania.qe.nue2.suse.org which covers point 9.

Actions

Copy link

#23

Updated by okurz 9 months ago

Subtask #166802 added

Actions

Copy link

#24

Updated by okurz 9 months ago

Category changed from Support to Regressions/Crashes
Status changed from New to Blocked
Assignee set to okurz
Target version changed from future to Ready

Actions

Copy link

#25

Updated by okurz 7 months ago

Target version changed from Ready to future

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries