Project

General

Profile

Actions

coordination #139010

open

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

[epic] Long OSD ppc64le job queue

Added by okurz about 1 year ago. Updated about 2 months ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
QA (public, currently private due to #173521) - future
Start date:
2023-11-04
Due date:
% Done:

66%

Estimated time:
(Total: 0.00 h)
Tags:

Description

Motivation

Currently on OSD there is a longer job queue in particular for ppc64le. This seems to be due to multiple reasons:

  1. Apparently multiple demanding product requirements come together including at least BCI and kernel live-patching
  2. Multi-month shortage on PowerPC related testing due to datacenter migration, see #132140
  3. KVM on PowerPC was repeatedly described as not supported anymore and deprecated and that we should not run this anymore hence there was less priority on providing more test capacity. Though for planning of migration we already prioritized setting up more free hardware reserved for manual testing instead to use for openQA tests.

I assume many teams learned that KVM on PowerPC is much more reliable for us than PowerVM for multiple reasons hence teams have worked with the temporarily higher capacity that we could provide at least for the time of the migration. By now about 60% of former kvm@powerpc capacity are available from FC Basement lab – merely a tertiary tier mitigation applied by qe tools team. The 100% which many might see as "reference" was a temporary situation preparing for datacenter migration. The actual reference should be about 1-2 years ago which is on par with the current provided capacity.

Ideas

Multiple improvement ideas, merely brainstorming with no commitment by anyone so far to conduct any of those:

  1. Communicate to all stakeholders our monitoring dashboard https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now from which it should be clear what to expect.
  2. Move prg2e target power8 machines still in nue1 to nue2 (this was based on wrong assumptions by okurz: There are no more PowerPC machines in nue1 so this leaves "3. Move nue3 power8 machines to nue2")
  3. Move nue3 power8 machines to nue2 -> #139100
  4. Try qemu on power9 from prg2 machines, also ask buildops team
  5. Ask other teams for free ressources, also orthos
  6. Decrease testing scope
  7. Decrease test runtime
  8. Decrease test failure rate, especially unreviewed, unlabeled failures
  9. DONE Try to setup free power hardware in FC Basement, e.g. mania.qe.nue2.suse.org https://racktables.nue.suse.com/index.php?page=object&object_id=9588 -> #139271
  10. Increase number of worker instances on existing kvm@powerpc machines and monitor for stability
  11. Increase openQA instance job limit to give ppc64le jobs a better chance to run
  12. Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs -> #139103
  13. Put more effort into PowerVM, e.g. #71794, and use PowerVM more
  14. Improve automatic decision so that products are marked as "acceptable" based on just a smaller critical subset of tests and give other not-critical tests a possibility to finish later so that longer job queue don't prevent releases.
  15. New builds are triggered despite old tests have not even finished so multiple times tests "never finish" -> Improve automation to serialize building+testing+releasing
  16. For BCI: Consider reducing polling interval when to trigger new tests so that not too many tests are scheduled

Subtasks 3 (1 open2 closed)

openQA Infrastructure (public) - action #139100: Long OSD ppc64le job queue - Move nue3 power8 machines to nue2Resolvedokurz2023-11-04

Actions
openQA Infrastructure (public) - action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:MResolvedokurz2023-11-04

Actions
openQA Infrastructure (public) - action #166802: Recover worker37, worker38, worker39 size:SBlockedokurz

Actions

Related issues 4 (0 open4 closed)

Related to Containers and images - action #138770: [BCI] Reduce coverage for ppc64leResolvedph03nix2023-10-31

Actions
Related to Containers and images - action #138725: [BCI] Re-enable FIPS on ppc64le and s390xResolvedpherranz2023-10-30

Actions
Related to openQA Infrastructure (public) - action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:MResolvedokurz2023-09-20

Actions
Copied from openQA Tests (public) - action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:MResolvedmkittler2023-09-20

Actions
Actions

Also available in: Atom PDF