action #166802
openopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #139010: [epic] Long OSD ppc64le job queue
Recover worker37, worker38, worker39 size:S
0%
Description
Motivation¶
After #139103 we should ensure that all remaining currently offline machines in PRG2 oQA infra are up and running again
Acceptance criteria¶
- AC1: All w37-w39 run OSD production jobs
- AC2: non-x86_64, non-qemu jobs are still executed and not starved out by too many x86_64
Suggestions¶
- Take care to apply the workarounds from #157975-12 to prevent accidental distribution upgrades
- Read what was done in #139103, bring up all w37-w39 again into production
- Monitor for the impact on qemu_ppc64le job age as well as other non-x86_64, non-qemu jobs
Updated by okurz 3 months ago
- Copied from action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M added
Updated by okurz 3 months ago
- Related to action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org) added
Updated by okurz 3 months ago
Applying the same approach as on #139103-29 to downgrade the firewall.
Updated by okurz 3 months ago · Edited
I brought back all w37+w38+w39 but then encountered a problem that looks very similar to what we have seen in the past that no more jobs are picked up by workers. Following https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production I took w38+w39 out of production again. See #139103-31 for details.
Updated by openqa_review 3 months ago
- Due date set to 2024-10-03
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 3 months ago
- Related to action #167057: Run more standard, qemu OSD openQA jobs in CC-compliant PRG2 and none in NUE2 size:S added
Updated by okurz 3 months ago
- Related to action #134924: Websocket server overloaded, affected worker slots shown as "broken" with graceful disconnect in workers table added
Updated by nicksinger 3 months ago
- Related to action #167081: test fails in support_server/setup on osd worker37 size:S added
Updated by okurz about 2 months ago
- Status changed from Blocked to In Progress
Updated by okurz about 2 months ago
- Related to action #157690: Simple global limit of registered/online workers size:M added
Updated by okurz about 2 months ago · Edited
- Status changed from In Progress to Blocked
I brought up w38 as well and did
for i in worker38 worker39; do openqa-clone-job --skip-chained-deps --repeat=60 --within-instance https://openqa.suse.de/tests/15639721 {TEST,BUILD}+=-poo166802-okurz _GROUP=0 WORKER_CLASS=$i; done
but I can already see on OSD in /var/log/openqa_scheduler
: [2024-10-12T13:52:26.746206Z] [warn] [pid:24409] Failed to send data to websocket server, reason: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.
so #157690 is not effective, reopened and disabled and powered down both w37+w38 again.
Updated by okurz about 1 month ago
- Related to action #168178: Limit connected online workers based on websocket+scheduler load size:M added