action #167057
closedcoordination #167054: [epic] Run more workloads in CC-compliant PRG2 to be less affected by CC related network changes
Run more standard, qemu OSD openQA jobs in CC-compliant PRG2 and none in NUE2 size:S
0%
Description
Motivation¶
non-compliant NUE2 based OSD workers might become problematic due to #165282 but we can not simply connect more PRG2 OSD workers as that overloads the webUI, see #166802, so we need to disable some worker slots in NUE2.
Acceptance criteria¶
- AC1: No standard, qemu OSD jobs are executed anymore in NUE2
- AC2: All jobs commonly scheduled on OSD are still executed
Suggestions¶
- Review all explicit and implicit standard qemu worker slots on NUE2 based workers within https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls and disable them with a note to the ticket and the parent epic and/or saga for context
- Ensure that there are according worker classes served by other workers preferrably in PRG2
Updated by okurz 3 months ago
- Related to action #166802: Recover worker37, worker38, worker39 size:S added
Updated by okurz 3 months ago
- Status changed from Feedback to In Progress
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/901 was merged and is effective. I now pulled out two more commits into https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/903 first and merged. Now merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/902 on top. After that was deployed I took the following machines out of production, powered them off and updated racktables accordingly.
- openqaworker-arm-1 https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9886
- imagetester https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=742
- openqaworker1 https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=1260
For openqaworker-arm-1 I additionally muted the according notification policy in https://monitor.qa.suse.de/alerting/routes with "All times" and mentioned that on racktables
Updated by okurz 3 months ago
- Status changed from In Progress to Feedback
I monitored https://openqa.suse.de/tests?resultfilter=Failed&resultfilter=Incomplete and https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now&refresh=5m and found no related problems.
Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/906 to handle sapworker1-3
Updated by okurz 3 months ago
openqaworker-arm-1 long-time alert was still complaining about "no data". I added a nested notification policy with no contact point and all time mute to not get notifications anymore. However the according alert(s) still show up on https://monitor.qa.suse.de/alerting/list?search=health:nodata but I would prefer to not delete the alert definitions completely.
Updated by okurz 3 months ago
- Due date deleted (
2024-10-09) - Status changed from Feedback to Resolved
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/906 merged and deployed. I took sapworker2+3 out of production, powered them off and marked accordingly in racktables and verified that the machines are actually off. https://openqa.suse.de/admin/workers has currently 947 worker instances connected.
Updated by okurz 2 months ago
- Copied to action #168177: Migrate critical VM based services needing access to CC-services to CC areas added