action #158170
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #158167: [epic] Increase worker capacity
Increase resources for s390x kvm size:M
0%
Description
Motivation¶
https://suse.slack.com/archives/C02CANHLANP/p1711533706482229
(Oliver Kurz) @Matthias Griessmeier would you be interested in trying to acquire more s390x kvm testing ressources? Looking into https://suse.slack.com/archives/C02CLB8TZP1/p1711532709502039 I found that s390x kvm openQA jobs have a significant schedule due to the limit of available instances. We would be able to run more instances with more memory assigned to the hpervisor LPAR
Acceptance criteria¶
- AC1: s390zl12+13 run more than 5 VMs each
- AC2: openQA jobs on s390zl12+13 still consistently pass and no related monitoring alerts
Suggestions¶
- s390zl12+13 have more resources
- There are already more VMs configured by mgriessmeier
- Bring https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951 forward, i.e. adjust DHCP pool addresses
- Then increase instances in workerconf.sls
- Verify while monitoring
Updated by okurz 9 months ago
- Related to action #153958: [alert] s390zl12: Memory usage alert Generic memory_usage_alert_s390zl12 generic added
Updated by okurz 9 months ago
Both s390zl12+13 will have double the original memory amount. Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/761 for re-enabling the previously disabled instances as part of #153958.
After that waiting for https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951 to have more s390x kvm instances DHCP/DNS entries prepared, then enable more instances in workerconf.sls
Updated by mgriessmeier 9 months ago
s390zl12 and s390zl13 have been upgraded and now have 160GB RAM each (double than before) and 6.0 Processors (previously 4.0)
I have prepared and reserved 20 more instances for future uses with https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951
Updated by openqa_review 9 months ago
- Due date set to 2024-04-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 9 months ago
- Tracker changed from coordination to action
- Status changed from In Progress to Feedback
Right now situation looks stable. s390zl12+13 are using more ressources and both back again with +2 instances. More is still pending on https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951
Updated by okurz 9 months ago
https://suse.slack.com/archives/C02CANHLANP/p1712228880986239
(Oliver Kurz) @Matthias Griessmeier will you follow-up with https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4951 regarding DHCP pool adjustement or do we need to take over?
Updated by nicksinger 8 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by nicksinger 8 months ago
- Status changed from In Progress to Feedback
IPs for machines adjusted in https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4977. I will wait for a merge before bringing up the according worker instances.
Updated by nicksinger 8 months ago
- Status changed from Feedback to Workable
Merged. Ready to be worked on again e.g. by validating the entries work and mentioned machines are ready to be used.
Updated by nicksinger 8 months ago
- Status changed from Workable to Feedback
Updated by nicksinger 8 months ago
- Status changed from Feedback to Workable
Merged. It would have been wise to add the new instances with a ticket suffix but now we're testing live. Lets review on Monday if the new virsh-instances perform as expected.
Updated by okurz 8 months ago · Edited
Doesn't go well. https://openqa.suse.de/admin/workers/3087 shows no successful jobs on the new instances. In particular https://openqa.suse.de/tests/14103625#step/bootloader_zkvm/44 states
# Test died: Error connecting to VNC server <s390kvm115.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host at /usr/lib/os-autoinst/testapi.pm line 1690.
I checked all the s390-kvm workers and saw consistent failures on s390kvm100…s390kvm119 but I have also seen s390kvm093 consistently failing https://openqa.suse.de/admin/workers/2650, not sure about that one.
I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/784 for mitigation, will merge and restart all related failures with
for i in WORKER="worker33 worker35 worker40"; do host=openqa.suse.de failed_since=2024-04-19 result="result='failed'" comment="label:poo158170" ./openqa-advanced-retrigger-jobs; done
Updated by nicksinger 8 months ago
- Status changed from Workable to In Progress
DHCP configs did not properly apply after the merge because suttner1 apparently was "out of sync" with suttner2 - not sure what or who fixed that but we're good now: https://openqa.suse.de/tests/overview?build=nsinger_s390validation
The failing instances most likely hit the sexagesimal-quirk which I try to fix/workaround now. After this is done we can merge https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/791 to finally bring them into production.
Updated by okurz 8 months ago
Feel welcome to block on https://jira.suse.com/browse/ENGINFRA-4030 "suttner1.oqa.prg2.suse.org+suttner2.oqa.prg2.suse.org times are both out of sync with NTP causing DHCP-failover to fail" any time and escalate to a line manager of your choice :)
Updated by nicksinger 8 months ago
- Status changed from In Progress to Feedback
well, we're good for now :) https://openqa.suse.de/tests/overview?build=nsinger_s390validation
Updated by nicksinger 8 months ago
- Status changed from Feedback to Resolved
I checked the instances. A lot of red container tests but it looks like test issues. Some few green jobs in between show that the workers do their job as expected.
Updated by nicksinger 8 months ago
- Status changed from Resolved to In Progress
worker36+37 are offline because of https://progress.opensuse.org/issues/157726 (and linked) meaning we miss 10 instances. Not sure how I missed them previously but we have to move them now. Doing this now
Updated by nicksinger 8 months ago
- Status changed from In Progress to Resolved
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/804 moved the slots around and OSD now has 20 production jobs (zl13 disabled due to https://progress.opensuse.org/issues/159066) which are capable to successfully complete jobs.
Updated by jbaier_cz 7 months ago
- Related to action #160598: [alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker size:S added