action #93119
closed[s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12
0%
Description
Description¶
For your information, the Mainframe zEC12 which hosted our LPARs s390p7 (s390x-kvm-sle15), s390p8 (s390x-kvm-sle12) and s390p9 (former QAM) - needed to be shut down, as we will get a new Mainframe in the upcoming weeks.
This won't have much impact as for now, as we have backup workers for s390x-kvm-sle12. for s390x-kvm-sle15 we don't have them, but as I'm aware of, the corresponding jobs have been modified to run on s390x-kvm-sle12. (if not, please do so, as s390-kvm-sle15 won't be functional as of now)
In the upcoming days I will set up more workers on our other mainframe zEC13. For now if you see failing testcases on worker_class 's390x-kvm-sle12', a simple restart should work as it will be triggered on the zEC13 workers (s390zp18).
Additionally, it seems there are still stests running on worker_class 'zkvm' which refers to an old LPAR s390pb, this was also shutdown. so the second goal of this ticket is to get rid of this machine as well
Tasks¶
Create PR to disable workers running on LPAR s390p7 and s390p8DONE by dzedro ( https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/318)Identify jobs running on worker_class 'zkvm' and move them to worker_class 's390x-kvm-sle12' to be scheduled on s390zp18Create PR to disable workers running on LPAR s390pbDONE by https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/320Setup new LPAR on zEC13 to mitigate the loss of the workers running on the old and shut down LPARs on zEC12DONE by https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/320Monitor jobs on s390-kvm-sle12 to spot potential setup issues on s390zp19s390zp19 running fine for 2 weeks nowTemporarily change s390-kvm-sle15 worker config to mirror s390-kvm-sle12- Document new setup and remove all references to the old one
- Provide additional resources for manual testing
Rollback¶
- Remove workaround machine class settings from openqa.suse.de/admin/machines
Updated by okurz over 3 years ago
- Target version set to future
cool, thank you. For now you can track it as a "personal" task. Unless you can resolve it yourself please take care to assign it to a squad eventually. I suggest either [qe-core] or [tools].
Updated by geor over 3 years ago
- Description updated (diff)
- Target version deleted (
future)
Updated by okurz over 3 years ago
- Target version set to future
@geor I assume you deleted the target version by mistake in an edit conflict, setting back to "Future".
On request by mgriessmeier creating a temporary worker configuration on grenache-1.qa. In /etc/openqa/workers.ini:
[66]
WORKER_CLASS=s390-kvm-sle12-poo93119-okurz,grenache-1
NETDEV=eth0
SUT_IP=10.161.145.90
VIRSH_HOSTNAME=s390zp19.suse.de
VIRSH_PASSWORD=nots3cr3t
VIRSH_GUEST=10.161.145.90
VIRSH_MAC=52:54:00:12:5c:d6
VIRSH_CMDLINE=ifcfg=dhcp
VIRSH_INSTANCE=1
and starting with systemctl start openqa-worker@66
. Scheduled new job with
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/6107218 _GROUP=0 WORKER_CLASS=s390-kvm-sle12-poo93119-okurz
Created job #6111084: sle-15-SP2-Server-DVD-Updates-s390x-Build20210526-1-ltp_cve@s390x-kvm-sle12 -> https://openqa.suse.de/tests/6111084
EDIT: job is fine
Updated by mgriessmeier over 3 years ago
- Description updated (diff)
new LPAR installed, https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/320 created, waiting for merge
Updated by okurz over 3 years ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/320 was merged. Tests were picked up to run on the new instances. So far no problems observed.
Updated by okurz over 3 years ago
- Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host":retry
As the new workers are in place we can now use auto-review, not only to label openQA jobs but to retrigger as well but selecting only jobs within the timeframe 2021-05-21 to 2021-05-25 to not match any potential future problems as well.
Updated by okurz over 3 years ago
- Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host":retry to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host"
Wait. If I understand the intended change correctly we will not have any machine "zkvm" anymore at all so we would need to change all test schedules, correct?
EDIT: for the sake of being backward-compatible how about to apply the worker class "svirt" to some or all s390-kvm-sle12 instances?
Updated by geor over 3 years ago
okurz wrote:
Wait. If I understand the intended change correctly we will not have any machine "zkvm" anymore at all so we would need to change all test schedules, correct?
Well for now we still have the zkvm Machine entry, it just points to a s390-kvm-sle12 worker class, so it should not affect scheduling. But maybe it should be re-adapted later to be consistent
Updated by okurz over 3 years ago
yes, I saw that now as well. So it seems like someone updated the machine. Then what we would need is manual tinkering with API to reschedule all failed tests with the updated worker class, hm …
Updated by geor over 3 years ago
okurz wrote:
yes, I saw that now as well. So it seems like someone updated the machine. Then what we would need is manual tinkering with API to reschedule all failed tests with the updated worker class, hm …
I took the liberty to update the machine entry, I also am in the process of rescheduling any zkvm s390 jobs to update their Settings, and further potential tinkering
Updated by mgriessmeier over 3 years ago
yeah technically the Machines 'zkvm' and s390-kvm-sle15 could be deleted...
Updated by okurz over 3 years ago
- Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host" to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host":retry:WORKER_CLASS=s390-kvm-sle12
@geor I think I have an idea how to extend auto-review to retrigger tests with changed settings. I hope you leave some old un-retriggered fails for me :D
Updated by okurz over 3 years ago
- Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host":retry:WORKER_CLASS=s390-kvm-sle12 to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host"
ok, nevermind. I will followup with my approach elsewhere. For now you can go ahead your way to handle all failed jobs that you can find.
Updated by geor over 3 years ago
okurz wrote:
@geor I think I have an idea how to extend auto-review to retrigger tests with changed settings. I hope you leave some old un-retriggered fails for me :D
Sorry just saw it! But this sounds useful!
Updated by geor over 3 years ago
mgriessmeier wrote:
yeah technically the Machines 'zkvm' and s390-kvm-sle15 could be deleted...
I just started migrating all zkvm jobs to s390x-kvm-sle12, when all seems well I will delete the zkvm Machine entry.
Updated by geor over 3 years ago
geor wrote:
mgriessmeier wrote:
yeah technically the Machines 'zkvm' and s390-kvm-sle15 could be deleted...
I just started migrating all zkvm jobs to s390x-kvm-sle12, when all seems well I will delete the zkvm Machine entry.
Done, MRs can be found here, here and here
I m keeping an eye on the newly scheduled s390 ex zkvm jobs just to be sure, but all looks good for now.
I think next week I will follow up with replacing s390-kvm-sle15 as well, from all job groups that reference it
Updated by mgriessmeier over 3 years ago
- Description updated (diff)
- Status changed from In Progress to Feedback
- Priority changed from High to Normal
Updated by okurz over 3 years ago
- Description updated (diff)
I have added the additional setting _MACHINE_COMMENT="As temporary workaround worker class set to s390x-kvm-sle12 instead of s390x-kvm-sle15, see https://progress.opensuse.org/issues/93119"
to the MACHINE config "s390x-kvm-sle15" on https://openqa.suse.de/admin/machines
We still need to update existing, scheduled jobs though. With openqa-cli api --pretty --osd jobs state=scheduled machine=zkvm | jq '.jobs | .[] | select(.settings.WORKER_CLASS=="svirt") | .id'
we can identify currently scheduled jobs that are scheduled for the worker class "svirt" which currently does not have a worker available anymore. I don't know how to update the settings over an existing job using openqa-cli
though (tried openqa-cli api --pretty -X put --osd jobs/6287610 -d '{"settings[WORKER_CLASS]": "s390x-kvm-sle12"}'
) but we can also do it over SQL. So I did:
for i in $(openqa-cli api --pretty --osd jobs state=scheduled machine=zkvm | jq '.jobs | .[] | select(.settings.WORKER_CLASS=="svirt") | .id'); do ssh osd "sudo -u geekotest psql --command=\"update job_settings set value = 's390-kvm-sle12' where job_id = $i and key = 'WORKER_CLASS';\" openqa"; done
and the same for WORKER_CLASS "s390x-kvm-sle15". Now to check if jobs are being picked up.
EDIT: Jobs were not picked up until mkittler restarted the openQA scheduler. Likely there is some memory caching going on and the scheduler does not see the manual updates in the database.
Updated by okurz over 3 years ago
- Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host" to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-.*T.*Error connecting to <root@s390p.*.suse.de>: No route to host"
Updated by okurz over 3 years ago
- Related to action #94465: [tools] zkvm tests are scheduled by retriggering month old jobs even though we do not have any "svirt" workers anymore added
Updated by tinita over 3 years ago
- Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-.*T.*Error connecting to <root@s390p.*.suse.de>: No route to host" to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12
Temporarily disabled auto_review:
auto_review:"(?s)2021-.*T.*Error connecting to <root@s390p.*.suse.de>: No route to host"
Because the regex is slow and can take over 4 minutes.
We should improve the regex. Is there a log example with lines we have to match?
See also #96713
Updated by okurz over 3 years ago
I guess by now we should really not have any more jobs matching the original pattern unless people really try to retrigger multi-month old jobs
Updated by tinita over 3 years ago
The 8 day old tests https://openqa.suse.de/tests/6795459 and https://openqa.suse.de/tests/6795460 still have WORKER_CLASS=s390x-kvm-sle15
Since there is no worker matching that, I cancelled the jobs.