action #93119
closed
[s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12
Added by mgriessmeier over 3 years ago.
Updated about 1 year ago.
Description
Description¶
For your information, the Mainframe zEC12 which hosted our LPARs s390p7 (s390x-kvm-sle15), s390p8 (s390x-kvm-sle12) and s390p9 (former QAM) - needed to be shut down, as we will get a new Mainframe in the upcoming weeks.
This won't have much impact as for now, as we have backup workers for s390x-kvm-sle12. for s390x-kvm-sle15 we don't have them, but as I'm aware of, the corresponding jobs have been modified to run on s390x-kvm-sle12. (if not, please do so, as s390-kvm-sle15 won't be functional as of now)
In the upcoming days I will set up more workers on our other mainframe zEC13. For now if you see failing testcases on worker_class 's390x-kvm-sle12', a simple restart should work as it will be triggered on the zEC13 workers (s390zp18).
Additionally, it seems there are still stests running on worker_class 'zkvm' which refers to an old LPAR s390pb, this was also shutdown. so the second goal of this ticket is to get rid of this machine as well
Tasks¶
Rollback¶
- Remove workaround machine class settings from openqa.suse.de/admin/machines
- Description updated (diff)
- Target version set to future
cool, thank you. For now you can track it as a "personal" task. Unless you can resolve it yourself please take care to assign it to a squad eventually. I suggest either [qe-core] or [tools].
- Description updated (diff)
- Target version deleted (
future)
- Target version set to future
@geor I assume you deleted the target version by mistake in an edit conflict, setting back to "Future".
On request by mgriessmeier creating a temporary worker configuration on grenache-1.qa. In /etc/openqa/workers.ini:
[66]
WORKER_CLASS=s390-kvm-sle12-poo93119-okurz,grenache-1
NETDEV=eth0
SUT_IP=10.161.145.90
VIRSH_HOSTNAME=s390zp19.suse.de
VIRSH_PASSWORD=nots3cr3t
VIRSH_GUEST=10.161.145.90
VIRSH_MAC=52:54:00:12:5c:d6
VIRSH_CMDLINE=ifcfg=dhcp
VIRSH_INSTANCE=1
and starting with systemctl start openqa-worker@66
. Scheduled new job with
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/6107218 _GROUP=0 WORKER_CLASS=s390-kvm-sle12-poo93119-okurz
Created job #6111084: sle-15-SP2-Server-DVD-Updates-s390x-Build20210526-1-ltp_cve@s390x-kvm-sle12 -> https://openqa.suse.de/tests/6111084
EDIT: job is fine
- Description updated (diff)
- Description updated (diff)
- Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host":retry
As the new workers are in place we can now use auto-review, not only to label openQA jobs but to retrigger as well but selecting only jobs within the timeframe 2021-05-21 to 2021-05-25 to not match any potential future problems as well.
- Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host":retry to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host"
Wait. If I understand the intended change correctly we will not have any machine "zkvm" anymore at all so we would need to change all test schedules, correct?
EDIT: for the sake of being backward-compatible how about to apply the worker class "svirt" to some or all s390-kvm-sle12 instances?
okurz wrote:
Wait. If I understand the intended change correctly we will not have any machine "zkvm" anymore at all so we would need to change all test schedules, correct?
Well for now we still have the zkvm Machine entry, it just points to a s390-kvm-sle12 worker class, so it should not affect scheduling. But maybe it should be re-adapted later to be consistent
yes, I saw that now as well. So it seems like someone updated the machine. Then what we would need is manual tinkering with API to reschedule all failed tests with the updated worker class, hm …
okurz wrote:
yes, I saw that now as well. So it seems like someone updated the machine. Then what we would need is manual tinkering with API to reschedule all failed tests with the updated worker class, hm …
I took the liberty to update the machine entry, I also am in the process of rescheduling any zkvm s390 jobs to update their Settings, and further potential tinkering
yeah technically the Machines 'zkvm' and s390-kvm-sle15 could be deleted...
- Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host" to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host":retry:WORKER_CLASS=s390-kvm-sle12
@geor I think I have an idea how to extend auto-review to retrigger tests with changed settings. I hope you leave some old un-retriggered fails for me :D
- Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host":retry:WORKER_CLASS=s390-kvm-sle12 to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host"
ok, nevermind. I will followup with my approach elsewhere. For now you can go ahead your way to handle all failed jobs that you can find.
okurz wrote:
@geor I think I have an idea how to extend auto-review to retrigger tests with changed settings. I hope you leave some old un-retriggered fails for me :D
Sorry just saw it! But this sounds useful!
mgriessmeier wrote:
yeah technically the Machines 'zkvm' and s390-kvm-sle15 could be deleted...
I just started migrating all zkvm jobs to s390x-kvm-sle12, when all seems well I will delete the zkvm Machine entry.
geor wrote:
mgriessmeier wrote:
yeah technically the Machines 'zkvm' and s390-kvm-sle15 could be deleted...
I just started migrating all zkvm jobs to s390x-kvm-sle12, when all seems well I will delete the zkvm Machine entry.
Done, MRs can be found here, here and here
I m keeping an eye on the newly scheduled s390 ex zkvm jobs just to be sure, but all looks good for now.
I think next week I will follow up with replacing s390-kvm-sle15 as well, from all job groups that reference it
- Description updated (diff)
- Status changed from In Progress to Feedback
- Priority changed from High to Normal
- Description updated (diff)
I have added the additional setting _MACHINE_COMMENT="As temporary workaround worker class set to s390x-kvm-sle12 instead of s390x-kvm-sle15, see https://progress.opensuse.org/issues/93119"
to the MACHINE config "s390x-kvm-sle15" on https://openqa.suse.de/admin/machines
We still need to update existing, scheduled jobs though. With openqa-cli api --pretty --osd jobs state=scheduled machine=zkvm | jq '.jobs | .[] | select(.settings.WORKER_CLASS=="svirt") | .id'
we can identify currently scheduled jobs that are scheduled for the worker class "svirt" which currently does not have a worker available anymore. I don't know how to update the settings over an existing job using openqa-cli
though (tried openqa-cli api --pretty -X put --osd jobs/6287610 -d '{"settings[WORKER_CLASS]": "s390x-kvm-sle12"}'
) but we can also do it over SQL. So I did:
for i in $(openqa-cli api --pretty --osd jobs state=scheduled machine=zkvm | jq '.jobs | .[] | select(.settings.WORKER_CLASS=="svirt") | .id'); do ssh osd "sudo -u geekotest psql --command=\"update job_settings set value = 's390-kvm-sle12' where job_id = $i and key = 'WORKER_CLASS';\" openqa"; done
and the same for WORKER_CLASS "s390x-kvm-sle15". Now to check if jobs are being picked up.
EDIT: Jobs were not picked up until mkittler restarted the openQA scheduler. Likely there is some memory caching going on and the scheduler does not see the manual updates in the database.
- Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host" to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-.*T.*Error connecting to <root@s390p.*.suse.de>: No route to host"
- Related to action #94465: [tools] zkvm tests are scheduled by retriggering month old jobs even though we do not have any "svirt" workers anymore added
- Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-.*T.*Error connecting to <root@s390p.*.suse.de>: No route to host" to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12
Temporarily disabled auto_review:
auto_review:"(?s)2021-.*T.*Error connecting to <root@s390p.*.suse.de>: No route to host"
Because the regex is slow and can take over 4 minutes.
We should improve the regex. Is there a log example with lines we have to match?
See also #96713
I guess by now we should really not have any more jobs matching the original pattern unless people really try to retrigger multi-month old jobs
- Status changed from Feedback to Closed
Also available in: Atom
PDF