Project

General

Profile

action #93119

[s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-.*T.*Error connecting to <root@s390p.*.suse.de>: No route to host"

Added by mgriessmeier 2 months ago. Updated about 1 month ago.

Status:
Feedback
Priority:
Normal
Assignee:
Target version:
Start date:
2021-05-26
Due date:
% Done:

0%

Estimated time:

Description

Description

For your information, the Mainframe zEC12 which hosted our LPARs s390p7 (s390x-kvm-sle15), s390p8 (s390x-kvm-sle12) and s390p9 (former QAM) - needed to be shut down, as we will get a new Mainframe in the upcoming weeks.
This won't have much impact as for now, as we have backup workers for s390x-kvm-sle12. for s390x-kvm-sle15 we don't have them, but as I'm aware of, the corresponding jobs have been modified to run on s390x-kvm-sle12. (if not, please do so, as s390-kvm-sle15 won't be functional as of now)

In the upcoming days I will set up more workers on our other mainframe zEC13. For now if you see failing testcases on worker_class 's390x-kvm-sle12', a simple restart should work as it will be triggered on the zEC13 workers (s390zp18).

Additionally, it seems there are still stests running on worker_class 'zkvm' which refers to an old LPAR s390pb, this was also shutdown. so the second goal of this ticket is to get rid of this machine as well

Tasks

Rollback

  • Remove workaround machine class settings from openqa.suse.de/admin/machines

Related issues

Related to openQA Tests - action #94465: [tools] zkvm tests are scheduled by retriggering month old jobs even though we do not have any "svirt" workers anymoreResolved2021-06-222021-07-06

History

#1 Updated by mgriessmeier 2 months ago

  • Description updated (diff)

#2 Updated by okurz 2 months ago

  • Target version set to future

cool, thank you. For now you can track it as a "personal" task. Unless you can resolve it yourself please take care to assign it to a squad eventually. I suggest either [qe-core] or [tools].

#3 Updated by geor 2 months ago

  • Description updated (diff)
  • Target version deleted (future)

#4 Updated by okurz 2 months ago

  • Target version set to future

geor I assume you deleted the target version by mistake in an edit conflict, setting back to "Future".

On request by mgriessmeier creating a temporary worker configuration on grenache-1.qa. In /etc/openqa/workers.ini:

[66]
WORKER_CLASS=s390-kvm-sle12-poo93119-okurz,grenache-1
NETDEV=eth0
SUT_IP=10.161.145.90
VIRSH_HOSTNAME=s390zp19.suse.de
VIRSH_PASSWORD=nots3cr3t
VIRSH_GUEST=10.161.145.90
VIRSH_MAC=52:54:00:12:5c:d6
VIRSH_CMDLINE=ifcfg=dhcp
VIRSH_INSTANCE=1

and starting with systemctl start openqa-worker@66. Scheduled new job with

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/6107218 _GROUP=0 WORKER_CLASS=s390-kvm-sle12-poo93119-okurz

Created job #6111084: sle-15-SP2-Server-DVD-Updates-s390x-Build20210526-1-ltp_cve@s390x-kvm-sle12 -> https://openqa.suse.de/tests/6111084

EDIT: job is fine

#5 Updated by mgriessmeier 2 months ago

  • Description updated (diff)

new LPAR installed, https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/320 created, waiting for merge

#6 Updated by mgriessmeier 2 months ago

  • Description updated (diff)

#7 Updated by okurz 2 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/320 was merged. Tests were picked up to run on the new instances. So far no problems observed.

#8 Updated by okurz 2 months ago

  • Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host":retry

As the new workers are in place we can now use auto-review, not only to label openQA jobs but to retrigger as well but selecting only jobs within the timeframe 2021-05-21 to 2021-05-25 to not match any potential future problems as well.

#9 Updated by okurz 2 months ago

  • Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host":retry to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host"

Wait. If I understand the intended change correctly we will not have any machine "zkvm" anymore at all so we would need to change all test schedules, correct?

EDIT: for the sake of being backward-compatible how about to apply the worker class "svirt" to some or all s390-kvm-sle12 instances?

#10 Updated by geor 2 months ago

okurz wrote:

Wait. If I understand the intended change correctly we will not have any machine "zkvm" anymore at all so we would need to change all test schedules, correct?

Well for now we still have the zkvm Machine entry, it just points to a s390-kvm-sle12 worker class, so it should not affect scheduling. But maybe it should be re-adapted later to be consistent

#11 Updated by okurz 2 months ago

yes, I saw that now as well. So it seems like someone updated the machine. Then what we would need is manual tinkering with API to reschedule all failed tests with the updated worker class, hm …

#12 Updated by geor 2 months ago

okurz wrote:

yes, I saw that now as well. So it seems like someone updated the machine. Then what we would need is manual tinkering with API to reschedule all failed tests with the updated worker class, hm …

I took the liberty to update the machine entry, I also am in the process of rescheduling any zkvm s390 jobs to update their Settings, and further potential tinkering

#13 Updated by mgriessmeier 2 months ago

yeah technically the Machines 'zkvm' and s390-kvm-sle15 could be deleted...

#14 Updated by okurz 2 months ago

  • Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host" to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host":retry:WORKER_CLASS=s390-kvm-sle12

geor I think I have an idea how to extend auto-review to retrigger tests with changed settings. I hope you leave some old un-retriggered fails for me :D

#15 Updated by okurz 2 months ago

  • Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host":retry:WORKER_CLASS=s390-kvm-sle12 to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host"

ok, nevermind. I will followup with my approach elsewhere. For now you can go ahead your way to handle all failed jobs that you can find.

#16 Updated by geor 2 months ago

okurz wrote:

geor I think I have an idea how to extend auto-review to retrigger tests with changed settings. I hope you leave some old un-retriggered fails for me :D

Sorry just saw it! But this sounds useful!

#17 Updated by geor 2 months ago

mgriessmeier wrote:

yeah technically the Machines 'zkvm' and s390-kvm-sle15 could be deleted...

I just started migrating all zkvm jobs to s390x-kvm-sle12, when all seems well I will delete the zkvm Machine entry.

#18 Updated by geor 2 months ago

geor wrote:

mgriessmeier wrote:

yeah technically the Machines 'zkvm' and s390-kvm-sle15 could be deleted...

I just started migrating all zkvm jobs to s390x-kvm-sle12, when all seems well I will delete the zkvm Machine entry.

Done, MRs can be found here, here and here

I m keeping an eye on the newly scheduled s390 ex zkvm jobs just to be sure, but all looks good for now.

I think next week I will follow up with replacing s390-kvm-sle15 as well, from all job groups that reference it

#19 Updated by mgriessmeier about 1 month ago

  • Description updated (diff)
  • Status changed from In Progress to Feedback
  • Priority changed from High to Normal

#20 Updated by okurz about 1 month ago

  • Description updated (diff)

I have added the additional setting _MACHINE_COMMENT="As temporary workaround worker class set to s390x-kvm-sle12 instead of s390x-kvm-sle15, see https://progress.opensuse.org/issues/93119" to the MACHINE config "s390x-kvm-sle15" on https://openqa.suse.de/admin/machines

We still need to update existing, scheduled jobs though. With openqa-cli api --pretty --osd jobs state=scheduled machine=zkvm | jq '.jobs | .[] | select(.settings.WORKER_CLASS=="svirt") | .id' we can identify currently scheduled jobs that are scheduled for the worker class "svirt" which currently does not have a worker available anymore. I don't know how to update the settings over an existing job using openqa-cli though (tried openqa-cli api --pretty -X put --osd jobs/6287610 -d '{"settings[WORKER_CLASS]": "s390x-kvm-sle12"}') but we can also do it over SQL. So I did:

for i in $(openqa-cli api --pretty --osd jobs state=scheduled machine=zkvm | jq '.jobs | .[] | select(.settings.WORKER_CLASS=="svirt") | .id'); do ssh osd "sudo -u geekotest psql --command=\"update job_settings set value = 's390-kvm-sle12' where job_id = $i and key = 'WORKER_CLASS';\" openqa"; done

and the same for WORKER_CLASS "s390x-kvm-sle15". Now to check if jobs are being picked up.

EDIT: Jobs were not picked up until mkittler restarted the openQA scheduler. Likely there is some memory caching going on and the scheduler does not see the manual updates in the database.

#21 Updated by okurz about 1 month ago

  • Subject changed from [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-05-2[1-5]T.*Error connecting to <root@s390p.*.suse.de>: No route to host" to [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 auto_review:"(?s)2021-.*T.*Error connecting to <root@s390p.*.suse.de>: No route to host"

#22 Updated by okurz about 1 month ago

  • Related to action #94465: [tools] zkvm tests are scheduled by retriggering month old jobs even though we do not have any "svirt" workers anymore added

Also available in: Atom PDF