action #58945
closed
OpenQA worker service not restarted after OpenQA update
Added by MDoucha about 5 years ago.
Updated about 5 years ago.
Description
The openqa-worker service on some openqa.suse.de workers doesn't get restarted after update. This may cause version mismatch between os-autoinst and openQA-common packages.
One example of this mismatch are these three verification runs for https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8329 below:
openqaworker2: https://openqa.suse.de/tests/3541705 (openqa-worker service last restarted on 2019-10-30)
openqaworker6: https://openqa.suse.de/tests/3541697 (openqa-worker service last restarted on 2019-09-18)
openqaworker9: https://openqa.suse.de/tests/3544337 (openqa-worker service last restarted on 2019-09-18)
All three jobs ran the same test modules (see autoinst log) but all tests after intall_ltp were scheduled at runtime. Updating test schedule at runtime requires patches merged into OpenQA on 2019-09-27 so openqaworker6 and openqaworker9 didn't update test schedule due to still running openQA-common from mid-September, before the patches were merged.
for example ps -u _openqa-worker auxf
on openqaworker3 shows me that the worker services have been restarted Sept. 18 whereas the two cache service have been restarted (correctly) on Oct. 30
- Status changed from New to In Progress
- Assignee set to okurz
- Target version set to Current Sprint
hm, I wonder why openqa-worker@1 was restarted on powerqaworker-qam-1. yesterday at 07:46:51 CET. Sounds like it was done during deployment. My hypothesis for https://progress.opensuse.org/issues/58945 is that the restart works on all workers where the openqa-worker.target is enabled. https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8329#issuecomment-548266424 mentions workers where mdoucha wants the services to restart.
okurz@openqa:/srv/pillar> sudo salt -l error --state-output=changes \* cmd.run 'systemctl is-active openqa-worker.target ; systemctl status openqa-worker@1 | grep "Active.*since"'
openqaworker7.suse.de:
inactive
Active: active (running) since Wed 2019-09-18 13:41:21 CEST; 1 months 12 days ago
QA-Power8-5-kvm.qa.suse.de:
active
Active: active (running) since Wed 2019-10-30 07:46:50 CET; 1 day 7h ago
powerqaworker-qam-1:
active
Active: active (running) since Wed 2019-10-30 07:46:51 CET; 1 day 7h ago
openqaworker3.suse.de:
inactive
Active: active (running) since Wed 2019-09-18 13:40:42 CEST; 1 months 12 days ago
openqaworker2.suse.de:
active
Active: active (running) since Wed 2019-10-30 07:46:53 CET; 1 day 7h ago
openqa-monitor.qa.suse.de:
inactive
Unit openqa-worker@1.service could not be found.
openqa.suse.de:
inactive
Unit openqa-worker@1.service could not be found.
openqaworker13.suse.de:
inactive
Active: active (running) since Wed 2019-10-30 10:17:57 CET; 1 day 5h ago
malbec.arch.suse.de:
active
Active: active (running) since Wed 2019-10-30 07:46:52 CET; 1 day 7h ago
openqaworker-arm-2.suse.de:
inactive
Active: active (running) since Tue 2019-10-29 17:12:00 UTC; 1 day 21h ago
openqaworker5.suse.de:
inactive
Active: active (running) since Wed 2019-09-18 13:42:12 CEST; 1 months 12 days ago
openqaworker9.suse.de:
inactive
Active: active (running) since Wed 2019-09-18 13:41:36 CEST; 1 months 12 days ago
grenache-1.qa.suse.de:
inactive
Active: active (running) since Mon 2019-09-30 14:17:36 CEST; 1 months 0 days ago
openqaworker8.suse.de:
inactive
Active: active (running) since Wed 2019-09-18 13:41:44 CEST; 1 months 12 days ago
QA-Power8-4-kvm.qa.suse.de:
active
Active: active (running) since Wed 2019-10-30 07:46:52 CET; 1 day 7h ago
openqaworker-arm-3.suse.de:
inactive
Active: active (running) since Tue 2019-10-22 07:07:59 CEST; 1 weeks 2 days ago
openqaworker-arm-1.suse.de:
inactive
Active: active (running) since Mon 2019-10-14 10:11:08 UTC; 2 weeks 3 days ago
openqaworker6.suse.de:
inactive
Active: active (running) since Wed 2019-09-18 13:41:25 CEST; 1 months 12 days ago
ERROR: Minions returned with non-zero exit code
So on all workers – except openqaworker13 (I assume someone manually tinkered) – the worker services have been restarted during deployment only when the target was enabled.
I am pretty sure it's the install of the package due to https://github.com/os-autoinst/openQA/blob/master/openQA.spec#L21 which mentions the target but not the worker template.
Fixed in https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/216 , deployed manually as currently certificates in standard container images in gitlab.suse.de do not work. Restarted all worker targets and checked the worker instances:
okurz@openqa:/srv/salt> sudo salt -l error --state-output=changes -C 'G@roles:worker' cmd.run 'systemctl restart openqa-worker.target ; systemctl status openqa-worker@1 | grep "Active.*since"'
malbec.arch.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:33 CET; 24ms ago
powerqaworker-qam-1:
Active: active (running) since Thu 2019-10-31 16:22:33 CET; 47ms ago
QA-Power8-4-kvm.qa.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:33 CET; 56ms ago
openqaworker-arm-1.suse.de:
Active: deactivating (stop-sigterm) since Thu 2019-10-31 15:22:33 UTC; 53ms ago
openqaworker-arm-3.suse.de:
Active: deactivating (stop-sigterm) since Thu 2019-10-31 16:22:33 CET; 131ms ago
openqaworker-arm-2.suse.de:
Active: inactive (dead) since Thu 2019-10-31 15:22:33 UTC; 101ms ago
grenache-1.qa.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:33 CET; 2s ago
openqaworker2.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:33 CET; 15s ago
QA-Power8-5-kvm.qa.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:33 CET; 1min 1s ago
openqaworker7.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:49 CET; 1min 13s ago
openqaworker3.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:36 CET; 1min 27s ago
openqaworker13.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:34 CET; 1min 29s ago
openqaworker6.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:47 CET; 1min 16s ago
openqaworker5.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:36 CET; 1min 27s ago
openqaworker9.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:36 CET; 1min 27s ago
openqaworker8.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:35 CET; 1min 28s ago
- Status changed from In Progress to Resolved
Also available in: Atom
PDF