action #58945
closedOpenQA worker service not restarted after OpenQA update
0%
Description
The openqa-worker service on some openqa.suse.de workers doesn't get restarted after update. This may cause version mismatch between os-autoinst and openQA-common packages.
One example of this mismatch are these three verification runs for https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8329 below:
openqaworker2: https://openqa.suse.de/tests/3541705 (openqa-worker service last restarted on 2019-10-30)
openqaworker6: https://openqa.suse.de/tests/3541697 (openqa-worker service last restarted on 2019-09-18)
openqaworker9: https://openqa.suse.de/tests/3544337 (openqa-worker service last restarted on 2019-09-18)
All three jobs ran the same test modules (see autoinst log) but all tests after intall_ltp were scheduled at runtime. Updating test schedule at runtime requires patches merged into OpenQA on 2019-09-27 so openqaworker6 and openqaworker9 didn't update test schedule due to still running openQA-common from mid-September, before the patches were merged.
Updated by okurz almost 5 years ago
for example ps -u _openqa-worker auxf
on openqaworker3 shows me that the worker services have been restarted Sept. 18 whereas the two cache service have been restarted (correctly) on Oct. 30
Updated by okurz almost 5 years ago
- Status changed from New to In Progress
- Assignee set to okurz
- Target version set to Current Sprint
hm, I wonder why openqa-worker@1 was restarted on powerqaworker-qam-1. yesterday at 07:46:51 CET. Sounds like it was done during deployment. My hypothesis for https://progress.opensuse.org/issues/58945 is that the restart works on all workers where the openqa-worker.target is enabled. https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8329#issuecomment-548266424 mentions workers where mdoucha wants the services to restart.
okurz@openqa:/srv/pillar> sudo salt -l error --state-output=changes \* cmd.run 'systemctl is-active openqa-worker.target ; systemctl status openqa-worker@1 | grep "Active.*since"'
openqaworker7.suse.de:
inactive
Active: active (running) since Wed 2019-09-18 13:41:21 CEST; 1 months 12 days ago
QA-Power8-5-kvm.qa.suse.de:
active
Active: active (running) since Wed 2019-10-30 07:46:50 CET; 1 day 7h ago
powerqaworker-qam-1:
active
Active: active (running) since Wed 2019-10-30 07:46:51 CET; 1 day 7h ago
openqaworker3.suse.de:
inactive
Active: active (running) since Wed 2019-09-18 13:40:42 CEST; 1 months 12 days ago
openqaworker2.suse.de:
active
Active: active (running) since Wed 2019-10-30 07:46:53 CET; 1 day 7h ago
openqa-monitor.qa.suse.de:
inactive
Unit openqa-worker@1.service could not be found.
openqa.suse.de:
inactive
Unit openqa-worker@1.service could not be found.
openqaworker13.suse.de:
inactive
Active: active (running) since Wed 2019-10-30 10:17:57 CET; 1 day 5h ago
malbec.arch.suse.de:
active
Active: active (running) since Wed 2019-10-30 07:46:52 CET; 1 day 7h ago
openqaworker-arm-2.suse.de:
inactive
Active: active (running) since Tue 2019-10-29 17:12:00 UTC; 1 day 21h ago
openqaworker5.suse.de:
inactive
Active: active (running) since Wed 2019-09-18 13:42:12 CEST; 1 months 12 days ago
openqaworker9.suse.de:
inactive
Active: active (running) since Wed 2019-09-18 13:41:36 CEST; 1 months 12 days ago
grenache-1.qa.suse.de:
inactive
Active: active (running) since Mon 2019-09-30 14:17:36 CEST; 1 months 0 days ago
openqaworker8.suse.de:
inactive
Active: active (running) since Wed 2019-09-18 13:41:44 CEST; 1 months 12 days ago
QA-Power8-4-kvm.qa.suse.de:
active
Active: active (running) since Wed 2019-10-30 07:46:52 CET; 1 day 7h ago
openqaworker-arm-3.suse.de:
inactive
Active: active (running) since Tue 2019-10-22 07:07:59 CEST; 1 weeks 2 days ago
openqaworker-arm-1.suse.de:
inactive
Active: active (running) since Mon 2019-10-14 10:11:08 UTC; 2 weeks 3 days ago
openqaworker6.suse.de:
inactive
Active: active (running) since Wed 2019-09-18 13:41:25 CEST; 1 months 12 days ago
ERROR: Minions returned with non-zero exit code
So on all workers – except openqaworker13 (I assume someone manually tinkered) – the worker services have been restarted during deployment only when the target was enabled.
I am pretty sure it's the install of the package due to https://github.com/os-autoinst/openQA/blob/master/openQA.spec#L21 which mentions the target but not the worker template.
Fixed in https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/216 , deployed manually as currently certificates in standard container images in gitlab.suse.de do not work. Restarted all worker targets and checked the worker instances:
okurz@openqa:/srv/salt> sudo salt -l error --state-output=changes -C 'G@roles:worker' cmd.run 'systemctl restart openqa-worker.target ; systemctl status openqa-worker@1 | grep "Active.*since"'
malbec.arch.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:33 CET; 24ms ago
powerqaworker-qam-1:
Active: active (running) since Thu 2019-10-31 16:22:33 CET; 47ms ago
QA-Power8-4-kvm.qa.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:33 CET; 56ms ago
openqaworker-arm-1.suse.de:
Active: deactivating (stop-sigterm) since Thu 2019-10-31 15:22:33 UTC; 53ms ago
openqaworker-arm-3.suse.de:
Active: deactivating (stop-sigterm) since Thu 2019-10-31 16:22:33 CET; 131ms ago
openqaworker-arm-2.suse.de:
Active: inactive (dead) since Thu 2019-10-31 15:22:33 UTC; 101ms ago
grenache-1.qa.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:33 CET; 2s ago
openqaworker2.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:33 CET; 15s ago
QA-Power8-5-kvm.qa.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:33 CET; 1min 1s ago
openqaworker7.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:49 CET; 1min 13s ago
openqaworker3.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:36 CET; 1min 27s ago
openqaworker13.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:34 CET; 1min 29s ago
openqaworker6.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:47 CET; 1min 16s ago
openqaworker5.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:36 CET; 1min 27s ago
openqaworker9.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:36 CET; 1min 27s ago
openqaworker8.suse.de:
Active: active (running) since Thu 2019-10-31 16:22:35 CET; 1min 28s ago