#108845 was resolved but does not fix the issues. https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&editPanel=96&tab=alert still shows broken workers. On https://openqa.suse.de/admin/workers filtering for "Broken" I can find powerqaworker-qam-1:1 and others on the same host broken. I logged in to the system over ssh and found that openqa-worker@{1..6}
are running but they should not, see https://gitlab.suse.de/openqa/salt-states-openqa#remarks-about-the-systemd-units-used-to-start-workers . Who did that again?
I did
systemctl mask --now openqa-worker@{1..6} && systemctl enable --now openqa-worker-auto-restart@{1..6}
and everything looks ok again but I am expecting that someone by mistake tries the same approach again and again. Eventually we should find a better approach by not having multiple systemd services for the same but solve with configuration within the services or something like that. Reported in #109734
I also found openqaworker2:5 broken due to the same problem. Someone started openqa-worker@5
when they should not. Visible from the history of the root user.
I suspect it was jpupava:
openqaworker2:/home/okurz # history | grep 'openqa-worker@5'
613 2022-04-07 11:45:39 systemctl status openqa-worker@5
618 2022-04-07 11:46:58 systemctl restart openqa-worker@5
620 2022-04-07 11:47:48 systemctl status openqa-worker@5
624 2022-04-07 11:49:09 systemctl restart openqa-worker@5
626 2022-04-07 11:49:12 systemctl restart openqa-worker@5
628 2022-04-07 11:49:18 systemctl stop openqa-worker@5
630 2022-04-07 11:49:26 systemctl status openqa-worker@5
631 2022-04-07 11:49:31 systemctl start openqa-worker@5
…
openqaworker2:/home/okurz # last | head -n 20
jpupava pts/6 10.100.12.155 Thu Apr 7 12:08 - 14:22 (02:13)
jpupava pts/6 10.100.12.155 Thu Apr 7 11:45 - 12:03 (00:17)
I will tell him over chat. Did so in https://suse.slack.com/archives/C02CANHLANP/p1649499490207389
Now there should be no more broken workers. Monitored the alert and unpaused. Problem solved, rollback steps completed.