action #164874
openosd-deployment failed due to openqaworker-arm-1 salt-minion yielding "no response" but the machine is up and reachable over ssh
0%
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2902107#L9202 shows
openqaworker-arm-1.qe.nue2.suse.org:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20240802080801900813
prevent the osd-deployment to continue. And that was reproduced until I did systemctl restart salt-minion
on openqaworker-arm-1 which was otherwise fully reachable and operational.
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2902185 shows the successful retry.
journalctl -u salt-minion
shows
Aug 01 07:17:46 openqaworker-arm-1 [RPM][26364]: Transaction ID 66ab1a75 finished: 0
Aug 01 08:13:26 openqaworker-arm-1 sudo[33685]: root : PWD=/root ; USER=root ; COMMAND=/usr/bin/timeout -v --kill-after=1m 5m telegraf --test --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d/
Aug 01 18:19:24 openqaworker-arm-1 sudo[29545]: root : PWD=/root ; USER=root ; COMMAND=/usr/bin/timeout -v --kill-after=1m 5m telegraf --test --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d/
Aug 02 10:13:17 openqaworker-arm-1 systemd[1]: Stopping The Salt Minion...
so no log entry after 2024-08-01 18:19:24 until I did systemctl restart salt-minion
. If we can reproduce this issue we should try with attaching strace to the process
Workaround¶
systemctl restart salt-minion
Updated by livdywan 4 months ago
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2956062
openqaworker-arm-1.qe.nue2.suse.org:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20240816052647209352
ERROR: Minions returned with non-zero exit code
Updated by okurz 4 months ago
osd deployment today failed again due to this. I logged into openqaworker-arm-1 and investigated. tail /var/log/salt/minion
at 2024-08-18 09:37:00Z shows
2024-08-18 03:40:00,724 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:05,732 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:10,740 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:10,744 [salt.minion :1145][ERROR ][2366] Error while bringing up minion for multi-master. Is master at openqa.suse.de responding?
2024-08-18 03:40:25,795 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:30,803 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:35,810 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:40,818 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:45,826 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:50,835 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
so timeouts during the time when OSD rebooted but no message afterwards. w
shows that the machine itself is up since 8h. journalctl -b -u salt-minion
shows also the above messages plus RPM output where I don't know where they come from. systemctl status salt-minion
says
● salt-minion.service - The Salt Minion
Loaded: loaded (/usr/lib/systemd/system/salt-minion.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2024-08-18 03:36:59 CEST; 8h ago
Main PID: 2163 (salt-minion)
Tasks: 5 (limit: 7372)
CGroup: /system.slice/salt-minion.service
├─ 2163 /usr/bin/python3 /usr/bin/salt-minion
└─ 2366 /usr/bin/python3 /usr/bin/salt-minion
Aug 18 07:17:11 openqaworker-arm-1 [RPM][331]: install openQA-client-4.6.1723914098.c2bbc49b-lp155.7068.1.aarch64: success
Aug 18 07:17:12 openqaworker-arm-1 [RPM][331]: erase openQA-client-4.6.1723660485.b24c2671-lp155.7053.1.aarch64: success
Aug 18 07:17:12 openqaworker-arm-1 [RPM][331]: install openQA-client-4.6.1723914098.c2bbc49b-lp155.7068.1.aarch64: success
Aug 18 07:17:12 openqaworker-arm-1 [RPM][331]: Transaction ID 66c183d6 finished: 0
Aug 18 07:17:12 openqaworker-arm-1 [RPM][335]: Transaction ID 66c183d8 started
Aug 18 07:17:12 openqaworker-arm-1 [RPM][335]: erase openQA-worker-4.6.1723660485.b24c2671-lp155.7053.1.aarch64: success
Aug 18 07:17:15 openqaworker-arm-1 [RPM][335]: install openQA-worker-4.6.1723914098.c2bbc49b-lp155.7068.1.aarch64: success
Aug 18 07:17:17 openqaworker-arm-1 [RPM][335]: erase openQA-worker-4.6.1723660485.b24c2671-lp155.7053.1.aarch64: success
Aug 18 07:17:17 openqaworker-arm-1 [RPM][335]: install openQA-worker-4.6.1723914098.c2bbc49b-lp155.7068.1.aarch64: success
Aug 18 07:17:17 openqaworker-arm-1 [RPM][335]: Transaction ID 66c183d8 finished: 0
strace -f -yy -p 2163
shows that the main salt process waits for the second. strace -f -yy -p 2366
shows the process in a polling wait
strace: Process 2366 attached with 4 threads
[pid 13526] epoll_pwait(16<anon_inode:[eventpoll]>, <unfinished ...>
[pid 13525] epoll_pwait(14<anon_inode:[eventpoll]>, <unfinished ...>
[pid 2368] pselect6(0, NULL, NULL, NULL, {tv_sec=4, tv_nsec=785892938}, NULL <unfinished ...>
[pid 2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 346, NULL, 8) = 0
[pid 2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 998, NULL, 8) = 0
[pid 2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 997, NULL, 8) = 0
[pid 2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 998, NULL, 8) = 0
[pid 2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 998, NULL, 8) = 0
[pid 2366] epoll_pwait(6<anon_inode:[eventpoll]>, <unfinished ...>
[pid 2368] <... pselect6 resumed>) = 0 (Timeout)
[pid 2368] getuid() = 0
[pid 2368] kill(2163, 0) = 0
…
I crosschecked on another machine, worker29, and there it looks like the same, so nothing extraordinary. If I call sudo salt \* test.ping
both processes react, reading a lot and showing that in the strace output. Also all machines now show up in the output from test.ping
so the workaround to restart the salt-minion is actually not necessary it seems. I retriggered https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2958810 manually and other jobs in the pipeline which succeeded