osd deployment today failed again due to this. I logged into openqaworker-arm-1 and investigated. tail /var/log/salt/minion
at 2024-08-18 09:37:00Z shows
2024-08-18 03:40:00,724 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:05,732 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:10,740 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:10,744 [salt.minion :1145][ERROR ][2366] Error while bringing up minion for multi-master. Is master at openqa.suse.de responding?
2024-08-18 03:40:25,795 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:30,803 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:35,810 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:40,818 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:45,826 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
2024-08-18 03:40:50,835 [salt.channel.client:293 ][ERROR ][2366] Failed to send msg SaltReqTimeoutError('Message timed out',)
so timeouts during the time when OSD rebooted but no message afterwards. w
shows that the machine itself is up since 8h. journalctl -b -u salt-minion
shows also the above messages plus RPM output where I don't know where they come from. systemctl status salt-minion
says
● salt-minion.service - The Salt Minion
Loaded: loaded (/usr/lib/systemd/system/salt-minion.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2024-08-18 03:36:59 CEST; 8h ago
Main PID: 2163 (salt-minion)
Tasks: 5 (limit: 7372)
CGroup: /system.slice/salt-minion.service
├─ 2163 /usr/bin/python3 /usr/bin/salt-minion
└─ 2366 /usr/bin/python3 /usr/bin/salt-minion
Aug 18 07:17:11 openqaworker-arm-1 [RPM][331]: install openQA-client-4.6.1723914098.c2bbc49b-lp155.7068.1.aarch64: success
Aug 18 07:17:12 openqaworker-arm-1 [RPM][331]: erase openQA-client-4.6.1723660485.b24c2671-lp155.7053.1.aarch64: success
Aug 18 07:17:12 openqaworker-arm-1 [RPM][331]: install openQA-client-4.6.1723914098.c2bbc49b-lp155.7068.1.aarch64: success
Aug 18 07:17:12 openqaworker-arm-1 [RPM][331]: Transaction ID 66c183d6 finished: 0
Aug 18 07:17:12 openqaworker-arm-1 [RPM][335]: Transaction ID 66c183d8 started
Aug 18 07:17:12 openqaworker-arm-1 [RPM][335]: erase openQA-worker-4.6.1723660485.b24c2671-lp155.7053.1.aarch64: success
Aug 18 07:17:15 openqaworker-arm-1 [RPM][335]: install openQA-worker-4.6.1723914098.c2bbc49b-lp155.7068.1.aarch64: success
Aug 18 07:17:17 openqaworker-arm-1 [RPM][335]: erase openQA-worker-4.6.1723660485.b24c2671-lp155.7053.1.aarch64: success
Aug 18 07:17:17 openqaworker-arm-1 [RPM][335]: install openQA-worker-4.6.1723914098.c2bbc49b-lp155.7068.1.aarch64: success
Aug 18 07:17:17 openqaworker-arm-1 [RPM][335]: Transaction ID 66c183d8 finished: 0
strace -f -yy -p 2163
shows that the main salt process waits for the second. strace -f -yy -p 2366
shows the process in a polling wait
strace: Process 2366 attached with 4 threads
[pid 13526] epoll_pwait(16<anon_inode:[eventpoll]>, <unfinished ...>
[pid 13525] epoll_pwait(14<anon_inode:[eventpoll]>, <unfinished ...>
[pid 2368] pselect6(0, NULL, NULL, NULL, {tv_sec=4, tv_nsec=785892938}, NULL <unfinished ...>
[pid 2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 346, NULL, 8) = 0
[pid 2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 998, NULL, 8) = 0
[pid 2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 997, NULL, 8) = 0
[pid 2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 998, NULL, 8) = 0
[pid 2366] epoll_pwait(6<anon_inode:[eventpoll]>, [], 1023, 998, NULL, 8) = 0
[pid 2366] epoll_pwait(6<anon_inode:[eventpoll]>, <unfinished ...>
[pid 2368] <... pselect6 resumed>) = 0 (Timeout)
[pid 2368] getuid() = 0
[pid 2368] kill(2163, 0) = 0
…
I crosschecked on another machine, worker29, and there it looks like the same, so nothing extraordinary. If I call sudo salt \* test.ping
both processes react, reading a lot and showing that in the strace output. Also all machines now show up in the output from test.ping
so the workaround to restart the salt-minion is actually not necessary it seems. I retriggered https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2958810 manually and other jobs in the pipeline which succeeded