Project

General

Profile

action #131249

Updated by okurz 11 months ago

## Observation 
 From https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1648467 and reproduced locally with `sudo salt --no-color -C 'G@roles:worker' test.ping`: 

 ``` 
 grenache-1.qa.suse.de: 
     Minion did not return. [No response] 
     The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command: 
    
     salt-run jobs.lookup_jid 20230622084232610255 
 worker5.oqa.suse.de: 
     Minion did not return. [No response] 
     The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command: 
    
     salt-run jobs.lookup_jid 20230622084232610255 
 worker2.oqa.suse.de: 
     Minion did not return. [No response] 
     The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command: 
    
     salt-run jobs.lookup_jid 20230622084232610255 
 ``` 

 ## Acceptance criteria 
 * **AC1:** `salt \* test.ping` and `salt \* state.apply` succeeds consistently for more than one day 
 * **AC2:** our salt states and pillar and osd deployment pipelines are green and stable again 

 ## Suggestions 
 * It seems we might have had this problem for a while but never really that severly. Now it looks like those machines even if we trigger a reboot and restart salt-minion can end up with "no response" again. Maybe we can revert some recent package updates? From /var/log/zypp/history there is 

 ``` 
 2023-06-22 03:01:12|install|python3-pyzmq|17.1.2-150000.3.5.2|x86_64||repo-sle-update|e2d9d07654cffc31e5199f40aa1ba9fee1e114c4ca5abd78f7fdc78b2e6cc21a| 
 ``` 

 * Debug the actual problem of hanging salt-minion. Maybe we can actually try to better trigger the problem, not prevent it? 
 * Research upstream, apply workarounds, potentially try upgrade Leap 15.5 if that might fix something 


 ## Rollback steps 
 * on worker2,worker3,worker5,grenache-1,openqaworker-arm-2,openqaworker-arm-3 worker5 `sudo mv /etc/systemd/system/auto-update.$i.disabled_poo131249,} && sudo systemctl enable --now auto-update.timer && sudo systemctl start auto-update`, wait for upgrade to complete and reboot

Back