action #180194
closed
coordination #161414: [epic] Improved salt based infrastructure management
Workers diesel and petrol are down which blocks the salt deployment pipeline size:S
Added by dheidler 15 days ago.
Updated 9 days ago.
Category:
Regressions/Crashes
- Category set to Regressions/Crashes
- Description updated (diff)
- Tags set to infra
- Priority changed from Normal to High
- Target version set to Ready
- Description updated (diff)
- Related to action #178576: Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S added
- Parent task set to #161414
- Tags changed from infra to infra, alert, reactive work
- Subject changed from Workers diesel and petrol are down which blocks the salt deployment pipeline to Workers diesel and petrol are down which blocks the salt deployment pipeline size:S
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to Feedback
- Priority changed from High to Low
- Assignee set to nicksinger
- Priority changed from Low to High
ERROR: Minions returned with non-zero exit code
.Retrying up to 1 more times after sleeping 3s …
......................................................................................................................................backup.qe.prg2.suse.org:
[...]
petrol.qe.nue2.suse.org:
Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code
- Status changed from Feedback to In Progress
livdywan wrote in #note-10:
ERROR: Minions returned with non-zero exit code
.Retrying up to 1 more times after sleeping 3s …
......................................................................................................................................backup.qe.prg2.suse.org:
[...]
petrol.qe.nue2.suse.org:
Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code
Interesting. Checking ipmitool, the machine was again powered off - not sure yet what powers off petrol. Diesel is fine, all other nue machines as well so I think a power surge is unlikely. Checking now how I could log why the machine turns itself off over night.
- Due date set to 2025-04-24
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Resolved
I checked logs for potential hints but didn't find anything. The machine is now stable for multiple days so I assume it was indeed some strange power problem in FC basement as I really cannot find anything else. We might consider enabling automatic boot-up on power if this happens again - usually we don't want this but for an important machine I think we could justify it.
Also available in: Atom
PDF