action #180194
closedcoordination #161414: [epic] Improved salt based infrastructure management
Workers diesel and petrol are down which blocks the salt deployment pipeline size:S
0%
Description
Broken pipeline: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4112517
Possibly similar issues as #178576
Mitigations¶
- Taken workers out of production if needed, see according wiki section https://progress.opensuse.org/projects/openqav3/wiki#Take-machines-out-of-salt-controlled-production
- Machines are currently online again?
- See Slack https://suse.slack.com/archives/C02AJ1E568M/p1744103328838559
Suggestions¶
- Are the machines unusable? Or otherwise (temporarily) unreachable
- Try to recover those workers if possible
- If petrol and diesel are out, we'll only have mania for this type of workload
Updated by nicksinger 15 days ago
- Related to action #178576: Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S added
Updated by livdywan 15 days ago
- Tags changed from infra to infra, alert, reactive work
- Subject changed from Workers diesel and petrol are down which blocks the salt deployment pipeline to Workers diesel and petrol are down which blocks the salt deployment pipeline size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger 15 days ago
- Status changed from Workable to Feedback
- Priority changed from High to Low
I checked the machines and found them to be powered off - the only mention of something related I found was https://suse.slack.com/archives/C029APBKLGK/p1744102226359739?thread_ts=1744100245.623629&cid=C029APBKLGK
After powering the machines on again, everything looked fine and I enabled them in salt again. I checked other Nuremberg salt-controlled machines but all others are powered on and showing multiple days of uptime. I will keep this on feedback for some days to see if it happens again.
Updated by livdywan 14 days ago
- Priority changed from Low to High
ERROR: Minions returned with non-zero exit code
.Retrying up to 1 more times after sleeping 3s …
......................................................................................................................................backup.qe.prg2.suse.org:
[...]
petrol.qe.nue2.suse.org:
Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code
Updated by nicksinger 14 days ago
- Status changed from Feedback to In Progress
livdywan wrote in #note-10:
ERROR: Minions returned with non-zero exit code .Retrying up to 1 more times after sleeping 3s … ......................................................................................................................................backup.qe.prg2.suse.org: [...] petrol.qe.nue2.suse.org: Minion did not return. [Not connected] ERROR: Minions returned with non-zero exit code
Interesting. Checking ipmitool, the machine was again powered off - not sure yet what powers off petrol. Diesel is fine, all other nue machines as well so I think a power surge is unlikely. Checking now how I could log why the machine turns itself off over night.
Updated by openqa_review 13 days ago
- Due date set to 2025-04-24
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 9 days ago
- Status changed from In Progress to Resolved
I checked logs for potential hints but didn't find anything. The machine is now stable for multiple days so I assume it was indeed some strange power problem in FC basement as I really cannot find anything else. We might consider enabling automatic boot-up on power if this happens again - usually we don't want this but for an important machine I think we could justify it.