Project

General

Profile

Actions

action #180194

closed

coordination #161414: [epic] Improved salt based infrastructure management

Workers diesel and petrol are down which blocks the salt deployment pipeline size:S

Added by dheidler 15 days ago. Updated 9 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-04-08
Due date:
2025-04-24
% Done:

0%

Estimated time:

Description

Broken pipeline: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/4112517

Possibly similar issues as #178576

Mitigations

Suggestions

  • Are the machines unusable? Or otherwise (temporarily) unreachable
  • Try to recover those workers if possible
  • If petrol and diesel are out, we'll only have mania for this type of workload

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #178576: Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:SResolvednicksinger2025-03-07

Actions
Actions #1

Updated by dheidler 15 days ago

  • Category set to Regressions/Crashes
Actions #2

Updated by dheidler 15 days ago

  • Description updated (diff)
Actions #3

Updated by livdywan 15 days ago

  • Tags set to infra
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #4

Updated by dheidler 15 days ago

  • Description updated (diff)
Actions #5

Updated by nicksinger 15 days ago

  • Related to action #178576: Workers unresponsive in salt pipelines including openqa-piworker, sapworker1 and monitor size:S added
Actions #6

Updated by okurz 15 days ago

  • Parent task set to #161414
Actions #7

Updated by livdywan 15 days ago

  • Tags changed from infra to infra, alert, reactive work
  • Subject changed from Workers diesel and petrol are down which blocks the salt deployment pipeline to Workers diesel and petrol are down which blocks the salt deployment pipeline size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by nicksinger 15 days ago

  • Status changed from Workable to Feedback
  • Priority changed from High to Low

I checked the machines and found them to be powered off - the only mention of something related I found was https://suse.slack.com/archives/C029APBKLGK/p1744102226359739?thread_ts=1744100245.623629&cid=C029APBKLGK
After powering the machines on again, everything looked fine and I enabled them in salt again. I checked other Nuremberg salt-controlled machines but all others are powered on and showing multiple days of uptime. I will keep this on feedback for some days to see if it happens again.

Actions #9

Updated by nicksinger 15 days ago

  • Assignee set to nicksinger
Actions #10

Updated by livdywan 14 days ago

  • Priority changed from Low to High
ERROR: Minions returned with non-zero exit code
.Retrying up to 1 more times after sleeping 3s …
......................................................................................................................................backup.qe.prg2.suse.org:
[...]
petrol.qe.nue2.suse.org:
    Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code
Actions #11

Updated by nicksinger 14 days ago

  • Status changed from Feedback to In Progress

livdywan wrote in #note-10:

ERROR: Minions returned with non-zero exit code
.Retrying up to 1 more times after sleeping 3s …
......................................................................................................................................backup.qe.prg2.suse.org:
[...]
petrol.qe.nue2.suse.org:
    Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code

Interesting. Checking ipmitool, the machine was again powered off - not sure yet what powers off petrol. Diesel is fine, all other nue machines as well so I think a power surge is unlikely. Checking now how I could log why the machine turns itself off over night.

Actions #12

Updated by openqa_review 13 days ago

  • Due date set to 2025-04-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by nicksinger 9 days ago

  • Status changed from In Progress to Resolved

I checked logs for potential hints but didn't find anything. The machine is now stable for multiple days so I assume it was indeed some strange power problem in FC basement as I really cannot find anything else. We might consider enabling automatic boot-up on power if this happens again - usually we don't want this but for an important machine I think we could justify it.

Actions

Also available in: Atom PDF