Project

General

Profile

Actions

action #133793

closed

salt-pillars-openqa failing to apply within 2h and it is not clear which minion(s) are missing size:M

Added by okurz 9 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-08-04
Due date:
% Done:

0%

Estimated time:

Description

Observation

See https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1734178 running into the 2h gitlab CI timeout while applying a salt high state. There is a lot of not helpful debug output with all the lines with "Result: Clean - Started:" and a mention of hosts being down "backup.qa.suse.de" and "openqaworker1.qe.nue2.suse.org" but it's not being clear which minions in the end do not return

Acceptance criteria

  • AC1: By default no lines with "Result: Clean - Started:": Put them in another logfile to be uploaded
  • AC2: No repeated "++ true, ++ sleep 1, ++ echo -n .":
  • AC3: We know which minions did not complete

Suggestions

sudo salt --no-color --state-output=changes 'backup-qam.qe.nue2.suse.org' state.apply queue=True | awk '/Result: Clean - Started/ {print > "/tmp/salt_profiling.log"; next} 1'

which provides nice terse output and all the profiling information into /tmp/salt_profiling.log

  • Maybe don't apply the "set -x" for those commands with the dot outputting

Out of scope

  • Timeout before the 2h gitlab CI timeout and write down which minions are still busy executing jobs -> #133457

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #133469: [alert] Salt states don't apply sometimes on individual workers size:MResolvednicksinger2023-07-27

Actions
Related to QA - action #133457: salt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:MResolvedokurz

Actions
Actions

Also available in: Atom PDF