Project

General

Profile

Actions

action #133793

closed

salt-pillars-openqa failing to apply within 2h and it is not clear which minion(s) are missing size:M

Added by okurz 9 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-08-04
Due date:
% Done:

0%

Estimated time:

Description

Observation

See https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1734178 running into the 2h gitlab CI timeout while applying a salt high state. There is a lot of not helpful debug output with all the lines with "Result: Clean - Started:" and a mention of hosts being down "backup.qa.suse.de" and "openqaworker1.qe.nue2.suse.org" but it's not being clear which minions in the end do not return

Acceptance criteria

  • AC1: By default no lines with "Result: Clean - Started:": Put them in another logfile to be uploaded
  • AC2: No repeated "++ true, ++ sleep 1, ++ echo -n .":
  • AC3: We know which minions did not complete

Suggestions

sudo salt --no-color --state-output=changes 'backup-qam.qe.nue2.suse.org' state.apply queue=True | awk '/Result: Clean - Started/ {print > "/tmp/salt_profiling.log"; next} 1'

which provides nice terse output and all the profiling information into /tmp/salt_profiling.log

  • Maybe don't apply the "set -x" for those commands with the dot outputting

Out of scope

  • Timeout before the 2h gitlab CI timeout and write down which minions are still busy executing jobs -> #133457

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #133469: [alert] Salt states don't apply sometimes on individual workers size:MResolvednicksinger2023-07-27

Actions
Related to QA - action #133457: salt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:MResolvedokurz

Actions
Actions #1

Updated by okurz 9 months ago

  • Description updated (diff)
Actions #2

Updated by okurz 9 months ago

  • Related to action #133469: [alert] Salt states don't apply sometimes on individual workers size:M added
Actions #3

Updated by okurz 9 months ago

  • Related to action #133457: salt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:M added
Actions #4

Updated by okurz 9 months ago

I just used

sudo salt --no-color --state-output=changes 'backup-qam.qe.nue2.suse.org' state.apply queue=True | awk '/Result: Clean - Started/ {print > "/tmp/salt_profiling.log"; next} 1'

which provides nice terse output and all the profiling information into /tmp/salt_profiling.log

Actions #5

Updated by nicksinger 9 months ago

  • Assignee set to nicksinger
Actions #6

Updated by mkittler 9 months ago

  • Subject changed from salt-pillars-openqa failing to apply within 2h and it is not clear which minion(s) are missing to salt-pillars-openqa failing to apply within 2h and it is not clear which minion(s) are missing size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #7

Updated by okurz 9 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/937 to improve the progress indication for the gitlab CI runner issue

Actions #8

Updated by okurz 9 months ago

  • Due date set to 2023-09-01
  • Status changed from Workable to Feedback
  • Assignee changed from nicksinger to okurz
Actions #9

Updated by okurz 9 months ago

  • Due date deleted (2023-09-01)
  • Status changed from Feedback to Resolved

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/938 merged. https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1751110 shows a much cleaner output. https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1751110/artifacts/browse shows the additional artifact https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1751110/artifacts/external_file/salt_profiling.log

and it is now clear where minions return with non-zero exit code:

ERROR: Minions returned with non-zero exit code
backup-qam.qe.nue2.suse.org:
storage.oqa.suse.de:
openqaworker17.qa.suse.cz:
openqaworker18.qa.suse.cz:
openqaworker16.qa.suse.cz:
worker35.oqa.prg2.suse.org:
worker29.oqa.prg2.suse.org:
worker37.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
worker36.oqa.prg2.suse.org:
worker34.oqa.prg2.suse.org:
worker30.oqa.prg2.suse.org:
worker39.oqa.prg2.suse.org:
worker31.oqa.prg2.suse.org:
worker32.oqa.prg2.suse.org:
worker38.oqa.prg2.suse.org:
worker-arm1.oqa.prg2.suse.org:
worker3.oqa.suse.de:
worker9.oqa.suse.de:
worker8.oqa.suse.de:
openqaworker1.qe.nue2.suse.org:
sapworker1.qe.nue2.suse.org:
sapworker2.qe.nue2.suse.org:
sapworker3.qe.nue2.suse.org:
qesapworker-prg6.qa.suse.cz:
worker2.oqa.suse.de:
qesapworker-prg7.qa.suse.cz:
qesapworker-prg5.qa.suse.cz:
openqaworker14.qa.suse.cz:
worker5.oqa.suse.de:
qesapworker-prg4.qa.suse.cz:
worker13.oqa.suse.de:
worker10.oqa.suse.de:
openqaw5-xen.qa.suse.de:
qamasternue.qa.suse.de:
    2023-08-11T13:00:05Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude \"\"": hostname: Name or service not known
    2023-08-11T13:00:05Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude \"\"": hostname: Name or service not known
    2023-08-11T13:00:06Z E! [telegraf] Error running agent: input plugins recorded 2 errors
malbec.arch.suse.de:
openqaworker-arm-2.suse.de:
tumblesle.qa.suse.de:
openqaworker-arm-3.suse.de:
schort-server.qa.suse.de:
baremetal-support.qa.suse.de:
backup.qa.suse.de:
jenkins.qa.suse.de:
openqa-piworker.qa.suse.de:
QA-Power8-4-kvm.qa.suse.de:
QA-Power8-5-kvm.qa.suse.de:
powerqaworker-qam-1.qa.suse.de:
grenache-1.qa.suse.de:
openqa.suse.de:
openqa-monitor.qa.suse.de:
    2023-08-11T13:00:13Z E! [inputs.x509_cert] could not find file: [/etc/dehydrated/certs/monitor.qe.nue2.suse.org/fullchain.pem]
    2023-08-11T13:00:18Z E! [telegraf] Error running agent: input plugins recorded 1 errors

Created two specific tickets about those observations:

Actions

Also available in: Atom PDF