Project

General

Profile

Actions

action #137984

closed

salt "refresh" job full of errors but CI job passes size:M

Added by okurz 11 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-10-13
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1900948 shows a lot of errors, e.g.

s390zl13.oqa.prg2.suse.org:
    ----------
    mine.update:
        True
    saltutil.refresh_grains:
        True
    saltutil.refresh_pillar:
        True
    saltutil.sync_grains:
        Traceback (most recent call last):
          File "/usr/lib/python3.6/site-packages/salt/modules/saltutil.py", line 79, in _get_top_file_envs
            return __context__["saltutil._top_file_envs"]
          File "/usr/lib/python3.6/site-packages/salt/loader/context.py", line 78, in __getitem__
            return self.value()[item]
        KeyError: 'saltutil._top_file_envs'

        During handling of the above exception, another exception occurred:

        Traceback (most recent call last):
          File "/usr/lib/python3.6/site-packages/salt/minion.py", line 2110, in _thread_multi_return
...

but in the end the CI job passes instead of failing

Steps to reproduce

I assume so far as long as https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1019 is not fully effective yet the problem can be reproduced by rerunning the CI job. The error message itself can be reproduced on osd with

salt --no-color 's390zl12*' saltutil.sync_grains

Acceptance criteria

  • AC1: Obvious errors visible in the log of the "refresh" CI job should fail the CI job

Suggestions

  • DONE Crosscheck if the salt command itself provides a non-zero exit code when the problem reproduces -> the command on osd salt --no-color 's390zl12*' saltutil.sync_grains; echo $? yields "1" from the exit code. So likely the problem is that in the CI instructions the command executed over ssh is not properly valuing the exit code of the internal command execution
  • Ensure that the CI job values the exit code or error condition accordingly
  • Make sure the exit code is still evaluated regardless of shown error messages

Problem

  • The problem seems to be related to the compound statement. salt \* saltutil.sync_grains,saltutil.refresh_grains , yields a 0 exit code, salt \* saltutil.sync_grains yields 1
Actions #1

Updated by okurz 11 months ago

  • Description updated (diff)
Actions #2

Updated by okurz 11 months ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #3

Updated by okurz 11 months ago

cool, even salt --no-color --failhard --hard-crash is quite "graceful" :D

Actions #4

Updated by okurz 11 months ago

  • Due date set to 2023-10-27
  • Status changed from In Progress to Feedback

I found two upstream issues and one related pull request:

The PR has been merged 5 years ago but I am not sure we actually have it in our package. In the two issues I found no solution.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1020

Actions #5

Updated by okurz 11 months ago

  • Status changed from Feedback to In Progress

reverted https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1021, was bad idea to hide the actual Traceback with "-q".

Actions #6

Updated by okurz 11 months ago

  • Status changed from In Progress to Feedback
Actions #7

Updated by okurz 11 months ago

  • Subject changed from salt "refresh" job full of errors but CI job passes to salt "refresh" job full of errors but CI job passes size:M
  • Description updated (diff)
Actions #10

Updated by okurz 11 months ago

  • Due date deleted (2023-10-27)
  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF