Project

General

Profile

Actions

action #161423

closed

coordination #161414: [epic] Improved salt based infrastructure management

[timeboxed:10h] Incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master size:S

Added by okurz 7 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Start date:
2024-06-03
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #161324 . Why did the salt states pipelines end with success when the salt high state was never reported to be successfully applied to the openqa.suse.de salt minion (openqa.suse.de is not mentioned in the list of minions where the state was applied but the pipeline still ended)? We do not know yet but this should help us in the future to spot errors quicker in case similar problems return. Maybe the problem is related to how we run salt over ssh from that minion openqa.suse.de and potentially the exit code from salt was never propagated but the command in bash just ended prematurely? Research about best practices how to apply a high state from a remotely accessible master upstream and investigate this

Acceptance criteria

  • AC1: We know the best practice how to apply a salt high state on a remotely accessible salt master while avoiding loosing the ssh session in the process

Suggestions

  • Just do a web research or vague look around if there are any best practices, known problems, instructions for running salt on a remote ssh-reachable host
  • Look into how the salt states CI pipelines originally behaved in #161309 and how results of the state application are missing for openqa.suse.de. Maybe we lost connection to the salt master while the high state was applied and then the CI pipeline ended with "success" even though we never received a response from openqa.suse.de?

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #161324: Conduct "lessons learned" with Five Why analysis for "osd not accessible, 502 Bad Gateway"Resolvedokurz2024-05-31

Actions
Related to openQA Infrastructure (public) - action #162641: Prevent redundant salt state.apply actions that are executed in every call - openqa-trigger-from-ibs-pluginResolvedjbaier_cz2024-06-20

Actions
Copied to openQA Infrastructure (public) - action #162377: incomplete config files on OSD due to salt - Prevent conflicting state applications on OSD "fstab" size:SResolvedokurz2024-06-03

Actions
Actions #1

Updated by okurz 7 months ago

  • Related to action #161324: Conduct "lessons learned" with Five Why analysis for "osd not accessible, 502 Bad Gateway" added
Actions #2

Updated by okurz 6 months ago

  • Copied to action #162377: incomplete config files on OSD due to salt - Prevent conflicting state applications on OSD "fstab" size:S added
Actions #3

Updated by okurz 6 months ago ยท Edited

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version changed from future to Ready

For #162377 I looked into "deploy" jobs from https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs and looked into https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2737048 and https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2737025 and https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2732901 and I did not find any state output for the minion openqa.suse.de mentioned so I wonder if that is now always missing from the output.

I looked on https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines?page=20&scope=all bisecting trying to find older deploy jobs where there is a state output for OSD. Then I will bisect when that output seems to stop appearing. I found https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1821390 from 9 months ago, 2023-09-11, and looked for "Summary for openqa.suse.de" showing

Summary for openqa.suse.de
--------------
Succeeded: 350 (changed=30)
Failed:      0
--------------
Total states run:     350
Total run time:    55.980 s

given that we found a non-contiguous series of missing output for the minion 'openqa.suse.de' I assume that we do not have a clear regression but rather a sporadic issue that could have been present since longer time. Hence we should look for improvements.

I am also suspecting that #162641 could actually cause problems here. At least it's hindering us and annoying.

My suggestions:

  1. Research upstream about known problems of applying a salt high state over an ssh connected master
  2. Apply community best practices how to trigger multi-minion high state from ssh accessible master+minion
  3. Split salt high state application to non-osd and the minion on osd itself
Actions #4

Updated by okurz 6 months ago

  • Related to action #162641: Prevent redundant salt state.apply actions that are executed in every call - openqa-trigger-from-ibs-plugin added
Actions #5

Updated by openqa_review 6 months ago

  • Due date set to 2024-07-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by okurz 6 months ago

  • Due date deleted (2024-07-05)
  • Status changed from In Progress to Workable
Actions #7

Updated by okurz 6 months ago

  • Subject changed from incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master to [timeboxed:10h] Incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master size:S
  • Description updated (diff)
Actions #8

Updated by okurz 6 months ago

  • Status changed from Workable to In Progress

Running an experiment with

runs=400 count-fail-ratio sudo salt --no-color --state-output=changes 'openqa.suse.de' state.apply test=True | grep -v Result

to see if I can reproduce the issue of 'openqa.suse.de' results not showing up.

This yields

openqa.suse.de:
----------
          ID: auto-upgrade.service
    Function: service.dead
     Comment: Service auto-upgrade.service not present; if created in this state run, it would have been stopped
     Started: 15:56:31.939924
    Duration: 54.554 ms
     Changes:   
----------
          ID: auto-upgrade.timer
    Function: service.dead
     Comment: Service auto-upgrade.timer not present; if created in this state run, it would have been stopped
     Started: 15:56:31.995216
    Duration: 42.742 ms
     Changes:   

Summary for openqa.suse.de
--------------
Succeeded: 489 (unchanged=2)
Failed:      0
--------------
Total states run:     489
Total run time:    38.127 s

with the auto-upgrade.{service,timer} mentions. To get rid of that I am running

runs=400 count-fail-ratio sudo salt --no-color --state-output=changes 'openqa.suse.de' state.apply | grep -v Result

If not reproducible this way then should use w/o test mode or with multiple nodes.

In the meantime again looked into deploy jobs backwards in time if the summary for openqa.suse.de was missing:

according job artifacts were already removed in gitlab so we can't reproduce in gitlab and have no logs for older. No point in continuing that route.

Actions #9

Updated by openqa_review 6 months ago

  • Due date set to 2024-07-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by okurz 6 months ago

  • Due date deleted (2024-07-18)
  • Status changed from In Progress to Resolved

runs=400 count-fail-ratio sudo salt --no-color --state-output=changes 'openqa.suse.de' state.apply | grep -v Result failed in some rare occassions due to other problems, e.g. xml.parsers.expat.ExpatError: syntax error: line 1, column 0 or minion timeout but no missing result. In the end I could not reproduce the problem anymore and we agreed that possibly our changes to not write /etc/fstab from multiple salt states might prevent the situation.

Actions #11

Updated by livdywan 4 months ago

  • Status changed from Resolved to Workable

This is failing on multiple workers, seesalt-pillars-openqa:

worker40.oqa.prg2.suse.org:
----------
          ID: security-sensor.repo
    Function: pkg.latest
        Name: velociraptor-client
      Result: False
     Comment: An exception occurred in this state: Traceback (most recent call last):
                File "/usr/lib/python3.6/site-packages/salt/state.py", line 2402, in call
                  *cdata["args"], **cdata["kwargs"]
                File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
                  return self.loader.run(run_func, *args, **kwargs)
                File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1234, in run
                  return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
                File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
                  return callable(*args, **kwargs)
                File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1249, in _run_as
                  return _func_or_method(*args, **kwargs)
                File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1282, in wrapper
                  return f(*args, **kwargs)
                File "/usr/lib/python3.6/site-packages/salt/states/pkg.py", line 2659, in latest
                  *desired_pkgs, fromrepo=fromrepo, refresh=refresh, **kwargs
                File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
                  return self.loader.run(run_func, *args, **kwargs)
                File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1234, in run
                  return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
                File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
                  return callable(*args, **kwargs)
                File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1249, in _run_as
                  return _func_or_method(*args, **kwargs)
                File "/usr/lib/python3.6/site-packages/salt/modules/zypperpkg.py", line 828, in latest_version
                  package_info = info_available(*names, **kwargs)
                File "/usr/lib/python3.6/site-packages/salt/modules/zypperpkg.py", line 752, in info_available
                  "info", "-t", "package", *batch[:batch_size]
                File "/usr/lib/python3.6/site-packages/salt/modules/zypperpkg.py", line 439, in __call
                  salt.utils.stringutils.to_str(self.__call_result["stdout"])
                File "/usr/lib64/python3.6/xml/dom/minidom.py", line 1968, in parseString
                  return expatbuilder.parseString(string)
                File "/usr/lib64/python3.6/xml/dom/expatbuilder.py", line 925, in parseString
                  return builder.parseString(string)
                File "/usr/lib64/python3.6/xml/dom/expatbuilder.py", line 223, in parseString
                  parser.Parse(string, True)
              xml.parsers.expat.ExpatError: syntax error: line 1, column 0
     Started: 11:36:48.164201
    Duration: 2689.752 ms
     Changes:   
Summary for worker40.oqa.prg2.suse.org
Actions #12

Updated by okurz 4 months ago

  • Status changed from Workable to Resolved

Fixed meanwhile by retrying and nothing related to incomplete config files on OSD. A known issue we have seen in the past but couldn't fix.

Actions

Also available in: Atom PDF