action #161423
closedcoordination #161414: [epic] Improved salt based infrastructure management
[timeboxed:10h] Incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master size:S
0%
Description
Motivation¶
See #161324 . Why did the salt states pipelines end with success when the salt high state was never reported to be successfully applied to the openqa.suse.de salt minion (openqa.suse.de is not mentioned in the list of minions where the state was applied but the pipeline still ended)? We do not know yet but this should help us in the future to spot errors quicker in case similar problems return. Maybe the problem is related to how we run salt over ssh from that minion openqa.suse.de and potentially the exit code from salt was never propagated but the command in bash just ended prematurely? Research about best practices how to apply a high state from a remotely accessible master upstream and investigate this
Acceptance criteria¶
- AC1: We know the best practice how to apply a salt high state on a remotely accessible salt master while avoiding loosing the ssh session in the process
Suggestions¶
- Just do a web research or vague look around if there are any best practices, known problems, instructions for running salt on a remote ssh-reachable host
- Look into how the salt states CI pipelines originally behaved in #161309 and how results of the state application are missing for openqa.suse.de. Maybe we lost connection to the salt master while the high state was applied and then the CI pipeline ended with "success" even though we never received a response from openqa.suse.de?
Updated by okurz 7 months ago
- Related to action #161324: Conduct "lessons learned" with Five Why analysis for "osd not accessible, 502 Bad Gateway" added
Updated by okurz 6 months ago
- Copied to action #162377: incomplete config files on OSD due to salt - Prevent conflicting state applications on OSD "fstab" size:S added
Updated by okurz 6 months ago ยท Edited
- Status changed from New to In Progress
- Assignee set to okurz
- Target version changed from future to Ready
For #162377 I looked into "deploy" jobs from https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs and looked into https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2737048 and https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2737025 and https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2732901 and I did not find any state output for the minion openqa.suse.de mentioned so I wonder if that is now always missing from the output.
I looked on https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines?page=20&scope=all bisecting trying to find older deploy jobs where there is a state output for OSD. Then I will bisect when that output seems to stop appearing. I found https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1821390 from 9 months ago, 2023-09-11, and looked for "Summary for openqa.suse.de" showing
Summary for openqa.suse.de
--------------
Succeeded: 350 (changed=30)
Failed: 0
--------------
Total states run: 350
Total run time: 55.980 s
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1923459#L818 from 2023-10-23 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2111714#L877 from 2023-12-29 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2401769#L1059 from 2024-03-19 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2578478#L1059 from 2024-05-07 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2609510#L972 from 2024-05-14 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2672330 from 2024-05-31 NOT (the one that triggered #161309)
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2691374#L1075 from 2024-06-06 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2692394#L1004 from 2024-06-06 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2720405#L1233 from 2024-06-13 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2732313#L3695 from 2024-06-17 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2732602 from 2024-06-17 1749 NOT
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2732901 NOT
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2737025 NOT
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2737048 NOT
given that we found a non-contiguous series of missing output for the minion 'openqa.suse.de' I assume that we do not have a clear regression but rather a sporadic issue that could have been present since longer time. Hence we should look for improvements.
I am also suspecting that #162641 could actually cause problems here. At least it's hindering us and annoying.
My suggestions:
- Research upstream about known problems of applying a salt high state over an ssh connected master
- Apply community best practices how to trigger multi-minion high state from ssh accessible master+minion
- Split salt high state application to non-osd and the minion on osd itself
Updated by okurz 6 months ago
- Related to action #162641: Prevent redundant salt state.apply actions that are executed in every call - openqa-trigger-from-ibs-plugin added
Updated by openqa_review 6 months ago
- Due date set to 2024-07-05
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 6 months ago
- Subject changed from incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master to [timeboxed:10h] Incomplete config files on OSD due to salt - Improve salt state application from remotely accessible salt master size:S
- Description updated (diff)
Updated by okurz 6 months ago
- Status changed from Workable to In Progress
Running an experiment with
runs=400 count-fail-ratio sudo salt --no-color --state-output=changes 'openqa.suse.de' state.apply test=True | grep -v Result
to see if I can reproduce the issue of 'openqa.suse.de' results not showing up.
This yields
openqa.suse.de:
----------
ID: auto-upgrade.service
Function: service.dead
Comment: Service auto-upgrade.service not present; if created in this state run, it would have been stopped
Started: 15:56:31.939924
Duration: 54.554 ms
Changes:
----------
ID: auto-upgrade.timer
Function: service.dead
Comment: Service auto-upgrade.timer not present; if created in this state run, it would have been stopped
Started: 15:56:31.995216
Duration: 42.742 ms
Changes:
Summary for openqa.suse.de
--------------
Succeeded: 489 (unchanged=2)
Failed: 0
--------------
Total states run: 489
Total run time: 38.127 s
with the auto-upgrade.{service,timer} mentions. To get rid of that I am running
runs=400 count-fail-ratio sudo salt --no-color --state-output=changes 'openqa.suse.de' state.apply | grep -v Result
If not reproducible this way then should use w/o test mode or with multiple nodes.
In the meantime again looked into deploy jobs backwards in time if the summary for openqa.suse.de was missing:
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2784679 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2783655 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2764636 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2764520 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2750519 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2749156 OK
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2737048 NOT (same from #161423-3)
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2737025 NOT (same from #161423-3)
- https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2732901 NOT (same from #161423-3)
according job artifacts were already removed in gitlab so we can't reproduce in gitlab and have no logs for older. No point in continuing that route.
Updated by openqa_review 6 months ago
- Due date set to 2024-07-18
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 6 months ago
- Due date deleted (
2024-07-18) - Status changed from In Progress to Resolved
runs=400 count-fail-ratio sudo salt --no-color --state-output=changes 'openqa.suse.de' state.apply | grep -v Result
failed in some rare occassions due to other problems, e.g. xml.parsers.expat.ExpatError: syntax error: line 1, column 0
or minion timeout but no missing result. In the end I could not reproduce the problem anymore and we agreed that possibly our changes to not write /etc/fstab from multiple salt states might prevent the situation.
Updated by livdywan 4 months ago
- Status changed from Resolved to Workable
This is failing on multiple workers, seesalt-pillars-openqa:
worker40.oqa.prg2.suse.org:
----------
ID: security-sensor.repo
Function: pkg.latest
Name: velociraptor-client
Result: False
Comment: An exception occurred in this state: Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/salt/state.py", line 2402, in call
*cdata["args"], **cdata["kwargs"]
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
return self.loader.run(run_func, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1234, in run
return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
return callable(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1249, in _run_as
return _func_or_method(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1282, in wrapper
return f(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/states/pkg.py", line 2659, in latest
*desired_pkgs, fromrepo=fromrepo, refresh=refresh, **kwargs
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
return self.loader.run(run_func, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1234, in run
return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
return callable(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1249, in _run_as
return _func_or_method(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/modules/zypperpkg.py", line 828, in latest_version
package_info = info_available(*names, **kwargs)
File "/usr/lib/python3.6/site-packages/salt/modules/zypperpkg.py", line 752, in info_available
"info", "-t", "package", *batch[:batch_size]
File "/usr/lib/python3.6/site-packages/salt/modules/zypperpkg.py", line 439, in __call
salt.utils.stringutils.to_str(self.__call_result["stdout"])
File "/usr/lib64/python3.6/xml/dom/minidom.py", line 1968, in parseString
return expatbuilder.parseString(string)
File "/usr/lib64/python3.6/xml/dom/expatbuilder.py", line 925, in parseString
return builder.parseString(string)
File "/usr/lib64/python3.6/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
Started: 11:36:48.164201
Duration: 2689.752 ms
Changes:
Summary for worker40.oqa.prg2.suse.org