action #176250
Updated by okurz 2 months ago
## Observation
Multiple config files where somehow corrupted by salt or incompletely written. First #163790, then #175710, both on OSD. Also on monitor, see #176175 . okurz first assumed that this might be related to too high load on OSD while running both the salt master as well as the salt minion but as a similar problem appeared on monitor which is salt-minion only the salt-master alone can not be the problem. So far the problem has only happened on virtual machines (both OSD and monitor are VMs).
## Acceptance Criteria
* **AC1:** We have consistent and stable application of config files managed by salt
## Suggestions
* Try to reproduce the problem in a separate testing environment, e.g. single VM from https://download.opensuse.org/distribution/leap/15.6/appliances/openSUSE-Leap-15.6-Minimal-VM.x86_64-kvm-and-xen.qcow2 and apply local salt state from https://gitlab.suse.de/openqa/salt-states-openqa using the role webui and/or monitor while putting the VM under stress, e.g. with the application `stress` (e.g. from https://download.opensuse.org/repositories/server:/monitoring/15.6/x86_64/stress-1.0.4-lp156.5.3.x86_64.rpm) `stress`.
* Read README from https://gitlab.suse.de/openqa/salt-states-openqa
* Run salt repeatedly with a command like `sudo nice env runs=300 count-fail-ratio salt --state-output=changes -C "*" state.apply queue=True | grep -v 'Result.*Clean' 2>&1 | tee -a salt_state.log`
* Give another try at upstream research. So far okurz has not found anything related. Consider asking domain experts from the salt community.
* Maybe the problem is related to our rather outdated python+salt stack within our infrastructure (as we run Leap). So *after* you could reproduce the problem in a clean environment consider to run updated python and/or salt as applicable, e.g. try if you can also reproduce the problem within Tumbleweed.
Back