action #176250
opencoordination #161414: [epic] Improved salt based infrastructure management
file corruption in salt controlled config files size:M
0%
Description
Observation¶
Multiple config files where somehow corrupted by salt or incompletely written. First #163790, then #175710, both on OSD. Also on monitor, see #176175 . okurz first assumed that this might be related to too high load on OSD while running both the salt master as well as the salt minion but as a similar problem appeared on monitor which is salt-minion only the salt-master alone can not be the problem. So far the problem has only happened on virtual machines (both OSD and monitor are VMs).
Acceptance Criteria¶
- AC1: We have consistent and stable application of config files managed by salt
Suggestions¶
- Try to reproduce the problem in a separate testing environment, e.g. single VM from https://download.opensuse.org/distribution/leap/15.6/appliances/openSUSE-Leap-15.6-Minimal-VM.x86_64-kvm-and-xen.qcow2 and apply local salt state from https://gitlab.suse.de/openqa/salt-states-openqa using the role webui and/or monitor while putting the VM under stress, e.g. with the application
stress-ng
- Read README from https://gitlab.suse.de/openqa/salt-states-openqa
- Run salt repeatedly with a command like
sudo nice env runs=300 count-fail-ratio salt --state-output=changes -C "*" state.apply queue=True | grep -v 'Result.*Clean' 2>&1 | tee -a salt_state.log
- Give another try at upstream research. So far okurz has not found anything related. Consider asking domain experts from the salt community.
- Maybe the problem is related to our rather outdated python+salt stack within our infrastructure (as we run Leap). So after you could reproduce the problem in a clean environment consider to run updated python and/or salt as applicable, e.g. try if you can also reproduce the problem within Tumbleweed.
- If the problem can not be reproduced in a synthetic environment then consider an idea from tinita: "adding another dummy.ini besides openqa.ini, with the same content, and in the loop calling salt.apply, the file is copied to a folder, so we have a list of files and can trace the changes between each call"
Updated by okurz 2 months ago
- Copied from action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Updated by okurz 2 months ago
- Copied from deleted (action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17)
Updated by okurz 2 months ago
- Blocks action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Updated by okurz 2 months ago
- Blocks action #176175: [alert] Grafana failed to start due to corrupted config file added
Updated by ybonatakis 2 months ago
- Status changed from In Progress to Workable
- Assignee deleted (
ybonatakis)
Here what I tried to bring up a testing environment.
qemu-system-x86_64 -enable-kvm -m 2048 -smp 2 -drive file=openSUSE-Leap-15.6-Minimal-VM.x86_64-kvm-and-xen.qcow2,format=qcow2
Then inside the VM
- zypper in salt-minion git-core htop vim systemd-coredump and stress-ng (apparently I dont need everything but whatever)
- reboot and then echo testing >{/etc/salt/minion_id,/etc/hostname} (edited)
- touch /etc/salt/grains and echo "roles: worker" /etc/salt/grains
- systemctl enable --now salt-minion but I get errors for master hostname salt not found. kinda solved by running
sed -i "s/localhost/& salt/" /etc/hosts
but there are still errors. Something missing from the setup
I could run salt-call --local state.apply
but it was failing.
in other notes, i need to run the image with proper network setup and serial as I couldnt clone from gitlab.suse.de. Although I learnt about a public repo https://github.com/os-autoinst/salt-states-openqa/. clone it, didnt solve the problem on salt-call. salt-call still couldnt find the high-state.
Updated by okurz 2 months ago · Edited
On osiris installing a new VM using br0. That failed with "cannot execute binary /usr/lib/qemu-bridge-helper: Permission denied
: Transport endpoint is not connected'". Following https://www.reddit.com/r/openSUSE/comments/q9jcmy/tutorial_how_to_use_bridged_network_on_a_gnome/ I did gpasswd -a okurz kvm
cd /var/lib/libvirt/images
https://download.opensuse.org/distribution/leap/15.6/appliances/openSUSE-Leap-15.6-Minimal-VM.x86_64-kvm-and-xen.qcow2
cp -al openSUSE-Leap-15.6-Minimal-VM.x86_64-kvm-and-xen.qcow2 okurz-poo176250.qcow2
Created VM "okurz-poo176250" using that qcow2 and br0.
Standard root QA testing password for manual tests s…g
Currently reachable as d4-147.qe.nue2.suse.org. Configured a local user account and added my SSH key so can now use ssh d4-147.qe.nue2.suse.org.
Configured screen, salt and now running
nice env runs=300 count-fail-ratio sh -c "salt --state-output=changes --no-color \* state.apply queue=True | grep -v 'Result.*Clean' 2>&1 | tee -a salt_state-$(date -Is).log && ls -l /etc/openqa/openqa.ini"
got
Data failed to compile:
----------
No matching sls found for 'openqa.server' in env 'base'
----------
No matching sls found for 'openqa.openqa-trigger-from-ibs' in env 'base'
----------
No matching sls found for 'certificates.dehydrated' in env 'base'
and corrupted btrfs?!?
Updated by openqa_review 2 months ago
- Due date set to 2025-02-13
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 2 months ago · Edited
Recreated the VM and trying again.
Now running
nice env runs=300 count-fail-ratio sh -c "salt --state-output=changes --no-color \* state.apply queue=True | grep -v 'Result.*Clean' 2>&1 | tee -a salt_state-$(date -Is).log 2>&1 && cp -a /etc/openqa/openqa.ini{,.\$(date -Is)} && ls -l /etc/openqa/openqa.ini*"
and in parallel
stress-ng --cpu 8 --iomix 4 --vm 2 --vm-bytes 128M --fork 4
EDIT: 2025-01-30 14:23Z intermediate state
-rw-r--r-- 1 geekotest root 17232 Jan 30 14:18 /etc/openqa/openqa.ini
-rw-r--r-- 1 geekotest root 17232 Jan 30 12:18 /etc/openqa/openqa.ini.2025-01-30T12:19:05+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 12:44 /etc/openqa/openqa.ini.2025-01-30T12:19:27+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 12:54 /etc/openqa/openqa.ini.2025-01-30T12:57:26+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 12:59 /etc/openqa/openqa.ini.2025-01-30T13:01:53+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:03 /etc/openqa/openqa.ini.2025-01-30T13:06:21+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:08 /etc/openqa/openqa.ini.2025-01-30T13:10:42+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:12 /etc/openqa/openqa.ini.2025-01-30T13:15:03+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:16 /etc/openqa/openqa.ini.2025-01-30T13:19:28+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:21 /etc/openqa/openqa.ini.2025-01-30T13:23:47+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:25 /etc/openqa/openqa.ini.2025-01-30T13:28:08+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:30 /etc/openqa/openqa.ini.2025-01-30T13:32:36+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:34 /etc/openqa/openqa.ini.2025-01-30T13:36:51+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:38 /etc/openqa/openqa.ini.2025-01-30T13:41:22+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:43 /etc/openqa/openqa.ini.2025-01-30T13:45:45+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:47 /etc/openqa/openqa.ini.2025-01-30T13:50:13+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:52 /etc/openqa/openqa.ini.2025-01-30T13:54:40+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 13:56 /etc/openqa/openqa.ini.2025-01-30T13:59:10+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 14:00 /etc/openqa/openqa.ini.2025-01-30T14:03:35+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 14:05 /etc/openqa/openqa.ini.2025-01-30T14:07:57+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 14:09 /etc/openqa/openqa.ini.2025-01-30T14:12:27+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 14:14 /etc/openqa/openqa.ini.2025-01-30T14:16:50+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 30 14:18 /etc/openqa/openqa.ini.2025-01-30T14:21:14+00:00
## count-fail-ratio: Run: 20. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 15.00%
## mean runtime: 264546±3954.77 ms
so all file sizes correct while the system has a mean load of 43.88 44.04 44.65
Updated by okurz 2 months ago
## count-fail-ratio: Run: 271. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 1.10%
## mean runtime: 268072±5618.01 ms
with no differences in file sizes. That means that salt on Leap 15.6 with our salt states can be very stable even under high system load. The differences vs. OSD are at least that some rules fail to apply as well as that there is no other load on the system, e.g. no OSD jobs or deployment while salt might be going on. What happens if openQA packages are upgraded and changing ini files while salt is running? Trying an experiment
Updated by okurz 2 months ago · Edited
from /var/log/salt/minion
OSError: [Errno 28] No space left on device
salt.exceptions.CommandExecutionError: Unable to write file '/etc/openqa/openqa.ini'. Exception: [Errno 28] No space left on device
2025-01-31 21:48:31,329 [salt.state :327 ][ERROR ][5219] An exception occurred in this state: OSError: [Errno 28] No space left on device
and
-rw-r--r-- 1 geekotest root 17232 Jan 31 21:25 /etc/openqa/openqa.ini.2025-01-31T21:28:43+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 31 21:30 /etc/openqa/openqa.ini.2025-01-31T21:34:04+00:00
-rw-r--r-- 1 geekotest root 17232 Jan 31 21:36 /etc/openqa/openqa.ini.2025-01-31T21:39:34+00:00
-rw-r--r-- 1 geekotest root 4024 Jan 31 21:41 /etc/openqa/openqa.ini.2025-01-31T21:43:16+00:00
-rw-r--r-- 1 geekotest root 6801 Jan 31 21:45 /etc/openqa/openqa.ini.2025-01-31T21:49:50+00:00
so in the end the storage was depleted which caused the incomplete file write. I assume the repeated zypper calls with no chance for snapper to trigger a cleanup within a reasonable time caused the problems here. Triggered systemctl start snapper-{timeline,boot}
to cleanup. But that did not cleanup much. Now running
while sleep 20; do zypper -n in --force openQA; sleep 5; snapper cleanup number; done
and continuing the experiment but without stress-ng
as that stalls btrfs cleanup mostly.
Updated by okurz 2 months ago
- Copied to action #176421: Support for config files in openqa.d size:S added
Updated by okurz about 2 months ago
No real config file corruption until now occured
## count-fail-ratio: Run: 1515. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < .19%
## mean runtime: 125157±6567.69 ms
and nobody had a better idea what to try besides again trying to break production which I can consider doing over weekend (or never).
Updated by okurz about 2 months ago
- Due date deleted (
2025-02-13) - Status changed from In Progress to Workable
- Assignee deleted (
okurz) - Priority changed from High to Low
- Target version changed from Ready to future
Updated by tinita about 2 months ago
- Related to action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S added
Updated by okurz 14 days ago
nicksinger had the great idea to use inotifywait
to monitor for file operations on /etc/openqa/openqa.ini . I assume there should be only one of three actors writing to /etc/openqa/openqa.ini
- salt from https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls?ref_type=heads#L37
- RPM package upgrades with zypper. But we use
config(noreplace)
so that shouldn't be a thing for us, or is it? - human operators on manual changes, usually in that environment with
vim
. That would manifest in seeing also .swp and .ini~ files written temporarily as well as openqa.ini replaced in one go from the swap file
I was running while true; do inotifywait -e CLOSE_WRITE,MOVE,MOVE_SELF,CREATE,DELETE,DELETE_SELF --monitor --format '%f %e %T' --timefmt '%F-%T' /etc/openqa/ ; done
and observed blocks that look like this
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:04
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:04
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:04
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:04
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:04
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
openqa.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:05
database.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:06
database.ini CLOSE_WRITE,CLOSE 2025-03-21-12:38:06
which can I also trigger if I call salt 'openqa.suse.de' state.apply openqa.server
so I assume the ini salt module does multiple individual open+close actions instead of a single one with all changes at once. This might be a problem with the corruption we observe. This is one more reason to consider moving to separate files in a config directory
Updated by okurz 14 days ago
- Related to action #177276: Make use of config files in openqa.ini.d for OSD specific settings size:S added