action #175407
closedcoordination #161414: [epic] Improved salt based infrastructure management
salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S
0%
Description
Observation¶
While looking into https://progress.opensuse.org/issues/174985, @nicksinger did a salt 'monitor.qe.nue2.suse.org' state.apply
which timed out. The mentioned salt-run jobs.lookup_jid 20250114115421987682
showed:
[…]
----------
ID: dehydrated.timer
Function: service.running
Result: True
Comment: The service dehydrated.timer is already running
Started: 12:56:56.183595
Duration: 126.445 ms
Changes:
----------
ID: systemctl start dehydrated
Function: cmd.run
Result: False
Comment: The following requisites were not found:
require:
id: webserver_config
Started: 12:56:56.324904
Duration: 0.008 ms
Changes:
Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 456 (changed=4)
Failed: 1
--------------
Total states run: 457
Total run time: 122.178 s
apparently this broke with https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/6cd458a57044b50f60df936ac04b1033534c7a9d#e83a0785fb156ea339320f4bf1d083717c84a2ba_11_13 but we never realized. Do we truncate our salt-logs too much?
Acceptance criteria¶
- AC1: We have a stable and clean salt-states-openqa pipeline again
- AC2: A pipeline only succeeds if all currently salt controlled hosts responded
Suggestions¶
- Does the same happen with other hosts?
- Check the job artifacts for salt deployments in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3609342/artifacts/browse
- Check what was done in the past to hide more of the salt-output on pipeline runs, e.g. crosscheck execution line https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3439811#L42
- Not failing everything because of a small issue is good - but how to make sure "small issues" don't get out of hand?
- Can we somehow extract metrics from salt? Our pipelines? (e.g. "success percentage" of workers per pipeline, etc)
- Think about oqa-minion-jobs: they can fail but we only care if it is about a certain threshold
- Research why we put "hide-timeout" into the salt state call, see commit c825939 from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/254 which unfortunately has not more content
Updated by nicksinger 24 days ago
- Related to action #174985: [alert] salt-states-openqa | Failed pipeline for master "salt.exceptions.SaltReqTimeoutError: Message timed out" size:S added
Updated by robert.richardson 24 days ago
- Subject changed from salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us to salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size: S
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger 23 days ago
okurz wrote in #note-7:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1336
merged. Interestingly in your deployment job, monitor is missing again. The associated salt_highstate.log artifact however mentions it somehow:
2025-01-14 16:41:51,723 [salt.client :1167][DEBUG ][12951] get_iter_returns for jid 20250114164151566706 sent to {'…-prg4.qa.suse.cz', 'grenache-1.oqa.prg2.suse.org', 'baremetal-support.qe.nue2.suse.org', 'monitor.qe.nue2.suse.org', 'storage.qe.prg2.suse.org', , …'} will timeout at 16:43:21.723622
In a following MR deployment job, monitor shows up with the expected issue. However, for some reason OSD timed out. I looked up the job (with salt-run jobs.lookup_jid 20250114193508323824
) and found:
Summary for openqa.suse.de
--------------
Succeeded: 629 (changed=117)
Failed: 0
--------------
Total states run: 629
Total run time: 172.592 s
[…]
Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 456 (changed=4)
Failed: 1
--------------
Total states run: 457
Total run time: 109.676 s
[…]
Summary for backup-qam.qe.nue2.suse.org
--------------
Succeeded: 286
Failed: 0
--------------
Total states run: 286
Total run time: 161.726 s
so either your change did not apply correctly or we need to bump it even more because it includes some more overhead.
For the initial issue of the missing salt state requirement, I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1338
Updated by nicksinger 22 days ago
I wanted to check what would be a good value to but it even more and looked into https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3666125
$ salt-run jobs.lookup_jid 20250116085031017432
Summary for petrol.qe.nue2.suse.org
--------------
Succeeded: 423 (changed=2)
Failed: 1
--------------
Total states run: 424
Total run time: 3219.013 s
[…]
Summary for diesel.qe.nue2.suse.org
--------------
Succeeded: 422 (changed=2)
Failed: 1
--------------
Total states run: 423
Total run time: 1795.992 s
guess only raising timeouts won't cut it.
Updated by okurz 22 days ago
- Copied to action #175629: diesel+petrol (possibly all ppc64le OPAL machines) often run into salt error "Not connected" or "No response" due to wireguard services failing to start on boot size:S added
Updated by okurz 21 days ago
- Status changed from Resolved to Workable
The situation is not stable, see https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3677962 . Various different hosts can run into timeout. We need to think harder what to do or work with even much bigger timeouts
Updated by okurz 18 days ago
ybonatakis suggested to look into "gather_job_timeout" as well however from https://docs.saltproject.io/en/latest/ref/configuration/master.html#gather-job-timeout this does not sound related.
I tried
sudo nice env runs=30 count-fail-ratio salt --state-output=changes -C "G@roles:worker and G@osarch:ppc64le" state.apply queue=True | grep -v 'Result.*Clean'
but that failed. nicksinger is currently looking into syntax errors for #162296. Waiting for his progress then before trying again.
Updated by okurz 18 days ago
merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1343
I will merge this now but will still test different sets of parameters after current syntax issues are fixed by @nicksinger
Updated by okurz 18 days ago
- Related to action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response" added
Updated by okurz 18 days ago
- Related to deleted (action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response")
Updated by okurz 18 days ago
- Blocks action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response" added
Updated by okurz 18 days ago
- Status changed from Feedback to In Progress
I called sudo nice env runs=30 count-fail-ratio salt --state-output=changes -C "G@roles:worker and G@osarch:ppc64le" state.apply queue=True | grep -v 'Result.*Clean'
and got
…
petrol.qe.nue2.suse.org:
----------
…
Total states run: 428
Total run time: 59.859 s
diesel.qe.nue2.suse.org:
----------
…
Total run time: 66.725 s
grenache-1.oqa.prg2.suse.org:
----------
…
Total run time: 75.593 s
mania.qe.nue2.suse.org:
Summary for mania.qe.nue2.suse.org
--------------
…
Total run time: 75.174 s
## count-fail-ratio: Run: 30. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 10.00%
## mean runtime: 91588±5128.76 ms
so each machine takes 60-75s and overall one run of state apply
takes 92s and no failures. Running again with all nodes sudo nice env runs=30 count-fail-ratio salt --state-output=changes -C "*" state.apply queue=True | grep -v 'Result.*Clean'
Updated by openqa_review 17 days ago
- Due date set to 2025-02-04
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 17 days ago
- Status changed from Blocked to In Progress
end of European business day approaching so running another set of tests. First sudo nice env runs=300 count-fail-ratio salt -C \* test.true
yields
## count-fail-ratio: Run: 300. Fails: 0. Fail ratio 0±0%. No fails, computed failure probability < 1.00%
## mean runtime: 1648±1226.41 ms
so stable and no problem there. Again full state
Updated by okurz 16 days ago
- Status changed from In Progress to Workable
Results from over the night
## count-fail-ratio: Run: 300. Fails: 16. Fail ratio 5.33±2.54%
## mean runtime: 192065±41227.61 ms
due to limited screen scrollback I did not record the individual failures. Still with a fail ratio of 5% I declare that as "stable" enough. And AC2 is also covered. The rest is to be followed for example in #175629
I want to run another big run over the next night but with lower timeout, e.g. salt -t 180 …
Updated by okurz 15 days ago
- Related to action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Updated by okurz 15 days ago
- Due date deleted (
2025-02-04) - Status changed from In Progress to Resolved
## count-fail-ratio: Run: 262. Fails: 166. Fail ratio 63.35±5.83%
## mean runtime: 183087±7143.24 ms
so very bad. But all machines are still reachable. So I think my approach of increasing the timeout was good.
We again observed corruption, see #175710. I think in one way or another me running salt state apply in a loop might help to reproduce more easily whatever underlying sporadic issue we might have.
Updated by jbaier_cz 3 days ago
- Related to action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S added