Actions
action #175407
closedcoordination #161414: [epic] Improved salt based infrastructure management
salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S
Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
While looking into https://progress.opensuse.org/issues/174985, @nicksinger did a salt 'monitor.qe.nue2.suse.org' state.apply
which timed out. The mentioned salt-run jobs.lookup_jid 20250114115421987682
showed:
[…]
----------
ID: dehydrated.timer
Function: service.running
Result: True
Comment: The service dehydrated.timer is already running
Started: 12:56:56.183595
Duration: 126.445 ms
Changes:
----------
ID: systemctl start dehydrated
Function: cmd.run
Result: False
Comment: The following requisites were not found:
require:
id: webserver_config
Started: 12:56:56.324904
Duration: 0.008 ms
Changes:
Summary for monitor.qe.nue2.suse.org
--------------
Succeeded: 456 (changed=4)
Failed: 1
--------------
Total states run: 457
Total run time: 122.178 s
apparently this broke with https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/6cd458a57044b50f60df936ac04b1033534c7a9d#e83a0785fb156ea339320f4bf1d083717c84a2ba_11_13 but we never realized. Do we truncate our salt-logs too much?
Acceptance criteria¶
- AC1: We have a stable and clean salt-states-openqa pipeline again
- AC2: A pipeline only succeeds if all currently salt controlled hosts responded
Suggestions¶
- Does the same happen with other hosts?
- Check the job artifacts for salt deployments in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3609342/artifacts/browse
- Check what was done in the past to hide more of the salt-output on pipeline runs, e.g. crosscheck execution line https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3439811#L42
- Not failing everything because of a small issue is good - but how to make sure "small issues" don't get out of hand?
- Can we somehow extract metrics from salt? Our pipelines? (e.g. "success percentage" of workers per pipeline, etc)
- Think about oqa-minion-jobs: they can fail but we only care if it is about a certain threshold
- Research why we put "hide-timeout" into the salt state call, see commit c825939 from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/254 which unfortunately has not more content
Actions