action #175407
Updated by robert.richardson 29 days ago
## Observation While looking into https://progress.opensuse.org/issues/174985, @nicksinger did a `salt 'monitor.qe.nue2.suse.org' state.apply` which timed out. The mentioned `salt-run jobs.lookup_jid 20250114115421987682` showed: ``` […] ---------- ID: dehydrated.timer Function: service.running Result: True Comment: The service dehydrated.timer is already running Started: 12:56:56.183595 Duration: 126.445 ms Changes: ---------- ID: systemctl start dehydrated Function: cmd.run Result: False Comment: The following requisites were not found: require: id: webserver_config Started: 12:56:56.324904 Duration: 0.008 ms Changes: Summary for monitor.qe.nue2.suse.org -------------- Succeeded: 456 (changed=4) Failed: 1 -------------- Total states run: 457 Total run time: 122.178 s ``` apparently this broke with https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/6cd458a57044b50f60df936ac04b1033534c7a9d#e83a0785fb156ea339320f4bf1d083717c84a2ba_11_13 but we never realized. Do we truncate our salt-logs *too much*? ## Acceptance criteria * **AC1:** We have a stable and clean salt-states-openqa pipeline again * **AC2:** A pipeline only succeeds if all currently salt controlled hosts responded ## Suggestions * Does the same happen with other hosts? * Check the job artifacts for salt deployments in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3609342/artifacts/browse * Check what was done in the past to hide more of the salt-output on pipeline runs, e.g. crosscheck execution line https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3439811#L42 runs * Not failing everything because of a small issue is good - but how to make sure "small issues" don't get out of hand? * Can we somehow extract metrics from salt? Our pipelines? (e.g. "success percentage" of workers per pipeline, etc) * Think about oqa-minion-jobs: they can fail but we only care if it is about a certain threshold * Research why we put "hide-timeout" into the salt state call, see commit c825939