action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

action #175407

Updated by robert.richardson 3 months ago

## Observation 
 While looking into https://progress.opensuse.org/issues/174985, @nicksinger did a `salt 'monitor.qe.nue2.suse.org' state.apply` which timed out. The mentioned `salt-run jobs.lookup_jid 20250114115421987682` showed: 

 ``` 
 […] 
 ---------- 
           ID: dehydrated.timer 
     Function: service.running 
       Result: True 
      Comment: The service dehydrated.timer is already running 
      Started: 12:56:56.183595 
     Duration: 126.445 ms 
      Changes: 
 ---------- 
           ID: systemctl start dehydrated 
     Function: cmd.run 
       Result: False 
      Comment: The following requisites were not found: 
                                  require: 
                                      id: webserver_config 
      Started: 12:56:56.324904 
     Duration: 0.008 ms 
      Changes: 

 Summary for monitor.qe.nue2.suse.org 
 -------------- 
 Succeeded: 456 (changed=4) 
 Failed:        1 
 -------------- 
 Total states run:       457 
 Total run time:     122.178 s 
 ``` 

 apparently this broke with https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/6cd458a57044b50f60df936ac04b1033534c7a9d#e83a0785fb156ea339320f4bf1d083717c84a2ba_11_13 but we never realized. Do we truncate our salt-logs *too much*? 

 ## Acceptance criteria 
 * **AC1:** We have a stable and clean salt-states-openqa pipeline again 
 * **AC2:** A pipeline only succeeds if all currently salt controlled hosts responded 

 ## Suggestions 
 * Does the same happen with other hosts? 
 * Check the job artifacts for salt deployments in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3609342/artifacts/browse 
 * Check what was done in the past to hide more of the salt-output on pipeline runs, e.g. crosscheck execution line https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3439811#L42 runs 
 * Not failing everything because of a small issue is good - but how to make sure "small issues" don't get out of hand? 
   * Can we somehow extract metrics from salt? Our pipelines? (e.g. "success percentage" of workers per pipeline, etc) 
   * Think about oqa-minion-jobs: they can fail but we only care if it is about a certain threshold 
 * Research why we put "hide-timeout" into the salt state call, see commit c825939

Back

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

action #175407