action #133457: salt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:M - QA (public) - openSUSE Project Management Tool

action #133457

The the following recent failures: ## Observation 
 https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1714239 

 ``` 
   Name: /etc/systemd/system/auto-update.service - https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/1714549 Function: file.managed - Result: Clean - Started: 21:29:26.689214 - Duration: 359.255 ms 
   Name: service.systemctl_reload - Function: module.run - Result: Clean - Started: 21:29:27.053802 - Duration: 0.018 ms 
   Name: auto-upgrade.service - Function: service.dead - Result: Clean - Started: 21:29:27.054218 - Duration: 61.444 ms 
   Name: auto-upgrade.timer - Function: service.dead - Result: Clean - Started: 21:29:27.116368 - Duration: 82.058 ms 
   Name: auto-update.timer - Function: service.running - Result: Clean - Started: 21:29:27.203488 - Duration: 255.774 ms 
 Summary for openqa.suse.de 
 -------------- 
 Succeeded: 345 (changed=30) 
 Failed:        0 
 -------------- 
 Total states run:       345 
 Total run time:     383.468 s.++ echo -n . 
 ++ true 
 ++ sleep 1 
 .++ echo -n . 

 ``` [...] 
 WARNING: Failed to pull image with policy "always": Error response from daemon: unknown: SSL_connect error: error:1408F10B:SSL routines:ssl3_get_record:wrong version number (manager.go:237:0s) ++ true 
 ++ sleep 1 
 .++ echo -n . 
 ++ true 
 ++ sleep 1 
 ERROR: Job failed: failed to pull image "registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest" with specified policies [always]: Error response from daemon: unknown: SSL_connect error: error:1408F10B:SSL routines:ssl3_get_record:wrong version number (manager.go:237:0s) execution took longer than 2h0m0s seconds 
  took longer than 2h0m0s seconds 
 ``` 

 ## Acceptance criteria 
 * **AC1:** bot-ng pipelines are executed successfully repeatedly jobs commonly don't run into the 2h gitlab CI timeout 
 * **AC2:** We can identify the faulty salt minion (because very likely it's one of those being stuck) 

 ## Suggestions 
 * The jobs fail well before any script execution so nothing look up an older ticket and read what we control within .gitlab-ci.yml, *or can we?* did there about this 
 * Research upstream what check if there are actually artifacts uploaded or not 
 * check if machines can be done reached over salt 
 * check usual runtimes of salt state apply 
 * try if the initial container image download fails. Maybe it is reproducible 
 * research upstream if there is anything better we can specify a retry for what the executor is trying do to pull. Or we spawn an internal super-mini image and in there call prevent to run into the container pull nested seemingly hardcoded gitlab 2h timeout 
 * Report SD ticket that they should fix run the infrastructure internal command of salt apply with a timeout well below 2h, e.g. in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml#L43 just prepend "timeout 1h …"

Back

Project

General

Profile

QA (public)

action #133457