action #133457
closed
salt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:M
Added by livdywan over 1 year ago.
Updated over 1 year ago.
Description
Observation¶
https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1714239
Name: /etc/systemd/system/auto-update.service - Function: file.managed - Result: Clean - Started: 21:29:26.689214 - Duration: 359.255 ms
Name: service.systemctl_reload - Function: module.run - Result: Clean - Started: 21:29:27.053802 - Duration: 0.018 ms
Name: auto-upgrade.service - Function: service.dead - Result: Clean - Started: 21:29:27.054218 - Duration: 61.444 ms
Name: auto-upgrade.timer - Function: service.dead - Result: Clean - Started: 21:29:27.116368 - Duration: 82.058 ms
Name: auto-update.timer - Function: service.running - Result: Clean - Started: 21:29:27.203488 - Duration: 255.774 ms
Summary for openqa.suse.de
--------------
Succeeded: 345 (changed=30)
Failed: 0
--------------
Total states run: 345
Total run time: 383.468 s.++ echo -n .
++ true
++ sleep 1
.++ echo -n .
[...]
++ true
++ sleep 1
.++ echo -n .
++ true
++ sleep 1
ERROR: Job failed: execution took longer than 2h0m0s seconds
took longer than 2h0m0s seconds
Acceptance criteria¶
- AC1: jobs commonly don't run into the 2h gitlab CI timeout
- AC2: We can identify the faulty salt minion (because very likely it's one of those being stuck)
Suggestions¶
- look up an older ticket and read what we did there about this
- check if there are actually artifacts uploaded or not
- check if machines can be reached over salt
- check usual runtimes of salt state apply
- try if it is reproducible
- research upstream if there is anything better we can do to prevent to run into the seemingly hardcoded gitlab 2h timeout
- run the internal command of salt apply with a timeout well below 2h, e.g. in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml#L43 just prepend "timeout 1h …"
- Copied from action #123894: qem-bot+openqa-bot gitlab CI pipeline aborted with error after 1h of execution added
- Related to action #119479: openqABot pipeline failed after runner getting stuck for 1h0m0s size:M added
- Tags set to infra, alert, gitlab CI
- Target version set to Ready
- Priority changed from Normal to Urgent
- Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 1h of execution to qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution
- Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution to qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M
- Description updated (diff)
- Status changed from New to Workable
- Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M to bot-ng - pipelines in GitLab fail to pull qam-ci-leap:latest size:M
- Description updated (diff)
- Subject changed from bot-ng - pipelines in GitLab fail to pull qam-ci-leap:latest size:M to qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M
- Description updated (diff)
- Related to action #133793: salt-pillars-openqa failing to apply within 2h and it is not clear which minion(s) are missing size:M added
- Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M to salt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:M
Renaming to reflect what this is about
- Status changed from Workable to In Progress
- Assignee set to livdywan
- Priority changed from Urgent to High
We're still able to identify issues with most pipeline runs hence lowering to High. And while we don't know exactly what's going on I'm going to look into what we can do about the redundant sleep spam, and if we can reduce it, or find another way to highlight what step the pipeline failed at.
- Status changed from In Progress to Feedback
I prepared an MR. I don't know if it's correct, though. So if we don't know how to fix it we can consider reverting it.
- Due date set to 2023-09-08
https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1789040#L54 shows
$ ssh $TARGET "salt --log-file=salt_syncupdate.log --log-file-level=debug --state-output=mixed --hide-timeout \* saltutil.sync_grains,saltutil.refresh_grains,saltutil.refresh_pillar,mine.update ,,,"
[ERROR ] Encountered StreamClosedException
[ERROR ]
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
but then the job still succeeds. Please look into that.
It looks like salt is outputting errors on stdin which we end up trying to parse with jq:
++ tee salt_ping.log
Currently the following minions are down:
jq: error (at <stdin>:1): string ("Salt reque...) has no keys
jq: error (at <stdin>:1): string ("Salt reque...) has no keys
And I can't reproduce. Just for the record that's what output should probably look like:
$ sudo salt --timeout 1 --hide-timeout '*' test.ping --out json 2>/dev/null | jq 'keys[]' | sort
"openqa-piworker.qa.suse.de"
"openqa.suse.de
Ultimately this is due to #132146#note-35 but the fact that the pipeline succeeded is not great so I'm adding pipefail so ideally it will fail in this case.
- Status changed from Feedback to Resolved
- Status changed from Resolved to Workable
- Related to action #134810: [tools] GitlabCI deploy on salt-states-openqa took too much time added
- Related to action #134819: Errors in salt minion and master log on osd added
- Status changed from Workable to Feedback
- Assignee changed from livdywan to okurz
- Due date deleted (
2023-09-08)
- Status changed from Feedback to Resolved
We have not seen timeout since then and we applied multiple measures to prevent the situation. Ready to be surprised by new failures anyway :)
Also available in: Atom
PDF