action #133457
closedsalt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:M
0%
Description
Observation¶
https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1714239
Name: /etc/systemd/system/auto-update.service - Function: file.managed - Result: Clean - Started: 21:29:26.689214 - Duration: 359.255 ms
Name: service.systemctl_reload - Function: module.run - Result: Clean - Started: 21:29:27.053802 - Duration: 0.018 ms
Name: auto-upgrade.service - Function: service.dead - Result: Clean - Started: 21:29:27.054218 - Duration: 61.444 ms
Name: auto-upgrade.timer - Function: service.dead - Result: Clean - Started: 21:29:27.116368 - Duration: 82.058 ms
Name: auto-update.timer - Function: service.running - Result: Clean - Started: 21:29:27.203488 - Duration: 255.774 ms
Summary for openqa.suse.de
--------------
Succeeded: 345 (changed=30)
Failed: 0
--------------
Total states run: 345
Total run time: 383.468 s.++ echo -n .
++ true
++ sleep 1
.++ echo -n .
[...]
++ true
++ sleep 1
.++ echo -n .
++ true
++ sleep 1
ERROR: Job failed: execution took longer than 2h0m0s seconds
took longer than 2h0m0s seconds
Acceptance criteria¶
- AC1: jobs commonly don't run into the 2h gitlab CI timeout
- AC2: We can identify the faulty salt minion (because very likely it's one of those being stuck)
Suggestions¶
- look up an older ticket and read what we did there about this
- check if there are actually artifacts uploaded or not
- check if machines can be reached over salt
- check usual runtimes of salt state apply
- try if it is reproducible
- research upstream if there is anything better we can do to prevent to run into the seemingly hardcoded gitlab 2h timeout
- run the internal command of salt apply with a timeout well below 2h, e.g. in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml#L43 just prepend "timeout 1h …"
Updated by livdywan over 1 year ago
- Copied from action #123894: qem-bot+openqa-bot gitlab CI pipeline aborted with error after 1h of execution added
Updated by livdywan over 1 year ago
- Related to action #119479: openqABot pipeline failed after runner getting stuck for 1h0m0s size:M added
Updated by okurz over 1 year ago
- Tags set to infra, alert, gitlab CI
- Target version set to Ready
Updated by okurz over 1 year ago
- Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 1h of execution to qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution
Updated by okurz over 1 year ago
- Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution to qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
- Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M to bot-ng - pipelines in GitLab fail to pull qam-ci-leap:latest size:M
- Description updated (diff)
Updated by okurz over 1 year ago
- Subject changed from bot-ng - pipelines in GitLab fail to pull qam-ci-leap:latest size:M to qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M
- Description updated (diff)
Updated by okurz over 1 year ago
- Related to action #133793: salt-pillars-openqa failing to apply within 2h and it is not clear which minion(s) are missing size:M added
Updated by livdywan over 1 year ago
- Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M to salt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:M
Renaming to reflect what this is about
Updated by livdywan over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
- Priority changed from Urgent to High
We're still able to identify issues with most pipeline runs hence lowering to High. And while we don't know exactly what's going on I'm going to look into what we can do about the redundant sleep spam, and if we can reduce it, or find another way to highlight what step the pipeline failed at.
Updated by livdywan over 1 year ago
- Status changed from In Progress to Feedback
So I did some digging. The workaround for Job marked as success when job terminates midway in Kubernetes may be obsolete since this has been resolved a year ago.
I also think we can split up the job into 3 via needs and matrix. That way we're not timing out on the whole thing and there's less being lumped up which we can't tell apart when it breaks. Bonus points for not needing to nest nested scripts.
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/936
Updated by livdywan over 1 year ago
livdywan wrote:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/936
Why do I think it's safe to remove the work-around? Because we use GitLab v16.1.2 and our runners report themselves as Running with gitlab-runner 16.1.0 (b72e108d) in logs.
I'm updating the job names for clarity and also splitting up refresh and deploy based on what we discussed in the unblock.
Updated by livdywan over 1 year ago
I merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/936
https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1777275 fails
ssh openqa.suse.de '( set -x; cd /srv && mkdir -p pillars && chown root:salt pillars && cd pillars && git status || git clone --depth 3 https://osd_deployment_ci:$PILLARS_CLONE_TOKEN@gitlab.suse.de/openqa/salt-pillars-openqa.git . && git fetch origin && git reset --hard origin/master && git checkout -f master && git pull )'
remote: HTTP Basic: Access denied
So somehow $PILLARS_CLONE_TOKEN is not set here? This is the same token, and you can see it in the settings https://gitlab.suse.de/openqa/salt-states-openqa/-/settings/ci_cd under Variables
I prepared an MR. I don't know if it's correct, though. So if we don't know how to fix it we can consider reverting it.
Updated by okurz over 1 year ago
- Due date set to 2023-09-08
https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1789040#L54 shows
$ ssh $TARGET "salt --log-file=salt_syncupdate.log --log-file-level=debug --state-output=mixed --hide-timeout \* saltutil.sync_grains,saltutil.refresh_grains,saltutil.refresh_pillar,mine.update ,,,"
[ERROR ] Encountered StreamClosedException
[ERROR ]
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
but then the job still succeeds. Please look into that.
Updated by livdywan over 1 year ago
It looks like salt is outputting errors on stdin which we end up trying to parse with jq:
++ tee salt_ping.log
Currently the following minions are down:
jq: error (at <stdin>:1): string ("Salt reque...) has no keys
jq: error (at <stdin>:1): string ("Salt reque...) has no keys
And I can't reproduce. Just for the record that's what output should probably look like:
$ sudo salt --timeout 1 --hide-timeout '*' test.ping --out json 2>/dev/null | jq 'keys[]' | sort
"openqa-piworker.qa.suse.de"
"openqa.suse.de
Ultimately this is due to #132146#note-35 but the fact that the pipeline succeeded is not great so I'm adding pipefail so ideally it will fail in this case.
Updated by livdywan over 1 year ago
- Status changed from Feedback to Resolved
Ultimately this is due to #132146#note-35 but the fact that the pipeline succeeded is not great so I'm adding pipefail so ideally it will fail in this case.
MR was been merged. The refresh pipeline succeeded because apparently in the meanwhile the underlying issue as been resolved. Let's assume we're good here, and see #132146 for more issues related to the move.
Updated by okurz over 1 year ago
- Status changed from Resolved to Workable
https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1790147 is an example of deploy running into 2h timeout now. I suggest we try the further still open suggestions from the ticket, e.g. to use the timeout
command for the inner layer of command executions
Updated by okurz over 1 year ago
- Related to action #134810: [tools] GitlabCI deploy on salt-states-openqa took too much time added
Updated by okurz over 1 year ago
- Related to action #134819: Errors in salt minion and master log on osd added
Updated by nicksinger over 1 year ago
We realized that our deploy-step does not only execute the highstate but also the telegraf check which is completely silent and causes the pipeline to look like salt is still running. To avoid this I split out our post-deploy step into an extra stage here: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/954
Updated by okurz over 1 year ago
another recent example: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1792093
Updated by okurz over 1 year ago
- Status changed from Workable to Feedback
- Assignee changed from livdywan to okurz
Updated by okurz over 1 year ago
Updated by okurz over 1 year ago
- Due date deleted (
2023-09-08) - Status changed from Feedback to Resolved
We have not seen timeout since then and we applied multiple measures to prevent the situation. Ready to be surprised by new failures anyway :)