action #133457: salt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:M - QA (public) - openSUSE Project Management Tool

Actions

Copy link

action #133457

closed

salt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:M

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

alert, infra, gitlab CI

Description

Observation¶

https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1714239

  Name: /etc/systemd/system/auto-update.service - Function: file.managed - Result: Clean - Started: 21:29:26.689214 - Duration: 359.255 ms
  Name: service.systemctl_reload - Function: module.run - Result: Clean - Started: 21:29:27.053802 - Duration: 0.018 ms
  Name: auto-upgrade.service - Function: service.dead - Result: Clean - Started: 21:29:27.054218 - Duration: 61.444 ms
  Name: auto-upgrade.timer - Function: service.dead - Result: Clean - Started: 21:29:27.116368 - Duration: 82.058 ms
  Name: auto-update.timer - Function: service.running - Result: Clean - Started: 21:29:27.203488 - Duration: 255.774 ms
Summary for openqa.suse.de
--------------
Succeeded: 345 (changed=30)
Failed:      0
--------------
Total states run:     345
Total run time:   383.468 s.++ echo -n .
++ true
++ sleep 1
.++ echo -n .

[...]
++ true
++ sleep 1
.++ echo -n .
++ true
++ sleep 1
ERROR: Job failed: execution took longer than 2h0m0s seconds
 took longer than 2h0m0s seconds

Acceptance criteria¶

AC1: jobs commonly don't run into the 2h gitlab CI timeout
AC2: We can identify the faulty salt minion (because very likely it's one of those being stuck)

Suggestions¶

look up an older ticket and read what we did there about this
check if there are actually artifacts uploaded or not
check if machines can be reached over salt
check usual runtimes of salt state apply
try if it is reproducible
research upstream if there is anything better we can do to prevent to run into the seemingly hardcoded gitlab 2h timeout
run the internal command of salt apply with a timeout well below 2h, e.g. in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml#L43 just prepend "timeout 1h …"

Related issues 5 (3 open — 2 closed)

Actions

Copy link

Updated by livdywan over 1 year ago

Copied from action #123894: qem-bot+openqa-bot gitlab CI pipeline aborted with error after 1h of execution added

Actions

Copy link

Updated by livdywan over 1 year ago

Related to action #119479: openqABot pipeline failed after runner getting stuck for 1h0m0s size:M added

Actions

Copy link

Updated by okurz over 1 year ago

Tags set to infra, alert, gitlab CI
Target version set to Ready

Actions

Copy link

Updated by okurz over 1 year ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by okurz over 1 year ago

Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 1h of execution to qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution

Actions

Copy link

Updated by okurz over 1 year ago

Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution to qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz over 1 year ago

Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M to bot-ng - pipelines in GitLab fail to pull qam-ci-leap:latest size:M
Description updated (diff)

Actions

Copy link

Updated by okurz over 1 year ago

Subject changed from bot-ng - pipelines in GitLab fail to pull qam-ci-leap:latest size:M to qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M
Description updated (diff)

Actions

Copy link

Updated by okurz over 1 year ago

Related to action #133793: salt-pillars-openqa failing to apply within 2h and it is not clear which minion(s) are missing size:M added

Actions

Copy link

#10

Updated by livdywan over 1 year ago

Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M to salt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:M

Renaming to reflect what this is about

Actions

Copy link

#11

Updated by livdywan over 1 year ago

Status changed from Workable to In Progress
Assignee set to livdywan
Priority changed from Urgent to High

We're still able to identify issues with most pipeline runs hence lowering to High. And while we don't know exactly what's going on I'm going to look into what we can do about the redundant sleep spam, and if we can reduce it, or find another way to highlight what step the pipeline failed at.

Actions

Copy link

#12

Updated by livdywan over 1 year ago

Status changed from In Progress to Feedback

So I did some digging. The workaround for Job marked as success when job terminates midway in Kubernetes may be obsolete since this has been resolved a year ago.

I also think we can split up the job into 3 via needs and matrix. That way we're not timing out on the whole thing and there's less being lumped up which we can't tell apart when it breaks. Bonus points for not needing to nest nested scripts.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/936

Actions

Copy link

#13

Updated by livdywan over 1 year ago

livdywan wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/936

Why do I think it's safe to remove the work-around? Because we use GitLab v16.1.2 and our runners report themselves as Running with gitlab-runner 16.1.0 (b72e108d) in logs.

I'm updating the job names for clarity and also splitting up refresh and deploy based on what we discussed in the unblock.

Actions

Copy link

#14

Updated by livdywan over 1 year ago

I merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/936
https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1777275 fails
ssh openqa.suse.de '( set -x; cd /srv && mkdir -p pillars && chown root:salt pillars && cd pillars && git status || git clone --depth 3 https://osd_deployment_ci:$PILLARS_CLONE_TOKEN@gitlab.suse.de/openqa/salt-pillars-openqa.git . && git fetch origin && git reset --hard origin/master && git checkout -f master && git pull )'
remote: HTTP Basic: Access denied
So somehow $PILLARS_CLONE_TOKEN is not set here? This is the same token, and you can see it in the settings https://gitlab.suse.de/openqa/salt-states-openqa/-/settings/ci_cd under Variables

I prepared an MR. I don't know if it's correct, though. So if we don't know how to fix it we can consider reverting it.

Actions

Copy link

#15

Updated by okurz over 1 year ago

Due date set to 2023-09-08

https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1789040#L54 shows

$ ssh $TARGET "salt --log-file=salt_syncupdate.log --log-file-level=debug --state-output=mixed --hide-timeout \* saltutil.sync_grains,saltutil.refresh_grains,saltutil.refresh_pillar,mine.update ,,,"
[ERROR   ] Encountered StreamClosedException
[ERROR   ] 
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

but then the job still succeeds. Please look into that.

Actions

Copy link

#16

Updated by livdywan over 1 year ago

It looks like salt is outputting errors on stdin which we end up trying to parse with jq:

++ tee salt_ping.log
Currently the following minions are down:
jq: error (at <stdin>:1): string ("Salt reque...) has no keys
jq: error (at <stdin>:1): string ("Salt reque...) has no keys

And I can't reproduce. Just for the record that's what output should probably look like:

$ sudo salt --timeout 1 --hide-timeout '*' test.ping --out json 2>/dev/null | jq 'keys[]' | sort
"openqa-piworker.qa.suse.de"
"openqa.suse.de

Ultimately this is due to #132146#note-35 but the fact that the pipeline succeeded is not great so I'm adding pipefail so ideally it will fail in this case.

Actions

Copy link

#17

Updated by livdywan over 1 year ago

Status changed from Feedback to Resolved

Ultimately this is due to #132146#note-35 but the fact that the pipeline succeeded is not great so I'm adding pipefail so ideally it will fail in this case.

MR was been merged. The refresh pipeline succeeded because apparently in the meanwhile the underlying issue as been resolved. Let's assume we're good here, and see #132146 for more issues related to the move.

Actions

Copy link

#18

Updated by okurz over 1 year ago

Status changed from Resolved to Workable

https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1790147 is an example of deploy running into 2h timeout now. I suggest we try the further still open suggestions from the ticket, e.g. to use the timeout command for the inner layer of command executions

Actions

Copy link

#19

Updated by okurz over 1 year ago

Related to action #134810: [tools] GitlabCI deploy on salt-states-openqa took too much time added

Actions

Copy link

#20

Updated by okurz over 1 year ago

Related to action #134819: Errors in salt minion and master log on osd added

Actions

Copy link

#21

Updated by nicksinger over 1 year ago

We realized that our deploy-step does not only execute the highstate but also the telegraf check which is completely silent and causes the pipeline to look like salt is still running. To avoid this I split out our post-deploy step into an extra stage here: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/954

Actions

Copy link

#22