Project

General

Profile

Actions

action #133457

closed

salt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:M

Added by livdywan 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1714239

  Name: /etc/systemd/system/auto-update.service - Function: file.managed - Result: Clean - Started: 21:29:26.689214 - Duration: 359.255 ms
  Name: service.systemctl_reload - Function: module.run - Result: Clean - Started: 21:29:27.053802 - Duration: 0.018 ms
  Name: auto-upgrade.service - Function: service.dead - Result: Clean - Started: 21:29:27.054218 - Duration: 61.444 ms
  Name: auto-upgrade.timer - Function: service.dead - Result: Clean - Started: 21:29:27.116368 - Duration: 82.058 ms
  Name: auto-update.timer - Function: service.running - Result: Clean - Started: 21:29:27.203488 - Duration: 255.774 ms
Summary for openqa.suse.de
--------------
Succeeded: 345 (changed=30)
Failed:      0
--------------
Total states run:     345
Total run time:   383.468 s.++ echo -n .
++ true
++ sleep 1
.++ echo -n .

[...]
++ true
++ sleep 1
.++ echo -n .
++ true
++ sleep 1
ERROR: Job failed: execution took longer than 2h0m0s seconds
 took longer than 2h0m0s seconds

Acceptance criteria

  • AC1: jobs commonly don't run into the 2h gitlab CI timeout
  • AC2: We can identify the faulty salt minion (because very likely it's one of those being stuck)

Suggestions

  • look up an older ticket and read what we did there about this
  • check if there are actually artifacts uploaded or not
  • check if machines can be reached over salt
  • check usual runtimes of salt state apply
  • try if it is reproducible
  • research upstream if there is anything better we can do to prevent to run into the seemingly hardcoded gitlab 2h timeout
  • run the internal command of salt apply with a timeout well below 2h, e.g. in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml#L43 just prepend "timeout 1h …"

Related issues 5 (3 open2 closed)

Related to openQA Infrastructure - action #119479: openqABot pipeline failed after runner getting stuck for 1h0m0s size:MWorkable2022-10-21

Actions
Related to openQA Infrastructure - action #133793: salt-pillars-openqa failing to apply within 2h and it is not clear which minion(s) are missing size:MResolvedokurz2023-08-04

Actions
Related to openQA Project - action #134810: [tools] GitlabCI deploy on salt-states-openqa took too much time Rejectedlivdywan2023-08-30

Actions
Related to openQA Infrastructure - action #134819: Errors in salt minion and master log on osdNew2023-08-30

Actions
Copied from QA - action #123894: qem-bot+openqa-bot gitlab CI pipeline aborted with error after 1h of executionNew2023-02-02

Actions
Actions #1

Updated by livdywan 9 months ago

  • Copied from action #123894: qem-bot+openqa-bot gitlab CI pipeline aborted with error after 1h of execution added
Actions #2

Updated by livdywan 9 months ago

  • Related to action #119479: openqABot pipeline failed after runner getting stuck for 1h0m0s size:M added
Actions #3

Updated by okurz 9 months ago

  • Tags set to infra, alert, gitlab CI
  • Target version set to Ready
Actions #4

Updated by okurz 9 months ago

  • Priority changed from Normal to Urgent
Actions #5

Updated by okurz 9 months ago

  • Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 1h of execution to qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution
Actions #6

Updated by okurz 9 months ago

  • Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution to qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #7

Updated by okurz 9 months ago

  • Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M to bot-ng - pipelines in GitLab fail to pull qam-ci-leap:latest size:M
  • Description updated (diff)
Actions #8

Updated by okurz 9 months ago

  • Subject changed from bot-ng - pipelines in GitLab fail to pull qam-ci-leap:latest size:M to qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M
  • Description updated (diff)
Actions #9

Updated by okurz 9 months ago

  • Related to action #133793: salt-pillars-openqa failing to apply within 2h and it is not clear which minion(s) are missing size:M added
Actions #10

Updated by livdywan 9 months ago

  • Subject changed from qem-bot+openqa-bot gitlab CI pipeline aborted with error after 2h of execution size:M to salt-states-openqa gitlab CI pipeline aborted with error after 2h of execution size:M

Renaming to reflect what this is about

Actions #11

Updated by livdywan 9 months ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan
  • Priority changed from Urgent to High

We're still able to identify issues with most pipeline runs hence lowering to High. And while we don't know exactly what's going on I'm going to look into what we can do about the redundant sleep spam, and if we can reduce it, or find another way to highlight what step the pipeline failed at.

Actions #12

Updated by livdywan 9 months ago

  • Status changed from In Progress to Feedback

So I did some digging. The workaround for Job marked as success when job terminates midway in Kubernetes may be obsolete since this has been resolved a year ago.

I also think we can split up the job into 3 via needs and matrix. That way we're not timing out on the whole thing and there's less being lumped up which we can't tell apart when it breaks. Bonus points for not needing to nest nested scripts.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/936

Actions #13

Updated by livdywan 9 months ago

livdywan wrote:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/936

Why do I think it's safe to remove the work-around? Because we use GitLab v16.1.2 and our runners report themselves as Running with gitlab-runner 16.1.0 (b72e108d) in logs.

I'm updating the job names for clarity and also splitting up refresh and deploy based on what we discussed in the unblock.

Actions #14

Updated by livdywan 8 months ago

I prepared an MR. I don't know if it's correct, though. So if we don't know how to fix it we can consider reverting it.

Actions #15

Updated by okurz 8 months ago

  • Due date set to 2023-09-08

https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1789040#L54 shows

$ ssh $TARGET "salt --log-file=salt_syncupdate.log --log-file-level=debug --state-output=mixed --hide-timeout \* saltutil.sync_grains,saltutil.refresh_grains,saltutil.refresh_pillar,mine.update ,,,"
[ERROR   ] Encountered StreamClosedException
[ERROR   ] 
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

but then the job still succeeds. Please look into that.

Actions #16

Updated by livdywan 8 months ago

It looks like salt is outputting errors on stdin which we end up trying to parse with jq:

++ tee salt_ping.log
Currently the following minions are down:
jq: error (at <stdin>:1): string ("Salt reque...) has no keys
jq: error (at <stdin>:1): string ("Salt reque...) has no keys

And I can't reproduce. Just for the record that's what output should probably look like:

$ sudo salt --timeout 1 --hide-timeout '*' test.ping --out json 2>/dev/null | jq 'keys[]' | sort
"openqa-piworker.qa.suse.de"
"openqa.suse.de

Ultimately this is due to #132146#note-35 but the fact that the pipeline succeeded is not great so I'm adding pipefail so ideally it will fail in this case.

Actions #17

Updated by livdywan 8 months ago

  • Status changed from Feedback to Resolved

Ultimately this is due to #132146#note-35 but the fact that the pipeline succeeded is not great so I'm adding pipefail so ideally it will fail in this case.

MR was been merged. The refresh pipeline succeeded because apparently in the meanwhile the underlying issue as been resolved. Let's assume we're good here, and see #132146 for more issues related to the move.

Actions #18

Updated by okurz 8 months ago

  • Status changed from Resolved to Workable

https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1790147 is an example of deploy running into 2h timeout now. I suggest we try the further still open suggestions from the ticket, e.g. to use the timeout command for the inner layer of command executions

Actions #19

Updated by okurz 8 months ago

  • Related to action #134810: [tools] GitlabCI deploy on salt-states-openqa took too much time added
Actions #20

Updated by okurz 8 months ago

  • Related to action #134819: Errors in salt minion and master log on osd added
Actions #21

Updated by nicksinger 8 months ago

We realized that our deploy-step does not only execute the highstate but also the telegraf check which is completely silent and causes the pipeline to look like salt is still running. To avoid this I split out our post-deploy step into an extra stage here: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/954

Actions #23

Updated by okurz 8 months ago

  • Status changed from Workable to Feedback
  • Assignee changed from livdywan to okurz
Actions #25

Updated by okurz 8 months ago

  • Due date deleted (2023-09-08)
  • Status changed from Feedback to Resolved

We have not seen timeout since then and we applied multiple measures to prevent the situation. Ready to be surprised by new failures anyway :)

Actions

Also available in: Atom PDF