Project

General

Profile

action #96827

gitlab CI pipeline failed with Job failed: pod status is Failed size:S

Added by cdywan over 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-08-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

I saw this at least twice in different pipelines (salt-states-openqa and openqa/openqa-review):

Cleaning up file based variables
00:00
ERROR: Job failed: pod "runner-25ldi6vv-project-5909-concurrent-0w2q6j" status is "Failed"
test-worker
Duration: 5 minutes 11 seconds
Timeout: 1h (from project)

Runner: #462 (25Ldi6Vv) gitlab-worker2:sle15.1
Commit cb53da6b  in !548
Increase {flush_,}interval and jitter for web UI

 Pipeline #185243 for telegraf_webui_intervals
test-worker
test-general
test-general-test
test-monitor
test-webui

Affected jobs so far passed after retry.

This will likely happen again because we run more in gitlab CI pipelines.

Just from today: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/557134

Suggestions

These problems are all outside the normal flow of what we can control from our side. But what we should be able to do is create EngInfra ticket and escalate the problem.


Related issues

Has duplicate QA - action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404Resolved2021-09-28

Copied to QA - action #103762: gitlab CI pipeline failed with Error cleaning up pod: Delete ... connect: connection refused Job failed (system failure): prepare environment: waiting for pod running ... i/o timeout. Check ... for more informationResolved2021-08-13

History

#2 Updated by jbaier_cz over 1 year ago

Seems to me like a bug in GitLab Kubernetes executor or a missconfiguration, in both cases nothing we can do anything with apart of filling a ticket for EngInfra.

#3 Updated by jbaier_cz over 1 year ago

I will add similar error from other project: https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/536484

#4 Updated by jbaier_cz over 1 year ago

And I got probably another occurrence here: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/546059

In this case, the job actually did succeed, I suspect there is an issue during artifact uploading that causes the job to fail.

#5 Updated by okurz over 1 year ago

  • Subject changed from gitlab CI pipeline failed with Job failed: pod status is Failed to gitlab CI pipeline failed with Job failed: pod status is Failed size:S
  • Description updated (diff)
  • Status changed from New to Workable

#6 Updated by jbaier_cz about 1 year ago

  • Assignee set to jbaier_cz

#7 Updated by jbaier_cz about 1 year ago

  • Status changed from Workable to Resolved

Last week, there was an update to Gitlab 14.2 (which also included an update for the gitlab-runner). There are no new failures since last Thu in our jobs (which still runs pretty frequently), so we can hope for the best and claim this issue as solved. We might reopen this ticket if any such error reappears.

#8 Updated by tinita about 1 year ago

  • Related to action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404 added

#9 Updated by tinita about 1 year ago

  • Related to deleted (action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404)

#10 Updated by tinita about 1 year ago

  • Has duplicate action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404 added

#11 Updated by okurz about 1 year ago

  • Status changed from Resolved to Feedback

So, looks like not solved, See #99411

#12 Updated by cdywan about 1 year ago

  • Status changed from Feedback to Blocked
  • Assignee changed from jbaier_cz to cdywan

Seems like it happened again:

.
Cleaning up project directory and file based variables
ERROR: Job failed: pod "runner-jzszgdx-project-4884-concurrent-04kj77" status is "Failed"

I filed #SD-62248.

#13 Updated by cdywan about 1 year ago

Another instance last night. Followed up on a response from infra saying this is due to disk space shortage on the container host which causes runners to be "evicted". I pointed out that manual intervention is not good enough because of alert fatigue.

#14 Updated by cdywan 12 months ago

  • Status changed from Blocked to Feedback

So SD-62248 got closed, and the situations seems under control at this point. So I'm thinking we can resolve it. There's ENGINFRA-705 on Jira for a follow-up.

#15 Updated by cdywan 12 months ago

  • Status changed from Feedback to Resolved

cdywan wrote:

So SD-62248 got closed, and the situations seems under control at this point. So I'm thinking we can resolve it. There's ENGINFRA-705 on Jira for a follow-up.

Mentioned in the daily. Closing since from our side the problem's been resolved and not observed for a couple months.

#16 Updated by cdywan 12 months ago

  • Copied to action #103762: gitlab CI pipeline failed with Error cleaning up pod: Delete ... connect: connection refused Job failed (system failure): prepare environment: waiting for pod running ... i/o timeout. Check ... for more information added

Also available in: Atom PDF