Project

General

Profile

Actions

action #96827

closed

gitlab CI pipeline failed with Job failed: pod status is Failed size:S

Added by livdywan over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-08-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

I saw this at least twice in different pipelines (salt-states-openqa and openqa/openqa-review):

Cleaning up file based variables
00:00
ERROR: Job failed: pod "runner-25ldi6vv-project-5909-concurrent-0w2q6j" status is "Failed"
test-worker
Duration: 5 minutes 11 seconds
Timeout: 1h (from project)

Runner: #462 (25Ldi6Vv) gitlab-worker2:sle15.1
Commit cb53da6b  in !548
Increase {flush_,}interval and jitter for web UI

 Pipeline #185243 for telegraf_webui_intervals
test-worker
test-general
test-general-test
test-monitor
test-webui

Affected jobs so far passed after retry.

This will likely happen again because we run more in gitlab CI pipelines.

Just from today: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/557134

Suggestions

These problems are all outside the normal flow of what we can control from our side. But what we should be able to do is create EngInfra ticket and escalate the problem.


Related issues 2 (0 open2 closed)

Has duplicate QA - action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404Resolvedtinita2021-09-28

Actions
Copied to QA - action #103762: gitlab CI pipeline failed with Error cleaning up pod: Delete ... connect: connection refused Job failed (system failure): prepare environment: waiting for pod running ... i/o timeout. Check ... for more informationResolvedjbaier_cz2021-08-13

Actions
Actions #2

Updated by jbaier_cz over 2 years ago

Seems to me like a bug in GitLab Kubernetes executor or a missconfiguration, in both cases nothing we can do anything with apart of filling a ticket for EngInfra.

Actions #3

Updated by jbaier_cz over 2 years ago

I will add similar error from other project: https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/536484

Actions #4

Updated by jbaier_cz over 2 years ago

And I got probably another occurrence here: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/546059

In this case, the job actually did succeed, I suspect there is an issue during artifact uploading that causes the job to fail.

Actions #5

Updated by okurz over 2 years ago

  • Subject changed from gitlab CI pipeline failed with Job failed: pod status is Failed to gitlab CI pipeline failed with Job failed: pod status is Failed size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by jbaier_cz over 2 years ago

  • Assignee set to jbaier_cz
Actions #7

Updated by jbaier_cz over 2 years ago

  • Status changed from Workable to Resolved

Last week, there was an update to Gitlab 14.2 (which also included an update for the gitlab-runner). There are no new failures since last Thu in our jobs (which still runs pretty frequently), so we can hope for the best and claim this issue as solved. We might reopen this ticket if any such error reappears.

Actions #8

Updated by tinita over 2 years ago

  • Related to action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404 added
Actions #9

Updated by tinita over 2 years ago

  • Related to deleted (action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404)
Actions #10

Updated by tinita over 2 years ago

  • Has duplicate action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404 added
Actions #11

Updated by okurz over 2 years ago

  • Status changed from Resolved to Feedback

So, looks like not solved, See #99411

Actions #12

Updated by livdywan over 2 years ago

  • Status changed from Feedback to Blocked
  • Assignee changed from jbaier_cz to livdywan

Seems like it happened again:

.
Cleaning up project directory and file based variables
ERROR: Job failed: pod "runner-jzszgdx-project-4884-concurrent-04kj77" status is "Failed"

I filed #SD-62248.

Actions #13

Updated by livdywan over 2 years ago

Another instance last night. Followed up on a response from infra saying this is due to disk space shortage on the container host which causes runners to be "evicted". I pointed out that manual intervention is not good enough because of alert fatigue.

Actions #14

Updated by livdywan about 2 years ago

  • Status changed from Blocked to Feedback

So SD-62248 got closed, and the situations seems under control at this point. So I'm thinking we can resolve it. There's ENGINFRA-705 on Jira for a follow-up.

Actions #15

Updated by livdywan about 2 years ago

  • Status changed from Feedback to Resolved

cdywan wrote:

So SD-62248 got closed, and the situations seems under control at this point. So I'm thinking we can resolve it. There's ENGINFRA-705 on Jira for a follow-up.

Mentioned in the daily. Closing since from our side the problem's been resolved and not observed for a couple months.

Actions #16

Updated by livdywan about 2 years ago

  • Copied to action #103762: gitlab CI pipeline failed with Error cleaning up pod: Delete ... connect: connection refused Job failed (system failure): prepare environment: waiting for pod running ... i/o timeout. Check ... for more information added
Actions

Also available in: Atom PDF