action #96827
closed
gitlab CI pipeline failed with Job failed: pod status is Failed size:S
Added by livdywan over 3 years ago.
Updated about 3 years ago.
Description
Observation¶
I saw this at least twice in different pipelines (salt-states-openqa and openqa/openqa-review):
Cleaning up file based variables
00:00
ERROR: Job failed: pod "runner-25ldi6vv-project-5909-concurrent-0w2q6j" status is "Failed"
test-worker
Duration: 5 minutes 11 seconds
Timeout: 1h (from project)
Runner: #462 (25Ldi6Vv) gitlab-worker2:sle15.1
Commit cb53da6b in !548
Increase {flush_,}interval and jitter for web UI
Pipeline #185243 for telegraf_webui_intervals
test-worker
test-general
test-general-test
test-monitor
test-webui
Affected jobs so far passed after retry.
This will likely happen again because we run more in gitlab CI pipelines.
Just from today: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/557134
Suggestions¶
These problems are all outside the normal flow of what we can control from our side. But what we should be able to do is create EngInfra ticket and escalate the problem.
Seems to me like a bug in GitLab Kubernetes executor or a missconfiguration, in both cases nothing we can do anything with apart of filling a ticket for EngInfra.
- Subject changed from gitlab CI pipeline failed with Job failed: pod status is Failed to gitlab CI pipeline failed with Job failed: pod status is Failed size:S
- Description updated (diff)
- Status changed from New to Workable
- Assignee set to jbaier_cz
- Status changed from Workable to Resolved
Last week, there was an update to Gitlab 14.2 (which also included an update for the gitlab-runner). There are no new failures since last Thu in our jobs (which still runs pretty frequently), so we can hope for the best and claim this issue as solved. We might reopen this ticket if any such error reappears.
- Related to action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404 added
- Related to deleted (action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404)
- Has duplicate action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404 added
- Status changed from Resolved to Feedback
So, looks like not solved, See #99411
- Status changed from Feedback to Blocked
- Assignee changed from jbaier_cz to livdywan
Seems like it happened again:
.
Cleaning up project directory and file based variables
ERROR: Job failed: pod "runner-jzszgdx-project-4884-concurrent-04kj77" status is "Failed"
I filed #SD-62248.
Another instance last night. Followed up on a response from infra saying this is due to disk space shortage on the container host which causes runners to be "evicted". I pointed out that manual intervention is not good enough because of alert fatigue.
- Status changed from Blocked to Feedback
So SD-62248 got closed, and the situations seems under control at this point. So I'm thinking we can resolve it. There's ENGINFRA-705 on Jira for a follow-up.
- Status changed from Feedback to Resolved
cdywan wrote:
So SD-62248 got closed, and the situations seems under control at this point. So I'm thinking we can resolve it. There's ENGINFRA-705 on Jira for a follow-up.
Mentioned in the daily. Closing since from our side the problem's been resolved and not observed for a couple months.
- Copied to action #103762: gitlab CI pipeline failed with Error cleaning up pod: Delete ... connect: connection refused Job failed (system failure): prepare environment: waiting for pod running ... i/o timeout. Check ... for more information added
Also available in: Atom
PDF