action #96827
closedgitlab CI pipeline failed with Job failed: pod status is Failed size:S
0%
Description
Observation¶
I saw this at least twice in different pipelines (salt-states-openqa and openqa/openqa-review):
Cleaning up file based variables
00:00
ERROR: Job failed: pod "runner-25ldi6vv-project-5909-concurrent-0w2q6j" status is "Failed"
test-worker
Duration: 5 minutes 11 seconds
Timeout: 1h (from project)
Runner: #462 (25Ldi6Vv) gitlab-worker2:sle15.1
Commit cb53da6b in !548
Increase {flush_,}interval and jitter for web UI
Pipeline #185243 for telegraf_webui_intervals
test-worker
test-general
test-general-test
test-monitor
test-webui
Affected jobs so far passed after retry.
This will likely happen again because we run more in gitlab CI pipelines.
Just from today: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/557134
Suggestions¶
These problems are all outside the normal flow of what we can control from our side. But what we should be able to do is create EngInfra ticket and escalate the problem.
Updated by livdywan over 3 years ago
Another pipeline hit this just now (https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/551)
Updated by jbaier_cz over 3 years ago
Seems to me like a bug in GitLab Kubernetes executor or a missconfiguration, in both cases nothing we can do anything with apart of filling a ticket for EngInfra.
Updated by jbaier_cz over 3 years ago
I will add similar error from other project: https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/536484
Updated by jbaier_cz over 3 years ago
And I got probably another occurrence here: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/546059
In this case, the job actually did succeed, I suspect there is an issue during artifact uploading that causes the job to fail.
Updated by okurz over 3 years ago
- Subject changed from gitlab CI pipeline failed with Job failed: pod status is Failed to gitlab CI pipeline failed with Job failed: pod status is Failed size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by jbaier_cz over 3 years ago
- Status changed from Workable to Resolved
Last week, there was an update to Gitlab 14.2 (which also included an update for the gitlab-runner). There are no new failures since last Thu in our jobs (which still runs pretty frequently), so we can hope for the best and claim this issue as solved. We might reopen this ticket if any such error reappears.
Updated by tinita about 3 years ago
- Related to action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404 added
Updated by tinita about 3 years ago
- Related to deleted (action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404)
Updated by tinita about 3 years ago
- Has duplicate action #99411: openqa-review report openqa_suse_de_status.html missing from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, page is 404 added
Updated by okurz about 3 years ago
- Status changed from Resolved to Feedback
So, looks like not solved, See #99411
Updated by livdywan about 3 years ago
- Status changed from Feedback to Blocked
- Assignee changed from jbaier_cz to livdywan
Seems like it happened again:
.
Cleaning up project directory and file based variables
ERROR: Job failed: pod "runner-jzszgdx-project-4884-concurrent-04kj77" status is "Failed"
I filed #SD-62248.
Updated by livdywan about 3 years ago
Another instance last night. Followed up on a response from infra saying this is due to disk space shortage on the container host which causes runners to be "evicted". I pointed out that manual intervention is not good enough because of alert fatigue.
Updated by livdywan about 3 years ago
- Status changed from Blocked to Feedback
So SD-62248 got closed, and the situations seems under control at this point. So I'm thinking we can resolve it. There's ENGINFRA-705 on Jira for a follow-up.
Updated by livdywan about 3 years ago
- Status changed from Feedback to Resolved
cdywan wrote:
So SD-62248 got closed, and the situations seems under control at this point. So I'm thinking we can resolve it. There's ENGINFRA-705 on Jira for a follow-up.
Mentioned in the daily. Closing since from our side the problem's been resolved and not observed for a couple months.
Updated by livdywan about 3 years ago
- Copied to action #103762: gitlab CI pipeline failed with Error cleaning up pod: Delete ... connect: connection refused Job failed (system failure): prepare environment: waiting for pod running ... i/o timeout. Check ... for more information added