action #105145: osd-deployment pipelines fail because ContainersNotInitialized size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #105145

closed

osd-deployment pipelines fail because ContainersNotInitialized size:M

Added by livdywan over 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

livdywan

Category:

Organisational

Target version:

Ready

Start date:

2022-01-20

Due date:

2022-02-11

% Done:

Estimated time:

Description

Observation¶

osd-deployment failed with the same error:

Waiting for pod gitlab/runner-ydlpfvpg-project-3530-concurrent-0h2jfn to be running, status is Pending
Waiting for pod gitlab/runner-ydlpfvpg-project-3530-concurrent-0h2jfn to be running, status is Pending
    ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
    ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod gitlab/runner-ydlpfvpg-project-3530-concurrent-0h2jfn to be running, status is Pending
    ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
    ContainersNotReady: "containers with unready status: [build helper]"
[...]
Waiting for pod gitlab/runner-ydlpfvpg-project-3530-concurrent-0h2jfn to be running, status is Pending
908 ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
909 ContainersNotReady: "containers with unready status: [build helper]"
911ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Acceptance criteria¶

AC1: osd-deployment pipeline doesn't alert about internal gitlab k8s busy loops
AC2: osd-deployment pipelines automatically restart on known internal errors

Suggestions¶

Implement an automatic retry on error in the GitLab pipeline https://docs.gitlab.com/ee/ci/yaml/#retrywhen
Look at the upstream issue
File an infra ticket
Come up with a way to silence alerts for internal GitLab errors
Install our own GitLab runner on k8s

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by livdywan over 3 years ago

Copied from action #103527: osd-deployment pipelines fail and alerts are not handled size:M added

Actions

Copy link

Updated by mkittler over 3 years ago

I don't agree with AC1 unless the job could somehow be restarted automatically as well. Otherwise someone in the team will have to restart it (as a reaction to the alert which is therefore still needed). See #96551#note-22 for an example.

Actions

Copy link

Updated by livdywan over 3 years ago

Description updated (diff)

mkittler wrote:

I don't agree with AC1 unless the job could somehow be restarted automatically as well. Otherwise someone in the team will have to restart it (as a reaction to the alert which is therefore still needed). See #96551#note-22 for an example.

Good point. Let's have two AC's.

Actions

Copy link

Updated by livdywan over 3 years ago

Subject changed from osd-deployment pipelines fail because ContainersNotInitialized to osd-deployment pipelines fail because ContainersNotInitialized size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by jbaier_cz about 3 years ago

And now a different (similar) problem can be found in https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/804138

Unschedulable: "0/10 nodes are available: 3 node(s) were not ready, 3 node(s) were out of disk space, 7 node(s) didn't match node selector."

Actions

Copy link

Updated by jbaier_cz about 3 years ago

jbaier_cz wrote:

And now a different (similar) problem can be found in https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/804138
Unschedulable: "0/10 nodes are available: 3 node(s) were not ready, 3 node(s) were out of disk space, 7 node(s) didn't match node selector."

Tracked (and resolved) in SD-74280

Actions

Copy link

Updated by livdywan about 3 years ago

Status changed from Workable to In Progress
Assignee set to livdywan

jbaier_cz wrote:

Tracked (and resolved) in SD-74280

So this has been solved for the moment. Since we'll eventually run into it again, though, I'm still proposing the trivial use of retry as suggested:

Actions

Copy link