action #105145
closedosd-deployment pipelines fail because ContainersNotInitialized size:M
Description
Observation¶
osd-deployment failed with the same error:
Waiting for pod gitlab/runner-ydlpfvpg-project-3530-concurrent-0h2jfn to be running, status is Pending
Waiting for pod gitlab/runner-ydlpfvpg-project-3530-concurrent-0h2jfn to be running, status is Pending
ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
ContainersNotReady: "containers with unready status: [build helper]"
Waiting for pod gitlab/runner-ydlpfvpg-project-3530-concurrent-0h2jfn to be running, status is Pending
ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
ContainersNotReady: "containers with unready status: [build helper]"
[...]
Waiting for pod gitlab/runner-ydlpfvpg-project-3530-concurrent-0h2jfn to be running, status is Pending
908 ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
909 ContainersNotReady: "containers with unready status: [build helper]"
911ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Acceptance criteria¶
- AC1: osd-deployment pipeline doesn't alert about internal gitlab k8s busy loops
- AC2: osd-deployment pipelines automatically restart on known internal errors
Suggestions¶
- Implement an automatic retry on error in the GitLab pipeline https://docs.gitlab.com/ee/ci/yaml/#retrywhen
- Look at the upstream issue
- File an infra ticket
- Come up with a way to silence alerts for internal GitLab errors
- Install our own GitLab runner on k8s
Updated by livdywan about 3 years ago
- Copied from action #103527: osd-deployment pipelines fail and alerts are not handled size:M added
Updated by mkittler about 3 years ago
I don't agree with AC1 unless the job could somehow be restarted automatically as well. Otherwise someone in the team will have to restart it (as a reaction to the alert which is therefore still needed). See #96551#note-22 for an example.
Updated by livdywan almost 3 years ago
- Description updated (diff)
mkittler wrote:
I don't agree with AC1 unless the job could somehow be restarted automatically as well. Otherwise someone in the team will have to restart it (as a reaction to the alert which is therefore still needed). See #96551#note-22 for an example.
Good point. Let's have two AC's.
Updated by livdywan almost 3 years ago
- Subject changed from osd-deployment pipelines fail because ContainersNotInitialized to osd-deployment pipelines fail because ContainersNotInitialized size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by jbaier_cz almost 3 years ago
And now a different (similar) problem can be found in https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/804138
Unschedulable: "0/10 nodes are available: 3 node(s) were not ready, 3 node(s) were out of disk space, 7 node(s) didn't match node selector."
Updated by jbaier_cz almost 3 years ago
jbaier_cz wrote:
And now a different (similar) problem can be found in https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/804138
Unschedulable: "0/10 nodes are available: 3 node(s) were not ready, 3 node(s) were out of disk space, 7 node(s) didn't match node selector."
Tracked (and resolved) in SD-74280
Updated by livdywan almost 3 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
jbaier_cz wrote:
Tracked (and resolved) in SD-74280
So this has been solved for the moment. Since we'll eventually run into it again, though, I'm still proposing the trivial use of retry as suggested:
Updated by openqa_review almost 3 years ago
- Due date set to 2022-02-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan almost 3 years ago
- Status changed from In Progress to Feedback
cdywan wrote:
Correction: https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/45
Both review and merged
Updated by livdywan almost 3 years ago
No complaints. I'll assume we're good here
Updated by livdywan almost 3 years ago
- Status changed from Feedback to Resolved