Project

General

Profile

Actions

action #156226

closed

[bot-ng] Pipeline failed / failed to pulled image / no space left on device

Added by livdywan 10 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
Due date:
2024-03-29
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2325569

WARNING: Failed to pull image with policy "always": failed to register layer: open /var/cache/zypp/solv/@System/solv.idx: no space left on device (manager.go:237:16s)
ERROR: Job failed: failed to pull image "registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest" with specified policies [always]: failed to register layer: open /var/cache/zypp/solv/@System/solv.idx: no space left on device (manager.go:237:16s)

WARNING: Failed to pull image with policy "always": failed to register layer: mkdir /var/cache/zypp/solv/obs_repository: no space left on device (manager.go:237:13s)
ERROR: Job failed: failed to pull image "registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest" with specified policies [always]: failed to register layer: mkdir /var/cache/zypp/solv/obs_repository: no space left on device (manager.go:237:13s)

Suggestions

  • DONE Restart pipelines
  • DONE Report an infra SD ticket
  • DONE Add retries to the pipeline
Actions #1

Updated by livdywan 10 months ago · Edited

  • Status changed from New to In Progress
  • Assignee set to livdywan

I'll attempt a retry and see if that helps, and otherwise file an SD ticket

Actions #2

Updated by tinita 10 months ago

See also https://suse.slack.com/archives/C029APBKLGK/p1709040801624089
According to that a ticket was already opened: https://sd.suse.com/servicedesk/customer/portal/1/SD-149509 but of course we can't read that

Actions #4

Updated by mkittler 10 months ago

A discussion about this was started in #help-it-ama yesterday (https://suse.slack.com/archives/C029APBKLGK/p1709040801624089) and a ticket was created: https://sd.suse.com/servicedesk/customer/portal/1/SD-149509

Actions #5

Updated by livdywan 10 months ago · Edited

FYI We have access to https://sd.suse.com/servicedesk/customer/portal/1/SD-149509 now.

I don't think we can mitigate this? So for now I would see if someone picks up the ticket and check in again if that is not the case (blocking is not an option in this case)

Edit: Maybe we can add retry: on: system-failure if we don't have that yet? Depending on the failure rate. It seems like one job passed again. Let me take a look.

Actions #6

Updated by livdywan 10 months ago · Edited

Edit: Maybe we can add retry: on: system-failure if we don't have that yet? Depending on the failure rate. It seems like one job passed again. Let me take a look.

https://gitlab.suse.de/livdywan/bot-ng/-/merge_requests/2

https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/66 Apparently GitLab really wants me to propose into my own fork. Not helpful. The third attempt looks correct, though.

Actions #7

Updated by okurz 10 months ago

  • Tags changed from qem-bot, reactive work to qem-bot, reactive work, infra
Actions #9

Updated by openqa_review 10 months ago

  • Due date set to 2024-03-14

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by livdywan 10 months ago

  • Description updated (diff)
Actions #11

Updated by livdywan 10 months ago

  • Status changed from In Progress to Blocked
  • Priority changed from Urgent to High

It would seem the mitigation worked, and pipelines have been passing again. The failure I still saw is handled as part of #156301. So we can block on SD-149509 now.

Actions #12

Updated by livdywan 9 months ago

  • Due date changed from 2024-03-14 to 2024-03-22

livdywan wrote in #note-11:

It would seem the mitigation worked, and pipelines have been passing again. The failure I still saw is handled as part of #156301. So we can block on SD-149509 now.

Asked for feedback on Slack. For now our mitigation still works.

Actions #13

Updated by livdywan 9 months ago

  • Due date changed from 2024-03-22 to 2024-03-29

Asked for feedback on Slack. For now our mitigation still works.

Routine check on SLO alert. Apparently there is another SD ticket, which I asked about. No movement otherwise.

Actions #14

Updated by livdywan 9 months ago

  • Priority changed from High to Normal

Routine check on SLO alert. Apparently there is another SD ticket, which I asked about. No movement otherwise.

Still no response. I guess we can lower prio here since we have a work-around.

Actions #15

Updated by livdywan 9 months ago

  • Status changed from Blocked to Resolved

I'm assuming the issue was fixed in the meanwhile, hence closing.

Actions

Also available in: Atom PDF