Project

General

Profile

Actions

action #162062

open

No space left on device causing GitLab pipelines to fail

Added by livdywan 6 months ago. Updated 3 months ago.

Status:
Blocked
Priority:
Low
Assignee:
Category:
Regressions/Crashes
Start date:
2024-06-10
Due date:
% Done:

0%

Estimated time:

Description

Observation

Various pipelines fail either running the container or during the script because there is no space left.

https://gitlab.suse.de/openqa/openqa-review/-/jobs/2706267

WARNING: Failed to pull image with policy "always": write /var/lib/docker/tmp/GetImageBlob2814730492: no space left on device (manager.go:250:0s)
ERROR: Job failed: failed to pull image "registry.opensuse.org/home/okurz/container/ca/containers/tumbleweed:openqa-review" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob2814730492: no space left on device (manager.go:250:0s)

https://gitlab.suse.de/openqa/os-autoinst-needles-opensuse-mirror/-/jobs/2706665

 - Download (curl) error for 'http://download.opensuse.org/distribution/leap/15.5/repo/oss/repodata/d0aae74c050dca8d30fbccd949a136d8ed209eccf8fdf435ac8c1d739271d8e7-appdata.xml.gz':
   Error code: Write error
   Error message: Failure writing output to destination
[...]
Can't create metadata cache directory.

https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2706694

$ rpm --query qam-metadata-openqabot
rpmxdbOpen: No space left on device
error: cannot open Name index using unknown db - Operation not permitted (1)
rpmxdbOpen: No space left on device
error: cannot open Name index using unknown db - Operation not permitted (1)
package qam-metadata-openqabot is not installed

https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2706024

ERROR: Preparation failed: adding cache volume: set volume permissions: create permission container for volume "runner-25ldi6vv-project-11950-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70": Error response from daemon: mkdir /var/lib/docker/overlay2/3efbcea02b9b1d07ff7de63478a83f5fcd6dd3e36949db2b1766d2ac77c5f8ec-init: no space left on device (linux_set.go:95:0s)

Acceptance criteria

  • AC1:

Suggestions

  • DONE File an infra SD ticket to get the GitLab runners checked
Actions #1

Updated by livdywan 6 months ago

  • Description updated (diff)
Actions #2

Updated by livdywan 6 months ago

  • Description updated (diff)
  • Status changed from New to In Progress
Actions #3

Updated by livdywan 6 months ago · Edited

  • Priority changed from Urgent to High

Debugging with Steven Mallindine. Pipelines running for now.

Actions #4

Updated by livdywan 6 months ago

  • Status changed from In Progress to Blocked

livdywan wrote in #note-2:

https://sd.suse.com/servicedesk/customer/portal/1/SD-159323

Let's block on this. Nothing we can do here.

Actions #5

Updated by livdywan 6 months ago

livdywan wrote in #note-4:

livdywan wrote in #note-2:

https://sd.suse.com/servicedesk/customer/portal/1/SD-159323

Let's block on this. Nothing we can do here.

Complete fix still pending.

Actions #6

Updated by livdywan 6 months ago

  • Priority changed from High to Low

SD ticket still open. I'd say it's not High for us anymore as things work for us and we've not seen any more issues.

Actions #7

Updated by livdywan 6 months ago

  • Status changed from Blocked to Resolved

Steve confirmed that the cleanup of images is working fine now. So we're good.

Actions #8

Updated by livdywan 5 months ago

  • Status changed from Resolved to Blocked

And apparently the issue has come back:

Pulling docker image registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest ...
WARNING: Failed to pull image with policy "always": failed to register layer: open /usr/lib/python3.11/site-packages/ansible_collections/junipernetworks/junos/plugins/modules/junos_ospfv2.py: no space left on device (manager.go:250:17s)
ERROR: Job failed: failed to pull image "registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest" with specified policies [always]: failed to register layer: open /usr/lib/python3.11/site-packages/ansible_collections/junipernetworks/junos/plugins/modules/junos_ospfv2.py: no space left on device (manager.go:250:17s)

I filed SD-162694

Actions #10

Updated by livdywan 4 months ago · Edited

  • Status changed from Blocked to In Progress

Apparently pipelines are failing again:

https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2998807

Will be retried in 3s ...
ERROR: Job failed (system failure): adding cache volume: set volume permissions: create permission container for volume "runner-25ldi6vv-project-6096-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70": Error response from daemon: mkdir /var/lib/docker/overlay2/ae4a05093eca6240b6d28fbf9ad94e8355082166bf93a6c3c6562cc0554f89c4-init: no space left on device (linux_set.go:95:0s)

https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2998804

Will be retried in 3s ...
ERROR: Job failed (system failure): adding cache volume: set volume permissions: create permission container for volume "runner-25ldi6vv-project-11950-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70": Error response from daemon: mkdir /var/lib/docker/overlay2/43ea0d4662f8f8842379af0cf5cc9390f9ae58ad76fb6c062a78fa2697dd5570-init: no space left on device (linux_set.go:95:0s)

https://gitlab.suse.de/openqa/os-autoinst-needles-sles/-/jobs/2998789

Fetching changes with git depth set to 3...
Initialized empty Git repository in /builds/openqa/os-autoinst-needles-sles/.git/
Created fresh repository.
fatal: write error: No space left on device
fatal: fetch-pack: invalid index-pack output
Actions #11

Updated by livdywan 4 months ago · Edited

  • Status changed from In Progress to Feedback

I added another comment on SD-162694. In the meanwhile it seems like the latest pipelines for bot-bg and Scripts CI did pass, so I'm not applying any mitigations for now.

I don't know how openqa-pusher works. It's not a regular schedule. So I am just retrying the job.

Actions #12

Updated by tinita 4 months ago

livdywan wrote in #note-11:

I added another comment on .

?

Actions #13

Updated by livdywan 4 months ago · Edited

tinita wrote in #note-12:

livdywan wrote in #note-11:

I added another comment on .

?

That was supposed to be SD-162694.

And quoting from the Slack conversation:

There was some issues this morning with some large jobs running that killed the disk space (as we have looked at in the past)...
If you remember, the system used to clear the cache nightly (which wasnt working, then we fixed), then we switched to hourly. This still wasnt enough for the jobs that ran earlier this morning.....
I have since disabled the docker caching on gitlab-worker1, and keeping an eye on it to see it the issue resolves....

Actions #14

Updated by livdywan 4 months ago

  • Status changed from Feedback to Blocked

I don't know how openqa-pusher works. It's not a regular schedule. So I am just retrying the job.

The retry worked. Back to blocking on SD-162694.

Actions

Also available in: Atom PDF