action #162062: No space left on device causing GitLab pipelines to fail - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #162062

closed

No space left on device causing GitLab pipelines to fail

Added by livdywan 12 months ago. Updated 18 days ago.

Status:

Resolved

Priority:

Low

Assignee:

livdywan

Category:

Regressions/Crashes

Target version:

QA (public) - Tools - Next

Start date:

2024-06-10

Due date:

% Done:

Estimated time:

Tags:

alert, infra, reactive work

Description

Observation¶

Various pipelines fail either running the container or during the script because there is no space left.

https://gitlab.suse.de/openqa/openqa-review/-/jobs/2706267

WARNING: Failed to pull image with policy "always": write /var/lib/docker/tmp/GetImageBlob2814730492: no space left on device (manager.go:250:0s)
ERROR: Job failed: failed to pull image "registry.opensuse.org/home/okurz/container/ca/containers/tumbleweed:openqa-review" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob2814730492: no space left on device (manager.go:250:0s)

https://gitlab.suse.de/openqa/os-autoinst-needles-opensuse-mirror/-/jobs/2706665

 - Download (curl) error for 'http://download.opensuse.org/distribution/leap/15.5/repo/oss/repodata/d0aae74c050dca8d30fbccd949a136d8ed209eccf8fdf435ac8c1d739271d8e7-appdata.xml.gz':
   Error code: Write error
   Error message: Failure writing output to destination
[...]
Can't create metadata cache directory.

https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2706694

$ rpm --query qam-metadata-openqabot
rpmxdbOpen: No space left on device
error: cannot open Name index using unknown db - Operation not permitted (1)
rpmxdbOpen: No space left on device
error: cannot open Name index using unknown db - Operation not permitted (1)
package qam-metadata-openqabot is not installed

https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2706024

ERROR: Preparation failed: adding cache volume: set volume permissions: create permission container for volume "runner-25ldi6vv-project-11950-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70": Error response from daemon: mkdir /var/lib/docker/overlay2/3efbcea02b9b1d07ff7de63478a83f5fcd6dd3e36949db2b1766d2ac77c5f8ec-init: no space left on device (linux_set.go:95:0s)

Acceptance criteria¶

AC1:

Suggestions¶

DONE File an infra SD ticket to get the GitLab runners checked

Actions

Copy link

Updated by livdywan 12 months ago

Description updated (diff)

Actions

Copy link

Updated by livdywan 12 months ago

Description updated (diff)
Status changed from New to In Progress

https://sd.suse.com/servicedesk/customer/portal/1/SD-159323

Actions

Copy link

Updated by livdywan 12 months ago · Edited

Priority changed from Urgent to High

Debugging with Steven Mallindine. Pipelines running for now.

Actions

Copy link

Updated by livdywan 12 months ago

Status changed from In Progress to Blocked

livdywan wrote in #note-2:

https://sd.suse.com/servicedesk/customer/portal/1/SD-159323

Let's block on this. Nothing we can do here.

Actions

Copy link

Updated by livdywan 11 months ago

livdywan wrote in #note-4:

livdywan wrote in #note-2:

https://sd.suse.com/servicedesk/customer/portal/1/SD-159323

Let's block on this. Nothing we can do here.

Complete fix still pending.

Actions

Copy link

Updated by livdywan 11 months ago

Priority changed from High to Low

SD ticket still open. I'd say it's not High for us anymore as things work for us and we've not seen any more issues.

Actions

Copy link

Updated by livdywan 11 months ago

Status changed from Blocked to Resolved

Steve confirmed that the cleanup of images is working fine now. So we're good.

Actions

Copy link

Updated by livdywan 10 months ago

Status changed from Resolved to Blocked

And apparently the issue has come back:

Pulling docker image registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest ...
WARNING: Failed to pull image with policy "always": failed to register layer: open /usr/lib/python3.11/site-packages/ansible_collections/junipernetworks/junos/plugins/modules/junos_ospfv2.py: no space left on device (manager.go:250:17s)
ERROR: Job failed: failed to pull image "registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest" with specified policies [always]: failed to register layer: open /usr/lib/python3.11/site-packages/ansible_collections/junipernetworks/junos/plugins/modules/junos_ospfv2.py: no space left on device (manager.go:250:17s)

I filed SD-162694

Actions

Copy link

Updated by okurz 10 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/2891166#L20

Actions

Copy link

#10

Updated by livdywan 9 months ago · Edited

Status changed from Blocked to In Progress

Apparently pipelines are failing again:

https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2998807

Will be retried in 3s ...
ERROR: Job failed (system failure): adding cache volume: set volume permissions: create permission container for volume "runner-25ldi6vv-project-6096-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70": Error response from daemon: mkdir /var/lib/docker/overlay2/ae4a05093eca6240b6d28fbf9ad94e8355082166bf93a6c3c6562cc0554f89c4-init: no space left on device (linux_set.go:95:0s)

https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2998804

Will be retried in 3s ...
ERROR: Job failed (system failure): adding cache volume: set volume permissions: create permission container for volume "runner-25ldi6vv-project-11950-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70": Error response from daemon: mkdir /var/lib/docker/overlay2/43ea0d4662f8f8842379af0cf5cc9390f9ae58ad76fb6c062a78fa2697dd5570-init: no space left on device (linux_set.go:95:0s)

https://gitlab.suse.de/openqa/os-autoinst-needles-sles/-/jobs/2998789

Fetching changes with git depth set to 3...
Initialized empty Git repository in /builds/openqa/os-autoinst-needles-sles/.git/
Created fresh repository.
fatal: write error: No space left on device
fatal: fetch-pack: invalid index-pack output

Actions

Copy link

#11

Updated by livdywan 9 months ago · Edited

Status changed from In Progress to Feedback

I added another comment on SD-162694. In the meanwhile it seems like the latest pipelines for bot-bg and Scripts CI did pass, so I'm not applying any mitigations for now.

I don't know how openqa-pusher works. It's not a regular schedule. So I am just retrying the job.

Actions

Copy link

#12

Updated by tinita 9 months ago

livdywan wrote in #note-11:

I added another comment on .

Actions

Copy link

#13

Updated by livdywan 9 months ago · Edited

tinita wrote in #note-12:

livdywan wrote in #note-11:

I added another comment on .

?

That was supposed to be SD-162694.

And quoting from the Slack conversation:

There was some issues this morning with some large jobs running that killed the disk space (as we have looked at in the past)...
If you remember, the system used to clear the cache nightly (which wasnt working, then we fixed), then we switched to hourly. This still wasnt enough for the jobs that ran earlier this morning.....
I have since disabled the docker caching on gitlab-worker1, and keeping an eye on it to see it the issue resolves....

Actions

Copy link

#14

Updated by livdywan 9 months ago

Status changed from Feedback to Blocked

I don't know how openqa-pusher works. It's not a regular schedule. So I am just retrying the job.

The retry worked. Back to blocking on SD-162694.

Actions

Copy link

#15

Updated by jbaier_cz 9 months ago · Edited

Another failure in https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3022436 and https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/3022521

Actions

Copy link

#16

Updated by livdywan 8 months ago

Happening again: https://gitlab.suse.de/openqa/osd-deployment/-/jobs/3134725

Commented on SD-162694

Actions

Copy link

#17

Updated by okurz 4 months ago

Target version changed from Ready to Tools - Next

Actions

Copy link

#18

Updated by livdywan 18 days ago

Status changed from Blocked to Resolved

livdywan wrote in #note-16:

Happening again: https://gitlab.suse.de/openqa/osd-deployment/-/jobs/3134725

Commented on SD-162694

A minimal improvement was implemented, hence resolving:

We identified an issue with incorrect log handling on the GitLab server and have deployed a fix. At the same time, we added disk space monitoring as well.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #162062

No space left on device causing GitLab pipelines to fail

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan 12 months ago

Updated by livdywan 12 months ago

Updated by livdywan 12 months ago · Edited

Updated by livdywan 12 months ago

Updated by livdywan 11 months ago

Updated by livdywan 11 months ago

Updated by livdywan 11 months ago

Updated by livdywan 10 months ago

Updated by okurz 10 months ago

Updated by livdywan 9 months ago · Edited

Updated by livdywan 9 months ago · Edited

Updated by tinita 9 months ago

Updated by livdywan 9 months ago · Edited

Updated by livdywan 9 months ago

Updated by jbaier_cz 9 months ago · Edited

Updated by livdywan 8 months ago

Updated by okurz 4 months ago

Updated by livdywan 18 days ago