action #53261
closedtests incomplete with "No space left on device" openqaworker-arm-2 ran out of space in uploading files in jobs -> cache takes 500G, should be only 50G?
0%
Description
Observation¶
[18/06/2019 12:17:16] <riafarov> [2019-06-18T10:09:36.107 UTC] [debug] Can't write to file "testresults/logs_from_installation_system-25.txt": No space left on device at /usr/lib/os-autoinst/basetest.pm line 436.
[18/06/2019 12:17:35] <riafarov> seems that openqaworker-arm-2 disk is full
[18/06/2019 12:25:47] <okurz> @riafarov: hm, it's not
[18/06/2019 12:26:41] <okurz> but something seems odd. Don't know where the pool is stored
[18/06/2019 12:28:52] <okurz> @riafarov ok I think I found it, the cache takes up 500G, is there a ticket for that already?
Problem¶
Investigating and comparing the space usage on other workers:
sudo salt '*' cmd.run 'du -sh /var/lib/openqa/cache'
openqaworker-arm-2.suse.de:
4.0K /var/lib/openqa/cache
openqa.suse.de:
du: cannot access '/var/lib/openqa/cache': No such file or directory
powerqaworker-qam-1:
62G /var/lib/openqa/cache
QA-Power8-4-kvm.qa.suse.de:
59G /var/lib/openqa/cache
openqaworker3.suse.de:
58G /var/lib/openqa/cache
openqaw1.qa.suse.de:
59G /var/lib/openqa/cache
openqaworker5.suse.de:
56G /var/lib/openqa/cache
QA-Power8-5-kvm.qa.suse.de:
58G /var/lib/openqa/cache
openqaworker2.suse.de:
59G /var/lib/openqa/cache
openqaw2.qa.suse.de:
57G /var/lib/openqa/cache
openqaworker13.suse.de:
58G /var/lib/openqa/cache
openqaworker6.suse.de:
57G /var/lib/openqa/cache
openqaworker8.suse.de:
59G /var/lib/openqa/cache
openqaworker9.suse.de:
57G /var/lib/openqa/cache
openqaworker7.suse.de:
47G /var/lib/openqa/cache
grenache-1.qa.suse.de:
62G /var/lib/openqa/cache
openqaworker-arm-1.suse.de:
Minion did not return. [Not connected]
malbec.arch.suse.de:
Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code
Updated by okurz over 5 years ago
- Assignee set to nicksinger
- Priority changed from High to Immediate
something for you?
I suggest to stop the workers, delete the cache dir and restart the workers
Updated by okurz over 5 years ago
- Has duplicate action #53300: [functional][y]"No space left on device" cause test incomplete for those create_hdd tests in aarch64 added
Updated by okurz over 5 years ago
- Subject changed from openqaworker-arm-2 ran out of space in uploading files in jobs -> cache takes 500G, should be only 50G? to tests incomplete with "No space left on device" openqaworker-arm-2 ran out of space in uploading files in jobs -> cache takes 500G, should be only 50G?
Updated by nicksinger over 5 years ago
- Priority changed from Immediate to High
I've cleaned the worker cache and everything was back running again. However, it already grew back to 150G in size so it seems to still not respect the configured limits. Lowering the urgency as it dosn't affect "customers" any longer.
Updated by okurz over 5 years ago
- Status changed from New to Feedback
- Assignee changed from nicksinger to okurz
ran full today again. Currently the space usage is
/dev/nvme0n1p1 734G 579G 118G 84% /var/lib/openqa/nvme
the cache service reports:
# systemctl status openqa-worker-cacheservice
● openqa-worker-cacheservice.service - OpenQA Worker Cache Service
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2019-08-07 10:19:15 UTC; 4 days ago
Main PID: 46473 (openqa-workerca)
Tasks: 1 (limit: 512)
CGroup: /system.slice/openqa-worker-cacheservice.service
└─46473 /usr/bin/perl /usr/share/openqa/script/openqa-workercache daemon -m production
Aug 07 10:19:15 openqaworker-arm-2 systemd[1]: Stopped OpenQA Worker Cache Service.
Aug 07 10:19:15 openqaworker-arm-2 systemd[1]: Started OpenQA Worker Cache Service.
Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200
Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 0
Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [i] Listening at "http://127.0.0.1:7844"
Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [i] Listening at "http://[::1]:7844"
I suspect the space calculation is wrong, maybe not resolving the correct device. We could try to explicitly configure the cache size. See if that makes a difference.
Ok, that did not make a difference. So changed the symlink to a bind mount:
rm cache && mkdir cache && echo '/var/lib/openqa/nvme/cache/ /var/lib/openqa/nvme/cache none bind 0 0' >> /etc/fstab && mount -a
and the cache service on startup could calculate the size correctly and handle it appropriately.
# systemctl start openqa-worker-cacheservice
# systemctl status openqa-worker-cacheservice
● openqa-worker-cacheservice.service - OpenQA Worker Cache Service
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2019-08-12 07:29:59 UTC; 7s ago
Main PID: 16881 (openqa-workerca)
Tasks: 1 (limit: 512)
CGroup: /system.slice/openqa-worker-cacheservice.service
└─16881 /usr/bin/perl /usr/share/openqa/script/openqa-workercache daemon -m production
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Reclaiming 2019622912 from 361925343296 to make space for 322122547200
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] CACHE: removed /var/lib/openqa/cache/openqa.suse.de/sle-15-aarch64-4.12.14-1470.1.g187af5a-Server-DVD-Incidents-Kernel@aarch64-virtio-with...efi-vars.qcow2
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Current cache size: 361925343296
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Reclaiming 330752 from 361925012544 to make space for 322122547200
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] CACHE: removed /var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-0229@aarch64-minimal_with_sdk0176_installed.qcow2
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Current cache size: 361925012544
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Reclaiming 1078525952 from 360846486592 to make space for 322122547200
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] CACHE: removed /var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-0229@aarch64-minimal_with_sdk0176_installed-uefi-vars.qcow2
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Current cache size: 360846486592
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Reclaiming 330240 from 360846156352 to make space for 322122547200
Hint: Some lines were ellipsized, use -l to show in full.
Updated by okurz over 5 years ago
- Copied to action #55373: Worker::Cache thinks no space is used when cache resides in symlinked folder pointing to other partition added