action #53261
tests incomplete with "No space left on device" openqaworker-arm-2 ran out of space in uploading files in jobs -> cache takes 500G, should be only 50G?
0%
Description
Observation¶
[18/06/2019 12:17:16] <riafarov> [2019-06-18T10:09:36.107 UTC] [debug] Can't write to file "testresults/logs_from_installation_system-25.txt": No space left on device at /usr/lib/os-autoinst/basetest.pm line 436. [18/06/2019 12:17:35] <riafarov> seems that openqaworker-arm-2 disk is full [18/06/2019 12:25:47] <okurz> @riafarov: hm, it's not [18/06/2019 12:26:41] <okurz> but something seems odd. Don't know where the pool is stored [18/06/2019 12:28:52] <okurz> @riafarov ok I think I found it, the cache takes up 500G, is there a ticket for that already?
Problem¶
Investigating and comparing the space usage on other workers:
sudo salt '*' cmd.run 'du -sh /var/lib/openqa/cache' openqaworker-arm-2.suse.de: 4.0K /var/lib/openqa/cache openqa.suse.de: du: cannot access '/var/lib/openqa/cache': No such file or directory powerqaworker-qam-1: 62G /var/lib/openqa/cache QA-Power8-4-kvm.qa.suse.de: 59G /var/lib/openqa/cache openqaworker3.suse.de: 58G /var/lib/openqa/cache openqaw1.qa.suse.de: 59G /var/lib/openqa/cache openqaworker5.suse.de: 56G /var/lib/openqa/cache QA-Power8-5-kvm.qa.suse.de: 58G /var/lib/openqa/cache openqaworker2.suse.de: 59G /var/lib/openqa/cache openqaw2.qa.suse.de: 57G /var/lib/openqa/cache openqaworker13.suse.de: 58G /var/lib/openqa/cache openqaworker6.suse.de: 57G /var/lib/openqa/cache openqaworker8.suse.de: 59G /var/lib/openqa/cache openqaworker9.suse.de: 57G /var/lib/openqa/cache openqaworker7.suse.de: 47G /var/lib/openqa/cache grenache-1.qa.suse.de: 62G /var/lib/openqa/cache openqaworker-arm-1.suse.de: Minion did not return. [Not connected] malbec.arch.suse.de: Minion did not return. [Not connected] ERROR: Minions returned with non-zero exit code
Related issues
History
#1
Updated by okurz over 3 years ago
- Assignee set to nicksinger
- Priority changed from High to Immediate
something for you?
I suggest to stop the workers, delete the cache dir and restart the workers
#2
Updated by okurz over 3 years ago
- Has duplicate action #53300: [functional][y]"No space left on device" cause test incomplete for those create_hdd tests in aarch64 added
#3
Updated by okurz over 3 years ago
- Subject changed from openqaworker-arm-2 ran out of space in uploading files in jobs -> cache takes 500G, should be only 50G? to tests incomplete with "No space left on device" openqaworker-arm-2 ran out of space in uploading files in jobs -> cache takes 500G, should be only 50G?
#4
Updated by nicksinger over 3 years ago
- Priority changed from Immediate to High
I've cleaned the worker cache and everything was back running again. However, it already grew back to 150G in size so it seems to still not respect the configured limits. Lowering the urgency as it dosn't affect "customers" any longer.
#5
Updated by okurz over 3 years ago
- Status changed from New to Feedback
- Assignee changed from nicksinger to okurz
ran full today again. Currently the space usage is
/dev/nvme0n1p1 734G 579G 118G 84% /var/lib/openqa/nvme
the cache service reports:
# systemctl status openqa-worker-cacheservice ● openqa-worker-cacheservice.service - OpenQA Worker Cache Service Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2019-08-07 10:19:15 UTC; 4 days ago Main PID: 46473 (openqa-workerca) Tasks: 1 (limit: 512) CGroup: /system.slice/openqa-worker-cacheservice.service └─46473 /usr/bin/perl /usr/share/openqa/script/openqa-workercache daemon -m production Aug 07 10:19:15 openqaworker-arm-2 systemd[1]: Stopped OpenQA Worker Cache Service. Aug 07 10:19:15 openqaworker-arm-2 systemd[1]: Started OpenQA Worker Cache Service. Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200 Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 0 Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [i] Listening at "http://127.0.0.1:7844" Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [i] Listening at "http://[::1]:7844"
I suspect the space calculation is wrong, maybe not resolving the correct device. We could try to explicitly configure the cache size. See if that makes a difference.
Ok, that did not make a difference. So changed the symlink to a bind mount:
rm cache && mkdir cache && echo '/var/lib/openqa/nvme/cache/ /var/lib/openqa/nvme/cache none bind 0 0' >> /etc/fstab && mount -a
and the cache service on startup could calculate the size correctly and handle it appropriately.
# systemctl start openqa-worker-cacheservice # systemctl status openqa-worker-cacheservice ● openqa-worker-cacheservice.service - OpenQA Worker Cache Service Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2019-08-12 07:29:59 UTC; 7s ago Main PID: 16881 (openqa-workerca) Tasks: 1 (limit: 512) CGroup: /system.slice/openqa-worker-cacheservice.service └─16881 /usr/bin/perl /usr/share/openqa/script/openqa-workercache daemon -m production Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Reclaiming 2019622912 from 361925343296 to make space for 322122547200 Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] CACHE: removed /var/lib/openqa/cache/openqa.suse.de/sle-15-aarch64-4.12.14-1470.1.g187af5a-Server-DVD-Incidents-Kernel@aarch64-virtio-with...efi-vars.qcow2 Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Current cache size: 361925343296 Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Reclaiming 330752 from 361925012544 to make space for 322122547200 Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] CACHE: removed /var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-0229@aarch64-minimal_with_sdk0176_installed.qcow2 Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Current cache size: 361925012544 Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Reclaiming 1078525952 from 360846486592 to make space for 322122547200 Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] CACHE: removed /var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-0229@aarch64-minimal_with_sdk0176_installed-uefi-vars.qcow2 Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Current cache size: 360846486592 Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Reclaiming 330240 from 360846156352 to make space for 322122547200 Hint: Some lines were ellipsized, use -l to show in full.
#6
Updated by okurz over 3 years ago
- Copied to action #55373: Worker::Cache thinks no space is used when cache resides in symlinked folder pointing to other partition added
#7
Updated by okurz over 3 years ago
- Status changed from Feedback to Resolved
I checked, we are good