action #53261

tests incomplete with "No space left on device" openqaworker-arm-2 ran out of space in uploading files in jobs -> cache takes 500G, should be only 50G?

Added by okurz 10 months ago. Updated 8 months ago.

Status:ResolvedStart date:18/06/2019
Priority:HighDue date:
Assignee:okurz% Done:

0%

Category:-
Target version:-
Duration:

Description

Observation

[18/06/2019 12:17:16] <riafarov> [2019-06-18T10:09:36.107 UTC] [debug] Can't write to file "testresults/logs_from_installation_system-25.txt": No space left on device at /usr/lib/os-autoinst/basetest.pm line 436.
[18/06/2019 12:17:35] <riafarov> seems that openqaworker-arm-2 disk is full
[18/06/2019 12:25:47] <okurz> @riafarov: hm, it's not
[18/06/2019 12:26:41] <okurz> but something seems odd. Don't know where the pool is stored
[18/06/2019 12:28:52] <okurz> @riafarov ok I think I found it, the cache takes up 500G, is there a ticket for that already?

Problem

Investigating and comparing the space usage on other workers:

sudo salt '*' cmd.run 'du -sh /var/lib/openqa/cache'
openqaworker-arm-2.suse.de:
    4.0K        /var/lib/openqa/cache
openqa.suse.de:
    du: cannot access '/var/lib/openqa/cache': No such file or directory
powerqaworker-qam-1:
    62G /var/lib/openqa/cache
QA-Power8-4-kvm.qa.suse.de:
    59G /var/lib/openqa/cache
openqaworker3.suse.de:
    58G /var/lib/openqa/cache
openqaw1.qa.suse.de:
    59G /var/lib/openqa/cache
openqaworker5.suse.de:
    56G /var/lib/openqa/cache
QA-Power8-5-kvm.qa.suse.de:
    58G /var/lib/openqa/cache
openqaworker2.suse.de:
    59G /var/lib/openqa/cache
openqaw2.qa.suse.de:
    57G /var/lib/openqa/cache
openqaworker13.suse.de:
    58G /var/lib/openqa/cache
openqaworker6.suse.de:
    57G /var/lib/openqa/cache
openqaworker8.suse.de:
    59G /var/lib/openqa/cache
openqaworker9.suse.de:
    57G /var/lib/openqa/cache
openqaworker7.suse.de:
    47G /var/lib/openqa/cache
grenache-1.qa.suse.de:
    62G /var/lib/openqa/cache
openqaworker-arm-1.suse.de:
    Minion did not return. [Not connected]
malbec.arch.suse.de:
    Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code

Related issues

Duplicated by openQA Tests - action #53300: [functional][y]"No space left on device" cause test incom... Rejected 19/06/2019
Copied to openQA Project - action #55373: Worker::Cache thinks no space is used when cache resides ... Resolved 18/06/2019

History

#1 Updated by okurz 10 months ago

  • Assignee set to nicksinger
  • Priority changed from High to Immediate

something for you?

I suggest to stop the workers, delete the cache dir and restart the workers

#2 Updated by okurz 10 months ago

  • Duplicated by action #53300: [functional][y]"No space left on device" cause test incomplete for those create_hdd tests in aarch64 added

#3 Updated by okurz 10 months ago

  • Subject changed from openqaworker-arm-2 ran out of space in uploading files in jobs -> cache takes 500G, should be only 50G? to tests incomplete with "No space left on device" openqaworker-arm-2 ran out of space in uploading files in jobs -> cache takes 500G, should be only 50G?

#4 Updated by nicksinger 9 months ago

  • Priority changed from Immediate to High

I've cleaned the worker cache and everything was back running again. However, it already grew back to 150G in size so it seems to still not respect the configured limits. Lowering the urgency as it dosn't affect "customers" any longer.

#5 Updated by okurz 8 months ago

  • Status changed from New to Feedback
  • Assignee changed from nicksinger to okurz

ran full today again. Currently the space usage is

/dev/nvme0n1p1                        734G  579G  118G  84% /var/lib/openqa/nvme

the cache service reports:

# systemctl status openqa-worker-cacheservice
● openqa-worker-cacheservice.service - OpenQA Worker Cache Service
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2019-08-07 10:19:15 UTC; 4 days ago
 Main PID: 46473 (openqa-workerca)
    Tasks: 1 (limit: 512)
   CGroup: /system.slice/openqa-worker-cacheservice.service
           └─46473 /usr/bin/perl /usr/share/openqa/script/openqa-workercache daemon -m production

Aug 07 10:19:15 openqaworker-arm-2 systemd[1]: Stopped OpenQA Worker Cache Service.
Aug 07 10:19:15 openqaworker-arm-2 systemd[1]: Started OpenQA Worker Cache Service.
Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200
Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 0
Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [i] Listening at "http://127.0.0.1:7844"
Aug 07 10:19:18 openqaworker-arm-2 openqa-workercache[46473]: [i] Listening at "http://[::1]:7844"

I suspect the space calculation is wrong, maybe not resolving the correct device. We could try to explicitly configure the cache size. See if that makes a difference.

Ok, that did not make a difference. So changed the symlink to a bind mount:

rm cache && mkdir cache && echo '/var/lib/openqa/nvme/cache/ /var/lib/openqa/nvme/cache none bind 0 0' >> /etc/fstab && mount -a

and the cache service on startup could calculate the size correctly and handle it appropriately.

# systemctl start openqa-worker-cacheservice
# systemctl status openqa-worker-cacheservice
● openqa-worker-cacheservice.service - OpenQA Worker Cache Service
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2019-08-12 07:29:59 UTC; 7s ago
 Main PID: 16881 (openqa-workerca)
    Tasks: 1 (limit: 512)
   CGroup: /system.slice/openqa-worker-cacheservice.service
           └─16881 /usr/bin/perl /usr/share/openqa/script/openqa-workercache daemon -m production

Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Reclaiming 2019622912 from 361925343296 to make space for 322122547200
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] CACHE: removed /var/lib/openqa/cache/openqa.suse.de/sle-15-aarch64-4.12.14-1470.1.g187af5a-Server-DVD-Incidents-Kernel@aarch64-virtio-with...efi-vars.qcow2
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Current cache size: 361925343296
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Reclaiming 330752 from 361925012544 to make space for 322122547200
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] CACHE: removed /var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-0229@aarch64-minimal_with_sdk0176_installed.qcow2
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Current cache size: 361925012544
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Reclaiming 1078525952 from 360846486592 to make space for 322122547200
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] CACHE: removed /var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-0229@aarch64-minimal_with_sdk0176_installed-uefi-vars.qcow2
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Current cache size: 360846486592
Aug 12 07:30:06 openqaworker-arm-2 openqa-workercache[16881]: [DEBUG] Reclaiming 330240 from 360846156352 to make space for 322122547200
Hint: Some lines were ellipsized, use -l to show in full.

#6 Updated by okurz 8 months ago

  • Copied to action #55373: Worker::Cache thinks no space is used when cache resides in symlinked folder pointing to other partition added

#7 Updated by okurz 8 months ago

  • Status changed from Feedback to Resolved

I checked, we are good

Also available in: Atom PDF