PR was merged by mkittler yesterday. And it seems to show problems on o3. As reported by DimStar.
[17/12/2019 11:40:02] <DimStar> https://openqa.opensuse.org/admin/assets - either I'm just lucky and asset cleanup is 'really' ongoing, or there is more stuff broken
[17/12/2019 11:40:27] <sysrich_> did anyone deploy something new? coolo okurz?
[17/12/2019 11:40:43] <DimStar> 2019-12-17 03:02:04|install|openQA|4.6.1576531085.b1739792f-lp151.2108.1|noarch||devel_openQA|1ceac919398d7c25d705b86ff1f3abd07d7534a691eca981c28df3418cf4dbe2|
[17/12/2019 11:40:51] <okurz> sysrich_: yes, there is nightly deployment
[17/12/2019 11:41:36] <okurz> sounds like https://github.com/os-autoinst/openQA/pull/2491 is the cause
[17/12/2019 11:41:38] <|Anna|> Github project os-autoinst/openQA pull request#2491: "Trigger tasks for limiting assets and results/logs hourly", created on 2019-11-14, status: closed on 2019-12-16, https://github.com/os-autoinst/openQA/pull/2491
[…]
[17/12/2019 11:42:26] <okurz> sysrich: can you tell me what is the impact? is it about asset cleanup now or the not registered TW snapshot?
[17/12/2019 11:42:45] <sysrich> okurz, DimStar or fvogt can probably tell you far more..I've only heard of this 3 minutes ago
[17/12/2019 11:43:47] <DimStar> okurz: well, in essence I'm trying to find out why 1216 does not show up on QA (should have move before 8am) - so while going through logs, there is quite some errors and when going to /admin/assets (wanted to see if anything 1216 was registered) it is constantly in 'cleanup'
[17/12/2019 11:43:58] <DimStar> okurz: the two can be related or independent... I don't know
[17/12/2019 11:44:50] <DimStar> new staging tests seem to show up though
[17/12/2019 11:44:57] <okurz> ok. I see. Let's assume they are only remotely related and we can try to follow up each one by one.
[17/12/2019 11:45:54] <DimStar> there was also a merge on rsync scripts yesterday (https://gitlab.suse.de/openqa/scripts/commit/30854a049a3e92aaad6e6709bade662473365ae7); but from first glance it does not look like this should have broken TW
[17/12/2019 11:46:34] <fvogt> There is quite a lot of Use of uninitialized value $_ in pattern match (m//) at /opt/openqa-scripts/rsync_opensuse.pm line 400.
[17/12/2019 11:49:54] <okurz> ok so first I will look into the cleanup minions on https://openqa.opensuse.org/minion/jobs , deleting a lot of failed, checking logs, workers, etc. to prevent a fillup of /space
https://nagios-devel.suse.de/pnp4nagios/graph?host=ariel-opensuse.suse.de&srv=space_partition&view=2 shows a growing /space since today morning.
EDIT: I can see the timer starting the service openqa-enqueue-asset-and-result-cleanup, journalctl -u openqa-enqueue-asset-and-result-cleanup
:
Dec 17 11:00:01 ariel systemd[1]: Starting Enqueues an asset cleanup and a result/logs cleanup task for the openQA....
Dec 17 11:00:05 ariel openqa[27477]: [
Dec 17 11:00:05 ariel openqa[27477]: {
Dec 17 11:00:05 ariel openqa[27477]: "gru_id" => 17075633,
Dec 17 11:00:05 ariel openqa[27477]: "minion_id" => 122294
Dec 17 11:00:05 ariel openqa[27477]: },
Dec 17 11:00:05 ariel openqa[27477]: {
Dec 17 11:00:05 ariel openqa[27477]: "gru_id" => 17075634,
Dec 17 11:00:05 ariel openqa[27477]: "minion_id" => 122295
Dec 17 11:00:05 ariel openqa[27477]: }
Dec 17 11:00:05 ariel openqa[27477]: ]
Dec 17 11:00:05 ariel systemd[1]: Started Enqueues an asset cleanup and a result/logs cleanup task for the openQA..
and I can see the gru service to traverse our asset tree:
geekote+ 22046 41.8 1.2 366952 199536 ? SN 10:50 5:53 \_ /usr/bin/perl /usr/share/openqa/script/openqa gru -m production run --reset-locks
ariel:/home/okurz # strace -f -estat -p 22046
[…]
I can see that the gru process is merely stuck in /var/lib/openqa/share/factory/other where we have (ls -ltra | wc -l
) 87644 files. This did not seem to be that severe of a problem in before but now it is. I can try to manually mitigate this a bit for the time being by manually deleting old assets but we should come up with a better approach, at the very least we can revert the PR again.
EDIT: find -name '*TEMP*' -delete
and find -mtime +365 -delete
.
As the cleanup task is still stuck in the same repo and /space is about to run full I killed the gru process and rolled back the web UI for now with zypper -n in -f /var/cache/zypp/packages/devel_openQA/noarch/openQA*-4.6.1576340016.48aaffc06-lp151.2103.1.noarch.rpm
. Immediately after the services were restarted the cleanup started and it immediately unlinked files and was not stuck in traversing. So this rolled back successfully the openQA webui installation. This seems to help with the asset management. https://openqa.opensuse.org/admin/assets is usable again and /space seems to recover.
@mkittler I propose to revert and you try to reproduce the problem with a test, e.g. with 100k local files. I have the suspicion the problem has to do with paths, e.g. chdir or absolute/relative or so.