action #55241

Prevent spamming cleanup jobs

Added by mkittler 7 months ago. Updated about 1 month ago.

Status:ResolvedStart date:08/08/2019
Priority:NormalDue date:
Assignee:mkittler% Done:

0%

Category:Organisational
Target version:Done
Difficulty:
Duration:

Description

limit_assets and limit_results_and_logs jobs are created quite frequently on an intensively used openQA instance such as OSD (see https://openqa.suse.de/minion/jobs?state=finished). Most of these jobs are cancelled because the previously started job hasn't been finished yet.

All these jobs are quite a distraction when using the Minion dashboard. Besides, creating so many of these jobs seems quite wasteful.

So far these both cleanup tasks are created when a new ISO is scheduled and that can be quite frequent (see https://openqa.suse.de/admin/productlog). Just consider the frequency of ISO posts alone from user osukup:

select p.t_created, nickname from scheduled_products as p join users on user_id=users.id where nickname = 'osukup' order by t_created desc limit 20;
      t_created      | nickname 
---------------------+----------
 2019-08-08 06:07:25 | osukup
 2019-08-08 03:08:26 | osukup
 2019-08-08 03:08:25 | osukup
 2019-08-08 03:08:25 | osukup
 2019-08-08 03:08:25 | osukup
 2019-08-08 03:08:24 | osukup
 2019-08-08 03:08:24 | osukup
 2019-08-08 03:08:23 | osukup
 2019-08-08 03:08:23 | osukup
 2019-08-08 03:08:23 | osukup
 2019-08-08 03:08:22 | osukup
 2019-08-08 03:08:22 | osukup
 2019-08-08 03:08:21 | osukup
 2019-08-08 03:08:20 | osukup
 2019-08-08 03:08:20 | osukup
 2019-08-08 03:08:20 | osukup
 2019-08-08 03:08:19 | osukup
 2019-08-08 03:08:19 | osukup
 2019-08-08 02:07:02 | osukup
 2019-08-08 02:07:02 | osukup
(20 Zeilen)

One suggestion to solve the problem would be to enqueue these tasks periodically. I would add a systemd timer in the same way as for the audit event cleanup. In this case it would make sense to activate the timers automatically when the web UI service is started.


Related issues

Related to openQA Project - action #55922: limit tasks needs to be limited (in runtime) New 24/08/2019

History

#1 Updated by cdywan 7 months ago

I like the suggestion a lot. Perhaps even make openqa-enqueue-audit-event-cleanup more generic and handle both? Or a separate one if you prefer, but using the same pattern wrt configuring the frequency for consistency.

#2 Updated by mkittler 7 months ago

I would use a separate timer because limit_assets and limit_results_and_logs should likely run more often than the audit event cleanup. It makes also sense not to trigger all of these jobs at time same time anyways (using even e.g. RandomizedDelaySec=).

#3 Updated by coolo 6 months ago

  • Category set to Organisational
  • Target version set to Ready

If we're honest with ourselves, they shouldn't be minion tasks at all - but background commands triggered on their own. But the minion dashboard is so damn convenient :)

So going with an hourly timer to schedule minion jobs sounds a good fit.

#4 Updated by mkittler 3 months ago

  • Related to action #55922: limit tasks needs to be limited (in runtime) added

#5 Updated by mkittler 3 months ago

  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint

#6 Updated by mkittler 3 months ago

I've created a draft PR for triggering the cleanup hourly but there's no consensus whether this is the best solution: https://github.com/os-autoinst/openQA/pull/2491

The alternative would be checking for ongoing or recently finished cleanup tasks to possibly delay spawning further tasks. However, that would weirdly duplicate what the look inside the Minion job is actually supposed to do.

#7 Updated by mkittler 3 months ago

Note that this is mainly a problem because the Minion dashboard is spammed with all these jobs. If a job could simply add a flag to its result which says 'delete me' we could easily prevent the problem. @kraih Does that makes sense? It sounds less weird than duplicating the lock-logic before scheduling the Minion job.

#8 Updated by kraih 3 months ago

Only enqueueing the task periodically makes sense to me. I'd only make the logic more complicated if that doesn't work out.

#9 Updated by coolo 3 months ago

+1 - I already objected to okurz's objection

#10 Updated by mkittler 3 months ago

Ok, I marked the PR "ready to review".

#11 Updated by okurz 2 months ago

  • Status changed from New to In Progress
  • Priority changed from Normal to Urgent

PR was merged by mkittler yesterday. And it seems to show problems on o3. As reported by DimStar.

[17/12/2019 11:40:02] <DimStar> https://openqa.opensuse.org/admin/assets - either I'm just lucky and asset cleanup is 'really' ongoing, or there is more stuff broken
[17/12/2019 11:40:27] <sysrich_> did anyone deploy something new? coolo okurz?
[17/12/2019 11:40:43] <DimStar> 2019-12-17 03:02:04|install|openQA|4.6.1576531085.b1739792f-lp151.2108.1|noarch||devel_openQA|1ceac919398d7c25d705b86ff1f3abd07d7534a691eca981c28df3418cf4dbe2|
[17/12/2019 11:40:51] <okurz> sysrich_: yes, there is nightly deployment
[17/12/2019 11:41:36] <okurz> sounds like https://github.com/os-autoinst/openQA/pull/2491 is the cause
[17/12/2019 11:41:38] <|Anna|> Github project os-autoinst/openQA pull request#2491: "Trigger tasks for limiting assets and results/logs hourly", created on 2019-11-14, status: closed on 2019-12-16, https://github.com/os-autoinst/openQA/pull/2491
[…]
[17/12/2019 11:42:26] <okurz> sysrich: can you tell me what is the impact? is it about asset cleanup now or the not registered TW snapshot?
[17/12/2019 11:42:45] <sysrich> okurz, DimStar or fvogt can probably tell you far more..I've only heard of this 3 minutes ago
[17/12/2019 11:43:47] <DimStar> okurz: well, in essence I'm trying to find out why 1216 does not show up on QA (should have move before 8am) - so while going through logs, there is quite some errors and when going to /admin/assets (wanted to see if anything 1216 was registered) it is constantly in 'cleanup'
[17/12/2019 11:43:58] <DimStar> okurz: the two can be related or independent... I don't know
[17/12/2019 11:44:50] <DimStar> new staging tests seem to show up though
[17/12/2019 11:44:57] <okurz> ok. I see. Let's assume they are only remotely related and we can try to follow up each one by one.
[17/12/2019 11:45:54] <DimStar> there was also a merge on rsync scripts yesterday (https://gitlab.suse.de/openqa/scripts/commit/30854a049a3e92aaad6e6709bade662473365ae7); but from first glance it does not look like this should have broken TW
[17/12/2019 11:46:34] <fvogt> There is quite a lot of Use of uninitialized value $_ in pattern match (m//) at /opt/openqa-scripts/rsync_opensuse.pm line 400.
[17/12/2019 11:49:54] <okurz> ok so first I will look into the cleanup minions on https://openqa.opensuse.org/minion/jobs , deleting a lot of failed, checking logs, workers, etc. to prevent a fillup of /space

https://nagios-devel.suse.de/pnp4nagios/graph?host=ariel-opensuse.suse.de&srv=space_partition&view=2 shows a growing /space since today morning.

EDIT: I can see the timer starting the service openqa-enqueue-asset-and-result-cleanup, journalctl -u openqa-enqueue-asset-and-result-cleanup:

Dec 17 11:00:01 ariel systemd[1]: Starting Enqueues an asset cleanup and a result/logs cleanup task for the openQA....
Dec 17 11:00:05 ariel openqa[27477]: [
Dec 17 11:00:05 ariel openqa[27477]:   {
Dec 17 11:00:05 ariel openqa[27477]:     "gru_id" => 17075633,
Dec 17 11:00:05 ariel openqa[27477]:     "minion_id" => 122294
Dec 17 11:00:05 ariel openqa[27477]:   },
Dec 17 11:00:05 ariel openqa[27477]:   {
Dec 17 11:00:05 ariel openqa[27477]:     "gru_id" => 17075634,
Dec 17 11:00:05 ariel openqa[27477]:     "minion_id" => 122295
Dec 17 11:00:05 ariel openqa[27477]:   }
Dec 17 11:00:05 ariel openqa[27477]: ]
Dec 17 11:00:05 ariel systemd[1]: Started Enqueues an asset cleanup and a result/logs cleanup task for the openQA..

and I can see the gru service to traverse our asset tree:

geekote+ 22046 41.8  1.2 366952 199536 ?       SN   10:50   5:53  \_ /usr/bin/perl /usr/share/openqa/script/openqa gru -m production run --reset-locks
ariel:/home/okurz # strace -f -estat -p 22046
[…]

I can see that the gru process is merely stuck in /var/lib/openqa/share/factory/other where we have (ls -ltra | wc -l) 87644 files. This did not seem to be that severe of a problem in before but now it is. I can try to manually mitigate this a bit for the time being by manually deleting old assets but we should come up with a better approach, at the very least we can revert the PR again.

EDIT: find -name '*TEMP*' -delete and find -mtime +365 -delete.

As the cleanup task is still stuck in the same repo and /space is about to run full I killed the gru process and rolled back the web UI for now with zypper -n in -f /var/cache/zypp/packages/devel_openQA/noarch/openQA*-4.6.1576340016.48aaffc06-lp151.2103.1.noarch.rpm . Immediately after the services were restarted the cleanup started and it immediately unlinked files and was not stuck in traversing. So this rolled back successfully the openQA webui installation. This seems to help with the asset management. https://openqa.opensuse.org/admin/assets is usable again and /space seems to recover.

@mkittler I propose to revert and you try to reproduce the problem with a test, e.g. with 100k local files. I have the suspicion the problem has to do with paths, e.g. chdir or absolute/relative or so.

#13 Updated by okurz 2 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to Normal

I closed the revert. As of today o3:/space is at 94% , near-critical but steady. Whatever seems to have caused the recent uprise in space usage then can be handled in #57494

#14 Updated by coolo 2 months ago

I have my doubts that the cleanup frequency has anything to do with the problems noted. The runtime of limit_assets was 18-31 minutes already before the deployment. Just noone looked at it.

#15 Updated by coolo 2 months ago

you can revert the revert - limit_assets is slow due to all the other files that are to be removed now.

It's doing this since at least

[2019-12-17T16:30:04.0479 UTC] [info] [pid:29726] GRU: removing /var/lib/openqa/share/factory/other/ltp-fs_readonly-opensuse-Tumbleweed-x86_64-20190225-DVD@64bit-with-ltp.qcow2.txt

which is the latest we have in logs. Instead of reverting the revert, helping the gru task by cleaning up other/ manual would have been a good fix. And I bet the uprise in FS usage is due to limit_assets not getting back to the isos before being canceled.

#16 Updated by okurz 2 months ago

coolo wrote:

you can revert the revert

I only opened the revert PR just in case we would come to the conclusion it would be necessary. We never came to that conclusion and the revert was closed but without merge.

And now for something completely different: The PR https://github.com/os-autoinst/openQA/pull/2616 for deleting assets regardless of the type has been prepared and already merged. It is correct though that I haven't considered the possibility we would have introduced a cleanup of "other" assets which never actually cleaned up "other" :D

#17 Updated by mkittler about 1 month ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done

The PR seems to work in production.

Also available in: Atom PDF