action #131471
open
Leftover worker temporary directories in /tmp on OSD and O3 size:M
Added by kraih over 1 year ago.
Updated over 1 year ago.
Category:
Regressions/Crashes
Description
Motivation¶
While investigating #131447 we noticed that there are quite a few leftover temporary directories in /tmp
. These appear to have been created by the scheduler or webui, and under certain circumstances are not cleaned up even if they are no longer needed.
...
drwx------ 2 geekotest nogroup 40 Jun 24 14:42 Y7FKl4lvbt
drwx------ 2 geekotest nogroup 40 Jun 22 08:39 yblAcg26X3
drwx------ 2 geekotest nogroup 40 Jun 25 09:09 YeBlK48awn
drwx------ 2 geekotest nogroup 40 Jun 23 19:55 yENrB8ToeU
drwx------ 2 geekotest nogroup 40 Jun 27 12:04 Y_F6OzG5_3
drwx------ 2 geekotest nogroup 40 Jun 23 17:46 yfGT2ppHyr
drwx------ 2 geekotest nogroup 40 Jun 25 09:09 yfneSFCwls
drwx------ 2 geekotest nogroup 40 Jun 22 16:54 yHqfyU6xnC
drwx------ 2 geekotest nogroup 40 Jun 26 02:26 YHrukyCIn5
drwx------ 2 geekotest nogroup 40 Jun 27 11:22 yIlfzjoPpj
drwx------ 2 geekotest nogroup 40 Jun 23 21:16 Yixisovjrn
drwx------ 2 geekotest nogroup 40 Jun 23 17:48 Yj1yuoh_7D
drwx------ 2 geekotest nogroup 40 Jun 22 11:50 YlIRb1a69M
drwx------ 2 geekotest nogroup 40 Jun 27 12:19 YLqGsMNpZ9
...
What triggers the directories not to be cleaned up?
From the openqa_scheduler_log:
[2023-06-27T19:25:43.422205Z] [warn] [pid:5909] Failed sending job(s) '3387016' to worker '425': Unable to assign job to worker 425: the worker is not connected anymore
In those cases $schema->txn_do(sub { $worker->unprepare_for_work; });
is called, which simply deletes the setting in the DB, but doesn't remove the dirctory.
Acceptance criteria¶
- AC1: No more leftover temporary directories on OSD and O3
Suggestions¶
tempdir()
is called in two places (here and here)
- Focus on fixing the cleanup code
- It might make sense to change the default directory from
/tmp
to some directory under /var/lib/openqa/...
too, since that's usually where we have our faster larger disks mounted
Out of scope¶
- Related to action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines added
- Target version changed from future to Ready
mkittler wrote:
Looks like this has been improved via https://github.com/os-autoinst/openQA/pull/5225. Now those directories should be better distinguishable.
On O3 it now looks like this:
...
drwx------ 2 geekotest nogroup 40 Jun 28 08:58 webui.worker-663.Gb7ldAWq
drwx------ 2 geekotest nogroup 40 Jun 28 03:11 webui.worker-685.u6GnjlTX
drwx------ 2 geekotest nogroup 40 Jun 28 02:06 webui.worker-686.zFIveEWJ
drwx------ 2 geekotest nogroup 40 Jun 28 01:45 webui.worker-688.GDucRVQa
drwx------ 2 geekotest nogroup 40 Jun 28 01:49 webui.worker-689.1h5mq5zP
drwx------ 2 geekotest nogroup 40 Jun 28 02:53 webui.worker-690.oZ012BFt
drwx------ 2 geekotest nogroup 40 Jun 28 01:41 webui.worker-691.e1q9JYvm
drwx------ 2 geekotest nogroup 40 Jun 28 01:53 webui.worker-692.R6shl8cQ
drwx------ 2 geekotest nogroup 40 Jun 28 02:53 webui.worker-693.hHGBIKmc
drwx------ 2 geekotest nogroup 40 Jun 28 01:54 webui.worker-694.rVP1hRIX
drwx------ 2 geekotest nogroup 40 Jun 28 01:46 webui.worker-695.xOd_ob4E
drwx------ 2 geekotest nogroup 40 Jun 28 01:41 webui.worker-696.TRHc0ogw
drwx------ 2 geekotest nogroup 40 Jun 28 02:29 webui.worker-697.Xaqejo4R
drwx------ 2 geekotest nogroup 40 Jun 28 01:54 webui.worker-698.EgqaA6dO
drwx------ 2 geekotest nogroup 40 Jun 28 01:54 webui.worker-699.DbF3smtH
drwx------ 2 geekotest nogroup 40 Jun 28 02:00 webui.worker-700.kfUDqTFf
drwx------ 2 geekotest nogroup 40 Jun 28 02:06 webui.worker-701.6ZF_sAXS
drwx------ 2 geekotest nogroup 120 Jun 28 02:58 webui.worker-702.qYnLFM1f
...
And I can see more than one directory per worker, e.g.
% ls -ld webui.worker-94._RJfQUKB webui.worker-94.hWurW5GA webui.worker-94.zAmB4g3_
drwx------ 2 geekotest nogroup 40 Jun 29 04:00 webui.worker-94._RJfQUKB
drwx------ 2 geekotest nogroup 40 Jun 28 01:30 webui.worker-94.hWurW5GA
drwx------ 2 geekotest nogroup 40 Jun 29 05:15 webui.worker-94.zAmB4g3_
So reusing the dirs doesn't seem to work.
- Related to action #131516: Consider creating a separate tmp dir filesystem, e.g. tmpfs? added
- Subject changed from Leftover temporary directories in /tmp on OSD and O3 to Leftover temporary directories in /tmp on OSD and O3 size:M
- Description updated (diff)
- Status changed from New to Workable
- Related to action #131465: Make temporary files and directories created by openQA services easier to identify size:M added
- Target version changed from Ready to future
Could be related to
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Scheduler/Model/Jobs.pm#L250-L258
log_warning "Failed sending job(s) '$job_ids_str' to worker '$worker_id': $error";
try {
$schema->txn_do(sub { $worker->unprepare_for_work; });
}
and unprepare_for_work
will just delete the properties from the DB without deleting the directory.
I just looked into the scheduler log on o3, and there are 86 such error messages in the last ~24h, and from a first glance the worker ids match the duplicate directories.
- Subject changed from Leftover temporary directories in /tmp on OSD and O3 size:M to Leftover worker temporary directories in /tmp on OSD and O3 size:M
- Description updated (diff)
Also available in: Atom
PDF