Project

General

Profile

Actions

action #131471

open

Leftover worker temporary directories in /tmp on OSD and O3 size:M

Added by kraih over 1 year ago. Updated over 1 year ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2023-06-27
Due date:
% Done:

0%

Estimated time:

Description

Motivation

While investigating #131447 we noticed that there are quite a few leftover temporary directories in /tmp. These appear to have been created by the scheduler or webui, and under certain circumstances are not cleaned up even if they are no longer needed.

...
drwx------ 2 geekotest nogroup  40 Jun 24 14:42 Y7FKl4lvbt
drwx------ 2 geekotest nogroup  40 Jun 22 08:39 yblAcg26X3
drwx------ 2 geekotest nogroup  40 Jun 25 09:09 YeBlK48awn
drwx------ 2 geekotest nogroup  40 Jun 23 19:55 yENrB8ToeU
drwx------ 2 geekotest nogroup  40 Jun 27 12:04 Y_F6OzG5_3
drwx------ 2 geekotest nogroup  40 Jun 23 17:46 yfGT2ppHyr
drwx------ 2 geekotest nogroup  40 Jun 25 09:09 yfneSFCwls
drwx------ 2 geekotest nogroup  40 Jun 22 16:54 yHqfyU6xnC
drwx------ 2 geekotest nogroup  40 Jun 26 02:26 YHrukyCIn5
drwx------ 2 geekotest nogroup  40 Jun 27 11:22 yIlfzjoPpj
drwx------ 2 geekotest nogroup  40 Jun 23 21:16 Yixisovjrn
drwx------ 2 geekotest nogroup  40 Jun 23 17:48 Yj1yuoh_7D
drwx------ 2 geekotest nogroup  40 Jun 22 11:50 YlIRb1a69M
drwx------ 2 geekotest nogroup  40 Jun 27 12:19 YLqGsMNpZ9
...

What triggers the directories not to be cleaned up?
From the openqa_scheduler_log:

[2023-06-27T19:25:43.422205Z] [warn] [pid:5909] Failed sending job(s) '3387016' to worker '425': Unable to assign job to worker 425: the worker is not connected anymore

In those cases $schema->txn_do(sub { $worker->unprepare_for_work; }); is called, which simply deletes the setting in the DB, but doesn't remove the dirctory.

Acceptance criteria

  • AC1: No more leftover temporary directories on OSD and O3

Suggestions

  • tempdir() is called in two places (here and here)
  • Focus on fixing the cleanup code
  • It might make sense to change the default directory from /tmp to some directory under /var/lib/openqa/... too, since that's usually where we have our faster larger disks mounted

Out of scope


Related issues 3 (1 open2 closed)

Related to openQA Project (public) - action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machinesResolvedkraih2023-06-27

Actions
Related to openQA Project (public) - action #131516: Consider creating a separate tmp dir filesystem, e.g. tmpfs?New

Actions
Related to openQA Project (public) - action #131465: Make temporary files and directories created by openQA services easier to identify size:MResolvedtinita2023-06-272023-07-13

Actions
Actions #1

Updated by kraih over 1 year ago

  • Related to action #131447: Some jobs incomplete due to auto_review:"api failure: 400.*/tmp/.*png.*No space left on device.*Utils.pm line 285":retry but enough space visible on machines added
Actions #2

Updated by mkittler over 1 year ago

Looks like this has been improved via https://github.com/os-autoinst/openQA/pull/5225. Now those directories should be better distinguishable.

Actions #3

Updated by okurz over 1 year ago

  • Target version changed from future to Ready
Actions #4

Updated by kraih over 1 year ago

mkittler wrote:

Looks like this has been improved via https://github.com/os-autoinst/openQA/pull/5225. Now those directories should be better distinguishable.

On O3 it now looks like this:

...
drwx------ 2 geekotest nogroup  40 Jun 28 08:58 webui.worker-663.Gb7ldAWq
drwx------ 2 geekotest nogroup  40 Jun 28 03:11 webui.worker-685.u6GnjlTX
drwx------ 2 geekotest nogroup  40 Jun 28 02:06 webui.worker-686.zFIveEWJ
drwx------ 2 geekotest nogroup  40 Jun 28 01:45 webui.worker-688.GDucRVQa
drwx------ 2 geekotest nogroup  40 Jun 28 01:49 webui.worker-689.1h5mq5zP
drwx------ 2 geekotest nogroup  40 Jun 28 02:53 webui.worker-690.oZ012BFt
drwx------ 2 geekotest nogroup  40 Jun 28 01:41 webui.worker-691.e1q9JYvm
drwx------ 2 geekotest nogroup  40 Jun 28 01:53 webui.worker-692.R6shl8cQ
drwx------ 2 geekotest nogroup  40 Jun 28 02:53 webui.worker-693.hHGBIKmc
drwx------ 2 geekotest nogroup  40 Jun 28 01:54 webui.worker-694.rVP1hRIX
drwx------ 2 geekotest nogroup  40 Jun 28 01:46 webui.worker-695.xOd_ob4E
drwx------ 2 geekotest nogroup  40 Jun 28 01:41 webui.worker-696.TRHc0ogw
drwx------ 2 geekotest nogroup  40 Jun 28 02:29 webui.worker-697.Xaqejo4R
drwx------ 2 geekotest nogroup  40 Jun 28 01:54 webui.worker-698.EgqaA6dO
drwx------ 2 geekotest nogroup  40 Jun 28 01:54 webui.worker-699.DbF3smtH
drwx------ 2 geekotest nogroup  40 Jun 28 02:00 webui.worker-700.kfUDqTFf
drwx------ 2 geekotest nogroup  40 Jun 28 02:06 webui.worker-701.6ZF_sAXS
drwx------ 2 geekotest nogroup 120 Jun 28 02:58 webui.worker-702.qYnLFM1f
...
Actions #5

Updated by tinita over 1 year ago

And I can see more than one directory per worker, e.g.

% ls -ld webui.worker-94._RJfQUKB webui.worker-94.hWurW5GA webui.worker-94.zAmB4g3_
drwx------ 2 geekotest nogroup 40 Jun 29 04:00 webui.worker-94._RJfQUKB
drwx------ 2 geekotest nogroup 40 Jun 28 01:30 webui.worker-94.hWurW5GA
drwx------ 2 geekotest nogroup 40 Jun 29 05:15 webui.worker-94.zAmB4g3_

So reusing the dirs doesn't seem to work.

Actions #6

Updated by dheidler over 1 year ago

  • Related to action #131516: Consider creating a separate tmp dir filesystem, e.g. tmpfs? added
Actions #7

Updated by dheidler over 1 year ago

  • Subject changed from Leftover temporary directories in /tmp on OSD and O3 to Leftover temporary directories in /tmp on OSD and O3 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by tinita over 1 year ago

  • Related to action #131465: Make temporary files and directories created by openQA services easier to identify size:M added
Actions #9

Updated by okurz over 1 year ago

  • Target version changed from Ready to future
Actions #10

Updated by tinita over 1 year ago

Could be related to
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Scheduler/Model/Jobs.pm#L250-L258

        log_warning "Failed sending job(s) '$job_ids_str' to worker '$worker_id': $error";
        try {
            $schema->txn_do(sub { $worker->unprepare_for_work; });
        }

and unprepare_for_work will just delete the properties from the DB without deleting the directory.

I just looked into the scheduler log on o3, and there are 86 such error messages in the last ~24h, and from a first glance the worker ids match the duplicate directories.

Actions #11

Updated by tinita over 1 year ago

  • Subject changed from Leftover temporary directories in /tmp on OSD and O3 size:M to Leftover worker temporary directories in /tmp on OSD and O3 size:M
  • Description updated (diff)
Actions

Also available in: Atom PDF