action #57689: asset cleanup jobs do not run on o3 (results cleanup works), workaround: unlock locks manually - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #57689

closed

asset cleanup jobs do not run on o3 (results cleanup works), workaround: unlock locks manually

Added by okurz over 5 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2019-10-03

Due date:

2019-12-10

% Done:

Estimated time:

Description

Observation¶

I am receiving monitoring alerts for o3 for usage of /space . https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=ariel-opensuse.suse.de&service=space%20partition#pnp_th2/1568826977/1568916977/0 shows the details. This happened on 2019-10-03 for the third time. I monitored the space usage and saw that /space was nearly depleted (100 GB free space left).

On
https://openqa.opensuse.org/minion/jobs?state=inactive I see "limit_assets". https://openqa.opensuse.org/minion/jobs?state=finished&offset=0&task=limit_assets do not show any successfully finished "limit_assets" jobs, https://openqa.opensuse.org/minion/jobs?state=finished&offset=0&task=limit_results_and_logs is fine though. https://openqa.opensuse.org/minion/jobs?state=failed mentions "limit_assets".

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz over 5 years ago

Related to action #57683: o3 /space is nearly running out again, assets are not refreshed, not cleaned up (was: too much logs&results) added

Actions

Copy link

Updated by okurz over 5 years ago

Subject changed from asset and results cleanup jobs do not run on o3, workaround: unlock locks manually to asset cleanup jobs do not run on o3 (results cleanup works), workaround: unlock locks manually
Description updated (diff)
Priority changed from High to Normal

Wildly clicking around on https://openqa.opensuse.org/minion/jobs does something, e.g. restarting limit_assets jobs, removing the lock, removing failed or inactive. It seems a "limit_assets" job is running now. In strace I can see that the gru job iterates some repo folders. Why it needs to list every single file in an individual repo folder I do not know, seems inefficient. However it finished successfully for now.

Actions

Copy link

Updated by coolo over 5 years ago

Target version set to Ready

the problem seems to be a restart of the gru service killing limit_assets leaving locks behind. and with the large timeout we set, this can easily be a problem. Having a day without cleaning up assets can easily kill the space :(

Actions

Copy link

Updated by kraih over 5 years ago

I'll add a feature upstream to Minion that will allow us to reset all locks quickly whenever we restart the gru service. It looks like the leftover locks are the result of systemd stopping the job with SIGKILL when it takes too long to stop on its own after SIGTERM.

Actions

Copy link

Updated by kraih over 5 years ago

Opened a PR with another alternative solution. https://github.com/os-autoinst/openQA/pull/2492

Actions

Copy link

Updated by okurz over 5 years ago

Due date set to 2019-12-10
Status changed from New to Feedback
Assignee set to okurz

we closed the PR meanwhile and wait for an upstream patch including https://github.com/mojolicious/minion/compare/d045708eeeb7...105612d4787b .

here is what I did on o3 for now:

$ cat /etc/systemd/system/openqa-gru.service.d/override.conf 
[Service]
ExecStop=/usr/bin/psql openqa -c 'delete from minion_locks;'

waiting for https://build.opensuse.org/request/show/749518

EDIT: 2019-11-21: And https://github.com/os-autoinst/openQA/pull/2532 updated the dependencies for perl-Minion

Actions

Copy link

Updated by okurz over 5 years ago

Status changed from Feedback to Resolved

Another PR by kraih to reset locks on startup of the gru service in https://github.com/os-autoinst/openQA/pull/2546 , merged, deployed to o3. Removed the workaround. Service looks fine so far.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #57689

asset cleanup jobs do not run on o3 (results cleanup works), workaround: unlock locks manually

Observation¶

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago

Updated by coolo over 5 years ago

Updated by kraih over 5 years ago

Updated by kraih over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago