action #123508: [tools][teregen] Handle "one instance is already running" better for template generator size:M - QA (public) - openSUSE Project Management Tool

Actions

Copy link

action #123508

closed

[tools][teregen] Handle "one instance is already running" better for template generator size:M

Added by jbaier_cz over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Low

Assignee:

jbaier_cz

Target version:

openQA Project (public) - Ready

Start date:

2023-01-23

Due date:

% Done:

Estimated time:

Description

Motivation¶

Sometimes, a running instance aka. teregen process in the same container will stuck itself (probably due to unresponsive socket for some 3rd party data source like IBS) for a long time (until killed). The highest probability is of course during maintenance windows as can be seen in the log.

2023/01/19 10:15:02 W [undef] TeReGen: One instance is already running under PID 19746. Exiting.
...
2023/01/23 10:00:02 W [undef] TeReGen: One instance is already running under PID 19746. Exiting.

This effectively halts generating for all templates until the hanging process is killed and new one is started (via cron or manually). It would be nice to have a possibility to forcefully quit generating if it takes too long.

Acceptance criteria¶

AC1: Templates are generated in a timely manner
AC2: Only one instance of generator is actively trying to generate new templates
AC3: The instance which is generating templates is not running indefinitely

Suggestions¶

source of teregen is in https://gitlab.suse.de/qa-maintenance/teregen/
teregen is running on qam, configured in https://gitlab.suse.de/qa-maintenance/qamops
Convert from cron to systemd timers with timeout+restart (or wait for next interval) for the "… generate" command, e.g. based on example MR https://gitlab.suse.de/qa-maintenance/qamops/-/merge_requests/25
Mind the other teregen processes concerning the web interface
See crontab -l -u teregen

Related issues 2 (2 open — 0 closed)

Actions

Copy link

Updated by jbaier_cz over 2 years ago

Related to action #90917: [teregen] Add notification about errors in template generating added

Actions

Copy link

Updated by okurz over 2 years ago

Subject changed from [tools][teregen] Handle "one instance is already running" better for template generator to [tools][teregen] Handle "one instance is already running" better for template generator size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz over 2 years ago

Copied to action #123700: [tools][teregen] Alert if systemd services fail on qam.suse.de, e.g. using our salt states in parallel with ansible added

Actions

Copy link

Updated by jbaier_cz over 2 years ago

Status changed from Workable to In Progress
Assignee set to jbaier_cz

Actions

Copy link

Updated by jbaier_cz over 2 years ago

Status changed from In Progress to Feedback

MR which provides systemd.timer drop-in replacement for cron: https://gitlab.suse.de/qa-maintenance/teregen/-/merge_requests/20
MR which deploy the provided systemd units: https://gitlab.suse.de/qa-maintenance/qamops/-/merge_requests/26

Actions

Copy link

Updated by okurz over 2 years ago

both MRs merged. On qam.suse.de the service teregen.service is in "failed". But systemctl --user --machine=teregen@ status teregen looks fine. I called systemctl reset-failed to make that go away. But I see a problem with https://gitlab.suse.de/qa-maintenance/teregen/-/merge_requests/20/diffs#f4cfb93bcaa3788ce6527b52ad493574c693aa57_0_12 stating that teregen as a system service should be pulled in by default.target so that will conflict with the user server, doesn't it?

Actions

Copy link

Updated by jbaier_cz over 2 years ago

okurz wrote:

both MRs merged. On qam.suse.de the service teregen.service is in "failed". But systemctl --user --machine=teregen@ status teregen looks fine. I called systemctl reset-failed to make that go away. But I see a problem with https://gitlab.suse.de/qa-maintenance/teregen/-/merge_requests/20/diffs#f4cfb93bcaa3788ce6527b52ad493574c693aa57_0_12 stating that teregen as a system service should be pulled in by default.target so that will conflict with the user server, doesn't it?

The default.target should be valid even for user services. As I did the timer and the generator service I also converted the main teregen service from a system service to user service. Forgot to run disable on the system level (so there still was a symlink from /etc/systemd/system/multi-user.target.wants). I fixed that, so there shouldn't be any conflict anymore.

Actions

Copy link

Updated by okurz over 2 years ago

Due date set to 2023-02-10

Good. What else do you see as necessary to resolve? How can we confirm that reports are generated and there are no conflicting processes without needing to wait for weeks?

Actions

Copy link

Updated by jbaier_cz over 2 years ago

okurz wrote:

Good. What else do you see as necessary to resolve? How can we confirm that reports are generated and there are no conflicting processes without needing to wait for weeks?

I wanted to wait for a few days to see if the new reports are generated properly. This part can be considered as completed (there already are some new reports). For the second part, I do not have a clear reproducer for the issue (some careful work with the firewall and a tarpit for selected connection would be needed to simulate the problem, maybe) so we might need to fallback to monitor the situation. In the worst case, our solution will not be better than the old one and the UV squad will inform us about missing templates eventually.

As the service now does exactly the same what I do when fixing the issue and because the issue is not a critical one (basically anyone from the UV or Tools team can check and restart the service), I would consider this solved (and reopen if necessary).

Actions

Copy link

#10

Updated by okurz over 2 years ago

Due date deleted (~~2023-02-10~~)
Status changed from Feedback to Resolved

Agreed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public)

Tags

Custom queries

action #123508

[tools][teregen] Handle "one instance is already running" better for template generator size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by jbaier_cz over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by okurz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by okurz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by okurz over 2 years ago