Project

General

Profile

Actions

action #123508

closed

[tools][teregen] Handle "one instance is already running" better for template generator size:M

Added by jbaier_cz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
2023-01-23
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Sometimes, a running instance aka. teregen process in the same container will stuck itself (probably due to unresponsive socket for some 3rd party data source like IBS) for a long time (until killed). The highest probability is of course during maintenance windows as can be seen in the log.

2023/01/19 10:15:02 W [undef] TeReGen: One instance is already running under PID 19746. Exiting.
...
2023/01/23 10:00:02 W [undef] TeReGen: One instance is already running under PID 19746. Exiting.

This effectively halts generating for all templates until the hanging process is killed and new one is started (via cron or manually). It would be nice to have a possibility to forcefully quit generating if it takes too long.

Acceptance criteria

AC1: Templates are generated in a timely manner
AC2: Only one instance of generator is actively trying to generate new templates
AC3: The instance which is generating templates is not running indefinitely

Suggestions


Related issues 2 (2 open0 closed)

Related to QA - action #90917: [teregen] Add notification about errors in template generatingNew2021-04-09

Actions
Copied to QA - action #123700: [tools][teregen] Alert if systemd services fail on qam.suse.de, e.g. using our salt states in parallel with ansibleNew2023-01-23

Actions
Actions #1

Updated by jbaier_cz over 1 year ago

  • Related to action #90917: [teregen] Add notification about errors in template generating added
Actions #2

Updated by okurz over 1 year ago

  • Subject changed from [tools][teregen] Handle "one instance is already running" better for template generator to [tools][teregen] Handle "one instance is already running" better for template generator size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by okurz over 1 year ago

  • Copied to action #123700: [tools][teregen] Alert if systemd services fail on qam.suse.de, e.g. using our salt states in parallel with ansible added
Actions #4

Updated by jbaier_cz over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to jbaier_cz
Actions #5

Updated by jbaier_cz over 1 year ago

  • Status changed from In Progress to Feedback

MR which provides systemd.timer drop-in replacement for cron: https://gitlab.suse.de/qa-maintenance/teregen/-/merge_requests/20
MR which deploy the provided systemd units: https://gitlab.suse.de/qa-maintenance/qamops/-/merge_requests/26

Actions #6

Updated by okurz over 1 year ago

both MRs merged. On qam.suse.de the service teregen.service is in "failed". But systemctl --user --machine=teregen@ status teregen looks fine. I called systemctl reset-failed to make that go away. But I see a problem with https://gitlab.suse.de/qa-maintenance/teregen/-/merge_requests/20/diffs#f4cfb93bcaa3788ce6527b52ad493574c693aa57_0_12 stating that teregen as a system service should be pulled in by default.target so that will conflict with the user server, doesn't it?

Actions #7

Updated by jbaier_cz over 1 year ago

okurz wrote:

both MRs merged. On qam.suse.de the service teregen.service is in "failed". But systemctl --user --machine=teregen@ status teregen looks fine. I called systemctl reset-failed to make that go away. But I see a problem with https://gitlab.suse.de/qa-maintenance/teregen/-/merge_requests/20/diffs#f4cfb93bcaa3788ce6527b52ad493574c693aa57_0_12 stating that teregen as a system service should be pulled in by default.target so that will conflict with the user server, doesn't it?

The default.target should be valid even for user services. As I did the timer and the generator service I also converted the main teregen service from a system service to user service. Forgot to run disable on the system level (so there still was a symlink from /etc/systemd/system/multi-user.target.wants). I fixed that, so there shouldn't be any conflict anymore.

Actions #8

Updated by okurz over 1 year ago

  • Due date set to 2023-02-10

Good. What else do you see as necessary to resolve? How can we confirm that reports are generated and there are no conflicting processes without needing to wait for weeks?

Actions #9

Updated by jbaier_cz over 1 year ago

okurz wrote:

Good. What else do you see as necessary to resolve? How can we confirm that reports are generated and there are no conflicting processes without needing to wait for weeks?

I wanted to wait for a few days to see if the new reports are generated properly. This part can be considered as completed (there already are some new reports). For the second part, I do not have a clear reproducer for the issue (some careful work with the firewall and a tarpit for selected connection would be needed to simulate the problem, maybe) so we might need to fallback to monitor the situation. In the worst case, our solution will not be better than the old one and the UV squad will inform us about missing templates eventually.

As the service now does exactly the same what I do when fixing the issue and because the issue is not a critical one (basically anyone from the UV or Tools team can check and restart the service), I would consider this solved (and reopen if necessary).

Actions #10

Updated by okurz over 1 year ago

  • Due date deleted (2023-02-10)
  • Status changed from Feedback to Resolved

Agreed

Actions

Also available in: Atom PDF