action #123508
closed
[tools][teregen] Handle "one instance is already running" better for template generator size:M
Added by jbaier_cz almost 2 years ago.
Updated almost 2 years ago.
Description
Motivation¶
Sometimes, a running instance aka. teregen process in the same container will stuck itself (probably due to unresponsive socket for some 3rd party data source like IBS) for a long time (until killed). The highest probability is of course during maintenance windows as can be seen in the log.
2023/01/19 10:15:02 W [undef] TeReGen: One instance is already running under PID 19746. Exiting.
...
2023/01/23 10:00:02 W [undef] TeReGen: One instance is already running under PID 19746. Exiting.
This effectively halts generating for all templates until the hanging process is killed and new one is started (via cron or manually). It would be nice to have a possibility to forcefully quit generating if it takes too long.
Acceptance criteria¶
AC1: Templates are generated in a timely manner
AC2: Only one instance of generator is actively trying to generate new templates
AC3: The instance which is generating templates is not running indefinitely
Suggestions¶
Related issues
2 (2 open — 0 closed)
- Related to action #90917: [teregen] Add notification about errors in template generating added
- Subject changed from [tools][teregen] Handle "one instance is already running" better for template generator to [tools][teregen] Handle "one instance is already running" better for template generator size:M
- Description updated (diff)
- Status changed from New to Workable
- Copied to action #123700: [tools][teregen] Alert if systemd services fail on qam.suse.de, e.g. using our salt states in parallel with ansible added
- Status changed from Workable to In Progress
- Assignee set to jbaier_cz
- Status changed from In Progress to Feedback
okurz wrote:
both MRs merged. On qam.suse.de the service teregen.service is in "failed". But systemctl --user --machine=teregen@ status teregen
looks fine. I called systemctl reset-failed
to make that go away. But I see a problem with https://gitlab.suse.de/qa-maintenance/teregen/-/merge_requests/20/diffs#f4cfb93bcaa3788ce6527b52ad493574c693aa57_0_12 stating that teregen as a system service should be pulled in by default.target so that will conflict with the user server, doesn't it?
The default.target should be valid even for user services. As I did the timer and the generator service I also converted the main teregen service from a system service to user service. Forgot to run disable on the system level (so there still was a symlink from /etc/systemd/system/multi-user.target.wants). I fixed that, so there shouldn't be any conflict anymore.
- Due date set to 2023-02-10
Good. What else do you see as necessary to resolve? How can we confirm that reports are generated and there are no conflicting processes without needing to wait for weeks?
okurz wrote:
Good. What else do you see as necessary to resolve? How can we confirm that reports are generated and there are no conflicting processes without needing to wait for weeks?
I wanted to wait for a few days to see if the new reports are generated properly. This part can be considered as completed (there already are some new reports). For the second part, I do not have a clear reproducer for the issue (some careful work with the firewall and a tarpit for selected connection would be needed to simulate the problem, maybe) so we might need to fallback to monitor the situation. In the worst case, our solution will not be better than the old one and the UV squad will inform us about missing templates eventually.
As the service now does exactly the same what I do when fixing the issue and because the issue is not a critical one (basically anyone from the UV or Tools team can check and restart the service), I would consider this solved (and reopen if necessary).
- Due date deleted (
2023-02-10)
- Status changed from Feedback to Resolved
Also available in: Atom
PDF