action #123508
closed[tools][teregen] Handle "one instance is already running" better for template generator size:M
0%
Description
Motivation¶
Sometimes, a running instance aka. teregen process in the same container will stuck itself (probably due to unresponsive socket for some 3rd party data source like IBS) for a long time (until killed). The highest probability is of course during maintenance windows as can be seen in the log.
2023/01/19 10:15:02 W [undef] TeReGen: One instance is already running under PID 19746. Exiting.
...
2023/01/23 10:00:02 W [undef] TeReGen: One instance is already running under PID 19746. Exiting.
This effectively halts generating for all templates until the hanging process is killed and new one is started (via cron or manually). It would be nice to have a possibility to forcefully quit generating if it takes too long.
Acceptance criteria¶
AC1: Templates are generated in a timely manner
AC2: Only one instance of generator is actively trying to generate new templates
AC3: The instance which is generating templates is not running indefinitely
Suggestions¶
- source of teregen is in https://gitlab.suse.de/qa-maintenance/teregen/
- teregen is running on qam, configured in https://gitlab.suse.de/qa-maintenance/qamops
- Convert from cron to systemd timers with timeout+restart (or wait for next interval) for the "… generate" command, e.g. based on example MR https://gitlab.suse.de/qa-maintenance/qamops/-/merge_requests/25
- Mind the other teregen processes concerning the web interface
- See
crontab -l -u teregen
Updated by jbaier_cz almost 2 years ago
- Related to action #90917: [teregen] Add notification about errors in template generating added
Updated by okurz almost 2 years ago
- Subject changed from [tools][teregen] Handle "one instance is already running" better for template generator to [tools][teregen] Handle "one instance is already running" better for template generator size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz almost 2 years ago
- Copied to action #123700: [tools][teregen] Alert if systemd services fail on qam.suse.de, e.g. using our salt states in parallel with ansible added
Updated by jbaier_cz almost 2 years ago
- Status changed from Workable to In Progress
- Assignee set to jbaier_cz
Updated by jbaier_cz almost 2 years ago
- Status changed from In Progress to Feedback
MR which provides systemd.timer drop-in replacement for cron: https://gitlab.suse.de/qa-maintenance/teregen/-/merge_requests/20
MR which deploy the provided systemd units: https://gitlab.suse.de/qa-maintenance/qamops/-/merge_requests/26
Updated by okurz almost 2 years ago
both MRs merged. On qam.suse.de the service teregen.service is in "failed". But systemctl --user --machine=teregen@ status teregen
looks fine. I called systemctl reset-failed
to make that go away. But I see a problem with https://gitlab.suse.de/qa-maintenance/teregen/-/merge_requests/20/diffs#f4cfb93bcaa3788ce6527b52ad493574c693aa57_0_12 stating that teregen as a system service should be pulled in by default.target so that will conflict with the user server, doesn't it?
Updated by jbaier_cz almost 2 years ago
okurz wrote:
both MRs merged. On qam.suse.de the service teregen.service is in "failed". But
systemctl --user --machine=teregen@ status teregen
looks fine. I calledsystemctl reset-failed
to make that go away. But I see a problem with https://gitlab.suse.de/qa-maintenance/teregen/-/merge_requests/20/diffs#f4cfb93bcaa3788ce6527b52ad493574c693aa57_0_12 stating that teregen as a system service should be pulled in by default.target so that will conflict with the user server, doesn't it?
The default.target should be valid even for user services. As I did the timer and the generator service I also converted the main teregen service from a system service to user service. Forgot to run disable on the system level (so there still was a symlink from /etc/systemd/system/multi-user.target.wants). I fixed that, so there shouldn't be any conflict anymore.
Updated by okurz almost 2 years ago
- Due date set to 2023-02-10
Good. What else do you see as necessary to resolve? How can we confirm that reports are generated and there are no conflicting processes without needing to wait for weeks?
Updated by jbaier_cz almost 2 years ago
okurz wrote:
Good. What else do you see as necessary to resolve? How can we confirm that reports are generated and there are no conflicting processes without needing to wait for weeks?
I wanted to wait for a few days to see if the new reports are generated properly. This part can be considered as completed (there already are some new reports). For the second part, I do not have a clear reproducer for the issue (some careful work with the firewall and a tarpit for selected connection would be needed to simulate the problem, maybe) so we might need to fallback to monitor the situation. In the worst case, our solution will not be better than the old one and the UV squad will inform us about missing templates eventually.
As the service now does exactly the same what I do when fixing the issue and because the issue is not a critical one (basically anyone from the UV or Tools team can check and restart the service), I would consider this solved (and reopen if necessary).
Updated by okurz almost 2 years ago
- Due date deleted (
2023-02-10) - Status changed from Feedback to Resolved
Agreed