So I just got back to this (you know...stuff's been happening...). Thanks for the pointers, mkittler , I will work down that path. Looking at the startup code, it looks to me like at least GRU does load all plugins: GRU starts up just by calling the openqa
script with a gru
arg, and the path the openqa
script goes down does load the plugins, regardless of the gru
arg. In the logs on the openQA instance I can see this:
May 02 17:58:34 openqa01.iad2.fedoraproject.org systemd[1]: Started openqa-gru.service - The openQA daemon for various background tasks like cleanup and saving needles.
May 02 17:58:37 openqa01.iad2.fedoraproject.org openqa-gru[832]: [info] Loading external plugin FedoraMessaging
the fact that it's loading our external plugin means it's loading plugins normally, so we're OK from that angle. However, I do think there's an issue here with the scheduler. The scheduler starts itself up - script/openqa-scheduler
calls OpenQA::Scheduler::run;
- and nothing in that flow that I can see would load plugins. And indeed, the journal messages for openqa-scheduler.service show no trace of plugin loading, unlike the gru logs.
There are a few places where the scheduler can poke job state without going through the web API. It has a check it runs on a timer, every two minutes, which kills and duplicates "abandoned" jobs - incomplete_and_duplicate_stale_jobs
- and that just goes through the DB models to call $job->done
and $job->auto_duplicate
. It also has a check - _update_scheduled_jobs
- which cancels jobs that have been scheduled for more than a certain length of time; it again does that directly in the DB, calling $job->cancel
on the DB model. I'm pretty sure none of those emits events now, and as you say, if we just change the DB model to emit events, that won't achieve anything useful unless we make the scheduler load at least the AMQP and audit plugins (for Fedora's purposes, it would also need to load our FedoraMessaging plugin, which is like the AMQP plugin but publishes messages in a different format).
This is getting pretty complicated at this point :/ The more I look, the more "holes" I find in the job state event emission. I'm not sure whether the best idea is to try and work on it incrementally, or try to come up with a grand plan to fix everything at once.
There are definitely ways we can work on it incrementally: we can just have the job model emit events appropriately (there's a few other holes that need patching - this is why https://bodhi.fedoraproject.org/updates/FEDORA-2023-3c2931ff23 thinks a bunch of tests that got cancelled hours ago are still 'queued' or 'running'). This won't fix everything, but it'll make things better not worse - as long as the code gets hit by the webAPI or gru, the appropriate plugins will respond to the events, and when something other than the webAPI or gru hits the code, it doesn't make anything any worse than it is right now.
So...in a way I kinda feel like just trying to do what I can to make things a bit better without making them any worse, at least at first. What does anyone else think? I'll at least send a couple of relatively simple PRs as tests (at least enough to hopefully fix the above Fedora issue).