Job life cycle not always covered by events
I'm currently working with events, again. It would be beneficial to this work if all job life cycles were fully covered by events, unless something truly weird happens - i.e. for every
openqa_job_create event that happens, there should be a corresponding job-went-away event: at least one of
openqa_job_done, or possibly
openqa_job_delete). However, I don't believe this is currently the case.
To give a specific example: cancelling (or, I think, restarting or duplicating) a job with children that are scheduled, but not running. Any children that are running should get an
openqa_job_done...I think?...but I don't believe scheduled children do. If I'm following the flow correctly, their state just gets changed in the database, but no event is emitted. So anything that's trying to follow the life cycle of a given job by events will be left hanging, wondering what happened to it.
Also, cancelling an ISO emits 'openqa_cancel_iso' and then just calls the database
cancel_by_settings (not the web API one, which emits events) on the ISO value. Again I think this will result in
job_done events for running jobs (I don't totally remember how that happens - I think it's because ultimately a 'stop doing that!' signal is sent to the worker, and the worker winds up going back through the web API to say 'I stopped now!', or something like that), but no specific events for scheduled jobs. Anything trying to keep track of job life cycles would have to catch the cancel_by_settings message and do quite a lot of work to figure out which previously-scheduled jobs just got cancelled.
I don't know if this is a goal of openQA at all, and if so how high a priority fixing it would be, but I thought it was worth bringing up, at least.
#2 Updated by AdamWill over 4 years ago
Huh, interesting - I thought it might be quite a complex one, because it involves the interactions between the components of openQA, and might involve re-architecting that a bit? Maybe there's a way I don't see, but it seems like for instance it wouldn't be right to have the database object methods start emitting events (if they even can). It requires the stuff that currently works by just kinda poking something in the database to be changed to run through some code path where an event can reasonably be emitted.
But, maybe that's less of a disruptive thing than I imagine :) Anyhow, it'd be great to have this. Like I always say, I'll work on it if I can, but...that's kind of an unknown with possible other priorities.
#3 Updated by EDiGiacinto over 4 years ago
Don't get me wrong here - it's not an 'easy' task because needs particular attention on details, but on the other hand suits very well for developers that are approaching to the openQA codebase/or are getting more familiar with it; since for me sounds very educative, as when developing this feature, you have to follow all the job path :)
#4 Updated by AdamWill over 4 years ago
Another note on this, just for reference: I'm pretty sure we never get
job_set_running events any more. I don't care a lot about 'waiting', tbh, but 'running' is pretty important. I think with the last major scheduler rewrite, nothing ever goes through the API
set_running endpoint any more; the scheduler does
set_running in the database, it doesn't hit the API. AFAICS nothing anywhere does any form of
set_waiting any more.
#5 Updated by AdamWill over 4 years ago
Another little note here: there are a couple of points where we basically emit events from the database schema. This seems...kinda awful, but maybe my instinct is wrong? They are in lib/OpenQA/Schema/Result/Workers.pm and lib/OpenQA/Schema/Result/Bugs.pm .
I'm currently thinking down the lines of allowing plugins for the other server apps besides webui, and having a fedmsg plugin in the scheduler, for the purpose I actually care about here (ensuring we emit fedmsgs covering the whole life cycle of each job...)
#8 Updated by okurz about 2 years ago
- Status changed from New to Resolved
Since then we have a bit of changed situation with rabbitmq which is used in various ways and I think in general we are fine with the events that are there. It might be true that not everything is covered by events but I guess this just shows which part of the workflow are covered by openQA internally completely and there is no need to handle them externally anyway.
#9 Updated by AdamWill about 2 years ago
I haven't checked this in detail lately, but just to note it wasn't about "handling" things externally really, but monitoring them. We have this 'CI Dashboard' thing in RH-land which wants to monitor various test systems based on standardized message bus messages (so the idea is that whether it's openQA or Jenkins or whatever else, if a system is testing a given Thing, it will send out similar messages at the respective points of the test's life cycle - scheduled, running, cancelled/complete/aborted - and the dashboard can show all the states from the various systems. But this only works if we actually can publish messages at each point in the life cycle, for each job. It's going to confuse things if we don't send out 'scheduled' for a job but it suddenly shows up as 'complete', or on the other hand if we send out 'running' for a job but never send out 'complete' or 'cancelled'...