Project

General

Profile

Actions

action #31069

closed

Job life cycle not always covered by events

Added by AdamWill about 6 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
-
Start date:
2018-01-30
Due date:
% Done:

0%

Estimated time:

Description

I'm currently working with events, again. It would be beneficial to this work if all job life cycles were fully covered by events, unless something truly weird happens - i.e. for every openqa_job_create event that happens, there should be a corresponding job-went-away event: at least one of openqa_job_cancel, openqa_job_done, or possibly openqa_job_delete). However, I don't believe this is currently the case.

To give a specific example: cancelling (or, I think, restarting or duplicating) a job with children that are scheduled, but not running. Any children that are running should get an openqa_job_done...I think?...but I don't believe scheduled children do. If I'm following the flow correctly, their state just gets changed in the database, but no event is emitted. So anything that's trying to follow the life cycle of a given job by events will be left hanging, wondering what happened to it.

Also, cancelling an ISO emits 'openqa_cancel_iso' and then just calls the database cancel_by_settings (not the web API one, which emits events) on the ISO value. Again I think this will result in job_done events for running jobs (I don't totally remember how that happens - I think it's because ultimately a 'stop doing that!' signal is sent to the worker, and the worker winds up going back through the web API to say 'I stopped now!', or something like that), but no specific events for scheduled jobs. Anything trying to keep track of job life cycles would have to catch the cancel_by_settings message and do quite a lot of work to figure out which previously-scheduled jobs just got cancelled.

I don't know if this is a goal of openQA at all, and if so how high a priority fixing it would be, but I thought it was worth bringing up, at least.


Related issues 1 (0 open1 closed)

Related to openQA Project - coordination #32851: [tools][EPIC] Scheduling redesignResolvedokurz2018-05-05

Actions
Actions #1

Updated by EDiGiacinto about 6 years ago

+1, i really would like to have events emitted and follow the job lifecycle - it will slim a lot of code parts as well.

This sounds a good entrance level issue imho

Actions #2

Updated by AdamWill about 6 years ago

Huh, interesting - I thought it might be quite a complex one, because it involves the interactions between the components of openQA, and might involve re-architecting that a bit? Maybe there's a way I don't see, but it seems like for instance it wouldn't be right to have the database object methods start emitting events (if they even can). It requires the stuff that currently works by just kinda poking something in the database to be changed to run through some code path where an event can reasonably be emitted.

But, maybe that's less of a disruptive thing than I imagine :) Anyhow, it'd be great to have this. Like I always say, I'll work on it if I can, but...that's kind of an unknown with possible other priorities.

Actions #3

Updated by EDiGiacinto about 6 years ago

Don't get me wrong here - it's not an 'easy' task because needs particular attention on details, but on the other hand suits very well for developers that are approaching to the openQA codebase/or are getting more familiar with it; since for me sounds very educative, as when developing this feature, you have to follow all the job path :)

Actions #4

Updated by AdamWill about 6 years ago

Another note on this, just for reference: I'm pretty sure we never get job_set_waiting or job_set_running events any more. I don't care a lot about 'waiting', tbh, but 'running' is pretty important. I think with the last major scheduler rewrite, nothing ever goes through the API set_running endpoint any more; the scheduler does set_running in the database, it doesn't hit the API. AFAICS nothing anywhere does any form of set_waiting any more.

Actions #5

Updated by AdamWill about 6 years ago

Another little note here: there are a couple of points where we basically emit events from the database schema. This seems...kinda awful, but maybe my instinct is wrong? They are in lib/OpenQA/Schema/Result/Workers.pm and lib/OpenQA/Schema/Result/Bugs.pm .

I'm currently thinking down the lines of allowing plugins for the other server apps besides webui, and having a fedmsg plugin in the scheduler, for the purpose I actually care about here (ensuring we emit fedmsgs covering the whole life cycle of each job...)

Actions #6

Updated by EDiGiacinto about 6 years ago

Actions #7

Updated by okurz almost 5 years ago

  • Category changed from 122 to Feature requests
Actions #8

Updated by okurz about 4 years ago

  • Status changed from New to Resolved

Since then we have a bit of changed situation with rabbitmq which is used in various ways and I think in general we are fine with the events that are there. It might be true that not everything is covered by events but I guess this just shows which part of the workflow are covered by openQA internally completely and there is no need to handle them externally anyway.

Actions #9

Updated by AdamWill about 4 years ago

I haven't checked this in detail lately, but just to note it wasn't about "handling" things externally really, but monitoring them. We have this 'CI Dashboard' thing in RH-land which wants to monitor various test systems based on standardized message bus messages (so the idea is that whether it's openQA or Jenkins or whatever else, if a system is testing a given Thing, it will send out similar messages at the respective points of the test's life cycle - scheduled, running, cancelled/complete/aborted - and the dashboard can show all the states from the various systems. But this only works if we actually can publish messages at each point in the life cycle, for each job. It's going to confuse things if we don't send out 'scheduled' for a job but it suddenly shows up as 'complete', or on the other hand if we send out 'running' for a job but never send out 'complete' or 'cancelled'...

Actions

Also available in: Atom PDF