Project

General

Profile

Actions

action #123625

open

No event emitted for jobs restarted via `RETRY`, jobs cancelled via `_job_stop_cluster`, and other cases

Added by AdamWill about 1 year ago. Updated 9 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2023-01-24
Due date:
% Done:

0%

Estimated time:

Description

If you use the RETRY auto-retry mechanism, AFAICS, when the job is restarted, no event is emitted. Logically speaking, I'd expect a job_restart event to be emitted, as it is when you restart a job manually.

This seems like it should be easy to fix so I was just going to send a PR, but actually the codepaths are kind of long and complex and there's a question without an obvious answer (to me).

RETRY handling starts in lib/OpenQA/Schema/Results/Jobs.pm done(), which checks if RETRY is set and the job failed and calls $self->auto_duplicate, which goes through a whole other pile of functions that wind up actually restarting the job. So I could just stick an emit_event in that pile somewhere - there's a precedent for using emit_event in that file, as OpenQA::App->singleton->emit_event(), in update_result(). However, there's another path where something calls $self->auto_duplicate, and that thing emits the event itself; in lib/OpenQA/WebAPI/Controller/API/V1/Job.pm, _restart() calls OpenQA::Resource::Jobs::job_restart(), which calls $job->auto_duplicate(), and then _restart() emits the event.

So, what's the best way to do this? Move the event emission somewhere under $job->auto_duplicate() and drop the emit_event() from _restart() in API/V1/Job.pm? Or have done() emit the event after calling $self->auto_duplicate(), kinda mirroring what _restart() does? Or is there a better idea? I'm not really sure.

Actions #1

Updated by okurz about 1 year ago

  • Category set to Feature requests
  • Target version set to future

So the question is if the automatic retry should be considered the same kind of "restart" as the manual restarts. The idea was that only manually or externally triggered restarts would trigger the job_restart event. May I ask what would you need the event for? Maybe we can introduce another specific event?

Actions #2

Updated by AdamWill about 1 year ago

Sure. We want an external system to know when openQA jobs are scheduled (and, in future, when they start running). The most obvious way to know this is for openQA to communicate when it happens. So ideally I want there to be an event emitted any time a job is created (and also any time a job starts running, but we'll come to that later).

Note openQA has already had one case where there was previously a 'similar' event (job_duplicate) and it got rolled into the job_restart event for the sake of simplicity. So I kinda figured openQA would just want to use job_restart again in this case and not invent some new event, but practically speaking it isn't really an issue, it wouldn't be a problem at all to handle a different event name with the approach we (Fedora) will be using to implement what we want to do.

Actions #3

Updated by AdamWill about 1 year ago

ping? any thoughts here? Thanks!

Actions #4

Updated by okurz about 1 year ago

I would give others some time to bring up their thoughts. SUSE currently has HackWeek so please expect a delay in comments this week.

Actions #5

Updated by AdamWill about 1 year ago

ah, thanks, wasn't aware that was going on.

Actions #6

Updated by mkittler about 1 year ago

It would make most sense of there's one restart event regardless of the case but the event would carry additional information (e.g. the user that restarted the event, or that it was due to RESTART or due to the reason matched the auto-clone regex).

It would likely make sense if the reason was passed to auto_duplicate and that function then emits the event (unless there was an error). The auto_duplicate function is e.g. called when setting a job "done". This in turn can happen from multiple services (main web UI, GRU, possibly more). So in order to make sending events from auto_duplicate work in all cases you need to ensure that plugins that use those events (AMQP, audit log) are loaded on startup of those services.

Actions #7

Updated by AdamWill 10 months ago

  • Subject changed from No job_restart event for jobs restarted via `RETRY` to No event emitted for jobs restarted via `RETRY`, jobs cancelled via `_job_stop_cluster`, and other cases

So I just got back to this (you know...stuff's been happening...). Thanks for the pointers, mkittler , I will work down that path. Looking at the startup code, it looks to me like at least GRU does load all plugins: GRU starts up just by calling the openqa script with a gru arg, and the path the openqa script goes down does load the plugins, regardless of the gru arg. In the logs on the openQA instance I can see this:

May 02 17:58:34 openqa01.iad2.fedoraproject.org systemd[1]: Started openqa-gru.service - The openQA daemon for various background tasks like cleanup and saving needles.
May 02 17:58:37 openqa01.iad2.fedoraproject.org openqa-gru[832]: [info] Loading external plugin FedoraMessaging

the fact that it's loading our external plugin means it's loading plugins normally, so we're OK from that angle. However, I do think there's an issue here with the scheduler. The scheduler starts itself up - script/openqa-scheduler calls OpenQA::Scheduler::run; - and nothing in that flow that I can see would load plugins. And indeed, the journal messages for openqa-scheduler.service show no trace of plugin loading, unlike the gru logs.

There are a few places where the scheduler can poke job state without going through the web API. It has a check it runs on a timer, every two minutes, which kills and duplicates "abandoned" jobs - incomplete_and_duplicate_stale_jobs - and that just goes through the DB models to call $job->done and $job->auto_duplicate. It also has a check - _update_scheduled_jobs - which cancels jobs that have been scheduled for more than a certain length of time; it again does that directly in the DB, calling $job->cancel on the DB model. I'm pretty sure none of those emits events now, and as you say, if we just change the DB model to emit events, that won't achieve anything useful unless we make the scheduler load at least the AMQP and audit plugins (for Fedora's purposes, it would also need to load our FedoraMessaging plugin, which is like the AMQP plugin but publishes messages in a different format).

This is getting pretty complicated at this point :/ The more I look, the more "holes" I find in the job state event emission. I'm not sure whether the best idea is to try and work on it incrementally, or try to come up with a grand plan to fix everything at once.

There are definitely ways we can work on it incrementally: we can just have the job model emit events appropriately (there's a few other holes that need patching - this is why https://bodhi.fedoraproject.org/updates/FEDORA-2023-3c2931ff23 thinks a bunch of tests that got cancelled hours ago are still 'queued' or 'running'). This won't fix everything, but it'll make things better not worse - as long as the code gets hit by the webAPI or gru, the appropriate plugins will respond to the events, and when something other than the webAPI or gru hits the code, it doesn't make anything any worse than it is right now.

So...in a way I kinda feel like just trying to do what I can to make things a bit better without making them any worse, at least at first. What does anyone else think? I'll at least send a couple of relatively simple PRs as tests (at least enough to hopefully fix the above Fedora issue).

Actions #8

Updated by AdamWill 9 months ago

sigh, just ran into yet another angle on this: when a job fails and its children are cancelled, there is no event published for the children (so we lose lifecycle tracking of them at that point, if we're trying to do it via events; in practice, this means our update system believes the test is 'running' forever, because that was the last event it got for the test). If anyone's wondering why I never sent the 'simple' PR I suggested above, it's because I tested it and it broke stuff unexpectedly. This seems to be just very difficult to fix.

Actions

Also available in: Atom PDF