action #46187
closedCreate list of "worker responsibilities"
0%
Description
We are considering to rewrite the worker. Since the worker does more than executing arbitrary jobs, a list of the tasks it currently is responsible for (and would need to be re-implemented) should be created.
Updated by mkittler almost 6 years ago
Tasks the worker is responsible for¶
Those are the tasks the worker is performing right now, in its current implementation (the list is numbered for easier referencing, not to imply a certain order):
- Register at the web UI with one or more worker classes so the web UI knows that a worker with certain capabilities is now available.
- Start isotovideo.
- Stop isotovideo ensuring no subprocesses are left over.
- Query test status from isotovideo's command server while it is running.
- Report test status to web UI while isotovideo is being executed.
- Upload test results to web UI while isotovideo is being executed.
- Take commands from the web UI, e.g.
- to "grab" a new job.
- to start and stop the live view and log.
- to be aware of ongoing developer mode sessions (because such sessions require an adjustment to the upload).
- Report the upload progress to the livehandler daemon if a developer mode session is ongoing (so developer mode knows when all screenshots are available and a new needle might be created based on them).
- Optimize PNG images with optipng (as part of the upload process).
- Download and cache assets required by the tests (caching is actually out-sourced to a Minion service).
- Sync test code and needles (also with caching support which is again out-sourced to a Minion service).
- Report if worker setup is broken (so far only tests whether caching is configured but not available) to avoid lots of incomplete jobs.
- Report its own hostname or IP address to the web UI so the live handler daemon for the developer mode knows how to connect to isotovideo's command server.
Note that 7.2. could also be done via the "live handler <=> command server" connection which would likely be less confusing/cluttered. 7.3. must be done by the worker because only the worker does the upload and as such only knows the upload progress (and not isotovideo or its command server).
Further notes¶
Of course having this list does not mean we don't want an open discussion about a new worker architecture. So a re-write could change some of those responsibilities/task. For instance isotovideo (or parts of it) could be moved to the worker. Some of the logic (e.g. killing subprocesses) is currently duplicated anyways.
It is also conceivable to have only one worker (systemd) service anymore but within that service multiple jobs could be running/managed. That might have advantages regarding scheduling of multi-machine tests.
Biggest problems with the worker code right now¶
We should avoid doing the same mistakes again:
- multiple nested callbacks: Such code is just hard to follow. If there are good alternatives suitable for our environment I'm glad to hear. Maybe co-routines?
- use of timers: Hard to follow, like the callbacks.
- bad/messy error handling: That part has been improved (or at least attempted to be improved) but we still see strange behavior in production like workers being stuck after an error occurs.
- slow/blocking operations executed on thread actually dedicated for async IO: Reason why the worker is sometimes unresponsive and the web UI thinks it has disconnected. We should use things like the Minion task queue consistently and not only for some tasks.
If something is missng/unclear, just comment or directly edit this comment.
Updated by mkittler almost 6 years ago
- Status changed from Feedback to Resolved
The previous comment contains the list and we discussed some of the points in the last tools team meeting.
Updated by mkittler almost 6 years ago
- Related to coordination #47117: [epic] Fix worker->websocket->scheduler->webui connection added
Updated by mkittler over 5 years ago
Not sure how to continue with the worker. But here's an update to "biggest problems with the worker code" after playing around with Mojo::Promise
:
- multiple nested callbacks:
Mojo::Promise
makes this at least a little bit better and would at least lead to more composable code. Using it in the current state of the worker seems hard because it is just too easy to introduce subtle bugs. But new code would benefit from usingMojo::Promise
for sure. - use of timers
- valid uses: job timeout, periodic status update and round robin web UI connection
- The code for managing those timers still needs cleanup.
- bad uses: polling on the internal state (e.g. to wait until status update is concluded)
- Internal operations should emit events to avoid polling, maybe by tieing them to a
Mojo::EventEmitter
object (e.g. a singleton object responsible for the upload).
- Internal operations should emit events to avoid polling, maybe by tieing them to a
- valid uses: job timeout, periodic status update and round robin web UI connection
- slow/blocking operations executed on thread actually dedicated for async IO
- Posting the status to the web UI and liveviewhandler is already done mostly async. Since it does not take long and happens quite often is is likely better to make is completely async (instead of making it a Minion job).
- The upload should be made completely async or outsourced to a Minion job.
- overall structure: There should be a separate Perl module (e.g. a "class" inheriting from
Mojo::Base
orMojo::EventEmitter
) for each responsibility.Common.pm
contains the client to talk to the openQA-API, timer management, reading the configuration and various utilities. Each of those categories should be in its own module.Jobs.pm
manages the job lifetime, posts status updates and uploads images and other test results. Each of those categories should be in its own module.Engines/isotovideo.pm
uses the questionableMojo::IOLoop::ReadWriteProcess
but I guess that's not really the biggest issue right now. The code is ugly but at least it doesn't mix too much different things.- The
elsif
-hell inCommands.pm
could be restructured similar to howCommandHandler.pm
in os-autoinst is already structured. - In general, things should be broken down to smaller, unit-test-friendly functions.
Updated by mkittler about 5 years ago
Note that "use of timers", "overall structure" and problematic blocking operations have been fixed by now. Usage of Mojo::Promise
hasn't been introduced yet, though.