action #46187

Create list of "worker responsibilities"

Added by mkittler over 1 year ago. Updated 10 months ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:


We are considering to rewrite the worker. Since the worker does more than executing arbitrary jobs, a list of the tasks it currently is responsible for (and would need to be re-implemented) should be created.

Related issues

Related to openQA Project - action #47117: [epic] Fix worker->websocket->scheduler->webui connectionResolved2019-02-04


#1 Updated by mkittler over 1 year ago

Tasks the worker is responsible for

Those are the tasks the worker is performing right now, in its current implementation (the list is numbered for easier referencing, not to imply a certain order):

  1. Register at the web UI with one or more worker classes so the web UI knows that a worker with certain capabilities is now available.
  2. Start isotovideo.
  3. Stop isotovideo ensuring no subprocesses are left over.
  4. Query test status from isotovideo's command server while it is running.
  5. Report test status to web UI while isotovideo is being executed.
  6. Upload test results to web UI while isotovideo is being executed.
  7. Take commands from the web UI, e.g.
    1. to "grab" a new job.
    2. to start and stop the live view and log.
    3. to be aware of ongoing developer mode sessions (because such sessions require an adjustment to the upload).
  8. Report the upload progress to the livehandler daemon if a developer mode session is ongoing (so developer mode knows when all screenshots are available and a new needle might be created based on them).
  9. Optimize PNG images with optipng (as part of the upload process).
  10. Download and cache assets required by the tests (caching is actually out-sourced to a Minion service).
  11. Sync test code and needles (also with caching support which is again out-sourced to a Minion service).
  12. Report if worker setup is broken (so far only tests whether caching is configured but not available) to avoid lots of incomplete jobs.
  13. Report its own hostname or IP address to the web UI so the live handler daemon for the developer mode knows how to connect to isotovideo's command server.

Note that 7.2. could also be done via the "live handler <=> command server" connection which would likely be less confusing/cluttered. 7.3. must be done by the worker because only the worker does the upload and as such only knows the upload progress (and not isotovideo or its command server).

Further notes

Of course having this list does not mean we don't want an open discussion about a new worker architecture. So a re-write could change some of those responsibilities/task. For instance isotovideo (or parts of it) could be moved to the worker. Some of the logic (e.g. killing subprocesses) is currently duplicated anyways.

It is also conceivable to have only one worker (systemd) service anymore but within that service multiple jobs could be running/managed. That might have advantages regarding scheduling of multi-machine tests.

Biggest problems with the worker code right now

We should avoid doing the same mistakes again:

  • multiple nested callbacks: Such code is just hard to follow. If there are good alternatives suitable for our environment I'm glad to hear. Maybe co-routines?
  • use of timers: Hard to follow, like the callbacks.
  • bad/messy error handling: That part has been improved (or at least attempted to be improved) but we still see strange behavior in production like workers being stuck after an error occurs.
  • slow/blocking operations executed on thread actually dedicated for async IO: Reason why the worker is sometimes unresponsive and the web UI thinks it has disconnected. We should use things like the Minion task queue consistently and not only for some tasks.

If something is missng/unclear, just comment or directly edit this comment.

#2 Updated by mkittler over 1 year ago

  • Status changed from New to Feedback

#3 Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

The previous comment contains the list and we discussed some of the points in the last tools team meeting.

#4 Updated by mkittler over 1 year ago

  • Related to action #47117: [epic] Fix worker->websocket->scheduler->webui connection added

#5 Updated by mkittler about 1 year ago

Not sure how to continue with the worker. But here's an update to "biggest problems with the worker code" after playing around with Mojo::Promise:

  • multiple nested callbacks: Mojo::Promise makes this at least a little bit better and would at least lead to more composable code. Using it in the current state of the worker seems hard because it is just too easy to introduce subtle bugs. But new code would benefit from using Mojo::Promise for sure.
  • use of timers
    • valid uses: job timeout, periodic status update and round robin web UI connection
      • The code for managing those timers still needs cleanup.
    • bad uses: polling on the internal state (e.g. to wait until status update is concluded)
      • Internal operations should emit events to avoid polling, maybe by tieing them to a Mojo::EventEmitter object (e.g. a singleton object responsible for the upload).
  • slow/blocking operations executed on thread actually dedicated for async IO
    • Posting the status to the web UI and liveviewhandler is already done mostly async. Since it does not take long and happens quite often is is likely better to make is completely async (instead of making it a Minion job).
    • The upload should be made completely async or outsourced to a Minion job.
  • overall structure: There should be a separate Perl module (e.g. a "class" inheriting from Mojo::Base or Mojo::EventEmitter) for each responsibility.
    • contains the client to talk to the openQA-API, timer management, reading the configuration and various utilities. Each of those categories should be in its own module.
    • manages the job lifetime, posts status updates and uploads images and other test results. Each of those categories should be in its own module.
    • Engines/ uses the questionable Mojo::IOLoop::ReadWriteProcess but I guess that's not really the biggest issue right now. The code is ugly but at least it doesn't mix too much different things.
    • The elsif-hell in could be restructured similar to how in os-autoinst is already structured.
    • In general, things should be broken down to smaller, unit-test-friendly functions.

#6 Updated by mkittler 10 months ago

Note that "use of timers", "overall structure" and problematic blocking operations have been fixed by now. Usage of Mojo::Promise hasn't been introduced yet, though.

Also available in: Atom PDF