action #62441
opencoordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA
coordination #102912: [epic] Simplify investigation of job failures - 2nd
openqa-worker systemd service can timeout when stopping
0%
Description
Observation¶
From openqaworker-arm-3 within OSD infrastructure journalctl -u openqa-worker@17 --since=2020-01-21 11:40
when I did systemctl mask --now openqa-worker@…
:
Jan 21 11:41:09 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:19 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:29 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:39 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:49 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:59 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:09 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:19 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:21 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Jan 21 11:42:21 openqaworker-arm-3 systemd[1]: Stopping openqa-worker@17.service...
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Quit due to signal TERM
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Stopping job 3813244 from openqa.suse.de: 03813244-sle-15-SP2-Online-aarch64-Build126.1-xfstests_xfs-generic-401-999@aarch64 - reason: quit
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Trying to stop job gracefully by announcing it to command server via http://localhost:20173/z4vitFNxgGLkY74d/broadcast
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Unable to stop the command server gracefully:
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Command server likely not reachable at all
Jan 21 11:42:22 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Isotovideo exit status: 0
Jan 21 11:42:24 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Registered process:45476
Jan 21 11:42:33 openqaworker-arm-3 worker[3832]: [info] [pid:3832] +++ worker notes +++
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [info] [pid:3832] End time: 2020-01-21 10:42:33
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Result: quit
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Job 3813244 stopped as incomplete
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] duplicating job 3813244
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/duplicate?dup_type_auto=1
Jan 21 11:42:35 openqaworker-arm-3 worker[3832]: [warn] [pid:3832] Can't open /var/lib/openqa/pool/17/testresults/result-generate_report.json for result upload - likely isotovideo could not be started or failed early. Error message: No such file o>
Jan 21 11:42:35 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:38 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Optimizing /var/lib/openqa/pool/17/testresults/run-474.png
Jan 21 11:42:38 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Uploading artefact run-474.png as ea827bcfa47edcc606116292b513618c
…
Jan 21 11:43:51 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Uploading artefact run-453.png as 6ef5f5f8f61c837a9bb9191ed0046514
Jan 21 11:43:51 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Optimizing /var/lib/openqa/pool/17/testresults/run-855.png
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: State 'stop-sigterm' timed out. Killing.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 3832 (worker) with signal SIGKILL.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 62860 (worker) with signal SIGKILL.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 63265 (optipng) with signal SIGKILL.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Main process exited, code=killed, status=9/KILL
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: Stopped openqa-worker@17.service.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Unit entered failed state.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Failed with result 'timeout'.
Problem¶
so it looks like the job was in state "Uploading" and "optipng" does not receive TERM or does not react on it.
Impact¶
As not only the worker service itself but also "openqa-workercache-daemon" seems to be impacted this might lead to worker cache sqlite database corruption
Expected result¶
The complete process structure should tear down within a "reasonable" time: Either the service is given more time for termination or we ensure that optipng stops within time.
Updated by okurz over 4 years ago
- Related to coordination #61922: [epic] Incomplete jobs with no logs at all added
Updated by mkittler over 4 years ago
As mentioned in https://progress.opensuse.org/issues/61922#note-8 an ongoing upload is not aborted when the worker receives SIGTERM. I also think this should be the case to be able to tear down the worker within a "reasonable" time (and uploads can take quite a while).
Updated by mkittler over 4 years ago
I've just had a look at the code to check how difficult it would be to make the upload interruptible. It wouldn't be hard to define one "exit-path" before each step (e.g. https://github.com/Martchus/openQA/pull/new/interruptible-upload). Only the image upload which is done in a sub process using Mojo::IOLoop::Subprocess
might be a little bit tricky to interrupt.
Updated by okurz over 4 years ago
mkittler wrote:
[…] Only the image upload which is done in a sub process using
Mojo::IOLoop::Subprocess
might be a little bit tricky to interrupt.
I guess that would be
https://github.com/os-autoinst/openQA/compare/master...Martchus:interruptible-upload?expand=1#diff-2559347dfdfd46e3d678b4f8823f2d7fR838
Maybe that subprocess is completely missing a signal handler? Do we need https://metacpan.org/pod/Mojo::IOLoop::Signal ?
Updated by mkittler over 4 years ago
Maybe that subprocess is completely missing a signal handler? Do we need https://metacpan.org/pod/Mojo::IOLoop::Signal ?
I would assume the subprocess has the same handler as the parent which means it won't really react to the signal in a sensible way. Overriding the signal handler with one which simply exists would help. Since we're not really using the IOLoop within the sub process I wouldn't use that module for it. To make that work we also need to ensure that the signal is actually passed to the sub process. All of this doesn't seem to be that hard to do after all - it is just some more work than I initially expected.
Updated by okurz about 4 years ago
- Tags changed from caching, openQA, sporadic, arm, ipmi, worker to worker
Updated by okurz almost 4 years ago
- Related to action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry added
Updated by okurz almost 4 years ago
- Description updated (diff)
- Status changed from New to Workable
- Target version changed from future to Ready
In #67000#note-50 favogt observed the same issue and linked that to worker cache sqlite database corruption. This means that this ticket is moving back to focus :)
Updated by nicksinger over 3 years ago
- Related to action #93138: openqa-reload-worker-auto-restart and openqa-worker-auto-restart fail if numofworkers is reduced added
Updated by livdywan over 3 years ago
[Alerting] Failed systemd services alert (except openqa.suse.de)
Check failed systemd services on hosts with `systemctl --failed`. Hint: Go to parent dashboard https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services to see a list of affected hosts.
Metric name
Value
Sum of failed systemd services
8.000
seems to specifically have failed like so:
openqa-worker-auto-restart@37.service: State 'stop-sigterm' timed out. Killing.
Updated by okurz over 3 years ago
- Status changed from Workable to New
moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size