action #62441
Updated by okurz about 4 years ago
## Observation
From openqaworker-arm-3 within OSD infrastructure `journalctl -u openqa-worker@17 --since=2020-01-21 11:40` when I did `systemctl mask --now openqa-worker@…` :
```
Jan 21 11:41:09 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:19 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:29 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:39 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:49 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:59 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:09 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:19 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:21 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Jan 21 11:42:21 openqaworker-arm-3 systemd[1]: Stopping openqa-worker@17.service...
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Quit due to signal TERM
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Stopping job 3813244 from openqa.suse.de: 03813244-sle-15-SP2-Online-aarch64-Build126.1-xfstests_xfs-generic-401-999@aarch64 - reason: quit
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Trying to stop job gracefully by announcing it to command server via http://localhost:20173/z4vitFNxgGLkY74d/broadcast
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Unable to stop the command server gracefully:
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Command server likely not reachable at all
Jan 21 11:42:22 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Isotovideo exit status: 0
Jan 21 11:42:24 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Registered process:45476
Jan 21 11:42:33 openqaworker-arm-3 worker[3832]: [info] [pid:3832] +++ worker notes +++
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [info] [pid:3832] End time: 2020-01-21 10:42:33
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Result: quit
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Job 3813244 stopped as incomplete
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] duplicating job 3813244
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/duplicate?dup_type_auto=1
Jan 21 11:42:35 openqaworker-arm-3 worker[3832]: [warn] [pid:3832] Can't open /var/lib/openqa/pool/17/testresults/result-generate_report.json for result upload - likely isotovideo could not be started or failed early. Error message: No such file o>
Jan 21 11:42:35 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:38 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Optimizing /var/lib/openqa/pool/17/testresults/run-474.png
Jan 21 11:42:38 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Uploading artefact run-474.png as ea827bcfa47edcc606116292b513618c
…
Jan 21 11:43:51 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Uploading artefact run-453.png as 6ef5f5f8f61c837a9bb9191ed0046514
Jan 21 11:43:51 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Optimizing /var/lib/openqa/pool/17/testresults/run-855.png
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: State 'stop-sigterm' timed out. Killing.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 3832 (worker) with signal SIGKILL.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 62860 (worker) with signal SIGKILL.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 63265 (optipng) with signal SIGKILL.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Main process exited, code=killed, status=9/KILL
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: Stopped openqa-worker@17.service.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Unit entered failed state.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Failed with result 'timeout'.
```
## Problem
so it looks like the job was in state "Uploading" and "optipng" does not receive TERM or does not react on it.
## Impact
As not only the worker service itself but also "openqa-workercache-daemon" seems to be impacted this might lead to worker cache sqlite database corruption
## Expected result
The complete process structure should tear down within a "reasonable" time: Either the service is given more time for termination or we ensure that optipng stops within time.