Project

General

Profile

action #62441

Updated by okurz over 3 years ago

## Observation 

 From openqaworker-arm-3 within OSD infrastructure `journalctl -u openqa-worker@17 --since=2020-01-21 11:40` when I did `systemctl mask --now openqa-worker@…` : 

 ``` 
 Jan 21 11:41:09 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status 
 Jan 21 11:41:19 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status 
 Jan 21 11:41:29 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status 
 Jan 21 11:41:39 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status 
 Jan 21 11:41:49 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status 
 Jan 21 11:41:59 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status 
 Jan 21 11:42:09 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status 
 Jan 21 11:42:19 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status 
 Jan 21 11:42:21 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Current command vanished from the unit file, execution of the command list won't be resumed. 
 Jan 21 11:42:21 openqaworker-arm-3 systemd[1]: Stopping openqa-worker@17.service... 
 Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Quit due to signal TERM 
 Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Stopping job 3813244 from openqa.suse.de: 03813244-sle-15-SP2-Online-aarch64-Build126.1-xfstests_xfs-generic-401-999@aarch64 - reason: quit 
 Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status 
 Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Trying to stop job gracefully by announcing it to command server via http://localhost:20173/z4vitFNxgGLkY74d/broadcast 
 Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Unable to stop the command server gracefully: 
 Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Command server likely not reachable at all 
 Jan 21 11:42:22 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Isotovideo exit status: 0 
 Jan 21 11:42:24 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Registered process:45476 
 Jan 21 11:42:33 openqaworker-arm-3 worker[3832]: [info] [pid:3832] +++ worker notes +++ 
 Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [info] [pid:3832] End time: 2020-01-21 10:42:33 
 Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Result: quit 
 Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Job 3813244 stopped as incomplete 
 Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] duplicating job 3813244 
 Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/duplicate?dup_type_auto=1 
 Jan 21 11:42:35 openqaworker-arm-3 worker[3832]: [warn] [pid:3832] Can't open /var/lib/openqa/pool/17/testresults/result-generate_report.json for result upload - likely isotovideo could not be started or failed early. Error message: No such file o> 
 Jan 21 11:42:35 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status 
 Jan 21 11:42:38 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Optimizing /var/lib/openqa/pool/17/testresults/run-474.png 
 Jan 21 11:42:38 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Uploading artefact run-474.png as ea827bcfa47edcc606116292b513618c 
 … 
 Jan 21 11:43:51 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Uploading artefact run-453.png as 6ef5f5f8f61c837a9bb9191ed0046514 
 Jan 21 11:43:51 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Optimizing /var/lib/openqa/pool/17/testresults/run-855.png 
 Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: State 'stop-sigterm' timed out. Killing. 
 Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 3832 (worker) with signal SIGKILL. 
 Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 62860 (worker) with signal SIGKILL. 
 Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 63265 (optipng) with signal SIGKILL. 
 Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Main process exited, code=killed, status=9/KILL 
 Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: Stopped openqa-worker@17.service. 
 Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Unit entered failed state. 
 Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Failed with result 'timeout'. 
 ``` 

 ## Problem 

 so it looks like the job was in state "Uploading" and "optipng" does not receive TERM or does not react on it. 

 ## Impact 
 As not only the worker service itself but also "openqa-workercache-daemon" seems to be impacted this might lead to worker cache sqlite database corruption 


 ## Expected result 

 The complete process structure should tear down within a "reasonable" time: Either the service is given more time for termination or we ensure that optipng stops within time.

Back