Project

General

Profile

Actions

action #62441

open

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

coordination #102912: [epic] Simplify investigation of job failures - 2nd

openqa-worker systemd service can timeout when stopping

Added by okurz almost 5 years ago. Updated about 3 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2020-01-21
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

From openqaworker-arm-3 within OSD infrastructure journalctl -u openqa-worker@17 --since=2020-01-21 11:40 when I did systemctl mask --now openqa-worker@… :

Jan 21 11:41:09 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:19 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:29 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:39 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:49 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:59 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:09 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:19 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:21 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Jan 21 11:42:21 openqaworker-arm-3 systemd[1]: Stopping openqa-worker@17.service...
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Quit due to signal TERM
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Stopping job 3813244 from openqa.suse.de: 03813244-sle-15-SP2-Online-aarch64-Build126.1-xfstests_xfs-generic-401-999@aarch64 - reason: quit
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Trying to stop job gracefully by announcing it to command server via http://localhost:20173/z4vitFNxgGLkY74d/broadcast
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Unable to stop the command server gracefully:
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Command server likely not reachable at all
Jan 21 11:42:22 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Isotovideo exit status: 0
Jan 21 11:42:24 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Registered process:45476
Jan 21 11:42:33 openqaworker-arm-3 worker[3832]: [info] [pid:3832] +++ worker notes +++
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [info] [pid:3832] End time: 2020-01-21 10:42:33
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Result: quit
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Job 3813244 stopped as incomplete
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] duplicating job 3813244
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/duplicate?dup_type_auto=1
Jan 21 11:42:35 openqaworker-arm-3 worker[3832]: [warn] [pid:3832] Can't open /var/lib/openqa/pool/17/testresults/result-generate_report.json for result upload - likely isotovideo could not be started or failed early. Error message: No such file o>
Jan 21 11:42:35 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:38 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Optimizing /var/lib/openqa/pool/17/testresults/run-474.png
Jan 21 11:42:38 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Uploading artefact run-474.png as ea827bcfa47edcc606116292b513618c
…
Jan 21 11:43:51 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Uploading artefact run-453.png as 6ef5f5f8f61c837a9bb9191ed0046514
Jan 21 11:43:51 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Optimizing /var/lib/openqa/pool/17/testresults/run-855.png
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: State 'stop-sigterm' timed out. Killing.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 3832 (worker) with signal SIGKILL.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 62860 (worker) with signal SIGKILL.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 63265 (optipng) with signal SIGKILL.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Main process exited, code=killed, status=9/KILL
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: Stopped openqa-worker@17.service.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Unit entered failed state.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Failed with result 'timeout'.

Problem

so it looks like the job was in state "Uploading" and "optipng" does not receive TERM or does not react on it.

Impact

As not only the worker service itself but also "openqa-workercache-daemon" seems to be impacted this might lead to worker cache sqlite database corruption

Expected result

The complete process structure should tear down within a "reasonable" time: Either the service is given more time for termination or we ensure that optipng stops within time.


Related issues 3 (1 open2 closed)

Related to openQA Project (public) - coordination #61922: [epic] Incomplete jobs with no logs at allResolvedmkittler2020-02-03

Actions
Related to openQA Project (public) - action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retryResolvedmkittler2020-05-18

Actions
Related to openQA Infrastructure (public) - action #93138: openqa-reload-worker-auto-restart and openqa-worker-auto-restart fail if numofworkers is reducedNew2021-05-26

Actions
Actions

Also available in: Atom PDF