Project

General

Profile

Actions

action #62441

open

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

coordination #102912: [epic] Simplify investigation of job failures - 2nd

openqa-worker systemd service can timeout when stopping

Added by okurz almost 5 years ago. Updated about 3 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2020-01-21
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

From openqaworker-arm-3 within OSD infrastructure journalctl -u openqa-worker@17 --since=2020-01-21 11:40 when I did systemctl mask --now openqa-worker@… :

Jan 21 11:41:09 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:19 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:29 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:39 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:49 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:41:59 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:09 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:19 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:21 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Jan 21 11:42:21 openqaworker-arm-3 systemd[1]: Stopping openqa-worker@17.service...
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Quit due to signal TERM
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Stopping job 3813244 from openqa.suse.de: 03813244-sle-15-SP2-Online-aarch64-Build126.1-xfstests_xfs-generic-401-999@aarch64 - reason: quit
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Trying to stop job gracefully by announcing it to command server via http://localhost:20173/z4vitFNxgGLkY74d/broadcast
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Unable to stop the command server gracefully:
Jan 21 11:42:21 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Command server likely not reachable at all
Jan 21 11:42:22 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Isotovideo exit status: 0
Jan 21 11:42:24 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Registered process:45476
Jan 21 11:42:33 openqaworker-arm-3 worker[3832]: [info] [pid:3832] +++ worker notes +++
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [info] [pid:3832] End time: 2020-01-21 10:42:33
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [info] [pid:3832] Result: quit
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] Job 3813244 stopped as incomplete
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] duplicating job 3813244
Jan 21 11:42:34 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/duplicate?dup_type_auto=1
Jan 21 11:42:35 openqaworker-arm-3 worker[3832]: [warn] [pid:3832] Can't open /var/lib/openqa/pool/17/testresults/result-generate_report.json for result upload - likely isotovideo could not be started or failed early. Error message: No such file o>
Jan 21 11:42:35 openqaworker-arm-3 worker[3832]: [debug] [pid:3832] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3813244/status
Jan 21 11:42:38 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Optimizing /var/lib/openqa/pool/17/testresults/run-474.png
Jan 21 11:42:38 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Uploading artefact run-474.png as ea827bcfa47edcc606116292b513618c
…
Jan 21 11:43:51 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Uploading artefact run-453.png as 6ef5f5f8f61c837a9bb9191ed0046514
Jan 21 11:43:51 openqaworker-arm-3 worker[3832]: [debug] [pid:62860] Optimizing /var/lib/openqa/pool/17/testresults/run-855.png
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: State 'stop-sigterm' timed out. Killing.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 3832 (worker) with signal SIGKILL.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 62860 (worker) with signal SIGKILL.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Killing process 63265 (optipng) with signal SIGKILL.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Main process exited, code=killed, status=9/KILL
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: Stopped openqa-worker@17.service.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Unit entered failed state.
Jan 21 11:43:51 openqaworker-arm-3 systemd[1]: openqa-worker@17.service: Failed with result 'timeout'.

Problem

so it looks like the job was in state "Uploading" and "optipng" does not receive TERM or does not react on it.

Impact

As not only the worker service itself but also "openqa-workercache-daemon" seems to be impacted this might lead to worker cache sqlite database corruption

Expected result

The complete process structure should tear down within a "reasonable" time: Either the service is given more time for termination or we ensure that optipng stops within time.


Related issues 3 (1 open2 closed)

Related to openQA Project (public) - coordination #61922: [epic] Incomplete jobs with no logs at allResolvedmkittler2020-02-03

Actions
Related to openQA Project (public) - action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retryResolvedmkittler2020-05-18

Actions
Related to openQA Infrastructure (public) - action #93138: openqa-reload-worker-auto-restart and openqa-worker-auto-restart fail if numofworkers is reducedNew2021-05-26

Actions
Actions #1

Updated by okurz almost 5 years ago

Actions #2

Updated by mkittler almost 5 years ago

As mentioned in https://progress.opensuse.org/issues/61922#note-8 an ongoing upload is not aborted when the worker receives SIGTERM. I also think this should be the case to be able to tear down the worker within a "reasonable" time (and uploads can take quite a while).

Actions #3

Updated by mkittler almost 5 years ago

I've just had a look at the code to check how difficult it would be to make the upload interruptible. It wouldn't be hard to define one "exit-path" before each step (e.g. https://github.com/Martchus/openQA/pull/new/interruptible-upload). Only the image upload which is done in a sub process using Mojo::IOLoop::Subprocess might be a little bit tricky to interrupt.

Actions #4

Updated by okurz almost 5 years ago

mkittler wrote:

[…] Only the image upload which is done in a sub process using Mojo::IOLoop::Subprocess might be a little bit tricky to interrupt.

I guess that would be
https://github.com/os-autoinst/openQA/compare/master...Martchus:interruptible-upload?expand=1#diff-2559347dfdfd46e3d678b4f8823f2d7fR838

Maybe that subprocess is completely missing a signal handler? Do we need https://metacpan.org/pod/Mojo::IOLoop::Signal ?

Actions #5

Updated by mkittler almost 5 years ago

Maybe that subprocess is completely missing a signal handler? Do we need https://metacpan.org/pod/Mojo::IOLoop::Signal ?

I would assume the subprocess has the same handler as the parent which means it won't really react to the signal in a sensible way. Overriding the signal handler with one which simply exists would help. Since we're not really using the IOLoop within the sub process I wouldn't use that module for it. To make that work we also need to ensure that the signal is actually passed to the sub process. All of this doesn't seem to be that hard to do after all - it is just some more work than I initially expected.

Actions #6

Updated by okurz over 4 years ago

  • Target version set to Ready
Actions #7

Updated by okurz over 4 years ago

  • Tags changed from caching, openQA, sporadic, arm, ipmi, worker to worker
Actions #8

Updated by okurz about 4 years ago

  • Target version changed from Ready to future
Actions #9

Updated by okurz about 4 years ago

  • Related to action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry added
Actions #10

Updated by okurz about 4 years ago

  • Description updated (diff)
  • Status changed from New to Workable
  • Target version changed from future to Ready

In #67000#note-50 favogt observed the same issue and linked that to worker cache sqlite database corruption. This means that this ticket is moving back to focus :)

Actions #11

Updated by okurz about 4 years ago

  • Parent task set to #39719
Actions #12

Updated by nicksinger over 3 years ago

  • Related to action #93138: openqa-reload-worker-auto-restart and openqa-worker-auto-restart fail if numofworkers is reduced added
Actions #13

Updated by livdywan over 3 years ago

[Alerting] Failed systemd services alert (except openqa.suse.de)

Check failed systemd services on hosts with `systemctl --failed`. Hint: Go to parent dashboard https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services to see a list of affected hosts.
Metric name

Value
Sum of failed systemd services

8.000

seems to specifically have failed like so:

openqa-worker-auto-restart@37.service: State 'stop-sigterm' timed out. Killing.
Actions #14

Updated by okurz over 3 years ago

  • Status changed from Workable to New

moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size

Actions #15

Updated by okurz over 3 years ago

  • Target version changed from Ready to future
Actions #16

Updated by okurz about 3 years ago

  • Parent task changed from #39719 to #102912
Actions

Also available in: Atom PDF