Project

General

Profile

Actions

action #13482

closed

isotovideo process fails to die on job completion, worker becomes stuck

Added by AdamWill over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
-
Category:
Feature requests
Target version:
-
Start date:
2016-08-27
Due date:
% Done:

0%

Estimated time:

Description

I upgraded our staging openQA to current git os-autoinst and openQA today. After the upgrade, a test run was automatically kicked off for a Rawhide compose by our scheduler. I wanted to restart the run with the tests changed a bit, so I forced a restart of the tests: this basically means we just POST the same ISOs with the same settings again, and let the scheduler cancel the existing jobs and create new ones.

It looks like three worker processes became stuck when I did this. For all three, this seems to be what happened: the attempt to kill the isotovideo process as part of the job teardown failed. This leaves the worker process running, but stuck - it won't pick up any new jobs, it won't check in with the server (the server sees it as 'dead'), and it won't cleanly exit if you shut down the service, even after forcibly killing the stuck isotovideo process. systemd timed out waiting for the process to exit cleanly and had to forcibly kill it.

I'll give links to one example case, as they all seem to be identical. This is the job: https://openqa.stg.fedoraproject.org/tests/33900 . The worker service system log looks like this:

Aug 26 23:47:32 qa06.qa.fedoraproject.org worker[2508]: got job 33900: 00033900-fedora-Rawhide-Server-dvd-iso-x86_64-BuildFedora-Rawhide-20160826.n.1-install_updates_nfs@64bit
Aug 26 23:47:32 qa06.qa.fedoraproject.org worker[2508]: 3608: WORKING 33900
Aug 26 23:54:54 qa06.qa.fedoraproject.org worker[2508]: killing 3608

that's where they get stuck. At that point, 3608 - which is the isotovideo process - is still alive. In two of the cases I had to kill -9 it to make it go away (plain kill did not work), in one case, plain kill worked. Note that a 'normal' job closure looks like this:

Aug 26 23:47:28 qa06.qa.fedoraproject.org worker[2508]: killing 3444
Aug 26 23:47:28 qa06.qa.fedoraproject.org worker[2508]: waitpid returned error: No child processes
Aug 26 23:47:29 qa06.qa.fedoraproject.org worker[2508]: can't open /var/lib/openqa/pool/3/testresults/result-_software_selection.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 534.
Aug 26 23:47:31 qa06.qa.fedoraproject.org worker[2508]: cleaning up 00033872-fedora-Rawhide-Atomic-boot-iso-x86_64-BuildFedora-Rawhide-20160826.n.1-install_default@64bit...
Aug 26 23:47:32 qa06.qa.fedoraproject.org worker[2508]: setting job 33872 to done

When I try to stop the service with systemd, I see this:

Aug 27 00:43:48 qa06.qa.fedoraproject.org systemd[1]: Stopping openQA Worker #3...
Aug 27 00:43:48 qa06.qa.fedoraproject.org worker[2508]: quit due to signal TERM
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: State 'stop-sigterm' timed out. Killing.
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: Main process exited, code=killed, status=9/KILL
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: Stopped openQA Worker #3.
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: Unit entered failed state.
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: Failed with result 'signal'.

Related issues 1 (0 open1 closed)

Is duplicate of openQA Project - action #12178: worker can hang when killing isotovideoResolvedcoolo2016-05-31

Actions
Actions #1

Updated by okurz over 7 years ago

  • Is duplicate of action #12838: sporadic "corrupt images" in various tests or fails uploading, e.g. with "Premature connection close" added
Actions #2

Updated by okurz over 7 years ago

  • Status changed from New to Closed
Actions #3

Updated by okurz over 7 years ago

  • Is duplicate of deleted (action #12838: sporadic "corrupt images" in various tests or fails uploading, e.g. with "Premature connection close")
Actions #4

Updated by okurz over 7 years ago

  • Is duplicate of action #12178: worker can hang when killing isotovideo added
Actions

Also available in: Atom PDF