action #65705
closedTests can hang when waiting for a cache service request
0%
Description
observation¶
[16:43] <nilxam> hmmm... there are openqa jobs in running state 13 hours already...
These jobs were stuck waiting for the cache service to download an asset. The worker log looked like this:
Apr 16 17:53:48 openqaworker1 worker[4607]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 51
Apr 16 17:53:48 openqaworker1 worker[4607]: [info] +++ setup notes +++
Apr 16 17:53:48 openqaworker1 worker[4607]: [info] Running on openqaworker1:1 (Linux 4.12.14-lp151.28.44-default #1 SMP Fri Mar 20 18:20:20 UTC 2020 (dbf1aea) x86_64)
Apr 16 17:53:48 openqaworker1 worker[4607]: [info] Downloading openSUSE-Staging:E-Tumbleweed-DVD-x86_64-Build362.1-Media.iso.sha256, request #19119 sent to Cache Service
Apr 16 17:53:53 openqaworker1 worker[4607]: [info] Download of openSUSE-Staging:E-Tumbleweed-DVD-x86_64-Build362.1-Media.iso.sha256 processed:
Apr 16 17:53:53 openqaworker1 worker[4607]: [info] [#19119] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
Apr 16 17:53:53 openqaworker1 worker[4607]: [info] [#19119] Downloading "openSUSE-Staging:E-Tumbleweed-DVD-x86_64-Build362.1-Media.iso.sha256" from "http://openqa1-opensuse/tests/1236927/asset/other/openSUSE-Staging:E-Tumbleweed-DVD-x86_64-Build362.1-Media.iso.sha256"
Apr 16 17:53:53 openqaworker1 worker[4607]: [info] [#19119] Size of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:E-Tumbleweed-DVD-x86_64-Build362.1-Media.iso.sha256" is 118 Byte, with ETag ""76-5a367ac81fafe""
Apr 16 17:53:53 openqaworker1 worker[4607]: [info] [#19119] Download of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:E-Tumbleweed-DVD-x86_64-Build362.1-Media.iso.sha256" successful, new cache size is 49GiB
Apr 16 17:53:53 openqaworker1 worker[4607]: [info] Downloading openSUSE-Staging-DVD-x86_64.qcow2, request #19120 sent to Cache Service
[no further output for hours]
The o3 workers openqaworker1
and openqaworker4
were affected. The services openqa-worker-cacheservice
and openqa-worker-cacheservice-minion
were both active and had no errors in the log. When restarting a worker instance it did not complain that the cache service is not available. So the cache service was generally responsive and yet stuck with some requests.
On openqaworker1 I restarted the workers and the cache services. Some jobs are now actually running after that.
On openqaworker4 I only restarted the workers but not the cache services. All new jobs were hanging again. So it looks like this is a problem of the cache services. Then I stopped the cache services. Apparently that didn't free the workers/jobs. Instead only the following message is printed to the worker journal every few seconds:
Apr 16 18:30:14 openqaworker4 worker[3719]: Use of uninitialized value in string eq at /usr/share/openqa/script/../lib/OpenQA/CacheService/Response/Status.pm line 20.
Only after restarting the workers the jobs/workers are incompleted/freed.
suggestions¶
- Investigate why the requests got stuck.
- Check whether there is a general timeout for those request. Apparently the worker does not even stop if the cache service is not reachable at all anymore and just prints a Perl warning. The behavior of the worker should be improved when the cache service becomes unavailable or makes no progress.
- Monitor the now restarted workers. Unfortunately it already looks like jobs would hang again.
- Disable the workers temporarily if necassary to prevent stuck jobs.
- Find out what has changed on these worker instances. The problem only occurs on openqaworker1 and openqaworker4 so there must be something special with these worker hosts. Since other worker hosts are not affected it is likely not a change in openQA itself (although openQA should neverthless prevent stuck jobs).