Project

General

Profile

action #65705

Tests can hang when waiting for a cache service request

Added by mkittler 11 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2020-04-16
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

observation

[16:43] <nilxam> hmmm... there are openqa jobs in running state 13 hours already...

These jobs were stuck waiting for the cache service to download an asset. The worker log looked like this:

Apr 16 17:53:48 openqaworker1 worker[4607]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 51
Apr 16 17:53:48 openqaworker1 worker[4607]: [info] +++ setup notes +++
Apr 16 17:53:48 openqaworker1 worker[4607]: [info] Running on openqaworker1:1 (Linux 4.12.14-lp151.28.44-default #1 SMP Fri Mar 20 18:20:20 UTC 2020 (dbf1aea) x86_64)
Apr 16 17:53:48 openqaworker1 worker[4607]: [info] Downloading openSUSE-Staging:E-Tumbleweed-DVD-x86_64-Build362.1-Media.iso.sha256, request #19119 sent to Cache Service
Apr 16 17:53:53 openqaworker1 worker[4607]: [info] Download of openSUSE-Staging:E-Tumbleweed-DVD-x86_64-Build362.1-Media.iso.sha256 processed:
Apr 16 17:53:53 openqaworker1 worker[4607]: [info] [#19119] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
Apr 16 17:53:53 openqaworker1 worker[4607]: [info] [#19119] Downloading "openSUSE-Staging:E-Tumbleweed-DVD-x86_64-Build362.1-Media.iso.sha256" from "http://openqa1-opensuse/tests/1236927/asset/other/openSUSE-Staging:E-Tumbleweed-DVD-x86_64-Build362.1-Media.iso.sha256"
Apr 16 17:53:53 openqaworker1 worker[4607]: [info] [#19119] Size of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:E-Tumbleweed-DVD-x86_64-Build362.1-Media.iso.sha256" is 118 Byte, with ETag ""76-5a367ac81fafe""
Apr 16 17:53:53 openqaworker1 worker[4607]: [info] [#19119] Download of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging:E-Tumbleweed-DVD-x86_64-Build362.1-Media.iso.sha256" successful, new cache size is 49GiB
Apr 16 17:53:53 openqaworker1 worker[4607]: [info] Downloading openSUSE-Staging-DVD-x86_64.qcow2, request #19120 sent to Cache Service
[no further output for hours]

The o3 workers openqaworker1 and openqaworker4 were affected. The services openqa-worker-cacheservice and openqa-worker-cacheservice-minion were both active and had no errors in the log. When restarting a worker instance it did not complain that the cache service is not available. So the cache service was generally responsive and yet stuck with some requests.

On openqaworker1 I restarted the workers and the cache services. Some jobs are now actually running after that.

On openqaworker4 I only restarted the workers but not the cache services. All new jobs were hanging again. So it looks like this is a problem of the cache services. Then I stopped the cache services. Apparently that didn't free the workers/jobs. Instead only the following message is printed to the worker journal every few seconds:

Apr 16 18:30:14 openqaworker4 worker[3719]: Use of uninitialized value in string eq at /usr/share/openqa/script/../lib/OpenQA/CacheService/Response/Status.pm line 20.

Only after restarting the workers the jobs/workers are incompleted/freed.

suggestions

  • Investigate why the requests got stuck.
  • Check whether there is a general timeout for those request. Apparently the worker does not even stop if the cache service is not reachable at all anymore and just prints a Perl warning. The behavior of the worker should be improved when the cache service becomes unavailable or makes no progress.
  • Monitor the now restarted workers. Unfortunately it already looks like jobs would hang again.
  • Disable the workers temporarily if necassary to prevent stuck jobs.
  • Find out what has changed on these worker instances. The problem only occurs on openqaworker1 and openqaworker4 so there must be something special with these worker hosts. Since other worker hosts are not affected it is likely not a change in openQA itself (although openQA should neverthless prevent stuck jobs).

Related issues

Related to openQA Project - action #64776: [cache][worker] cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete auto_review:"setup failure: Cache service status error: Premature connection close":retryResolved2020-03-24

History

#1 Updated by okurz 11 months ago

  • Category set to Concrete Bugs

I am still assuming for now it's a regression, doesn't really matter though

#2 Updated by okurz 11 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz

I can confirm that there are again some stuck jobs. I try to cleanup hard-core. systemctl stop openqa-worker.target; rm -rf /var/lib/openqa/cache/*; reboot on both w1 and w4

EDIT: After reboot all previous jobs were correctly incompleted and retriggered. Monitoring on both worker machines with journalctl -b -f -u openqa\* . Right now I could not find any tests stuck in "Running" but with no further progress. I will monitor.

#3 Updated by okurz 11 months ago

  • Related to action #64776: [cache][worker] cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete auto_review:"setup failure: Cache service status error: Premature connection close":retry added

#4 Updated by okurz 11 months ago

  • Status changed from In Progress to Resolved

the immediate issue seems to be gone, no tests are currently stuck. We also have #64776 still to follow up with

#5 Updated by mkittler 11 months ago

#64776 is different from what I observed yesterday. Yesterday there were not immediate incompletes due to failed downloads. The jobs were still running and would still be running right now if I hadn't restarted the workers.

This is also a different issue than the corruption of the sqlite database we sometimes saw in the past. (The sqlite error prevented the whole service to start so the workers were complaining that the cache service is not reachable. That's at least a helpful error message and it caused no workers/jobs to hang.)


In this issue case the cache service was even running and generally available and responsible. Hence there was no timeout or any kind of error. But still the download was stuck somewhere. Unfortunately I don't know the cache service well enough to be able to say why.

But since cleaning up the cache directory helped (and not restarting the cache service alone) it is clear that something in that directory caused the problem. (Or the restart made the difference but that's rather unlikely.) I suppose we will have to wait until it happens again and then save the cache directory somewhere to investigate further. Note that the cache directory should not be saved within a sub directory of the cache directory itself because it is possibly wiped out. (I've already fallen into that trap previously with the sqlite related error.)


So what do do ticket-wise? Reopen this one? Adjust #64776 so it actually matches yesterdays problem? (There were more than just one problems and I'm talking about the one I've reported here.) Create a new follow-up ticket?

#6 Updated by mkittler 11 months ago

Regarding Apr 16 18:30:14 openqaworker4 worker[3719]: Use of uninitialized value in string eq at /usr/share/openqa/script/../lib/OpenQA/CacheService/Response/Status.pm line 20.:

This could be easily "fixed" via (shift->data->{status} // '') eq 'downloading'. However, I'm wondering where this function is actually called. _progress_is_downloading in the helpers only acquires a lock and doesn't actually check whether anything is downloading.

#7 Updated by okurz 11 months ago

mkittler wrote:

So what do do ticket-wise? Reopen this one? Adjust #64776 so it actually matches yesterdays problem? (There were more than just one problems and I'm talking about the one I've reported here.) Create a new follow-up ticket?

I suggest to only open a new ticket if you have a new idea – like your suggestion further down – which we can actually do, otherwise let's just leave it, work on other cache related tickets first and then see what pops up again. You know, we have enough tickets :)

#8 Updated by kraih 10 months ago

mkittler wrote:

However, I'm wondering where this function is actually called. _progress_is_downloading in the helpers only acquires a lock and doesn't actually check whether anything is downloading.

It doesn't actually acquire a lock, it checks if a lock is currently active, which would mean there is a download in progress.

#9 Updated by kraih 10 months ago

Having looked into the problem a bit more. The exception handling around the download status API could definitely be improved (and i already have some ideas i will try). Unfortunately it looks like there is no copy of the cache service database from the time the problem occurred, so it's hard to say what went actually wrong. In the logs everything looks fine. The download wasn't even slow.

Feb 28 07:38:08 openqaworker1 openqa-workercache[2678]: [21163] [i] [#19120] Downloading: "openSUSE-Tumbleweed-NET-x86_64-Snapshot20200227-Media.iso"
Feb 28 07:38:08 openqaworker1 openqa-workercache[2678]: [21163] [i] [#19120] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
Feb 28 07:38:08 openqaworker1 openqa-workercache[2678]: [21163] [i] [#19120] Downloading "openSUSE-Tumbleweed-NET-x86_64-Snapshot20200227-Media.iso" from "http://openqa1-opensuse/tests/1189595/asset/iso/openSUSE-Tumbleweed-NET-x86_64-Snapshot20200227-Media.iso"
Feb 28 07:38:08 openqaworker1 openqa-workercache[2678]: [21164] [i] [#19121] Downloading: "openSUSE-Tumbleweed-DVD-x86_64-Snapshot20200227-Media.iso"
Feb 28 07:38:08 openqaworker1 openqa-workercache[2678]: [21164] [i] [#19121] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
Feb 28 07:38:08 openqaworker1 openqa-workercache[2678]: [21164] [i] [#19121] Downloading "openSUSE-Tumbleweed-DVD-x86_64-Snapshot20200227-Media.iso" from "http://openqa1-opensuse/tests/1189665/asset/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20200227-Media.iso"
Feb 28 07:38:08 openqaworker1 openqa-workercache[2678]: [21163] [i] [#19120] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-NET-x86_64-Snapshot20200227-Media.iso" has not changed, updating last use
Feb 28 07:38:08 openqaworker1 openqa-workercache[2678]: [21163] [i] [#19120] Finished download

Almost all cache service state is in the SQLite db, so next time you encounter a problem, keep a copy of it (before clearing the cache). That makes investigating much easier.

Also available in: Atom PDF