action #64776
closed[cache][worker] cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete auto_review:"setup failure: Cache service status error: Premature connection close":retry
0%
Description
Observation¶
Observed on openqaworker6.s.d as reported in #64752 (ticket lost due to db recovery) the cache service out of a sudden failed to download any new requested assets. Jobs that rely on assets still existing in the cache are still running fine, any jobs that trigger a request for download any new asset incomplete. There are no obvious mentions in the log files what went wrong.
Acceptance criteria¶
- AC1: There is more detail than just "Failed to download", accessible to users
Suggestions¶
- Investigate if more details could be provided and forwarded to the details provided and logs and also the reason that is put on the job details web page.
- If possible, prevent the situation that was observed in #64752
Workaround¶
In #64752 it helped to restart the cache service and cache service minion
Further details¶
Original subject content: just "Failed to download", no further reason.
Updated by okurz over 4 years ago
- Related to action #65705: Tests can hang when waiting for a cache service request added
Updated by okurz over 4 years ago
- Target version set to Ready
https://openqa.suse.de/tests/4468047 is incomplete with "Reason: setup failure: Cache service status error: Premature connection close". This reason is not shown in logs albeit it should. The logs just say:
[2020-07-21T02:14:14.0696 CEST] [info] [pid:57475] Downloading sle-15-SP1-ppc64le-4.12.14-904.1.g6553721-Server-DVD-Incidents-Kernel@ppc64le-virtio-with-ltp.qcow2, request #6997 sent to Cache Service
[2020-07-21T02:14:19.0750 CEST] [info] [pid:57475] Download of sle-15-SP1-ppc64le-4.12.14-904.1.g6553721-Server-DVD-Incidents-Kernel@ppc64le-virtio-with-ltp.qcow2 processed:
Asset "sle-15-SP1-ppc64le-4.12.14-904.1.g6553721-Server-DVD-Incidents-Kernel@ppc64le-virtio-with-ltp.qcow2" was downloaded by #6996, details are therefore unavailable here
[2020-07-21T02:14:19.0752 CEST] [debug] [pid:57475] Found ISO, caching SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso
[2020-07-21T02:14:19.0754 CEST] [info] [pid:57475] Downloading SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso, request #6998 sent to Cache Service
[2020-07-21T02:14:29.0821 CEST] [info] [pid:57475] Download of SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso processed:
[info] [#6998] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#6998] Downloading "SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" from "http://openqa.suse.de/tests/4468047/asset/iso/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso"
[info] [#6998] Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" has not changed, updating last use
[2020-07-21T02:14:29.0823 CEST] [info] [pid:57475] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #7000 sent to Cache Service
[2020-07-21T02:14:40.0242 CEST] [info] [pid:57475] +++ worker notes +++
[2020-07-21T02:14:40.0242 CEST] [info] [pid:57475] End time: 2020-07-21 00:14:40
[2020-07-21T02:14:40.0242 CEST] [info] [pid:57475] Result: setup failure
[2020-07-21T02:14:40.0250 CEST] [info] [pid:51685] Uploading autoinst-log.txt
Updated by okurz over 4 years ago
- Related to action #66988: Mojo::Reactor::Poll: I/O watcher failed: Can't use an undefined value as a HASH reference at /usr/share/openqa/script/../lib/OpenQA/CacheService/Response/Status.pm line 20. added
Updated by okurz over 4 years ago
- Priority changed from Normal to High
mdoucha has reported a problem in https://chat.suse.de/channel/testing?msg=JBH6XFvijNSRmJPHv that he links to #66988 and https://github.com/os-autoinst/openQA/pull/3211 which uncovered the real issue an a more observable way (or maybe even introduced the issue).
Updated by okurz over 4 years ago
- Related to action #69178: workaround for #64776 using https://github.com/os-autoinst/scripts/blob/master/openqa-label-known-issues added
Updated by kraih over 4 years ago
okurz wrote:
https://openqa.suse.de/tests/4468047 is incomplete with "Reason: setup failure: Cache service status error: Premature connection close". This reason is not shown in logs albeit it should.
It should be in the worker-log.txt
, because that's where we currently forward all worker internal errors. The good news is that this means the new cache service client logging works. The bad news is that there is no clues as to what caused it. Normally my first guess would be someone restarting the cache service. But i looked at the logs on the machine and there is nothing. So we are dealing with an unknown connection error between the worker and the cache service.
It's a generic web server error condition, so debugging it is tricky. Something could have simply blocked the process resulting in a server side timeout. So i see two solutions we could try, a) the simple one, we just increase the server side connection timeout, or b) the more complicated one, we add retry logic to the cache service client (say 5 times for a connection error with a few seconds sleep in between).
Updated by okurz over 4 years ago
kraih wrote:
okurz wrote:
https://openqa.suse.de/tests/4468047 is incomplete with "Reason: setup failure: Cache service status error: Premature connection close". This reason is not shown in logs albeit it should.
It should be in the
worker-log.txt
, because that's where we currently forward all worker internal errors.
Certainly we do not want to duplicate everything but the problem I see is that in case of incompletes without module details we show the content of autoinst-log.txt in the job's details but not worker-log.txt and other output from cache is shown there as well. Something feels inconsistent here. Maybe we need to move some logs to autoinst-log.txt or maybe we need to move more again from autoinst-log.txt to worker-log.txt and show the content on the job's details as well.
It's a generic web server error condition, so debugging it is tricky. Something could have simply blocked the process resulting in a server side timeout. So i see two solutions we could try, a) the simple one, we just increase the server side connection timeout, or b) the more complicated one, we add retry logic to the cache service client (say 5 times for a connection error with a few seconds sleep in between).
as we have "retry" in other cases as well so I think "b)" is better. However an alternative "c)" can be to just ensure that the job ends up with an incomplete reason that will trigger a retry on the level of the complete test job.
Updated by kraih over 4 years ago
okurz wrote:
Certainly we do not want to duplicate everything but the problem I see is that in case of incompletes without module details we show the content of autoinst-log.txt in the job's details but not worker-log.txt and other output from cache is shown there as well. Something feels inconsistent here. Maybe we need to move some logs to autoinst-log.txt or maybe we need to move more again from autoinst-log.txt to worker-log.txt and show the content on the job's details as well.
Agreed. The rules for which log messages go into which file do seem rather sketchy. Especially for errors bubbling up from the job into the worker context.
Updated by okurz over 4 years ago
- Subject changed from cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete, just "Failed to download", no further reason to cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete, just "Failed to download", no further reason. auto_review:"setup failure: Cache service status error: Premature connection close":retry
I think I found a way to cover #69178 using https://gitlab.suse.de/openqa/auto-review/ . For this we need the special subject text auto_review:"setup failure: Cache service status error: Premature connection close":retry
Updated by okurz over 4 years ago
- Subject changed from cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete, just "Failed to download", no further reason. auto_review:"setup failure: Cache service status error: Premature connection close":retry to cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete auto_review:"setup failure: Cache service status error: Premature connection close":retry
- Description updated (diff)
Updated by kraih over 4 years ago
As a first step i'm working on a retry feature for the cache service client, if we're lucky that's already enough to resolve the root problem.
Updated by okurz over 4 years ago
- Subject changed from cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete auto_review:"setup failure: Cache service status error: Premature connection close":retry to [cache][worker] cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete auto_review:"setup failure: Cache service status error: Premature connection close":retry
Updated by okurz over 4 years ago
- Tags changed from caching, openQA, sporadic, arm, ipmi, worker to caching, sporadic, worker
Updated by kraih over 4 years ago
Opened a PR for the retry feature. https://github.com/os-autoinst/openQA/pull/3301
Updated by livdywan over 4 years ago
- Status changed from In Progress to Feedback
The PR was merged _^