Project

General

Profile

Actions

action #64776

closed

[cache][worker] cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete auto_review:"setup failure: Cache service status error: Premature connection close":retry

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2020-03-24
Due date:
% Done:

0%

Estimated time:

Description

Observation

Observed on openqaworker6.s.d as reported in #64752 (ticket lost due to db recovery) the cache service out of a sudden failed to download any new requested assets. Jobs that rely on assets still existing in the cache are still running fine, any jobs that trigger a request for download any new asset incomplete. There are no obvious mentions in the log files what went wrong.

Acceptance criteria

  • AC1: There is more detail than just "Failed to download", accessible to users

Suggestions

  • Investigate if more details could be provided and forwarded to the details provided and logs and also the reason that is put on the job details web page.
  • If possible, prevent the situation that was observed in #64752

Workaround

In #64752 it helped to restart the cache service and cache service minion

Further details

Original subject content: just "Failed to download", no further reason.


Related issues 3 (0 open3 closed)

Related to openQA Project (public) - action #65705: Tests can hang when waiting for a cache service requestResolvedokurz2020-04-16

Actions
Related to openQA Project (public) - action #66988: Mojo::Reactor::Poll: I/O watcher failed: Can't use an undefined value as a HASH reference at /usr/share/openqa/script/../lib/OpenQA/CacheService/Response/Status.pm line 20.Resolvedkraih2020-05-18

Actions
Related to openQA Project (public) - action #69178: workaround for #64776 using https://github.com/os-autoinst/scripts/blob/master/openqa-label-known-issuesResolvedokurz2020-07-21

Actions
Actions #1

Updated by okurz over 4 years ago

  • Related to action #65705: Tests can hang when waiting for a cache service request added
Actions #2

Updated by okurz over 4 years ago

  • Target version set to Ready

https://openqa.suse.de/tests/4468047 is incomplete with "Reason: setup failure: Cache service status error: Premature connection close". This reason is not shown in logs albeit it should. The logs just say:

[2020-07-21T02:14:14.0696 CEST] [info] [pid:57475] Downloading sle-15-SP1-ppc64le-4.12.14-904.1.g6553721-Server-DVD-Incidents-Kernel@ppc64le-virtio-with-ltp.qcow2, request #6997 sent to Cache Service
[2020-07-21T02:14:19.0750 CEST] [info] [pid:57475] Download of sle-15-SP1-ppc64le-4.12.14-904.1.g6553721-Server-DVD-Incidents-Kernel@ppc64le-virtio-with-ltp.qcow2 processed:
Asset "sle-15-SP1-ppc64le-4.12.14-904.1.g6553721-Server-DVD-Incidents-Kernel@ppc64le-virtio-with-ltp.qcow2" was downloaded by #6996, details are therefore unavailable here
[2020-07-21T02:14:19.0752 CEST] [debug] [pid:57475] Found ISO, caching SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso
[2020-07-21T02:14:19.0754 CEST] [info] [pid:57475] Downloading SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso, request #6998 sent to Cache Service
[2020-07-21T02:14:29.0821 CEST] [info] [pid:57475] Download of SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso processed:
[info] [#6998] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#6998] Downloading "SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" from "http://openqa.suse.de/tests/4468047/asset/iso/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso"
[info] [#6998] Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" has not changed, updating last use

[2020-07-21T02:14:29.0823 CEST] [info] [pid:57475] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #7000 sent to Cache Service
[2020-07-21T02:14:40.0242 CEST] [info] [pid:57475] +++ worker notes +++
[2020-07-21T02:14:40.0242 CEST] [info] [pid:57475] End time: 2020-07-21 00:14:40
[2020-07-21T02:14:40.0242 CEST] [info] [pid:57475] Result: setup failure
[2020-07-21T02:14:40.0250 CEST] [info] [pid:51685] Uploading autoinst-log.txt
Actions #3

Updated by okurz over 4 years ago

  • Related to action #66988: Mojo::Reactor::Poll: I/O watcher failed: Can't use an undefined value as a HASH reference at /usr/share/openqa/script/../lib/OpenQA/CacheService/Response/Status.pm line 20. added
Actions #4

Updated by okurz over 4 years ago

  • Priority changed from Normal to High

mdoucha has reported a problem in https://chat.suse.de/channel/testing?msg=JBH6XFvijNSRmJPHv that he links to #66988 and https://github.com/os-autoinst/openQA/pull/3211 which uncovered the real issue an a more observable way (or maybe even introduced the issue).

Actions #5

Updated by okurz over 4 years ago

  • Related to action #69178: workaround for #64776 using https://github.com/os-autoinst/scripts/blob/master/openqa-label-known-issues added
Actions #6

Updated by kraih over 4 years ago

okurz wrote:

https://openqa.suse.de/tests/4468047 is incomplete with "Reason: setup failure: Cache service status error: Premature connection close". This reason is not shown in logs albeit it should.

It should be in the worker-log.txt, because that's where we currently forward all worker internal errors. The good news is that this means the new cache service client logging works. The bad news is that there is no clues as to what caused it. Normally my first guess would be someone restarting the cache service. But i looked at the logs on the machine and there is nothing. So we are dealing with an unknown connection error between the worker and the cache service.

It's a generic web server error condition, so debugging it is tricky. Something could have simply blocked the process resulting in a server side timeout. So i see two solutions we could try, a) the simple one, we just increase the server side connection timeout, or b) the more complicated one, we add retry logic to the cache service client (say 5 times for a connection error with a few seconds sleep in between).

Actions #7

Updated by okurz over 4 years ago

kraih wrote:

okurz wrote:

https://openqa.suse.de/tests/4468047 is incomplete with "Reason: setup failure: Cache service status error: Premature connection close". This reason is not shown in logs albeit it should.

It should be in the worker-log.txt, because that's where we currently forward all worker internal errors.

Certainly we do not want to duplicate everything but the problem I see is that in case of incompletes without module details we show the content of autoinst-log.txt in the job's details but not worker-log.txt and other output from cache is shown there as well. Something feels inconsistent here. Maybe we need to move some logs to autoinst-log.txt or maybe we need to move more again from autoinst-log.txt to worker-log.txt and show the content on the job's details as well.

It's a generic web server error condition, so debugging it is tricky. Something could have simply blocked the process resulting in a server side timeout. So i see two solutions we could try, a) the simple one, we just increase the server side connection timeout, or b) the more complicated one, we add retry logic to the cache service client (say 5 times for a connection error with a few seconds sleep in between).

as we have "retry" in other cases as well so I think "b)" is better. However an alternative "c)" can be to just ensure that the job ends up with an incomplete reason that will trigger a retry on the level of the complete test job.

Actions #8

Updated by kraih over 4 years ago

okurz wrote:

Certainly we do not want to duplicate everything but the problem I see is that in case of incompletes without module details we show the content of autoinst-log.txt in the job's details but not worker-log.txt and other output from cache is shown there as well. Something feels inconsistent here. Maybe we need to move some logs to autoinst-log.txt or maybe we need to move more again from autoinst-log.txt to worker-log.txt and show the content on the job's details as well.

Agreed. The rules for which log messages go into which file do seem rather sketchy. Especially for errors bubbling up from the job into the worker context.

Actions #9

Updated by okurz over 4 years ago

  • Subject changed from cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete, just "Failed to download", no further reason to cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete, just "Failed to download", no further reason. auto_review:"setup failure: Cache service status error: Premature connection close":retry

I think I found a way to cover #69178 using https://gitlab.suse.de/openqa/auto-review/ . For this we need the special subject text auto_review:"setup failure: Cache service status error: Premature connection close":retry

Actions #10

Updated by okurz over 4 years ago

  • Subject changed from cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete, just "Failed to download", no further reason. auto_review:"setup failure: Cache service status error: Premature connection close":retry to cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete auto_review:"setup failure: Cache service status error: Premature connection close":retry
  • Description updated (diff)
Actions #11

Updated by kraih over 4 years ago

As a first step i'm working on a retry feature for the cache service client, if we're lucky that's already enough to resolve the root problem.

Actions #12

Updated by okurz over 4 years ago

  • Subject changed from cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete auto_review:"setup failure: Cache service status error: Premature connection close":retry to [cache][worker] cache service suddenly stopped to download assets, all subsequent jobs needing download incomplete auto_review:"setup failure: Cache service status error: Premature connection close":retry
Actions #13

Updated by okurz over 4 years ago

  • Tags set to caching, openQA, sporadic
Actions #14

Updated by kraih over 4 years ago

  • Status changed from Workable to In Progress
Actions #15

Updated by okurz over 4 years ago

  • Tags changed from caching, openQA, sporadic, arm, ipmi, worker to caching, sporadic, worker
Actions #16

Updated by kraih over 4 years ago

Opened a PR for the retry feature. https://github.com/os-autoinst/openQA/pull/3301

Actions #17

Updated by livdywan over 4 years ago

  • Status changed from In Progress to Feedback

The PR was merged _^

Actions #18

Updated by okurz over 4 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF