Project

General

Profile

action #71185

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

coordination #62420: [epic] Distinguish all types of incompletes

job incompletes with auto_review:"setup failure: Cache service status error: Premature connection close":retry and does not retry, should we just automatically retry the connection?

Added by okurz 9 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2020-09-10
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://openqa.suse.de/tests/4663520 is incomplete, reason is "setup failure: Cache service status error: Premature connection close" , the worker log
https://openqa.suse.de/tests/4663520/file/worker-log.txt gives more details:

[2020-09-09T12:48:28.0234 CEST] [debug] [pid:5715] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4663520/status
[2020-09-09T12:48:28.0344 CEST] [debug] [pid:5715] Linked asset "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP2-aarch64-Installtest.qcow2" to "/var/lib/openqa/pool/3/SLES-15-SP2-aarch64-Installtest.qcow2"
[2020-09-09T12:48:33.0422 CEST] [debug] [pid:5715] Updating status so job 4663520 is not considered dead.
[2020-09-09T12:48:33.0423 CEST] [debug] [pid:5715] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4663520/status
[2020-09-09T12:48:33.0515 CEST] [debug] [pid:5715] Linked asset "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP2-Installer-DVD-aarch64-GM-DVD1.iso" to "/var/lib/openqa/pool/3/SLE-15-SP2-Installer-DVD-aarch64-GM-DVD1.iso"
[2020-09-09T12:48:38.0603 CEST] [debug] [pid:5715] Updating status so job 4663520 is not considered dead.
[2020-09-09T12:48:38.0604 CEST] [debug] [pid:5715] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4663520/status
[2020-09-09T12:48:38.0716 CEST] [debug] [pid:5715] Linked asset "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP2-aarch64-Installtest-uefi-vars.qcow2" to "/var/lib/openqa/pool/3/SLES-15-SP2-aarch64-Installtest-uefi-vars.qcow2"
[2020-09-09T12:48:43.0759 CEST] [debug] [pid:5715] Updating status so job 4663520 is not considered dead.
[2020-09-09T12:48:43.0760 CEST] [debug] [pid:5715] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4663520/status
[2020-09-09T12:48:48.0809 CEST] [debug] [pid:5715] Updating status so job 4663520 is not considered dead.
[2020-09-09T12:48:48.0810 CEST] [debug] [pid:5715] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4663520/status
[2020-09-09T12:48:48.0844 CEST] [error] [pid:5715] Unable to setup job 4663520: Cache service status error: Premature connection close
[2020-09-09T12:48:48.0844 CEST] [debug] [pid:5715] Stopping job 4663520 from openqa.suse.de: 04663520-sle-15-SP2-Server-DVD-Incidents-Install-aarch64-Build:15836:openssl-1_1-qam-incidentinstall@aarch64-virtio - reason: setup failure
[2020-09-09T12:48:48.0845 CEST] [debug] [pid:5715] REST-API call: POST http://openqa.suse.de/api/v1/jobs/4663520/status
[2020-09-09T12:48:48.0917 CEST] [info] [pid:14619] Uploading autoinst-log.txt
[2020-09-09T12:48:48.0968 CEST] [info] [pid:14619] Uploading worker-log.txt

but then the job incompletes and also is not automatically retriggered. It is unclear to the user what should be done

Acceptance criteria

  • AC1: "Cache service status error: Premature connection close" is prevented or handled with retries (either within job or by retriggering the complete job)

Suggestions

  • Look into the cache service implementation if we can have retries in this situation. If not, maybe mark job as incomplete with proper reason and ensure it is automatically retriggered.

History

#1 Updated by okurz 9 months ago

  • Tags set to cache, connection, network, worker, auto_review, osd
  • Description updated (diff)
  • Status changed from New to Workable
  • Target version set to Ready

#2 Updated by kraih 8 months ago

The HTTP request in question (cache service status update) already has 3 retries with a 5 second sleep time for each retry. If this is a common issue we could first try tweaking the defaults a bit. More retries and/or longer sleep time. If it's rare we can probably just ignore the issue.

#3 Updated by kraih 8 months ago

  • Assignee set to kraih

#4 Updated by okurz 8 months ago

  • Parent task set to #62420

#5 Updated by okurz 7 months ago

#6 Updated by cdywan 7 months ago

I assume this is in progress - okurz did you want to take this ticket or is this a collaborative effort with kraih now?

#7 Updated by kraih 7 months ago

  • Assignee deleted (kraih)

This was supposed to be the next ticket in my queue.

#8 Updated by kraih 7 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

#9 Updated by okurz 7 months ago

  • Status changed from In Progress to Feedback

#67855 is impacting me and delaying my work …

#10 Updated by okurz 7 months ago

  • Status changed from Feedback to Resolved

PR merged. As the problem is not likely to happen that often I will set the ticket to "Resolved" right away. If in the review of "auto-review" we see the ticket still referenced we should reconsider.

Also available in: Atom PDF