Project

General

Profile

Actions

action #80334

closed

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

coordination #62420: [epic] Distinguish all types of incompletes

job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger

Added by Xiaojing_liu over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2020-11-25
Due date:
% Done:

0%

Estimated time:
80.00 h

Description

Observation

The job https://openqa.suse.de/tests/5068680 shows that

[2020-11-25T05:35:36.0825 CET] [debug] [pid:123005] +++ worker notes +++
[2020-11-25T05:35:37.183 CET] [debug] Current version is 4.6.1605530625.31c8f336 [interface v20]
parse error in vars.json:
Usage: Cpanel::JSON::XS::decode(self, jsonstr, typesv= NULL) at /usr/lib/os-autoinst/bmwqemu.pm line 97.
[2020-11-25T05:35:37.185 CET] [debug] Unable to serialize fatal error: Can't write to file "base_state.json": No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 86.

123005: EXIT 1
[2020-11-25T05:35:37.0189 CET] [info] [pid:9534] Isotovideo exit status: 1

Impact

So far seems like only a single incomplete with this error

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label , to look for this ticket call openqa-query-for-job-label poo#80334

Problem

  • There is an error message "No space left on device". This should be checked if this was/is a general situation on the host powerqaworker-qam-1

Suggestions

  • Check on powerqaworker-qam-1 what happened, what was/is the available space at the time the issue happened
  • openQA jobs of this kind should be automatically retriggered as likely another worker will succeed or the jobs on the same worker might succeed later.

Workaround

Manually retrigger

Actions #1

Updated by okurz over 3 years ago

  • Project changed from openQA Project to openQA Infrastructure
  • Description updated (diff)
  • Status changed from New to Workable
  • Target version set to Ready

Hm, but "No space left on device" is something that openQA can't fix. I checked the job and that was on powerqaworker-qam-1 and only one job

In general I suggest to use https://progress.opensuse.org/projects/openqav3/wiki/#Defects
as a template which also links to a template extension for "auto_review" tickets which you can copy-paste and adjust for the issue at hand. What is also important to know is how many jobs are affected, e.g. how many jobs the auto-review pipeline reports with the same error. I checked and it was only one job so it's completely fine to have this ticket with "normal" priority only :)

Actions #2

Updated by okurz over 3 years ago

  • Project changed from openQA Infrastructure to openQA Project
  • Subject changed from job incompletes with auto_review:"terminated prematurely with corrupted state file" to job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger
  • Description updated (diff)
  • Category set to Regressions/Crashes
Actions #3

Updated by okurz over 3 years ago

  • Parent task set to #62420
Actions #4

Updated by livdywan over 3 years ago

okurz wrote:

Hm, but "No space left on device" is something that openQA can't fix. I checked the job and that was on powerqaworker-qam-1 and only one job

Unless you count #76984 of course.

Actions #5

Updated by mkittler over 3 years ago

I just wanted to check whether the disk usage is critical on that worker but I can not SSH to it and the worker shows as offline on OSD as well.

@cdywan

Unless you count #76984 of course.

But this issue is definitely focusing on the web UI host. Here a worker ran out of disk space.

Actions #6

Updated by livdywan over 3 years ago

mkittler wrote:

I just wanted to check whether the disk usage is critical on that worker but I can not SSH to it and the worker shows as offline on OSD as well.

@cdywan

Unless you count #76984 of course.

But this issue is definitely focusing on the web UI host. Here a worker ran out of disk space.

Ah, you're right! My bad.

Actions #7

Updated by mkittler over 3 years ago

With this PR re-triggering such jobs automatically is just a matter of configuration: https://github.com/os-autoinst/openQA/pull/3624/files#diff-71734bf4d91118f92743c7d0db9218c4f24729410e020110884606b5ad6a3878R819-R829
So maybe we would just extend the default regex here.

Actions #8

Updated by Xiaojing_liu over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to Xiaojing_liu
Actions #9

Updated by openqa_review over 3 years ago

  • Due date set to 2021-01-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by Xiaojing_liu over 3 years ago

@mkittler The reason is terminated prematurely with corrupted state file, and no space left on device is written in autoinst-log. How about matching the reason and checking if the autoinst-log includes no space left? if we only check the reason, we cannot confirm it's related with no space left.

Actions #12

Updated by okurz over 3 years ago

  • Status changed from In Progress to Feedback

merged

Actions #13

Updated by livdywan about 3 years ago

Interestingly the latest run of the job passed, which would make this a bad example of a working fix.

Do we have an instance of the new error happening in the meanwhile?

Actions #14

Updated by Xiaojing_liu about 3 years ago

cdywan wrote:

Interestingly the latest run of the job passed, which would make this a bad example of a working fix.

Do we have an instance of the new error happening in the meanwhile?

On OSD, I did not find an incomplete job with reason terminated prematurely: Encountered corrupted state file. Also did not find a job with reason terminated prematurely with corrupted state file after the pr had been merged. seems this issue did not happen in recent days.

Actions #15

Updated by okurz about 3 years ago

@Xiaojing_liu the due date has passed since a long time we should really try to get it "Resolved". What do you see as missing if anything?

Actions #16

Updated by livdywan about 3 years ago

  • Status changed from Feedback to Resolved

okurz wrote:

@Xiaojing_liu the due date has passed since a long time we should really try to get it "Resolved". What do you see as missing if anything?

Was my comment not clear enough? 🤓

I take it you're okay to set it Resolved w/o a trivial way to confirm it.

Actions #17

Updated by okurz about 3 years ago

  • Due date deleted (2021-01-26)
Actions #18

Updated by Xiaojing_liu almost 3 years ago

  • Estimated time set to 80.00 h
Actions

Also available in: Atom PDF