action #80334: job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger - openQA Project (public) - openSUSE Project Management Tool

Actions

action #80334

closed

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

coordination #62420: [epic] Distinguish all types of incompletes

job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger

Added by Xiaojing_liu about 4 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Xiaojing_liu

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2020-11-25

Due date:

% Done:

Estimated time:

80.00 h

Description

Observation¶

The job https://openqa.suse.de/tests/5068680 shows that

[2020-11-25T05:35:36.0825 CET] [debug] [pid:123005] +++ worker notes +++
[37m[2020-11-25T05:35:37.183 CET] [debug] Current version is 4.6.1605530625.31c8f336 [interface v20]
[0mparse error in vars.json:
Usage: Cpanel::JSON::XS::decode(self, jsonstr, typesv= NULL) at /usr/lib/os-autoinst/bmwqemu.pm line 97.
[37m[2020-11-25T05:35:37.185 CET] [debug] Unable to serialize fatal error: Can't write to file "base_state.json": No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 86.

[0m123005: EXIT 1
[2020-11-25T05:35:37.0189 CET] [info] [pid:9534] Isotovideo exit status: 1

Impact¶

So far seems like only a single incomplete with this error

Steps to reproduce¶

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label , to look for this ticket call openqa-query-for-job-label poo#80334

Problem¶

There is an error message "No space left on device". This should be checked if this was/is a general situation on the host powerqaworker-qam-1

Suggestions¶

Check on powerqaworker-qam-1 what happened, what was/is the available space at the time the issue happened
openQA jobs of this kind should be automatically retriggered as likely another worker will succeed or the jobs on the same worker might succeed later.

Workaround¶

Manually retrigger

Actions

Copy link

Updated by okurz about 4 years ago

Project changed from openQA Project (public) to openQA Infrastructure (public)
Description updated (diff)
Status changed from New to Workable
Target version set to Ready

Hm, but "No space left on device" is something that openQA can't fix. I checked the job and that was on powerqaworker-qam-1 and only one job

In general I suggest to use https://progress.opensuse.org/projects/openqav3/wiki/#Defects
as a template which also links to a template extension for "auto_review" tickets which you can copy-paste and adjust for the issue at hand. What is also important to know is how many jobs are affected, e.g. how many jobs the auto-review pipeline reports with the same error. I checked and it was only one job so it's completely fine to have this ticket with "normal" priority only :)

Actions

Copy link

Updated by okurz about 4 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Subject changed from job incompletes with auto_review:"terminated prematurely with corrupted state file" to job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger
Description updated (diff)
Category set to Regressions/Crashes

Actions

Copy link

Updated by okurz about 4 years ago

Parent task set to #62420

Actions

Copy link

Updated by livdywan about 4 years ago

okurz wrote:

Hm, but "No space left on device" is something that openQA can't fix. I checked the job and that was on powerqaworker-qam-1 and only one job

Unless you count #76984 of course.

Actions

Copy link

Updated by mkittler about 4 years ago

I just wanted to check whether the disk usage is critical on that worker but I can not SSH to it and the worker shows as offline on OSD as well.

@cdywan

Unless you count #76984 of course.

But this issue is definitely focusing on the web UI host. Here a worker ran out of disk space.

Actions

Copy link

Updated by livdywan about 4 years ago

mkittler wrote:

I just wanted to check whether the disk usage is critical on that worker but I can not SSH to it and the worker shows as offline on OSD as well.

@cdywan

Unless you count #76984 of course.

But this issue is definitely focusing on the web UI host. Here a worker ran out of disk space.

Ah, you're right! My bad.

Actions

Copy link

Updated by mkittler about 4 years ago

With this PR re-triggering such jobs automatically is just a matter of configuration: https://github.com/os-autoinst/openQA/pull/3624/files#diff-71734bf4d91118f92743c7d0db9218c4f24729410e020110884606b5ad6a3878R819-R829
So maybe we would just extend the default regex here.

Actions

Copy link

Updated by Xiaojing_liu almost 4 years ago

Status changed from Workable to In Progress
Assignee set to Xiaojing_liu

Actions

Copy link

Updated by openqa_review almost 4 years ago

Due date set to 2021-01-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#10

Updated by Xiaojing_liu almost 4 years ago

@mkittler The reason is terminated prematurely with corrupted state file, and no space left on device is written in autoinst-log. How about matching the reason and checking if the autoinst-log includes no space left? if we only check the reason, we cannot confirm it's related with no space left.

Actions

Copy link

#11

Updated by okurz almost 4 years ago

https://github.com/os-autoinst/openQA/pull/3672

Actions

Copy link

#12

Updated by okurz almost 4 years ago

Status changed from In Progress to Feedback

merged

Actions

Copy link

#13

Updated by livdywan almost 4 years ago

Interestingly the latest run of the job passed, which would make this a bad example of a working fix.

Do we have an instance of the new error happening in the meanwhile?

Actions

Copy link

#14

Updated by Xiaojing_liu almost 4 years ago

cdywan wrote:

Interestingly the latest run of the job passed, which would make this a bad example of a working fix.

Do we have an instance of the new error happening in the meanwhile?

On OSD, I did not find an incomplete job with reason terminated prematurely: Encountered corrupted state file. Also did not find a job with reason terminated prematurely with corrupted state file after the pr had been merged. seems this issue did not happen in recent days.

Actions

Copy link

#15

Updated by okurz almost 4 years ago

@Xiaojing_liu the due date has passed since a long time we should really try to get it "Resolved". What do you see as missing if anything?

Actions

Copy link

#16

Updated by livdywan almost 4 years ago

Status changed from Feedback to Resolved

okurz wrote:

@Xiaojing_liu the due date has passed since a long time we should really try to get it "Resolved". What do you see as missing if anything?

Was my comment not clear enough? 🤓

I take it you're okay to set it Resolved w/o a trivial way to confirm it.

Actions

Copy link

#17

Updated by okurz almost 4 years ago

Due date deleted (~~2021-01-26~~)

Actions

Copy link

#18

Updated by Xiaojing_liu over 3 years ago

Estimated time set to 80.00 h

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #80334

job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger

Observation¶

Impact¶

Steps to reproduce¶

Problem¶

Suggestions¶

Workaround¶

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by livdywan about 4 years ago

Updated by mkittler about 4 years ago

Updated by livdywan about 4 years ago

Updated by mkittler about 4 years ago

Updated by Xiaojing_liu almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by Xiaojing_liu almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by Xiaojing_liu almost 4 years ago

Updated by okurz almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by okurz almost 4 years ago

Updated by Xiaojing_liu over 3 years ago