action #80334
closed
coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
coordination #62420: [epic] Distinguish all types of incompletes
job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger
Added by Xiaojing_liu about 4 years ago.
Updated over 3 years ago.
Category:
Regressions/Crashes
Description
Observation¶
The job https://openqa.suse.de/tests/5068680 shows that
[2020-11-25T05:35:36.0825 CET] [debug] [pid:123005] +++ worker notes +++
[37m[2020-11-25T05:35:37.183 CET] [debug] Current version is 4.6.1605530625.31c8f336 [interface v20]
[0mparse error in vars.json:
Usage: Cpanel::JSON::XS::decode(self, jsonstr, typesv= NULL) at /usr/lib/os-autoinst/bmwqemu.pm line 97.
[37m[2020-11-25T05:35:37.185 CET] [debug] Unable to serialize fatal error: Can't write to file "base_state.json": No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 86.
[0m123005: EXIT 1
[2020-11-25T05:35:37.0189 CET] [info] [pid:9534] Isotovideo exit status: 1
Impact¶
So far seems like only a single incomplete with this error
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label , to look for this ticket call openqa-query-for-job-label poo#80334
Problem¶
- There is an error message "No space left on device". This should be checked if this was/is a general situation on the host powerqaworker-qam-1
Suggestions¶
- Check on powerqaworker-qam-1 what happened, what was/is the available space at the time the issue happened
- openQA jobs of this kind should be automatically retriggered as likely another worker will succeed or the jobs on the same worker might succeed later.
Workaround¶
Manually retrigger
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Description updated (diff)
- Status changed from New to Workable
- Target version set to Ready
Hm, but "No space left on device" is something that openQA can't fix. I checked the job and that was on powerqaworker-qam-1 and only one job
In general I suggest to use https://progress.opensuse.org/projects/openqav3/wiki/#Defects
as a template which also links to a template extension for "auto_review" tickets which you can copy-paste and adjust for the issue at hand. What is also important to know is how many jobs are affected, e.g. how many jobs the auto-review pipeline reports with the same error. I checked and it was only one job so it's completely fine to have this ticket with "normal" priority only :)
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Subject changed from job incompletes with auto_review:"terminated prematurely with corrupted state file" to job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger
- Description updated (diff)
- Category set to Regressions/Crashes
- Parent task set to #62420
okurz wrote:
Hm, but "No space left on device" is something that openQA can't fix. I checked the job and that was on powerqaworker-qam-1 and only one job
Unless you count #76984 of course.
I just wanted to check whether the disk usage is critical on that worker but I can not SSH to it and the worker shows as offline on OSD as well.
@cdywan
Unless you count #76984 of course.
But this issue is definitely focusing on the web UI host. Here a worker ran out of disk space.
mkittler wrote:
I just wanted to check whether the disk usage is critical on that worker but I can not SSH to it and the worker shows as offline on OSD as well.
@cdywan
Unless you count #76984 of course.
But this issue is definitely focusing on the web UI host. Here a worker ran out of disk space.
Ah, you're right! My bad.
- Status changed from Workable to In Progress
- Assignee set to Xiaojing_liu
- Due date set to 2021-01-26
Setting due date based on mean cycle time of SUSE QE Tools
@mkittler The reason is terminated prematurely with corrupted state file
, and no space left on device
is written in autoinst-log. How about matching the reason and checking if the autoinst-log includes no space left
? if we only check the reason, we cannot confirm it's related with no space left
.
- Status changed from In Progress to Feedback
Interestingly the latest run of the job passed, which would make this a bad example of a working fix.
Do we have an instance of the new error happening in the meanwhile?
cdywan wrote:
Interestingly the latest run of the job passed, which would make this a bad example of a working fix.
Do we have an instance of the new error happening in the meanwhile?
On OSD, I did not find an incomplete job with reason terminated prematurely: Encountered corrupted state file
. Also did not find a job with reason terminated prematurely with corrupted state file
after the pr had been merged. seems this issue did not happen in recent days.
@Xiaojing_liu the due date has passed since a long time we should really try to get it "Resolved". What do you see as missing if anything?
- Status changed from Feedback to Resolved
okurz wrote:
@Xiaojing_liu the due date has passed since a long time we should really try to get it "Resolved". What do you see as missing if anything?
Was my comment not clear enough? 🤓
I take it you're okay to set it Resolved w/o a trivial way to confirm it.
- Due date deleted (
2021-01-26)
- Estimated time set to 80.00 h
Also available in: Atom
PDF