action #80334
closedcoordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
coordination #62420: [epic] Distinguish all types of incompletes
job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger
0%
Description
Observation¶
The job https://openqa.suse.de/tests/5068680 shows that
[2020-11-25T05:35:36.0825 CET] [debug] [pid:123005] +++ worker notes +++
[37m[2020-11-25T05:35:37.183 CET] [debug] Current version is 4.6.1605530625.31c8f336 [interface v20]
[0mparse error in vars.json:
Usage: Cpanel::JSON::XS::decode(self, jsonstr, typesv= NULL) at /usr/lib/os-autoinst/bmwqemu.pm line 97.
[37m[2020-11-25T05:35:37.185 CET] [debug] Unable to serialize fatal error: Can't write to file "base_state.json": No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 86.
[0m123005: EXIT 1
[2020-11-25T05:35:37.0189 CET] [info] [pid:9534] Isotovideo exit status: 1
Impact¶
So far seems like only a single incomplete with this error
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label , to look for this ticket call openqa-query-for-job-label poo#80334
Problem¶
- There is an error message "No space left on device". This should be checked if this was/is a general situation on the host powerqaworker-qam-1
Suggestions¶
- Check on powerqaworker-qam-1 what happened, what was/is the available space at the time the issue happened
- openQA jobs of this kind should be automatically retriggered as likely another worker will succeed or the jobs on the same worker might succeed later.
Workaround¶
Manually retrigger
Updated by okurz about 4 years ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Description updated (diff)
- Status changed from New to Workable
- Target version set to Ready
Hm, but "No space left on device" is something that openQA can't fix. I checked the job and that was on powerqaworker-qam-1 and only one job
In general I suggest to use https://progress.opensuse.org/projects/openqav3/wiki/#Defects
as a template which also links to a template extension for "auto_review" tickets which you can copy-paste and adjust for the issue at hand. What is also important to know is how many jobs are affected, e.g. how many jobs the auto-review pipeline reports with the same error. I checked and it was only one job so it's completely fine to have this ticket with "normal" priority only :)
Updated by okurz about 4 years ago
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Subject changed from job incompletes with auto_review:"terminated prematurely with corrupted state file" to job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger
- Description updated (diff)
- Category set to Regressions/Crashes
Updated by livdywan about 4 years ago
okurz wrote:
Hm, but "No space left on device" is something that openQA can't fix. I checked the job and that was on powerqaworker-qam-1 and only one job
Unless you count #76984 of course.
Updated by mkittler about 4 years ago
I just wanted to check whether the disk usage is critical on that worker but I can not SSH to it and the worker shows as offline on OSD as well.
@cdywan
Unless you count #76984 of course.
But this issue is definitely focusing on the web UI host. Here a worker ran out of disk space.
Updated by livdywan about 4 years ago
mkittler wrote:
I just wanted to check whether the disk usage is critical on that worker but I can not SSH to it and the worker shows as offline on OSD as well.
@cdywan
Unless you count #76984 of course.
But this issue is definitely focusing on the web UI host. Here a worker ran out of disk space.
Ah, you're right! My bad.
Updated by mkittler about 4 years ago
With this PR re-triggering such jobs automatically is just a matter of configuration: https://github.com/os-autoinst/openQA/pull/3624/files#diff-71734bf4d91118f92743c7d0db9218c4f24729410e020110884606b5ad6a3878R819-R829
So maybe we would just extend the default regex here.
Updated by Xiaojing_liu almost 4 years ago
- Status changed from Workable to In Progress
- Assignee set to Xiaojing_liu
Updated by openqa_review almost 4 years ago
- Due date set to 2021-01-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by Xiaojing_liu almost 4 years ago
@mkittler The reason is terminated prematurely with corrupted state file
, and no space left on device
is written in autoinst-log. How about matching the reason and checking if the autoinst-log includes no space left
? if we only check the reason, we cannot confirm it's related with no space left
.
Updated by okurz almost 4 years ago
Updated by livdywan almost 4 years ago
Interestingly the latest run of the job passed, which would make this a bad example of a working fix.
Do we have an instance of the new error happening in the meanwhile?
Updated by Xiaojing_liu almost 4 years ago
cdywan wrote:
Interestingly the latest run of the job passed, which would make this a bad example of a working fix.
Do we have an instance of the new error happening in the meanwhile?
On OSD, I did not find an incomplete job with reason terminated prematurely: Encountered corrupted state file
. Also did not find a job with reason terminated prematurely with corrupted state file
after the pr had been merged. seems this issue did not happen in recent days.
Updated by okurz almost 4 years ago
@Xiaojing_liu the due date has passed since a long time we should really try to get it "Resolved". What do you see as missing if anything?
Updated by livdywan almost 4 years ago
- Status changed from Feedback to Resolved
okurz wrote:
@Xiaojing_liu the due date has passed since a long time we should really try to get it "Resolved". What do you see as missing if anything?
Was my comment not clear enough? 🤓
I take it you're okay to set it Resolved w/o a trivial way to confirm it.