coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
coordination #62420: [epic] Distinguish all types of incompletes
job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger
The job https://openqa.suse.de/tests/5068680 shows that
[2020-11-25T05:35:36.0825 CET] [debug] [pid:123005] +++ worker notes +++ [37m[2020-11-25T05:35:37.183 CET] [debug] Current version is 4.6.1605530625.31c8f336 [interface v20] [0mparse error in vars.json: Usage: Cpanel::JSON::XS::decode(self, jsonstr, typesv= NULL) at /usr/lib/os-autoinst/bmwqemu.pm line 97. [37m[2020-11-25T05:35:37.185 CET] [debug] Unable to serialize fatal error: Can't write to file "base_state.json": No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 86. [0m123005: EXIT 1 [2020-11-25T05:35:37.0189 CET] [info] [pid:9534] Isotovideo exit status: 1
So far seems like only a single incomplete with this error
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label , to look for this ticket call
- There is an error message "No space left on device". This should be checked if this was/is a general situation on the host powerqaworker-qam-1
- Check on powerqaworker-qam-1 what happened, what was/is the available space at the time the issue happened
- openQA jobs of this kind should be automatically retriggered as likely another worker will succeed or the jobs on the same worker might succeed later.
- Project changed from openQA Project to openQA Infrastructure
- Description updated (diff)
- Status changed from New to Workable
- Target version set to Ready
Hm, but "No space left on device" is something that openQA can't fix. I checked the job and that was on powerqaworker-qam-1 and only one job
In general I suggest to use https://progress.opensuse.org/projects/openqav3/wiki/#Defects
as a template which also links to a template extension for "auto_review" tickets which you can copy-paste and adjust for the issue at hand. What is also important to know is how many jobs are affected, e.g. how many jobs the auto-review pipeline reports with the same error. I checked and it was only one job so it's completely fine to have this ticket with "normal" priority only :)
- Project changed from openQA Infrastructure to openQA Project
- Subject changed from job incompletes with auto_review:"terminated prematurely with corrupted state file" to job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger
- Description updated (diff)
- Category set to Concrete Bugs
I just wanted to check whether the disk usage is critical on that worker but I can not SSH to it and the worker shows as offline on OSD as well.
Unless you count #76984 of course.
But this issue is definitely focusing on the web UI host. Here a worker ran out of disk space.
Ah, you're right! My bad.
With this PR re-triggering such jobs automatically is just a matter of configuration: https://github.com/os-autoinst/openQA/pull/3624/files#diff-71734bf4d91118f92743c7d0db9218c4f24729410e020110884606b5ad6a3878R819-R829
So maybe we would just extend the default regex here.
#10 Updated by Xiaojing_liu 5 months ago
mkittler The reason is
terminated prematurely with corrupted state file, and
no space left on device is written in autoinst-log. How about matching the reason and checking if the autoinst-log includes
no space left? if we only check the reason, we cannot confirm it's related with
no space left.
#14 Updated by Xiaojing_liu 4 months ago
Interestingly the latest run of the job passed, which would make this a bad example of a working fix.
Do we have an instance of the new error happening in the meanwhile?
On OSD, I did not find an incomplete job with reason
terminated prematurely: Encountered corrupted state file. Also did not find a job with reason
terminated prematurely with corrupted state file after the pr had been merged. seems this issue did not happen in recent days.
- Status changed from Feedback to Resolved
@Xiaojing_liu the due date has passed since a long time we should really try to get it "Resolved". What do you see as missing if anything?
Was my comment not clear enough? 🤓
I take it you're okay to set it Resolved w/o a trivial way to confirm it.