action #80334
coordination #39719: [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier
coordination #62420: [epic] Distinguish all types of incompletes
job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger
0%
Description
Observation¶
The job https://openqa.suse.de/tests/5068680 shows that
[2020-11-25T05:35:36.0825 CET] [debug] [pid:123005] +++ worker notes +++ [37m[2020-11-25T05:35:37.183 CET] [debug] Current version is 4.6.1605530625.31c8f336 [interface v20] [0mparse error in vars.json: Usage: Cpanel::JSON::XS::decode(self, jsonstr, typesv= NULL) at /usr/lib/os-autoinst/bmwqemu.pm line 97. [37m[2020-11-25T05:35:37.185 CET] [debug] Unable to serialize fatal error: Can't write to file "base_state.json": No space left on device at /usr/lib/os-autoinst/bmwqemu.pm line 86. [0m123005: EXIT 1 [2020-11-25T05:35:37.0189 CET] [info] [pid:9534] Isotovideo exit status: 1
Impact¶
So far seems like only a single incomplete with this error
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label , to look for this ticket call openqa-query-for-job-label poo#80334
Problem¶
- There is an error message "No space left on device". This should be checked if this was/is a general situation on the host powerqaworker-qam-1
Suggestions¶
- Check on powerqaworker-qam-1 what happened, what was/is the available space at the time the issue happened
- openQA jobs of this kind should be automatically retriggered as likely another worker will succeed or the jobs on the same worker might succeed later.
Workaround¶
Manually retrigger
History
#1
Updated by okurz about 2 months ago
- Project changed from openQA Project to openQA Infrastructure
- Description updated (diff)
- Status changed from New to Workable
- Target version set to Ready
Hm, but "No space left on device" is something that openQA can't fix. I checked the job and that was on powerqaworker-qam-1 and only one job
In general I suggest to use https://progress.opensuse.org/projects/openqav3/wiki/#Defects
as a template which also links to a template extension for "auto_review" tickets which you can copy-paste and adjust for the issue at hand. What is also important to know is how many jobs are affected, e.g. how many jobs the auto-review pipeline reports with the same error. I checked and it was only one job so it's completely fine to have this ticket with "normal" priority only :)
#2
Updated by okurz about 2 months ago
- Project changed from openQA Infrastructure to openQA Project
- Subject changed from job incompletes with auto_review:"terminated prematurely with corrupted state file" to job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger
- Description updated (diff)
- Category set to Concrete Bugs
#3
Updated by okurz about 2 months ago
- Parent task set to #62420
#4
Updated by cdywan about 2 months ago
okurz wrote:
Hm, but "No space left on device" is something that openQA can't fix. I checked the job and that was on powerqaworker-qam-1 and only one job
Unless you count #76984 of course.
#5
Updated by mkittler about 1 month ago
#6
Updated by cdywan about 1 month ago
mkittler wrote:
I just wanted to check whether the disk usage is critical on that worker but I can not SSH to it and the worker shows as offline on OSD as well.
Unless you count #76984 of course.
But this issue is definitely focusing on the web UI host. Here a worker ran out of disk space.
Ah, you're right! My bad.
#7
Updated by mkittler about 1 month ago
With this PR re-triggering such jobs automatically is just a matter of configuration: https://github.com/os-autoinst/openQA/pull/3624/files#diff-71734bf4d91118f92743c7d0db9218c4f24729410e020110884606b5ad6a3878R819-R829
So maybe we would just extend the default regex here.
#8
Updated by Xiaojing_liu 9 days ago
- Status changed from Workable to In Progress
- Assignee set to Xiaojing_liu
#9
Updated by openqa_review 8 days ago
- Due date set to 2021-01-26
Setting due date based on mean cycle time of SUSE QE Tools
#10
Updated by Xiaojing_liu 7 days ago
mkittler The reason is terminated prematurely with corrupted state file
, and no space left on device
is written in autoinst-log. How about matching the reason and checking if the autoinst-log includes no space left
? if we only check the reason, we cannot confirm it's related with no space left
.