Project

General

Profile

action #81828

Jobs incomplete with auto_review:"(?s)Refusing to save an empty state file to avoid overwriting a useful one.*Result: timeout":retry

Added by dzedro 7 months ago. Updated 2 days ago.

Status:
Feedback
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-01-06
Due date:
2021-08-10
% Done:

0%

Estimated time:
Difficulty:

Description

https://openqa.suse.de/tests/5251637/file/autoinst-log.txt

[2021-01-06T02:02:06.500 CET] [debug] Blocking SIGTERM
[2021-01-06T02:02:06.548 CET] [debug] Unblocking SIGTERM
7670: channel_out 16, channel_in 15
[2021-01-06T02:02:06.677 CET] [debug] Blocking SIGTERM
[2021-01-06T02:02:06.718 CET] [debug] Unblocking SIGTERM
7780: cmdpipe 14, rsppipe 17
[2021-01-06T02:02:06.722 CET] [debug] started mgmt loop with pid 7780
[2021-01-06T02:02:06.784 CET] [debug] qemu version detected: 4.2.1
[2021-01-06T02:02:06.786 CET] [debug] running /usr/bin/chattr -f +C /var/lib/openqa/pool/7/raid
[2021-01-06T07:02:04.371 CET] [debug] isotovideo received signal TERM
[2021-01-06T07:02:04.371 CET] [debug] backend got TERM
[2021-01-06T07:02:04.373 CET] [info] ::: OpenQA::Qemu::Proc::save_state: Refusing to save an empty state file to avoid overwriting a useful one
[2021-01-06T07:02:04.373 CET] [debug] Passing remaining frames to the video encoder
[2021-01-06T07:02:04.374 CET] [debug] Waiting for video encoder to finalize the video
[2021-01-06T07:02:04.374 CET] [debug] The built-in video encoder (pid 7859) terminated
[2021-01-06T07:02:04.375 CET] [debug] commands process exited: 0
[2021-01-06T07:02:04.376 CET] [debug] sending magic and exit
[2021-01-06T07:02:04.376 CET] [debug] received magic close
[2021-01-06T07:02:04.378 CET] [debug] [autotest] process exited: 1
[2021-01-06T07:02:04.383 CET] [debug] backend process exited: 0
failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 126.
[2021-01-06T07:02:04.483 CET] [debug] stopping autotest process 7701
[2021-01-06T07:02:04.483 CET] [debug] done with autotest process
7670: EXIT 1

Related issues

Related to openQA Project - action #96007: OpenQA jobs randomly time out during setup phaseBlocked2021-07-26

History

#1 Updated by dzedro 7 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Category set to Concrete Bugs

#2 Updated by okurz 7 months ago

  • Subject changed from Jobs incomplete with auto_review:"Refusing to save an empty state file to avoid overwriting a useful one":retry to Jobs incomplete with auto_review:"(?s)Refusing to save an empty state file to avoid overwriting a useful one.*Result: timeout":retry

What you mentioned as auto-review expression looks like an unimportant symptom to me. The more relevant lines are:

[2021-01-06T02:02:06.786 CET] [debug] running /usr/bin/chattr -f +C /var/lib/openqa/pool/7/raid
[2021-01-06T07:02:04.371 CET] [debug] isotovideo received signal TERM

5 h (!) have passed between the two lines and the job ends with "timeout_exceeded". But we can include that in the regex

#3 Updated by mkittler 7 months ago

Yes, this job has clearly just been aborted because the timeout has been exceeded. The job's result is timeout_exceeded and the MAX_JOB_TIME is in accordance with the actual runtime.

I doubt we can find out at this point what blocked chattr. Likely something IO related.

"Refusing to save an empty state file to avoid overwriting a useful one" is likely logged because the QEMU backend has already done first initializations at the time the test was aborted but the VM hasn't been started yet.

#4 Updated by AdamWill 7 days ago

I've been seeing this same thing periodically in Fedora openQA for the last while (I should really have noted which git snapshot update made it start happening, but I didn't, sigh). https://openqa.fedoraproject.org/tests/933232 is the most recent example. There doesn't seem to be much rhyme or reason to when or why it happens, it just...sometimes happens. Exact same behaviour, nothing seems to be logged after the chattr command until two hours later when the timeout is exceeded and the job gets shut down.

#5 Updated by okurz 3 days ago

  • Related to action #96007: OpenQA jobs randomly time out during setup phase added

#6 Updated by okurz 3 days ago

  • Priority changed from Normal to High
  • Target version changed from future to Ready

also seen in #96007 , adding to backlog

#7 Updated by okurz 3 days ago

  • Assignee set to okurz

I have an idea though. We should be able to extend the openQA config to also run openqa-review for "timeout_exceeded" jobs to catch such issues and automatically retry as workaround at least

#8 Updated by okurz 3 days ago

  • Status changed from New to In Progress

Manually added a line on osd in /etc/openqa/openqa.ini:

job_done_hook_timeout_exceeded = env host=openqa.suse.de /opt/os-autoinst-scripts/openqa-label-known-issues-hook

Triggering a job with way too short timeout to trigger an artificial timeout exceeded result:

openqa-clone-job --within-instance https://openqa.suse.de/tests/6569084 _GROUP=0 BUILD=poo81828_okurz MAX_JOB_TIME=1

-> https://openqa.suse.de/tests/6570359 which triggered the hook script.

Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/531 to include that change in persistent salt config.

EDIT: With echo -e "https://openqa.suse.de/tests/6552459\nhttps://openqa.suse.de/tests/6570359" | env host=openqa.suse.de dry_run=1 ./openqa-label-known-issues I verified that https://openqa.suse.de/tests/6552459 as mentioned in #96007 is detected by auto-review but the simple job I triggered https://openqa.suse.de/tests/6570359 is not as expected.

#9 Updated by openqa_review 3 days ago

  • Due date set to 2021-08-10

Setting due date based on mean cycle time of SUSE QE Tools

#10 Updated by okurz 2 days ago

  • Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/531 merged. Waiting some days to crosscheck for jobs in "timeout_exceeded". Then I can check if there are any jobs automatically labeled with this ticket by calling openqa-query-for-job-label poo#81828, currently none.

Also available in: Atom PDF