Project

General

Profile

action #98307

Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M

Added by punkioudi about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2021-09-08
Due date:
% Done:

0%

Estimated time:

Description

Observation

This issue has been observed mainly in the latest runs of Leap 15.3 Updates/Backports (qemu x86_64). This is the latest Build for Leap 15.3 Updates : https://openqa.opensuse.org/tests/overview?distri=opensuse&version=15.3&build=20210908-1&groupid=80 and the 8 fails happened for the same reason --> timeout: setup exceeded MAX_SETUP_TIME
eg : https://openqa.opensuse.org/tests/1906875

Similar issues have been also spotted in the latest Build for Tumbleweed : https://openqa.opensuse.org/tests/overview?distri=microos&distri=opensuse&version=Tumbleweed&build=20210906&groupid=1
eg : https://openqa.opensuse.org/tests/1906246


Related issues

Related to openQA Project - action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retryResolved2021-08-042021-08-19

Related to openQA Project - action #81828: Jobs run into timeout_exceeded after the 'chattr' call, no output until timeout, auto_review:"(?s)Refusing to save an empty state file to avoid overwriting a useful one.*Result: timeout":retryNew2021-01-06

Related to openQA Infrastructure - action #97658: many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:MIn Progress2021-08-302021-11-05

History

#1 Updated by dimstar about 2 months ago

All failed/incomplete jobs I so far looked at were attempted to run on openqaworker1 - ow4 and ow7 are correctly running the jobs

#2 Updated by punkioudi about 2 months ago

  • Description updated (diff)

#3 Updated by VANASTASIADIS about 2 months ago

  • Target version set to Ready

#4 Updated by cdywan about 2 months ago

Is this different to #98307?

#5 Updated by mkittler about 2 months ago

  • Status changed from New to Rejected

Duplicate of #98307, I'm closing before we'll spread information across multiple tickets.

#6 Updated by mkittler about 2 months ago

  • Status changed from Rejected to Workable
  • Assignee set to mkittler
  • Target version deleted (Ready)

Seems like this became #98307, very strange.

#7 Updated by mkittler about 2 months ago

  • Target version set to Ready

#8 Updated by mkittler about 2 months ago

  • Status changed from Workable to In Progress

Looks like many Minion jobs on that worker failed due to file system problems:

  Error in tempfile() using template /tmp/XXXXXXXXXX: Could not create temp file /tmp/PovS2CtH1p: Read-only file system at /usr/lib/perl5/vendor_perl/5.26.1/Capture/Tiny.pm line 360.

Judging by the output of

select id, t_finished, result, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%setup exceeded MAX_SETUP_TIME%' and t_finished >= '2021-09-07T00:00:00' order by t_finished;

the problem is only on openqaworker1.

#9 Updated by favogt about 2 months ago

Btrfs errors forced read-only remounts:

[111995.247498] BTRFS error (device md1): Remounting read-write after error is not allowed
[111995.248598] systemd-journald[879]: Failed to write entry (9 items, 278 bytes), ignoring: Read-only file system
[111995.248698] systemd-journald[879]: Failed to write entry (9 items, 298 bytes), ignoring: Read-only file system

#10 Updated by cdywan about 2 months ago

  • Subject changed from Many jobs in o3 fail with timeout_exceeded to Many jobs in o3 fail with timeout_exceeded on openqaworker1 size:M

#11 Updated by favogt about 2 months ago

The error wasn't visible in dmesg anymore and the journal couldn't be written, so I triggered a reboot to see the error:

[    9.666559] BTRFS error (device md1): parent transid verify failed on 1089191936 wanted 2071739 found 2100571
[    9.676867] BTRFS error (device md1): parent transid verify failed on 1089191936 wanted 2071739 found 2100571
[    9.687144] BTRFS warning (device md1): Skipping commit of aborted transaction.
[    9.687147] BTRFS: error (device md1) in cleanup_transaction:1818: errno=-5 IO failure
[    9.695421] BTRFS info (device md1): forced readonly
[    9.695524] BTRFS info (device md1): delayed_refs has NO entry
[    9.695951] BTRFS info (device md1): disk space caching is enabled

#12 Updated by mkittler about 2 months ago

After multiple reboots it looks good again and btrfs scrub returned no errors.

#13 Updated by favogt about 2 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Marking as fixed, though I don't know why it suddenly decided to work again...

#14 Updated by okurz about 2 months ago

  • Status changed from Resolved to Feedback
  • % Done changed from 100 to 0

I think we need to do better here as we also have seen other cases of "timeout_exceeded". E.g. there is https://openqa.suse.de/tests/7067000# with

Likely error from autoinst-log.txt:

[2021-09-10T07:32:41.0555 CEST] [info] [pid:8865] +++ setup notes +++
[2021-09-10T07:32:41.0555 CEST] [info] [pid:8865] Running on openqaworker5:19 (Linux 5.3.18-lp152.87-default #1 SMP Sun Aug 8 21:53:57 UTC 2021 (44d702a) x86_64)
[2021-09-10T07:32:41.0566 CEST] [debug] [pid:8865] Found HDD_1, caching publiccloud_15sp3_EC2_BYOS_Updates.qcow2
[2021-09-10T07:32:41.0569 CEST] [info] [pid:8865] Downloading publiccloud_15sp3_EC2_BYOS_Updates.qcow2, request #29489 sent to Cache Service
[2021-09-10T09:32:41.0800 CEST] [info] [pid:8865] +++ worker notes +++
[2021-09-10T09:32:41.0800 CEST] [info] [pid:8865] End time: 2021-09-10 07:32:41
[2021-09-10T09:32:41.0800 CEST] [info] [pid:8865] Result: timeout
[2021-09-10T09:32:41.0814 CEST] [info] [pid:16583] Uploading autoinst-log.txt

so another similar likely related case. Mind the 2h gap between "Downloading …" and the end of the job.

#15 Updated by okurz about 2 months ago

  • Subject changed from Many jobs in o3 fail with timeout_exceeded on openqaworker1 size:M to Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M

added the auto-review expression from #96557 here. Feel free to move it to another open ticket if you prefer.

#16 Updated by okurz about 2 months ago

  • Related to action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry added

#17 Updated by okurz about 2 months ago

  • Related to action #81828: Jobs run into timeout_exceeded after the 'chattr' call, no output until timeout, auto_review:"(?s)Refusing to save an empty state file to avoid overwriting a useful one.*Result: timeout":retry added

#18 Updated by okurz about 2 months ago

  • Related to action #97658: many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:M added

#19 Updated by mkittler about 2 months ago

  • Status changed from Feedback to Resolved

The other job is about OSD and about a different worker and a different problem (the file system is not read-only on that worker). So it makes definitely more sense to open a different ticket. I'll look into the problem on openqaworker5 and how many jobs there are generally exceeding MAX_SETUP_TIME on OSD and then create a ticket with the relevant information.

Also available in: Atom PDF