action #98307
closedMany jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M
0%
Description
Observation¶
This issue has been observed mainly in the latest runs of Leap 15.3 Updates/Backports (qemu x86_64). This is the latest Build for Leap 15.3 Updates : https://openqa.opensuse.org/tests/overview?distri=opensuse&version=15.3&build=20210908-1&groupid=80 and the 8 fails happened for the same reason --> timeout: setup exceeded MAX_SETUP_TIME
eg : https://openqa.opensuse.org/tests/1906875
Similar issues have been also spotted in the latest Build for Tumbleweed : https://openqa.opensuse.org/tests/overview?distri=microos&distri=opensuse&version=Tumbleweed&build=20210906&groupid=1
eg : https://openqa.opensuse.org/tests/1906246
Updated by dimstar about 3 years ago
All failed/incomplete jobs I so far looked at were attempted to run on openqaworker1 - ow4 and ow7 are correctly running the jobs
Updated by mkittler about 3 years ago
- Status changed from New to Rejected
Duplicate of #98307, I'm closing before we'll spread information across multiple tickets.
Updated by mkittler about 3 years ago
- Status changed from Rejected to Workable
- Assignee set to mkittler
- Target version deleted (
Ready)
Seems like this became #98307, very strange.
Updated by mkittler about 3 years ago
- Status changed from Workable to In Progress
Looks like many Minion jobs on that worker failed due to file system problems:
Error in tempfile() using template /tmp/XXXXXXXXXX: Could not create temp file /tmp/PovS2CtH1p: Read-only file system at /usr/lib/perl5/vendor_perl/5.26.1/Capture/Tiny.pm line 360.
Judging by the output of
select id, t_finished, result, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%setup exceeded MAX_SETUP_TIME%' and t_finished >= '2021-09-07T00:00:00' order by t_finished;
the problem is only on openqaworker1
.
Updated by favogt about 3 years ago
Btrfs errors forced read-only remounts:
[111995.247498] BTRFS error (device md1): Remounting read-write after error is not allowed
[111995.248598] systemd-journald[879]: Failed to write entry (9 items, 278 bytes), ignoring: Read-only file system
[111995.248698] systemd-journald[879]: Failed to write entry (9 items, 298 bytes), ignoring: Read-only file system
Updated by livdywan about 3 years ago
- Subject changed from Many jobs in o3 fail with timeout_exceeded to Many jobs in o3 fail with timeout_exceeded on openqaworker1 size:M
Updated by favogt about 3 years ago
The error wasn't visible in dmesg anymore and the journal couldn't be written, so I triggered a reboot to see the error:
[ 9.666559] BTRFS error (device md1): parent transid verify failed on 1089191936 wanted 2071739 found 2100571
[ 9.676867] BTRFS error (device md1): parent transid verify failed on 1089191936 wanted 2071739 found 2100571
[ 9.687144] BTRFS warning (device md1): Skipping commit of aborted transaction.
[ 9.687147] BTRFS: error (device md1) in cleanup_transaction:1818: errno=-5 IO failure
[ 9.695421] BTRFS info (device md1): forced readonly
[ 9.695524] BTRFS info (device md1): delayed_refs has NO entry
[ 9.695951] BTRFS info (device md1): disk space caching is enabled
Updated by mkittler about 3 years ago
After multiple reboots it looks good again and btrfs scrub returned no errors.
Updated by favogt about 3 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Marking as fixed, though I don't know why it suddenly decided to work again...
Updated by okurz about 3 years ago
- Status changed from Resolved to Feedback
- % Done changed from 100 to 0
I think we need to do better here as we also have seen other cases of "timeout_exceeded". E.g. there is https://openqa.suse.de/tests/7067000# with
Likely error from autoinst-log.txt:
[2021-09-10T07:32:41.0555 CEST] [info] [pid:8865] +++ setup notes +++
[2021-09-10T07:32:41.0555 CEST] [info] [pid:8865] Running on openqaworker5:19 (Linux 5.3.18-lp152.87-default #1 SMP Sun Aug 8 21:53:57 UTC 2021 (44d702a) x86_64)
[2021-09-10T07:32:41.0566 CEST] [debug] [pid:8865] Found HDD_1, caching publiccloud_15sp3_EC2_BYOS_Updates.qcow2
[2021-09-10T07:32:41.0569 CEST] [info] [pid:8865] Downloading publiccloud_15sp3_EC2_BYOS_Updates.qcow2, request #29489 sent to Cache Service
[2021-09-10T09:32:41.0800 CEST] [info] [pid:8865] +++ worker notes +++
[2021-09-10T09:32:41.0800 CEST] [info] [pid:8865] End time: 2021-09-10 07:32:41
[2021-09-10T09:32:41.0800 CEST] [info] [pid:8865] Result: timeout
[2021-09-10T09:32:41.0814 CEST] [info] [pid:16583] Uploading autoinst-log.txt
so another similar likely related case. Mind the 2h gap between "Downloading …" and the end of the job.
Updated by okurz about 3 years ago
- Subject changed from Many jobs in o3 fail with timeout_exceeded on openqaworker1 size:M to Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M
added the auto-review expression from #96557 here. Feel free to move it to another open ticket if you prefer.
Updated by okurz about 3 years ago
- Related to action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry added
Updated by okurz about 3 years ago
- Related to action #81828: Jobs run into timeout_exceeded after the 'chattr' call, no output until timeout, auto_review:"(?s)Refusing to save an empty state file to avoid overwriting a useful one.*Result: timeout":retry added
Updated by okurz about 3 years ago
- Related to action #97658: many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:M added
Updated by mkittler about 3 years ago
- Status changed from Feedback to Resolved
The other job is about OSD and about a different worker and a different problem (the file system is not read-only on that worker). So it makes definitely more sense to open a different ticket. I'll look into the problem on openqaworker5 and how many jobs there are generally exceeding MAX_SETUP_TIME
on OSD and then create a ticket with the relevant information.