action #98307: Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #98307

closed

Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M

Added by punkioudi over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-09-08

Due date:

% Done:

Estimated time:

Description

Observation¶

This issue has been observed mainly in the latest runs of Leap 15.3 Updates/Backports (qemu x86_64). This is the latest Build for Leap 15.3 Updates : https://openqa.opensuse.org/tests/overview?distri=opensuse&version=15.3&build=20210908-1&groupid=80 and the 8 fails happened for the same reason --> timeout: setup exceeded MAX_SETUP_TIME
eg : https://openqa.opensuse.org/tests/1906875

Similar issues have been also spotted in the latest Build for Tumbleweed : https://openqa.opensuse.org/tests/overview?distri=microos&distri=opensuse&version=Tumbleweed&build=20210906&groupid=1
eg : https://openqa.opensuse.org/tests/1906246

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by dimstar over 3 years ago

All failed/incomplete jobs I so far looked at were attempted to run on openqaworker1 - ow4 and ow7 are correctly running the jobs

Actions

Copy link

Updated by punkioudi over 3 years ago

Description updated (diff)

Actions

Copy link

Updated by VANASTASIADIS over 3 years ago

Target version set to Ready

Actions

Copy link

Updated by livdywan over 3 years ago

Is this different to #98307?

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from New to Rejected

Duplicate of #98307, I'm closing before we'll spread information across multiple tickets.

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from Rejected to Workable
Assignee set to mkittler
Target version deleted (~~Ready~~)

Seems like this became #98307, very strange.

Actions

Copy link

Updated by mkittler over 3 years ago

Target version set to Ready

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from Workable to In Progress

Looks like many Minion jobs on that worker failed due to file system problems:

  Error in tempfile() using template /tmp/XXXXXXXXXX: Could not create temp file /tmp/PovS2CtH1p: Read-only file system at /usr/lib/perl5/vendor_perl/5.26.1/Capture/Tiny.pm line 360.

Judging by the output of

select id, t_finished, result, reason, (select host from workers where id = assigned_worker_id) as worker from jobs where reason like '%setup exceeded MAX_SETUP_TIME%' and t_finished >= '2021-09-07T00:00:00' order by t_finished;

the problem is only on openqaworker1.

Actions

Copy link

Updated by favogt over 3 years ago

Btrfs errors forced read-only remounts:

[111995.247498] BTRFS error (device md1): Remounting read-write after error is not allowed
[111995.248598] systemd-journald[879]: Failed to write entry (9 items, 278 bytes), ignoring: Read-only file system
[111995.248698] systemd-journald[879]: Failed to write entry (9 items, 298 bytes), ignoring: Read-only file system

Actions

Copy link

#10

Updated by livdywan over 3 years ago

Subject changed from Many jobs in o3 fail with timeout_exceeded to Many jobs in o3 fail with timeout_exceeded on openqaworker1 size:M

Actions

Copy link

#11

Updated by favogt over 3 years ago

The error wasn't visible in dmesg anymore and the journal couldn't be written, so I triggered a reboot to see the error:

[    9.666559] BTRFS error (device md1): parent transid verify failed on 1089191936 wanted 2071739 found 2100571
[    9.676867] BTRFS error (device md1): parent transid verify failed on 1089191936 wanted 2071739 found 2100571
[    9.687144] BTRFS warning (device md1): Skipping commit of aborted transaction.
[    9.687147] BTRFS: error (device md1) in cleanup_transaction:1818: errno=-5 IO failure
[    9.695421] BTRFS info (device md1): forced readonly
[    9.695524] BTRFS info (device md1): delayed_refs has NO entry
[    9.695951] BTRFS info (device md1): disk space caching is enabled

Actions

Copy link

#12

Updated by mkittler over 3 years ago

After multiple reboots it looks good again and btrfs scrub returned no errors.

Actions

Copy link

#13

Updated by favogt over 3 years ago

Status changed from In Progress to Resolved
% Done changed from 0 to 100

Marking as fixed, though I don't know why it suddenly decided to work again...

Actions

Copy link

#14

Updated by okurz over 3 years ago

Status changed from Resolved to Feedback
% Done changed from 100 to 0

I think we need to do better here as we also have seen other cases of "timeout_exceeded". E.g. there is https://openqa.suse.de/tests/7067000# with

Likely error from autoinst-log.txt:

[2021-09-10T07:32:41.0555 CEST] [info] [pid:8865] +++ setup notes +++
[2021-09-10T07:32:41.0555 CEST] [info] [pid:8865] Running on openqaworker5:19 (Linux 5.3.18-lp152.87-default #1 SMP Sun Aug 8 21:53:57 UTC 2021 (44d702a) x86_64)
[2021-09-10T07:32:41.0566 CEST] [debug] [pid:8865] Found HDD_1, caching publiccloud_15sp3_EC2_BYOS_Updates.qcow2
[2021-09-10T07:32:41.0569 CEST] [info] [pid:8865] Downloading publiccloud_15sp3_EC2_BYOS_Updates.qcow2, request #29489 sent to Cache Service
[2021-09-10T09:32:41.0800 CEST] [info] [pid:8865] +++ worker notes +++
[2021-09-10T09:32:41.0800 CEST] [info] [pid:8865] End time: 2021-09-10 07:32:41
[2021-09-10T09:32:41.0800 CEST] [info] [pid:8865] Result: timeout
[2021-09-10T09:32:41.0814 CEST] [info] [pid:16583] Uploading autoinst-log.txt

so another similar likely related case. Mind the 2h gap between "Downloading …" and the end of the job.

Actions

Copy link

#15

Updated by okurz over 3 years ago

Subject changed from Many jobs in o3 fail with timeout_exceeded on openqaworker1 size:M to Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M

added the auto-review expression from #96557 here. Feel free to move it to another open ticket if you prefer.

Actions

Copy link

#16

Updated by okurz over 3 years ago

Related to action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry added

Actions

Copy link

#17

Updated by okurz over 3 years ago

Related to action #81828: Jobs run into timeout_exceeded after the 'chattr' call, no output until timeout, auto_review:"(?s)Refusing to save an empty state file to avoid overwriting a useful one.*Result: timeout":retry added

Actions

Copy link

#18

Updated by okurz over 3 years ago

Related to action #97658: many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:M added

Actions

Copy link

#19

Updated by mkittler over 3 years ago

Status changed from Feedback to Resolved

The other job is about OSD and about a different worker and a different problem (the file system is not read-only on that worker). So it makes definitely more sense to open a different ticket. I'll look into the problem on openqaworker5 and how many jobs there are generally exceeding MAX_SETUP_TIME on OSD and then create a ticket with the relevant information.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #98307

Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M

Observation¶

Updated by dimstar over 3 years ago

Updated by punkioudi over 3 years ago

Updated by VANASTASIADIS over 3 years ago

Updated by livdywan over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by favogt over 3 years ago

Updated by livdywan over 3 years ago

Updated by favogt over 3 years ago

Updated by mkittler over 3 years ago

Updated by favogt over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago