Project

General

Profile

action #97658

many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:M

Added by okurz about 2 months ago. Updated 22 days ago.

Status:
In Progress
Priority:
Low
Assignee:
Target version:
Start date:
2021-08-30
Due date:
2021-11-05
% Done:

0%

Estimated time:

Description

Observation

From https://matrix.to/#/!ilXMcHXPOjTZeauZcg:libera.chat/$VfpltxmcDxTLSzTVKh7fa5gFHn6TR3hdcX1cnNnBNlg?via=libera.chat&via=matrix.org&via=opensuse.org AdaLovelace asked

Does anybody know about network issues or anything in this direction between openQA and the mainframe? There are timeouts at the start of openQA tests. https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20210829&groupid=34

Suggestions

  • Try to recover the existing machine from a rescue system
  • If the hardware needs a replacement, file an infra ticket asking to replace hardware and boot a rescue system to install the machine
  • Ensure that o3 s390x openQA jobs are processed correctly

Out of scope

  • Reconsider backup strategy or replacement using containers

Related issues

Related to openSUSE admin - tickets #97691: Corrupted file system on rebel prevents openQA tests for s390xNew2021-08-30

Related to openQA Infrastructure - action #98307: Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:MResolved2021-09-08

Blocks openQA Infrastructure - action #93381: [O3]request to add an IPMI SUT to O3In Progress2021-06-022021-10-30

Copied to openQA Infrastructure - action #97751: replacement setup for o3 s390x openQA workers size:MResolved2021-09-17

History

#1 Updated by okurz about 2 months ago

  • Assignee set to okurz

#2 Updated by okurz about 2 months ago

From rebel:

rebel:~ # systemctl status openqa-worker-cacheservice
● openqa-worker-cacheservice.service - OpenQA Worker Cache Service
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2021-08-28 03:33:24 CEST; 2 days ago
 Main PID: 1138 (openqa-workerca)
    Tasks: 5
   CGroup: /system.slice/openqa-worker-cacheservice.service
           ├─1138 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
           ├─1242 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
           ├─1243 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
           ├─1244 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
           └─1245 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80

Aug 28 03:33:24 rebel systemd[1]: Started OpenQA Worker Cache Service.
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-Tumbleweed-s390x-20210826-textmode@s390x-zVM-vswitch-l2.qcow2" b>
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: Web application available at http://127.0.0.1:9530
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: Web application available at http://[::1]:9530
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Cache size of "/var/lib/openqa/cache" is 50GiB, with limit 50GiB
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Listening at "http://127.0.0.1:9530"
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Listening at "http://[::1]:9530"
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Manager 1138 started
rebel:~ # systemctl status openqa-worker-cacheservice
openqa-worker-cacheservice-minion.service  openqa-worker-cacheservice.service         
rebel:~ # systemctl status openqa-worker-cacheservice
openqa-worker-cacheservice-minion.service  openqa-worker-cacheservice.service         
rebel:~ # systemctl status openqa-worker-cacheservice-minion.service 
● openqa-worker-cacheservice-minion.service - OpenQA Worker Cache Service Minion
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice-minion.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2021-08-28 03:33:24 CEST; 2 days ago
 Main PID: 1143 (openqa-workerca)
    Tasks: 1
   CGroup: /system.slice/openqa-worker-cacheservice-minion.service
           └─1143 /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks

Aug 28 14:02:37 rebel openqa-worker-cacheservice-minion[1143]: [5373] [i] [#3389] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 14:02:47 rebel openqa-worker-cacheservice-minion[1143]: [5377] [i] [#3390] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso"
Aug 28 14:18:34 rebel openqa-worker-cacheservice-minion[1143]: [6392] [i] [#3391] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 14:18:45 rebel openqa-worker-cacheservice-minion[1143]: [6397] [i] [#3392] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso"
Aug 28 14:37:38 rebel openqa-worker-cacheservice-minion[1143]: [7776] [i] [#3393] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 14:37:48 rebel openqa-worker-cacheservice-minion[1143]: [7790] [i] [#3394] Downloading: "opensuse-Tumbleweed-s390x-20210827-textmode@s390x-zVM-vswitch-l2.qcow2"
Aug 28 15:05:06 rebel openqa-worker-cacheservice-minion[1143]: [8574] [i] [#3395] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 15:05:17 rebel openqa-worker-cacheservice-minion[1143]: [8575] [i] [#3396] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso"
Aug 28 15:51:53 rebel openqa-worker-cacheservice-minion[1143]: [9301] [i] [#3397] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 15:51:59 rebel openqa-worker-cacheservice-minion[1143]: [9302] [i] [#3398] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso"

so the last messages from both services are from two days ago, no mention of the more recent requests from today. Trying to strace the process for openqa-worker-cacheservice-minion with strace -f -yy -p 1143 and I get

stat("/var/lib/openqa/cache/cache.sqlite", {st_mode=S_IFREG|0644, st_size=278528, ...}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/var/lib/openqa/cache/cache.sqlite-journal", 0x7ffeca4a0690) = -1 ENOENT (No such file or directory)
lseek(3</var/lib/openqa/cache/cache.sqlite>, 24, SEEK_SET) = 24
read(3</var/lib/openqa/cache/cache.sqlite>, "\0\1\247\277\0\0\0D\0\0\0\23\0\0\0(", 16) = 16
stat("/var/lib/openqa/cache/cache.sqlite-wal", 0x7ffeca4a0690) = -1 ENOENT (No such file or directory)
fstat(3</var/lib/openqa/cache/cache.sqlite>, {st_mode=S_IFREG|0644, st_size=278528, ...}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=1073741825, l_len=1}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=2}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
nanosleep({tv_sec=0, tv_nsec=500000000}, NULL) = 0
getpid()
… (repeats)

so likely stuck on the lock?

but in dmesg I see

[77131.442675] BTRFS info (device sda1): qgroup scan completed (inconsistency flag cleared)
[87038.736150] BTRFS info (device sda1): qgroup scan completed (inconsistency flag cleared)
[87040.724869] BTRFS error (device sda1): parent transid verify failed on 399785984 wanted 1232944 found 1252990
[87040.724875] BTRFS error (device sda1): parent transid verify failed on 399785984 wanted 1232944 found 1252990
[87040.725039] BTRFS warning (device sda1): Skipping commit of aborted transaction.
[87040.725042] BTRFS: error (device sda1) in cleanup_transaction:1818: errno=-5 IO failure
[87040.725043] BTRFS info (device sda1): forced readonly
[87040.725366] BTRFS info (device sda1): delayed_refs has NO entry
[93674.081962] systemd-journald[696]: Failed to write entry (20 items, 585 bytes), ignoring: Read-only file system
[93674.082182] systemd-journald[696]: Failed to write entry (20 items, 691 bytes), ignoring: Read-only file system
[93674.082400] systemd-journald[696]: Failed to write entry (20 items, 584 bytes), ignoring: Read-only file system
[93674.082680] systemd-journald[696]: Failed to write entry (20 items, 691 bytes), ignoring: Read-only file system

which is for the root filesystem which is a different filesystem than what we use for the openQA cache which would be sdb3 with ext4 (/var/lib/openqa) but still this is a special situation.

#3 Updated by okurz about 2 months ago

  • Assignee deleted (okurz)

I triggered a reboot with ipmi-rebel-ipmi power reset and connected to SoL with ipmi-rebel-ipmi sol activate. During reboot after BIOS, RAID BIOS and network boot, a PXE boot menu mentioning "o3" shows up with options to install "tumbleweed" and "leap 42.2". I wonder where that comes from. If I select to boot from local disk I end up with a black screen on SoL. Maybe the install options would allow to check the btrfs filesystem from outside and see if it can be recovered. Someone else can check?

#4 Updated by nicksinger about 2 months ago

  • Assignee set to nicksinger

#5 Updated by nicksinger about 2 months ago

no bootloader shown over IPMI even if I set the disk as bootdev 1. I assume it was never configured for the serial terminal. Trying to boot an opensuse iso from my samba share now to debug the setup on disk.

#6 Updated by nicksinger about 2 months ago

no success. In the meantime I even lost access to ipmi/sol due to a misconfiguration from my side in the bios. I was able to restore access and get "graphical output" over the supermicro "IPMIView" utility. I saw that btrfs throws a lot of errors while booting. Recovering from the systemd shell was not possible. I give the rescue media another shot

#7 Updated by nicksinger about 2 months ago

I was able to mount the FS with a grml live system. However as sonn as you write something btrfs bails out and mounts the disk R/O. According to my research this indicates bigger issues with the system. The disk build in there and failing is a "WDC WD3200YS-01PGB0" with serial "WD-WCAPD5141821". The RAID controller is a "LSI MegaRAID SAS 2108". I assume one of these components might be defective.

Smart shows "no errors recorded" for the disk. Running a smart self test shows after a couple of seconds that the SATA link resets:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended captive    Interrupted (host reset)      90%     11675         -

I'm not sure how to figure out if the disk is connected over the LSI controller. Therefore we might need physical help from infra.

#8 Updated by nicksinger about 2 months ago

The machine is currently reachable over ariel with the ip 192.168.112.11. There is a tmux session running if anybody is interested

#10 Updated by cdywan about 2 months ago

  • Related to tickets #97691: Corrupted file system on rebel prevents openQA tests for s390x added

#12 Updated by okurz about 2 months ago

  • Copied to action #97751: replacement setup for o3 s390x openQA workers size:M added

#13 Updated by cdywan about 2 months ago

  • Subject changed from many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" to many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:M
  • Description updated (diff)
  • Status changed from New to In Progress

#14 Updated by openqa_review about 2 months ago

  • Due date set to 2021-09-15

Setting due date based on mean cycle time of SUSE QE Tools

#15 Updated by nicksinger about 2 months ago

  • Blocks action #93381: [O3]request to add an IPMI SUT to O3 added

#16 Updated by okurz about 2 months ago

mkittler volunteers to team up with nsinger to learn how to order replacement hardware. Please order a replacement storage device, ask EngInfra to replace it and see if that fixes the problem. If it does, good, if not then likely storage controller or something else is broken. In this case consider decomissioning the complete hardware or replace other components depending on RMA periods. Lookup racktables about support period.

#17 Updated by mgriessmeier about 2 months ago

okurz wrote:

mkittler volunteers to team up with nsinger to learn how to order replacement hardware. Please order a replacement storage device, ask EngInfra to replace it and see if that fixes the problem. If it does, good, if not then likely storage controller or something else is broken. In this case consider decomissioning the complete hardware or replace other components depending on RMA periods. Lookup racktables about support period.

so in general the process for such "small" amounts, not sure about the threshold though (iirc something around 600€) - works as follows:

  • open Jira-SD ticket (Hardware Request -> Non-Standard/Other Equipment) with business justification and link to the desired Hardware
  • await approval from line manager and SUSE-IT
  • two options after that: 1) Order yourself and expense afterwards through SAP Concur with reference to Jira-SD ticket or 2) Kindly ask SUSE-IT in the Jira-SD ticket to order it for you, though I never did this and don't know how good this works

for this particular issue I agreed with Nick to first check in the office early next week, because recently we found some "old" HDDs which are still sealed, so were not used at all, to check if we can use those.

#18 Updated by okurz about 2 months ago

  • Related to action #98307: Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M added

#19 Updated by cdywan about 1 month ago

  • Due date changed from 2021-09-15 to 2021-09-24
  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

Reducing priority due to #97751, aiming for next Wednesday but due two days after.

#20 Updated by nicksinger about 1 month ago

  • Status changed from Feedback to Blocked

We where in the office yesterday but unfortunately nobody from infra was there and we couldn't go into the server room. I prepared some HDDs now and placed them in my office and kindly asked infra (https://sd.suse.com/servicedesk/customer/portal/1/SD-61294) to build them in.

#21 Updated by nicksinger about 1 month ago

  • Due date changed from 2021-09-24 to 2021-10-01

#22 Updated by nicksinger 29 days ago

Update for people who don't see the Ticket:

Dear Nick Singer,

The ticket has been assigned to Eng-Infra queue to get picked up by an engineer from Engineering-Infra team.

It was 4 days ago. No further answers until now.

#23 Updated by okurz 27 days ago

  • Status changed from Blocked to Feedback

https://sd.suse.com/servicedesk/customer/portal/1/SD-61294 was resolved with message "disk replaced"

#24 Updated by okurz 26 days ago

  • Due date changed from 2021-10-01 to 2021-10-05

To be checked on-site at next possibility if not possible remotely

#25 Updated by nicksinger 22 days ago

  • Priority changed from High to Low

Tried to install Leap15.3 over PXE but no success so far. I went with IPXE as we don't have the necessary infrastructure in the o3 network to replicate what we have in the OSD PXE setup (e.g. mount to dist). What I tried so far is building ipxe (git.ipxe.org/ipxe.git) with the following, embedded script (myscript.ipxe):

#!ipxe
dhcp

kernel http://download.opensuse.org/distribution/leap/15.3/repo/oss/boot/x86_64/loader/linux usessh=1 sshpassword=linux network=1 install=http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/ console=ttyS1,115200n8 root=/dev/ram0 initrd=initrd textmode=1
initrd http://download.opensuse.org/distribution/leap/15.3/repo/oss/boot/x86_64/loader/initrd

boot
shell

To build it, you have to compile it like this (inside the scr directory): make -j$(nproc) EMBED=../myscript.ipxe.

It loads the kernel + initrd but always gets stuck while loading the initrd at around 70%. As I used a similar script for aarch64 workers in the past (and it worked there) I can't tell if the console parameter is just wrong. Until now I tried ttyS0 and ttyS1.

I also tried to load the iso directly with

chain --timeout 1800000 http://download.opensuse.org/distribution/leap/15.3/iso/openSUSE-Leap-15.3-NET-x86_64.iso

But this results in an "unknown exec format" which is apparently caused by regular BIOS booting which does not support loading of ISOs (unlike UEFI)

#26 Updated by nicksinger 22 days ago

  • Due date changed from 2021-10-05 to 2021-11-05
  • Status changed from Feedback to In Progress

right, also changing the priority here as we have workarounds for the host in place (containers). Setting the due date far ahead since the easy solutions did not work and we need to figure out how we can have working PXE on o3 infra.

#27 Updated by nicksinger 22 days ago

ariel with dnsmasq is our dhcp server which hands out the PXE configuration according to this config:

ariel:/etc/dnsmasq.d # cat pxeboot.conf
enable-tftp
tftp-root=/srv/tftpboot
pxe-prompt="Press F8 for menu.", 10
pxe-service=x86PC, "ipxe", ipxe.bin
pxe-service=x86PC, "pxelinux", pxelinux.0
#pxe-service=x86PC, "Boot from local disk"
#pxe-service=x86PC, "Install Linux", pxelinux

#28 Updated by nicksinger 22 days ago

I checked the BIOS of rebel and the configuration for console redirection was for "COM3". I then tried "ttyS2" and I finally could boot the installer medium. So this is the currently deployed /srv/tftpboot/ipxe.bin content which works for rebel:

#!ipxe
dhcp

kernel http://download.opensuse.org/distribution/leap/15.3/repo/oss/boot/x86_64/loader/linux usessh=1 sshpassword=linux network=1 install=http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/ console=ttyS2,115200n8 root=/dev/ram0 initrd=initrd textmode=1
initrd http://download.opensuse.org/distribution/leap/15.3/repo/oss/boot/x86_64/loader/initrd

boot
shell

Conducting the instillation over ssh now.

#29 Updated by nicksinger 22 days ago

Installation is done and openSUSE boots successfully. Next step is to setup the openqa-worker packages again and do general configuration.

Also available in: Atom PDF