Project

General

Profile

action #97658

many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:M

Added by okurz 9 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
2021-08-30
Due date:
% Done:

0%

Estimated time:

Description

Observation

From https://matrix.to/#/!ilXMcHXPOjTZeauZcg:libera.chat/$VfpltxmcDxTLSzTVKh7fa5gFHn6TR3hdcX1cnNnBNlg?via=libera.chat&via=matrix.org&via=opensuse.org AdaLovelace asked

Does anybody know about network issues or anything in this direction between openQA and the mainframe? There are timeouts at the start of openQA tests. https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20210829&groupid=34

Suggestions

  • Try to recover the existing machine from a rescue system
  • If the hardware needs a replacement, file an infra ticket asking to replace hardware and boot a rescue system to install the machine
  • Ensure that o3 s390x openQA jobs are processed correctly

Out of scope

  • Reconsider backup strategy or replacement using containers

Related issues

Related to openSUSE admin - tickets #97691: Corrupted file system on rebel prevents openQA tests for s390xClosed2021-08-30

Related to openQA Infrastructure - action #98307: Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:MResolved2021-09-08

Blocks openQA Infrastructure - action #93381: [O3]request to add an IPMI SUT to O3 size:MResolved2021-06-02

Copied to openQA Infrastructure - action #97751: replacement setup for o3 s390x openQA workers size:MResolved2021-09-17

History

#1 Updated by okurz 9 months ago

  • Assignee set to okurz

#2 Updated by okurz 9 months ago

From rebel:

rebel:~ # systemctl status openqa-worker-cacheservice
● openqa-worker-cacheservice.service - OpenQA Worker Cache Service
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2021-08-28 03:33:24 CEST; 2 days ago
 Main PID: 1138 (openqa-workerca)
    Tasks: 5
   CGroup: /system.slice/openqa-worker-cacheservice.service
           ├─1138 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
           ├─1242 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
           ├─1243 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
           ├─1244 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
           └─1245 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80

Aug 28 03:33:24 rebel systemd[1]: Started OpenQA Worker Cache Service.
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-Tumbleweed-s390x-20210826-textmode@s390x-zVM-vswitch-l2.qcow2" b>
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: Web application available at http://127.0.0.1:9530
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: Web application available at http://[::1]:9530
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Cache size of "/var/lib/openqa/cache" is 50GiB, with limit 50GiB
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Listening at "http://127.0.0.1:9530"
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Listening at "http://[::1]:9530"
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Manager 1138 started
rebel:~ # systemctl status openqa-worker-cacheservice
openqa-worker-cacheservice-minion.service  openqa-worker-cacheservice.service         
rebel:~ # systemctl status openqa-worker-cacheservice
openqa-worker-cacheservice-minion.service  openqa-worker-cacheservice.service         
rebel:~ # systemctl status openqa-worker-cacheservice-minion.service 
● openqa-worker-cacheservice-minion.service - OpenQA Worker Cache Service Minion
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice-minion.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2021-08-28 03:33:24 CEST; 2 days ago
 Main PID: 1143 (openqa-workerca)
    Tasks: 1
   CGroup: /system.slice/openqa-worker-cacheservice-minion.service
           └─1143 /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks

Aug 28 14:02:37 rebel openqa-worker-cacheservice-minion[1143]: [5373] [i] [#3389] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 14:02:47 rebel openqa-worker-cacheservice-minion[1143]: [5377] [i] [#3390] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso"
Aug 28 14:18:34 rebel openqa-worker-cacheservice-minion[1143]: [6392] [i] [#3391] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 14:18:45 rebel openqa-worker-cacheservice-minion[1143]: [6397] [i] [#3392] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso"
Aug 28 14:37:38 rebel openqa-worker-cacheservice-minion[1143]: [7776] [i] [#3393] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 14:37:48 rebel openqa-worker-cacheservice-minion[1143]: [7790] [i] [#3394] Downloading: "opensuse-Tumbleweed-s390x-20210827-textmode@s390x-zVM-vswitch-l2.qcow2"
Aug 28 15:05:06 rebel openqa-worker-cacheservice-minion[1143]: [8574] [i] [#3395] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 15:05:17 rebel openqa-worker-cacheservice-minion[1143]: [8575] [i] [#3396] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso"
Aug 28 15:51:53 rebel openqa-worker-cacheservice-minion[1143]: [9301] [i] [#3397] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 15:51:59 rebel openqa-worker-cacheservice-minion[1143]: [9302] [i] [#3398] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso"

so the last messages from both services are from two days ago, no mention of the more recent requests from today. Trying to strace the process for openqa-worker-cacheservice-minion with strace -f -yy -p 1143 and I get

stat("/var/lib/openqa/cache/cache.sqlite", {st_mode=S_IFREG|0644, st_size=278528, ...}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/var/lib/openqa/cache/cache.sqlite-journal", 0x7ffeca4a0690) = -1 ENOENT (No such file or directory)
lseek(3</var/lib/openqa/cache/cache.sqlite>, 24, SEEK_SET) = 24
read(3</var/lib/openqa/cache/cache.sqlite>, "\0\1\247\277\0\0\0D\0\0\0\23\0\0\0(", 16) = 16
stat("/var/lib/openqa/cache/cache.sqlite-wal", 0x7ffeca4a0690) = -1 ENOENT (No such file or directory)
fstat(3</var/lib/openqa/cache/cache.sqlite>, {st_mode=S_IFREG|0644, st_size=278528, ...}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=1073741825, l_len=1}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=2}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
nanosleep({tv_sec=0, tv_nsec=500000000}, NULL) = 0
getpid()
… (repeats)

so likely stuck on the lock?

but in dmesg I see

[77131.442675] BTRFS info (device sda1): qgroup scan completed (inconsistency flag cleared)
[87038.736150] BTRFS info (device sda1): qgroup scan completed (inconsistency flag cleared)
[87040.724869] BTRFS error (device sda1): parent transid verify failed on 399785984 wanted 1232944 found 1252990
[87040.724875] BTRFS error (device sda1): parent transid verify failed on 399785984 wanted 1232944 found 1252990
[87040.725039] BTRFS warning (device sda1): Skipping commit of aborted transaction.
[87040.725042] BTRFS: error (device sda1) in cleanup_transaction:1818: errno=-5 IO failure
[87040.725043] BTRFS info (device sda1): forced readonly
[87040.725366] BTRFS info (device sda1): delayed_refs has NO entry
[93674.081962] systemd-journald[696]: Failed to write entry (20 items, 585 bytes), ignoring: Read-only file system
[93674.082182] systemd-journald[696]: Failed to write entry (20 items, 691 bytes), ignoring: Read-only file system
[93674.082400] systemd-journald[696]: Failed to write entry (20 items, 584 bytes), ignoring: Read-only file system
[93674.082680] systemd-journald[696]: Failed to write entry (20 items, 691 bytes), ignoring: Read-only file system

which is for the root filesystem which is a different filesystem than what we use for the openQA cache which would be sdb3 with ext4 (/var/lib/openqa) but still this is a special situation.

#3 Updated by okurz 9 months ago

  • Assignee deleted (okurz)

I triggered a reboot with ipmi-rebel-ipmi power reset and connected to SoL with ipmi-rebel-ipmi sol activate. During reboot after BIOS, RAID BIOS and network boot, a PXE boot menu mentioning "o3" shows up with options to install "tumbleweed" and "leap 42.2". I wonder where that comes from. If I select to boot from local disk I end up with a black screen on SoL. Maybe the install options would allow to check the btrfs filesystem from outside and see if it can be recovered. Someone else can check?

#4 Updated by nicksinger 9 months ago

  • Assignee set to nicksinger

#5 Updated by nicksinger 9 months ago

no bootloader shown over IPMI even if I set the disk as bootdev 1. I assume it was never configured for the serial terminal. Trying to boot an opensuse iso from my samba share now to debug the setup on disk.

#6 Updated by nicksinger 9 months ago

no success. In the meantime I even lost access to ipmi/sol due to a misconfiguration from my side in the bios. I was able to restore access and get "graphical output" over the supermicro "IPMIView" utility. I saw that btrfs throws a lot of errors while booting. Recovering from the systemd shell was not possible. I give the rescue media another shot

#7 Updated by nicksinger 9 months ago

I was able to mount the FS with a grml live system. However as sonn as you write something btrfs bails out and mounts the disk R/O. According to my research this indicates bigger issues with the system. The disk build in there and failing is a "WDC WD3200YS-01PGB0" with serial "WD-WCAPD5141821". The RAID controller is a "LSI MegaRAID SAS 2108". I assume one of these components might be defective.

Smart shows "no errors recorded" for the disk. Running a smart self test shows after a couple of seconds that the SATA link resets:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended captive    Interrupted (host reset)      90%     11675         -

I'm not sure how to figure out if the disk is connected over the LSI controller. Therefore we might need physical help from infra.

#8 Updated by nicksinger 9 months ago

The machine is currently reachable over ariel with the ip 192.168.112.11. There is a tmux session running if anybody is interested

#10 Updated by cdywan 9 months ago

  • Related to tickets #97691: Corrupted file system on rebel prevents openQA tests for s390x added

#12 Updated by okurz 9 months ago

  • Copied to action #97751: replacement setup for o3 s390x openQA workers size:M added

#13 Updated by cdywan 9 months ago

  • Subject changed from many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" to many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:M
  • Description updated (diff)
  • Status changed from New to In Progress

#14 Updated by openqa_review 9 months ago

  • Due date set to 2021-09-15

Setting due date based on mean cycle time of SUSE QE Tools

#15 Updated by nicksinger 9 months ago

  • Blocks action #93381: [O3]request to add an IPMI SUT to O3 size:M added

#16 Updated by okurz 9 months ago

mkittler volunteers to team up with nsinger to learn how to order replacement hardware. Please order a replacement storage device, ask EngInfra to replace it and see if that fixes the problem. If it does, good, if not then likely storage controller or something else is broken. In this case consider decomissioning the complete hardware or replace other components depending on RMA periods. Lookup racktables about support period.

#17 Updated by mgriessmeier 9 months ago

okurz wrote:

mkittler volunteers to team up with nsinger to learn how to order replacement hardware. Please order a replacement storage device, ask EngInfra to replace it and see if that fixes the problem. If it does, good, if not then likely storage controller or something else is broken. In this case consider decomissioning the complete hardware or replace other components depending on RMA periods. Lookup racktables about support period.

so in general the process for such "small" amounts, not sure about the threshold though (iirc something around 600€) - works as follows:

  • open Jira-SD ticket (Hardware Request -> Non-Standard/Other Equipment) with business justification and link to the desired Hardware
  • await approval from line manager and SUSE-IT
  • two options after that: 1) Order yourself and expense afterwards through SAP Concur with reference to Jira-SD ticket or 2) Kindly ask SUSE-IT in the Jira-SD ticket to order it for you, though I never did this and don't know how good this works

for this particular issue I agreed with Nick to first check in the office early next week, because recently we found some "old" HDDs which are still sealed, so were not used at all, to check if we can use those.

#18 Updated by okurz 9 months ago

  • Related to action #98307: Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M added

#19 Updated by cdywan 8 months ago

  • Due date changed from 2021-09-15 to 2021-09-24
  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

Reducing priority due to #97751, aiming for next Wednesday but due two days after.

#20 Updated by nicksinger 8 months ago

  • Status changed from Feedback to Blocked

We where in the office yesterday but unfortunately nobody from infra was there and we couldn't go into the server room. I prepared some HDDs now and placed them in my office and kindly asked infra (https://sd.suse.com/servicedesk/customer/portal/1/SD-61294) to build them in.

#21 Updated by nicksinger 8 months ago

  • Due date changed from 2021-09-24 to 2021-10-01

#22 Updated by nicksinger 8 months ago

Update for people who don't see the Ticket:

Dear Nick Singer,

The ticket has been assigned to Eng-Infra queue to get picked up by an engineer from Engineering-Infra team.

It was 4 days ago. No further answers until now.

#23 Updated by okurz 8 months ago

  • Status changed from Blocked to Feedback

https://sd.suse.com/servicedesk/customer/portal/1/SD-61294 was resolved with message "disk replaced"

#24 Updated by okurz 8 months ago

  • Due date changed from 2021-10-01 to 2021-10-05

To be checked on-site at next possibility if not possible remotely

#25 Updated by nicksinger 8 months ago

  • Priority changed from High to Low

Tried to install Leap15.3 over PXE but no success so far. I went with IPXE as we don't have the necessary infrastructure in the o3 network to replicate what we have in the OSD PXE setup (e.g. mount to dist). What I tried so far is building ipxe (git.ipxe.org/ipxe.git) with the following, embedded script (myscript.ipxe):

#!ipxe
dhcp

kernel http://download.opensuse.org/distribution/leap/15.3/repo/oss/boot/x86_64/loader/linux usessh=1 sshpassword=linux network=1 install=http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/ console=ttyS1,115200n8 root=/dev/ram0 initrd=initrd textmode=1
initrd http://download.opensuse.org/distribution/leap/15.3/repo/oss/boot/x86_64/loader/initrd

boot
shell

To build it, you have to compile it like this (inside the scr directory): make -j$(nproc) EMBED=../myscript.ipxe.

It loads the kernel + initrd but always gets stuck while loading the initrd at around 70%. As I used a similar script for aarch64 workers in the past (and it worked there) I can't tell if the console parameter is just wrong. Until now I tried ttyS0 and ttyS1.

I also tried to load the iso directly with

chain --timeout 1800000 http://download.opensuse.org/distribution/leap/15.3/iso/openSUSE-Leap-15.3-NET-x86_64.iso

But this results in an "unknown exec format" which is apparently caused by regular BIOS booting which does not support loading of ISOs (unlike UEFI)

#26 Updated by nicksinger 8 months ago

  • Due date changed from 2021-10-05 to 2021-11-05
  • Status changed from Feedback to In Progress

right, also changing the priority here as we have workarounds for the host in place (containers). Setting the due date far ahead since the easy solutions did not work and we need to figure out how we can have working PXE on o3 infra.

#27 Updated by nicksinger 8 months ago

ariel with dnsmasq is our dhcp server which hands out the PXE configuration according to this config:

ariel:/etc/dnsmasq.d # cat pxeboot.conf
enable-tftp
tftp-root=/srv/tftpboot
pxe-prompt="Press F8 for menu.", 10
pxe-service=x86PC, "ipxe", ipxe.bin
pxe-service=x86PC, "pxelinux", pxelinux.0
#pxe-service=x86PC, "Boot from local disk"
#pxe-service=x86PC, "Install Linux", pxelinux

#28 Updated by nicksinger 8 months ago

I checked the BIOS of rebel and the configuration for console redirection was for "COM3". I then tried "ttyS2" and I finally could boot the installer medium. So this is the currently deployed /srv/tftpboot/ipxe.bin content which works for rebel:

#!ipxe
dhcp

kernel http://download.opensuse.org/distribution/leap/15.3/repo/oss/boot/x86_64/loader/linux usessh=1 sshpassword=linux network=1 install=http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/ console=ttyS2,115200n8 root=/dev/ram0 initrd=initrd textmode=1
initrd http://download.opensuse.org/distribution/leap/15.3/repo/oss/boot/x86_64/loader/initrd

boot
shell

Conducting the instillation over ssh now.

#29 Updated by nicksinger 8 months ago

Installation is done and openSUSE boots successfully. Next step is to setup the openqa-worker packages again and do general configuration.

#30 Updated by nicksinger 7 months ago

  • Due date changed from 2021-11-05 to 2021-11-15

the base system is installed and configuration should be in place. I agreed with okurz that we try to start the s390 workers there as well to see what happens if two workers use the same resource. While checking the system I saw that the transactional-updates.service failed and added my comments to https://bugzilla.opensuse.org/show_bug.cgi?id=1192078#c1

#31 Updated by nicksinger 7 months ago

while fixing the transactional-updates.service I realized that my previously restored openQA settings where also missing on the system. Not sure if I applied them wrongly and they got lost over a reboot but have to do them again anyway

#32 Updated by nicksinger 7 months ago

I restored the settings of rebel now and started the openqa-worker services. It is back in o3 now: https://openqa.opensuse.org/admin/workers/181 - waiting for jobs to check if this conflicts with our current container setup

#33 Updated by mkittler 6 months ago

I installed some missing dependencies and added our repositories from devel:openQA and configured the testpool server. Now the test setup seems to work, e.g. https://openqa.opensuse.org/tests/2032416 is currently running. If everything works fine, we can switch the worker classes back to normal.

#34 Updated by cdywan 6 months ago

mkittler wrote:

I installed some missing dependencies and added our repositories from devel:openQA and configured the testpool server. Now the test setup seems to work, e.g. https://openqa.opensuse.org/tests/2032416 is currently running. If everything works fine, we can switch the worker classes back to normal.

That test failed. But it doesn't look like it has passed in 2 months so I'm sure this is "fine" :-D

Use of uninitialized value $args{"logs_path"} in concatenation (.) or string at opensuse/lib/opensusebasetest.pm line 237.

I don't know if this is relevant, at least I noticed that checking the logs.

#35 Updated by okurz 6 months ago

Please take https://openqa.opensuse.org/group_overview/34 as reference. It shows that at least some jobs pass and that's the reference for us

#36 Updated by cdywan 6 months ago

nicksinger Does that mean this ticket should be wrapped up and resolved? Or is there more to be done here? And please update the ticket status when answering ;-)

#37 Updated by okurz 6 months ago

It was me, okurz, not nicksinger, that answered. And, no, the ticket should only be resolved as soon as tests on rebel pass consistently. I was referencing a reference to which to compare our tests even though not all tests might pass elsewhere.

#38 Updated by cdywan 6 months ago

Suggested in the Unlock for Nick to ask Sarah for a good test reference

#39 Updated by AdaLovelace 6 months ago

It can be, that you have sent the email to the wrong email address. I did not receive anything. I do not work for IBM any more.
But I am a watcher of this ticket.

You can use the default Tumbleweed tests textmode and install_ltp_s390x as examples. Why? There are disconnections from the host during await_install at the moment.
I expect an infrastructure issue in this case.

#40 Updated by nicksinger 6 months ago

AdaLovelace wrote:

You can use the default Tumbleweed tests textmode and install_ltp_s390x as examples. Why? There are disconnections from the host during await_install at the moment.
I expect an infrastructure issue in this case.

Thanks for the update! I assume you mean the problems seen here: https://openqa.opensuse.org/tests/2040472#step/await_install/87 ?
I think the connection issues are out-of-scope for this explicit ticket here but I will use the test as a reference if the recovered worker (rebel) shows the same symptoms.

#41 Updated by nicksinger 6 months ago

  • Status changed from In Progress to Resolved

I consider this ticket ("rebel is broken") as resolved now since the validation test reaches the same step as it does on other machines. Regarding the network timeout we might need further investigation. In the screenshot I also see that yast was "killed" so maybe the broken ssh-connection just happens afterwards when no data is transferred anymore. @AdaLovelace please feel free to open another ticket if you find out more why the test fails.

#42 Updated by AdaLovelace 6 months ago

I have to reopen it.
openQA is using the containerized solution for new openSUSE Tumbleweed releases for s390x. Therefore, all tests are failing.

#43 Updated by AdaLovelace 6 months ago

okurz I do not have permission to reopen or change the status.

#44 Updated by nicksinger 6 months ago

  • Status changed from Resolved to In Progress

I assume we have an answer to okurz question then how the workers will behave. I will disable the s390 workers on rebel which are already running as container

#45 Updated by okurz 6 months ago

  • Due date deleted (2021-11-15)
  • Status changed from In Progress to Resolved

Also available in: Atom PDF