action #97658
closedmany (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:M
0%
Description
Observation¶
From https://matrix.to/#/!ilXMcHXPOjTZeauZcg:libera.chat/$VfpltxmcDxTLSzTVKh7fa5gFHn6TR3hdcX1cnNnBNlg?via=libera.chat&via=matrix.org&via=opensuse.org AdaLovelace asked
Does anybody know about network issues or anything in this direction between openQA and the mainframe? There are timeouts at the start of openQA tests. https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20210829&groupid=34
Suggestions¶
- Try to recover the existing machine from a rescue system
- If the hardware needs a replacement, file an infra ticket asking to replace hardware and boot a rescue system to install the machine
- Ensure that o3 s390x openQA jobs are processed correctly
Out of scope¶
- Reconsider backup strategy or replacement using containers
Updated by okurz about 3 years ago
From rebel:
rebel:~ # systemctl status openqa-worker-cacheservice
● openqa-worker-cacheservice.service - OpenQA Worker Cache Service
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2021-08-28 03:33:24 CEST; 2 days ago
Main PID: 1138 (openqa-workerca)
Tasks: 5
CGroup: /system.slice/openqa-worker-cacheservice.service
├─1138 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
├─1242 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
├─1243 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
├─1244 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
└─1245 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
Aug 28 03:33:24 rebel systemd[1]: Started OpenQA Worker Cache Service.
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Purging "/var/lib/openqa/cache/openqa1-opensuse/opensuse-Tumbleweed-s390x-20210826-textmode@s390x-zVM-vswitch-l2.qcow2" b>
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: Web application available at http://127.0.0.1:9530
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: Web application available at http://[::1]:9530
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Cache size of "/var/lib/openqa/cache" is 50GiB, with limit 50GiB
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Listening at "http://127.0.0.1:9530"
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Listening at "http://[::1]:9530"
Aug 28 03:33:29 rebel openqa-workercache-daemon[1138]: [1138] [i] Manager 1138 started
rebel:~ # systemctl status openqa-worker-cacheservice
openqa-worker-cacheservice-minion.service openqa-worker-cacheservice.service
rebel:~ # systemctl status openqa-worker-cacheservice
openqa-worker-cacheservice-minion.service openqa-worker-cacheservice.service
rebel:~ # systemctl status openqa-worker-cacheservice-minion.service
● openqa-worker-cacheservice-minion.service - OpenQA Worker Cache Service Minion
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice-minion.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2021-08-28 03:33:24 CEST; 2 days ago
Main PID: 1143 (openqa-workerca)
Tasks: 1
CGroup: /system.slice/openqa-worker-cacheservice-minion.service
└─1143 /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
Aug 28 14:02:37 rebel openqa-worker-cacheservice-minion[1143]: [5373] [i] [#3389] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 14:02:47 rebel openqa-worker-cacheservice-minion[1143]: [5377] [i] [#3390] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso"
Aug 28 14:18:34 rebel openqa-worker-cacheservice-minion[1143]: [6392] [i] [#3391] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 14:18:45 rebel openqa-worker-cacheservice-minion[1143]: [6397] [i] [#3392] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso"
Aug 28 14:37:38 rebel openqa-worker-cacheservice-minion[1143]: [7776] [i] [#3393] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 14:37:48 rebel openqa-worker-cacheservice-minion[1143]: [7790] [i] [#3394] Downloading: "opensuse-Tumbleweed-s390x-20210827-textmode@s390x-zVM-vswitch-l2.qcow2"
Aug 28 15:05:06 rebel openqa-worker-cacheservice-minion[1143]: [8574] [i] [#3395] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 15:05:17 rebel openqa-worker-cacheservice-minion[1143]: [8575] [i] [#3396] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso"
Aug 28 15:51:53 rebel openqa-worker-cacheservice-minion[1143]: [9301] [i] [#3397] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso.sha256"
Aug 28 15:51:59 rebel openqa-worker-cacheservice-minion[1143]: [9302] [i] [#3398] Downloading: "openSUSE-Tumbleweed-DVD-s390x-Snapshot20210827-Media.iso"
so the last messages from both services are from two days ago, no mention of the more recent requests from today. Trying to strace the process for openqa-worker-cacheservice-minion with strace -f -yy -p 1143
and I get
stat("/var/lib/openqa/cache/cache.sqlite", {st_mode=S_IFREG|0644, st_size=278528, ...}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/var/lib/openqa/cache/cache.sqlite-journal", 0x7ffeca4a0690) = -1 ENOENT (No such file or directory)
lseek(3</var/lib/openqa/cache/cache.sqlite>, 24, SEEK_SET) = 24
read(3</var/lib/openqa/cache/cache.sqlite>, "\0\1\247\277\0\0\0D\0\0\0\23\0\0\0(", 16) = 16
stat("/var/lib/openqa/cache/cache.sqlite-wal", 0x7ffeca4a0690) = -1 ENOENT (No such file or directory)
fstat(3</var/lib/openqa/cache/cache.sqlite>, {st_mode=S_IFREG|0644, st_size=278528, ...}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=1073741825, l_len=1}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=2}) = 0
fcntl(3</var/lib/openqa/cache/cache.sqlite>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
nanosleep({tv_sec=0, tv_nsec=500000000}, NULL) = 0
getpid()
… (repeats)
so likely stuck on the lock?
but in dmesg I see
[77131.442675] BTRFS info (device sda1): qgroup scan completed (inconsistency flag cleared)
[87038.736150] BTRFS info (device sda1): qgroup scan completed (inconsistency flag cleared)
[87040.724869] BTRFS error (device sda1): parent transid verify failed on 399785984 wanted 1232944 found 1252990
[87040.724875] BTRFS error (device sda1): parent transid verify failed on 399785984 wanted 1232944 found 1252990
[87040.725039] BTRFS warning (device sda1): Skipping commit of aborted transaction.
[87040.725042] BTRFS: error (device sda1) in cleanup_transaction:1818: errno=-5 IO failure
[87040.725043] BTRFS info (device sda1): forced readonly
[87040.725366] BTRFS info (device sda1): delayed_refs has NO entry
[93674.081962] systemd-journald[696]: Failed to write entry (20 items, 585 bytes), ignoring: Read-only file system
[93674.082182] systemd-journald[696]: Failed to write entry (20 items, 691 bytes), ignoring: Read-only file system
[93674.082400] systemd-journald[696]: Failed to write entry (20 items, 584 bytes), ignoring: Read-only file system
[93674.082680] systemd-journald[696]: Failed to write entry (20 items, 691 bytes), ignoring: Read-only file system
which is for the root filesystem which is a different filesystem than what we use for the openQA cache which would be sdb3 with ext4 (/var/lib/openqa) but still this is a special situation.
Updated by okurz about 3 years ago
- Assignee deleted (
okurz)
I triggered a reboot with ipmi-rebel-ipmi power reset
and connected to SoL with ipmi-rebel-ipmi sol activate
. During reboot after BIOS, RAID BIOS and network boot, a PXE boot menu mentioning "o3" shows up with options to install "tumbleweed" and "leap 42.2". I wonder where that comes from. If I select to boot from local disk I end up with a black screen on SoL. Maybe the install options would allow to check the btrfs filesystem from outside and see if it can be recovered. Someone else can check?
Updated by nicksinger about 3 years ago
no bootloader shown over IPMI even if I set the disk as bootdev 1. I assume it was never configured for the serial terminal. Trying to boot an opensuse iso from my samba share now to debug the setup on disk.
Updated by nicksinger about 3 years ago
no success. In the meantime I even lost access to ipmi/sol due to a misconfiguration from my side in the bios. I was able to restore access and get "graphical output" over the supermicro "IPMIView" utility. I saw that btrfs throws a lot of errors while booting. Recovering from the systemd shell was not possible. I give the rescue media another shot
Updated by nicksinger about 3 years ago
I was able to mount the FS with a grml live system. However as sonn as you write something btrfs bails out and mounts the disk R/O. According to my research this indicates bigger issues with the system. The disk build in there and failing is a "WDC WD3200YS-01PGB0" with serial "WD-WCAPD5141821". The RAID controller is a "LSI MegaRAID SAS 2108". I assume one of these components might be defective.
Smart shows "no errors recorded" for the disk. Running a smart self test shows after a couple of seconds that the SATA link resets:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended captive Interrupted (host reset) 90% 11675 -
I'm not sure how to figure out if the disk is connected over the LSI controller. Therefore we might need physical help from infra.
Updated by nicksinger about 3 years ago
The machine is currently reachable over ariel with the ip 192.168.112.11. There is a tmux session running if anybody is interested
Updated by livdywan about 3 years ago
- Related to tickets #97691: Corrupted file system on rebel prevents openQA tests for s390x added
Updated by okurz about 3 years ago
- Copied to action #97751: replacement setup for o3 s390x openQA workers size:M added
Updated by livdywan about 3 years ago
- Subject changed from many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" to many (maybe all) jobs on rebel within o3 run into timeout_exceeded "setup exceeded MAX_SETUP_TIME" size:M
- Description updated (diff)
- Status changed from New to In Progress
Updated by openqa_review about 3 years ago
- Due date set to 2021-09-15
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger about 3 years ago
- Blocks action #93381: [O3]request to add an IPMI SUT to O3 size:M added
Updated by okurz about 3 years ago
mkittler volunteers to team up with nsinger to learn how to order replacement hardware. Please order a replacement storage device, ask EngInfra to replace it and see if that fixes the problem. If it does, good, if not then likely storage controller or something else is broken. In this case consider decomissioning the complete hardware or replace other components depending on RMA periods. Lookup racktables about support period.
Updated by mgriessmeier about 3 years ago
okurz wrote:
mkittler volunteers to team up with nsinger to learn how to order replacement hardware. Please order a replacement storage device, ask EngInfra to replace it and see if that fixes the problem. If it does, good, if not then likely storage controller or something else is broken. In this case consider decomissioning the complete hardware or replace other components depending on RMA periods. Lookup racktables about support period.
so in general the process for such "small" amounts, not sure about the threshold though (iirc something around 600€) - works as follows:
- open Jira-SD ticket (Hardware Request -> Non-Standard/Other Equipment) with business justification and link to the desired Hardware
- await approval from line manager and SUSE-IT
- two options after that: 1) Order yourself and expense afterwards through SAP Concur with reference to Jira-SD ticket or 2) Kindly ask SUSE-IT in the Jira-SD ticket to order it for you, though I never did this and don't know how good this works
for this particular issue I agreed with Nick to first check in the office early next week, because recently we found some "old" HDDs which are still sealed, so were not used at all, to check if we can use those.
Updated by okurz about 3 years ago
- Related to action #98307: Many jobs in o3 fail with timeout_exceeded on openqaworker1 auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry size:M added
Updated by livdywan about 3 years ago
- Due date changed from 2021-09-15 to 2021-09-24
- Status changed from In Progress to Feedback
- Priority changed from Urgent to High
Reducing priority due to #97751, aiming for next Wednesday but due two days after.
Updated by nicksinger about 3 years ago
- Status changed from Feedback to Blocked
We where in the office yesterday but unfortunately nobody from infra was there and we couldn't go into the server room. I prepared some HDDs now and placed them in my office and kindly asked infra (https://sd.suse.com/servicedesk/customer/portal/1/SD-61294) to build them in.
Updated by nicksinger about 3 years ago
- Due date changed from 2021-09-24 to 2021-10-01
Updated by nicksinger about 3 years ago
Update for people who don't see the Ticket:
Dear Nick Singer,
The ticket has been assigned to Eng-Infra queue to get picked up by an engineer from Engineering-Infra team.
It was 4 days ago. No further answers until now.
Updated by okurz about 3 years ago
- Status changed from Blocked to Feedback
https://sd.suse.com/servicedesk/customer/portal/1/SD-61294 was resolved with message "disk replaced"
Updated by okurz about 3 years ago
- Due date changed from 2021-10-01 to 2021-10-05
To be checked on-site at next possibility if not possible remotely
Updated by nicksinger about 3 years ago
- Priority changed from High to Low
Tried to install Leap15.3 over PXE but no success so far. I went with IPXE as we don't have the necessary infrastructure in the o3 network to replicate what we have in the OSD PXE setup (e.g. mount to dist). What I tried so far is building ipxe (git.ipxe.org/ipxe.git) with the following, embedded script (myscript.ipxe):
#!ipxe
dhcp
kernel http://download.opensuse.org/distribution/leap/15.3/repo/oss/boot/x86_64/loader/linux usessh=1 sshpassword=linux network=1 install=http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/ console=ttyS1,115200n8 root=/dev/ram0 initrd=initrd textmode=1
initrd http://download.opensuse.org/distribution/leap/15.3/repo/oss/boot/x86_64/loader/initrd
boot
shell
To build it, you have to compile it like this (inside the scr directory): make -j$(nproc) EMBED=../myscript.ipxe
.
It loads the kernel + initrd but always gets stuck while loading the initrd at around 70%. As I used a similar script for aarch64 workers in the past (and it worked there) I can't tell if the console
parameter is just wrong. Until now I tried ttyS0 and ttyS1.
I also tried to load the iso directly with
chain --timeout 1800000 http://download.opensuse.org/distribution/leap/15.3/iso/openSUSE-Leap-15.3-NET-x86_64.iso
But this results in an "unknown exec format" which is apparently caused by regular BIOS booting which does not support loading of ISOs (unlike UEFI)
Updated by nicksinger about 3 years ago
- Due date changed from 2021-10-05 to 2021-11-05
- Status changed from Feedback to In Progress
right, also changing the priority here as we have workarounds for the host in place (containers). Setting the due date far ahead since the easy solutions did not work and we need to figure out how we can have working PXE on o3 infra.
Updated by nicksinger about 3 years ago
ariel with dnsmasq is our dhcp server which hands out the PXE configuration according to this config:
ariel:/etc/dnsmasq.d # cat pxeboot.conf
enable-tftp
tftp-root=/srv/tftpboot
pxe-prompt="Press F8 for menu.", 10
pxe-service=x86PC, "ipxe", ipxe.bin
pxe-service=x86PC, "pxelinux", pxelinux.0
#pxe-service=x86PC, "Boot from local disk"
#pxe-service=x86PC, "Install Linux", pxelinux
Updated by nicksinger about 3 years ago
I checked the BIOS of rebel and the configuration for console redirection was for "COM3". I then tried "ttyS2" and I finally could boot the installer medium. So this is the currently deployed /srv/tftpboot/ipxe.bin
content which works for rebel:
#!ipxe
dhcp
kernel http://download.opensuse.org/distribution/leap/15.3/repo/oss/boot/x86_64/loader/linux usessh=1 sshpassword=linux network=1 install=http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/ console=ttyS2,115200n8 root=/dev/ram0 initrd=initrd textmode=1
initrd http://download.opensuse.org/distribution/leap/15.3/repo/oss/boot/x86_64/loader/initrd
boot
shell
Conducting the instillation over ssh now.
Updated by nicksinger about 3 years ago
Installation is done and openSUSE boots successfully. Next step is to setup the openqa-worker packages again and do general configuration.
Updated by nicksinger almost 3 years ago
- Due date changed from 2021-11-05 to 2021-11-15
the base system is installed and configuration should be in place. I agreed with @okurz that we try to start the s390 workers there as well to see what happens if two workers use the same resource. While checking the system I saw that the transactional-updates.service
failed and added my comments to https://bugzilla.opensuse.org/show_bug.cgi?id=1192078#c1
Updated by nicksinger almost 3 years ago
while fixing the transactional-updates.service
I realized that my previously restored openQA settings where also missing on the system. Not sure if I applied them wrongly and they got lost over a reboot but have to do them again anyway
Updated by nicksinger almost 3 years ago
I restored the settings of rebel now and started the openqa-worker services. It is back in o3 now: https://openqa.opensuse.org/admin/workers/181 - waiting for jobs to check if this conflicts with our current container setup
Updated by mkittler almost 3 years ago
I installed some missing dependencies and added our repositories from devel:openQA
and configured the testpool server. Now the test setup seems to work, e.g. https://openqa.opensuse.org/tests/2032416 is currently running. If everything works fine, we can switch the worker classes back to normal.
Updated by livdywan almost 3 years ago
mkittler wrote:
I installed some missing dependencies and added our repositories from
devel:openQA
and configured the testpool server. Now the test setup seems to work, e.g. https://openqa.opensuse.org/tests/2032416 is currently running. If everything works fine, we can switch the worker classes back to normal.
That test failed. But it doesn't look like it has passed in 2 months so I'm sure this is "fine" :-D
Use of uninitialized value $args{"logs_path"} in concatenation (.) or string at opensuse/lib/opensusebasetest.pm line 237.
I don't know if this is relevant, at least I noticed that checking the logs.
Updated by okurz almost 3 years ago
Please take https://openqa.opensuse.org/group_overview/34 as reference. It shows that at least some jobs pass and that's the reference for us
Updated by livdywan almost 3 years ago
@nicksinger Does that mean this ticket should be wrapped up and resolved? Or is there more to be done here? And please update the ticket status when answering ;-)
Updated by okurz almost 3 years ago
It was me, okurz, not nicksinger, that answered. And, no, the ticket should only be resolved as soon as tests on rebel pass consistently. I was referencing a reference to which to compare our tests even though not all tests might pass elsewhere.
Updated by livdywan almost 3 years ago
Suggested in the Unlock for Nick to ask Sarah for a good test reference
Updated by AdaLovelace almost 3 years ago
It can be, that you have sent the email to the wrong email address. I did not receive anything. I do not work for IBM any more.
But I am a watcher of this ticket.
You can use the default Tumbleweed tests textmode and install_ltp_s390x as examples. Why? There are disconnections from the host during await_install at the moment.
I expect an infrastructure issue in this case.
Updated by nicksinger almost 3 years ago
AdaLovelace wrote:
You can use the default Tumbleweed tests textmode and install_ltp_s390x as examples. Why? There are disconnections from the host during await_install at the moment.
I expect an infrastructure issue in this case.
Thanks for the update! I assume you mean the problems seen here: https://openqa.opensuse.org/tests/2040472#step/await_install/87 ?
I think the connection issues are out-of-scope for this explicit ticket here but I will use the test as a reference if the recovered worker (rebel) shows the same symptoms.
Updated by nicksinger almost 3 years ago
- Status changed from In Progress to Resolved
I consider this ticket ("rebel is broken") as resolved now since the validation test reaches the same step as it does on other machines. Regarding the network timeout we might need further investigation. In the screenshot I also see that yast was "killed" so maybe the broken ssh-connection just happens afterwards when no data is transferred anymore. @AdaLovelace please feel free to open another ticket if you find out more why the test fails.
Updated by AdaLovelace almost 3 years ago
I have to reopen it.
openQA is using the containerized solution for new openSUSE Tumbleweed releases for s390x. Therefore, all tests are failing.
Updated by AdaLovelace almost 3 years ago
@okurz I do not have permission to reopen or change the status.
Updated by nicksinger almost 3 years ago
- Status changed from Resolved to In Progress
I assume we have an answer to @okurz question then how the workers will behave. I will disable the s390 workers on rebel which are already running as container
Updated by okurz almost 3 years ago
- Due date deleted (
2021-11-15) - Status changed from In Progress to Resolved
I added the information to the wiki https://progress.opensuse.org/projects/openqav3/wiki/Wiki/diff?utf8=%E2%9C%93&version=133&version_from=132&commit=View+differences