action #72079
closedcoordination #69478: [epic] Upgrade o3+osd workers+webui to openSUSE Leap 15.2
Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor), also problem auto_review:"(?s)starting: /usr/bin/qemu-system-aarch64.*backend died: Migrate to file failed"
0%
Description
Motivation¶
ggardet_arm suggests to upgrade openqa-aarch64 to Leap 15.2 soon so that he run some qemu >= 4.0 aarch64 specific tests. Also having one machine machine in production upgraded can help us to find early problems.
Acceptance criteria¶
- AC1: openqa-aarch64 (o3) runs a clean upgraded openSUSE Leap 15.2 (no failed systemd services, no left over .rpm-new files, etc.)
Suggestions¶
- read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
- Reserve some time when the worker is only executing a few or no openQA test jobs
- Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
- Use the instructions from above but use
transactional-update dup
ortransactional-update
to do the upgrade - After upgrade reboot and check everything working as expected, if not rollback, e.g. with
transactional-update rollback
- Document in parent ticket so that we keep the learnings in mind for other machines
Further details¶
- Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can run aarch64 jobs while this host might be down for a little bit longer. And okurz can hold your hand
Updated by okurz about 4 years ago
- Due date set to 2020-10-13
- Status changed from Workable to Feedback
- Assignee set to okurz
At first I realized that openqa-aarch64 has not been upgrade since 11 days. It sems that the repo http://download.opensuse.org/repositories/Virtualization/openSUSE_Factory_ARM/ should now be
http://download.opensuse.org/repositories/Virtualization/openSUSE_Factory/ instead. Did transactional-upgrade dup reboot
and then in transactional-shell following https://progress.opensuse.org/projects/openqav3/wiki/#Distribution-upgrades.
All good so far, machine is back up, no failed services, all worker instances currently working on jobs. What I have seen in SoL:
Welcome to openSUSE Leap 15.2 - Kernel 5.3.18-lp152.44-default (ttyAMA0).
openqa-aarch64 login: [ 120.648267] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [qemu-system-aar:5851]
[ 120.668315] watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [qemu-system-aar:4798]
[ 120.678321] watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [qemu-system-aar:4354]
[ 120.698359] watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [qemu-system-aar:5452]
[ 120.699337] watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [qemu-system-aar:5436]
[ 120.728308] watchdog: BUG: soft lockup - CPU#30 stuck for 22s! [qemu-system-aar:5875]
[ 120.798340] watchdog: BUG: soft lockup - CPU#51 stuck for 22s! [qemu-system-aar:4128]
[ 124.678636] watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [qemu-system-aar:5349]
[ 124.698641] watchdog: BUG: soft lockup - CPU#21 stuck for 23s! [qemu-system-aar:5373]
[ 124.718629] watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [qemu-system-aar:5922]
[ 124.798631] watchdog: BUG: soft lockup - CPU#53 stuck for 23s! [migration/53:277]
[ 124.808628] watchdog: BUG: soft lockup - CPU#54 stuck for 23s! [qemu-system-aar:5908]
[ 144.688583] watchdog: BUG: soft lockup - CPU#20 stuck for 21s! [qemu-system-aar:5910]
[ 144.748542] watchdog: BUG: soft lockup - CPU#38 stuck for 22s! [migration/38:202]
[ 144.778539] watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [migration/46:242]
[ 144.788539] watchdog: BUG: soft lockup - CPU#50 stuck for 22s! [migration/50:262]
[ 144.788542] watchdog: BUG: soft lockup - CPU#48 stuck for 22s! [migration/48:252]
[ 144.808538] watchdog: BUG: soft lockup - CPU#56 stuck for 22s! [migration/56:292]
[ 148.748527] watchdog: BUG: soft lockup - CPU#37 stuck for 22s! [migration/37:197]
[ME] Fault detect start!
[ME] Fault detect start!
[ME] Ipmi data check failed!
From the timer one can see that this was 120s after bootup.
So far jobs seem to be fine, e.g. https://openqa.opensuse.org/tests/1427176
Updated by okurz about 4 years ago
- Status changed from Feedback to Resolved
other jobs have returned similar results to in before as well so I regard the upgrade to be finished successfully.
Updated by okurz about 4 years ago
- Copied to action #73189: Upgrade o3 workers to openSUSE Leap 15.2 after openqa-aarch64 already done added
Updated by ggardet_arm about 4 years ago
Thanks for the upgrade to Leap 15.2.
I noticed that some mistyping errors are occurring again, whereas they were nearly absent on this machine before the upgrade.
- https://openqa.opensuse.org/tests/1429694#step/xterm/5
- https://openqa.opensuse.org/tests/1430080#step/firstrun/45
- https://openqa.opensuse.org/tests/1429876#step/php7/2
This is probably related to the CPU stuck
messages.
qemu version before: 3.1.1.1
qemu version now: 4.2.1
We should also check the SSD which could be "busy" or slower for some reason.
Updated by ggardet_arm about 4 years ago
- Status changed from Resolved to In Progress
Updated by ggardet_arm about 4 years ago
There are also few corrupted screen cases: https://openqa.opensuse.org/tests/1429734#step/welcome/12
and qemu-img create
failures: https://openqa.opensuse.org/tests/1429051
Updated by okurz about 4 years ago
- Due date changed from 2020-10-13 to 2020-10-16
- Assignee changed from okurz to ggardet_arm
I provided ssh access to the machine for ggardet_arm so that he/you can check yourself. What we could do is 1. rollback to the previous state of 15.1, 2. install a qemu version from the Virtualization repo instead, 3. debug manually on the host, 4. reduce worker instance numbers and monitor the overall load or stress on the system. I would appreciate if you can lead this effort and I just support :)
Updated by ggardet_arm about 4 years ago
There is also API failures: https://openqa.opensuse.org/tests/1429899/file/worker-log.txt
Updated by livdywan about 4 years ago
ggardet_arm wrote:
There is also API failures: https://openqa.opensuse.org/tests/1429899/file/worker-log.txt
I guess you mean these? Looks like DNS issues 🤔
[2020-10-12T15:06:13.0627 CEST] [error] Uploading artefact textinfo-29.png failed: Can't connect: Temporary failure in name resolution
[...]
[2020-10-12T15:37:01.0593 CEST] [error] Error uploading textinfo-systemctl_status.log: connection error: Can't connect: Temporary failure in name resolution
Updated by okurz about 4 years ago
- Subject changed from Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor) to Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor), also problem auto_review:"backend died: Migrate to file failed"
I also found https://openqa.opensuse.org/tests/1430888 with reason "Reason: backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 267." pointing to performance problems.
Updated by okurz about 4 years ago
- Related to action #73297: auto_review:"(?s)Running on openqa-aarch64:.*considering VNC stalled.*THERE IS NOTHING TO READ" added
Updated by okurz about 4 years ago
ggardet_arm asked me to debug boot problems over IPMI which is unfortunately only accessible SUSE-internal. sol activate
does show no output so did power reset
while sol activate
is active in another console. Reached boot menu and confirming default indeed yields problems. Fails to boot 15.1, no helpful message, nothing over serial. power reset
again and pressing buttons to abort grub menu timeout, e.g. "esc" and "up". Snapshot menu takes some seconds to load but is there and shows many entries so that should be ok. First selecting last good boot, i.e. "* openSUSE Leap 15.2 (,2020-10-13T11:56,post,yast snapper)".
Messages in sol:
Loading Linux 5.3.18-lp152.44-default ...
Loading initial ramdisk ...
CoreGetMemoryMap: MapKey(3EDBA3E0) = D028, mMemoryMapKey(3EDE4B98) = D028.
CoreGetMemoryMap: MapKey(3EDBA400) = D028, mMemoryMapKey(3EDE4B98) = D028.
Loading driver at 0x0002C261000 EntryPoint=0x0002D25751C
Loading driver at 0x0002C261000 EntryPoint=0x0002D25751C
CoreGetMemoryMap: MapKey(3EDBA3A8) = D02C, mMemoryMapKey(3EDE4B98) = D02C.
CoreGetMemoryMap: MapKey(3EDBA3A8) = D02D, mMemoryMapKey(3EDE4B98) = D02D.
CoreGetMemoryMap: MapKey(3EDBA358) = D02E, mMemoryMapKey(3EDE4B98) = D02E.
CoreGetMemoryMap: MapKey(3EDBA358) = D02F, mMemoryMapKey(3EDE4B98) = D02F.
EFI stub: Booting Linux Kernel...
EFI stub: EFI_RNG_PROTOCOL unavailable, no randomness supplied
EFI stub: Using DTB from configuration table
CoreGetMemoryMap: MapKey(3EDBA348) = D148, mMemoryMapKey(3EDE4B98) = D148.
CoreGetMemoryMap: MapKey(3EDBA348) = D149, mMemoryMapKey(3EDE4B98) = D149.
EFI stub: Exiting boot services and installing virtual address map...
CoreGetMemoryMap: MapKey(3EDBA298) = D1D3, mMemoryMapKey(3EDE4B98) = D1D3.
CoreGetMemoryMap: MapKey(3EDBA298) = D1D4, mMemoryMapKey(3EDE4B98) = D1D4.
CoreGetMemoryMap: MapKey(3EDBA348) = D1D6, mMemoryMapKey(3EDE4B98) = D1D6.
CoreGetMemoryMap: MapKey(3EDBA348) = D1D7, mMemoryMapKey(3EDE4B98) = D1D7.
CoreGetMemoryMap: MapKey(3EDBA308) = D1D7, mMemoryMapKey(3EDE4B98) = D1D7.
CoreGetMemoryMap: MapKey(3EDBA308) = D1D8, mMemoryMapKey(3EDE4B98) = D1D8.
[SAL_ClearAffiliationSMP,324]it is going to hard reset dev:0x500E004AAAAAAA1F
SAS ExitBootServicesEvent
GMAC ExitBootServicesEvent
GMAC ExitBootServicesEvent
OHCI ExitBootServicesEvent
IPMI ExitBootService Event
M3 sram flag set.........done!IPMI ExitBootServicesEvent
sEe c
u[MEi ii]ic
[MdiSMMU ExitBootServicesEvent
ab[Md icc ensbl
dMsucc2s .n
e[ Eu cecs.nab bMe] iuc dis.
[lE] cocteoll
t[su] eonfrollru0 esrfuliy
t[MEc eontrolls 1 easeu ly
t[sEc eontuolyer M ]arof ioil ruc esrfu ly
successfully
[ME] dimm 0 present
[ME] dimm 0 present
SetOSWatchdog CountDown 3000. Status Success
BIOS Is Over! Bye~
[ME] dimm 1 present
[ME] dimm 1 present
[ME] dimm 2 present
[ME] dimm 2 present
[ME] dimm 3 present
[ME] dimm 3 present
[ME] dimm 4 present
[ME] dimm 4 present
[ME] dimm 5 present
[ME] dimm 5 present
[ME] dimm 6 present
[ME] dimm 6 present
[ME] dimm 7 present
[ME] dimm 7 present
and after some seconds booting kernel. Successfully booted and triggered reboot to crosscheck if I can find a bootable Leap 15.1 variant.
Selected "openSUSE Leap 15.1 (,2020-10-10T07:42,Snapshot Update of #156)" and I get:
Booting `openSUSE Leap 15.1 '
Loading Linux 4.12.14-lp151.28.67-default ...
Loading initial ramdisk ...
CoreGetMemoryMap: MapKey(3EDBA410) = 4D1E, mMemoryMapKey(3EDE4B98) = 4D1E.
CoreGetMemoryMap: MapKey(3EDBA430) = 4D1E, mMemoryMapKey(3EDE4B98) = 4D1E.
Loading driver at 0x0002D596000 EntryPoint=0x0002E30BB10
Loading driver at 0x0002D596000 EntryPoint=0x0002E30BB10
CoreGetMemoryMap: MapKey(3EDBA3E8) = 4D22, mMemoryMapKey(3EDE4B98) = 4D22.
CoreGetMemoryMap: MapKey(3EDBA3E8) = 4D23, mMemoryMapKey(3EDE4B98) = 4D23.
CoreGetMemoryMap: MapKey(3EDBA398) = 4D24, mMemoryMapKey(3EDE4B98) = 4D24.
CoreGetMemoryMap: MapKey(3EDBA398) = 4D25, mMemoryMapKey(3EDE4B98) = 4D25.
SmiGraphicsOutputQueryMode +
SmiGraphicsOutputQueryMode -
SmiGraphicsOutputQueryMode +
SmiGraphicsOutputQueryMode -
CoreGetMemoryMap: MapKey(3EDBA398) = 4D28, mMemoryMapKey(3EDE4B98) = 4D28.
CoreGetMemoryMap: MapKey(3EDBA398) = 4D29, mMemoryMapKey(3EDE4B98) = 4D29.
CoreGetMemoryMap: MapKey(3EDBA2E8) = 4D29, mMemoryMapKey(3EDE4B98) = 4D29.
CoreGetMemoryMap: MapKey(3EDBA2E8) = 4D2A, mMemoryMapKey(3EDE4B98) = 4D2A.
CoreGetMemoryMap: MapKey(3EDBA398) = 4D2C, mMemoryMapKey(3EDE4B98) = 4D2C.
CoreGetMemoryMap: MapKey(3EDBA398) = 4D2D, mMemoryMapKey(3EDE4B98) = 4D2D.
CoreGetMemoryMap: MapKey(3EDBA358) = 4D2D, mMemoryMapKey(3EDE4B98) = 4D2D.
CoreGetMemoryMap: MapKey(3EDBA358) = 4D2E, mMemoryMapKey(3EDE4B98) = 4D2E.
[SAL_ClearAffiliationSMP,324]it is going to hard reset dev:0x500E004AAAAAAA1F
SAS ExitBootServicesEvent
GMAC ExitBootServicesEvent
GMAC ExitBootServicesEvent
OHCI ExitBootServicesEvent
IPMI ExitBootService Event
M3 sram flag set.........done!IPMI ExitBootServicesEvent
p
_ME]t
c dEs b2e SMMU ExitBootServicesEvent
si2
MEa lec sncbees
c dEsablcddsscccess.uc[Me] i2c en]bied suncbss.
[cE] sont[ollec 0 raree nit suc enstuslc
sMfullontrMEl ron raref in t succesif suy
[ME] lon
roE]ec 2 talee nit succissfullyc
[ME]llont[oll r 3ttolee nitasec inst lucc
ssfully
[ME] dimm 0 present
[ME] dimm 0 present
SetOSWatchdog CountDown 3000. Status Success
BIOS Is Over! Bye~
[ME] dimm 1 present
[ME] dimm 1 present
[ME] dimm 2 present
[ME] dimm 2 present
[ME] dimm 3 present
[ME] dimm 3 present
[ME] dimm 4 present
[ME] dimm 4 present
[ME] dimm 5 present
[ME] dimm 5 present
[ME] dimm 6 present
[ME] dimm 6 present
[ME] dimm 7 present
[ME] dimm 7 present
but then nothing (for at least a minute). Trying the boot menu one before that, that is nr. #155
. No luck either. Back to #178
on openSUSE Leap 15.2 . I suggest to keep Leap 15.2 for now but try with different qemu versions. Also the current default boot entry is still #156
so beware to activate #178
or similar before rebooting. Maybe a new snapshot created with transactional-update
would automatically do that anyway.
Updated by ggardet_arm about 4 years ago
I created https://bugzilla.opensuse.org/show_bug.cgi?id=1177629 to track the kernel traces.
Updated by ggardet_arm about 4 years ago
I installed kernel 5.8.14 from kernel:stable repo this morning to see if things improve, or not. So far, it seems a bit better, but not as good as with Leap 15.1.
Updated by ggardet_arm about 4 years ago
I reverted qemu to 3.1.1.1 few minutes ago.
Updated by okurz about 4 years ago
- Due date changed from 2020-10-16 to 2020-10-24
ok, that would be also my ideas, kernel and qemu. Let's hop it's one of these :)
Updated by ggardet_arm about 4 years ago
The nightly update installed again qemu 4.2, so I did it again, plus a lock on qemu-arm
package. It is rebooting atm.
Updated by okurz about 4 years ago
@ggardet_arm the due date has passed, I would like to call this ticket "Resolved" as we are still running openSUSE Leap 15.2 . There is the lock on "qemu-arm" but other packages are up-to-date. Is there anything more you want to do for this ticket why we should keep it in "In Progress"?
Updated by ggardet_arm about 4 years ago
okurz wrote:
@ggardet_arm the due date has passed, I would like to call this ticket "Resolved" as we are still running openSUSE Leap 15.2 . There is the lock on "qemu-arm" but other packages are up-to-date. Is there anything more you want to do for this ticket why we should keep it in "In Progress"?
It is updated but we still have the performance problem...
Updated by livdywan about 4 years ago
- Due date changed from 2020-10-24 to 2020-11-13
Updated by ggardet_arm about 4 years ago
I added dioread_nolock
to the ext4 mount option and it seems to be a bit better, but still not as good as Leap 15.1.
UUID=4816fdde-f472-48fe-b35c-35de23396a50 /var/lib/openqa ext4 noatime,data=writeback,commit=1200,dioread_nolock 0 2
Updated by okurz about 4 years ago
- Due date deleted (
2020-11-13) - Target version changed from Ready to future
ok, I see that you did not give up. That's good :) Removing from backlog of SUSE QE Tools team. Feel free to assign back to us whenever you need our help but right now it seems you know better :)
Updated by okurz about 4 years ago
- Subject changed from Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor), also problem auto_review:"backend died: Migrate to file failed" to Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor), also problem auto_review:"(?s)starting: /usr/bin/qemu-system-aarch64.*backend died: Migrate to file failed"
Updated by okurz over 3 years ago
@ggardet_arm how do you see the current situation in 2021?
Updated by ggardet_arm over 3 years ago
- Status changed from In Progress to Resolved