Project

General

Profile

action #72079

coordination #69478: [epic] Upgrade o3+osd workers+webui to openSUSE Leap 15.2

Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor), also problem auto_review:"(?s)starting: /usr/bin/qemu-system-aarch64.*backend died: Migrate to file failed"

Added by okurz 7 months ago. Updated 4 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Target version:
Start date:
2020-09-29
Due date:
% Done:

0%

Estimated time:

Description

Motivation

ggardet_arm suggests to upgrade openqa-aarch64 to Leap 15.2 soon so that he run some qemu >= 4.0 aarch64 specific tests. Also having one machine machine in production upgraded can help us to find early problems.

Acceptance criteria

  • AC1: openqa-aarch64 (o3) runs a clean upgraded openSUSE Leap 15.2 (no failed systemd services, no left over .rpm-new files, etc.)

Suggestions

  • read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
  • Reserve some time when the worker is only executing a few or no openQA test jobs
  • Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
  • Use the instructions from above but use transactional-update dup or transactional-update to do the upgrade
  • After upgrade reboot and check everything working as expected, if not rollback, e.g. with transactional-update rollback
  • Document in parent ticket so that we keep the learnings in mind for other machines

Further details

  • Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can run aarch64 jobs while this host might be down for a little bit longer. And okurz can hold your hand

Related issues

Related to openQA Infrastructure - action #73297: auto_review:"(?s)Running on openqa-aarch64:.*considering VNC stalled.*THERE IS NOTHING TO READ"In Progress2020-10-13

Copied to openQA Infrastructure - action #73189: Upgrade o3 workers to openSUSE Leap 15.2 after openqa-aarch64 already doneResolved

History

#1 Updated by okurz 6 months ago

  • Description updated (diff)

#2 Updated by okurz 6 months ago

  • Due date set to 2020-10-13
  • Status changed from Workable to Feedback
  • Assignee set to okurz

At first I realized that openqa-aarch64 has not been upgrade since 11 days. It sems that the repo http://download.opensuse.org/repositories/Virtualization/openSUSE_Factory_ARM/ should now be
http://download.opensuse.org/repositories/Virtualization/openSUSE_Factory/ instead. Did transactional-upgrade dup reboot and then in transactional-shell following https://progress.opensuse.org/projects/openqav3/wiki/#Distribution-upgrades.

All good so far, machine is back up, no failed services, all worker instances currently working on jobs. What I have seen in SoL:

Welcome to openSUSE Leap 15.2 - Kernel 5.3.18-lp152.44-default (ttyAMA0).


openqa-aarch64 login: [  120.648267] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [qemu-system-aar:5851]
[  120.668315] watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [qemu-system-aar:4798]
[  120.678321] watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [qemu-system-aar:4354]
[  120.698359] watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [qemu-system-aar:5452]
[  120.699337] watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [qemu-system-aar:5436]
[  120.728308] watchdog: BUG: soft lockup - CPU#30 stuck for 22s! [qemu-system-aar:5875]
[  120.798340] watchdog: BUG: soft lockup - CPU#51 stuck for 22s! [qemu-system-aar:4128]
[  124.678636] watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [qemu-system-aar:5349]
[  124.698641] watchdog: BUG: soft lockup - CPU#21 stuck for 23s! [qemu-system-aar:5373]
[  124.718629] watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [qemu-system-aar:5922]
[  124.798631] watchdog: BUG: soft lockup - CPU#53 stuck for 23s! [migration/53:277]
[  124.808628] watchdog: BUG: soft lockup - CPU#54 stuck for 23s! [qemu-system-aar:5908]
[  144.688583] watchdog: BUG: soft lockup - CPU#20 stuck for 21s! [qemu-system-aar:5910]
[  144.748542] watchdog: BUG: soft lockup - CPU#38 stuck for 22s! [migration/38:202]
[  144.778539] watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [migration/46:242]
[  144.788539] watchdog: BUG: soft lockup - CPU#50 stuck for 22s! [migration/50:262]
[  144.788542] watchdog: BUG: soft lockup - CPU#48 stuck for 22s! [migration/48:252]
[  144.808538] watchdog: BUG: soft lockup - CPU#56 stuck for 22s! [migration/56:292]
[  148.748527] watchdog: BUG: soft lockup - CPU#37 stuck for 22s! [migration/37:197]
[ME] Fault detect start!
[ME] Fault detect start!
[ME] Ipmi data check failed!

From the timer one can see that this was 120s after bootup.

So far jobs seem to be fine, e.g. https://openqa.opensuse.org/tests/1427176

#3 Updated by okurz 6 months ago

  • Status changed from Feedback to Resolved

other jobs have returned similar results to in before as well so I regard the upgrade to be finished successfully.

#4 Updated by okurz 6 months ago

  • Copied to action #73189: Upgrade o3 workers to openSUSE Leap 15.2 after openqa-aarch64 already done added

#5 Updated by ggardet_arm 6 months ago

Thanks for the upgrade to Leap 15.2.
I noticed that some mistyping errors are occurring again, whereas they were nearly absent on this machine before the upgrade.

This is probably related to the CPU stuck messages.

qemu version before: 3.1.1.1
qemu version now: 4.2.1

We should also check the SSD which could be "busy" or slower for some reason.

#6 Updated by ggardet_arm 6 months ago

  • Status changed from Resolved to In Progress

#7 Updated by ggardet_arm 6 months ago

There are also few corrupted screen cases: https://openqa.opensuse.org/tests/1429734#step/welcome/12

and qemu-img create failures: https://openqa.opensuse.org/tests/1429051

#8 Updated by okurz 6 months ago

  • Due date changed from 2020-10-13 to 2020-10-16
  • Assignee changed from okurz to ggardet_arm

I provided ssh access to the machine for ggardet_arm so that he/you can check yourself. What we could do is 1. rollback to the previous state of 15.1, 2. install a qemu version from the Virtualization repo instead, 3. debug manually on the host, 4. reduce worker instance numbers and monitor the overall load or stress on the system. I would appreciate if you can lead this effort and I just support :)

#10 Updated by cdywan 6 months ago

ggardet_arm wrote:

There is also API failures: https://openqa.opensuse.org/tests/1429899/file/worker-log.txt

I guess you mean these? Looks like DNS issues 🤔

[2020-10-12T15:06:13.0627 CEST] [error] Uploading artefact textinfo-29.png failed: Can't connect: Temporary failure in name resolution
[...]
[2020-10-12T15:37:01.0593 CEST] [error] Error uploading textinfo-systemctl_status.log: connection error: Can't connect: Temporary failure in name resolution

#11 Updated by ggardet_arm 6 months ago

WebUI claims: Reason: api failure

#12 Updated by okurz 6 months ago

  • Subject changed from Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor) to Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor), also problem auto_review:"backend died: Migrate to file failed"

I also found https://openqa.opensuse.org/tests/1430888 with reason "Reason: backend died: Migrate to file failed, it has been running for more than 240 seconds at /usr/lib/os-autoinst/backend/qemu.pm line 267." pointing to performance problems.

#13 Updated by okurz 6 months ago

  • Related to action #73297: auto_review:"(?s)Running on openqa-aarch64:.*considering VNC stalled.*THERE IS NOTHING TO READ" added

#14 Updated by okurz 6 months ago

ggardet_arm asked me to debug boot problems over IPMI which is unfortunately only accessible SUSE-internal. sol activate does show no output so did power reset while sol activate is active in another console. Reached boot menu and confirming default indeed yields problems. Fails to boot 15.1, no helpful message, nothing over serial. power reset again and pressing buttons to abort grub menu timeout, e.g. "esc" and "up". Snapshot menu takes some seconds to load but is there and shows many entries so that should be ok. First selecting last good boot, i.e. "* openSUSE Leap 15.2 (,2020-10-13T11:56,post,yast snapper)".

Messages in sol:

Loading Linux 5.3.18-lp152.44-default ...
Loading initial ramdisk ...              
CoreGetMemoryMap: MapKey(3EDBA3E0) = D028, mMemoryMapKey(3EDE4B98) = D028.
CoreGetMemoryMap: MapKey(3EDBA400) = D028, mMemoryMapKey(3EDE4B98) = D028.
Loading driver at 0x0002C261000 EntryPoint=0x0002D25751C
Loading driver at 0x0002C261000 EntryPoint=0x0002D25751C 
CoreGetMemoryMap: MapKey(3EDBA3A8) = D02C, mMemoryMapKey(3EDE4B98) = D02C.
CoreGetMemoryMap: MapKey(3EDBA3A8) = D02D, mMemoryMapKey(3EDE4B98) = D02D.
CoreGetMemoryMap: MapKey(3EDBA358) = D02E, mMemoryMapKey(3EDE4B98) = D02E.
CoreGetMemoryMap: MapKey(3EDBA358) = D02F, mMemoryMapKey(3EDE4B98) = D02F.
EFI stub: Booting Linux Kernel...
EFI stub: EFI_RNG_PROTOCOL unavailable, no randomness supplied
EFI stub: Using DTB from configuration table
CoreGetMemoryMap: MapKey(3EDBA348) = D148, mMemoryMapKey(3EDE4B98) = D148.
CoreGetMemoryMap: MapKey(3EDBA348) = D149, mMemoryMapKey(3EDE4B98) = D149.
EFI stub: Exiting boot services and installing virtual address map...
CoreGetMemoryMap: MapKey(3EDBA298) = D1D3, mMemoryMapKey(3EDE4B98) = D1D3.
CoreGetMemoryMap: MapKey(3EDBA298) = D1D4, mMemoryMapKey(3EDE4B98) = D1D4.
CoreGetMemoryMap: MapKey(3EDBA348) = D1D6, mMemoryMapKey(3EDE4B98) = D1D6.
CoreGetMemoryMap: MapKey(3EDBA348) = D1D7, mMemoryMapKey(3EDE4B98) = D1D7.
CoreGetMemoryMap: MapKey(3EDBA308) = D1D7, mMemoryMapKey(3EDE4B98) = D1D7.
CoreGetMemoryMap: MapKey(3EDBA308) = D1D8, mMemoryMapKey(3EDE4B98) = D1D8.
[SAL_ClearAffiliationSMP,324]it is going to hard reset dev:0x500E004AAAAAAA1F
SAS ExitBootServicesEvent
GMAC ExitBootServicesEvent
GMAC ExitBootServicesEvent
OHCI ExitBootServicesEvent
IPMI ExitBootService Event
M3 sram flag set.........done!IPMI ExitBootServicesEvent
sEe c
u[MEi ii]ic
       [MdiSMMU ExitBootServicesEvent
ab[Md  icc ensbl
                dMsucc2s .n
e[ Eu cecs.nab             bMe] iuc dis.
            [lE] cocteoll
t[su] eonfrollru0 esrfuliy
t[MEc eontrolls 1 easeu ly
t[sEc eontuolyer M ]arof ioil ruc esrfu ly
 successfully
[ME] dimm 0 present
[ME] dimm 0 present
 SetOSWatchdog  CountDown 3000. Status Success
BIOS Is Over!  Bye~ 
[ME] dimm 1 present
[ME] dimm 1 present
[ME] dimm 2 present
[ME] dimm 2 present
[ME] dimm 3 present
[ME] dimm 3 present
[ME] dimm 4 present
[ME] dimm 4 present
[ME] dimm 5 present
[ME] dimm 5 present
[ME] dimm 6 present
[ME] dimm 6 present
[ME] dimm 7 present
[ME] dimm 7 present

and after some seconds booting kernel. Successfully booted and triggered reboot to crosscheck if I can find a bootable Leap 15.1 variant.

Selected "openSUSE Leap 15.1 (,2020-10-10T07:42,Snapshot Update of #156)" and I get:

  Booting `openSUSE Leap 15.1 '

Loading Linux 4.12.14-lp151.28.67-default ...
Loading initial ramdisk ...                  
CoreGetMemoryMap: MapKey(3EDBA410) = 4D1E, mMemoryMapKey(3EDE4B98) = 4D1E.
CoreGetMemoryMap: MapKey(3EDBA430) = 4D1E, mMemoryMapKey(3EDE4B98) = 4D1E.
Loading driver at 0x0002D596000 EntryPoint=0x0002E30BB10
Loading driver at 0x0002D596000 EntryPoint=0x0002E30BB10 
CoreGetMemoryMap: MapKey(3EDBA3E8) = 4D22, mMemoryMapKey(3EDE4B98) = 4D22.
CoreGetMemoryMap: MapKey(3EDBA3E8) = 4D23, mMemoryMapKey(3EDE4B98) = 4D23.
CoreGetMemoryMap: MapKey(3EDBA398) = 4D24, mMemoryMapKey(3EDE4B98) = 4D24.
CoreGetMemoryMap: MapKey(3EDBA398) = 4D25, mMemoryMapKey(3EDE4B98) = 4D25.
SmiGraphicsOutputQueryMode +
SmiGraphicsOutputQueryMode -
SmiGraphicsOutputQueryMode +
SmiGraphicsOutputQueryMode -
CoreGetMemoryMap: MapKey(3EDBA398) = 4D28, mMemoryMapKey(3EDE4B98) = 4D28.
CoreGetMemoryMap: MapKey(3EDBA398) = 4D29, mMemoryMapKey(3EDE4B98) = 4D29.
CoreGetMemoryMap: MapKey(3EDBA2E8) = 4D29, mMemoryMapKey(3EDE4B98) = 4D29.
CoreGetMemoryMap: MapKey(3EDBA2E8) = 4D2A, mMemoryMapKey(3EDE4B98) = 4D2A.
CoreGetMemoryMap: MapKey(3EDBA398) = 4D2C, mMemoryMapKey(3EDE4B98) = 4D2C.
CoreGetMemoryMap: MapKey(3EDBA398) = 4D2D, mMemoryMapKey(3EDE4B98) = 4D2D.
CoreGetMemoryMap: MapKey(3EDBA358) = 4D2D, mMemoryMapKey(3EDE4B98) = 4D2D.
CoreGetMemoryMap: MapKey(3EDBA358) = 4D2E, mMemoryMapKey(3EDE4B98) = 4D2E.
[SAL_ClearAffiliationSMP,324]it is going to hard reset dev:0x500E004AAAAAAA1F
SAS ExitBootServicesEvent
GMAC ExitBootServicesEvent
GMAC ExitBootServicesEvent
OHCI ExitBootServicesEvent
IPMI ExitBootService Event
M3 sram flag set.........done!IPMI ExitBootServicesEvent
p
 _ME]t
c dEs b2e SMMU ExitBootServicesEvent
si2
    MEa lec sncbees 
c dEsablcddsscccess.uc[Me] i2c en]bied suncbss. 
                                                [cE] sont[ollec 0 raree  nit suc enstuslc
sMfullontrMEl ron raref in t succesif suy
[ME] lon
        roE]ec 2 talee  nit succissfullyc
                                         [ME]llont[oll r 3ttolee  nitasec inst lucc
                                                                                   ssfully
[ME] dimm 0 present
[ME] dimm 0 present
 SetOSWatchdog  CountDown 3000. Status Success
BIOS Is Over!  Bye~ 
[ME] dimm 1 present
[ME] dimm 1 present
[ME] dimm 2 present
[ME] dimm 2 present
[ME] dimm 3 present
[ME] dimm 3 present
[ME] dimm 4 present
[ME] dimm 4 present
[ME] dimm 5 present
[ME] dimm 5 present
[ME] dimm 6 present
[ME] dimm 6 present
[ME] dimm 7 present
[ME] dimm 7 present

but then nothing (for at least a minute). Trying the boot menu one before that, that is nr. #155. No luck either. Back to #178 on openSUSE Leap 15.2 . I suggest to keep Leap 15.2 for now but try with different qemu versions. Also the current default boot entry is still #156 so beware to activate #178 or similar before rebooting. Maybe a new snapshot created with transactional-update would automatically do that anyway.

#15 Updated by ggardet_arm 6 months ago

I created https://bugzilla.opensuse.org/show_bug.cgi?id=1177629 to track the kernel traces.

#16 Updated by ggardet_arm 6 months ago

I installed kernel 5.8.14 from kernel:stable repo this morning to see if things improve, or not. So far, it seems a bit better, but not as good as with Leap 15.1.

#17 Updated by ggardet_arm 6 months ago

I reverted qemu to 3.1.1.1 few minutes ago.

#18 Updated by okurz 6 months ago

  • Due date changed from 2020-10-16 to 2020-10-24

ok, that would be also my ideas, kernel and qemu. Let's hop it's one of these :)

#19 Updated by ggardet_arm 6 months ago

The nightly update installed again qemu 4.2, so I did it again, plus a lock on qemu-arm package. It is rebooting atm.

#20 Updated by okurz 6 months ago

ggardet_arm the due date has passed, I would like to call this ticket "Resolved" as we are still running openSUSE Leap 15.2 . There is the lock on "qemu-arm" but other packages are up-to-date. Is there anything more you want to do for this ticket why we should keep it in "In Progress"?

#21 Updated by ggardet_arm 6 months ago

okurz wrote:

ggardet_arm the due date has passed, I would like to call this ticket "Resolved" as we are still running openSUSE Leap 15.2 . There is the lock on "qemu-arm" but other packages are up-to-date. Is there anything more you want to do for this ticket why we should keep it in "In Progress"?

It is updated but we still have the performance problem...

#22 Updated by cdywan 5 months ago

  • Due date changed from 2020-10-24 to 2020-11-13

#23 Updated by ggardet_arm 5 months ago

I added dioread_nolock to the ext4 mount option and it seems to be a bit better, but still not as good as Leap 15.1.

UUID=4816fdde-f472-48fe-b35c-35de23396a50   /var/lib/openqa        ext4       noatime,data=writeback,commit=1200,dioread_nolock  0  2

#24 Updated by okurz 5 months ago

  • Due date deleted (2020-11-13)
  • Target version changed from Ready to future

ok, I see that you did not give up. That's good :) Removing from backlog of SUSE QE Tools team. Feel free to assign back to us whenever you need our help but right now it seems you know better :)

#25 Updated by okurz 4 months ago

  • Priority changed from High to Normal

#26 Updated by okurz 4 months ago

  • Subject changed from Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor), also problem auto_review:"backend died: Migrate to file failed" to Upgrade o3 worker openqa-aarch64 to openSUSE Leap 15.2 (to use newer packages specifically needed for aarch64 and as precursor), also problem auto_review:"(?s)starting: /usr/bin/qemu-system-aarch64.*backend died: Migrate to file failed"

Also available in: Atom PDF