Project

General

Profile

action #66236

aarch64.o.o root filesystem seems to be broken

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
2020-04-29
Due date:
% Done:

0%

Estimated time:

Description

Observation

On 2020-04-29 the machine aarch64.o.o did not come up trying to boot from an older snapshot after transactional-update rollback last && reboot. grub shows only older snapshots, e.g. from 2020-03 . btrfs and sysrich tried to recover the system but failed. Most likely we need to reinstall.


Related issues

Related to openQA Infrastructure - action #66340: openqa-aarch64 :15 and :16 are not started after rebootResolved2020-05-01

History

#1 Updated by okurz over 1 year ago

I did a simple backup for convenience within the ipmi SOL as we have an ext4 filesystem for /var/lib/openqa

mkdir -p /var/lib/openqa/backup/root/$(date +%F)
rsync -aHP --one-file-system / /var/lib/openqa/backup/root/2020-04-29/
rsync -aHP --one-file-system /etc/ /var/lib/openqa/backup/root/2020-04-29/etc/

current filesystem setup:

# lsblk 
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 447.1G  0 disk 
├─sda1   8:1    0   500M  0 part /boot/efi
├─sda2   8:2    0    30G  0 part /
├─sda3   8:3    0 414.7G  0 part /var/lib/openqa
└─sda4   8:4    0     2G  0 part [SWAP]

#2 Updated by favogt over 1 year ago

I reinstalled by booting the net install linux and initrd from the repo.
Unfortunately, I screwed up: In the partitioning dialog I thought it said "Delete btrfs on ...", but it didn't - it chose /dev/sda3 as new root filesystem, whoops...
The installation summary page also doesn't show the partitioning changes, which is probably a bug.
So I redid the backup of / and repeat the installation.

As sda3 was btrfs now, I had to mkfs. That resulted in this:

# mkfs.ext4 /dev/sda3
mke2fs 1.43.8 (1-Jan-2018)
/dev/sda3 contains a btrfs file system
Proceed anyway? (y,N) y
Discarding device blocks: [  902.595132] Internal error: Oops: 96000045 [#1] SMP
[  902.599998] Modules linked in: vfat fat usb_storage btrfs zstd_compress zlib_deflate xor raid6_pq dm_multipath dm_mod 8021q garp mrp stp llc arc4 nfs lockd grace fscache nls_iso8859_1 nls_cp437 af_packet sg st sr_mod cdrom sunrpc efivarfs joydev hid_generic usbhid ipmi_ssif marvell hibmc_drm ttm drm_kms_helper ehci_platform syscopyarea sysfillrect sysimgblt aes_ce_blk fb_sys_fops crypto_simd cryptd ehci_hcd drm aes_ce_cipher crc32_ce crct10dif_ce ghash_ce aes_arm64 hisi_sas_v2_hw ipmi_si usbcore sha2_ce sha256_arm64 hisi_sas_main ipmi_devintf sha1_ce libsas i2c_designware_platform drm_panel_orientation_quirks hns_dsaf ipmi_msghandler i2c_designware_core scsi_transport_sas hns_enet_drv hns_mdio hnae scsi_dh_rdac scsi_dh_emc scsi_dh_alua squashfs zstd_decompress xxhash loop
[  902.668422] CPU: 27 PID: 510 Comm: kworker/27:1 Not tainted 4.12.14-lp151.27-default #1 openSUSE Leap 15.1
[  902.678060] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.50 06/01/2018
[  902.685275] Workqueue: events cache_reap
[  902.689185] task: ffff801fae6f6180 task.stack: ffff801fae6f8000
[  902.695091] pstate: 40000085 (nZcv daIf -PAN -UAO)
[  902.699870] pc : free_block+0x118/0x1e8
[  902.703693] lr : drain_array_locked+0x68/0xf8
[  902.708035] sp : ffff801fae6fbc60
[  902.711336] x29: ffff801fae6fbc60 x28: ffff801fbbd7e208 
[  902.716636] x27: ffff000008f9e000 x26: ffff000008f40200 
[  902.721935] x25: ffff801fb89cd010 x24: ffff801fae6fbd30 
[  902.727234] x23: ffff801fb89cd010 x22: ffff801fbb400100 
[  902.732533] x21: 0000000000000060 x20: ffff801fae6fbd30 
[  902.737831] x19: ffff801fbb400100 x18: 0000000000000000 
[  902.743130] x17: 0000000000000001 x16: 0000000000000001 
[  902.748429] x15: 0000000000000005 x14: ffff801fbb401108 
[  902.753729] x13: ffff801fbb401128 x12: ffff7e0000000000 
[  902.759028] x11: dead000000000100 x10: dead000000000200 
[  902.764327] x9 : ffff801fb89cd310 x8 : ffff7e007de9b060 
[  902.769626] x7 : ffff801f7a6c10c0 x6 : 00000000ffffffff 
[  902.774925] x5 : 0000000000000000 x4 : ffff801f7a6c1c40 
[  902.780224] x3 : ffff801fbb401100 x2 : 0000000000000003 
[  902.785523] x1 : ffff801fb89cd2f0 x0 : ffff7e007de9b040 
[  902.790823] Process kworker/27:1 (pid: 510, stack limit = 0xffff801fae6f8000)
[  902.797943] Call trace:
[  902.800378]  free_block+0x118/0x1e8
[  902.803853]  drain_array_locked+0x68/0xf8
[  902.807849]  drain_array+0x80/0xb0
[  902.811237]  cache_reap+0x11c/0x258
[  902.814713]  process_one_work+0x1e4/0x430
[  902.818709]  worker_thread+0x50/0x478
[  902.822358]  kthread+0x134/0x138
[  902.825574]  ret_from_fork+0x10/0x20
[  902.829137] Code: 1ad02442 0b050042 1acf2442 b4000624 (38264882) 
[  902.835229] ---[ end trace 0ada5e0d2e71f676 ]---
[  946.555094] BUG: workqueue lockup - pool cpus=27 node=0 flags=0x1 nice=0 stuck for 43s!
[  946.563096] Showing busy workqueues and worker pools:
[  946.568151] workqueue events: flags=0x0
[  946.568165]   pwq 54: cpus=27 node=0 flags=0x1 nice=0 active=1/256
[  946.571984]     in-flight: 510:cache_reap
[  946.582157]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  946.582164]     pending: vmstat_shepherd
[  946.592132] workqueue events_power_efficient: flags=0x80
[  946.597437]   pwq 54: cpus=27 node=0 flags=0x1 nice=0 active=1/256
[  946.597442]     pending: fb_flashcursor
[  946.607456] workqueue mm_percpu_wq: flags=0x8
[  946.611806]   pwq 54: cpus=27 node=0 flags=0x1 nice=0 active=1/256
[  946.611811]     pending: vmstat_update
[  946.622174] pool 54: cpus=27 node=0 flags=0x1 nice=0 hung=43s workers=2 manager: 174
[  962.865086] INFO: rcu_sched detected stalls on CPUs/tasks:
[  962.870561]  27-...: (1 GPs behind) idle=ad2/140000000000000/0 softirq=1748/1748 fqs=3001 
[  962.878811]  (detected by 23, t=6002 jiffies, g=2173, c=2172, q=38)
[  962.885069] Task dump for CPU 27:
[  962.888371] kworker/27:0    R  running task        0   174      2 0x0000000a
[  962.895411] Call trace:
[  962.897847]  __switch_to+0x9c/0xe0
[  962.901236]  0xffff801fb53c7dc0
[  962.904364]  worker_thread+0x34c/0x478
[  962.908100]  kthread+0x134/0x138
[  962.911315]  ret_from_fork+0x10/0x20

and the system was completely unresponsive, needed a hard reset. A retry with -E nodiscard worked.

The installation is done now and the system is up again. I'll install openQA-worker, configure hugepages and try to work around bsc#1142000.

#3 Updated by favogt over 1 year ago

I had to aa-teardown to get qemu to work with hugepages (though that somehow started working OOTB after a reboot) and install os-autoinst-distri-opensuse-deps manually. I also removed plymouth (zypper rm -u libply*; mkinitrd). Copied over lines from /etc/fstab and the files /etc/openqa/, /etc/firewalld/zones/, /etc/sysconfig/network/*, enabled some services and rebooted.

#4 Updated by okurz over 1 year ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz
  • Priority changed from Urgent to High

Thanks a lot to fvogt for the quick and hard work to get the machine back, much appreciated :)

https://openqa.opensuse.org/tests/1249821 shows another problem Could not open '/usr/share/qemu/aavmf-aarch64-opensuse-code.bin': No such file or directory, known issue (see other ticket). Solved with

mkdir /root/qemu/
cp -a /var/lib/openqa/oldroot/usr/share/qemu/*opensuse* /root/qemu/
transactional-update shell
cp -a /root/qemu/* /usr/share/qemu/
exit
reboot

Monitoring rescheduled test: https://openqa.opensuse.org/tests/1249821

#5 Updated by okurz over 1 year ago

https://openqa.opensuse.org/tests/1249821 looks fine but the developer mode does not work yet. firewall-cmd --list-all-zones shows:

internal (active)
  target: default
  icmp-block-inversion: no
  interfaces: br1
  sources: 
  services: ssh mdns samba-client dhcpv6-client
  ports: 
  protocols: 
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules: 


public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0
  sources: 
  services: ssh dhcpv6-client
  ports: 
  protocols: 
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules: 


trusted (active)
  target: ACCEPT
  icmp-block-inversion: no
  interfaces: ovs-system tap0 tap1 tap128 tap129 tap130 tap131 tap132 tap133 tap2 tap3 tap4 tap5 tap64 tap65 tap66 tap67 tap68 tap69
  sources: 
  services: 
  ports: 
  protocols: 
  masquerade: yes
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules: 

which I think should show br1 + eth0 in trusted.

#6 Updated by okurz over 1 year ago

Using http://open.qa/docs/#_steps_to_debug_developer_mode_setup I could confirm that the live handler daemon on o3 can be reached and that the webui host tries to reach the right host and port on the machine aarch64 but the port is not reachable, probably due to firewall as stated before. I fixed that with for i in $bridge $dev ovs-system; do firewall-cmd --zone=trusted --change-interface=$i; done using --change-interface vs. --add-interface. I also changed that in https://github.com/okurz/openQA/blob/feature/setup_mm/script/setup_mm

In https://openqa.opensuse.org/tests/1249821 I could confirm that the developer mode works fine now.

Will wait for more tests to complete, e.g. also "wicked-basic" in https://openqa.opensuse.org/tests/1250515#next_previous . I am suspect that again it will fail in a dubious way as it did in the past :)

#7 Updated by okurz over 1 year ago

I have seen in at least one test job that stalls were detected. I realized comparing /etc/fstab with the previous file in /var/lib/openqa/oldroot/etc/fstab that /var/lib/openqa, the mount point also used to store the machine pool in is specified in /etc/fstab with "defaults" and hence mounted with mount options "rw,relatime,data=ordered". Previously we had "noatime" in /etc/fstab. I think we can go even more aggressive as there is no important data on /var/lib/openqa . If in doubt we can completely recreate the worker cache and pool on reboot which we do for osd workers but we do not reboot them that often.

https://ext4.wiki.kernel.org/index.php/Ext3_Data=Ordered_vs_Data=Writeback_mode suggests writeback for better performance at the higher risk of data corruption. In #64746 I am conducting a bigger research, let's experiment with aarch64.o.o :) Here I am following https://wiki.archlinux.org/index.php/Ext4#Improving_performance as well, did noatime,data=writeback,commit=1200 but did not try barrier=0 yet.

wicked_basic_sut passed in https://openqa.opensuse.org/tests/1251102# , that was unexpected ;)

But some other tests failed which I can not yet explain, e.g.

I reduced the worker instances from 16 to 14 for now.

#8 Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

I think we've done well :)

#9 Updated by okurz over 1 year ago

  • Related to action #66340: openqa-aarch64 :15 and :16 are not started after reboot added

#10 Updated by okurz over 1 year ago

  • Subject changed from aarch64.o.o root filesystem seems to be broken. to aarch64.o.o root filesystem seems to be broken
  • Status changed from Resolved to Feedback

I enabled systemctl enable --now openqa-worker@{15..16} after #66340 and will wait for more feedback regarding stability. Also see #66337

Found a mismatch in zypper repos, added prios to openQA repos and added virtualization repo for #51953

#11 Updated by ggardet_arm over 1 year ago

It seems that the updates from Update repo are not applied.
For example, QEMU version is 3.1.0 atm, whereas version 3.1.1.1 was used previously and is available at http://download.opensuse.org/ports/update/leap/15.1/oss/

#12 Updated by ggardet_arm over 1 year ago

It seems that the updates from Update repo are not applied.
For example, QEMU version is 3.1.0 atm, whereas version 3.1.1.1 was used previously and is available at http://download.opensuse.org/ports/update/leap/15.1/oss/

okurz wrote:

Using http://open.qa/docs/#_steps_to_debug_developer_mode_setup I could confirm that the live handler daemon on o3 can be reached and that the webui host tries to reach the right host and port on the machine aarch64 but the port is not reachable, probably due to firewall as stated before. I fixed that with for i in $bridge $dev ovs-system; do firewall-cmd --zone=trusted --change-interface=$i; done using --change-interface vs. --add-interface. I also changed that in https://github.com/okurz/openQA/blob/feature/setup_mm/script/setup_mm

In https://openqa.opensuse.org/tests/1249821 I could confirm that the developer mode works fine now.

Will wait for more tests to complete, e.g. also "wicked-basic" in https://openqa.opensuse.org/tests/1250515#next_previous . I am suspect that again it will fail in a dubious way as it did in the past :)

Developer mode does not work. It fails with unable to upgrade ws to command server.

#13 Updated by okurz over 1 year ago

ggardet_arm wrote:

It seems that the updates from Update repo are not applied.
For example, QEMU version is 3.1.0 atm, whereas version 3.1.1.1 was used previously and is available at http://download.opensuse.org/ports/update/leap/15.1/oss/

Wrong update repos are configured on the machine. I wonder if that had been done wrong by the installer or by fvogt. At least openSUSE Leap 15.2 aarch64 in openQA shows correct repos: https://openqa.opensuse.org/tests/1251595#step/zypper_clear_repos/4

I fixed that now on the machine and tried to streamline repos more.

Developer mode does not work. It fails with unable to upgrade ws to command server.

fixed with

bridge=br1; dev=eth0
for i in $bridge $dev ovs-system; do firewall-cmd --zone=trusted --change-interface=$i; done
firewall-cmd --runtime-to-permanent

Previously I forgot the firewall-cmd --runtime-to-permanent part.

#14 Updated by favogt over 1 year ago

Other than adding devel:openQA(:Leap:15.1), I didn't do any modifications to the repository configuration.
In which way was it broken?

#15 Updated by okurz over 1 year ago

favogt wrote:

In which way was it broken?

old:

10 | repo-update            | Main Update Repository                | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | http://download.opensuse.org/update/leap/15.1/oss                 

new, fixed:

 8 | repo-ports-update | repo-update                           | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | http://download.opensuse.org/ports/update/leap/15.1/oss/               

so instead of http://download.opensuse.org/update/leap/15.1/oss which is not providing aarch64 updates we must use the URL that ggardet mentioned: http://download.opensuse.org/ports/update/leap/15.1/oss/. The symptom: Updates are visible in src packages but no binary packages exist. Installed 299 packages from the ports update repo.

Overall we seem to have some instability. I wonder if this is related to changes on the worker machine. I wonder if so far in general the stability of tests is comparable to the situation before the OS reinstall or worse. also aarch64 tests are still running with just a single cpu core, maybe we should use more up to date defaults, e.g. QEMUCPUS=4, trying with "openqa-clone-job --within-instance https://openqa.opensuse.org/tests/1249887 TEST=xfce-okurz-qemucpus-4 QEMUCPUS=4", triggered as https://openqa.opensuse.org/tests/1251864'
29 +openqa also aarch64 tests are still running with just a single cpu core, maybe we should use more up to date defaults, e.g. QEMUCPUS=4, trying with "openqa-clone-job --within-instance https://openqa.opensuse.org/tests/1249887 TEST=xfce-okurz-qemucpus-4 QEMUCPUS=4", triggered as https://openqa.opensuse.org/tests/1251864 . This failed in https://openqa.opensuse.org/tests/1251864#step/x_vt/3 with mistyping "|GREP" instead of "| grep". Updates might help.

Also I have appended all boot options we had previously from /var/lib/openqa/oldroot/etc/default/grub . There was still a random reboot problem for which I brought back old shutdown debugging info. We can experiment with removing old kernel options afterwards if we like:

The change I did:

-GRUB_CMDLINE_LINUX_DEFAULT="console=ttyAMA0,115200n quiet mitigations=off default_hugepagesz=1G hugepagesz=1G hugepages=64"
+GRUB_CMDLINE_LINUX_DEFAULT="console=ttyAMA0,115200 splash=silent quiet showopts nospec spectre_v2=off pti=off kpti=off default_hugepagesz=1G hugepagesz=1G hugepages=64 crashkernel=167M systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M printk.devkmsg=on enforcing=0"

#16 Updated by ggardet_arm over 1 year ago

Yes, it was a known problem and should have been fixed since a while, including 15.1 IIRC. Maybe it was post GA, and the fix is only available when network is used at installation time?
Anyway, it is fixed for 15.2 for sure.

#17 Updated by okurz over 1 year ago

My test run with 4 CPU cores for xfce passed in https://openqa.opensuse.org/tests/1252459 with a runtime of 1:54h vs. 1 core https://openqa.opensuse.org/tests/1252456 2:06h but both passed fine with updated mitigation command line changes. Not that significant

#18 Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

Machine is up, was upgraded and rebooted automatically this morning. qemu version qemu-3.1.1.1-lp151.7.12.1.aarch64 from update. developer mode works, firewall config seems correct as well. No surprising Failed|Incomplete

Reviewed many more failed tests and have found no "unusual" failures, many known bugs and some sporadic failures that I think should be fixed but I do not want to meddle with this right now for this ticket :)

@ggardet I would love to hear your "success story" of a cloud runner. I think we can include some points about it in the openQA documentation.

Also available in: Atom PDF