Project

General

Profile

action #103581

Many jobs on openqa.opensuse.org incomplete in ' timeout: setup exceeded MAX_SETUP_TIME'

Added by Xiaojing_liu 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-12-07
Due date:
% Done:

100%

Estimated time:
Difficulty:

Description

Observation

There are many jobs on openqa.opensuse.org incomplete in 'timeout: setup exceeded MAX_SETUP_TIME'
The log show:

[2021-12-07T04:47:55.572079+01:00] [info] +++ setup notes +++
[2021-12-07T04:47:55.572463+01:00] [info] Running on openqaworker4:2 (Linux 5.3.18-59.34-default #1 SMP Thu Nov 11 12:18:45 UTC 2021 (a2a53aa) x86_64)
[2021-12-07T04:47:55.577783+01:00] [debug] Found ASSET_1, caching Tumbleweed.x86_64-1.0-libvirt-Snapshot20211205.vagrant.libvirt.box
[2021-12-07T04:47:55.580926+01:00] [info] Downloading Tumbleweed.x86_64-1.0-libvirt-Snapshot20211205.vagrant.libvirt.box, request #495088 sent to Cache Service
[2021-12-07T05:47:55.606967+01:00] [info] +++ worker notes +++
[2021-12-07T05:47:55.607179+01:00] [info] End time: 2021-12-07 04:47:55
[2021-12-07T05:47:55.607254+01:00] [info] Result: timeout
[2021-12-07T05:47:55.619503+01:00] [info] Uploading autoinst-log.txt

Here is an example: https://openqa.opensuse.org/tests/2073891#

Suggestion

Rollback steps

  • Enable default worker classes, e.g. "qemu_x86_64" and "qemu_i586" again on openqaworker4 in /etc/openqa/workers.ini

Related issues

Related to openQA Infrastructure - action #103524: OW1: performance loss size:MResolved2021-12-06

History

#1 Updated by okurz 5 months ago

  • Priority changed from Urgent to Immediate
  • Target version set to Ready

#2 Updated by okurz 5 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz

I am trying to disable the machine for now and retrigger according failures.

#3 Updated by okurz 5 months ago

  • Description updated (diff)
  • Status changed from In Progress to New
  • Assignee deleted (okurz)
  • Priority changed from Immediate to Urgent

I suspect that the problem is only on openqaworker4. After logging in to openqaworker4 with ssh I could see that even tab-completion on ls does not work as there was a report about a "Read-only filesystem". systemctl --failed showed that transactional-update failed. I could also not change any files in /etc. I rebooted and after reboot I could change /etc/openqa/workers.ini to use classes like qemu_x86_64_disabled_poo103581 to not collect any production jobs. Then I restarted the openQA services with systemctl restart openqa-worker-auto-restart@{1..20}. Soon after the system turned read-only again. dmesg shows:

[  676.505576] WARNING: CPU: 20 PID: 3832 at ../fs/btrfs/qgroup.c:2510 btrfs_qgroup_account_extents+0x1a9/0x260 [btrfs]
[  676.505577] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet iscsi_ibft iscsi_boot_sysfs rfkill dmi_sysfs msr ext4 crc16 mbcache jbd2 intel_rapl_msr iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel raid0 aesni_intel crypto_simd cryptd ixgbe glue_helper intel_wmi_thunderbolt(N) pcspkr xfrm_algo i2c_i801 igb libphy ses mdio mei_me enclosure mei dca lpc_ich ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad button fuse configfs btrfs libcrc32c xor raid6_pq raid1 md_mod sd_mod ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm_ttm_helper ttm drm ehci_pci ehci_hcd mpt3sas ahci libahci usbcore nvme libata nvme_core crc32c_intel t10_pi raid_class scsi_transport_sas wmi overlay sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[  676.505606] Supported: No, Unsupported modules are loaded
[  676.505609] CPU: 20 PID: 3832 Comm: snapperd Tainted: G                  N 5.3.18-59.34-default #1 SLE15-SP3
[  676.505610] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A22 12/09/2015
[  676.505627] RIP: 0010:btrfs_qgroup_account_extents+0x1a9/0x260 [btrfs]
[  676.505628] Code: 7e 08 49 83 c6 18 4c 89 fa 48 89 ee e8 e0 d0 d2 d3 49 8b 0e 48 85 c9 75 e5 e9 c2 fe ff ff 48 c7 c7 40 2f 71 c0 e8 e5 87 23 d3 <0f> 0b 49 8b 57 18 4d 8d 47 38 45 31 c9 31 c9 31 ff 48 89 ee e8 1e
[  676.505629] RSP: 0018:ffffb5588fc7fce0 EFLAGS: 00010282
[  676.505630] RAX: 0000000000000024 RBX: ffff973ef96f2f50 RCX: 0000000000000000
[  676.505631] RDX: 0000000000000000 RSI: ffff973effb19558 RDI: ffff973effb19558
[  676.505631] RBP: ffff973f816a6000 R08: 0000000000000586 R09: 0000000000000001
[  676.505632] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[  676.505632] R13: 00000000000100eb R14: 0000000000000000 R15: ffff973e06292780
[  676.505633] FS:  00007fb3c5567700(0000) GS:ffff973effb00000(0000) knlGS:0000000000000000
[  676.505634] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  676.505634] CR2: 00007fb3c000c000 CR3: 0000003ff29ba004 CR4: 00000000001706e0
[  676.505635] Call Trace:
[  676.505652]  btrfs_commit_transaction+0x52e/0xa60 [btrfs]
[  676.505665]  ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
[  676.505681]  btrfs_ioctl+0x2401/0x3030 [btrfs]
[  676.505686]  ? hrtimer_start_range_ns+0x1d2/0x2a0
[  676.505687]  ? hrtimer_try_to_cancel+0x25/0x100
[  676.505690]  ? do_nanosleep+0xaf/0x180
[  676.505692]  ? hrtimer_nanosleep+0x9a/0x140
[  676.505693]  ? hrtimer_init_sleeper+0x90/0x90
[  676.505696]  ? ksys_ioctl+0x92/0xb0
[  676.505697]  ksys_ioctl+0x92/0xb0
[  676.505699]  __x64_sys_ioctl+0x16/0x20
[  676.505703]  do_syscall_64+0x5b/0x1e0
[  676.505704]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  676.505705] RIP: 0033:0x7fb3c85bb807
[  676.505707] Code: b3 66 90 48 8b 05 89 76 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 76 2d 00 f7 d8 64 89 01 48
[  676.505708] RSP: 002b:00007fb3c55667a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[  676.505709] RAX: ffffffffffffffda RBX: 0000000000003d95 RCX: 00007fb3c85bb807
[  676.505709] RDX: 00007fb3c55657a0 RSI: 0000000000009408 RDI: 0000000000000006
[  676.505710] RBP: 00007fb3c0007a38 R08: 0000000000000000 R09: 0000000000000001
[  676.505710] R10: 00007fb3c5566790 R11: 0000000000000202 R12: 00007fb3c0007a38
[  676.505711] R13: 00007fb3c0001900 R14: 00007fb3c55667d0 R15: 0000555f7ca64cd0
[  676.505712] ---[ end trace c78b2596a9f87d19 ]---
[  676.533444] BTRFS error (device md1): parent transid verify failed on 257163264 wanted 2282038 found 2309582
[  676.542223] BTRFS error (device md1): parent transid verify failed on 257163264 wanted 2282038 found 2309582
[  676.550970] BTRFS warning (device md1): Skipping commit of aborted transaction.
[  676.550974] BTRFS: error (device md1) in cleanup_transaction:1818: errno=-5 IO failure
[  676.550975] BTRFS info (device md1): forced readonly
[  676.551271] BTRFS info (device md1): delayed_refs has NO entry

I suspected it might be a recently introduced kernel regression so I checked grep kernel-default /var/log/zypp/history but have found as last entries:

2021-11-10 00:45:44|install|kernel-default-extra|5.3.18-59.10.1|x86_64||openSUSE-Leap-$releasever-1|528ab2e35a4a01e3ddbc10b881712564e25ac3d85e547fc96ab3cd8714d679d2|
2021-11-10 00:46:25|install|kernel-default-optional|5.3.18-59.10.1|x86_64||openSUSE-Leap-$releasever-1|4c48b0ef624c78302c645a76c5b95ccd3dc4da50f575cb009f93f7f0e2d966f4|
2021-11-17 00:47:54|install|kernel-default|5.3.18-59.34.1|x86_64||repo-sle-update|4fd31f17a2115c59de1518725df9a641ee6d8692d29c4044d0e8d1e01c821dbf|
2021-11-17 00:48:05|install|kernel-default-extra|5.3.18-59.34.1|x86_64||repo-sle-update|04445f3288b3a84a1ce28579bae260a3d1e38bc8f12bc4a00d4c963a20826ef9|
2021-11-17 00:48:16|install|kernel-default-optional|5.3.18-59.34.1|x86_64||repo-sle-update|a8d4e6f0f939b737f3d4d3739a68152e5e2eb45dfa58a32a4aea81f66253cdcf|
# 2021-11-17 00:48:37 Output of kernel-default-extra-5.3.18-59.34.1.x86_64.rpm %posttrans script:

so no kernel update in past days at least. Now "snapper-cleanup" failed, likely also due to the broken filesystem. However it has a non-helpful error message so I reported https://github.com/openSUSE/snapper/issues/686

#4 Updated by okurz 5 months ago

  • Assignee set to favogt

you wanted to conduct another btrfs repair on ow4. Try that and then you can decide how we should continue and who should continue.

#5 Updated by favogt 5 months ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

I performed another btrfs repair from the initrd, but also disabled quotas this time.

sh-4.4# /tmp/btrfs-image -w /dev/md1 /mnt/btrfs-image-3
sh-4.4# btrfs check --repair /dev/md1
enabling repair mode
Opening filesystem to check...
Checking filesystem on /dev/md1
UUID: 9058d83e-9f1f-49be-a4c0-8f975248ad39
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
No device size related problem found
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
ERROR: out of memory
ERROR: Loading qgroups from disk: -2
ERROR: failed to check quota groups
found 22593638400 bytes used, error(s) found
total csum bytes: 14809372
total tree bytes: 234930176
total fs tree bytes: 200769536
total extent tree bytes: 16105472
btree space waste bytes: 45485438
file data blocks allocated: 1505504395264
 referenced 28253839360
extent buffer leak: start 266600448 len 16384
extent buffer leak: start 38453248 len 16384
sh-4.4# btrfs quota disable /dev/md1
ERROR: not a btrfs filesystem: /dev/md1
sh-4.4# mount /dev/md1 /sysroot
[  374.943722] BTRFS info (device md1): disk space caching is enabled
[  374.949933] BTRFS info (device md1): has skinny extents
[  375.043471] BTRFS error (device md1): qgroup generation mismatch, marked as inconsistent
[  375.083996] BTRFS info (device md1): checking UUID tree
sh-4.4# btrfs quota disable /dev/md1
ERROR: not a btrfs filesystem: /dev/md1
sh-4.4# btrfs quota disable /sysroot
sh-4.4# btrfs scrub start /sysroot
scrub started on[  434.676526] BTRFS info (device md1): scrub: started on devid 1
 /sysroot, fsid 9058d83e-9f1f-49be-a4c0-8f975248ad39 (pid=1078)
sh-4.4# btrfs scrub status /sysroot
scrub status for 9058d83e-9f1f-49be-a4c0-8f975248ad39
        scrub started at Tue Dec  7 08:12:57 2021, running for 00:00:05
        total bytes scrubbed: 340.75MiB with 0 errors
sh-4.4# df -h
sh: df: command not found
sh-4.4# btrfs fi df /sysroot
Data, single: total=25.01GiB, used=20.82GiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=1.00GiB, used=223.78MiB
GlobalReserve, single: total=32.25MiB, used=0.00B
sh-4.4# btrfs scrub status /sysroot
scrub status for 9058d83e-9f1f-49be-a4c0-8f975248ad39
        scrub started at Tue Dec  7 08:12:57 2021, running for 00:01:25
        total bytes scrubbed: 11.05GiB with 0 errors
sh-4.4# [  600.404059] BTRFS info (device md1): scrub: finished on devid 1 with status: 0
sh-4.4# btrfs scrub status /sysroot
scrub status for 9058d83e-9f1f-49be-a4c0-8f975248ad39
        scrub started at Tue Dec  7 08:12:57 2021 and finished after 00:02:45
        total bytes scrubbed: 21.26GiB with 0 errors

While at it, I also remade the ext2 FS on /var/lib/openqa as ext4, as it's not recreated on each boot anymore for some reason.
I had to do the usual dance of temporarily making the snapshot read-write and edit /etc/fstab inside to work around https://bugzilla.opensuse.org/show_bug.cgi?id=1183856.

#6 Updated by cdywan 5 months ago

Also available in: Atom PDF