action #103581
closedMany jobs on openqa.opensuse.org incomplete in ' timeout: setup exceeded MAX_SETUP_TIME'
Description
Observation¶
There are many jobs on openqa.opensuse.org incomplete in 'timeout: setup exceeded MAX_SETUP_TIME'
The log show:
[2021-12-07T04:47:55.572079+01:00] [info] +++ setup notes +++
[2021-12-07T04:47:55.572463+01:00] [info] Running on openqaworker4:2 (Linux 5.3.18-59.34-default #1 SMP Thu Nov 11 12:18:45 UTC 2021 (a2a53aa) x86_64)
[2021-12-07T04:47:55.577783+01:00] [debug] Found ASSET_1, caching Tumbleweed.x86_64-1.0-libvirt-Snapshot20211205.vagrant.libvirt.box
[2021-12-07T04:47:55.580926+01:00] [info] Downloading Tumbleweed.x86_64-1.0-libvirt-Snapshot20211205.vagrant.libvirt.box, request #495088 sent to Cache Service
[2021-12-07T05:47:55.606967+01:00] [info] +++ worker notes +++
[2021-12-07T05:47:55.607179+01:00] [info] End time: 2021-12-07 04:47:55
[2021-12-07T05:47:55.607254+01:00] [info] Result: timeout
[2021-12-07T05:47:55.619503+01:00] [info] Uploading autoinst-log.txt
Here is an example: https://openqa.opensuse.org/tests/2073891#
Suggestion¶
- If the problem is present on multiple machines rollback the deployment on o3.
- Investigate if recent packages changes, e.g. in os-autoinst or https://build.opensuse.org/package/show/devel:openQA:Leap:15.3/perl-Mojo-IOLoop-ReadWriteProcess can have caused problems
- Investigate the logs on openqaworker4 further for why there were reports about I/O problems
- Ensure consistent stability on our o3 workers
Rollback steps¶
- Enable default worker classes, e.g. "qemu_x86_64" and "qemu_i586" again on openqaworker4 in /etc/openqa/workers.ini
Updated by okurz about 3 years ago
- Priority changed from Urgent to Immediate
- Target version set to Ready
Updated by okurz about 3 years ago
- Status changed from New to In Progress
- Assignee set to okurz
I am trying to disable the machine for now and retrigger according failures.
Updated by okurz about 3 years ago
- Description updated (diff)
- Status changed from In Progress to New
- Assignee deleted (
okurz) - Priority changed from Immediate to Urgent
I suspect that the problem is only on openqaworker4. After logging in to openqaworker4 with ssh I could see that even tab-completion on ls
does not work as there was a report about a "Read-only filesystem". systemctl --failed
showed that transactional-update failed. I could also not change any files in /etc. I rebooted and after reboot I could change /etc/openqa/workers.ini to use classes like qemu_x86_64_disabled_poo103581
to not collect any production jobs. Then I restarted the openQA services with systemctl restart openqa-worker-auto-restart@{1..20}
. Soon after the system turned read-only again. dmesg shows:
[ 676.505576] WARNING: CPU: 20 PID: 3832 at ../fs/btrfs/qgroup.c:2510 btrfs_qgroup_account_extents+0x1a9/0x260 [btrfs]
[ 676.505577] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet iscsi_ibft iscsi_boot_sysfs rfkill dmi_sysfs msr ext4 crc16 mbcache jbd2 intel_rapl_msr iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel raid0 aesni_intel crypto_simd cryptd ixgbe glue_helper intel_wmi_thunderbolt(N) pcspkr xfrm_algo i2c_i801 igb libphy ses mdio mei_me enclosure mei dca lpc_ich ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad button fuse configfs btrfs libcrc32c xor raid6_pq raid1 md_mod sd_mod ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm_ttm_helper ttm drm ehci_pci ehci_hcd mpt3sas ahci libahci usbcore nvme libata nvme_core crc32c_intel t10_pi raid_class scsi_transport_sas wmi overlay sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[ 676.505606] Supported: No, Unsupported modules are loaded
[ 676.505609] CPU: 20 PID: 3832 Comm: snapperd Tainted: G N 5.3.18-59.34-default #1 SLE15-SP3
[ 676.505610] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A22 12/09/2015
[ 676.505627] RIP: 0010:btrfs_qgroup_account_extents+0x1a9/0x260 [btrfs]
[ 676.505628] Code: 7e 08 49 83 c6 18 4c 89 fa 48 89 ee e8 e0 d0 d2 d3 49 8b 0e 48 85 c9 75 e5 e9 c2 fe ff ff 48 c7 c7 40 2f 71 c0 e8 e5 87 23 d3 <0f> 0b 49 8b 57 18 4d 8d 47 38 45 31 c9 31 c9 31 ff 48 89 ee e8 1e
[ 676.505629] RSP: 0018:ffffb5588fc7fce0 EFLAGS: 00010282
[ 676.505630] RAX: 0000000000000024 RBX: ffff973ef96f2f50 RCX: 0000000000000000
[ 676.505631] RDX: 0000000000000000 RSI: ffff973effb19558 RDI: ffff973effb19558
[ 676.505631] RBP: ffff973f816a6000 R08: 0000000000000586 R09: 0000000000000001
[ 676.505632] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[ 676.505632] R13: 00000000000100eb R14: 0000000000000000 R15: ffff973e06292780
[ 676.505633] FS: 00007fb3c5567700(0000) GS:ffff973effb00000(0000) knlGS:0000000000000000
[ 676.505634] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 676.505634] CR2: 00007fb3c000c000 CR3: 0000003ff29ba004 CR4: 00000000001706e0
[ 676.505635] Call Trace:
[ 676.505652] btrfs_commit_transaction+0x52e/0xa60 [btrfs]
[ 676.505665] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
[ 676.505681] btrfs_ioctl+0x2401/0x3030 [btrfs]
[ 676.505686] ? hrtimer_start_range_ns+0x1d2/0x2a0
[ 676.505687] ? hrtimer_try_to_cancel+0x25/0x100
[ 676.505690] ? do_nanosleep+0xaf/0x180
[ 676.505692] ? hrtimer_nanosleep+0x9a/0x140
[ 676.505693] ? hrtimer_init_sleeper+0x90/0x90
[ 676.505696] ? ksys_ioctl+0x92/0xb0
[ 676.505697] ksys_ioctl+0x92/0xb0
[ 676.505699] __x64_sys_ioctl+0x16/0x20
[ 676.505703] do_syscall_64+0x5b/0x1e0
[ 676.505704] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 676.505705] RIP: 0033:0x7fb3c85bb807
[ 676.505707] Code: b3 66 90 48 8b 05 89 76 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 76 2d 00 f7 d8 64 89 01 48
[ 676.505708] RSP: 002b:00007fb3c55667a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 676.505709] RAX: ffffffffffffffda RBX: 0000000000003d95 RCX: 00007fb3c85bb807
[ 676.505709] RDX: 00007fb3c55657a0 RSI: 0000000000009408 RDI: 0000000000000006
[ 676.505710] RBP: 00007fb3c0007a38 R08: 0000000000000000 R09: 0000000000000001
[ 676.505710] R10: 00007fb3c5566790 R11: 0000000000000202 R12: 00007fb3c0007a38
[ 676.505711] R13: 00007fb3c0001900 R14: 00007fb3c55667d0 R15: 0000555f7ca64cd0
[ 676.505712] ---[ end trace c78b2596a9f87d19 ]---
[ 676.533444] BTRFS error (device md1): parent transid verify failed on 257163264 wanted 2282038 found 2309582
[ 676.542223] BTRFS error (device md1): parent transid verify failed on 257163264 wanted 2282038 found 2309582
[ 676.550970] BTRFS warning (device md1): Skipping commit of aborted transaction.
[ 676.550974] BTRFS: error (device md1) in cleanup_transaction:1818: errno=-5 IO failure
[ 676.550975] BTRFS info (device md1): forced readonly
[ 676.551271] BTRFS info (device md1): delayed_refs has NO entry
I suspected it might be a recently introduced kernel regression so I checked grep kernel-default /var/log/zypp/history
but have found as last entries:
2021-11-10 00:45:44|install|kernel-default-extra|5.3.18-59.10.1|x86_64||openSUSE-Leap-$releasever-1|528ab2e35a4a01e3ddbc10b881712564e25ac3d85e547fc96ab3cd8714d679d2|
2021-11-10 00:46:25|install|kernel-default-optional|5.3.18-59.10.1|x86_64||openSUSE-Leap-$releasever-1|4c48b0ef624c78302c645a76c5b95ccd3dc4da50f575cb009f93f7f0e2d966f4|
2021-11-17 00:47:54|install|kernel-default|5.3.18-59.34.1|x86_64||repo-sle-update|4fd31f17a2115c59de1518725df9a641ee6d8692d29c4044d0e8d1e01c821dbf|
2021-11-17 00:48:05|install|kernel-default-extra|5.3.18-59.34.1|x86_64||repo-sle-update|04445f3288b3a84a1ce28579bae260a3d1e38bc8f12bc4a00d4c963a20826ef9|
2021-11-17 00:48:16|install|kernel-default-optional|5.3.18-59.34.1|x86_64||repo-sle-update|a8d4e6f0f939b737f3d4d3739a68152e5e2eb45dfa58a32a4aea81f66253cdcf|
# 2021-11-17 00:48:37 Output of kernel-default-extra-5.3.18-59.34.1.x86_64.rpm %posttrans script:
so no kernel update in past days at least. Now "snapper-cleanup" failed, likely also due to the broken filesystem. However it has a non-helpful error message so I reported https://github.com/openSUSE/snapper/issues/686
Updated by okurz about 3 years ago
- Assignee set to favogt
you wanted to conduct another btrfs repair on ow4. Try that and then you can decide how we should continue and who should continue.
Updated by favogt about 3 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
I performed another btrfs repair from the initrd, but also disabled quotas this time.
sh-4.4# /tmp/btrfs-image -w /dev/md1 /mnt/btrfs-image-3
sh-4.4# btrfs check --repair /dev/md1
enabling repair mode
Opening filesystem to check...
Checking filesystem on /dev/md1
UUID: 9058d83e-9f1f-49be-a4c0-8f975248ad39
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
No device size related problem found
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
ERROR: out of memory
ERROR: Loading qgroups from disk: -2
ERROR: failed to check quota groups
found 22593638400 bytes used, error(s) found
total csum bytes: 14809372
total tree bytes: 234930176
total fs tree bytes: 200769536
total extent tree bytes: 16105472
btree space waste bytes: 45485438
file data blocks allocated: 1505504395264
referenced 28253839360
extent buffer leak: start 266600448 len 16384
extent buffer leak: start 38453248 len 16384
sh-4.4# btrfs quota disable /dev/md1
ERROR: not a btrfs filesystem: /dev/md1
sh-4.4# mount /dev/md1 /sysroot
[ 374.943722] BTRFS info (device md1): disk space caching is enabled
[ 374.949933] BTRFS info (device md1): has skinny extents
[ 375.043471] BTRFS error (device md1): qgroup generation mismatch, marked as inconsistent
[ 375.083996] BTRFS info (device md1): checking UUID tree
sh-4.4# btrfs quota disable /dev/md1
ERROR: not a btrfs filesystem: /dev/md1
sh-4.4# btrfs quota disable /sysroot
sh-4.4# btrfs scrub start /sysroot
scrub started on[ 434.676526] BTRFS info (device md1): scrub: started on devid 1
/sysroot, fsid 9058d83e-9f1f-49be-a4c0-8f975248ad39 (pid=1078)
sh-4.4# btrfs scrub status /sysroot
scrub status for 9058d83e-9f1f-49be-a4c0-8f975248ad39
scrub started at Tue Dec 7 08:12:57 2021, running for 00:00:05
total bytes scrubbed: 340.75MiB with 0 errors
sh-4.4# df -h
sh: df: command not found
sh-4.4# btrfs fi df /sysroot
Data, single: total=25.01GiB, used=20.82GiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=1.00GiB, used=223.78MiB
GlobalReserve, single: total=32.25MiB, used=0.00B
sh-4.4# btrfs scrub status /sysroot
scrub status for 9058d83e-9f1f-49be-a4c0-8f975248ad39
scrub started at Tue Dec 7 08:12:57 2021, running for 00:01:25
total bytes scrubbed: 11.05GiB with 0 errors
sh-4.4# [ 600.404059] BTRFS info (device md1): scrub: finished on devid 1 with status: 0
sh-4.4# btrfs scrub status /sysroot
scrub status for 9058d83e-9f1f-49be-a4c0-8f975248ad39
scrub started at Tue Dec 7 08:12:57 2021 and finished after 00:02:45
total bytes scrubbed: 21.26GiB with 0 errors
While at it, I also remade the ext2
FS on /var/lib/openqa
as ext4
, as it's not recreated on each boot anymore for some reason.
I had to do the usual dance of temporarily making the snapshot read-write and edit /etc/fstab
inside to work around https://bugzilla.opensuse.org/show_bug.cgi?id=1183856.
Updated by livdywan about 3 years ago
- Related to action #103524: OW1: performance loss size:M added