action #103581
Many jobs on openqa.opensuse.org incomplete in ' timeout: setup exceeded MAX_SETUP_TIME'
100%
Description
Observation¶
There are many jobs on openqa.opensuse.org incomplete in 'timeout: setup exceeded MAX_SETUP_TIME'
The log show:
[2021-12-07T04:47:55.572079+01:00] [info] +++ setup notes +++ [2021-12-07T04:47:55.572463+01:00] [info] Running on openqaworker4:2 (Linux 5.3.18-59.34-default #1 SMP Thu Nov 11 12:18:45 UTC 2021 (a2a53aa) x86_64) [2021-12-07T04:47:55.577783+01:00] [debug] Found ASSET_1, caching Tumbleweed.x86_64-1.0-libvirt-Snapshot20211205.vagrant.libvirt.box [2021-12-07T04:47:55.580926+01:00] [info] Downloading Tumbleweed.x86_64-1.0-libvirt-Snapshot20211205.vagrant.libvirt.box, request #495088 sent to Cache Service [2021-12-07T05:47:55.606967+01:00] [info] +++ worker notes +++ [2021-12-07T05:47:55.607179+01:00] [info] End time: 2021-12-07 04:47:55 [2021-12-07T05:47:55.607254+01:00] [info] Result: timeout [2021-12-07T05:47:55.619503+01:00] [info] Uploading autoinst-log.txt
Here is an example: https://openqa.opensuse.org/tests/2073891#
Suggestion¶
- If the problem is present on multiple machines rollback the deployment on o3.
- Investigate if recent packages changes, e.g. in os-autoinst or https://build.opensuse.org/package/show/devel:openQA:Leap:15.3/perl-Mojo-IOLoop-ReadWriteProcess can have caused problems
- Investigate the logs on openqaworker4 further for why there were reports about I/O problems
- Ensure consistent stability on our o3 workers
Rollback steps¶
- Enable default worker classes, e.g. "qemu_x86_64" and "qemu_i586" again on openqaworker4 in /etc/openqa/workers.ini
Related issues
History
#1
Updated by okurz over 1 year ago
- Priority changed from Urgent to Immediate
- Target version set to Ready
#2
Updated by okurz over 1 year ago
- Status changed from New to In Progress
- Assignee set to okurz
I am trying to disable the machine for now and retrigger according failures.
#3
Updated by okurz over 1 year ago
- Description updated (diff)
- Status changed from In Progress to New
- Assignee deleted (
okurz) - Priority changed from Immediate to Urgent
I suspect that the problem is only on openqaworker4. After logging in to openqaworker4 with ssh I could see that even tab-completion on ls
does not work as there was a report about a "Read-only filesystem". systemctl --failed
showed that transactional-update failed. I could also not change any files in /etc. I rebooted and after reboot I could change /etc/openqa/workers.ini to use classes like qemu_x86_64_disabled_poo103581
to not collect any production jobs. Then I restarted the openQA services with systemctl restart openqa-worker-auto-restart@{1..20}
. Soon after the system turned read-only again. dmesg shows:
[ 676.505576] WARNING: CPU: 20 PID: 3832 at ../fs/btrfs/qgroup.c:2510 btrfs_qgroup_account_extents+0x1a9/0x260 [btrfs] [ 676.505577] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet iscsi_ibft iscsi_boot_sysfs rfkill dmi_sysfs msr ext4 crc16 mbcache jbd2 intel_rapl_msr iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel raid0 aesni_intel crypto_simd cryptd ixgbe glue_helper intel_wmi_thunderbolt(N) pcspkr xfrm_algo i2c_i801 igb libphy ses mdio mei_me enclosure mei dca lpc_ich ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad button fuse configfs btrfs libcrc32c xor raid6_pq raid1 md_mod sd_mod ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm_ttm_helper ttm drm ehci_pci ehci_hcd mpt3sas ahci libahci usbcore nvme libata nvme_core crc32c_intel t10_pi raid_class scsi_transport_sas wmi overlay sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod [ 676.505606] Supported: No, Unsupported modules are loaded [ 676.505609] CPU: 20 PID: 3832 Comm: snapperd Tainted: G N 5.3.18-59.34-default #1 SLE15-SP3 [ 676.505610] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A22 12/09/2015 [ 676.505627] RIP: 0010:btrfs_qgroup_account_extents+0x1a9/0x260 [btrfs] [ 676.505628] Code: 7e 08 49 83 c6 18 4c 89 fa 48 89 ee e8 e0 d0 d2 d3 49 8b 0e 48 85 c9 75 e5 e9 c2 fe ff ff 48 c7 c7 40 2f 71 c0 e8 e5 87 23 d3 <0f> 0b 49 8b 57 18 4d 8d 47 38 45 31 c9 31 c9 31 ff 48 89 ee e8 1e [ 676.505629] RSP: 0018:ffffb5588fc7fce0 EFLAGS: 00010282 [ 676.505630] RAX: 0000000000000024 RBX: ffff973ef96f2f50 RCX: 0000000000000000 [ 676.505631] RDX: 0000000000000000 RSI: ffff973effb19558 RDI: ffff973effb19558 [ 676.505631] RBP: ffff973f816a6000 R08: 0000000000000586 R09: 0000000000000001 [ 676.505632] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 [ 676.505632] R13: 00000000000100eb R14: 0000000000000000 R15: ffff973e06292780 [ 676.505633] FS: 00007fb3c5567700(0000) GS:ffff973effb00000(0000) knlGS:0000000000000000 [ 676.505634] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 676.505634] CR2: 00007fb3c000c000 CR3: 0000003ff29ba004 CR4: 00000000001706e0 [ 676.505635] Call Trace: [ 676.505652] btrfs_commit_transaction+0x52e/0xa60 [btrfs] [ 676.505665] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs] [ 676.505681] btrfs_ioctl+0x2401/0x3030 [btrfs] [ 676.505686] ? hrtimer_start_range_ns+0x1d2/0x2a0 [ 676.505687] ? hrtimer_try_to_cancel+0x25/0x100 [ 676.505690] ? do_nanosleep+0xaf/0x180 [ 676.505692] ? hrtimer_nanosleep+0x9a/0x140 [ 676.505693] ? hrtimer_init_sleeper+0x90/0x90 [ 676.505696] ? ksys_ioctl+0x92/0xb0 [ 676.505697] ksys_ioctl+0x92/0xb0 [ 676.505699] __x64_sys_ioctl+0x16/0x20 [ 676.505703] do_syscall_64+0x5b/0x1e0 [ 676.505704] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 676.505705] RIP: 0033:0x7fb3c85bb807 [ 676.505707] Code: b3 66 90 48 8b 05 89 76 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 76 2d 00 f7 d8 64 89 01 48 [ 676.505708] RSP: 002b:00007fb3c55667a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 676.505709] RAX: ffffffffffffffda RBX: 0000000000003d95 RCX: 00007fb3c85bb807 [ 676.505709] RDX: 00007fb3c55657a0 RSI: 0000000000009408 RDI: 0000000000000006 [ 676.505710] RBP: 00007fb3c0007a38 R08: 0000000000000000 R09: 0000000000000001 [ 676.505710] R10: 00007fb3c5566790 R11: 0000000000000202 R12: 00007fb3c0007a38 [ 676.505711] R13: 00007fb3c0001900 R14: 00007fb3c55667d0 R15: 0000555f7ca64cd0 [ 676.505712] ---[ end trace c78b2596a9f87d19 ]--- [ 676.533444] BTRFS error (device md1): parent transid verify failed on 257163264 wanted 2282038 found 2309582 [ 676.542223] BTRFS error (device md1): parent transid verify failed on 257163264 wanted 2282038 found 2309582 [ 676.550970] BTRFS warning (device md1): Skipping commit of aborted transaction. [ 676.550974] BTRFS: error (device md1) in cleanup_transaction:1818: errno=-5 IO failure [ 676.550975] BTRFS info (device md1): forced readonly [ 676.551271] BTRFS info (device md1): delayed_refs has NO entry
I suspected it might be a recently introduced kernel regression so I checked grep kernel-default /var/log/zypp/history
but have found as last entries:
2021-11-10 00:45:44|install|kernel-default-extra|5.3.18-59.10.1|x86_64||openSUSE-Leap-$releasever-1|528ab2e35a4a01e3ddbc10b881712564e25ac3d85e547fc96ab3cd8714d679d2| 2021-11-10 00:46:25|install|kernel-default-optional|5.3.18-59.10.1|x86_64||openSUSE-Leap-$releasever-1|4c48b0ef624c78302c645a76c5b95ccd3dc4da50f575cb009f93f7f0e2d966f4| 2021-11-17 00:47:54|install|kernel-default|5.3.18-59.34.1|x86_64||repo-sle-update|4fd31f17a2115c59de1518725df9a641ee6d8692d29c4044d0e8d1e01c821dbf| 2021-11-17 00:48:05|install|kernel-default-extra|5.3.18-59.34.1|x86_64||repo-sle-update|04445f3288b3a84a1ce28579bae260a3d1e38bc8f12bc4a00d4c963a20826ef9| 2021-11-17 00:48:16|install|kernel-default-optional|5.3.18-59.34.1|x86_64||repo-sle-update|a8d4e6f0f939b737f3d4d3739a68152e5e2eb45dfa58a32a4aea81f66253cdcf| # 2021-11-17 00:48:37 Output of kernel-default-extra-5.3.18-59.34.1.x86_64.rpm %posttrans script:
so no kernel update in past days at least. Now "snapper-cleanup" failed, likely also due to the broken filesystem. However it has a non-helpful error message so I reported https://github.com/openSUSE/snapper/issues/686
#4
Updated by okurz over 1 year ago
- Assignee set to favogt
you wanted to conduct another btrfs repair on ow4. Try that and then you can decide how we should continue and who should continue.
#5
Updated by favogt over 1 year ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
I performed another btrfs repair from the initrd, but also disabled quotas this time.
sh-4.4# /tmp/btrfs-image -w /dev/md1 /mnt/btrfs-image-3 sh-4.4# btrfs check --repair /dev/md1 enabling repair mode Opening filesystem to check... Checking filesystem on /dev/md1 UUID: 9058d83e-9f1f-49be-a4c0-8f975248ad39 [1/7] checking root items Fixed 0 roots. [2/7] checking extents No device size related problem found [3/7] checking free space cache cache and super generation don't match, space cache will be invalidated [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups ERROR: out of memory ERROR: Loading qgroups from disk: -2 ERROR: failed to check quota groups found 22593638400 bytes used, error(s) found total csum bytes: 14809372 total tree bytes: 234930176 total fs tree bytes: 200769536 total extent tree bytes: 16105472 btree space waste bytes: 45485438 file data blocks allocated: 1505504395264 referenced 28253839360 extent buffer leak: start 266600448 len 16384 extent buffer leak: start 38453248 len 16384 sh-4.4# btrfs quota disable /dev/md1 ERROR: not a btrfs filesystem: /dev/md1 sh-4.4# mount /dev/md1 /sysroot [ 374.943722] BTRFS info (device md1): disk space caching is enabled [ 374.949933] BTRFS info (device md1): has skinny extents [ 375.043471] BTRFS error (device md1): qgroup generation mismatch, marked as inconsistent [ 375.083996] BTRFS info (device md1): checking UUID tree sh-4.4# btrfs quota disable /dev/md1 ERROR: not a btrfs filesystem: /dev/md1 sh-4.4# btrfs quota disable /sysroot sh-4.4# btrfs scrub start /sysroot scrub started on[ 434.676526] BTRFS info (device md1): scrub: started on devid 1 /sysroot, fsid 9058d83e-9f1f-49be-a4c0-8f975248ad39 (pid=1078) sh-4.4# btrfs scrub status /sysroot scrub status for 9058d83e-9f1f-49be-a4c0-8f975248ad39 scrub started at Tue Dec 7 08:12:57 2021, running for 00:00:05 total bytes scrubbed: 340.75MiB with 0 errors sh-4.4# df -h sh: df: command not found sh-4.4# btrfs fi df /sysroot Data, single: total=25.01GiB, used=20.82GiB System, DUP: total=8.00MiB, used=16.00KiB Metadata, DUP: total=1.00GiB, used=223.78MiB GlobalReserve, single: total=32.25MiB, used=0.00B sh-4.4# btrfs scrub status /sysroot scrub status for 9058d83e-9f1f-49be-a4c0-8f975248ad39 scrub started at Tue Dec 7 08:12:57 2021, running for 00:01:25 total bytes scrubbed: 11.05GiB with 0 errors sh-4.4# [ 600.404059] BTRFS info (device md1): scrub: finished on devid 1 with status: 0 sh-4.4# btrfs scrub status /sysroot scrub status for 9058d83e-9f1f-49be-a4c0-8f975248ad39 scrub started at Tue Dec 7 08:12:57 2021 and finished after 00:02:45 total bytes scrubbed: 21.26GiB with 0 errors
While at it, I also remade the ext2
FS on /var/lib/openqa
as ext4
, as it's not recreated on each boot anymore for some reason.
I had to do the usual dance of temporarily making the snapshot read-write and edit /etc/fstab
inside to work around https://bugzilla.opensuse.org/show_bug.cgi?id=1183856.
#6
Updated by cdywan over 1 year ago
- Related to action #103524: OW1: performance loss size:M added