Project

General

Profile

action #111758

o3 jobs exceeding MAX_SETUP_TIME auto_review:"(?s)openqaworker4.*timeout: setup exceeded MAX_SETUP_TIME":retry size:M

Added by okurz 2 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-05-30
Due date:
% Done:

100%

Estimated time:

Description

Observation

Seems like at least one job ran into MAX_SETUP_TIME, see https://openqa.opensuse.org/tests/2393325 showing

[2022-05-30T08:24:05.869679+02:00] [debug] Found HDD_1, caching opensuse-Tumbleweed-x86_64-20220527-Tumbleweed@64bit.qcow2
[2022-05-30T08:24:05.873320+02:00] [info] Downloading opensuse-Tumbleweed-x86_64-20220527-Tumbleweed@64bit.qcow2, request #252294 sent to Cache Service
[2022-05-30T09:24:05.896856+02:00] [info] +++ worker notes +++
[2022-05-30T09:24:05.897085+02:00] [info] End time: 2022-05-30 07:24:05
[2022-05-30T09:24:05.897159+02:00] [info] Result: timeout

with reason timeout: setup exceeded MAX_SETUP_TIME

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#111758

Acceptance criteria

AC1: openqaworker4 is able to work on 20 jobs in parallel

Suggestions

  • Reinstall the operating system on openqaworker4 via ipmi
  • Order a replacement machine e.g. file SD ticket
  • Ensure the system is able to work on o3 production jobs
  • Monitor for stability

Related issues

Related to openQA Project - action #105379: Continuous deployment of o3 workers - one worker first size:MResolved2022-01-24

History

#1 Updated by okurz 2 months ago

  • Priority changed from Urgent to Immediate

DimStar confirmed seeing similar issues elsewhere, see https://suse.slack.com/archives/C02CANHLANP/p1653897581429969 , e.g. https://openqa.opensuse.org/tests/2393391 , same reason, same worker

#2 Updated by okurz 2 months ago

  • Subject changed from o3 jobs exceeding MAX_SETUP_TIME to o3 jobs exceeding MAX_SETUP_TIME auto_review:"(?s)openqaworker4.*timeout: setup exceeded MAX_SETUP_TIME":retry
  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to okurz

#3 Updated by okurz 2 months ago

dmesg says

[1023946.290932] WARNING: CPU: 7 PID: 8230 at ../fs/btrfs/extent-tree.c:3004 __btrfs_free_extent.isra.47+0x201/0xbc0 [btrfs]
[1023946.290933] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet iscsi_ibft iscsi_boot_sysfs rfkill dmi_sysfs msr ext4 crc16 mbcache jbd2 intel_rapl_msr iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel raid0 crypto_simd cryptd ixgbe glue_helper intel_wmi_thunderbolt(N) pcspkr xfrm_algo ses enclosure igb libphy mei_me mei mdio lpc_ich dca i2c_i801 ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad button fuse configfs btrfs libcrc32c xor raid6_pq raid1 md_mod ast i2c_algo_bit drm_vram_helper drm_kms_helper sd_mod syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm_ttm_helper ttm ehci_pci drm ehci_hcd mpt3sas ahci usbcore libahci nvme libata crc32c_intel nvme_core t10_pi raid_class scsi_transport_sas wmi overlay sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[1023946.290967] Supported: No, Unsupported modules are loaded
[1023946.290969] CPU: 7 PID: 8230 Comm: btrfs Tainted: G                  N 5.3.18-150300.59.68-default #1 SLE15-SP3
[1023946.290970] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A22 12/09/2015
[1023946.290982] RIP: 0010:__btrfs_free_extent.isra.47+0x201/0xbc0 [btrfs]
[1023946.290983] Code: 89 e6 e8 a2 cc ff ff 41 89 c5 58 45 85 ed 0f 84 90 00 00 00 41 83 fd fe 0f 85 11 ff ff ff 48 c7 c7 30 9d 97 c0 e8 6d 80 25 cd <0f> 0b 49 8b 3c 24 e8 b4 69 00 00 ff b4 24 b8 00 00 00 48 8b 7c 24
[1023946.290984] RSP: 0018:ffffad9468987898 EFLAGS: 00010282
[1023946.290985] RAX: 0000000000000024 RBX: 0000000000000000 RCX: 0000000000000000
[1023946.290986] RDX: 0000000000000000 RSI: ffff88b43f9d9558 RDI: ffff88b43f9d9558
[1023946.290986] RBP: 000000000c578000 R08: 0000000000003369 R09: 0000000000000001
[1023946.290987] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88b3aff9bbd0
[1023946.290987] R13: 00000000fffffffe R14: 0000000000000000 R15: 0000000000003d94
[1023946.290988] FS:  00007f165e5568c0(0000) GS:ffff88b43f9c0000(0000) knlGS:0000000000000000
[1023946.290989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1023946.290989] CR2: 0000562d5e561a30 CR3: 000000358f130005 CR4: 00000000001726e0
[1023946.290990] Call Trace:
[1023946.291013]  ? btrfs_merge_delayed_refs+0x307/0x3e0 [btrfs]
[1023946.291024]  __btrfs_run_delayed_refs+0x797/0x1180 [btrfs]
[1023946.291039]  ? btrfs_set_path_blocking+0x45/0x50 [btrfs]
[1023946.291050]  btrfs_run_delayed_refs+0x62/0x1f0 [btrfs]
[1023946.291066]  btrfs_commit_transaction+0x50/0xa60 [btrfs]
[1023946.291087]  prepare_to_merge+0x246/0x260 [btrfs]
[1023946.291105]  relocate_block_group+0x20d/0x790 [btrfs]
[1023946.291124]  btrfs_relocate_block_group+0x16f/0x2e0 [btrfs]
[1023946.291141]  btrfs_relocate_chunk+0x31/0xb0 [btrfs]
[1023946.291159]  btrfs_balance+0xa18/0x11f0 [btrfs]
[1023946.291178]  btrfs_ioctl_balance+0x2f2/0x3a0 [btrfs]
[1023946.291196]  btrfs_ioctl+0x16d4/0x3030 [btrfs]
[1023946.291201]  ? scnprintf+0x49/0x80
[1023946.291205]  ? __kmalloc_node+0x1ef/0x340
[1023946.291207]  ? __fput+0x150/0x270
[1023946.291209]  ? ksys_ioctl+0x92/0xb0
[1023946.291210]  ksys_ioctl+0x92/0xb0
[1023946.291212]  __x64_sys_ioctl+0x16/0x20
[1023946.291215]  do_syscall_64+0x5b/0x1e0
[1023946.291218]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

filesystem turned read-only

#4 Updated by okurz 2 months ago

  • Status changed from In Progress to New
  • Assignee deleted (okurz)
  • Priority changed from Immediate to Urgent

Machine is reachable over IPMI but has no network due to a read-only root filesystem which needs to be recovered. As the machine is not within the network currently it has no worse impact on the o3 jobs.

#5 Updated by nicksinger 2 months ago

  • Assignee set to nicksinger

#6 Updated by nicksinger 2 months ago

After a reboot btrfs prints:

[   50.423028] BTRFS error (device md1): unable to find ref byte nr 207060992 parent 0 root 15764  owner 0 offset 0
[   50.433896] BTRFS: error (device md1) in __btrfs_free_extent:3010: errno=-2 No such entry
[   50.442254] BTRFS: error (device md1) in btrfs_run_delayed_refs:2129: errno=-2 No such entry

trying to boot a recovery medium now to scrub the volume and try to repair it. Fist quick look in smart doesn't show anything unexpected for the 2 disks in question:

openqaworker4:~ # smartctl -a /dev/sda
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Travelstar 7K1000
Device Model:     HGST HTE721010A9E630
Serial Number:    JR10034M0HBEHK
LU WWN Device Id: 5 000cca 8a8c6fc7e
Firmware Version: JB0OA3M0
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon May 30 12:12:03 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (   45) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 166) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   141   141   033    Pre-fail  Always       -       1
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       70
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   001   001   000    Old_age   Always       -       53848
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       69
191 G-Sense_Error_Rate      0x000a   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       48
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       88
194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Min/Max 20/33)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     53847         -
# 2  Short offline       Completed without error       00%     53839         -
# 3  Short offline       Completed without error       00%     53815         -
# 4  Short offline       Completed without error       00%     53791         -
# 5  Short offline       Completed without error       00%     53767         -
# 6  Short offline       Completed without error       00%     53743         -
# 7  Short offline       Completed without error       00%     53719         -
# 8  Short offline       Completed without error       00%     53695         -
# 9  Short offline       Completed without error       00%     53671         -
#10  Short offline       Completed without error       00%     53647         -
#11  Short offline       Completed without error       00%     53623         -
#12  Short offline       Completed without error       00%     53599         -
#13  Short offline       Completed without error       00%     53575         -
#14  Short offline       Completed without error       00%     53551         -
#15  Short offline       Completed without error       00%     53527         -
#16  Short offline       Completed without error       00%     53503         -
#17  Short offline       Completed without error       00%     53479         -
#18  Short offline       Completed without error       00%     53455         -
#19  Short offline       Completed without error       00%     53431         -
#20  Short offline       Completed without error       00%     53407         -
#21  Short offline       Completed without error       00%     53383         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

openqaworker4:~ # smartctl -a /dev/sdb
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Travelstar 7K1000
Device Model:     HGST HTE721010A9E630
Serial Number:    JR10034M0MUNRK
LU WWN Device Id: 5 000cca 8a8c90368
Firmware Version: JB0OA3M0
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon May 30 12:12:07 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (   45) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 174) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   133   133   033    Pre-fail  Always       -       1
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       69
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   001   001   000    Old_age   Always       -       53848
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       69
191 G-Sense_Error_Rate      0x000a   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       44
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       180
194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Min/Max 20/34)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     53848         -
# 2  Short offline       Completed without error       00%     53839         -
# 3  Short offline       Completed without error       00%     53815         -
# 4  Short offline       Completed without error       00%     53791         -
# 5  Short offline       Completed without error       00%     53767         -
# 6  Short offline       Completed without error       00%     53743         -
# 7  Short offline       Completed without error       00%     53719         -
# 8  Short offline       Completed without error       00%     53695         -
# 9  Short offline       Completed without error       00%     53671         -
#10  Short offline       Completed without error       00%     53647         -
#11  Short offline       Completed without error       00%     53623         -
#12  Short offline       Completed without error       00%     53599         -
#13  Short offline       Completed without error       00%     53575         -
#14  Short offline       Completed without error       00%     53551         -
#15  Short offline       Completed without error       00%     53527         -
#16  Short offline       Completed without error       00%     53503         -
#17  Short offline       Completed without error       00%     53479         -
#18  Short offline       Completed without error       00%     53455         -
#19  Short offline       Completed without error       00%     53431         -
#20  Short offline       Completed without error       00%     53407         -
#21  Short offline       Completed without error       00%     53383         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

#7 Updated by cdywan 2 months ago

  • Subject changed from o3 jobs exceeding MAX_SETUP_TIME auto_review:"(?s)openqaworker4.*timeout: setup exceeded MAX_SETUP_TIME":retry to o3 jobs exceeding MAX_SETUP_TIME auto_review:"(?s)openqaworker4.*timeout: setup exceeded MAX_SETUP_TIME":retry size:M
  • Description updated (diff)
  • Status changed from New to Workable

#8 Updated by favogt 2 months ago

fsck doesn't show any errors:

sh-4.4# btrfs check /dev/md1
Opening filesystem to check...
Checking filesystem on /dev/md1
UUID: 9058d83e-9f1f-49be-a4c0-8f975248ad39
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 29340753920 bytes used, no error found
total csum bytes: 20352152
total tree bytes: 316784640
total fs tree bytes: 272531456
total extent tree bytes: 19922944
btree space waste bytes: 61990275
file data blocks allocated: 386117111808
 referenced 40491360256

scrub still fails as expected though:

sh-4.4# btrfs scrub start -Bf /dev/md1
[  198.654683] BTRFS error (device md1): unable to find ref byte nr 207060992 parent 0 root 15764  owner 0 offset 0
[  198.665150] BTRFS: error (device md1) in __btrfs_free_extent:3010: errno=-2 No such entry
[  198.673353] BTRFS: error (device md1) in btrfs_run_delayed_refs:2129: errno=-2 No such entry
ERROR: scrubbing /dev/md1 failed for device id 1: ret=-1, errno=5 (Input/output error)
scrub canceled for 9058d83e-9f1f-49be-a4c0-8f975248ad39
        scrub started at Thu Jun  2 09:11:50 2022 and was aborted after 00:00:21
        total bytes scrubbed: 0.00B with 0 errors

Looks like this needs a reinstall. If there aren't any objections until lunch (~1h), I'll start one if noone else wants to.

#9 Updated by cdywan 2 months ago

favogt wrote:

Looks like this needs a reinstall. If there aren't any objections until lunch (~1h), I'll start one if noone else wants to.

I'd say feel free to go ahead

#10 Updated by favogt 2 months ago

  • Assignee changed from nicksinger to favogt
  • % Done changed from 0 to 80

Done! I took the liberty and installed Leap 15.4 already, that might help with the future upgrades of other workers.

For the installation I downloaded https://download.opensuse.org/distribution/leap/15.4/repo/oss/boot/x86_64/loader/{linux,initrd} and ran kexec --initrd=initrd --reuse-cmdline linux. I kept the existing partition setup, just added the mounts again and proceeded with the default transactional server config.

In the installed system I added the devel:openQA{,:Leap} repos and installed openQA-worker (and after a reboot also os-autoinst-distri-opensuse-deps, which I forgot initially) and did a reboot. I also had to adjust /etc/default/grub to remove a leftover rd.break which got there through --reuse-cmdline. I copied over /etc/firewalld, systemd override files, /etc/openqa* and /etc/chrony.conf from the backup in /var/lib/openqa/root-backup/ I did before the installation and also enabled the openQA services.

After a reboot they showed up broken because the cache service couldn't start due to a permission issue - /var/lib/openqa/cache was from the previous system with the previous uid for _openqa-worker. Instead of changing ownership everywhere I just ran usermod -u 471 _openqa-worker and adjusted the owner of /etc/openqa/client.conf again. After a service restart the first jobs started running.

I just saw that staging tests fail due to the special ovmf build not being there yet, I just copied it over and will trigger a reboot.

#11 Updated by favogt 2 months ago

I also saw some Reason: cache failure: Cache service queue already full (5) failures, probably because the limit is only at 50GiB. I increased it to 200GiB for now.

#12 Updated by favogt 2 months ago

  • Status changed from Workable to Resolved
  • % Done changed from 80 to 100

favogt wrote:

I also saw some Reason: cache failure: Cache service queue already full (5) failures, probably because the limit is only at 50GiB. I increased it to 200GiB for now.

That wasn't immediately solved by the change, but after waiting for a while, the cache got fuller and the errors eventually disappeared.

I saw a (spurious?) weird issue with graphics in one job: https://openqa.opensuse.org/tests/2399434#step/welcome/7

Other than that it looks like ow4 is performing properly so far.

#13 Updated by okurz 2 months ago

Thank you for doing this. I also managed to start the java based remote control client from
https://openqaworker4-ipmi.suse.de/ with javaws.itweb jviewer.jnlp from icedtea-web which offers virtual media redirection so one can select a local ISO file as installation medium. But your approach sounds even better to me.

I extended https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Setup-guide-for-new-machines and did the missing steps which are:

echo "requires:openQA-worker" > /etc/zypp/systemCheck.d/openqa.check
sed -i 's@/ btrfs ro@/ btrfs rw@' /etc/fstab
mount -o rw,remount /
btrfs property set -ts / ro false
zypper -n in openQA-continuous-update
systemctl enable --now openqa-continuous-update.timer

as we use continuous updates on our o3 updates.

#14 Updated by favogt 2 months ago

okurz wrote:

Thank you for doing this. I also managed to start the java based remote control client from
https://openqaworker4-ipmi.suse.de/ with javaws.itweb jviewer.jnlp from icedtea-web which offers virtual media redirection so one can select a local ISO file as installation medium. But your approach sounds even better to me.

Over VPN that would take literal days to boot here (2MBit/s upload on good days)...

I extended https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Setup-guide-for-new-machines and did the missing steps which are:

echo "requires:openQA-worker" > /etc/zypp/systemCheck.d/openqa.check
sed -i 's@/ btrfs ro@/ btrfs rw@' /etc/fstab
mount -o rw,remount /
btrfs property set -ts / ro false
zypper -n in openQA-continuous-update
systemctl enable --now openqa-continuous-update.timer

as we use continuous updates on our o3 updates.

O.o You're converting the transactional server systems back into regular ones. There's no benefit left with that setup.

#15 Updated by okurz 2 months ago

favogt wrote:

O.o You're converting the transactional server systems back into regular ones. There's no benefit left with that setup.

We only update the openQA stack that way, the rest of the system is still done as an transitional update

#16 Updated by favogt 2 months ago

okurz wrote:

favogt wrote:

O.o You're converting the transactional server systems back into regular ones. There's no benefit left with that setup.

We only update the openQA stack that way, the rest of the system is still done as an transitional update

FWICT the service calls plain zypper dup, which installs all available updates. That's only run if there are openQA package updates, but the installation is not limited to those.

#17 Updated by okurz 2 months ago

  • Related to action #105379: Continuous deployment of o3 workers - one worker first size:M added

#18 Updated by okurz 2 months ago

favogt wrote:

FWICT the service calls plain zypper dup, which installs all available updates. That's only run if there are openQA package updates, but the installation is not limited to those.

Right. The approach was changed due to the comment in
#105379#note-11 from mkittler

According to DimStar -r limits the dependency resolution to just that repository so we must not use it and better just upgrade everything.

so, yes, seems like with this the nightly transactional updates should be noop's and have no real effect. Only that still a reboot should triggered for consistency. But I think even that does not seem to happen according to for i in openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do ssh root@$i uptime; done which shows that most machines have not rebooted for the past four days. Will create a new issue.

Also available in: Atom PDF