action #111758
closedo3 jobs exceeding MAX_SETUP_TIME auto_review:"(?s)openqaworker4.*timeout: setup exceeded MAX_SETUP_TIME":retry size:M
Added by okurz over 2 years ago. Updated over 2 years ago.
100%
Description
Observation¶
Seems like at least one job ran into MAX_SETUP_TIME, see https://openqa.opensuse.org/tests/2393325 showing
[2022-05-30T08:24:05.869679+02:00] [debug] Found HDD_1, caching opensuse-Tumbleweed-x86_64-20220527-Tumbleweed@64bit.qcow2
[2022-05-30T08:24:05.873320+02:00] [info] Downloading opensuse-Tumbleweed-x86_64-20220527-Tumbleweed@64bit.qcow2, request #252294 sent to Cache Service
[2022-05-30T09:24:05.896856+02:00] [info] +++ worker notes +++
[2022-05-30T09:24:05.897085+02:00] [info] End time: 2022-05-30 07:24:05
[2022-05-30T09:24:05.897159+02:00] [info] Result: timeout
with reason timeout: setup exceeded MAX_SETUP_TIME
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#111758
Acceptance criteria¶
AC1: openqaworker4 is able to work on 20 jobs in parallel
Suggestions¶
- Reinstall the operating system on openqaworker4 via ipmi
- Order a replacement machine e.g. file SD ticket
- Ensure the system is able to work on o3 production jobs
- Monitor for stability
Updated by okurz over 2 years ago
- Priority changed from Urgent to Immediate
DimStar confirmed seeing similar issues elsewhere, see https://suse.slack.com/archives/C02CANHLANP/p1653897581429969 , e.g. https://openqa.opensuse.org/tests/2393391 , same reason, same worker
Updated by okurz over 2 years ago
- Subject changed from o3 jobs exceeding MAX_SETUP_TIME to o3 jobs exceeding MAX_SETUP_TIME auto_review:"(?s)openqaworker4.*timeout: setup exceeded MAX_SETUP_TIME":retry
- Description updated (diff)
- Status changed from New to In Progress
- Assignee set to okurz
Updated by okurz over 2 years ago
dmesg says
[1023946.290932] WARNING: CPU: 7 PID: 8230 at ../fs/btrfs/extent-tree.c:3004 __btrfs_free_extent.isra.47+0x201/0xbc0 [btrfs]
[1023946.290933] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet iscsi_ibft iscsi_boot_sysfs rfkill dmi_sysfs msr ext4 crc16 mbcache jbd2 intel_rapl_msr iTCO_wdt intel_pmc_bxt iTCO_vendor_support intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel raid0 crypto_simd cryptd ixgbe glue_helper intel_wmi_thunderbolt(N) pcspkr xfrm_algo ses enclosure igb libphy mei_me mei mdio lpc_ich dca i2c_i801 ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad button fuse configfs btrfs libcrc32c xor raid6_pq raid1 md_mod ast i2c_algo_bit drm_vram_helper drm_kms_helper sd_mod syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm_ttm_helper ttm ehci_pci drm ehci_hcd mpt3sas ahci usbcore libahci nvme libata crc32c_intel nvme_core t10_pi raid_class scsi_transport_sas wmi overlay sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[1023946.290967] Supported: No, Unsupported modules are loaded
[1023946.290969] CPU: 7 PID: 8230 Comm: btrfs Tainted: G N 5.3.18-150300.59.68-default #1 SLE15-SP3
[1023946.290970] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A22 12/09/2015
[1023946.290982] RIP: 0010:__btrfs_free_extent.isra.47+0x201/0xbc0 [btrfs]
[1023946.290983] Code: 89 e6 e8 a2 cc ff ff 41 89 c5 58 45 85 ed 0f 84 90 00 00 00 41 83 fd fe 0f 85 11 ff ff ff 48 c7 c7 30 9d 97 c0 e8 6d 80 25 cd <0f> 0b 49 8b 3c 24 e8 b4 69 00 00 ff b4 24 b8 00 00 00 48 8b 7c 24
[1023946.290984] RSP: 0018:ffffad9468987898 EFLAGS: 00010282
[1023946.290985] RAX: 0000000000000024 RBX: 0000000000000000 RCX: 0000000000000000
[1023946.290986] RDX: 0000000000000000 RSI: ffff88b43f9d9558 RDI: ffff88b43f9d9558
[1023946.290986] RBP: 000000000c578000 R08: 0000000000003369 R09: 0000000000000001
[1023946.290987] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88b3aff9bbd0
[1023946.290987] R13: 00000000fffffffe R14: 0000000000000000 R15: 0000000000003d94
[1023946.290988] FS: 00007f165e5568c0(0000) GS:ffff88b43f9c0000(0000) knlGS:0000000000000000
[1023946.290989] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1023946.290989] CR2: 0000562d5e561a30 CR3: 000000358f130005 CR4: 00000000001726e0
[1023946.290990] Call Trace:
[1023946.291013] ? btrfs_merge_delayed_refs+0x307/0x3e0 [btrfs]
[1023946.291024] __btrfs_run_delayed_refs+0x797/0x1180 [btrfs]
[1023946.291039] ? btrfs_set_path_blocking+0x45/0x50 [btrfs]
[1023946.291050] btrfs_run_delayed_refs+0x62/0x1f0 [btrfs]
[1023946.291066] btrfs_commit_transaction+0x50/0xa60 [btrfs]
[1023946.291087] prepare_to_merge+0x246/0x260 [btrfs]
[1023946.291105] relocate_block_group+0x20d/0x790 [btrfs]
[1023946.291124] btrfs_relocate_block_group+0x16f/0x2e0 [btrfs]
[1023946.291141] btrfs_relocate_chunk+0x31/0xb0 [btrfs]
[1023946.291159] btrfs_balance+0xa18/0x11f0 [btrfs]
[1023946.291178] btrfs_ioctl_balance+0x2f2/0x3a0 [btrfs]
[1023946.291196] btrfs_ioctl+0x16d4/0x3030 [btrfs]
[1023946.291201] ? scnprintf+0x49/0x80
[1023946.291205] ? __kmalloc_node+0x1ef/0x340
[1023946.291207] ? __fput+0x150/0x270
[1023946.291209] ? ksys_ioctl+0x92/0xb0
[1023946.291210] ksys_ioctl+0x92/0xb0
[1023946.291212] __x64_sys_ioctl+0x16/0x20
[1023946.291215] do_syscall_64+0x5b/0x1e0
[1023946.291218] entry_SYSCALL_64_after_hwframe+0x44/0xa9
filesystem turned read-only
Updated by okurz over 2 years ago
- Status changed from In Progress to New
- Assignee deleted (
okurz) - Priority changed from Immediate to Urgent
Machine is reachable over IPMI but has no network due to a read-only root filesystem which needs to be recovered. As the machine is not within the network currently it has no worse impact on the o3 jobs.
Updated by nicksinger over 2 years ago
After a reboot btrfs prints:
[ 50.423028] BTRFS error (device md1): unable to find ref byte nr 207060992 parent 0 root 15764 owner 0 offset 0
[ 50.433896] BTRFS: error (device md1) in __btrfs_free_extent:3010: errno=-2 No such entry
[ 50.442254] BTRFS: error (device md1) in btrfs_run_delayed_refs:2129: errno=-2 No such entry
trying to boot a recovery medium now to scrub the volume and try to repair it. Fist quick look in smart doesn't show anything unexpected for the 2 disks in question:
openqaworker4:~ # smartctl -a /dev/sda
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: HGST Travelstar 7K1000
Device Model: HGST HTE721010A9E630
Serial Number: JR10034M0HBEHK
LU WWN Device Id: 5 000cca 8a8c6fc7e
Firmware Version: JB0OA3M0
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon May 30 12:12:03 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 45) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 166) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 062 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 040 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 141 141 033 Pre-fail Always - 1
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 70
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 040 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 001 001 000 Old_age Always - 53848
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 69
191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 48
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 88
194 Temperature_Celsius 0x0002 214 214 000 Old_age Always - 28 (Min/Max 20/33)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
223 Load_Retry_Count 0x000a 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 53847 -
# 2 Short offline Completed without error 00% 53839 -
# 3 Short offline Completed without error 00% 53815 -
# 4 Short offline Completed without error 00% 53791 -
# 5 Short offline Completed without error 00% 53767 -
# 6 Short offline Completed without error 00% 53743 -
# 7 Short offline Completed without error 00% 53719 -
# 8 Short offline Completed without error 00% 53695 -
# 9 Short offline Completed without error 00% 53671 -
#10 Short offline Completed without error 00% 53647 -
#11 Short offline Completed without error 00% 53623 -
#12 Short offline Completed without error 00% 53599 -
#13 Short offline Completed without error 00% 53575 -
#14 Short offline Completed without error 00% 53551 -
#15 Short offline Completed without error 00% 53527 -
#16 Short offline Completed without error 00% 53503 -
#17 Short offline Completed without error 00% 53479 -
#18 Short offline Completed without error 00% 53455 -
#19 Short offline Completed without error 00% 53431 -
#20 Short offline Completed without error 00% 53407 -
#21 Short offline Completed without error 00% 53383 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
openqaworker4:~ # smartctl -a /dev/sdb
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.3.18-150300.59.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: HGST Travelstar 7K1000
Device Model: HGST HTE721010A9E630
Serial Number: JR10034M0MUNRK
LU WWN Device Id: 5 000cca 8a8c90368
Firmware Version: JB0OA3M0
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon May 30 12:12:07 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 45) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 174) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 062 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 040 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 133 133 033 Pre-fail Always - 1
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 69
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 040 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 001 001 000 Old_age Always - 53848
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 69
191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 44
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 180
194 Temperature_Celsius 0x0002 214 214 000 Old_age Always - 28 (Min/Max 20/34)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
223 Load_Retry_Count 0x000a 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 53848 -
# 2 Short offline Completed without error 00% 53839 -
# 3 Short offline Completed without error 00% 53815 -
# 4 Short offline Completed without error 00% 53791 -
# 5 Short offline Completed without error 00% 53767 -
# 6 Short offline Completed without error 00% 53743 -
# 7 Short offline Completed without error 00% 53719 -
# 8 Short offline Completed without error 00% 53695 -
# 9 Short offline Completed without error 00% 53671 -
#10 Short offline Completed without error 00% 53647 -
#11 Short offline Completed without error 00% 53623 -
#12 Short offline Completed without error 00% 53599 -
#13 Short offline Completed without error 00% 53575 -
#14 Short offline Completed without error 00% 53551 -
#15 Short offline Completed without error 00% 53527 -
#16 Short offline Completed without error 00% 53503 -
#17 Short offline Completed without error 00% 53479 -
#18 Short offline Completed without error 00% 53455 -
#19 Short offline Completed without error 00% 53431 -
#20 Short offline Completed without error 00% 53407 -
#21 Short offline Completed without error 00% 53383 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Updated by livdywan over 2 years ago
- Subject changed from o3 jobs exceeding MAX_SETUP_TIME auto_review:"(?s)openqaworker4.*timeout: setup exceeded MAX_SETUP_TIME":retry to o3 jobs exceeding MAX_SETUP_TIME auto_review:"(?s)openqaworker4.*timeout: setup exceeded MAX_SETUP_TIME":retry size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by favogt over 2 years ago
fsck doesn't show any errors:
sh-4.4# btrfs check /dev/md1
Opening filesystem to check...
Checking filesystem on /dev/md1
UUID: 9058d83e-9f1f-49be-a4c0-8f975248ad39
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 29340753920 bytes used, no error found
total csum bytes: 20352152
total tree bytes: 316784640
total fs tree bytes: 272531456
total extent tree bytes: 19922944
btree space waste bytes: 61990275
file data blocks allocated: 386117111808
referenced 40491360256
scrub still fails as expected though:
sh-4.4# btrfs scrub start -Bf /dev/md1
[ 198.654683] BTRFS error (device md1): unable to find ref byte nr 207060992 parent 0 root 15764 owner 0 offset 0
[ 198.665150] BTRFS: error (device md1) in __btrfs_free_extent:3010: errno=-2 No such entry
[ 198.673353] BTRFS: error (device md1) in btrfs_run_delayed_refs:2129: errno=-2 No such entry
ERROR: scrubbing /dev/md1 failed for device id 1: ret=-1, errno=5 (Input/output error)
scrub canceled for 9058d83e-9f1f-49be-a4c0-8f975248ad39
scrub started at Thu Jun 2 09:11:50 2022 and was aborted after 00:00:21
total bytes scrubbed: 0.00B with 0 errors
Looks like this needs a reinstall. If there aren't any objections until lunch (~1h), I'll start one if noone else wants to.
Updated by livdywan over 2 years ago
favogt wrote:
Looks like this needs a reinstall. If there aren't any objections until lunch (~1h), I'll start one if noone else wants to.
I'd say feel free to go ahead
Updated by favogt over 2 years ago
- Assignee changed from nicksinger to favogt
- % Done changed from 0 to 80
Done! I took the liberty and installed Leap 15.4 already, that might help with the future upgrades of other workers.
For the installation I downloaded https://download.opensuse.org/distribution/leap/15.4/repo/oss/boot/x86_64/loader/{linux,initrd} and ran kexec --initrd=initrd --reuse-cmdline linux
. I kept the existing partition setup, just added the mounts again and proceeded with the default transactional server config.
In the installed system I added the devel:openQA{,:Leap} repos and installed openQA-worker
(and after a reboot also os-autoinst-distri-opensuse-deps
, which I forgot initially) and did a reboot. I also had to adjust /etc/default/grub
to remove a leftover rd.break
which got there through --reuse-cmdline
. I copied over /etc/firewalld
, systemd override files, /etc/openqa*
and /etc/chrony.conf
from the backup in /var/lib/openqa/root-backup/
I did before the installation and also enabled the openQA services.
After a reboot they showed up broken because the cache service couldn't start due to a permission issue - /var/lib/openqa/cache
was from the previous system with the previous uid for _openqa-worker
. Instead of changing ownership everywhere I just ran usermod -u 471 _openqa-worker
and adjusted the owner of /etc/openqa/client.conf
again. After a service restart the first jobs started running.
I just saw that staging tests fail due to the special ovmf build not being there yet, I just copied it over and will trigger a reboot.
Updated by favogt over 2 years ago
I also saw some Reason: cache failure: Cache service queue already full (5)
failures, probably because the limit is only at 50GiB. I increased it to 200GiB for now.
Updated by favogt over 2 years ago
- Status changed from Workable to Resolved
- % Done changed from 80 to 100
favogt wrote:
I also saw some
Reason: cache failure: Cache service queue already full (5)
failures, probably because the limit is only at 50GiB. I increased it to 200GiB for now.
That wasn't immediately solved by the change, but after waiting for a while, the cache got fuller and the errors eventually disappeared.
I saw a (spurious?) weird issue with graphics in one job: https://openqa.opensuse.org/tests/2399434#step/welcome/7
Other than that it looks like ow4 is performing properly so far.
Updated by okurz over 2 years ago
Thank you for doing this. I also managed to start the java based remote control client from
https://openqaworker4-ipmi.suse.de/ with javaws.itweb jviewer.jnlp
from icedtea-web which offers virtual media redirection so one can select a local ISO file as installation medium. But your approach sounds even better to me.
I extended https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Setup-guide-for-new-machines and did the missing steps which are:
echo "requires:openQA-worker" > /etc/zypp/systemCheck.d/openqa.check
sed -i 's@/ btrfs ro@/ btrfs rw@' /etc/fstab
mount -o rw,remount /
btrfs property set -ts / ro false
zypper -n in openQA-continuous-update
systemctl enable --now openqa-continuous-update.timer
as we use continuous updates on our o3 updates.
Updated by favogt over 2 years ago
okurz wrote:
Thank you for doing this. I also managed to start the java based remote control client from
https://openqaworker4-ipmi.suse.de/ withjavaws.itweb jviewer.jnlp
from icedtea-web which offers virtual media redirection so one can select a local ISO file as installation medium. But your approach sounds even better to me.
Over VPN that would take literal days to boot here (2MBit/s upload on good days)...
I extended https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Setup-guide-for-new-machines and did the missing steps which are:
echo "requires:openQA-worker" > /etc/zypp/systemCheck.d/openqa.check sed -i 's@/ btrfs ro@/ btrfs rw@' /etc/fstab mount -o rw,remount / btrfs property set -ts / ro false zypper -n in openQA-continuous-update systemctl enable --now openqa-continuous-update.timer
as we use continuous updates on our o3 updates.
O.o You're converting the transactional server systems back into regular ones. There's no benefit left with that setup.
Updated by okurz over 2 years ago
favogt wrote:
O.o You're converting the transactional server systems back into regular ones. There's no benefit left with that setup.
We only update the openQA stack that way, the rest of the system is still done as an transitional update
Updated by favogt over 2 years ago
okurz wrote:
favogt wrote:
O.o You're converting the transactional server systems back into regular ones. There's no benefit left with that setup.
We only update the openQA stack that way, the rest of the system is still done as an transitional update
FWICT the service calls plain zypper dup
, which installs all available updates. That's only run if there are openQA package updates, but the installation is not limited to those.
Updated by okurz over 2 years ago
- Related to action #105379: Continuous deployment of o3 workers - one worker first size:M added
Updated by okurz over 2 years ago
favogt wrote:
FWICT the service calls plain
zypper dup
, which installs all available updates. That's only run if there are openQA package updates, but the installation is not limited to those.
Right. The approach was changed due to the comment in
#105379#note-11 from @mkittler
According to DimStar -r limits the dependency resolution to just that repository so we must not use it and better just upgrade everything.
so, yes, seems like with this the nightly transactional updates should be noop's and have no real effect. Only that still a reboot should triggered for consistency. But I think even that does not seem to happen according to for i in openqaworker1 openqaworker4 openqaworker7 imagetester rebel power8; do ssh root@$i uptime; done
which shows that most machines have not rebooted for the past four days. Will create a new issue.