action #128999
closedopenQA workers salt recipes should ensure that also developer mode works size:M
0%
Description
Motivation¶
In https://suse.slack.com/archives/C02CANHLANP/p1683638470261569 the question came up if the message about "can not upgrade ws server" something is related to VPN selection of users which of course it's not. Seems that at least one prague worker was not properly configured for developer mode. Our salt states already cover some firewall configuration but apparently not all, see https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/worker.sls#L245
Acceptance criteria¶
- AC1: It is understood why our initial setup of new prague workers did not include this
- AC2: All OSD machines have been crosschecked for the missing configuration
- AC3: salt-states-openqa must cover the configuration
Suggestions¶
- See https://open.qa/docs/images/architecture.svg for required port ranges
- Look at documentation for configuring firewalld manually (https://open.qa/docs/#_configure_firewalld)
Updated by mkittler over 1 year ago
- Subject changed from openQA workers salt recipes should ensure that also developer mode works to openQA workers salt recipes should ensure that also developer mode works size:M
- Status changed from New to Workable
Motivation¶
In https://suse.slack.com/archives/C02CANHLANP/p1683638470261569 the question came up if the message about "can not upgrade ws server" something is related to VPN selection of users which of course it's not. Seems that at least one prague worker was not properly configured for developer mode. Our salt states already cover some firewall configuration but apparently not all, see https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/worker.sls#L245
Acceptance criteria¶
- AC1: It is understood why our initial setup of new prague workers did not include this
- AC2: All OSD machines have been crosschecked for the missing configuration
- AC3: salt-states-openqa must cover the configuration
Suggestions¶
- See https://open.qa/docs/images/architecture.svg for required port ranges
- Look at documentation for configuring firewalld manually (https://open.qa/docs/#_configure_firewalld)
Updated by mkittler over 1 year ago
- Status changed from Workable to In Progress
It looks like our salt states would do the right thing. They configure the zone trusted which allows all incoming connections.
It is hard to tell what went wrong on this Prague located worker. Maybe the firewall was configured in a way that our salt states can't cope with. I tried to reproduce the problem by setting up a worker with out salt states in a fresh Leap 15.4 VM. Despite being able to resolve openqa.suse.de within that VM and it showing up on OSD via salt-key -L
I'm getting salt-minion[3572]: [ERROR ] Exception during resolving address: [Errno 2] Host name lookup failure
when trying to connect the Minion. The problem persists when hardcoding OSD's IP. Any idea what the problem could be? Unfortunately the error message doesn't even state which address it cannot resolve.
Updated by mkittler over 1 year ago
After suspendng+resuming the VM host I get past the resolving problem in the VM guest. Applying the salt states nevertheless doesn't go smooth at all. The salt minion service is restarted constantly running into different errors, e.g.:
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: [ERROR ] Command 'systemd-run' failed with return code: 104
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: [ERROR ] stdout: Loading repository data...
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: Reading installed packages...
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: [ERROR ] stderr: Running scope as unit: run-r7e1f21fc02354322ba4e5f895c08bc40.scope
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: Package 'os-autoinst-swtpm' not found.
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: [ERROR ] retcode: 104
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: [ERROR ] An error was encountered while installing package(s): Zypper command failure: Running scope as unit: run-r7e1f21fc02354322ba4e5f895c08bc40.scope
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: Package 'os-autoinst-swtpm' not found.Loading repository data...
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: Reading installed packages...
Jun 14 17:21:58 martchus-test-vm systemd[1]: salt-minion.service: Scheduled restart job, restart counter is at 16.
I guess it is now eventually stuck at:
2023-06-14 17:22:18,177 [salt.state :319 ][ERROR ][3955] An error was encountered while installing package(s): Zypper command failure: Running scope as unit: run-r6ba622d7c97c4998a43c8cdef3e0bbc2.scope
Package 'os-autoinst-swtpm' not found.Loading repository data...
Reading installed packages...
2023-06-14 17:22:56,053 [salt.state :319 ][ERROR ][3955] Failed to configure repo 'SUSE_CA': refresh_db() got multiple values for keyword argument 'root'
2023-06-14 17:24:38,595 [salt.state :319 ][ERROR ][3955] User openvswitch is not available Group openvswitch is not available
2023-06-14 17:24:38,597 [salt.state :319 ][ERROR ][3955] /etc/logrotate.d/openvswitch: file not found
2023-06-14 17:24:38,601 [salt.state :319 ][ERROR ][3955] User openvswitch is not available Group openvswitch is not available
Updated by openqa_review over 1 year ago
- Due date set to 2023-06-29
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 1 year ago
I've just tried again and still see the same problem. Note that salt actually did something. The openQA-worker package is installed from the correctly configured repository and if I had configured worker slots it likely would have started worker services. It also installed telegraf. I don't see any firewall-related changes, though. So maybe that Prague worker also ended up in a half-setup state because salt behaved very weirdly.
Today, after booting into the VM again I see the Host name lookup failure
again despite openqa.suse.de
being pingable just fine from within the VM. I'm out of ideas what to do. Of course I can start from scratch from a previous VM snapshot but I don't know what I would do differently.
Updated by mkittler over 1 year ago
After suspend+resume of the VM host I've again gotten a little but further. Now velociraptor is running and salt also managed to add the CA repo back on its own. The firewall config is still missing. Even after applying the "high state" explicitly:
----------
ID: /var/lib/openqa/share
Function: mount.mounted
Result: False
Comment: One or more requisite failed: openqa.worker.worker.packages
Started: 13:12:33.662749
Duration: 0.026 ms
Changes:
----------
ID: /etc/openqa/workers.ini
Function: file.managed
Result: False
Comment: One or more requisite failed: openqa.worker.worker.packages
Started: 13:12:33.666084
Duration: 0.026 ms
Changes:
----------
ID: /etc/openqa/client.conf
Function: ini.options_present
Result: False
Comment: One or more requisite failed: openqa.worker.worker.packages
Started: 13:12:33.668873
Duration: 0.015 ms
Changes:
----------
ID: /etc/systemd/system/openqa-worker-auto-restart@.service.d/30-openqa-max-inactive-caching-downloads.conf
Function: file.managed
Result: False
Comment: One or more requisite failed: openqa.worker.worker.packages
Started: 13:12:33.670839
Duration: 0.014 ms
Changes:
Name: /etc/systemd/system/openqa-worker@.service - Function: file.symlink - Result: Clean Started: - 13:12:33.671029 Duration: 6.315 ms
Name: openqa-worker.target - Function: service.disabled - Result: Clean Started: - 13:12:33.677620 Duration: 21.957 ms
----------
ID: openqa-worker-cacheservice
Function: service.running
Result: False
Comment: One or more requisite failed: openqa.worker.worker.packages
Started: 13:12:33.705299
Duration: 0.024 ms
Changes:
----------
ID: openqa-worker-cacheservice-minion
Function: service.running
Result: False
Comment: One or more requisite failed: openqa.worker.worker.packages
Started: 13:12:33.708426
Duration: 0.034 ms
Changes:
----------
ID: stop_and_disable_all_not_configured_workers
Function: cmd.run
Name: services=$(systemctl list-units --all 'openqa-worker-auto-restart@*.service' | sed -e '/.*openqa-worker-auto-restart@.*\.service.*/!d' -e 's|.*openqa-worker-auto-restart@\(.*\)\.service.*|\1|' | awk '{ if($0 > 0) print "openqa-worker-auto-restart@" $0 ".service openqa-reload-worker-auto-restart@" $0 ".path" }' | tr '\n' ' '); [ -z "$services" ] || systemctl disable --now $services
Result: True
Comment: Command "services=$(systemctl list-units --all 'openqa-worker-auto-restart@*.service' | sed -e '/.*openqa-worker-auto-restart@.*\.service.*/!d' -e 's|.*openqa-worker-auto-restart@\(.*\)\.service.*|\1|' | awk '{ if($0 > 0) print "openqa-worker-auto-restart@" $0 ".service openqa-reload-worker-auto-restart@" $0 ".path" }' | tr '\n' ' '); [ -z "$services" ] || systemctl disable --now $services" run
Started: 13:12:33.708767
Duration: 28.66 ms
Changes:
----------
pid:
25640
retcode:
0
stderr:
stdout:
----------
ID: firewalld_config
Function: file.replace
Name: /etc/firewalld/firewalld.conf
Result: False
Comment: One or more requisite failed: openqa.worker.worker.packages
Started: 13:12:33.756697
Duration: 0.026 ms
Changes:
----------
ID: firewalld
Function: service.running
Result: False
Comment: One or more requisite failed: openqa.worker.worker.packages, openqa.worker.firewalld_config
Started: 13:12:33.765553
Duration: 0.016 ms
Changes:
Name: apparmor - Function: pkg.purged - Result: Clean Started: - 13:12:33.765903 Duration: 13.401 ms
Name: apparmor - Function: service.dead - Result: Clean Started: - 13:12:33.779778 Duration: 57.768 ms
Name: apparmor - Function: service.masked - Result: Clean Started: - 13:12:33.838131 Duration: 2.591 ms
Name: chattr +C /var/lib/openqa/cache && touch /var/lib/openqa/cache/.nocow - Function: cmd.run - Result: Clean Started: - 13:12:33.840948 Duration: 25.716 ms
Name: /etc/default/grub - Function: file.replace - Result: Clean Started: - 13:12:33.867242 Duration: 11.727 ms
Name: /etc/default/grub - Function: file.replace - Result: Clean Started: - 13:12:33.879160 Duration: 5.878 ms
Name: /etc/default/grub - Function: file.replace - Result: Clean Started: - 13:12:33.885215 Duration: 8.984 ms
----------
ID: grub2-mkconfig > /boot/grub2/grub.cfg
Function: cmd.run
Result: True
Comment: Command "grub2-mkconfig > /boot/grub2/grub.cfg" run
Started: 13:12:33.894626
Duration: 2029.619 ms
Changes:
----------
pid:
25652
retcode:
0
stderr:
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.14.21-150400.24.63-default
Found initrd image: /boot/initrd-5.14.21-150400.24.63-default
Warning: os-prober will not be executed to detect other bootable partitions.
Systems on them will not be added to the GRUB boot configuration.
Check GRUB_DISABLE_OS_PROBER documentation entry.
done
stdout:
----------
ID: setcap cap_net_admin=ep /usr/bin/qemu-system-x86_64
Function: cmd.run
Result: False
Comment: One or more requisite failed: openqa.worker.worker.packages
Started: 13:12:35.930637
Duration: 0.012 ms
Changes:
Name: /etc/sudoers.d/_openqa-worker - Function: file.managed - Result: Clean Started: - 13:12:35.934098 Duration: 7.67 ms
Name: kernel.softlockup_panic - Function: sysctl.present - Result: Clean Started: - 13:12:35.942295 Duration: 15.648 ms
Name: /etc/sysctl.d/50-vm-bytes.conf - Function: file.managed - Result: Clean Started: - 13:12:35.958615 Duration: 69.399 ms
Name: sysctl -p /etc/sysctl.d/50-vm-bytes.conf - Function: cmd.run - Result: Clean Started: - 13:12:36.031114 Duration: 0.023 ms
Name: server.packages - Function: pkg.installed - Result: Clean Started: - 13:12:36.031378 Duration: 18.256 ms
Name: /etc/qemu-ifup-br0 - Function: file.managed - Result: Clean Started: - 13:12:36.050041 Duration: 5.884 ms
Name: /etc/qemu-ifdown-br0 - Function: file.managed - Result: Clean Started: - 13:12:36.056388 Duration: 6.534 ms
Name: /etc/qemu-ifup-br2 - Function: file.managed - Result: Clean Started: - 13:12:36.063299 Duration: 3.081 ms
Name: /etc/qemu-ifdown-br2 - Function: file.managed - Result: Clean Started: - 13:12:36.066567 Duration: 6.327 ms
Name: /etc/qemu-ifup-br3 - Function: file.managed - Result: Clean Started: - 13:12:36.073447 Duration: 6.116 ms
Name: /etc/qemu-ifdown-br3 - Function: file.managed - Result: Clean Started: - 13:12:36.080005 Duration: 6.657 ms
Name: qemu - Function: pkg.installed - Result: Clean Started: - 13:12:36.086851 Duration: 10.409 ms
Name: /usr/share/qemu/ipxe.lkrn - Function: file.managed - Result: Clean Started: - 13:12:36.097562 Duration: 77.128 ms
Name: tgt - Function: pkg.installed - Result: Clean Started: - 13:12:36.175092 Duration: 15.878 ms
Name: dd if=/dev/zero of=/opt/openqa-iscsi-disk seek=1M bs=20480 count=1 - Function: cmd.run - Result: Clean Started: - 13:12:36.195218 Duration: 19.712 ms
Name: tgtd - Function: service.running - Result: Clean Started: - 13:12:36.219333 Duration: 61.929 ms
Name: salt://openqa/iscsi-target-setup.sh - Function: cmd.script - Result: Clean Started: - 13:12:36.284702 Duration: 14.573 ms
----------
ID: /etc/sysconfig/network/ifcfg-br1
Function: file.managed
Result: False
Comment: One or more requisite failed: openqa.worker.worker.packages
Started: 13:12:36.305793
Duration: 0.017 ms
Changes:
----------
ID: openvswitch
Function: service.running
Result: False
Comment: One or more requisite failed: openqa.openvswitch./etc/sysconfig/network/ifcfg-br1
Started: 13:12:36.309592
Duration: 0.016 ms
Changes:
Name: wicked - Function: pkg.installed - Result: Clean Started: - 13:12:36.309919 Duration: 16.043 ms
----------
ID: wicked ifup br1
Function: cmd.wait
Result: False
Comment: One or more requisite failed: openqa.openvswitch./etc/sysconfig/network/ifcfg-br1
Started: 13:12:36.331002
Duration: 0.023 ms
Changes:
Name: /etc/wicked/scripts/gre_tunnel_preup.sh - Function: file.absent - Result: Clean Started: - 13:12:36.331236 Duration: 2.884 ms
----------
ID: /etc/sysconfig/os-autoinst-openvswitch
Function: file.managed
Result: False
Comment: One or more requisite failed: openqa.worker.worker.packages
Started: 13:12:36.338598
Duration: 0.016 ms
Changes:
Name: /etc/systemd/system/os-autoinst-openvswitch.service.d/30-init-timeout.conf - Function: file.managed - Result: Clean Started: - 13:12:36.338856 Duration: 76.694 ms
Name: service.systemctl_reload - Function: module.wait - Result: Clean Started: - 13:12:36.422083 Duration: 4.032 ms
----------
ID: os-autoinst-openvswitch
Function: service.running
Result: False
Comment: One or more requisite failed: openqa.openvswitch./etc/sysconfig/network/ifcfg-br1, openqa.openvswitch./etc/sysconfig/os-autoinst-openvswitch
Started: 13:12:36.439003
Duration: 0.022 ms
Changes:
----------
ID: /var/log/openvswitch
Function: file.directory
Result: False
Comment: User openvswitch is not available Group openvswitch is not available
Started: 13:12:36.439272
Duration: 6.155 ms
Changes:
----------
ID: /etc/logrotate.d/openvswitch
Function: file.line
Result: False
Comment: /etc/logrotate.d/openvswitch: file not found
Started: 13:12:36.445890
Duration: 3.474 ms
Changes:
Name: /etc/sysconfig/openvswitch - Function: file.replace - Result: Clean Started: - 13:12:36.449713 Duration: 3.727 ms
----------
ID: /etc/openvswitch
Function: file.directory
Result: False
Comment: User openvswitch is not available Group openvswitch is not available
Started: 13:12:36.453948
Duration: 6.016 ms
Changes:
----------
ID: ovsdb-server.service
Function: service.running
Result: False
Comment: One or more requisite failed: openqa.openvswitch_boo1181418./etc/openvswitch
Started: 13:12:36.467677
Duration: 0.035 ms
Changes:
----------
ID: ovs-vswitchd.service
Function: service.running
Result: False
Comment: One or more requisite failed: openqa.openvswitch_boo1181418./etc/openvswitch
Started: 13:12:36.475265
Duration: 0.026 ms
Changes:
----------
ID: os-autoinst-openvswitch.service
Function: service.running
Result: False
Comment: One or more requisite failed: openqa.openvswitch_boo1181418./etc/openvswitch
Started: 13:12:36.483816
Duration: 0.038 ms
Changes:
Name: /etc/dbus-1/system.d/system-local.conf - Function: file.managed - Result: Clean Started: - 13:12:36.484205 Duration: 8.42 ms
Name: /usr/local/bin/recover-nfs.sh - Function: file.managed - Result: Clean Started: - 13:12:36.493071 Duration: 76.437 ms
Name: /etc/systemd/system/recover-nfs.service - Function: file.managed - Result: Clean Started: - 13:12:36.569789 Duration: 64.402 ms
Name: service.systemctl_reload - Function: module.run - Result: Clean Started: - 13:12:36.637134 Duration: 0.009 ms
Name: /etc/systemd/system/recover-nfs.timer - Function: file.managed - Result: Clean Started: - 13:12:36.637235 Duration: 74.057 ms
Name: service.systemctl_reload - Function: module.run - Result: Clean Started: - 13:12:36.716745 Duration: 0.017 ms
Name: recover-nfs.timer - Function: service.running - Result: Clean Started: - 13:12:36.716987 Duration: 55.305 ms
Name: /etc/systemd/system/auto-update.service - Function: file.managed - Result: Clean Started: - 13:12:36.772773 Duration: 66.196 ms
Name: service.systemctl_reload - Function: module.run - Result: Clean Started: - 13:12:36.842837 Duration: 0.019 ms
Name: /etc/systemd/system/auto-update.timer - Function: file.managed - Result: Clean Started: - 13:12:36.843123 Duration: 74.227 ms
Name: service.systemctl_reload - Function: module.run - Result: Clean Started: - 13:12:36.919046 Duration: 0.008 ms
Name: auto-update.timer - Function: service.running - Result: Clean Started: - 13:12:36.920589 Duration: 52.616 ms
Summary for martchus-test-vm
--------------
Succeeded: 292 (changed=3)
Failed: 23
So I guess it is very easy to get a machine in a half setup state and the firewall config seems to be something that is quite at the end of the dependency chain. Maybe the person how setup the Prague located worker hasn't tried to apply the high state manually and thus didn't realize that there were still a errors - because on the VM itself systemctl --failed
lists nothing. So the minion service is running just fine and no other services are failing (just deactivated when they should actually be active like telegraf which is currently still missing).
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
Maybe calling state.apply
is really what one is supposed to do - possibly multiple times - to speed things up as the whole appliance can fail due to various reasons. Likely the most common reason is zypper running into download errors which then also blocks lots of other things depending on it. Now only one failure is remaining:
----------
ID: btrfs-nocow
Function: cmd.run
Name: chattr +C /var/lib/openqa/cache && touch /var/lib/openqa/cache/.nocow
Result: False
Comment: Command "chattr +C /var/lib/openqa/cache && touch /var/lib/openqa/cache/.nocow" run
Started: 13:43:43.423041
Duration: 372.98 ms
Changes:
----------
pid:
4332
retcode:
127
stderr:
/bin/sh: chattr: command not found
stdout:
Name: /etc/default/grub - Function: file.replace - Result: Clean Started: - 13:43:43.796525 Duration: 8.944 ms
That means the firewall config is in place as it should be. Firewalld has also been restarted so it should also be effective. Running commands like firewall-cmd --get-default-zone
also confirms that.
So maybe when setting up that worker in Prague some errors where just overlooked (maybe the amount of errors was not as big as in my case so it wasn't as obvious). I can only say that in general the firewall config can be applied correctly via our salt states.
MR for fixing the problem with chattr: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/887
Updated by mkittler over 1 year ago
- Status changed from Feedback to Resolved
The MR has been merged.
I've also just removed the test VM again from OSD.
I think this ticket can be resolved with the explanation that errors when applying salt states on that worker were not handled correctly. In general our salt states configure the firewall so that all incoming connections are allowed.
Updated by okurz over 1 year ago
- Related to action #124562: Re-install at least one of the new OSD workers located in Prague added
Updated by okurz over 1 year ago
Ok, good. I think the rest can be covered in #124562