Project

General

Profile

Actions

action #128999

closed

openQA workers salt recipes should ensure that also developer mode works size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-05-09
Due date:
2023-06-29
% Done:

0%

Estimated time:

Description

Motivation

In https://suse.slack.com/archives/C02CANHLANP/p1683638470261569 the question came up if the message about "can not upgrade ws server" something is related to VPN selection of users which of course it's not. Seems that at least one prague worker was not properly configured for developer mode. Our salt states already cover some firewall configuration but apparently not all, see https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/worker.sls#L245

Acceptance criteria

  • AC1: It is understood why our initial setup of new prague workers did not include this
  • AC2: All OSD machines have been crosschecked for the missing configuration
  • AC3: salt-states-openqa must cover the configuration

Suggestions


Related issues 1 (1 open0 closed)

Related to openQA Infrastructure - action #124562: Re-install at least one of the new OSD workers located in PragueNew

Actions
Actions #1

Updated by okurz over 1 year ago

  • Tags deleted (infra)
Actions #2

Updated by mkittler over 1 year ago

  • Subject changed from openQA workers salt recipes should ensure that also developer mode works to openQA workers salt recipes should ensure that also developer mode works size:M
  • Status changed from New to Workable

Motivation

In https://suse.slack.com/archives/C02CANHLANP/p1683638470261569 the question came up if the message about "can not upgrade ws server" something is related to VPN selection of users which of course it's not. Seems that at least one prague worker was not properly configured for developer mode. Our salt states already cover some firewall configuration but apparently not all, see https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/worker.sls#L245

Acceptance criteria

  • AC1: It is understood why our initial setup of new prague workers did not include this
  • AC2: All OSD machines have been crosschecked for the missing configuration
  • AC3: salt-states-openqa must cover the configuration

Suggestions

Actions #3

Updated by mkittler over 1 year ago

  • Description updated (diff)
Actions #4

Updated by mkittler over 1 year ago

  • Assignee set to mkittler
Actions #5

Updated by mkittler over 1 year ago

  • Status changed from Workable to In Progress

It looks like our salt states would do the right thing. They configure the zone trusted which allows all incoming connections.

It is hard to tell what went wrong on this Prague located worker. Maybe the firewall was configured in a way that our salt states can't cope with. I tried to reproduce the problem by setting up a worker with out salt states in a fresh Leap 15.4 VM. Despite being able to resolve openqa.suse.de within that VM and it showing up on OSD via salt-key -L I'm getting salt-minion[3572]: [ERROR ] Exception during resolving address: [Errno 2] Host name lookup failure when trying to connect the Minion. The problem persists when hardcoding OSD's IP. Any idea what the problem could be? Unfortunately the error message doesn't even state which address it cannot resolve.

Actions #6

Updated by mkittler over 1 year ago

After suspendng+resuming the VM host I get past the resolving problem in the VM guest. Applying the salt states nevertheless doesn't go smooth at all. The salt minion service is restarted constantly running into different errors, e.g.:

Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: [ERROR   ] Command 'systemd-run' failed with return code: 104
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: [ERROR   ] stdout: Loading repository data...
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: Reading installed packages...
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: [ERROR   ] stderr: Running scope as unit: run-r7e1f21fc02354322ba4e5f895c08bc40.scope
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: Package 'os-autoinst-swtpm' not found.
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: [ERROR   ] retcode: 104
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: [ERROR   ] An error was encountered while installing package(s): Zypper command failure: Running scope as unit: run-r7e1f21fc02354322ba4e5f895c08bc40.scope
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: Package 'os-autoinst-swtpm' not found.Loading repository data...
Jun 14 17:21:45 martchus-test-vm salt-minion[3955]: Reading installed packages...
Jun 14 17:21:58 martchus-test-vm systemd[1]: salt-minion.service: Scheduled restart job, restart counter is at 16.

I guess it is now eventually stuck at:

2023-06-14 17:22:18,177 [salt.state       :319 ][ERROR   ][3955] An error was encountered while installing package(s): Zypper command failure: Running scope as unit: run-r6ba622d7c97c4998a43c8cdef3e0bbc2.scope
Package 'os-autoinst-swtpm' not found.Loading repository data...
Reading installed packages...
2023-06-14 17:22:56,053 [salt.state       :319 ][ERROR   ][3955] Failed to configure repo 'SUSE_CA': refresh_db() got multiple values for keyword argument 'root'
2023-06-14 17:24:38,595 [salt.state       :319 ][ERROR   ][3955] User openvswitch is not available Group openvswitch is not available
2023-06-14 17:24:38,597 [salt.state       :319 ][ERROR   ][3955] /etc/logrotate.d/openvswitch: file not found
2023-06-14 17:24:38,601 [salt.state       :319 ][ERROR   ][3955] User openvswitch is not available Group openvswitch is not available
Actions #7

Updated by openqa_review over 1 year ago

  • Due date set to 2023-06-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by mkittler over 1 year ago

I've just tried again and still see the same problem. Note that salt actually did something. The openQA-worker package is installed from the correctly configured repository and if I had configured worker slots it likely would have started worker services. It also installed telegraf. I don't see any firewall-related changes, though. So maybe that Prague worker also ended up in a half-setup state because salt behaved very weirdly.

Today, after booting into the VM again I see the Host name lookup failure again despite openqa.suse.de being pingable just fine from within the VM. I'm out of ideas what to do. Of course I can start from scratch from a previous VM snapshot but I don't know what I would do differently.

Actions #9

Updated by mkittler over 1 year ago

After suspend+resume of the VM host I've again gotten a little but further. Now velociraptor is running and salt also managed to add the CA repo back on its own. The firewall config is still missing. Even after applying the "high state" explicitly:

----------
          ID: /var/lib/openqa/share
    Function: mount.mounted
      Result: False
     Comment: One or more requisite failed: openqa.worker.worker.packages
     Started: 13:12:33.662749
    Duration: 0.026 ms
     Changes:   
----------
          ID: /etc/openqa/workers.ini
    Function: file.managed
      Result: False
     Comment: One or more requisite failed: openqa.worker.worker.packages
     Started: 13:12:33.666084
    Duration: 0.026 ms
     Changes:   
----------
          ID: /etc/openqa/client.conf
    Function: ini.options_present
      Result: False
     Comment: One or more requisite failed: openqa.worker.worker.packages
     Started: 13:12:33.668873
    Duration: 0.015 ms
     Changes:   
----------
          ID: /etc/systemd/system/openqa-worker-auto-restart@.service.d/30-openqa-max-inactive-caching-downloads.conf
    Function: file.managed
      Result: False
     Comment: One or more requisite failed: openqa.worker.worker.packages
     Started: 13:12:33.670839
    Duration: 0.014 ms
     Changes:   
  Name: /etc/systemd/system/openqa-worker@.service - Function: file.symlink - Result: Clean Started: - 13:12:33.671029 Duration: 6.315 ms
  Name: openqa-worker.target - Function: service.disabled - Result: Clean Started: - 13:12:33.677620 Duration: 21.957 ms
----------
          ID: openqa-worker-cacheservice
    Function: service.running
      Result: False
     Comment: One or more requisite failed: openqa.worker.worker.packages
     Started: 13:12:33.705299
    Duration: 0.024 ms
     Changes:   
----------
          ID: openqa-worker-cacheservice-minion
    Function: service.running
      Result: False
     Comment: One or more requisite failed: openqa.worker.worker.packages
     Started: 13:12:33.708426
    Duration: 0.034 ms
     Changes:   
----------
          ID: stop_and_disable_all_not_configured_workers
    Function: cmd.run
        Name: services=$(systemctl list-units --all 'openqa-worker-auto-restart@*.service' | sed -e '/.*openqa-worker-auto-restart@.*\.service.*/!d' -e 's|.*openqa-worker-auto-restart@\(.*\)\.service.*|\1|' | awk '{ if($0 > 0) print "openqa-worker-auto-restart@" $0 ".service openqa-reload-worker-auto-restart@" $0 ".path" }' | tr '\n' ' '); [ -z "$services" ] || systemctl disable --now $services
      Result: True
     Comment: Command "services=$(systemctl list-units --all 'openqa-worker-auto-restart@*.service' | sed -e '/.*openqa-worker-auto-restart@.*\.service.*/!d' -e 's|.*openqa-worker-auto-restart@\(.*\)\.service.*|\1|' | awk '{ if($0 > 0) print "openqa-worker-auto-restart@" $0 ".service openqa-reload-worker-auto-restart@" $0 ".path" }' | tr '\n' ' '); [ -z "$services" ] || systemctl disable --now $services" run
     Started: 13:12:33.708767
    Duration: 28.66 ms
     Changes:   
              ----------
              pid:
                  25640
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: firewalld_config
    Function: file.replace
        Name: /etc/firewalld/firewalld.conf
      Result: False
     Comment: One or more requisite failed: openqa.worker.worker.packages
     Started: 13:12:33.756697
    Duration: 0.026 ms
     Changes:   
----------
          ID: firewalld
    Function: service.running
      Result: False
     Comment: One or more requisite failed: openqa.worker.worker.packages, openqa.worker.firewalld_config
     Started: 13:12:33.765553
    Duration: 0.016 ms
     Changes:   
  Name: apparmor - Function: pkg.purged - Result: Clean Started: - 13:12:33.765903 Duration: 13.401 ms
  Name: apparmor - Function: service.dead - Result: Clean Started: - 13:12:33.779778 Duration: 57.768 ms
  Name: apparmor - Function: service.masked - Result: Clean Started: - 13:12:33.838131 Duration: 2.591 ms
  Name: chattr +C /var/lib/openqa/cache && touch /var/lib/openqa/cache/.nocow - Function: cmd.run - Result: Clean Started: - 13:12:33.840948 Duration: 25.716 ms
  Name: /etc/default/grub - Function: file.replace - Result: Clean Started: - 13:12:33.867242 Duration: 11.727 ms
  Name: /etc/default/grub - Function: file.replace - Result: Clean Started: - 13:12:33.879160 Duration: 5.878 ms
  Name: /etc/default/grub - Function: file.replace - Result: Clean Started: - 13:12:33.885215 Duration: 8.984 ms
----------
          ID: grub2-mkconfig > /boot/grub2/grub.cfg
    Function: cmd.run
      Result: True
     Comment: Command "grub2-mkconfig > /boot/grub2/grub.cfg" run
     Started: 13:12:33.894626
    Duration: 2029.619 ms
     Changes:   
              ----------
              pid:
                  25652
              retcode:
                  0
              stderr:
                  Generating grub configuration file ...
                  Found linux image: /boot/vmlinuz-5.14.21-150400.24.63-default
                  Found initrd image: /boot/initrd-5.14.21-150400.24.63-default
                  Warning: os-prober will not be executed to detect other bootable partitions.
                  Systems on them will not be added to the GRUB boot configuration.
                  Check GRUB_DISABLE_OS_PROBER documentation entry.
                  done
              stdout:
----------
          ID: setcap cap_net_admin=ep /usr/bin/qemu-system-x86_64
    Function: cmd.run
      Result: False
     Comment: One or more requisite failed: openqa.worker.worker.packages
     Started: 13:12:35.930637
    Duration: 0.012 ms
     Changes:   
  Name: /etc/sudoers.d/_openqa-worker - Function: file.managed - Result: Clean Started: - 13:12:35.934098 Duration: 7.67 ms
  Name: kernel.softlockup_panic - Function: sysctl.present - Result: Clean Started: - 13:12:35.942295 Duration: 15.648 ms
  Name: /etc/sysctl.d/50-vm-bytes.conf - Function: file.managed - Result: Clean Started: - 13:12:35.958615 Duration: 69.399 ms
  Name: sysctl -p /etc/sysctl.d/50-vm-bytes.conf - Function: cmd.run - Result: Clean Started: - 13:12:36.031114 Duration: 0.023 ms
  Name: server.packages - Function: pkg.installed - Result: Clean Started: - 13:12:36.031378 Duration: 18.256 ms
  Name: /etc/qemu-ifup-br0 - Function: file.managed - Result: Clean Started: - 13:12:36.050041 Duration: 5.884 ms
  Name: /etc/qemu-ifdown-br0 - Function: file.managed - Result: Clean Started: - 13:12:36.056388 Duration: 6.534 ms
  Name: /etc/qemu-ifup-br2 - Function: file.managed - Result: Clean Started: - 13:12:36.063299 Duration: 3.081 ms
  Name: /etc/qemu-ifdown-br2 - Function: file.managed - Result: Clean Started: - 13:12:36.066567 Duration: 6.327 ms
  Name: /etc/qemu-ifup-br3 - Function: file.managed - Result: Clean Started: - 13:12:36.073447 Duration: 6.116 ms
  Name: /etc/qemu-ifdown-br3 - Function: file.managed - Result: Clean Started: - 13:12:36.080005 Duration: 6.657 ms
  Name: qemu - Function: pkg.installed - Result: Clean Started: - 13:12:36.086851 Duration: 10.409 ms
  Name: /usr/share/qemu/ipxe.lkrn - Function: file.managed - Result: Clean Started: - 13:12:36.097562 Duration: 77.128 ms
  Name: tgt - Function: pkg.installed - Result: Clean Started: - 13:12:36.175092 Duration: 15.878 ms
  Name: dd if=/dev/zero of=/opt/openqa-iscsi-disk seek=1M bs=20480 count=1 - Function: cmd.run - Result: Clean Started: - 13:12:36.195218 Duration: 19.712 ms
  Name: tgtd - Function: service.running - Result: Clean Started: - 13:12:36.219333 Duration: 61.929 ms
  Name: salt://openqa/iscsi-target-setup.sh - Function: cmd.script - Result: Clean Started: - 13:12:36.284702 Duration: 14.573 ms
----------
          ID: /etc/sysconfig/network/ifcfg-br1
    Function: file.managed
      Result: False
     Comment: One or more requisite failed: openqa.worker.worker.packages
     Started: 13:12:36.305793
    Duration: 0.017 ms
     Changes:   
----------
          ID: openvswitch
    Function: service.running
      Result: False
     Comment: One or more requisite failed: openqa.openvswitch./etc/sysconfig/network/ifcfg-br1
     Started: 13:12:36.309592
    Duration: 0.016 ms
     Changes:   
  Name: wicked - Function: pkg.installed - Result: Clean Started: - 13:12:36.309919 Duration: 16.043 ms
----------
          ID: wicked ifup br1
    Function: cmd.wait
      Result: False
     Comment: One or more requisite failed: openqa.openvswitch./etc/sysconfig/network/ifcfg-br1
     Started: 13:12:36.331002
    Duration: 0.023 ms
     Changes:   
  Name: /etc/wicked/scripts/gre_tunnel_preup.sh - Function: file.absent - Result: Clean Started: - 13:12:36.331236 Duration: 2.884 ms
----------
          ID: /etc/sysconfig/os-autoinst-openvswitch
    Function: file.managed
      Result: False
     Comment: One or more requisite failed: openqa.worker.worker.packages
     Started: 13:12:36.338598
    Duration: 0.016 ms
     Changes:   
  Name: /etc/systemd/system/os-autoinst-openvswitch.service.d/30-init-timeout.conf - Function: file.managed - Result: Clean Started: - 13:12:36.338856 Duration: 76.694 ms
  Name: service.systemctl_reload - Function: module.wait - Result: Clean Started: - 13:12:36.422083 Duration: 4.032 ms
----------
          ID: os-autoinst-openvswitch
    Function: service.running
      Result: False
     Comment: One or more requisite failed: openqa.openvswitch./etc/sysconfig/network/ifcfg-br1, openqa.openvswitch./etc/sysconfig/os-autoinst-openvswitch
     Started: 13:12:36.439003
    Duration: 0.022 ms
     Changes:   
----------
          ID: /var/log/openvswitch
    Function: file.directory
      Result: False
     Comment: User openvswitch is not available Group openvswitch is not available
     Started: 13:12:36.439272
    Duration: 6.155 ms
     Changes:   
----------
          ID: /etc/logrotate.d/openvswitch
    Function: file.line
      Result: False
     Comment: /etc/logrotate.d/openvswitch: file not found
     Started: 13:12:36.445890
    Duration: 3.474 ms
     Changes:   
  Name: /etc/sysconfig/openvswitch - Function: file.replace - Result: Clean Started: - 13:12:36.449713 Duration: 3.727 ms
----------
          ID: /etc/openvswitch
    Function: file.directory
      Result: False
     Comment: User openvswitch is not available Group openvswitch is not available
     Started: 13:12:36.453948
    Duration: 6.016 ms
     Changes:   
----------
          ID: ovsdb-server.service
    Function: service.running
      Result: False
     Comment: One or more requisite failed: openqa.openvswitch_boo1181418./etc/openvswitch
     Started: 13:12:36.467677
    Duration: 0.035 ms
     Changes:   
----------
          ID: ovs-vswitchd.service
    Function: service.running
      Result: False
     Comment: One or more requisite failed: openqa.openvswitch_boo1181418./etc/openvswitch
     Started: 13:12:36.475265
    Duration: 0.026 ms
     Changes:   
----------
          ID: os-autoinst-openvswitch.service
    Function: service.running
      Result: False
     Comment: One or more requisite failed: openqa.openvswitch_boo1181418./etc/openvswitch
     Started: 13:12:36.483816
    Duration: 0.038 ms
     Changes:   
  Name: /etc/dbus-1/system.d/system-local.conf - Function: file.managed - Result: Clean Started: - 13:12:36.484205 Duration: 8.42 ms
  Name: /usr/local/bin/recover-nfs.sh - Function: file.managed - Result: Clean Started: - 13:12:36.493071 Duration: 76.437 ms
  Name: /etc/systemd/system/recover-nfs.service - Function: file.managed - Result: Clean Started: - 13:12:36.569789 Duration: 64.402 ms
  Name: service.systemctl_reload - Function: module.run - Result: Clean Started: - 13:12:36.637134 Duration: 0.009 ms
  Name: /etc/systemd/system/recover-nfs.timer - Function: file.managed - Result: Clean Started: - 13:12:36.637235 Duration: 74.057 ms
  Name: service.systemctl_reload - Function: module.run - Result: Clean Started: - 13:12:36.716745 Duration: 0.017 ms
  Name: recover-nfs.timer - Function: service.running - Result: Clean Started: - 13:12:36.716987 Duration: 55.305 ms
  Name: /etc/systemd/system/auto-update.service - Function: file.managed - Result: Clean Started: - 13:12:36.772773 Duration: 66.196 ms
  Name: service.systemctl_reload - Function: module.run - Result: Clean Started: - 13:12:36.842837 Duration: 0.019 ms
  Name: /etc/systemd/system/auto-update.timer - Function: file.managed - Result: Clean Started: - 13:12:36.843123 Duration: 74.227 ms
  Name: service.systemctl_reload - Function: module.run - Result: Clean Started: - 13:12:36.919046 Duration: 0.008 ms
  Name: auto-update.timer - Function: service.running - Result: Clean Started: - 13:12:36.920589 Duration: 52.616 ms

Summary for martchus-test-vm
--------------
Succeeded: 292 (changed=3)
Failed:     23

So I guess it is very easy to get a machine in a half setup state and the firewall config seems to be something that is quite at the end of the dependency chain. Maybe the person how setup the Prague located worker hasn't tried to apply the high state manually and thus didn't realize that there were still a errors - because on the VM itself systemctl --failed lists nothing. So the minion service is running just fine and no other services are failing (just deactivated when they should actually be active like telegraf which is currently still missing).

Actions #10

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

Maybe calling state.apply is really what one is supposed to do - possibly multiple times - to speed things up as the whole appliance can fail due to various reasons. Likely the most common reason is zypper running into download errors which then also blocks lots of other things depending on it. Now only one failure is remaining:

----------
          ID: btrfs-nocow
    Function: cmd.run
        Name: chattr +C /var/lib/openqa/cache && touch /var/lib/openqa/cache/.nocow
      Result: False
     Comment: Command "chattr +C /var/lib/openqa/cache && touch /var/lib/openqa/cache/.nocow" run
     Started: 13:43:43.423041
    Duration: 372.98 ms
     Changes:   
              ----------
              pid:
                  4332
              retcode:
                  127
              stderr:
                  /bin/sh: chattr: command not found
              stdout:
  Name: /etc/default/grub - Function: file.replace - Result: Clean Started: - 13:43:43.796525 Duration: 8.944 ms

That means the firewall config is in place as it should be. Firewalld has also been restarted so it should also be effective. Running commands like firewall-cmd --get-default-zone also confirms that.

So maybe when setting up that worker in Prague some errors where just overlooked (maybe the amount of errors was not as big as in my case so it wasn't as obvious). I can only say that in general the firewall config can be applied correctly via our salt states.

MR for fixing the problem with chattr: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/887

Actions #11

Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

The MR has been merged.
I've also just removed the test VM again from OSD.

I think this ticket can be resolved with the explanation that errors when applying salt states on that worker were not handled correctly. In general our salt states configure the firewall so that all incoming connections are allowed.

Actions #12

Updated by okurz over 1 year ago

  • Related to action #124562: Re-install at least one of the new OSD workers located in Prague added
Actions #13

Updated by okurz over 1 year ago

Ok, good. I think the rest can be covered in #124562

Actions

Also available in: Atom PDF