Project

General

Profile

Actions

action #163840

closed

openqaworker18 shows "unable to upgrade ws to command server" size:S

Added by okurz 15 days ago. Updated 1 day ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-07-12
Due date:
2024-08-08
% Done:

0%

Estimated time:

Description

Observation

Jobs on openqaworker18, e.g. https://openqa.suse.de/tests/14912093#live as well as others on that host showed "unable to upgrade ws to command server", e.g. see

Screenshot_20240712_111051_openqaworker18_unable_to_upgrade_ws_to_command_server.png
https://progress.opensuse.org/issues/163840

openqaworker18 shows "unable to upgrade ws to command server" size:S

Observation

Jobs on openqaworker18, e.g. https://openqa.suse.de/tests/14912093#live as well as others on that host showed "unable to upgrade ws to command server", e.g. see

Screenshot_20240712_111051_openqaworker18_unable_to_upgrade_ws_to_command_server.png

That of course is only visible while tests are running. Other jobs on other workers at the same time work fine.

Suggestions

  • Look into worker log files and system journal on the host
  • Follow the documentation: https://open.qa/docs/#debugdevelmode
  • If you are lazy just reboot the machine and check jobs after that

Files

Actions #1

Updated by okurz 15 days ago

  • Subject changed from openqaworker18 shows "unable to upgrade ws to command server" to openqaworker18 shows "unable to upgrade ws to command server" size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by nicksinger 3 days ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #3

Updated by nicksinger 3 days ago

Connection to openqa-livehandler is working, checking its log on OSD shows:

Jul 24 15:54:47 openqa openqa-livehandler-daemon[10625]: [debug] [pid:10625] Asking the worker 2289 to start providing livestream for job 15000414
Jul 24 15:54:47 openqa openqa-livehandler-daemon[10625]: [debug] [pid:10625] Asking worker 2289 to start providing livestream for job 15000414
Jul 24 15:54:47 openqa openqa-livehandler-daemon[10625]: [debug] attempt to open ws proxy for job sle-15-SP4-Server-DVD-Updates-x86_64-Build20240721-1-stack_tests_php@64bit (15000414) where URL to os-autoinst command server is unknown
Jul 24 15:55:00 openqa openqa-livehandler-daemon[10625]: [debug] [pid:10625] Asking worker 2289 to stop providing livestream for job 15000414
Jul 24 15:55:01 openqa openqa-livehandler-daemon[10625]: [debug] [pid:10625] Asking the worker 2289 to start providing livestream for job 15000414
Jul 24 15:55:01 openqa openqa-livehandler-daemon[10625]: [debug] [pid:10625] Asking worker 2289 to start providing livestream for job 15000414
Jul 24 15:55:27 openqa openqa-livehandler-daemon[10625]: [debug] [pid:10625] connecting to os-autoinst command server for job 15000414 at ws://openqaworker18.qa.suse.cz:20033/fIRMvu5qm_jM_N14/ws

the last message then repeats every second. I try to follow up with "attempt to open ws proxy for job sle-15-SP4-Server-DVD-Updates-x86_64-Build20240721-1-stack_tests_php@64bit (15000414) where URL to os-autoinst command server is unknown" and try to understand how to fix this keeping https://open.qa/docs/#debugdevelmode in mind

Actions #4

Updated by nicksinger 3 days ago

From my JS console

Received message via ws proxy: {"data":{"vnc_arg":"openqaworker18.qa.suse.cz:5993"},"type":"info","what":"connecting to os-autoinst command server at ws:\/\/openqaworker18.qa.suse.cz:20033\/fIRMvu5qm_jM_N14\/ws"} test_result.js:302:9
Received message via ws proxy: {"data":null,"type":"error","what":"unable to upgrade ws to command server"} test_result.js:302:9
Error from ws proxy: unable to upgrade ws to command server test_result.js:312:9
Connection to livehandler lost

not sure yet what this "command server" is supposed to be. Maybe spawned by the worker along side of os-autoinst (naming would indicate this). Services on other workers don't include it as long running service:

openqaworker17:~ # systemctl  | grep -i openqa | grep -vi worker
  var-lib-openqa.mount                                                                 loaded active     mounted      /var/lib/openqa
  openqa.slice                                                                         loaded active     active       Slice /openqa
worker35:~ # systemctl  | grep -i openqa | grep -vi worker
  var-lib-openqa-share.automount                                                 loaded active running   var-lib-openqa-share.automount
  var-lib-openqa-share.mount                                                     loaded active mounted   /var/lib/openqa/share
  var-lib-openqa.mount                                                           loaded active mounted   /var/lib/openqa
  openqa.slice
Actions #5

Updated by nicksinger 3 days ago

The documented $worker_instance_number * 10 + 20002 + 1 matches what I see in all debug messages. To test I ran nc -l -p 20033 on worker18 and tried to connect from worker17 (same network) and the connection instantly ends. Double-checking with nmap:

openqaworker17:~ # nmap -p 20033 openqaworker18.qa.suse.cz
Starting Nmap 7.92 ( https://nmap.org ) at 2024-07-24 17:23 CEST
Nmap scan report for openqaworker18.qa.suse.cz (10.100.96.76)
Host is up (0.00017s latency).

PORT      STATE    SERVICE
20033/tcp filtered unknown
MAC Address: 3C:EC:EF:93:AA:C8 (Super Micro Computer)

Nmap done: 1 IP address (1 host up) scanned in 0.18 seconds

To crosscheck, I run the same nc command on worker17 and checked from worker18 (can reach NC, port is reported "open" in nmap) as well as from engineering VPN (which also works). So the culprit has to be firewalld running on openqaworker18. I now try to find what is supposed to apply these changes (I assume our salt states) and why it did not do so on worker18.

Actions #6

Updated by nicksinger 3 days ago

I guess somewhere in here we do this configuration: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/worker.sls#L221-251 - firewall-cmd --list-all-zones shows "interfaces: br1 ovs-system" but on worker18 "interfaces: br1 eth0 ovs-system". /etc/firewalld/zones/trusted.xml contains:

openqaworker18:/etc/firewalld/zones # cat trusted.xml
<?xml version="1.0" encoding="utf-8"?>
<zone target="ACCEPT">
  <short>Trusted</short>
  <description>All network connections are accepted.</description>
  <interface name="br1"/>
  <interface name="ovs-system"/>
  <interface name="eth0"/>
  <masquerade/>
  <forward/>
</zone>

so something didn't apply our (correctly) deployed configuration. Checking salt further if I can spot where this should happen or if we maybe just completely miss it and never encountered it because a reboot would most likely fix it automatically (worker18 is up for 24 days).

Actions #7

Updated by nicksinger 3 days ago

while trying to come up with a proper reload command I noticed that eth0 on worker18 is in the public zone:

public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0
  sources:
  services: dhcpv6-client ssh
  ports: 5991-6120/tcp
  protocols:
  forward: no
  masquerade: yes
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:
Actions #8

Updated by openqa_review 3 days ago

  • Due date set to 2024-08-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by nicksinger 2 days ago

Firewall change did the trick. I tried to improve our salt states but the solution might be considered too ugly for such an edge-case: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1236
I unfortunately have no idea why this worker was misconfigured. It could have been previous experiments or a left-over from installation. I checked all other minions we operate and can not find the same issue on them:

openqa:~ # salt '*' cmd.run 'grep -r "interface" /etc/firewalld/zones/*.xml | sort | uniq -c | grep -v "\s*1\ " || true'
s390zl12.oqa.prg2.suse.org:
s390zl13.oqa.prg2.suse.org:
ada.qe.prg2.suse.org:
osiris-1.qe.nue2.suse.org:
    grep: /etc/firewalld/zones/*.xml: No such file or directory
storage.qe.prg2.suse.org:
openqaworker16.qa.suse.cz:
openqaworker17.qa.suse.cz:
openqaworker18.qa.suse.cz:
worker35.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
sapworker3.qe.nue2.suse.org:
worker40.oqa.prg2.suse.org:
worker30.oqa.prg2.suse.org:
qesapworker-prg6.qa.suse.cz:
worker32.oqa.prg2.suse.org:
worker29.oqa.prg2.suse.org:
qesapworker-prg4.qa.suse.cz:
openqa.suse.de:
    grep: /etc/firewalld/zones/*.xml: No such file or directory
sapworker2.qe.nue2.suse.org:
sapworker1.qe.nue2.suse.org:
worker-arm1.oqa.prg2.suse.org:
openqaworker1.qe.nue2.suse.org:
worker34.oqa.prg2.suse.org:
qamaster.qe.nue2.suse.org:
    grep: /etc/firewalld/zones/*.xml: No such file or directory
qesapworker-prg5.qa.suse.cz:
baremetal-support.qe.nue2.suse.org:
    grep: /etc/firewalld/zones/*.xml: No such file or directory
openqaworker14.qa.suse.cz:
schort-server.qe.nue2.suse.org:
    grep: /etc/firewalld/zones/*.xml: No such file or directory
jenkins.qe.nue2.suse.org:
    grep: /etc/firewalld/zones/*.xml: No such file or directory
unreal6.qe.nue2.suse.org:
qesapworker-prg7.qa.suse.cz:
worker-arm2.oqa.prg2.suse.org:
openqaw5-xen.qe.prg2.suse.org:
backup-vm.qe.nue2.suse.org:
    grep: /etc/firewalld/zones/*.xml: No such file or directory
monitor.qe.nue2.suse.org:
    grep: /etc/firewalld/zones/*.xml: No such file or directory
diesel.qe.nue2.suse.org:
openqa-piworker.qe.nue2.suse.org:
mania.qe.nue2.suse.org:
petrol.qe.nue2.suse.org:
grenache-1.oqa.prg2.suse.org:
tumblesle.qe.nue2.suse.org:
    grep: /etc/firewalld/zones/*.xml: No such file or directory
imagetester.qe.nue2.suse.org:
openqaworker-arm-1.qe.nue2.suse.org:
backup-qam.qe.nue2.suse.org:
    grep: /etc/firewalld/zones/*.xml: No such file or directory
Actions #10

Updated by nicksinger 1 day ago

  • Status changed from Feedback to Resolved

As the machine was already manually fixed I now validated my changes in salt again by moving the interface into the public zone, reloading firewalld and then executing a highstate on that host:

----------
          ID: firewalld_zones
    Function: file.managed
        Name: /etc/firewalld/zones/trusted.xml
      Result: True
     Comment: File /etc/firewalld/zones/trusted.xml updated
     Started: 08:49:17.637235
    Duration: 441.143 ms
     Changes:
              ----------
              diff:
                  ---
                  +++
                  @@ -4,6 +4,7 @@
                     <description>All network connections are accepted.</description>
                     <interface name="br1"/>
                     <interface name="ovs-system"/>
                  +  <interface name="eth0"/>
                     <masquerade/>
                     <forward/>
                   </zone>
----------
          ID: firewalld
    Function: service.running
      Result: True
     Comment: Service restarted
     Started: 08:49:18.164322
    Duration: 2064.715 ms
     Changes:
              ----------
              firewalld:
                  True
----------
          ID: move_interface
    Function: cmd.run
        Name: sed -i '/name="eth0"/d' /etc/firewalld/zones/*.xml; firewall-cmd --reload; firewall-cmd --zone=trusted --change-interface=eth0 --permanent
      Result: True
     Comment: Command "sed -i '/name="eth0"/d' /etc/firewalld/zones/*.xml; firewall-cmd --reload; firewall-cmd --zone=trusted --change-interface=eth0 --permanent" run
     Started: 08:49:20.229646
    Duration: 1443.066 ms
     Changes:
              ----------
              pid:
                  58225
              retcode:
                  0
              stderr:
              stdout:
                  success
                  success

which I can also see on the machine itself:

openqaworker18:/etc/firewalld/zones # grep -i inter *.xml
trusted.xml:  <interface name="br1"/>
trusted.xml:  <interface name="ovs-system"/>
trusted.xml:  <interface name="eth0"/>

all done then.

Actions

Also available in: Atom PDF