action #163840
closedopenqaworker18 shows "unable to upgrade ws to command server" size:S
Added by okurz 5 months ago. Updated 5 months ago.
0%
Description
Observation¶
Jobs on openqaworker18, e.g. https://openqa.suse.de/tests/14912093#live as well as others on that host showed "unable to upgrade ws to command server", e.g. see
https://progress.opensuse.org/issues/163840
openqaworker18 shows "unable to upgrade ws to command server" size:S
Observation¶
Jobs on openqaworker18, e.g. https://openqa.suse.de/tests/14912093#live as well as others on that host showed "unable to upgrade ws to command server", e.g. see
That of course is only visible while tests are running. Other jobs on other workers at the same time work fine.
Suggestions¶
- Look into worker log files and system journal on the host
- Follow the documentation: https://open.qa/docs/#debugdevelmode
- If you are lazy just reboot the machine and check jobs after that
Files
Screenshot_20240712_111051_openqaworker18_unable_to_upgrade_ws_to_command_server.png (77.7 KB) Screenshot_20240712_111051_openqaworker18_unable_to_upgrade_ws_to_command_server.png | okurz, 2024-07-12 09:13 | ||
clipboard-202407252014-hhmgr.png (64.3 KB) clipboard-202407252014-hhmgr.png | nicksinger, 2024-07-25 18:14 |
Updated by nicksinger 5 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by nicksinger 5 months ago
Connection to openqa-livehandler is working, checking its log on OSD shows:
Jul 24 15:54:47 openqa openqa-livehandler-daemon[10625]: [debug] [pid:10625] Asking the worker 2289 to start providing livestream for job 15000414
Jul 24 15:54:47 openqa openqa-livehandler-daemon[10625]: [debug] [pid:10625] Asking worker 2289 to start providing livestream for job 15000414
Jul 24 15:54:47 openqa openqa-livehandler-daemon[10625]: [debug] attempt to open ws proxy for job sle-15-SP4-Server-DVD-Updates-x86_64-Build20240721-1-stack_tests_php@64bit (15000414) where URL to os-autoinst command server is unknown
Jul 24 15:55:00 openqa openqa-livehandler-daemon[10625]: [debug] [pid:10625] Asking worker 2289 to stop providing livestream for job 15000414
Jul 24 15:55:01 openqa openqa-livehandler-daemon[10625]: [debug] [pid:10625] Asking the worker 2289 to start providing livestream for job 15000414
Jul 24 15:55:01 openqa openqa-livehandler-daemon[10625]: [debug] [pid:10625] Asking worker 2289 to start providing livestream for job 15000414
Jul 24 15:55:27 openqa openqa-livehandler-daemon[10625]: [debug] [pid:10625] connecting to os-autoinst command server for job 15000414 at ws://openqaworker18.qa.suse.cz:20033/fIRMvu5qm_jM_N14/ws
the last message then repeats every second. I try to follow up with "attempt to open ws proxy for job sle-15-SP4-Server-DVD-Updates-x86_64-Build20240721-1-stack_tests_php@64bit (15000414) where URL to os-autoinst command server is unknown" and try to understand how to fix this keeping https://open.qa/docs/#debugdevelmode in mind
Updated by nicksinger 5 months ago
From my JS console
Received message via ws proxy: {"data":{"vnc_arg":"openqaworker18.qa.suse.cz:5993"},"type":"info","what":"connecting to os-autoinst command server at ws:\/\/openqaworker18.qa.suse.cz:20033\/fIRMvu5qm_jM_N14\/ws"} test_result.js:302:9
Received message via ws proxy: {"data":null,"type":"error","what":"unable to upgrade ws to command server"} test_result.js:302:9
Error from ws proxy: unable to upgrade ws to command server test_result.js:312:9
Connection to livehandler lost
not sure yet what this "command server" is supposed to be. Maybe spawned by the worker along side of os-autoinst (naming would indicate this). Services on other workers don't include it as long running service:
openqaworker17:~ # systemctl | grep -i openqa | grep -vi worker
var-lib-openqa.mount loaded active mounted /var/lib/openqa
openqa.slice loaded active active Slice /openqa
worker35:~ # systemctl | grep -i openqa | grep -vi worker
var-lib-openqa-share.automount loaded active running var-lib-openqa-share.automount
var-lib-openqa-share.mount loaded active mounted /var/lib/openqa/share
var-lib-openqa.mount loaded active mounted /var/lib/openqa
openqa.slice
Updated by nicksinger 5 months ago
The documented $worker_instance_number * 10 + 20002 + 1
matches what I see in all debug messages. To test I ran nc -l -p 20033
on worker18 and tried to connect from worker17 (same network) and the connection instantly ends. Double-checking with nmap:
openqaworker17:~ # nmap -p 20033 openqaworker18.qa.suse.cz
Starting Nmap 7.92 ( https://nmap.org ) at 2024-07-24 17:23 CEST
Nmap scan report for openqaworker18.qa.suse.cz (10.100.96.76)
Host is up (0.00017s latency).
PORT STATE SERVICE
20033/tcp filtered unknown
MAC Address: 3C:EC:EF:93:AA:C8 (Super Micro Computer)
Nmap done: 1 IP address (1 host up) scanned in 0.18 seconds
To crosscheck, I run the same nc command on worker17 and checked from worker18 (can reach NC, port is reported "open" in nmap) as well as from engineering VPN (which also works). So the culprit has to be firewalld running on openqaworker18. I now try to find what is supposed to apply these changes (I assume our salt states) and why it did not do so on worker18.
Updated by nicksinger 5 months ago
I guess somewhere in here we do this configuration: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/worker.sls#L221-251 - firewall-cmd --list-all-zones
shows "interfaces: br1 ovs-system" but on worker18 "interfaces: br1 eth0 ovs-system". /etc/firewalld/zones/trusted.xml
contains:
openqaworker18:/etc/firewalld/zones # cat trusted.xml
<?xml version="1.0" encoding="utf-8"?>
<zone target="ACCEPT">
<short>Trusted</short>
<description>All network connections are accepted.</description>
<interface name="br1"/>
<interface name="ovs-system"/>
<interface name="eth0"/>
<masquerade/>
<forward/>
</zone>
so something didn't apply our (correctly) deployed configuration. Checking salt further if I can spot where this should happen or if we maybe just completely miss it and never encountered it because a reboot would most likely fix it automatically (worker18 is up for 24 days).
Updated by nicksinger 5 months ago
while trying to come up with a proper reload command I noticed that eth0 on worker18 is in the public zone:
public (active)
target: default
icmp-block-inversion: no
interfaces: eth0
sources:
services: dhcpv6-client ssh
ports: 5991-6120/tcp
protocols:
forward: no
masquerade: yes
forward-ports:
source-ports:
icmp-blocks:
rich rules:
Updated by openqa_review 5 months ago
- Due date set to 2024-08-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 5 months ago
- File clipboard-202407252014-hhmgr.png clipboard-202407252014-hhmgr.png added
- Status changed from In Progress to Feedback
Firewall change did the trick. I tried to improve our salt states but the solution might be considered too ugly for such an edge-case: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1236
I unfortunately have no idea why this worker was misconfigured. It could have been previous experiments or a left-over from installation. I checked all other minions we operate and can not find the same issue on them:
openqa:~ # salt '*' cmd.run 'grep -r "interface" /etc/firewalld/zones/*.xml | sort | uniq -c | grep -v "\s*1\ " || true'
s390zl12.oqa.prg2.suse.org:
s390zl13.oqa.prg2.suse.org:
ada.qe.prg2.suse.org:
osiris-1.qe.nue2.suse.org:
grep: /etc/firewalld/zones/*.xml: No such file or directory
storage.qe.prg2.suse.org:
openqaworker16.qa.suse.cz:
openqaworker17.qa.suse.cz:
openqaworker18.qa.suse.cz:
worker35.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
sapworker3.qe.nue2.suse.org:
worker40.oqa.prg2.suse.org:
worker30.oqa.prg2.suse.org:
qesapworker-prg6.qa.suse.cz:
worker32.oqa.prg2.suse.org:
worker29.oqa.prg2.suse.org:
qesapworker-prg4.qa.suse.cz:
openqa.suse.de:
grep: /etc/firewalld/zones/*.xml: No such file or directory
sapworker2.qe.nue2.suse.org:
sapworker1.qe.nue2.suse.org:
worker-arm1.oqa.prg2.suse.org:
openqaworker1.qe.nue2.suse.org:
worker34.oqa.prg2.suse.org:
qamaster.qe.nue2.suse.org:
grep: /etc/firewalld/zones/*.xml: No such file or directory
qesapworker-prg5.qa.suse.cz:
baremetal-support.qe.nue2.suse.org:
grep: /etc/firewalld/zones/*.xml: No such file or directory
openqaworker14.qa.suse.cz:
schort-server.qe.nue2.suse.org:
grep: /etc/firewalld/zones/*.xml: No such file or directory
jenkins.qe.nue2.suse.org:
grep: /etc/firewalld/zones/*.xml: No such file or directory
unreal6.qe.nue2.suse.org:
qesapworker-prg7.qa.suse.cz:
worker-arm2.oqa.prg2.suse.org:
openqaw5-xen.qe.prg2.suse.org:
backup-vm.qe.nue2.suse.org:
grep: /etc/firewalld/zones/*.xml: No such file or directory
monitor.qe.nue2.suse.org:
grep: /etc/firewalld/zones/*.xml: No such file or directory
diesel.qe.nue2.suse.org:
openqa-piworker.qe.nue2.suse.org:
mania.qe.nue2.suse.org:
petrol.qe.nue2.suse.org:
grenache-1.oqa.prg2.suse.org:
tumblesle.qe.nue2.suse.org:
grep: /etc/firewalld/zones/*.xml: No such file or directory
imagetester.qe.nue2.suse.org:
openqaworker-arm-1.qe.nue2.suse.org:
backup-qam.qe.nue2.suse.org:
grep: /etc/firewalld/zones/*.xml: No such file or directory
Updated by nicksinger 5 months ago
- Status changed from Feedback to Resolved
As the machine was already manually fixed I now validated my changes in salt again by moving the interface into the public zone, reloading firewalld and then executing a highstate on that host:
----------
ID: firewalld_zones
Function: file.managed
Name: /etc/firewalld/zones/trusted.xml
Result: True
Comment: File /etc/firewalld/zones/trusted.xml updated
Started: 08:49:17.637235
Duration: 441.143 ms
Changes:
----------
diff:
---
+++
@@ -4,6 +4,7 @@
<description>All network connections are accepted.</description>
<interface name="br1"/>
<interface name="ovs-system"/>
+ <interface name="eth0"/>
<masquerade/>
<forward/>
</zone>
----------
ID: firewalld
Function: service.running
Result: True
Comment: Service restarted
Started: 08:49:18.164322
Duration: 2064.715 ms
Changes:
----------
firewalld:
True
----------
ID: move_interface
Function: cmd.run
Name: sed -i '/name="eth0"/d' /etc/firewalld/zones/*.xml; firewall-cmd --reload; firewall-cmd --zone=trusted --change-interface=eth0 --permanent
Result: True
Comment: Command "sed -i '/name="eth0"/d' /etc/firewalld/zones/*.xml; firewall-cmd --reload; firewall-cmd --zone=trusted --change-interface=eth0 --permanent" run
Started: 08:49:20.229646
Duration: 1443.066 ms
Changes:
----------
pid:
58225
retcode:
0
stderr:
stdout:
success
success
which I can also see on the machine itself:
openqaworker18:/etc/firewalld/zones # grep -i inter *.xml
trusted.xml: <interface name="br1"/>
trusted.xml: <interface name="ovs-system"/>
trusted.xml: <interface name="eth0"/>
all done then.