action #45848
closeddeveloper mode on o3 broken - "unable to upgrade ws to command server" - except on openqaworker4
Added by okurz about 6 years ago. Updated almost 6 years ago.
0%
Description
Observation¶
https://i.imgur.com/fxeV3kT.png
The local browser web console outputs repeatedly:
Connection to livehandler lost
test_result.js:167:1
Establishing ws connection to wss://openqa.opensuse.org/liveviewhandler/tests/827215/developer/ws-proxy/status
test_result.js:176:28
Received message via ws proxy: {"data":null,"type":"info","what":"connecting to os-autoinst command server at ws:\/\/openqaworker1:20163\/xmLOPocY2ivfpeaz\/ws"}
test_result.js:172:1
Received message via ws proxy: {"data":null,"type":"error","what":"unable to upgrade ws to command server"}
test_result.js:172:1
Error from ws proxy: unable to upgrade ws to command server
test_result.js:181:1
The connection to wss://openqa.opensuse.org/liveviewhandler/tests/827215/developer/ws-proxy/status was interrupted while the page was loading.
Steps to reproduce¶
Go to any running job on o3 except on openqaworker4 where the developer mode seems to work fine, e.g. openqaworker1 and imagetester show this problem.
Updated by okurz about 6 years ago
hm, I just found that we have only on openqaworker4 the WORKER_HOSTNAME set, right?
Updated by okurz about 6 years ago
- Status changed from New to In Progress
- Assignee set to okurz
ok, I will try to handle by setting the ip address for each worker consistently.
Updated by mkravec about 6 years ago
Just a guess, but I had this issue on my remote worker because openqaworker1:20163 should be openqaworker1.arch.suse.de:20163
Was solved in my case by setting WORKER_HOSTNAME = openqaworker1.arch.suse.de on workers.ini
Updated by mkittler about 6 years ago
I can not check the workers for incorrect settings because I don't have SSH access to the workers.
If WORKER_HOSTNAME
is not set the web UI falls back to using the hostname it knows by itself, here openqaworker1
(the console log shows that). I also used that hostname to attempt logging via SSH and did not get Could not resolve hostname ...
but instead was stuck at the password prompt. So I suppose the hostname is correct.
The next step would be checking the firewal (http://open.qa/docs/#_steps_to_debug_developer_mode_setup). A netstat
on the worker might be helpful as well to see whether the websocket server is actually listening on that particular port.
Updated by mkittler about 6 years ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
Updated by okurz about 6 years ago
- Checked with
for i in power8 aarch64 imagetester openqaworker1 openqaworker4 ; do echo $i && ssh root@$i "grep -C 3 WORKER_HOSTNAME /etc/openqa/workers.ini"; done
- Set
WORKER_HOSTNAME
with the local IP - Restarted all workers with
for i in power8 aarch64 imagetester openqaworker1 openqaworker4 ; do echo $i && ssh root@$i "systemctl restart openqa-worker.target"; done
- Checked on the webui going to "running" jobs and seeing if the developer mode can be triggered -> Result is the same as in before, only works on openqaworker4.
So on to check the connection attempts and firewall.
I see the websockets server there on the worker.
openqaworker4 has no firewall configured, openqaworker1 has it enabled. I assume this is where the problem lies.
nsinger has recently in his words changed the default-zone to trusted. So much for consistency:
for i in power8 aarch64 imagetester openqaworker1 openqaworker4 ; do echo -n "$i: " && ssh root@$i "firewall-cmd --get-default-zone"; done
power8: bash: firewall-cmd: command not found
aarch64: public
imagetester: trusted
openqaworker1: trusted
openqaworker4: FirewallD is not running
@nsinger could you take over?
Updated by okurz about 6 years ago
- Status changed from In Progress to Feedback
- Assignee changed from okurz to nicksinger
Updated by nicksinger about 6 years ago
Sure. I've now also changed aarch64 to trusted and after reloading the firewall the interactive mode works. Same applies for imagetester and openqaworker1 BUT i had to reload the firewall once again. So it seems like firewalld is not properly applying the default zone setting on start-up. Unfortunately all I can find in the logs:
openqaworker1:~ # journalctl -fu firewalld
-- Logs begin at Thu 2019-01-10 03:32:43 CET. --
Jan 10 03:33:03 openqaworker1 systemd[1]: Starting firewalld - dynamic firewall daemon...
Jan 10 03:33:08 openqaworker1 systemd[1]: Started firewalld - dynamic firewall daemon.
Jan 10 12:01:58 openqaworker1 systemd[1]: Reloading firewalld - dynamic firewall daemon.
Jan 10 12:01:58 openqaworker1 systemd[1]: Reloaded firewalld - dynamic firewall daemon.
Updated by nicksinger about 6 years ago
issue on power8 is the following:
os-autoinst version "0.0" is incompatible, version "1.0" is required
so for this host we can safely assume that the firewall isn't a problem.
Updated by nicksinger about 6 years ago
I've rebooted openqaworker1 to prove my theory. Indeed after reboot the websocket connection is broken again.
iptables-save
output after reboot with broken connection. ip6tables-save
output after reboot with broken connection.
iptables-save
output after reloading firewalld with working connection. ip6tables-save
output after reloading firewalld with working connection.
Not diffed yet and maybe not even helpful (I've somehow the impression that firewalld uses ebtables and not iptables any longer).
Updated by okurz about 6 years ago
You might want to report a bug regarding "transactional-server" and firewall
Updated by nicksinger about 6 years ago
After a little more investigation I'm quiet certain we face some bug here and created bsc#1121613.
Updated by okurz about 6 years ago
yep, looks good. So any idea for a workaround to apply for our workers?
Updated by nicksinger about 6 years ago
workaround in place on imagetester. I've created a service called bsc1121613-workaround.service:
[Unit]
Description=Workaround for bsc#1121613 - reload firewall after startup
After=firewalld.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/systemctl reload firewalld.service
[Install]
WantedBy=multi-user.target
I've also manually started and enabled it. After that the connection worked. Let us now wait for a reboot to see if the timing matters.
Updated by okurz almost 6 years ago
So the workaround was not enough because the services would trigger too early and actually kill the firewall for good.
What nsinger and me did was to
- add a delay to the service file:
[Unit]
Description=Workaround for bsc#1121613 - reload firewall after startup
After=firewalld.service
[Service]
Type=oneshot
RemainAfterExit=yes
# Ensure the actual firewalld service is up before we reload
ExecStartPre=/bin/sleep 10
ExecStart=/usr/bin/systemctl reload firewalld.service
[Install]
WantedBy=multi-user.target
tried it out on imagetester
- and copied to the other workers and activated:
scp root@imagetester:/etc/systemd/system/bsc1121613-workaround.service /tmp/
for i in power8 aarch64 openqaworker1 openqaworker4 ; do scp /tmp/bsc1121613-workaround.service root@$i:/etc/systemd/system/; done
for i in power8 aarch64 imagetester openqaworker1 openqaworker4; do echo $i && ssh root@$i "systemctl daemon-reload; systemctl restart firewalld; systemctl restart bsc1121613-workaround"; done
which works better. But now, today, I can see that the developer mode works on openqaworker1, power8, aarch64 but not openqaworker4. imagester unknown.
Checking the status of the firewall and workaround service shows:
$ for i in power8 aarch64 imagetester openqaworker1 openqaworker4; do echo $i && ssh root@$i "systemctl status firewalld bsc1121613-workaround"; done
power8
● firewalld.service
Loaded: not-found (Reason: No such file or directory)
Active: inactive (dead)
● bsc1121613-workaround.service - Workaround for bsc#1121613 - reload firewall after startup
Loaded: loaded (/etc/systemd/system/bsc1121613-workaround.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2019-01-16 08:39:16 UTC; 1 weeks 1 days ago
Main PID: 150208 (code=exited, status=5)
Jan 16 08:39:06 power8 systemd[1]: Starting Workaround for bsc#1121613 - reload firewall after startup...
Jan 16 08:39:16 power8 systemctl[150208]: Failed to reload firewalld.service: Unit firewalld.service failed to load: No such file or directory.
Jan 16 08:39:16 power8 systemd[1]: bsc1121613-workaround.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jan 16 08:39:16 power8 systemd[1]: Failed to start Workaround for bsc#1121613 - reload firewall after startup.
Jan 16 08:39:16 power8 systemd[1]: bsc1121613-workaround.service: Unit entered failed state.
Jan 16 08:39:16 power8 systemd[1]: bsc1121613-workaround.service: Failed with result 'exit-code'.
aarch64
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: disabled)
Active: active (running)
Docs: man:firewalld(1)
Main PID: 1557 (firewalld)
Tasks: 2 (limit: 9830)
CGroup: /system.slice/firewalld.service
└─1557 /usr/bin/python3 -Es /usr/sbin/firewalld --nofork --nopid
Jan 24 03:34:23 openqa-aarch64 systemd[1]: Starting firewalld - dynamic firewall daemon...
Jan 24 03:34:24 openqa-aarch64 systemd[1]: Started firewalld - dynamic firewall daemon.
● bsc1121613-workaround.service - Workaround for bsc#1121613 - reload firewall after startup
Loaded: loaded (/etc/systemd/system/bsc1121613-workaround.service; disabled; vendor preset: disabled)
Active: inactive (dead)
imagetester
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2019-01-24 03:31:10 CET; 13h ago
Docs: man:firewalld(1)
Process: 2217 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
Main PID: 1228 (firewalld)
Tasks: 2 (limit: 4915)
CGroup: /system.slice/firewalld.service
└─1228 /usr/bin/python3 -Es /usr/sbin/firewalld --nofork --nopid
Jan 24 03:31:08 imagetester systemd[1]: Starting firewalld - dynamic firewall daemon...
Jan 24 03:31:10 imagetester systemd[1]: Started firewalld - dynamic firewall daemon.
Jan 24 03:31:20 imagetester systemd[1]: Reloading firewalld - dynamic firewall daemon.
Jan 24 03:31:20 imagetester systemd[1]: Reloaded firewalld - dynamic firewall daemon.
● bsc1121613-workaround.service - Workaround for bsc#1121613 - reload firewall after startup
Loaded: loaded (/etc/systemd/system/bsc1121613-workaround.service; enabled; vendor preset: disabled)
Active: active (exited) since Thu 2019-01-24 03:31:20 CET; 13h ago
Process: 2216 ExecStart=/usr/bin/systemctl reload firewalld.service (code=exited, status=0/SUCCESS)
Process: 1377 ExecStartPre=/bin/sleep 10 (code=exited, status=0/SUCCESS)
Main PID: 2216 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 4915)
CGroup: /system.slice/bsc1121613-workaround.service
Jan 24 03:31:10 imagetester systemd[1]: Starting Workaround for bsc#1121613 - reload firewall after startup...
Jan 24 03:31:20 imagetester systemd[1]: Started Workaround for bsc#1121613 - reload firewall after startup.
openqaworker1
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2019-01-24 03:33:50 CET; 13h ago
Docs: man:firewalld(1)
Main PID: 1695 (firewalld)
Tasks: 2 (limit: 4915)
CGroup: /system.slice/firewalld.service
└─1695 /usr/bin/python3 -Es /usr/sbin/firewalld --nofork --nopid
Jan 24 03:33:45 openqaworker1 systemd[1]: Starting firewalld - dynamic firewall daemon...
Jan 24 03:33:50 openqaworker1 systemd[1]: Started firewalld - dynamic firewall daemon.
● bsc1121613-workaround.service - Workaround for bsc#1121613 - reload firewall after startup
Loaded: loaded (/etc/systemd/system/bsc1121613-workaround.service; disabled; vendor preset: disabled)
Active: inactive (dead)
openqaworker4
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Docs: man:firewalld(1)
● bsc1121613-workaround.service - Workaround for bsc#1121613 - reload firewall after startup
Loaded: loaded (/etc/systemd/system/bsc1121613-workaround.service; disabled; vendor preset: disabled)
Active: inactive (dead)
so we can see … a mess. As known, power8 does not have firewalld and openqaworker4 has no firewall configured. Only imagetester seems to work as expected: The firewall started and the workaround service triggered a reload 10s afterwards. On aarch64 as well as openqaworker1 the workaround service is disabled but on aarch64 the firewall seems to run fine nevertheless. Btw, I checked and the firewall zone reports as "trusted" on all. Shouldn't the workaround service be active on aarch64 and openqaworker1 as well?
I enabled the service on both aarch64 and openqaworker1. On openqaworker1 I did:
ssh root@openqaworker1 "systemctl enable --now bsc1121613-workaround"
which triggered the reload of the firewall et voilà, the developer mode works again
Updated by nicksinger almost 6 years ago
- Status changed from Feedback to Resolved
Today I've quickly checked o3. Interactive mode worked on openqaworker1 as well as openqaworker4 :)