Project

General

Profile

Actions

action #45848

closed

developer mode on o3 broken - "unable to upgrade ws to command server" - except on openqaworker4

Added by okurz over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2019-01-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://i.imgur.com/fxeV3kT.png

The local browser web console outputs repeatedly:

Connection to livehandler lost
test_result.js:167:1
Establishing ws connection to wss://openqa.opensuse.org/liveviewhandler/tests/827215/developer/ws-proxy/status
test_result.js:176:28
Received message via ws proxy: {"data":null,"type":"info","what":"connecting to os-autoinst command server at ws:\/\/openqaworker1:20163\/xmLOPocY2ivfpeaz\/ws"}
test_result.js:172:1
Received message via ws proxy: {"data":null,"type":"error","what":"unable to upgrade ws to command server"}
test_result.js:172:1
Error from ws proxy: unable to upgrade ws to command server
test_result.js:181:1
The connection to wss://openqa.opensuse.org/liveviewhandler/tests/827215/developer/ws-proxy/status was interrupted while the page was loading.

Steps to reproduce

Go to any running job on o3 except on openqaworker4 where the developer mode seems to work fine, e.g. openqaworker1 and imagetester show this problem.

Actions #1

Updated by okurz over 5 years ago

hm, I just found that we have only on openqaworker4 the WORKER_HOSTNAME set, right?

Actions #2

Updated by okurz over 5 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

ok, I will try to handle by setting the ip address for each worker consistently.

Actions #3

Updated by mkravec over 5 years ago

Just a guess, but I had this issue on my remote worker because openqaworker1:20163 should be openqaworker1.arch.suse.de:20163

Was solved in my case by setting WORKER_HOSTNAME = openqaworker1.arch.suse.de on workers.ini

Actions #4

Updated by mkittler over 5 years ago

I can not check the workers for incorrect settings because I don't have SSH access to the workers.

If WORKER_HOSTNAME is not set the web UI falls back to using the hostname it knows by itself, here openqaworker1 (the console log shows that). I also used that hostname to attempt logging via SSH and did not get Could not resolve hostname ... but instead was stuck at the password prompt. So I suppose the hostname is correct.

The next step would be checking the firewal (http://open.qa/docs/#_steps_to_debug_developer_mode_setup). A netstat on the worker might be helpful as well to see whether the websocket server is actually listening on that particular port.

Actions #5

Updated by mkittler over 5 years ago

  • Project changed from openQA Project to openQA Infrastructure
Actions #6

Updated by okurz over 5 years ago

  • Checked with
for i in power8 aarch64 imagetester openqaworker1 openqaworker4 ; do echo $i && ssh root@$i "grep -C 3 WORKER_HOSTNAME /etc/openqa/workers.ini"; done
  • Set WORKER_HOSTNAME with the local IP
  • Restarted all workers with
for i in power8 aarch64 imagetester openqaworker1 openqaworker4 ; do echo $i && ssh root@$i "systemctl restart openqa-worker.target"; done
  • Checked on the webui going to "running" jobs and seeing if the developer mode can be triggered -> Result is the same as in before, only works on openqaworker4.

So on to check the connection attempts and firewall.

I see the websockets server there on the worker.

openqaworker4 has no firewall configured, openqaworker1 has it enabled. I assume this is where the problem lies.

nsinger has recently in his words changed the default-zone to trusted. So much for consistency:

for i in power8 aarch64 imagetester openqaworker1 openqaworker4 ; do echo -n "$i: " && ssh root@$i "firewall-cmd --get-default-zone"; done
power8: bash: firewall-cmd: command not found
aarch64: public
imagetester: trusted
openqaworker1: trusted
openqaworker4: FirewallD is not running

@nsinger could you take over?

Actions #7

Updated by okurz over 5 years ago

  • Status changed from In Progress to Feedback
  • Assignee changed from okurz to nicksinger
Actions #8

Updated by nicksinger over 5 years ago

Sure. I've now also changed aarch64 to trusted and after reloading the firewall the interactive mode works. Same applies for imagetester and openqaworker1 BUT i had to reload the firewall once again. So it seems like firewalld is not properly applying the default zone setting on start-up. Unfortunately all I can find in the logs:

openqaworker1:~ # journalctl -fu firewalld
-- Logs begin at Thu 2019-01-10 03:32:43 CET. --
Jan 10 03:33:03 openqaworker1 systemd[1]: Starting firewalld - dynamic firewall daemon...
Jan 10 03:33:08 openqaworker1 systemd[1]: Started firewalld - dynamic firewall daemon.
Jan 10 12:01:58 openqaworker1 systemd[1]: Reloading firewalld - dynamic firewall daemon.
Jan 10 12:01:58 openqaworker1 systemd[1]: Reloaded firewalld - dynamic firewall daemon.
Actions #9

Updated by nicksinger over 5 years ago

issue on power8 is the following:

os-autoinst version "0.0" is incompatible, version "1.0" is required

so for this host we can safely assume that the firewall isn't a problem.

Actions #10

Updated by nicksinger over 5 years ago

I've rebooted openqaworker1 to prove my theory. Indeed after reboot the websocket connection is broken again.

iptables-save output after reboot with broken connection. ip6tables-save output after reboot with broken connection.

iptables-save output after reloading firewalld with working connection. ip6tables-save output after reloading firewalld with working connection.

Not diffed yet and maybe not even helpful (I've somehow the impression that firewalld uses ebtables and not iptables any longer).

Actions #11

Updated by okurz over 5 years ago

You might want to report a bug regarding "transactional-server" and firewall

Actions #12

Updated by nicksinger over 5 years ago

After a little more investigation I'm quiet certain we face some bug here and created bsc#1121613.

Actions #13

Updated by okurz over 5 years ago

yep, looks good. So any idea for a workaround to apply for our workers?

Actions #14

Updated by nicksinger over 5 years ago

workaround in place on imagetester. I've created a service called bsc1121613-workaround.service:

[Unit]
Description=Workaround for bsc#1121613 - reload firewall after startup
After=firewalld.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/systemctl reload firewalld.service

[Install]
WantedBy=multi-user.target

I've also manually started and enabled it. After that the connection worked. Let us now wait for a reboot to see if the timing matters.

Actions #15

Updated by okurz about 5 years ago

So the workaround was not enough because the services would trigger too early and actually kill the firewall for good.

What nsinger and me did was to

  • add a delay to the service file:
[Unit]
Description=Workaround for bsc#1121613 - reload firewall after startup
After=firewalld.service

[Service]
Type=oneshot
RemainAfterExit=yes
# Ensure the actual firewalld service is up before we reload
ExecStartPre=/bin/sleep 10
ExecStart=/usr/bin/systemctl reload firewalld.service

[Install]
WantedBy=multi-user.target

tried it out on imagetester

  • and copied to the other workers and activated:
scp root@imagetester:/etc/systemd/system/bsc1121613-workaround.service /tmp/
for i in power8 aarch64 openqaworker1 openqaworker4 ; do scp /tmp/bsc1121613-workaround.service root@$i:/etc/systemd/system/; done
for i in power8 aarch64 imagetester openqaworker1 openqaworker4; do echo $i && ssh root@$i "systemctl daemon-reload; systemctl restart firewalld; systemctl restart bsc1121613-workaround"; done

which works better. But now, today, I can see that the developer mode works on openqaworker1, power8, aarch64 but not openqaworker4. imagester unknown.

Checking the status of the firewall and workaround service shows:

$ for i in power8 aarch64 imagetester openqaworker1 openqaworker4; do echo $i && ssh root@$i "systemctl status firewalld bsc1121613-workaround"; done
power8
● firewalld.service
   Loaded: not-found (Reason: No such file or directory)
   Active: inactive (dead)

● bsc1121613-workaround.service - Workaround for bsc#1121613 - reload firewall after startup
   Loaded: loaded (/etc/systemd/system/bsc1121613-workaround.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-01-16 08:39:16 UTC; 1 weeks 1 days ago
 Main PID: 150208 (code=exited, status=5)

Jan 16 08:39:06 power8 systemd[1]: Starting Workaround for bsc#1121613 - reload firewall after startup...
Jan 16 08:39:16 power8 systemctl[150208]: Failed to reload firewalld.service: Unit firewalld.service failed to load: No such file or directory.
Jan 16 08:39:16 power8 systemd[1]: bsc1121613-workaround.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jan 16 08:39:16 power8 systemd[1]: Failed to start Workaround for bsc#1121613 - reload firewall after startup.
Jan 16 08:39:16 power8 systemd[1]: bsc1121613-workaround.service: Unit entered failed state.
Jan 16 08:39:16 power8 systemd[1]: bsc1121613-workaround.service: Failed with result 'exit-code'.
aarch64
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: disabled)
   Active: active (running)
     Docs: man:firewalld(1)
 Main PID: 1557 (firewalld)
    Tasks: 2 (limit: 9830)
   CGroup: /system.slice/firewalld.service
           └─1557 /usr/bin/python3 -Es /usr/sbin/firewalld --nofork --nopid

Jan 24 03:34:23 openqa-aarch64 systemd[1]: Starting firewalld - dynamic firewall daemon...
Jan 24 03:34:24 openqa-aarch64 systemd[1]: Started firewalld - dynamic firewall daemon.

● bsc1121613-workaround.service - Workaround for bsc#1121613 - reload firewall after startup
   Loaded: loaded (/etc/systemd/system/bsc1121613-workaround.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
imagetester
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2019-01-24 03:31:10 CET; 13h ago
     Docs: man:firewalld(1)
  Process: 2217 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
 Main PID: 1228 (firewalld)
    Tasks: 2 (limit: 4915)
   CGroup: /system.slice/firewalld.service
           └─1228 /usr/bin/python3 -Es /usr/sbin/firewalld --nofork --nopid

Jan 24 03:31:08 imagetester systemd[1]: Starting firewalld - dynamic firewall daemon...
Jan 24 03:31:10 imagetester systemd[1]: Started firewalld - dynamic firewall daemon.
Jan 24 03:31:20 imagetester systemd[1]: Reloading firewalld - dynamic firewall daemon.
Jan 24 03:31:20 imagetester systemd[1]: Reloaded firewalld - dynamic firewall daemon.

● bsc1121613-workaround.service - Workaround for bsc#1121613 - reload firewall after startup
   Loaded: loaded (/etc/systemd/system/bsc1121613-workaround.service; enabled; vendor preset: disabled)
   Active: active (exited) since Thu 2019-01-24 03:31:20 CET; 13h ago
  Process: 2216 ExecStart=/usr/bin/systemctl reload firewalld.service (code=exited, status=0/SUCCESS)
  Process: 1377 ExecStartPre=/bin/sleep 10 (code=exited, status=0/SUCCESS)
 Main PID: 2216 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/bsc1121613-workaround.service

Jan 24 03:31:10 imagetester systemd[1]: Starting Workaround for bsc#1121613 - reload firewall after startup...
Jan 24 03:31:20 imagetester systemd[1]: Started Workaround for bsc#1121613 - reload firewall after startup.
openqaworker1
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2019-01-24 03:33:50 CET; 13h ago
     Docs: man:firewalld(1)
 Main PID: 1695 (firewalld)
    Tasks: 2 (limit: 4915)
   CGroup: /system.slice/firewalld.service
           └─1695 /usr/bin/python3 -Es /usr/sbin/firewalld --nofork --nopid

Jan 24 03:33:45 openqaworker1 systemd[1]: Starting firewalld - dynamic firewall daemon...
Jan 24 03:33:50 openqaworker1 systemd[1]: Started firewalld - dynamic firewall daemon.

● bsc1121613-workaround.service - Workaround for bsc#1121613 - reload firewall after startup
   Loaded: loaded (/etc/systemd/system/bsc1121613-workaround.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
openqaworker4
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)

● bsc1121613-workaround.service - Workaround for bsc#1121613 - reload firewall after startup
   Loaded: loaded (/etc/systemd/system/bsc1121613-workaround.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

so we can see … a mess. As known, power8 does not have firewalld and openqaworker4 has no firewall configured. Only imagetester seems to work as expected: The firewall started and the workaround service triggered a reload 10s afterwards. On aarch64 as well as openqaworker1 the workaround service is disabled but on aarch64 the firewall seems to run fine nevertheless. Btw, I checked and the firewall zone reports as "trusted" on all. Shouldn't the workaround service be active on aarch64 and openqaworker1 as well?

I enabled the service on both aarch64 and openqaworker1. On openqaworker1 I did:

ssh root@openqaworker1 "systemctl enable --now bsc1121613-workaround"

which triggered the reload of the firewall et voilà, the developer mode works again

Actions #16

Updated by nicksinger about 5 years ago

  • Status changed from Feedback to Resolved

Today I've quickly checked o3. Interactive mode worked on openqaworker1 as well as openqaworker4 :)

Actions

Also available in: Atom PDF