Project

General

Profile

action #68092

hpc tests failed on arm machines, especially openqaworker-arm-3, likely due to problems with the firewall

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
2020-06-15
Due date:
2020-07-16
% Done:

0%

Estimated time:

Description

Observation

sebchlad pointed me to a failing test within the scenario sle-15-SP2-Online-aarch64-Build209.2-hpc_DELTA_slurm_master_accounting@aarch64 with details in https://openqa.suse.de/tests/4347870/file/serial_terminal.txt . The VMs involved are able to ping each other and tests are able to upload all results to openQA as well as the other way around however the slurm database can not be accessed, maybe due to a firewall problem on the openQA worker host.

History

#1 Updated by okurz over 1 year ago

debugged multi-machine with sebchlad. after reboot of workers some tests failed, in particular on arm3. osd workers are still on SuSEfirewall2, o3 workers are on firewalld. I could not find a ticket describing that we need to migrate to firewalld. I ran wicked advanced tests on arm with openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de 4341663 _GROUP=0 BUILD=X WORKER_CLASS=openqaworker-arm-3 TEST=wicked_advanced_okurz and both test jobs passed while hpc_slurm tests showed problems. I showed and explained to sebchlad our current monitoring and recovery actions and the custom clone command. firewalld runs as service so we can check if the service runs and alert otherwise. For SuSEfirewall2 on startup that runs but exits so we can not rely on the systemd service for everything to be still in order, e.g.:

okurz@openqaworker-arm-3:~> sudo systemctl status SuSEfirewall2
● SuSEfirewall2.service - SuSEfirewall2 phase 2
   Loaded: loaded (/usr/lib/systemd/system/SuSEfirewall2.service; enabled; vendor preset: disabled)
   Active: active (exited) since Fri 2020-06-12 16:29:07 CEST; 2 days ago
…
Jun 12 16:28:31 openqaworker-arm-3 SuSEfirewall2[3846]: Warning: /proc/sys/net/ipv4/ip_forward is not enabled, but required for FW_ROUTE, you should configure this in /etc/sysctl.conf. This option has been implicitly enabled now.
Jun 12 16:28:31 openqaworker-arm-3 SuSEfirewall2[3846]: <36>Jun 12 16:28:31 SuSEfirewall2[3846]: Warning: /proc/sys/net/ipv4/ip_forward is not enabled, but required for FW_ROUTE, you should configure this in /etc/sysctl.conf. This option has been implicitly enabled now.
Jun 12 16:29:07 openqaworker-arm-3 SuSEfirewall2[3846]: Firewall rules successfully set
Jun 12 16:29:07 openqaworker-arm-3 systemd[1]: Started SuSEfirewall2 phase 2.

so firewall was started. However:

okurz@openqaworker-arm-3:~> sudo SuSEfirewall2 status
<35>Jun 15 09:51:43 SuSEfirewall2[7051]: SuSEfirewall2 not active

and also iptables shows a problem:

openqa:/srv/salt # salt -l error --no-color -C  'G@roles:worker' cmd.run 'iptables -L | head -n 3'
openqaworker9.suse.de:
    Chain INPUT (policy DROP)
    target     prot opt source               destination
    ACCEPT     all  --  anywhere             anywhere
…
openqaworker-arm-3.suse.de:
    Another app is currently holding the xtables lock. Perhaps you want to use the -w option?

checking 1h later showed success though.

Created #68095 for migration of osd workers to firewalld.

#2 Updated by okurz over 1 year ago

  • Status changed from New to Feedback

I looked into this issue again together with sebchlad yesterday.

To bisect the issue sebchlad ran an older build which also failed but I triggered a build based on SLE15-SP2-RC3. I did that by copying the trigger script from https://openqa.suse.de/admin/obs_rsync/SUSE:SLE-15-SP2:GA:TEST%7Cbase/runs/.run_last/download/openqa.cmd into a local file, replaced the build version with $build and ran

env build=okurz bash -ex ./trigger WORKER_CLASS=openqaworker-arm-3 BETA=1 SCC_URL= _SKIP_POST_FAIL_HOOKS=1 ISO_URL=http://dist.suse.de/install/SLE-15-SP2-Online-RC3/SLE-15-SP2-Online-aarch64-RC3-Media1.iso

maybe unexpected but that passed completely :D See https://openqa.suse.de/tests/4365653#dependencies . However I doubt that the tests were a clean RC3 state. E.g. https://openqa.suse.de/tests/4365658/file/serial0.txt shows a kernel version 5.3.18-22 and ls /mounts/dist/install/SLP/SLE-15-SP2-Module-Basesystem-*/aarch64/DVD1/aarch64/kernel-default-5* on login.suse.de tells me that 5.3.18-22 is GMC, RC3 would be 5.3.18-20 . I guess new versions are installed because the system is registered against default public SCC. To have a better test base we should probably use the Full medium or try to point to the repos on dist.suse.de. Both is more work though. With this at least we know that the infrastructure is able to execute these tests albeit unknown how reliable that is.

sebchlad in case you will report or have already reported a bug I would appreciate if you can link me in so that I can track it here

#3 Updated by rfan1 over 1 year ago

Hello Oliver,

I don't know if my understanding is correct, but hopefully it can provide some help.

SUSE Linux Enterprise 15 introduces firewalld as the new software firewall, replacing SuSE-
firewall2. SuSEfirewall2 has not been removed from SUSE Linux Enterprise 15 and is still in
the Main repository. However, it is not included in the default installation, and firewalld is
installed by default in new installations.

There is no automatic migration, so you must migrate
to firewalld manually. firewalld includes a helper migration script, susefirewall2-to-
firewalld . Depending on the complexity of your SuSEfirewall2 configuration the script may
perform a perfect migration, or it may fail. Most likely it will partially succeed and you will
have to review your new firewalld configuration and make adjustments.
The resulting configuration will make firewalld behave somewhat like SuSEfirewall2. To take
full advantage of firewalld 's features you may elect to create a new configuration, rather than
trying to migrate your old configuration. It is safe to run the susefirewall2-to-firewalld
script with no options, as it makes no permanent changes to your system. However, if you are
administering the system remotely you could get locked out.
Install and run susefirewall2-to-firewalld :
root # zypper in susefirewall2-to-firewalld
root # susefirewall2-to-firewalld

rfan@openqaworker-arm-3:~> systemctl list-unit-files |grep firewall
SuSEfirewall2.service enabled

SuSEfirewall2_init.service enabled

SuSEfirewall2_setup.service enabled

rfan@openqaworker-arm-3:~> zypper se -s firewall
Loading repository data...
Reading installed packages...

S Name Type Version Arch Repository
i SuSEfirewall2 package 3.6.378-lp151.2.2 noarch openSUSE-Leap-15.1-Oss
SuSEfirewall2-fail2ban package 0.10.4-lp151.1.1 noarch openSUSE-Leap-15.1-Oss
firewall-applet package 0.5.5-lp151.5.1 noarch openSUSE-Leap-15.1-Oss
firewall-config package 0.5.5-lp151.5.1 noarch openSUSE-Leap-15.1-Oss
i firewall-macros package 0.5.5-lp151.5.1 noarch openSUSE-Leap-15.1-Oss
firewalld package 0.5.5-lp151.5.1 noarch openSUSE-Leap-15.1-Oss
firewalld-lang package 0.5.5-lp151.5.1 noarch openSUSE-Leap-15.1-Oss
firewalld-rpcbind-helper package 0.1-lp151.5.3.1 noarch openSUSE-Leap-15.1-Update
firewalld-rpcbind-helper package 0.1-lp151.4.1 noarch openSUSE-Leap-15.1-Oss
firewalld-rpcbind-helper srcpackage 0.1-lp151.5.3.1 noarch openSUSE-Leap-15.1-Update
python3-firewall package 0.5.5-lp151.5.1 noarch openSUSE-Leap-15.1-Oss
susefirewall2-to-firewalld package 0.0.4-lp151.2.3.1 noarch openSUSE-Leap-15.1-Update
susefirewall2-to-firewalld package 0.0.4-lp151.1.1 noarch openSUSE-Leap-15.1-Oss
susefirewall2-to-firewalld srcpackage 0.0.4-lp151.2.3.1 noarch openSUSE-Leap-15.1-Update
i yast2-firewall package 4.1.12-lp151.1.1 noarch openSUSE-Leap-15.1-Oss

#4 Updated by okurz over 1 year ago

  • Due date set to 2020-07-16

Well, thanks for trying to help. I had already recorded the need to migrate to firewalld in #68095 . This means changes to https://gitlab.suse.de/openqa/salt-states-openqa to configure firewalld correctly for openQA multi-machine tests.

sebchlad I would appreciate an update by you if you have found anything related to the original problem. Did you report a product bug about this?

#5 Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

Discussed with sebchlad. He did report a product bug which in turn was rejected but no more tests failed in the same way so we are good :)

Also available in: Atom PDF