hpc tests failed on arm machines, especially openqaworker-arm-3, likely due to problems with the firewall
sebchlad pointed me to a failing test within the scenario sle-15-SP2-Online-aarch64-Build209.2-hpc_DELTA_slurm_master_accounting@aarch64 with details in https://openqa.suse.de/tests/4347870/file/serial_terminal.txt . The VMs involved are able to ping each other and tests are able to upload all results to openQA as well as the other way around however the slurm database can not be accessed, maybe due to a firewall problem on the openQA worker host.
#1 Updated by okurz over 1 year ago
debugged multi-machine with sebchlad. after reboot of workers some tests failed, in particular on arm3. osd workers are still on SuSEfirewall2, o3 workers are on firewalld. I could not find a ticket describing that we need to migrate to firewalld. I ran wicked advanced tests on arm with
openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de 4341663 _GROUP=0 BUILD=X WORKER_CLASS=openqaworker-arm-3 TEST=wicked_advanced_okurz and both test jobs passed while hpc_slurm tests showed problems. I showed and explained to sebchlad our current monitoring and recovery actions and the custom clone command. firewalld runs as service so we can check if the service runs and alert otherwise. For SuSEfirewall2 on startup that runs but exits so we can not rely on the systemd service for everything to be still in order, e.g.:
okurz@openqaworker-arm-3:~> sudo systemctl status SuSEfirewall2 ● SuSEfirewall2.service - SuSEfirewall2 phase 2 Loaded: loaded (/usr/lib/systemd/system/SuSEfirewall2.service; enabled; vendor preset: disabled) Active: active (exited) since Fri 2020-06-12 16:29:07 CEST; 2 days ago … Jun 12 16:28:31 openqaworker-arm-3 SuSEfirewall2: Warning: /proc/sys/net/ipv4/ip_forward is not enabled, but required for FW_ROUTE, you should configure this in /etc/sysctl.conf. This option has been implicitly enabled now. Jun 12 16:28:31 openqaworker-arm-3 SuSEfirewall2: <36>Jun 12 16:28:31 SuSEfirewall2: Warning: /proc/sys/net/ipv4/ip_forward is not enabled, but required for FW_ROUTE, you should configure this in /etc/sysctl.conf. This option has been implicitly enabled now. Jun 12 16:29:07 openqaworker-arm-3 SuSEfirewall2: Firewall rules successfully set Jun 12 16:29:07 openqaworker-arm-3 systemd: Started SuSEfirewall2 phase 2.
so firewall was started. However:
okurz@openqaworker-arm-3:~> sudo SuSEfirewall2 status <35>Jun 15 09:51:43 SuSEfirewall2: SuSEfirewall2 not active
and also iptables shows a problem:
openqa:/srv/salt # salt -l error --no-color -C 'G@roles:worker' cmd.run 'iptables -L | head -n 3' openqaworker9.suse.de: Chain INPUT (policy DROP) target prot opt source destination ACCEPT all -- anywhere anywhere … openqaworker-arm-3.suse.de: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
checking 1h later showed success though.
Created #68095 for migration of osd workers to firewalld.
#2 Updated by okurz over 1 year ago
- Status changed from New to Feedback
I looked into this issue again together with sebchlad yesterday.
To bisect the issue sebchlad ran an older build which also failed but I triggered a build based on SLE15-SP2-RC3. I did that by copying the trigger script from https://openqa.suse.de/admin/obs_rsync/SUSE:SLE-15-SP2:GA:TEST%7Cbase/runs/.run_last/download/openqa.cmd into a local file, replaced the build version with
$build and ran
env build=okurz bash -ex ./trigger WORKER_CLASS=openqaworker-arm-3 BETA=1 SCC_URL= _SKIP_POST_FAIL_HOOKS=1 ISO_URL=http://dist.suse.de/install/SLE-15-SP2-Online-RC3/SLE-15-SP2-Online-aarch64-RC3-Media1.iso
maybe unexpected but that passed completely :D See https://openqa.suse.de/tests/4365653#dependencies . However I doubt that the tests were a clean RC3 state. E.g. https://openqa.suse.de/tests/4365658/file/serial0.txt shows a kernel version 5.3.18-22 and
ls /mounts/dist/install/SLP/SLE-15-SP2-Module-Basesystem-*/aarch64/DVD1/aarch64/kernel-default-5* on login.suse.de tells me that 5.3.18-22 is GMC, RC3 would be 5.3.18-20 . I guess new versions are installed because the system is registered against default public SCC. To have a better test base we should probably use the Full medium or try to point to the repos on dist.suse.de. Both is more work though. With this at least we know that the infrastructure is able to execute these tests albeit unknown how reliable that is.
sebchlad in case you will report or have already reported a bug I would appreciate if you can link me in so that I can track it here
#3 Updated by rfan1 over 1 year ago
I don't know if my understanding is correct, but hopefully it can provide some help.
SUSE Linux Enterprise 15 introduces firewalld as the new software firewall, replacing SuSE-
firewall2. SuSEfirewall2 has not been removed from SUSE Linux Enterprise 15 and is still in
the Main repository. However, it is not included in the default installation, and firewalld is
installed by default in new installations.
There is no automatic migration, so you must migrate
to firewalld manually. firewalld includes a helper migration script, susefirewall2-to-
firewalld . Depending on the complexity of your SuSEfirewall2 configuration the script may
perform a perfect migration, or it may fail. Most likely it will partially succeed and you will
have to review your new firewalld configuration and make adjustments.
The resulting configuration will make firewalld behave somewhat like SuSEfirewall2. To take
full advantage of firewalld 's features you may elect to create a new configuration, rather than
trying to migrate your old configuration. It is safe to run the susefirewall2-to-firewalld
script with no options, as it makes no permanent changes to your system. However, if you are
administering the system remotely you could get locked out.
Install and run susefirewall2-to-firewalld :
root # zypper in susefirewall2-to-firewalld
root # susefirewall2-to-firewalld
rfan@openqaworker-arm-3:~> systemctl list-unit-files |grep firewall
rfan@openqaworker-arm-3:~> zypper se -s firewall
Loading repository data...
Reading installed packages...
#4 Updated by okurz over 1 year ago
- Due date set to 2020-07-16
Well, thanks for trying to help. I had already recorded the need to migrate to firewalld in #68095 . This means changes to https://gitlab.suse.de/openqa/salt-states-openqa to configure firewalld correctly for openQA multi-machine tests.
sebchlad I would appreciate an update by you if you have found anything related to the original problem. Did you report a product bug about this?