action #66907
closedMultimachine test fails in setup for ARM workers
0%
Description
Observation¶
With RC2 build for SLE15SP2 we hit again known ARM MM problems. Those are recurring issues and require usual attention form QA tools team/people with access to OSD machines.
As this is RC2 build validation day, I'm opening this to perhaps get some traction on finding long term solutions.
openQA test in scenario sle-15-SP2-Online-aarch64-hpc_DELTA_slurm_accounting_supportserver@aarch64 fails in
setup
Test suite description¶
Slurm accounting tests with db configured and NFS shared folder provided. 1 ctl, multiple compute nodes. Maintainer: schlad
Reproducible¶
Fails since (at least) Build 194.1 (current job)
Expected result¶
Last good: 191.1 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by sebchlad over 4 years ago
- Subject changed from test fails in setup to test fails in setup for ARM workes (known problems)
Updated by pcervinka over 4 years ago
- Subject changed from test fails in setup for ARM workes (known problems) to Multimachine test fails in setup for ARM workers
Updated by pcervinka over 4 years ago
- Subject changed from Multimachine test fails in setup for ARM workers to [ha][hpc][openqa] Multimachine test fails in setup for ARM workers
Updated by acarvajal over 4 years ago
- Subject changed from [ha][hpc][openqa] Multimachine test fails in setup for ARM workers to Multimachine test fails in setup for ARM workers
Saw several MM HA tests also failing due to network issues, either by attempting to contact 10.0.2.2, attempting to run yast2 firewall
, name solving, etc.
In the screenshots it could be seen that network was not working properly for the VMs, for example in: https://openqa.suse.de/tests/4238452#step/iscsi_client/5 (could not resolve openqa.suse.de) or https://openqa.suse.de/tests/4238437#step/setup/25 (network 10.0.2.1 unreacheable).
After checking some of the failed jobs, a pattern could be seen in that failed jobs seemed to be limited to openqaworker-arm-1.
Checking on the system, the following could be seen in the status of the os-autoinst-openvswitch service:
openqaworker-arm-1:~ # systemctl status os-autoinst-openvswitch.service
● os-autoinst-openvswitch.service - os-autoinst openvswitch helper
Loaded: loaded (/usr/lib/systemd/system/os-autoinst-openvswitch.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/os-autoinst-openvswitch.service.d
└─override.conf
Active: active (running) since Thu 2020-05-14 14:03:00 UTC; 20h ago
Main PID: 3367 (os-autoinst-ope)
Tasks: 1
CGroup: /system.slice/os-autoinst-openvswitch.service
└─3367 /usr/bin/perl /usr/lib/os-autoinst/os-autoinst-openvswitch
May 15 10:54:12 openqaworker-arm-1 ovs-vsctl[21376]: ovs|00001|vsctl|INFO|Called as ovs-vsctl remove port tap2 tag 24
May 15 10:54:24 openqaworker-arm-1 ovs-vsctl[21445]: ovs|00001|db_ctl_base|ERR|no port named tap16
May 15 10:54:24 openqaworker-arm-1 os-autoinst-openvswitch[3367]: ovs-vsctl: no port named tap16
May 15 10:54:24 openqaworker-arm-1 os-autoinst-openvswitch[3367]: 'tap16' is not connected to bridge 'br1'
May 15 10:54:52 openqaworker-arm-1 ovs-vsctl[21859]: ovs|00001|db_ctl_base|ERR|no port named tap15
May 15 10:54:52 openqaworker-arm-1 os-autoinst-openvswitch[3367]: ovs-vsctl: no port named tap15
May 15 10:54:52 openqaworker-arm-1 os-autoinst-openvswitch[3367]: 'tap15' is not connected to bridge 'br1'
May 15 10:55:10 openqaworker-arm-1 ovs-vsctl[22196]: ovs|00001|db_ctl_base|ERR|no port named tap10
May 15 10:55:10 openqaworker-arm-1 os-autoinst-openvswitch[3367]: ovs-vsctl: no port named tap10
May 15 10:55:10 openqaworker-arm-1 os-autoinst-openvswitch[3367]: 'tap10' is not connected to bridge 'br1'
Checking those 3 tap interfaces directly, interfaces were defined in the system, but not listed in the bridge startup script:
openqaworker-arm-1:/etc/sysconfig/network # ip a | egrep 'tap16|tap15|tap10'
15: tap10: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
40: tap15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
41: tap16: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
openqaworker-arm-1:/etc/sysconfig/network # egrep 'tap16|tap15|tap10' ifcfg-br1
Could not continue checking as server was restarted.
Updated by pcervinka over 4 years ago
- Blocked by action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures added
Updated by okurz over 4 years ago
- Status changed from New to In Progress
- Assignee set to okurz
Updated by okurz over 4 years ago
- Blocked by deleted (action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures)
Updated by okurz over 4 years ago
- Is duplicate of action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures added
Updated by sebchlad over 4 years ago
And in the meantime I got access to OSD workers, so I will try to help by maintaining ARM workers and when needed, I will mask unwanted workers which should not be there or restart the network interfaces etc.