Project

General

Profile

action #66907

Multimachine test fails in setup for ARM workers

Added by sebchlad over 1 year ago. Updated over 1 year ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
-
Start date:
2020-05-15
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

With RC2 build for SLE15SP2 we hit again known ARM MM problems. Those are recurring issues and require usual attention form QA tools team/people with access to OSD machines.
As this is RC2 build validation day, I'm opening this to perhaps get some traction on finding long term solutions.

openQA test in scenario sle-15-SP2-Online-aarch64-hpc_DELTA_slurm_accounting_supportserver@aarch64 fails in
setup

Test suite description

Slurm accounting tests with db configured and NFS shared folder provided. 1 ctl, multiple compute nodes. Maintainer: schlad

Reproducible

Fails since (at least) Build 194.1 (current job)

Expected result

Last good: 191.1 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues

Is duplicate of openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failuresResolved2020-02-26

History

#1 Updated by sebchlad over 1 year ago

  • Subject changed from test fails in setup to test fails in setup for ARM workes (known problems)

#2 Updated by pcervinka over 1 year ago

  • Subject changed from test fails in setup for ARM workes (known problems) to Multimachine test fails in setup for ARM workers

#3 Updated by pcervinka over 1 year ago

  • Subject changed from Multimachine test fails in setup for ARM workers to [ha][hpc][openqa] Multimachine test fails in setup for ARM workers

#4 Updated by acarvajal over 1 year ago

  • Subject changed from [ha][hpc][openqa] Multimachine test fails in setup for ARM workers to Multimachine test fails in setup for ARM workers

Saw several MM HA tests also failing due to network issues, either by attempting to contact 10.0.2.2, attempting to run yast2 firewall, name solving, etc.

In the screenshots it could be seen that network was not working properly for the VMs, for example in: https://openqa.suse.de/tests/4238452#step/iscsi_client/5 (could not resolve openqa.suse.de) or https://openqa.suse.de/tests/4238437#step/setup/25 (network 10.0.2.1 unreacheable).

After checking some of the failed jobs, a pattern could be seen in that failed jobs seemed to be limited to openqaworker-arm-1.

Checking on the system, the following could be seen in the status of the os-autoinst-openvswitch service:

openqaworker-arm-1:~ # systemctl status os-autoinst-openvswitch.service
● os-autoinst-openvswitch.service - os-autoinst openvswitch helper
   Loaded: loaded (/usr/lib/systemd/system/os-autoinst-openvswitch.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/os-autoinst-openvswitch.service.d
           └─override.conf
   Active: active (running) since Thu 2020-05-14 14:03:00 UTC; 20h ago
 Main PID: 3367 (os-autoinst-ope)
    Tasks: 1
   CGroup: /system.slice/os-autoinst-openvswitch.service
           └─3367 /usr/bin/perl /usr/lib/os-autoinst/os-autoinst-openvswitch

May 15 10:54:12 openqaworker-arm-1 ovs-vsctl[21376]: ovs|00001|vsctl|INFO|Called as ovs-vsctl remove port tap2 tag 24
May 15 10:54:24 openqaworker-arm-1 ovs-vsctl[21445]: ovs|00001|db_ctl_base|ERR|no port named tap16
May 15 10:54:24 openqaworker-arm-1 os-autoinst-openvswitch[3367]: ovs-vsctl: no port named tap16
May 15 10:54:24 openqaworker-arm-1 os-autoinst-openvswitch[3367]: 'tap16' is not connected to bridge 'br1'
May 15 10:54:52 openqaworker-arm-1 ovs-vsctl[21859]: ovs|00001|db_ctl_base|ERR|no port named tap15
May 15 10:54:52 openqaworker-arm-1 os-autoinst-openvswitch[3367]: ovs-vsctl: no port named tap15
May 15 10:54:52 openqaworker-arm-1 os-autoinst-openvswitch[3367]: 'tap15' is not connected to bridge 'br1'
May 15 10:55:10 openqaworker-arm-1 ovs-vsctl[22196]: ovs|00001|db_ctl_base|ERR|no port named tap10
May 15 10:55:10 openqaworker-arm-1 os-autoinst-openvswitch[3367]: ovs-vsctl: no port named tap10
May 15 10:55:10 openqaworker-arm-1 os-autoinst-openvswitch[3367]: 'tap10' is not connected to bridge 'br1'

Checking those 3 tap interfaces directly, interfaces were defined in the system, but not listed in the bridge startup script:

openqaworker-arm-1:/etc/sysconfig/network # ip a | egrep 'tap16|tap15|tap10'
15: tap10: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
40: tap15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
41: tap16: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
openqaworker-arm-1:/etc/sysconfig/network # egrep 'tap16|tap15|tap10' ifcfg-br1 

Could not continue checking as server was restarted.

#5 Updated by pcervinka over 1 year ago

  • Blocked by action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures added

#6 Updated by okurz over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to okurz

As pcervinka rightly stated this is blocked by #63874 . As the relation to the ticket might not have been obvious I extended the subject in #63874 so that we can close this ticket as duplicate unless you see any additional information here.

#7 Updated by okurz over 1 year ago

  • Blocked by deleted (action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures)

#8 Updated by okurz over 1 year ago

  • Is duplicate of action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures added

#9 Updated by okurz over 1 year ago

  • Status changed from In Progress to Rejected

#10 Updated by sebchlad over 1 year ago

And in the meantime I got access to OSD workers, so I will try to help by maintaining ARM workers and when needed, I will mask unwanted workers which should not be there or restart the network interfaces etc.

Also available in: Atom PDF