action #66907: Multimachine test fails in setup for ARM workers - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #66907

closed

Multimachine test fails in setup for ARM workers

Added by sebchlad about 5 years ago. Updated about 5 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

okurz

Category:

Infrastructure

Target version:

Start date:

2020-05-15

Due date:

% Done:

Estimated time:

Difficulty:

Description

Observation¶

With RC2 build for SLE15SP2 we hit again known ARM MM problems. Those are recurring issues and require usual attention form QA tools team/people with access to OSD machines.
As this is RC2 build validation day, I'm opening this to perhaps get some traction on finding long term solutions.

openQA test in scenario sle-15-SP2-Online-aarch64-hpc_DELTA_slurm_accounting_supportserver@aarch64 fails in
setup

Test suite description¶

Slurm accounting tests with db configured and NFS shared folder provided. 1 ctl, multiple compute nodes. Maintainer: schlad

Reproducible¶

Fails since (at least) Build 194.1 (current job)

Expected result¶

Last good: 191.1 (or more recent)

Further details¶

Always latest result in this scenario: latest

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by sebchlad about 5 years ago

Subject changed from test fails in setup to test fails in setup for ARM workes (known problems)

Actions

Copy link

Updated by pcervinka about 5 years ago

Subject changed from test fails in setup for ARM workes (known problems) to Multimachine test fails in setup for ARM workers

Actions

Copy link

Updated by pcervinka about 5 years ago

Subject changed from Multimachine test fails in setup for ARM workers to [ha][hpc][openqa] Multimachine test fails in setup for ARM workers

Actions

Copy link

Updated by acarvajal about 5 years ago

Subject changed from [ha][hpc][openqa] Multimachine test fails in setup for ARM workers to Multimachine test fails in setup for ARM workers

Saw several MM HA tests also failing due to network issues, either by attempting to contact 10.0.2.2, attempting to run yast2 firewall, name solving, etc.

In the screenshots it could be seen that network was not working properly for the VMs, for example in: https://openqa.suse.de/tests/4238452#step/iscsi_client/5 (could not resolve openqa.suse.de) or https://openqa.suse.de/tests/4238437#step/setup/25 (network 10.0.2.1 unreacheable).

After checking some of the failed jobs, a pattern could be seen in that failed jobs seemed to be limited to openqaworker-arm-1.

Checking on the system, the following could be seen in the status of the os-autoinst-openvswitch service:

openqaworker-arm-1:~ # systemctl status os-autoinst-openvswitch.service
● os-autoinst-openvswitch.service - os-autoinst openvswitch helper
   Loaded: loaded (/usr/lib/systemd/system/os-autoinst-openvswitch.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/os-autoinst-openvswitch.service.d
           └─override.conf
   Active: active (running) since Thu 2020-05-14 14:03:00 UTC; 20h ago
 Main PID: 3367 (os-autoinst-ope)
    Tasks: 1
   CGroup: /system.slice/os-autoinst-openvswitch.service
           └─3367 /usr/bin/perl /usr/lib/os-autoinst/os-autoinst-openvswitch

May 15 10:54:12 openqaworker-arm-1 ovs-vsctl[21376]: ovs|00001|vsctl|INFO|Called as ovs-vsctl remove port tap2 tag 24
May 15 10:54:24 openqaworker-arm-1 ovs-vsctl[21445]: ovs|00001|db_ctl_base|ERR|no port named tap16
May 15 10:54:24 openqaworker-arm-1 os-autoinst-openvswitch[3367]: ovs-vsctl: no port named tap16
May 15 10:54:24 openqaworker-arm-1 os-autoinst-openvswitch[3367]: 'tap16' is not connected to bridge 'br1'
May 15 10:54:52 openqaworker-arm-1 ovs-vsctl[21859]: ovs|00001|db_ctl_base|ERR|no port named tap15
May 15 10:54:52 openqaworker-arm-1 os-autoinst-openvswitch[3367]: ovs-vsctl: no port named tap15
May 15 10:54:52 openqaworker-arm-1 os-autoinst-openvswitch[3367]: 'tap15' is not connected to bridge 'br1'
May 15 10:55:10 openqaworker-arm-1 ovs-vsctl[22196]: ovs|00001|db_ctl_base|ERR|no port named tap10
May 15 10:55:10 openqaworker-arm-1 os-autoinst-openvswitch[3367]: ovs-vsctl: no port named tap10
May 15 10:55:10 openqaworker-arm-1 os-autoinst-openvswitch[3367]: 'tap10' is not connected to bridge 'br1'

Checking those 3 tap interfaces directly, interfaces were defined in the system, but not listed in the bridge startup script:

openqaworker-arm-1:/etc/sysconfig/network # ip a | egrep 'tap16|tap15|tap10'
15: tap10: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
40: tap15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
41: tap16: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
openqaworker-arm-1:/etc/sysconfig/network # egrep 'tap16|tap15|tap10' ifcfg-br1

Could not continue checking as server was restarted.

Actions

Copy link

Updated by pcervinka about 5 years ago

Blocked by action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures added

Actions

Copy link

Updated by okurz about 5 years ago

Status changed from New to In Progress
Assignee set to okurz

As pcervinka rightly stated this is blocked by #63874 . As the relation to the ticket might not have been obvious I extended the subject in #63874 so that we can close this ticket as duplicate unless you see any additional information here.

Actions

Copy link

Updated by okurz about 5 years ago

Blocked by deleted (action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures)

Actions

Copy link

Updated by okurz about 5 years ago

Is duplicate of action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures added

Actions

Copy link

Updated by okurz about 5 years ago

Status changed from In Progress to Rejected

Actions

Copy link

#10

Updated by sebchlad about 5 years ago

And in the meantime I got access to OSD workers, so I will try to help by maintaining ARM workers and when needed, I will mask unwanted workers which should not be there or restart the network interfaces etc.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #66907

Multimachine test fails in setup for ARM workers

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Further details¶

Updated by sebchlad about 5 years ago

Updated by pcervinka about 5 years ago

Updated by pcervinka about 5 years ago

Updated by acarvajal about 5 years ago

Updated by pcervinka about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by sebchlad about 5 years ago