Project

General

Profile

action #109494

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

openQA Project - coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

Restore network connection of arm-4/5 size:M

Added by mkittler 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2022-04-05
Due date:
% Done:

0%

Estimated time:

Description

Observation

After starting arm-4/5 today (to investigate #109232) both workers were unable to get an IP address.

I couldn't find messages from them in /var/log/messages on qanet.qa.suse.de. I also couldn't ping it via arping:

openqaworker-arm-4:~ # arping -I eth0 -b 10.162.0.1
ARPING 10.162.0.1 from 10.0.2.2 eth0

There are no failed systemd units, including wicked. The network config looks like this:

openqaworker-arm-4:~ # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 18:c0:4d:8c:82:8e brd ff:ff:ff:ff:ff:ff
    altname eno1
    altname enp11s0f0
    inet6 fe80::1ac0:4dff:fe8c:828e/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 18:c0:4d:8c:82:8f brd ff:ff:ff:ff:ff:ff
    altname eno2
    altname enp11s0f1
18: ovs-system: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether ae:1e:c1:a3:0f:39 brd ff:ff:ff:ff:ff:ff
19: br1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 4e:40:d5:d2:bf:43 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.2/15 brd 10.1.255.255 scope global br1
       valid_lft forever preferred_lft forever
    inet6 fe80::4c40:d5ff:fed2:bf43/64 scope link 
       valid_lft forever preferred_lft forever
20: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 72:c6:8a:8a:03:d2 brd ff:ff:ff:ff:ff:ff
21: tap64: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether f2:41:3b:31:bb:29 brd ff:ff:ff:ff:ff:ff
22: tap128: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether f6:fc:23:fa:80:c3 brd ff:ff:ff:ff:ff:ff
23: tap1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 62:ee:a0:ff:3e:38 brd ff:ff:ff:ff:ff:ff
24: tap65: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 4a:4e:4c:01:9c:79 brd ff:ff:ff:ff:ff:ff
25: tap129: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 2a:f6:47:e0:e8:88 brd ff:ff:ff:ff:ff:ff
26: tap2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 2e:6c:54:d2:6c:fc brd ff:ff:ff:ff:ff:ff
27: tap66: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether f6:c8:0f:c7:de:7a brd ff:ff:ff:ff:ff:ff
28: tap130: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 8e:e0:72:c5:d3:4f brd ff:ff:ff:ff:ff:ff
29: tap3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether ee:32:a5:78:2b:eb brd ff:ff:ff:ff:ff:ff
30: tap67: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 8a:d3:a0:2e:f4:f0 brd ff:ff:ff:ff:ff:ff
31: tap131: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 7e:4e:54:78:eb:14 brd ff:ff:ff:ff:ff:ff

So eth0 is up. It looks exactly alike on arm-5.

History

#1 Updated by mkittler 3 months ago

  • Blocks action #109232: Document relevant differences of arm-4/5 vs. arm-1/2/3 and aarch64.o.o, involve domain experts in asking what parameters are important to be able to run openQA tests size:M added

#2 Updated by mkittler 3 months ago

  • Project changed from openQA Project to openQA Infrastructure

#3 Updated by okurz 3 months ago

  • Priority changed from Normal to High
  • Target version set to Ready

#4 Updated by mkittler 3 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

Since we don't know the switches those workers are connected to I'll file an Infra ticket.

#5 Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback

#6 Updated by mkittler 3 months ago

That's the ip addr output on arm-5 (for the mac address):

openqaworker-arm-5:~ # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 18:c0:4d:06:ce:57 brd ff:ff:ff:ff:ff:ff
    altname eno1
    altname enp11s0f0
    inet6 fe80::1ac0:4dff:fe06:ce57/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 18:c0:4d:06:ce:58 brd ff:ff:ff:ff:ff:ff
    altname eno2
    altname enp11s0f1
4: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether ee:2e:6a:fe:6a:e1 brd ff:ff:ff:ff:ff:ff
5: tap64: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 86:8b:d4:d6:2c:dd brd ff:ff:ff:ff:ff:ff
6: tap128: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 36:7a:b4:24:ae:a8 brd ff:ff:ff:ff:ff:ff
7: tap1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 26:b5:b3:1b:cb:c5 brd ff:ff:ff:ff:ff:ff
8: tap65: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 96:bf:fa:d1:e4:cd brd ff:ff:ff:ff:ff:ff
9: tap129: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 22:b2:6c:31:6f:89 brd ff:ff:ff:ff:ff:ff
10: tap2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 9e:18:69:25:1b:58 brd ff:ff:ff:ff:ff:ff
11: tap66: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 36:be:02:ab:5c:4c brd ff:ff:ff:ff:ff:ff
12: tap130: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether d6:44:54:0b:76:1e brd ff:ff:ff:ff:ff:ff
13: tap3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 3a:d6:af:60:c6:84 brd ff:ff:ff:ff:ff:ff
14: tap67: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether 7e:b7:6e:fd:47:24 brd ff:ff:ff:ff:ff:ff
15: tap131: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master ovs-system state DOWN group default qlen 1000
    link/ether ae:ee:00:92:c9:cc brd ff:ff:ff:ff:ff:ff
16: ovs-system: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether ce:5c:e0:f6:e4:56 brd ff:ff:ff:ff:ff:ff
17: br1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether ea:c8:48:36:27:48 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.2/15 brd 10.1.255.255 scope global br1
       valid_lft forever preferred_lft forever
    inet6 fe80::e8c8:48ff:fe36:2748/64 scope link 
       valid_lft forever preferred_lft forever

#7 Updated by nicksinger 3 months ago

  • Assignee changed from mkittler to nicksinger

I've asked for help from lmb@suse.de to help us to reconfigure the switches. Until we have proper network on them I will take the ticket for now.

#8 Updated by nicksinger 2 months ago

From "Flurfunk" I heard that lmb seems to be out of office currently. This is why I decided to move these machines into a different rack to a switch controlled by us, the new location is Rack 4 right next to our "QA Racks": https://racktables.suse.de/index.php?page=rack&rack_id=522. I've also added cable connections to https://racktables.suse.de/index.php?page=object&tab=ports&object_id=11969 (host OS ethernet uplink) and https://racktables.suse.de/index.php?page=object&tab=ports&object_id=996 (BMC connections). qanet20nue was reconfigured to provide the BMC with untagged VLAN12 to the BMCs and they both can be reached under their documented ip/hostname. openqaworker-arm-4 was able to receive an IP from qanet again and is now reachable normally via ssh. Unfortunately the fiber connection for arm-5 is broken. The switch reports "RX Loose" and no IP assignment is possible. This cable needs to be replaced again.

#9 Updated by mkittler 2 months ago

  • Blocks deleted (action #109232: Document relevant differences of arm-4/5 vs. arm-1/2/3 and aarch64.o.o, involve domain experts in asking what parameters are important to be able to run openQA tests size:M)

#10 Updated by okurz 2 months ago

  • Subject changed from Restore network connection of arm-4/5 to Restore network connection of arm-4/5 size:M

nicksinger plans for the physical work to be conducted over the course of the next days.

#11 Updated by nicksinger about 2 months ago

  • Status changed from Feedback to Resolved

I switched the fiber cables. Now arm-5 is also reachable again:

selenium ~ ยป ping openqaworker-arm-5                                                    
PING openqaworker-arm-5.qa.suse.de (10.162.6.203) 56(84) bytes of data.
64 Bytes von openqaworker-arm-5.qa.suse.de (10.162.6.203): icmp_seq=1 ttl=64 Zeit=0.219 ms
64 Bytes von openqaworker-arm-5.qa.suse.de (10.162.6.203): icmp_seq=2 ttl=64 Zeit=0.198 ms

Also available in: Atom PDF