Project

General

Profile

Actions

action #96260

closed

coordination #96185: [epic] Multimachine failure rate increased

Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-07-29
Due date:
% Done:

0%

Estimated time:

Description

Observation

From OSD:

sudo salt -l error --state-output=changes -C 'G@roles:worker' cmd.run 'ovs-vsctl show | grep -C 3 error'                     
openqaworker2.suse.de:
openqaworker8.suse.de:
                Interface gre6
                    type: gre
                    options: {remote_ip="10.160.2.20"}
                    error: "could not add network device gre6 to ofproto (File exists)"
            Port gre7
                Interface gre7
                    type: gre
                    options: {remote_ip="10.160.2.20"}
                    error: "could not add network device gre7 to ofproto (File exists)"
            Port tap75
                Interface tap75
            Port tap14
openqaworker3.suse.de:
… same error
openqaworker9.suse.de:
… same error
openqaworker6.suse.de:
… same error
openqaworker5.suse.de:
… same error
QA-Power8-5-kvm.qa.suse.de:
… same error
QA-Power8-4-kvm.qa.suse.de:
malbec.arch.suse.de:
… same error
powerqaworker-qam-1.qa.suse.de:
grenache-1.qa.suse.de:
openqaworker13.suse.de:
openqaworker10.suse.de:
                Interface gre6
                    type: gre
                    options: {remote_ip="10.160.2.20"}
                    error: "could not add network device gre6 to ofproto (File exists)"
            Port tap8
                Interface tap8
            Port tap128
openqaworker-arm-1.suse.de:
                Interface gre6
                    type: gre
                    options: {remote_ip="10.160.2.20"}
                    error: "could not add network device gre6 to ofproto (File exists)"
            Port tap132
                Interface tap132
            Port gre9
    --
                Interface gre5
                    type: gre
                    options: {remote_ip="10.160.2.20"}
                    error: "could not add network device gre5 to ofproto (File exists)"
            Port tap130
                Interface tap130
            Port gre4
openqaworker-arm-2.suse.de:
… same error
openqaworker-arm-3.suse.de:
… same error
ERROR: Minions returned with non-zero exit code

so the same error appears in most workers, but not all. The IPv4 address is 10.160.2.20, which is openqaworker10
the job is running on openqaworker10 itself. So it looks like we fail to add a GRE tunnel to the same host?

Suggestion


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #96938: openqaworker10+13 are offline, reason unknown, let's fix other problems first size:MResolvedmkittler2021-08-16

Actions
Actions #1

Updated by livdywan over 2 years ago

  • Subject changed from Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? to Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by livdywan over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan
Actions #3

Updated by livdywan over 2 years ago

  • Status changed from In Progress to Workable

So in openqa/openvswitch.sls there's this command:

- ovs-vsctl --may-exist add-port $bridge gre{{- loop.index }} -- set interface gre{{- loop.index }} type=gre options:remote_ip={{ pillar['workerconf'][remote]['bridge_ip'] }}

The --may-exist should mean it's fine if the port exists. Still gre* is already setup and that fails? So even adding an --if-exists] --with-iface del-port ... would seem redundant?

Actions #4

Updated by livdywan over 2 years ago

  • Assignee deleted (livdywan)
Actions #5

Updated by mkittler over 2 years ago

Either --may-exist is broken or the port is bound to another bridge. Or the error "File exists" doesn't mean this is about the port but some other file involved in the setup.

(http://www.openvswitch.org/support/dist-docs/ovs-vsctl.8.txt)

Actions #6

Updated by dheidler over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #7

Updated by dheidler over 2 years ago

10.160.2.20 is unreachable:

$ sudo arping -c6 -I eth0 10.160.2.20
ARPING 10.160.2.20 from 10.160.0.207 eth0
Sent 6 probes (6 broadcast(s))
Received 0 response(s)
Actions #8

Updated by okurz over 2 years ago

  • Related to action #96938: openqaworker10+13 are offline, reason unknown, let's fix other problems first size:M added
Actions #9

Updated by dheidler over 2 years ago

  • Status changed from In Progress to Blocked

As openqaworker10 (10.160.2.20) as well as its BNC are unreachable since the night from 14. to 15., this will need to wait on infra:
https://infra.nue.suse.com/SelfService/Display.html?id=195815&

Actions #10

Updated by okurz over 2 years ago

  • Due date set to 2021-08-31
  • Status changed from Blocked to In Progress

I meant that you should block the other ticket on that one but nevermind. Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/343 to fix the duplicate bridge_ip entries. Then I suggest to check that every worker has a correct config if other machines are taken out of production.

Actions #11

Updated by okurz over 2 years ago

MR merged and should be deployed. I checked with sudo salt -l error --state-output=changes -C 'G@roles:worker' cmd.run 'ovs-vsctl show | grep -C 3 error' but still found the same error as originally multiple times. I confirmed that the IPv4 entry for openqaworker10 only appears one:

$ git grep -c '10.160.2.20'
openqa/workerconf.sls:1

I explicitly applied a high state on all workers with sudo salt -C 'G@roles:worker' state.apply but no effect. Error is still present.

Actions #12

Updated by dheidler over 2 years ago

okurz wrote:

I explicitly applied a high state on all workers with sudo salt -C 'G@roles:worker' state.apply but no effect. Error is still present.

I didn't expect this to change when openqaworker10 is still in the list of workers.
Let's try to remove it from that list: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/344

Actions #13

Updated by dheidler over 2 years ago

Actions #14

Updated by dheidler over 2 years ago

  • Status changed from In Progress to Feedback
Actions #15

Updated by dheidler over 2 years ago

  • Status changed from Feedback to In Progress

Reading the manpage of ovs-vsctl the --may-exist option will result in nothing being done at all if the port already exists. So if there are any changes being done to the list of remote hosts, the update might even get lost.

So what needs to be done here is eg. deleting a port before recreating it.
I wonder if this could mess with running MM tests by producing some packet loss until the next command is executed.

Also ports will not automatically get removed when the config is removed from salt.

Actions #16

Updated by dheidler over 2 years ago

  • Status changed from In Progress to Feedback
Actions #17

Updated by dheidler over 2 years ago

merged and will monitor multi machine jobs.

Actions #18

Updated by dheidler over 2 years ago

Turns out that I manually had to run ifup br1 on the workers to actually apply the config.
This means that there is actually no risk of packet loss due to running the preup script that removes and readds all the GRE ports. It might lead to inconsistencies if the config is not applied if the script is written.

At least IF the script is executed, there are no more duplicate ports or missing entries.

Actions #19

Updated by dheidler over 2 years ago

When re-adding openqaworker10 to production, an unrelated exception occured that should be fixed by this MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/563

Actions #20

Updated by dheidler over 2 years ago

  • Status changed from Feedback to Resolved
Actions #21

Updated by okurz over 2 years ago

I verified that sudo salt -l error --state-output=changes -C 'G@roles:worker' cmd.run 'ovs-vsctl show | grep -C 3 error' reports no errors, that's fine. No looking at the parent ticket #96185 I wonder if our multi-machine failure rate decreased. So I added #96191 to the backlog now.

Actions #22

Updated by okurz over 2 years ago

  • Due date deleted (2021-08-31)
Actions

Also available in: Atom PDF