action #96260
closedcoordination #96185: [epic] Multimachine failure rate increased
Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? size:M
Description
Observation¶
From OSD:
sudo salt -l error --state-output=changes -C 'G@roles:worker' cmd.run 'ovs-vsctl show | grep -C 3 error'
openqaworker2.suse.de:
openqaworker8.suse.de:
Interface gre6
type: gre
options: {remote_ip="10.160.2.20"}
error: "could not add network device gre6 to ofproto (File exists)"
Port gre7
Interface gre7
type: gre
options: {remote_ip="10.160.2.20"}
error: "could not add network device gre7 to ofproto (File exists)"
Port tap75
Interface tap75
Port tap14
openqaworker3.suse.de:
… same error
openqaworker9.suse.de:
… same error
openqaworker6.suse.de:
… same error
openqaworker5.suse.de:
… same error
QA-Power8-5-kvm.qa.suse.de:
… same error
QA-Power8-4-kvm.qa.suse.de:
malbec.arch.suse.de:
… same error
powerqaworker-qam-1.qa.suse.de:
grenache-1.qa.suse.de:
openqaworker13.suse.de:
openqaworker10.suse.de:
Interface gre6
type: gre
options: {remote_ip="10.160.2.20"}
error: "could not add network device gre6 to ofproto (File exists)"
Port tap8
Interface tap8
Port tap128
openqaworker-arm-1.suse.de:
Interface gre6
type: gre
options: {remote_ip="10.160.2.20"}
error: "could not add network device gre6 to ofproto (File exists)"
Port tap132
Interface tap132
Port gre9
--
Interface gre5
type: gre
options: {remote_ip="10.160.2.20"}
error: "could not add network device gre5 to ofproto (File exists)"
Port tap130
Interface tap130
Port gre4
openqaworker-arm-2.suse.de:
… same error
openqaworker-arm-3.suse.de:
… same error
ERROR: Minions returned with non-zero exit code
so the same error appears in most workers, but not all. The IPv4 address is 10.160.2.20, which is openqaworker10
the job is running on openqaworker10 itself. So it looks like we fail to add a GRE tunnel to the same host?
Suggestion¶
- Look into https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines - seems to be green atm?
- Check gre iface definitions in salt
Updated by livdywan over 3 years ago
- Subject changed from Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? to Failed to add GRE tunnel to openqaworker10 on most OSD workers, recent regression explaining multi-machine errors? size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by livdywan over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
Updated by livdywan over 3 years ago
- Status changed from In Progress to Workable
So in openqa/openvswitch.sls
there's this command:
- ovs-vsctl --may-exist add-port $bridge gre{{- loop.index }} -- set interface gre{{- loop.index }} type=gre options:remote_ip={{ pillar['workerconf'][remote]['bridge_ip'] }}
The --may-exist
should mean it's fine if the port exists. Still gre*
is already setup and that fails? So even adding an --if-exists] --with-iface del-port ...
would seem redundant?
Updated by mkittler over 3 years ago
Either --may-exist
is broken or the port is bound to another bridge. Or the error "File exists" doesn't mean this is about the port but some other file involved in the setup.
(http://www.openvswitch.org/support/dist-docs/ovs-vsctl.8.txt)
Updated by dheidler over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to dheidler
Updated by dheidler over 3 years ago
10.160.2.20 is unreachable:
$ sudo arping -c6 -I eth0 10.160.2.20
ARPING 10.160.2.20 from 10.160.0.207 eth0
Sent 6 probes (6 broadcast(s))
Received 0 response(s)
Updated by okurz over 3 years ago
- Related to action #96938: openqaworker10+13 are offline, reason unknown, let's fix other problems first size:M added
Updated by dheidler over 3 years ago
- Status changed from In Progress to Blocked
As openqaworker10 (10.160.2.20) as well as its BNC are unreachable since the night from 14. to 15., this will need to wait on infra:
https://infra.nue.suse.com/SelfService/Display.html?id=195815&
Updated by okurz over 3 years ago
- Due date set to 2021-08-31
- Status changed from Blocked to In Progress
I meant that you should block the other ticket on that one but nevermind. Created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/343 to fix the duplicate bridge_ip entries. Then I suggest to check that every worker has a correct config if other machines are taken out of production.
Updated by okurz over 3 years ago
MR merged and should be deployed. I checked with sudo salt -l error --state-output=changes -C 'G@roles:worker' cmd.run 'ovs-vsctl show | grep -C 3 error'
but still found the same error as originally multiple times. I confirmed that the IPv4 entry for openqaworker10 only appears one:
$ git grep -c '10.160.2.20'
openqa/workerconf.sls:1
I explicitly applied a high state on all workers with sudo salt -C 'G@roles:worker' state.apply
but no effect. Error is still present.
Updated by dheidler over 3 years ago
okurz wrote:
I explicitly applied a high state on all workers with
sudo salt -C 'G@roles:worker' state.apply
but no effect. Error is still present.
I didn't expect this to change when openqaworker10 is still in the list of workers.
Let's try to remove it from that list: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/344
Updated by dheidler over 3 years ago
Also took openqaworker10 out of production as described here: https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production
Updated by dheidler over 3 years ago
- Status changed from In Progress to Feedback
Updated by dheidler over 3 years ago
- Status changed from Feedback to In Progress
Reading the manpage of ovs-vsctl the --may-exist
option will result in nothing being done at all if the port already exists. So if there are any changes being done to the list of remote hosts, the update might even get lost.
So what needs to be done here is eg. deleting a port before recreating it.
I wonder if this could mess with running MM tests by producing some packet loss until the next command is executed.
Also ports will not automatically get removed when the config is removed from salt.
Updated by dheidler over 3 years ago
- Status changed from In Progress to Feedback
Updated by dheidler over 3 years ago
merged and will monitor multi machine jobs.
Updated by dheidler over 3 years ago
Turns out that I manually had to run ifup br1
on the workers to actually apply the config.
This means that there is actually no risk of packet loss due to running the preup script that removes and readds all the GRE ports. It might lead to inconsistencies if the config is not applied if the script is written.
At least IF the script is executed, there are no more duplicate ports or missing entries.
Updated by dheidler over 3 years ago
When re-adding openqaworker10 to production, an unrelated exception occured that should be fixed by this MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/563