action #160646
closedopenQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:M
0%
Description
Observation¶
Originally reported in https://suse.slack.com/archives/C02CANHLANP/p1716169544132569
(Richard Fan) Hello experts, many Multi-machine tests are failed like MM failed jobs on qe-core (edited)
After that as visible in https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1716124476616&to=1716212983933&viewPanel=24
2024-05-19 22:30 there is an increase of parallel_failed.
E.g.
openQA test in scenario sle-15-SP3-Server-DVD-Updates-x86_64-qam_kernel_multipath@64bit fails in
multipath_iscsi
shows that it's not a problem with MTU as the error message is "connect: Network is unreachable"
ssh worker29.oqa.prg2.suse.org "cat /etc/wicked/scripts/gre_tunnel_preup.sh"
shows a problem
#!/bin/sh
action="$1"
bridge="$2"
# enable STP for the multihost bridges
ovs-vsctl set bridge $bridge stp_enable=false
ovs-vsctl set bridge $bridge rstp_enable=true
for gre_port in $(ovs-vsctl list-ifaces $bridge | grep gre) ; do ovs-vsctl --if-exists del-port $bridge $gre_port ; done
there should be a list of GRE tunnel interface setup calls after the last line between those machines but no GRE tunnels are setup at all
Reproducible¶
Fails since
https://openqa.suse.de/tests/overview?result=parallel_failed&distri=sle&version=15-SP4&build=20240519-1
Expected result¶
Last good: 20240519-1 (or more recent)
Suggestions¶
- DONE Mitigate -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814
- Apply workarounds
- Retrigger affected jobs
- Investigate error source
- Fix the problem or at least abort the generation with error if the section would be completely empty?
- Prevent the same and similar problems in the future
- Apply rollback steps
- Monitor effect carefully
- Look into https://stats.openqa-monitor.qa.suse.de/alerting/grafana/0XohcmfVk/view?orgId=1
Rollback steps¶
Further details¶
Always latest result in the originally mentioned scenario: latest
Updated by nicksinger 6 months ago
I ran 'salt 'worker29.oqa.prg2.suse.org' cp.get_template salt://openqa/openvswitch.sls /tmp/openvswitch' to check the content of our state which renders like this:
# Worker for GRE needs to have defined entry bridge_ip: <uplink_address_of_this_worker> in pillar data
/etc/wicked/scripts/gre_tunnel_preup.sh:
file.managed:
- user: root
- group: root
- mode: "0744"
- makedirs: true
- contents:
- '#!/bin/sh'
- action="$1"
- bridge="$2"
- '# enable STP for the multihost bridges'
- ovs-vsctl set bridge $bridge stp_enable=false
- ovs-vsctl set bridge $bridge rstp_enable=true
- for gre_port in $(ovs-vsctl list-ifaces $bridge | grep gre) ; do ovs-vsctl --if-exists del-port $bridge $gre_port ; done
- 'ovs-vsctl --may-exist add-port $bridge gre8 -- set interface gre8 type=gre options:remote_ip=10.145.10.8 # worker35'
wicked ifup all:
cmd.run:
- onchanges:
- file: /etc/wicked/scripts/gre_tunnel_preup.sh
that indicates that our issues is between these lines: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls#L98-128
Updated by okurz 6 months ago
- Description updated (diff)
Added suggestions with mitigations, investigation hints, workarounds, etc.
Merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814 and called
failed_since="2024-02-19 18:00Z" result="result='parallel_failed'" host=openqa.suse.de comment="label:poo160646" openqa-advanced-retrigger-jobs
Updated by okurz 6 months ago · Edited
nicksinger wrote in #note-3:
[…]
that indicates that our issues is between these lines: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls#L98-128
Yes. I was debugging that script by inserting commands like
{%- do salt.log.error('testing jinja logging') -%}
on OSD and executing salt commands against one worker and checking the minion log on that worker. That revealed that remote_bridge_interface for host worker40: []
so an empty list for remote_bridge_interface apparently? Then I called the command from #160646-2 and could confirm that there are no entries for IP address showing up. But after merging my workaround https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814 which does not directly change any of those parameters still now again valid values show up. My current hypothesis is that during the last time the salt high state was applied the grains did not have valid data and our config was created with empty gre setup scripts.
Updated by nicksinger 6 months ago
Actually reading our jinja template to generate gre_tunnel_preup.sh and taking https://progress.opensuse.org/issues/160646#note-5 into consideration, I think that either our grains went missing on the hosts and therefore populated a wrong mine on each minion or the mine was producing wrong output.
After checking a possible place to fail, I came across a check introduced with: https://progress.opensuse.org/issues/130835#note-1 - I think without it we would have seen a failing pipeline. Anyhow, just removing it won't cut it because it describes a valid scenario to disable a salt-minion (blacklist its key on OSD) but have the data still present in the workerconf.sls
Updated by openqa_review 6 months ago
- Due date set to 2024-06-05
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 6 months ago
- Status changed from In Progress to Feedback
I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1196 for future debugging and will enable all instances again by basically reverting https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814 because the mine is populated properly again (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/818).
I'm not sure how to approach this. I don't fully understand why this happened in the first place and can only explain it by the mine containing or producing wrong information which we can't really check for.
I was thinking along the lines of just resolving the cluster remote by DNS as this is basically what we do with the current template (evaluate the key from workerconf.sls into an public IP of the remote worker).
We're not exactly sure if this would affect jobs (https://suse.slack.com/archives/C02AJ1E568M/p1716313457649849) and I don't see an easy way to get the domain from the worker (as this information is stripped in our workerconf) without using the mine again.
Updated by nicksinger 6 months ago
- Copied to action #160826: Optimize gre_tunnel_preup.sh generation jinja template size:S added
Updated by nicksinger 6 months ago
- Status changed from Feedback to In Progress
Technically waiting for the MRs to be merged but I keep an eye on the situation
Updated by pcervinka 6 months ago
Could we please merge solution? I think thanks to limitation to one worker we have pending jobs for 3 days like https://openqa.suse.de/tests/14398639.
Updated by livdywan 6 months ago
pcervinka wrote in #note-14:
Could we please merge solution? I think thanks to limitation to one worker we have pending jobs for 3 days like https://openqa.suse.de/tests/14398639.
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1196 was merged, and we're monitoring the results now
Updated by nicksinger 6 months ago
- Due date deleted (
2024-06-05) - Status changed from In Progress to Workable
- Assignee deleted (
nicksinger) - Priority changed from Urgent to High
The revert is merged but I fail to come up with a proper long-term solution to avoid this happening again. Maybe we can brainstorm this together.
Updated by okurz 6 months ago
- Related to action #161381: multi-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:S added
Updated by ybonatakis 6 months ago
I run a Statistical analysis
for i in {01..50} ; do openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de 14514267 TEST+=-$USER_poo160646_$i BUILD=poo160646_investigation _GROUP="Test Development: SLE 15" ; done
https://openqa.suse.de/tests/14517120
All jobs passed.
I see that @nicksinger has applied some changes but I am not sure how this solved the problem. Or whether was resolved by another ticket with some other resolution (for instance https://progress.opensuse.org/issues/161381).
I think most of the ACs are covered but I cant tell if Prevent the same and similar problems in the future is satisfied.
Updated by okurz 6 months ago
- Status changed from Workable to Resolved
ybonatakis wrote in #note-20:
I run a Statistical analysis
for i in {01..50} ; do openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de 14514267 TEST+=-$USER_poo160646_$i BUILD=poo160646_investigation _GROUP="Test Development: SLE 15" ; done
https://openqa.suse.de/tests/14517120All jobs passed.
I see that @nicksinger has applied some changes but I am not sure how this solved the problem. Or whether was resolved by another ticket with some other resolution (for instance #161381).I think most of the ACs are covered but I cant tell if Prevent the same and similar problems in the future is satisfied.
That's fine. We will follow up in #161735
rollback actions are covered with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/818 already I guess. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 looks all good again so I guess we are done here.