Project

General

Profile

Actions

action #160646

closed

openQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:M

Added by okurz about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-05-21
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Originally reported in https://suse.slack.com/archives/C02CANHLANP/p1716169544132569

(Richard Fan) Hello experts, many Multi-machine tests are failed like MM failed jobs on qe-core (edited)

After that as visible in https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1716124476616&to=1716212983933&viewPanel=24
2024-05-19 22:30 there is an increase of parallel_failed.

E.g.
openQA test in scenario sle-15-SP3-Server-DVD-Updates-x86_64-qam_kernel_multipath@64bit fails in
multipath_iscsi
shows that it's not a problem with MTU as the error message is "connect: Network is unreachable"

ssh worker29.oqa.prg2.suse.org "cat /etc/wicked/scripts/gre_tunnel_preup.sh" shows a problem

#!/bin/sh
action="$1"
bridge="$2"
# enable STP for the multihost bridges
ovs-vsctl set bridge $bridge stp_enable=false
ovs-vsctl set bridge $bridge rstp_enable=true
for gre_port in $(ovs-vsctl list-ifaces $bridge | grep gre) ; do ovs-vsctl --if-exists del-port $bridge $gre_port ; done

there should be a list of GRE tunnel interface setup calls after the last line between those machines but no GRE tunnels are setup at all

Reproducible

Fails since
https://openqa.suse.de/tests/overview?result=parallel_failed&distri=sle&version=15-SP4&build=20240519-1

Expected result

Last good: 20240519-1 (or more recent)

Suggestions

Rollback steps

Further details

Always latest result in the originally mentioned scenario: latest


Related issues 2 (1 open1 closed)

Related to openQA Infrastructure - action #161381: multi-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:SResolvedmkittler2024-06-032024-06-18

Actions
Copied to openQA Infrastructure - action #160826: Optimize gre_tunnel_preup.sh generation jinja templateNew2024-05-21

Actions
Actions #1

Updated by okurz about 2 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #2

Updated by okurz about 2 months ago

sudo salt --no-color 'worker40*' grains.get 'ip4_interfaces' shows

worker40.oqa.prg2.suse.org:
    ----------
    br1:
    erspan0:
    eth0:
    eth1:
    gre0:

I think there should be some entries populated

Actions #3

Updated by nicksinger about 2 months ago

I ran 'salt 'worker29.oqa.prg2.suse.org' cp.get_template salt://openqa/openvswitch.sls /tmp/openvswitch' to check the content of our state which renders like this:

# Worker for GRE needs to have defined entry bridge_ip: <uplink_address_of_this_worker> in pillar data
/etc/wicked/scripts/gre_tunnel_preup.sh:



  file.managed:
    - user: root
    - group: root
    - mode: "0744"
    - makedirs: true
    - contents:
      - '#!/bin/sh'
      - action="$1"
      - bridge="$2"
      - '# enable STP for the multihost bridges'
      - ovs-vsctl set bridge $bridge stp_enable=false
      - ovs-vsctl set bridge $bridge rstp_enable=true
      - for gre_port in $(ovs-vsctl list-ifaces $bridge | grep gre) ; do ovs-vsctl --if-exists del-port $bridge $gre_port ; done
































      - 'ovs-vsctl --may-exist add-port $bridge gre8 -- set interface gre8 type=gre options:remote_ip=10.145.10.8 # worker35'























wicked ifup all:
  cmd.run:
    - onchanges:
      - file: /etc/wicked/scripts/gre_tunnel_preup.sh

that indicates that our issues is between these lines: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls#L98-128

Actions #4

Updated by okurz about 2 months ago

  • Description updated (diff)

Added suggestions with mitigations, investigation hints, workarounds, etc.

Merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814 and called

failed_since="2024-02-19 18:00Z" result="result='parallel_failed'" host=openqa.suse.de comment="label:poo160646" openqa-advanced-retrigger-jobs
Actions #5

Updated by okurz about 2 months ago · Edited

nicksinger wrote in #note-3:

[…]
that indicates that our issues is between these lines: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls#L98-128

Yes. I was debugging that script by inserting commands like

{%- do salt.log.error('testing jinja logging') -%}

on OSD and executing salt commands against one worker and checking the minion log on that worker. That revealed that remote_bridge_interface for host worker40: [] so an empty list for remote_bridge_interface apparently? Then I called the command from #160646-2 and could confirm that there are no entries for IP address showing up. But after merging my workaround https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814 which does not directly change any of those parameters still now again valid values show up. My current hypothesis is that during the last time the salt high state was applied the grains did not have valid data and our config was created with empty gre setup scripts.

Actions #6

Updated by okurz about 2 months ago

  • Description updated (diff)
Actions #7

Updated by livdywan about 2 months ago

  • Subject changed from multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all to multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:M
  • Description updated (diff)
Actions #8

Updated by okurz about 2 months ago

  • Description updated (diff)
  • Assignee changed from okurz to nicksinger
  • Priority changed from Immediate to Urgent

As discussed nicksinger continues here. @nicksinger at your own discretion continue to investigate, revert workaround, etc.

Actions #9

Updated by nicksinger about 2 months ago

Actually reading our jinja template to generate gre_tunnel_preup.sh and taking https://progress.opensuse.org/issues/160646#note-5 into consideration, I think that either our grains went missing on the hosts and therefore populated a wrong mine on each minion or the mine was producing wrong output.

After checking a possible place to fail, I came across a check introduced with: https://progress.opensuse.org/issues/130835#note-1 - I think without it we would have seen a failing pipeline. Anyhow, just removing it won't cut it because it describes a valid scenario to disable a salt-minion (blacklist its key on OSD) but have the data still present in the workerconf.sls

Actions #10

Updated by openqa_review about 2 months ago

  • Due date set to 2024-06-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by nicksinger about 2 months ago

  • Status changed from In Progress to Feedback

I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1196 for future debugging and will enable all instances again by basically reverting https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814 because the mine is populated properly again (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/818).
I'm not sure how to approach this. I don't fully understand why this happened in the first place and can only explain it by the mine containing or producing wrong information which we can't really check for.

I was thinking along the lines of just resolving the cluster remote by DNS as this is basically what we do with the current template (evaluate the key from workerconf.sls into an public IP of the remote worker).
We're not exactly sure if this would affect jobs (https://suse.slack.com/archives/C02AJ1E568M/p1716313457649849) and I don't see an easy way to get the domain from the worker (as this information is stripped in our workerconf) without using the mine again.

Actions #12

Updated by nicksinger about 2 months ago

  • Copied to action #160826: Optimize gre_tunnel_preup.sh generation jinja template added
Actions #13

Updated by nicksinger about 2 months ago

  • Status changed from Feedback to In Progress

Technically waiting for the MRs to be merged but I keep an eye on the situation

Actions #14

Updated by pcervinka about 2 months ago

Could we please merge solution? I think thanks to limitation to one worker we have pending jobs for 3 days like https://openqa.suse.de/tests/14398639.

Actions #15

Updated by livdywan about 2 months ago

pcervinka wrote in #note-14:

Could we please merge solution? I think thanks to limitation to one worker we have pending jobs for 3 days like https://openqa.suse.de/tests/14398639.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1196 was merged, and we're monitoring the results now

Actions #16

Updated by nicksinger about 2 months ago

  • Due date deleted (2024-06-05)
  • Status changed from In Progress to Workable
  • Assignee deleted (nicksinger)
  • Priority changed from Urgent to High

The revert is merged but I fail to come up with a proper long-term solution to avoid this happening again. Maybe we can brainstorm this together.

Actions #17

Updated by okurz about 1 month ago

  • Assignee set to ybonatakis
Actions #18

Updated by okurz about 1 month ago

  • Related to action #161381: multi-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:S added
Actions #19

Updated by okurz about 1 month ago

  • Parent task set to #111929
Actions #20

Updated by ybonatakis about 1 month ago

I run a Statistical analysis

for i in {01..50} ; do openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de 14514267 TEST+=-$USER_poo160646_$i BUILD=poo160646_investigation _GROUP="Test Development: SLE 15" ; done
https://openqa.suse.de/tests/14517120

All jobs passed.
I see that @nicksinger has applied some changes but I am not sure how this solved the problem. Or whether was resolved by another ticket with some other resolution (for instance https://progress.opensuse.org/issues/161381).

I think most of the ACs are covered but I cant tell if Prevent the same and similar problems in the future is satisfied.

Actions #21

Updated by okurz about 1 month ago

  • Status changed from Workable to Resolved

ybonatakis wrote in #note-20:

I run a Statistical analysis

for i in {01..50} ; do openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de 14514267 TEST+=-$USER_poo160646_$i BUILD=poo160646_investigation _GROUP="Test Development: SLE 15" ; done
https://openqa.suse.de/tests/14517120

All jobs passed.
I see that @nicksinger has applied some changes but I am not sure how this solved the problem. Or whether was resolved by another ticket with some other resolution (for instance #161381).

I think most of the ACs are covered but I cant tell if Prevent the same and similar problems in the future is satisfied.

That's fine. We will follow up in #161735

rollback actions are covered with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/818 already I guess. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 looks all good again so I guess we are done here.

Actions

Also available in: Atom PDF