action #160646: multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #160646

closed

openQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:M

Added by okurz about 1 year ago. Updated 12 months ago.

Status:

Resolved

Priority:

High

Assignee:

ybonatakis

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-05-21

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Observation¶

Originally reported in https://suse.slack.com/archives/C02CANHLANP/p1716169544132569

(Richard Fan) Hello experts, many Multi-machine tests are failed like MM failed jobs on qe-core (edited)

After that as visible in https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1716124476616&to=1716212983933&viewPanel=24
2024-05-19 22:30 there is an increase of parallel_failed.

E.g.
openQA test in scenario sle-15-SP3-Server-DVD-Updates-x86_64-qam_kernel_multipath@64bit fails in
multipath_iscsi
shows that it's not a problem with MTU as the error message is "connect: Network is unreachable"

ssh worker29.oqa.prg2.suse.org "cat /etc/wicked/scripts/gre_tunnel_preup.sh" shows a problem

#!/bin/sh
action="$1"
bridge="$2"
# enable STP for the multihost bridges
ovs-vsctl set bridge $bridge stp_enable=false
ovs-vsctl set bridge $bridge rstp_enable=true
for gre_port in $(ovs-vsctl list-ifaces $bridge | grep gre) ; do ovs-vsctl --if-exists del-port $bridge $gre_port ; done

there should be a list of GRE tunnel interface setup calls after the last line between those machines but no GRE tunnels are setup at all

Reproducible¶

Fails since
https://openqa.suse.de/tests/overview?result=parallel_failed&distri=sle&version=15-SP4&build=20240519-1

Expected result¶

Last good: 20240519-1 (or more recent)

Suggestions¶

DONE Mitigate -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814
Apply workarounds
Retrigger affected jobs
Investigate error source
Fix the problem or at least abort the generation with error if the section would be completely empty?
Prevent the same and similar problems in the future
Apply rollback steps
Monitor effect carefully
Look into https://stats.openqa-monitor.qa.suse.de/alerting/grafana/0XohcmfVk/view?orgId=1

Rollback steps¶

Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814

Further details¶

Always latest result in the originally mentioned scenario: latest

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz about 1 year ago

Status changed from New to In Progress
Assignee set to okurz

Actions

Copy link

Updated by okurz about 1 year ago

sudo salt --no-color 'worker40*' grains.get 'ip4_interfaces' shows

worker40.oqa.prg2.suse.org:
    ----------
    br1:
    erspan0:
    eth0:
    eth1:
    gre0:

I think there should be some entries populated

Actions

Copy link

Updated by nicksinger about 1 year ago

I ran 'salt 'worker29.oqa.prg2.suse.org' cp.get_template salt://openqa/openvswitch.sls /tmp/openvswitch' to check the content of our state which renders like this:

# Worker for GRE needs to have defined entry bridge_ip: <uplink_address_of_this_worker> in pillar data
/etc/wicked/scripts/gre_tunnel_preup.sh:



  file.managed:
    - user: root
    - group: root
    - mode: "0744"
    - makedirs: true
    - contents:
      - '#!/bin/sh'
      - action="$1"
      - bridge="$2"
      - '# enable STP for the multihost bridges'
      - ovs-vsctl set bridge $bridge stp_enable=false
      - ovs-vsctl set bridge $bridge rstp_enable=true
      - for gre_port in $(ovs-vsctl list-ifaces $bridge | grep gre) ; do ovs-vsctl --if-exists del-port $bridge $gre_port ; done
































      - 'ovs-vsctl --may-exist add-port $bridge gre8 -- set interface gre8 type=gre options:remote_ip=10.145.10.8 # worker35'























wicked ifup all:
  cmd.run:
    - onchanges:
      - file: /etc/wicked/scripts/gre_tunnel_preup.sh

that indicates that our issues is between these lines: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls#L98-128

Actions

Copy link

Updated by okurz about 1 year ago

Description updated (diff)

Added suggestions with mitigations, investigation hints, workarounds, etc.

Merged https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814 and called

failed_since="2024-02-19 18:00Z" result="result='parallel_failed'" host=openqa.suse.de comment="label:poo160646" openqa-advanced-retrigger-jobs

Actions

Copy link

Updated by okurz about 1 year ago · Edited

nicksinger wrote in #note-3:

[…]
that indicates that our issues is between these lines: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls#L98-128

Yes. I was debugging that script by inserting commands like

{%- do salt.log.error('testing jinja logging') -%}

on OSD and executing salt commands against one worker and checking the minion log on that worker. That revealed that remote_bridge_interface for host worker40: [] so an empty list for remote_bridge_interface apparently? Then I called the command from #160646-2 and could confirm that there are no entries for IP address showing up. But after merging my workaround https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814 which does not directly change any of those parameters still now again valid values show up. My current hypothesis is that during the last time the salt high state was applied the grains did not have valid data and our config was created with empty gre setup scripts.

Actions

Copy link

Updated by okurz about 1 year ago

Description updated (diff)

Actions

Copy link

Updated by livdywan about 1 year ago

Subject changed from multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all to multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:M
Description updated (diff)

Actions

Copy link

Updated by okurz about 1 year ago

Description updated (diff)
Assignee changed from okurz to nicksinger
Priority changed from Immediate to Urgent

As discussed nicksinger continues here. @nicksinger at your own discretion continue to investigate, revert workaround, etc.

Actions

Copy link

Updated by nicksinger about 1 year ago

Actually reading our jinja template to generate gre_tunnel_preup.sh and taking https://progress.opensuse.org/issues/160646#note-5 into consideration, I think that either our grains went missing on the hosts and therefore populated a wrong mine on each minion or the mine was producing wrong output.

After checking a possible place to fail, I came across a check introduced with: https://progress.opensuse.org/issues/130835#note-1 - I think without it we would have seen a failing pipeline. Anyhow, just removing it won't cut it because it describes a valid scenario to disable a salt-minion (blacklist its key on OSD) but have the data still present in the workerconf.sls

Actions

Copy link

#10

Updated by openqa_review about 1 year ago

Due date set to 2024-06-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#11

Updated by nicksinger about 1 year ago

Status changed from In Progress to Feedback

I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1196 for future debugging and will enable all instances again by basically reverting https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/814 because the mine is populated properly again (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/818).
I'm not sure how to approach this. I don't fully understand why this happened in the first place and can only explain it by the mine containing or producing wrong information which we can't really check for.

I was thinking along the lines of just resolving the cluster remote by DNS as this is basically what we do with the current template (evaluate the key from workerconf.sls into an public IP of the remote worker).
We're not exactly sure if this would affect jobs (https://suse.slack.com/archives/C02AJ1E568M/p1716313457649849) and I don't see an easy way to get the domain from the worker (as this information is stripped in our workerconf) without using the mine again.

Actions

Copy link

#12

Updated by nicksinger about 1 year ago

Copied to action #160826: Optimize gre_tunnel_preup.sh generation jinja template size:S added

Actions

Copy link

#13

Updated by nicksinger about 1 year ago

Status changed from Feedback to In Progress

Technically waiting for the MRs to be merged but I keep an eye on the situation

Actions

Copy link

#14

Updated by pcervinka about 1 year ago

Could we please merge solution? I think thanks to limitation to one worker we have pending jobs for 3 days like https://openqa.suse.de/tests/14398639.

Actions

Copy link

#15

Updated by livdywan about 1 year ago

pcervinka wrote in #note-14:

Could we please merge solution? I think thanks to limitation to one worker we have pending jobs for 3 days like https://openqa.suse.de/tests/14398639.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1196 was merged, and we're monitoring the results now

Actions

Copy link

#16

Updated by nicksinger about 1 year ago

Due date deleted (~~2024-06-05~~)
Status changed from In Progress to Workable
Assignee deleted (~~nicksinger~~)
Priority changed from Urgent to High

The revert is merged but I fail to come up with a proper long-term solution to avoid this happening again. Maybe we can brainstorm this together.

Actions

Copy link

#17

Updated by okurz 12 months ago

Assignee set to ybonatakis

Actions

Copy link

#18

Updated by okurz 12 months ago

Related to action #161381: multi-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:S added

Actions

Copy link

#19

Updated by okurz 12 months ago

Parent task set to #111929

Actions

Copy link

#20

Updated by ybonatakis 12 months ago

I run a Statistical analysis

for i in {01..50} ; do openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de 14514267 TEST+=-$USER_poo160646_$i BUILD=poo160646_investigation _GROUP="Test Development: SLE 15" ; done
https://openqa.suse.de/tests/14517120

All jobs passed.
I see that @nicksinger has applied some changes but I am not sure how this solved the problem. Or whether was resolved by another ticket with some other resolution (for instance https://progress.opensuse.org/issues/161381).

I think most of the ACs are covered but I cant tell if Prevent the same and similar problems in the future is satisfied.

Actions

Copy link

#21

Updated by okurz 12 months ago

Status changed from Workable to Resolved

ybonatakis wrote in #note-20:

I run a Statistical analysis

for i in {01..50} ; do openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de 14514267 TEST+=-$USER_poo160646_$i BUILD=poo160646_investigation _GROUP="Test Development: SLE 15" ; done
https://openqa.suse.de/tests/14517120

All jobs passed.
I see that @nicksinger has applied some changes but I am not sure how this solved the problem. Or whether was resolved by another ticket with some other resolution (for instance #161381).

I think most of the ACs are covered but I cant tell if Prevent the same and similar problems in the future is satisfied.

That's fine. We will follow up in #161735

rollback actions are covered with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/818 already I guess. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 looks all good again so I guess we are done here.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #160646

multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:M

Observation¶

Reproducible¶

Expected result¶

Suggestions¶

Rollback steps¶

Further details¶

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by nicksinger about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago · Edited

Updated by okurz about 1 year ago

Updated by livdywan about 1 year ago

Updated by okurz about 1 year ago

Updated by nicksinger about 1 year ago

Updated by openqa_review about 1 year ago

Updated by nicksinger about 1 year ago

Updated by nicksinger about 1 year ago

Updated by nicksinger about 1 year ago

Updated by pcervinka about 1 year ago

Updated by livdywan about 1 year ago

Updated by nicksinger about 1 year ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by ybonatakis 12 months ago

Updated by okurz 12 months ago