action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

action #134282

closed

openQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry

Added by emiura over 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-08-15

Due date:

% Done:

Estimated time:

Tags:

infra

Description

Observations¶

Multi-machine jobs can't download artifacts from OBS/pip

Theory¶

(Fill this section with our current understanding of how the world works based on observations as written in the next section)

Problem¶

H1 REJECT The product has changed
- -> E1-1 Compare tests on multiple product versions -> O1-1-1 We observed the problem in multiple products with different state of maintenance updates and the support server is old SLE12SP3 with no change in maintenance updates since months. It is unlikely that the iscsi client changed recently but that has to be verified
  - -> E1-2 Find output of "openqa-investigate" jobs comparing against "last good" -> O1-2-1 https://openqa.suse.de/tests/12080239#comment-993398 shows reproducibly four failed tests so reproducible for all states of test and product so reject H1
H2 Fails because of changes in test setup
- H2.1 Our test hardware equipment behaves different
- H2.2 The network behaves different
H3 Fails because of changes in test infrastructure software, e.g. os-autoinst, openQA
- -> E3-1 TODO compare package versions installed on machines from "last good" with "first bad", e.g. from /var/log/zypp/history
- -> E3-2 It is probably not the Open vSwitch version, see comment #134282#note-98
H4 Fails because of changes in test management configuration, e.g. openQA database settings
- -> wait for E5-1
H5 Fails because of changes in the test software itself (the test plan in source code as well as needles)
- -> E5-1 TODO Compare vars.json from "last good" with "first bad" and in particular look into changes to needles and job templates
H6 REJECT Sporadic issue, i.e. the root problem is already hidden in the system for a long time but does not show symptoms every time
- -> O6-1 #134282#note-71 but there is no 100% fail ratio
- -> E6-2 Increase timeout in the initial step of firewall configuration to check if we have non-reliable test results due to timeouts
- -> TODO Investigate the timeout in the initial step of firewall configuration
- -> TODO Add TIMEOUT_SCALE=3 on non HanasR cluster tests' support servers
H7 Multi-machine jobs don't work across workers anymore since 2023-08 -> also see #111908 and #135773
- H7.1 REJECT Multi-machine jobs generally work fine when executed on a single physical machine -> E7.1-1 Run multi-machines only on a single physical machine -> O7.1-1-1 See #134282-80
- We could pin jobs to a worker but that will need to be implemented properly, see #135035
- We otherwise need to understand the infra setup better

Suggestions¶

Test case improvements
- support_server/setup
- firewall services add zone=EXT service=service:target
- MTU check for packet size - covered in #135200
MTU size configuration
- By default MTU runs at MTU 1500, however for openQA TORs we have MTU 9216 configured for each port and the future network automation service will apply this setting as well by default throughout PRG2, lowering the MTU will then be request via SD-Ticket. https://sd.suse.com/servicedesk/customer/portal/1/SD-130143
Come up with better reproducer, e.g. run an openQA test scenario as single-machine test with support_server still on a tap-worker -> see #134282-104
Verify stability on one or multiple workers e.g. #135773-9

Rollback steps¶

~~Re-enable OSD deployments~~ DONE
~~Switch off worker9 again~~ DONE

Out of scope¶

Improving openQA upstream documentation -> #135914
ovs-server+client scenario and MTU related fixes -> #135773
lessons learned -> #136007
SAP NFS server related issues qesap-nfs.qa.suse.cz -> #135938
Problems to reach machines in external network in multi-machine tests -> #135056
Ensure IP forwarding is persistent for good -> #136013

Related issues 13 (1 open — 12 closed)

Related to openQA Project (public) - action #111908: Multimachine failures between multiple physical workers

Resolved

okurz

2022-06-03

Actions

Related to openQA Tests (public) - action #133787: [qe-core] not hardcode a single worker to run autofs_server/client' and 'ovs-server/client' tests

Closed

rfan1

2023-08-04

Actions

Related to openQA Project (public) - action #135035: Optionally restrict multimachine jobs to a single worker

Resolved

mkittler

2023-09-01

Actions

Related to openQA Infrastructure (public) - action #135056: MM Test fails in a connection to an address outside of the worker

Resolved

mkittler

2023-09-01

Actions

Related to openQA Infrastructure (public) - action #134042: auto-update on OSD does not install updates due to "Problem: nothing provides 'libwebkit2gtk3 ..." but service does not fail and we do not get an alert size:M

Resolved

livdywan

2023-08-09

2023-09-12

Actions

Related to openQA Infrastructure (public) - action #135578: Long job age and jobs not executed for long size:M

Resolved

nicksinger

Actions

Related to openQA Infrastructure (public) - action #135944: Implement a constantly running monitoring/debugging VM for the multi-machine network

New

2023-09-18

Actions

Copied to openQA Tests (public) - action #135200: [qe-core] Implement a ping check with custom MTU packet size

Rejected

dvenkatachala

2023-08-15

Actions

Copied to openQA Infrastructure (public) - action #135773: [tools] many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workers size:M

Resolved

livdywan

2023-08-15

2023-10-07

Actions

Copied to openQA Tests (public) - action #135818: [kernel] minimal reproducer for many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workers

Resolved

pcervinka

2023-08-15

Actions

Copied to openQA Project (public) - action #135914: Extend/add initial validation steps and "best practices" for multi-machine test setup/debugging to openQA documentation size:M

Resolved

mkittler

Actions

Copied to openQA Infrastructure (public) - action #136007: Conduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP size:S

Resolved

tinita

Actions

Copied to openQA Project (public) - action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:M

Resolved

dheidler

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #134282

[tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry

Observations¶

Theory¶

Problem¶

Suggestions¶

Rollback steps¶

Out of scope¶

Updated by pcervinka over 1 year ago

Updated by pcervinka over 1 year ago

Updated by pcervinka over 1 year ago

Updated by pcervinka over 1 year ago

Updated by osukup over 1 year ago

Updated by livdywan over 1 year ago

Updated by dzedro over 1 year ago

Updated by pcervinka over 1 year ago

Updated by pcervinka over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by rfan1 over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by mgrifalconi over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by mkittler over 1 year ago

Updated by pcervinka over 1 year ago

Updated by acarvajal over 1 year ago

Updated by livdywan over 1 year ago

Updated by mkittler over 1 year ago

Updated by acarvajal over 1 year ago

Updated by pstivanin over 1 year ago

Updated by okurz over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by ybonatakis over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by apappas over 1 year ago

Updated by apappas over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by apappas over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by acarvajal over 1 year ago

Updated by livdywan over 1 year ago

Updated by acarvajal over 1 year ago

Updated by livdywan over 1 year ago

Updated by nicksinger over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by srinidhir over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago

Updated by livdywan over 1 year ago

Updated by srinidhir over 1 year ago

Updated by acarvajal over 1 year ago

Updated by acarvajal over 1 year ago