Project

General

Profile

Actions

action #134282

closed

openQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry

Added by emiura about 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-08-15
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observations

  • Multi-machine jobs can't download artifacts from OBS/pip

Theory

(Fill this section with our current understanding of how the world works based on observations as written in the next section)

Problem

  • H1 REJECT The product has changed
    • -> E1-1 Compare tests on multiple product versions -> O1-1-1 We observed the problem in multiple products with different state of maintenance updates and the support server is old SLE12SP3 with no change in maintenance updates since months. It is unlikely that the iscsi client changed recently but that has to be verified
  • H2 Fails because of changes in test setup
    • H2.1 Our test hardware equipment behaves different
    • H2.2 The network behaves different
  • H3 Fails because of changes in test infrastructure software, e.g. os-autoinst, openQA
    • -> E3-1 TODO compare package versions installed on machines from "last good" with "first bad", e.g. from /var/log/zypp/history
    • -> E3-2 It is probably not the Open vSwitch version, see comment #134282#note-98
  • H4 Fails because of changes in test management configuration, e.g. openQA database settings
    • -> wait for E5-1
  • H5 Fails because of changes in the test software itself (the test plan in source code as well as needles)
    • -> E5-1 TODO Compare vars.json from "last good" with "first bad" and in particular look into changes to needles and job templates
  • H6 REJECT Sporadic issue, i.e. the root problem is already hidden in the system for a long time but does not show symptoms every time
    • -> O6-1 #134282#note-71 but there is no 100% fail ratio
    • -> E6-2 Increase timeout in the initial step of firewall configuration to check if we have non-reliable test results due to timeouts
    • -> TODO Investigate the timeout in the initial step of firewall configuration
    • -> TODO Add TIMEOUT_SCALE=3 on non HanasR cluster tests' support servers
  • H7 Multi-machine jobs don't work across workers anymore since 2023-08 -> also see #111908 and #135773
    • H7.1 REJECT Multi-machine jobs generally work fine when executed on a single physical machine -> E7.1-1 Run multi-machines only on a single physical machine -> O7.1-1-1 See #134282-80
    • We could pin jobs to a worker but that will need to be implemented properly, see #135035
    • We otherwise need to understand the infra setup better

Suggestions

  • Test case improvements
    • support_server/setup
    • firewall services add zone=EXT service=service:target
    • MTU check for packet size - covered in #135200
  • MTU size configuration
    • By default MTU runs at MTU 1500, however for openQA TORs we have MTU 9216 configured for each port and the future network automation service will apply this setting as well by default throughout PRG2, lowering the MTU will then be request via SD-Ticket. https://sd.suse.com/servicedesk/customer/portal/1/SD-130143
  • Come up with better reproducer, e.g. run an openQA test scenario as single-machine test with support_server still on a tap-worker -> see #134282-104
  • Verify stability on one or multiple workers e.g. #135773-9

Rollback steps

Out of scope

  • Improving openQA upstream documentation -> #135914
  • ovs-server+client scenario and MTU related fixes -> #135773
  • lessons learned -> #136007
  • SAP NFS server related issues qesap-nfs.qa.suse.cz -> #135938
  • Problems to reach machines in external network in multi-machine tests -> #135056
  • Ensure IP forwarding is persistent for good -> #136013

Related issues 13 (1 open12 closed)

Related to openQA Project - action #111908: Multimachine failures between multiple physical workersResolvedokurz2022-06-03

Actions
Related to openQA Tests - action #133787: [qe-core] not hardcode a single worker to run autofs_server/client' and 'ovs-server/client' testsClosedrfan12023-08-04

Actions
Related to openQA Project - action #135035: Optionally restrict multimachine jobs to a single workerResolvedmkittler2023-09-01

Actions
Related to openQA Infrastructure - action #135056: MM Test fails in a connection to an address outside of the workerResolvedmkittler2023-09-01

Actions
Related to openQA Infrastructure - action #134042: auto-update on OSD does not install updates due to "Problem: nothing provides 'libwebkit2gtk3 ..." but service does not fail and we do not get an alert size:MResolvedlivdywan2023-08-092023-09-12

Actions
Related to openQA Infrastructure - action #135578: Long job age and jobs not executed for long size:MResolvednicksinger

Actions
Related to openQA Infrastructure - action #135944: Implement a constantly running monitoring/debugging VM for the multi-machine networkNew2023-09-18

Actions
Copied to openQA Tests - action #135200: [qe-core] Implement a ping check with custom MTU packet sizeRejecteddvenkatachala2023-08-15

Actions
Copied to openQA Infrastructure - action #135773: [tools] many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workers size:MResolvedlivdywan2023-08-152023-10-07

Actions
Copied to openQA Tests - action #135818: [kernel] minimal reproducer for many multi-machine test failures in "ovs-client+ovs-server" test scenario when tests are run across different workersResolvedpcervinka2023-08-15

Actions
Copied to openQA Project - action #135914: Extend/add initial validation steps and "best practices" for multi-machine test setup/debugging to openQA documentation size:MResolvedmkittler

Actions
Copied to openQA Infrastructure - action #136007: Conduct "lessons learned" with Five Why analysis for network protocols failures on multimachine tests on HA/SAP size:SResolvedtinita

Actions
Copied to openQA Project - action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:MResolveddheidler

Actions
Actions

Also available in: Atom PDF