Project

General

Profile

Actions

action #152389

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M

Added by okurz 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-12-11
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP5-Server-DVD-Updates-x86_64-qam_kernel_multipath@64bit fails in
multipath_iscsi

Test suite description

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml. Maintainer: jpupava on 15sp1 is problem missing python-xml package

Reproducible

Fails since (at least) Build 20231210-1 (current job)

Expected result

Last good: 20231208-1 (or more recent)

Acceptance criteria

Problem

Pinging (as of certain sizes via -s parameter) and certain traffic (e.g. SSH) hangs when using via GRE tunnels (the MM test setup).

H1 -> E1-1 take a look into openQA investigate results -> O1-1-1 openqa-investigate in job $url proves no changes in product

Problem

  • H1 REJECTED The product has changed -> unlikely because it happened accross all products at the same time

  • H2 Fails because of changes in test setup

    • H2.1 REJECTED Recent changes of the MTU size on the bridge on worker hosts made a difference -> E2.1-1 Revert changes from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1061 manually -> O2.1-1-1 reverting manually on two worker hosts didn't make any difference -> E2.1-2 Explicitly set back MTU to default value with ovs-vsctl -> #152389#note-26
    • H2.2 ACCEPTED The network behaves differently -> considering the testing with tracepath in #152557 the network definitely behaves in an unexpected way when it comes to routing and it most likely was not always like that
    • H2.2.1 Issue #152557 means that there might now be additional hops that lower the physical MTU size leading to the problem we see -> E2.2.1-1 wait until the SD ticket has been resolved and see if it works after that as before -> The routing is to be expected as between PRG1/NUE2 and PRG2 there is an ipsec tunnel involved which most likely impacts the effective MTU limit in the observed ways, E2.2.1-2 check whether lowering the MTU would help (as that might indicate that the physical MTU size is indeed reduced) -> lowering the MTU size within the VM helps indeed, see #152389#note-25
    • H2.3 The automatic reboot of machines in our Sunday maintenance window had an impact -> E2.3-1 First check if workers actually rebooted -> O2.3-1-1 sudo salt -C 'G@roles:worker' cmd.run 'w' shows that all workers rebooted on last Sunday so there was a reboot -> E2.3-2 Implement #152095 for better investigation
    • H2.4 REJECTED Scenarios failing now were actually never tested as part of https://progress.opensuse.org/issues/151310 -> the scenario https://openqa.suse.de/tests/13018864 was passing at the time when #151310 was resolved and queries for failing MM jobs done in #151310 didn't the many failures we see now
    • H2.5 ACCEPTED There is something wrong in the combination of GRE tunnels with more than 2 physical hosts (#152389-10) -> E2.5-1 Run a multi-machine cluster between two non-production hosts with only GRE tunnels between those two enabled -> with just 2 physical hosts the problem is indeed no longer reproducible, see #152389#note-29
    • H2.5.1 ACCEPTED Only a specific worker host (or the way it is connected) is problematic -> E2.5.1 Add further hosts step by step -> it seems that adding qesapworker-prg4.qa.suse.cz to the cluster causes the problem (but not similar workers like qesapworker-prg5.qa.suse.cz), see #152389#note-32 and #152389#note-35
  • H3 REJECTED Fails because of changes in test infrastructure software, e.g. os-autoinst, openQA
    -> O3-1-1 Comparing "first bad" https://openqa.suse.de/tests/13018864/logfile?filename=autoinst-log.txt os-autoinst version 4.6.1702036503.3b9f3a2 and "last good" 4.6.1701963272.58c0dd5 yielding

    $ git log1 --no-merges 58c0dd5..3b9f3a2
    fdf5f064 Improve `sudo`-usage in `t/20-openqa-isotovideo-utils.t`
    2f9d913a Consider code as generally uncoverable when testing relies on `sudo`
    

    also no relevant changes in openQA at all

  • H4 REJECTED Fails because of changes in test management configuration, e.g. openQA database settings -> O4-1-1 no relevant changes, see https://openqa.suse.de/tests/13018864#investigation

  • H5 REJECTED Fails because of changes in the test software itself (the test plan in source code as well as needles) -> no changes, see e.g. https://openqa.suse.de/tests/13018864#investigation

  • H6 REJECTED Sporadic issue, i.e. the root problem is already hidden in the system for a long time but does not show symptoms every time -> O6-1-1 https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=1701825350155&to=1702376526900 shows a clear regression by statistic

Suggestions

Debug in VMs (using the developer mode or by creating VMs manually) as we have already started in #152389#note-10 an subsequent comments.

The mentioned scenario is an easy reproducer but not the only affected scenario. Use e.g.

select distinct count(jobs.id), array_agg(jobs.id), (select name from job_groups where id = group_id), (array_agg(test))[1] as example_test from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2 and t_finished >= '2023-12-05T18:00' and result in ('failed', 'incomplete') and test not like '%:investigate:%' group by group_id order by count(jobs.id) desc;

to find possibly also affected and relevant scenarios.

  • Lower the MTU, but keep in mind minimums for IPv6? according to dheidler referencing wikipedia minimum would be 1280 bytes
  • Use more hosts in different locations and see if those work more reliably in production
  • Ensure tap is not used by machines that don't work since that would still affect the gre setup
  1. Make the use of MTU in os-autoinst-distri-opensuse configurable, i.e. the value that the support_server sets as well as the ping_size_check
  2. DONE Use the minimum MTU in both os-autoinst-distri-opensuse as well as https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls to be "on the safe side" for across-location tunneled GRE configurations -> https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18351 (merged)
  3. DONE Only enable "tap" in our workerconf.sls within one location per one architecture, e.g. only ppc64le tap in NUE2, only x86_64 tap in PRG2 -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/701 (merged) and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/702 (merged)
  4. Change https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls to only enable GRE tunnels on machines within the same datacenter, e.g. "location-prg" and rename accordingly in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls
  5. Follow #152737 regarding scheduling limited to individual zone

Rollback steps

  1. Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/693 disabling all tap classes except one x86_64 worker hosts
  2. Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/4be80b2c720f6023b20355c9f4ac71096dc0aee4
  3. Remove silence from https://monitor.qa.suse.de/alerting/silences "alertname=Ratio of multi-machine tests by result alert"

Further details

Always latest result in this scenario: latest


Files

screenshot_20231211_183832.png (36.5 KB) screenshot_20231211_183832.png mkittler, 2023-12-11 17:40

Related issues 11 (0 open11 closed)

Related to openQA Project - action #138698: significant increase in multi-machine test failures on OSD since 2023-10-25, e.g. test fails in support_server/setup size:MResolvedmkittler2023-10-27

Actions
Related to openQA Project - action #136154: multimachine tests restarted by RETRY test variable end up without the proper dependency size:MResolvedmkittler

Actions
Related to openQA Project - action #151310: [regression] significant increase of parallel_failed+failed since 2023-11-21 size:MResolvedmkittler2023-11-23

Actions
Related to openQA Tests - action #151612: [kernel][tools] test fails in suseconnect_scc - SUT times out trying to reach https://scc.suse.comResolvedmkittler2023-11-28

Actions
Related to openQA Tests - action #152461: [core][tools] test fails in various s390x-kvm tests with "s390x-kvm[\S\s]*(command 'zypper -n in[^\n]*timed out|sh install_k3s.sh[^\n]*failed)"Resolvedokurz2023-12-12

Actions
Related to openQA Infrastructure - action #152095: [spike solution][timeboxed:8h] Ping over GRE tunnels and TAP devices and openvswitch outside a VM with differing packet sizes size:SResolvedjbaier_cz2023-12-05

Actions
Related to openQA Tests - action #152755: [tools] test fails in scc_registration - SCC not reachable despite not running multi-machine tests? size:MResolvedmkittler2023-12-19

Actions
Related to openQA Project - action #153769: Better handle changes in GRE tunnel configuration size:MResolvedokurz2024-01-17

Actions
Related to openQA Project - action #154552: [ppc64le] test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.deResolvedmkittler2024-01-30

Actions
Copied to openQA Infrastructure - action #152557: unexpected routing between PRG1/NUE2+PRG2Resolvedokurz

Actions
Copied to openQA Project - action #153880: https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de, likely DNS problems in PRG1Resolvedokurz

Actions
Actions

Also available in: Atom PDF