Project

General

Profile

action #152389

Updated by mkittler 5 months ago

## Observation 
 openQA test in scenario sle-15-SP5-Server-DVD-Updates-x86_64-qam_kernel_multipath@64bit fails in 
 [multipath_iscsi](https://openqa.suse.de/tests/13018864/modules/multipath_iscsi/steps/23) 

 ## Test suite description 
 Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml. Maintainer: jpupava on 15sp1 is problem missing python-xml package 

 ## Reproducible 
 Fails since (at least) Build [20231210-1](https://openqa.suse.de/tests/13018864) (current job) 

 ## Expected result 
 Last good: [20231208-1](https://openqa.suse.de/tests/13010854) (or more recent) 

 ## Acceptance criteria 
 * **AC1:** failed+parallel_failed on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24 is significantly below 20% again 
 * **AC2:** same as AC1 but also after the next weekend and worker host reboots 

 ## Problem 
 Pinging (as of certain sizes via `-s` parameter) and certain traffic (e.g. SSH) hangs when using via GRE tunnels (the MM test setup). 

 **H1** -> *E1-1* take a look into openQA investigate results -> *O1-1-1* openqa-investigate in job $url proves no changes in product 

 ## Problem 
 * **H1** *REJECTED* The product has changed -> unlikely because it happened accross all products at the same time 

 * **H2** Fails because of changes in test setup 
  * **H2.1** Recent changes of the MTU size on the bridge on worker hosts made a difference 
 -> *E2.1-1* Revert changes from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1061 manually -> *O2.1-1-1* reverting manually on two worker hosts didn't make any difference 
 -> *E2.1-2* Explicitly set back MTU to default value with ovs-vsctl -> #152389#note-26 

  * **H2.2** The network behaves differently 
  * **H2.3** The automatic reboot of machines in our Sunday maintenance window had an impact 
 -> *E2.3-1* First check if workers actually rebooted -> *O2.3-1-1* `sudo salt -C 'G@roles:worker' cmd.run 'w'` shows that *all* workers rebooted on last Sunday so there *was* a reboot 
 -> *E2.3-2* Implement #152095 for better investigation 

  * **H2.4** *REJECTED* Scenarios failing now were actually never tested as part of https://progress.opensuse.org/issues/151310 -> the scenario https://openqa.suse.de/tests/13018864 was passing at the time when #151310 was resolved and queries for failing MM jobs done in #151310 didn't the many failures we see now 
  * **H2.5** *ACCEPTED* There is something wrong in the combination of GRE tunnels with more than 2 physical hosts (#152389-10) -> *E2.5-1* Run a multi-machine cluster between two non-production hosts with only GRE tunnels between those two enabled -> with just 2 physical hosts the problem is indeed no longer reproducible, see #152389#note-29 
     * **H2.5.1** Only a specific worker host is problematic -> *E2.5.1* Add further hosts step by step 

 * **H3** *REJECTED* Fails because of changes in test infrastructure software, e.g. os-autoinst, openQA 
    -> *O3-1-1* Comparing "first bad" https://openqa.suse.de/tests/13018864/logfile?filename=autoinst-log.txt os-autoinst version 4.6.1702036503.3b9f3a2 and "last good" 4.6.1701963272.58c0dd5 yielding 
 ``` 
 $ git log1 --no-merges 58c0dd5..3b9f3a2 
 fdf5f064 Improve `sudo`-usage in `t/20-openqa-isotovideo-utils.t` 
 2f9d913a Consider code as generally uncoverable when testing relies on `sudo` 
 ``` 
 also no relevant changes in openQA at all 

 * **H4** *REJECTED* Fails because of changes in test management configuration, e.g. openQA database settings -> *O4-1-1* no relevant changes, see https://openqa.suse.de/tests/13018864#investigation 
 * **H5** *REJECTED* Fails because of changes in the test software itself (the test plan in source code as well as needles) -> no changes, see e.g. https://openqa.suse.de/tests/13018864#investigation 
 * **H6** *REJECTED* Sporadic issue, i.e. the root problem is already hidden in the system for a long time but does not show symptoms every time -> *O6-1-1* https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=1701825350155&to=1702376526900 shows a clear regression by statistic 

 ## Suggestions 
 Debug in VMs (using the developer mode or by creating VMs manually) as we have already started in #152389#note-10 an subsequent comments. 

 The mentioned scenario is an easy reproducer but not the only affected scenario. Use e.g. 
 ``` 
 select distinct count(jobs.id), array_agg(jobs.id), (select name from job_groups where id = group_id), (array_agg(test))[1] as example_test from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2 and t_finished >= '2023-12-05T18:00' and result in ('failed', 'incomplete') and test not like '%:investigate:%' group by group_id order by count(jobs.id) desc; 
 ``` 
 to find possibly also affected and relevant scenarios. 

 ## Rollback steps 
 1. Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/693 disabling all tap classes except one x86_64 worker hosts 
 2. Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/commit/4be80b2c720f6023b20355c9f4ac71096dc0aee4 
 3. Remove silence from https://monitor.qa.suse.de/alerting/silences "alertname=Ratio of multi-machine tests by result alert" 

 ## Further details 
 Always latest result in this scenario: [latest](https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=qam_kernel_multipath&version=15-SP5) 

Back