Project

General

Profile

Actions

action #138698

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

significant increase in multi-machine test failures on OSD since 2023-10-25, e.g. test fails in support_server/setup size:M

Added by acarvajal 6 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-10-27
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_priorityfencing_supportserver@64bit fails in
setup

Test suite description

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Reproducible

Not easily reproducible. Failure is sporadic. See Next & Previous Results tab in linked test.

Failed on (at least) Build :29290:libfido2 (current job)

Expected result

Last good: :29978:qemu (or more recent)

Acceptance criteria

Problem

Suggestions

Further details

Always latest result in this scenario: latest

Rollback steps


Related issues 6 (4 open2 closed)

Related to openQA Infrastructure - action #133700: Network bandwidth graphs per switch, like https://mrtg.suse.de/qanet13nue, for all current top-of-rack switches (TORs) that we are connected to size:MBlockedokurz2023-08-02

Actions
Related to openQA Infrastructure - action #138707: Re-enable worker32 for multi-machine tests in productionBlockedokurz2023-10-28

Actions
Related to openQA Infrastructure - action #139070: Re-enable worker34 for multi-machine tests in productionBlockedokurz2023-10-28

Actions
Related to openQA Infrastructure - action #139154: Re-enable worker33 for multi-machine tests in productionNew2023-10-28

Actions
Related to openQA Project - action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:MResolvedmkittler2023-12-11

Actions
Copied to openQA Project - action #139055: Comments mentioning bugrefs as part of a sentence are treated like bug refs and taken over size:SResolvedokurz2023-10-272023-11-17

Actions
Actions #1

Updated by acarvajal 6 months ago

I think this issue could be caused by a configuration problem in the worker or in the network. As discussed in #eng-testing, we have been seeing many failures since October 26th, but not a clear pattern can be easily discerned.

Failures are:

  1. Failures in support_server/setup as this one. So far these seem to be happening on worker34 (see linked job) and also https://openqa.suse.de/tests/12692647, and in worker33 (https://openqa.suse.de/tests/12692668), so I suggest starting the debug in those 2 workers.

  2. Failures in SUTs trying to reach 10.0.2.2. We have seen those in:

worker33: https://openqa.suse.de/tests/12692960#step/barrier_init/5
worker34: https://openqa.suse.de/tests/12693210#step/ha_cluster_init/4
worker32: https://openqa.suse.de/tests/12692616#step/barrier_init/5

  1. Name solving issues (using the DNS provided by the support server):

worker34: https://openqa.suse.de/tests/12693209#step/iscsi_client/15 (support server in worker35)
worker39: https://openqa.suse.de/tests/12690359#step/ha_cluster_join/6 (support server in worker30)
worker33: https://openqa.suse.de/tests/12691203#step/ha_cluster_join/6 (support server in worker37)
worker35: https://openqa.suse.de/tests/12692184#step/ha_cluster_init/3 (support server in worker32)
worker29: https://openqa.suse.de/tests/12692221#step/cluster_md/2 (support server in worker32)
worker32: https://openqa.suse.de/tests/12692201#step/iscsi_client/15 (support server in worker30)
worker30: https://openqa.suse.de/tests/12692313#step/ha_cluster_join/6 (support server in worker33)
worker33: https://openqa.suse.de/tests/12692311#step/ha_cluster_join/6 (support server in worker39)
worker33: https://openqa.suse.de/tests/12692339#step/iscsi_client/15 (support server in worker35)
worker34: https://openqa.suse.de/tests/12692372#step/ha_cluster_join/6 (support server in worker29)
worker34: https://openqa.suse.de/tests/12692380#step/ha_cluster_init/23 (support server in worker39)
worker32: https://openqa.suse.de/tests/12692386#step/ha_cluster_join/6 (support server in worker30)

Will edit later and add the other cases we're seeing.

Edit:

  1. Connection issues between nodes in a MM setup:

worker32: https://openqa.suse.de/tests/12692186#step/hawk_gui/29 (support server in worker35, node1 in worker39)
worker34: https://openqa.suse.de/tests/12692192#step/ha_cluster_join/15 (support server in worker33, node1 in worker33)
worker34: https://openqa.suse.de/tests/12692233#step/remove_node/23 (support server in worker29, node2 in worker30)
worker32: https://openqa.suse.de/tests/12692332#step/hawk_gui/36 (support server in worker35, node2 in worker34)

  1. Cluster resources stopped. This is an odd one, but I guess communication issues between the cluster can make the DC stop resources in the remote node:

https://openqa.suse.de/tests/12693181#step/check_cluster_integrity/6 (job in worker40, other jobs in worker33 & worker30)
https://openqa.suse.de/tests/12692558#step/check_cluster_integrity/6 (job in worker38, other jobs in worker32 & worker40)

  1. Other connection issues:

worker32: https://openqa.suse.de/tests/12692266#step/register_without_ltss/53 (support server in worker30)

  1. And finally, the most common error, are these random reboots. They're happening in multiple modules. I think what we're seeing here is that there is some connection problem between the cluster nodes, and the HA stack fences the node. Due to the nature of the failure, jobs are not leaving logs for us to check (failing node is in grub, so post_fail_hook can not gather logs, and the other nodes finish with parallel_failed so post_fail_hook does not run:

https://openqa.suse.de/tests/12693179#step/check_hawk/10 (job in worker34, other jobs in worker40 & worker33)
https://openqa.suse.de/tests/12693197#step/check_cluster_integrity/2 (job in worker33, other jobs in worker32 & worker33)
https://openqa.suse.de/tests/12693199#step/console_reboot#1/17 (job in worker34, other jobs in worker29 & worker36)
https://openqa.suse.de/tests/12684453#step/cluster_md/2 (job in worker32, other jobs in worker39, worker29, worker33 & worker38)
https://openqa.suse.de/tests/12684536#step/vg/9 (job in worker33, other jobs in worker35 & worker40)
https://openqa.suse.de/tests/12684540#step/clvmd_lvmlockd/19 (job in worker33, other jobs in worker32, worker35 & worker40)
https://openqa.suse.de/tests/12684548#step/check_after_reboot/3 (job in worker40, other jobs in worker33, worker35 & worker40)
https://openqa.suse.de/tests/12692787#step/drbd_passive/30 (job in worker33, other jobs in worker37 & worker34)
https://openqa.suse.de/tests/12697964#step/clvmd_lvmlockd/26 (job in worker34, other jobs in worker29 & worker36)
https://openqa.suse.de/tests/12697960#step/cluster_state_mgmt/13 (job in worker34, other jobs in worker30 & worker34)
https://openqa.suse.de/tests/12697957#step/dlm/11 (job in worker33, other jobs in worker33 & worker30)
https://openqa.suse.de/tests/12697948#step/check_after_reboot/33 (job in worker39, other jobs in worker32, worker39 & worker38)
https://openqa.suse.de/tests/12700249#step/filesystem#1/15 (job in worker32, other jobs in worker30 & worker39)
https://openqa.suse.de/tests/12700243#step/drbd_passive/10 (job in worker32, other jobs in worker29 & worker32)
https://openqa.suse.de/tests/12686893#step/cluster_md/14 (job in worker34, other jobs in worker35, worker40, worker36 & worker37)
https://openqa.suse.de/tests/12692179#step/dlm/11 (job in worker34, other jobs in worker30, worker40 & worker38)
https://openqa.suse.de/tests/12692204#step/vg/6 (job in worker33, other jobs in worker34 & worker38)
https://openqa.suse.de/tests/12692217#step/check_after_reboot/19 (job in worker30, other jobs in worker32, worker40 & worker38)
https://openqa.suse.de/tests/12692260#step/check_after_reboot/12 (job in worker33, other jobs in worker33, worker34 & worker35)
https://openqa.suse.de/tests/12692255#step/check_after_reboot/7 (job in worker40, other jobs in worker32, worker29 & worker36)
https://openqa.suse.de/tests/12692269#step/check_logs/11 (job in worker32, other jobs in worker29 & worker38)
https://openqa.suse.de/tests/12692284#step/cluster_md/20 (job in worker33, other jobs in worker34, worker40 & worker38)
https://openqa.suse.de/tests/12692291#step/clvmd_lvmlockd/9 (job in worker34, other jobs in worker30 & worker33)
https://openqa.suse.de/tests/12692319#step/ha_cluster_join/16 (job in worker32, other jobs in worker39, worker38 & worker33)
https://openqa.suse.de/tests/12692306#step/cluster_md/6 (job in worker34, other jobs in worker30 & worker37)
https://openqa.suse.de/tests/12692357#step/dlm/10 (job in worker34, other jobs in worker36, worker40 & worker37)
https://openqa.suse.de/tests/12692375#step/cluster_md/6 (job in worker33, other jobs in worker29 & worker36)
https://openqa.suse.de/tests/12692414#step/filesystem/19 (job in worker32, other jobs in worker35, worker39 & worker30)
https://openqa.suse.de/tests/12692401#step/ha_cluster_init/54 (job in worker32, other jobs in worker37 & worker30)
https://openqa.suse.de/tests/12692397#step/ha_cluster_init/27 (job in worker32, other jobs in worker33, worker34 & worker32)

As reported, not a clear pattern, but workers 32, 33 & 34 seem to appear a lot in these failures.

Actions #2

Updated by okurz 6 months ago

  • Related to action #133700: Network bandwidth graphs per switch, like https://mrtg.suse.de/qanet13nue, for all current top-of-rack switches (TORs) that we are connected to size:M added
Actions #3

Updated by okurz 6 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Category set to Regressions/Crashes
  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version set to Ready

ok, so let's follow the hypothesis that this is worker host specific. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now&viewPanel=27 indeed shows w32+33+34 with a higher failure rate but actually also w31+w35+w36. As right now at least for qemu-x86_64 we have a high redundancy we can easily take some machines out of production temporarily to check such hypothesis. So doing

sudo salt 'worker3[1-6].oqa.*' cmd.run "sudo systemctl disable --now telegraf \$(systemctl list-units | grep openqa-worker-auto-restart | cut -d . -f 1 | xargs)" && for i in {31..36}; do sudo salt-key -y -d worker$i.oqa.prg2.suse.org; done

although I consider it unlikely that those and only those machines are affected. Either it's something not worker specific or maybe a problem in the network. Bandwidth graphs like requested in #133700 would obviously help.

@acarvajal I will merely address one unlikely hypothesis in particular due to lower overall expected system load during the weekend. On top I will only again be fully working 2023-11-02, not before. IMHO the impact is severe enough to make that ticket at least "High" if not "Urgent". If you want to help to adress this issue with more effort I suggest you look into a better reproducer, e.g. minimized test scenario with modules excluded that don't impact the error and modules which do have an impact repeated multiple times if possible.

Actions #4

Updated by okurz 6 months ago

  • Description updated (diff)
Actions #5

Updated by okurz 6 months ago

@acarvajal support_server/setup is at least an example failing rather quickly in comparison but the test code is obviously far from race-free. The code has:

   assert_screen 'iscsi-target-overview-add-target-tab';

    # Wait for the Identifier field to change from 'test' value to the correct one
    # We could simply use a 'sleep' here but it's less good
    wait_screen_change(undef, 10);

    # Select Target field
    send_key 'alt-t';
    wait_still_screen 3;

    # Change Target value
    for (1 .. 40) { send_key 'backspace'; }
    type_string 'iqn.2016-02.de.openqa';
    wait_still_screen 3;

    # Select Identifier field
    send_key 'alt-f';
    wait_still_screen 3;

    # Change Identifier value
    for (1 .. 40) { send_key 'backspace'; }
    wait_still_screen 3;
    type_string '132';
    wait_still_screen 3;

    # Un-check Use Authentication
    send_key 'alt-u';
    wait_still_screen 3;

so lot's of waste-full and racy wait_still_screen. I can not give you a guarantee that we would be able to fix that issue at all. Someone should improve that code. Could you take that into your scope in a separate ticket please?

Actions #6

Updated by okurz 6 months ago

  • Subject changed from test fails in support_server/setup to significant increase in multi-machine test failures on OSD since 2023-10-25, e.g. test fails in support_server/setup
  • Description updated (diff)

For H3 https://openqa.suse.de/tests/12691358#investigation shows

diff_to_last_good   

-   "BASE_TEST_ISSUES" : "29978",
+   "BASE_TEST_ISSUES" : "29290",
-   "BUILD" : ":29978:qemu",
+   "BUILD" : ":29290:libfido2",
-   "INCIDENT_ID" : "29978",
-   "INCIDENT_REPO" : "http://download.suse.de/ibs/SUSE:/Maintenance:/29978/SUSE_Updates_SLE-Module-Basesystem_15-SP5_x86_64,http://download.suse.de/ibs/SUSE:/Maintenance:/29978/SUSE_Updates_SLE-Module-Server-Applications_15-SP5_x86_64",
+   "INCIDENT_ID" : "29290",
+   "INCIDENT_REPO" : "http://download.suse.de/ibs/SUSE:/Maintenance:/29290/SUSE_Updates_SLE-Module-Basesystem_15-SP5_x86_64",
-   "NICMAC" : "52:54:00:12:0a:b7",
+   "NICMAC" : "52:54:00:12:0d:60",

-   "NICVLAN" : "115",
+   "NICVLAN" : "137",
-   "PRJDIR" : "/var/lib/openqa/cache/openqa.suse.de",
+   "PRJDIR" : "/var/lib/openqa/share",
-   "QEMUPORT" : "20172",
+   "QEMUPORT" : "20462",
-   "REPOHASH" : "1698322664",
+   "REPOHASH" : "1698310213",
-   "RRID" : "SUSE:Maintenance:29978:311661",
+   "RRID" : "SUSE:Maintenance:29290:311633",
-   "SERVERAPP_TEST_ISSUES" : "29978",
-   "TAPDEV" : "tap16",
+   "TAPDEV" : "tap45",
-   "VNC" : "107",
+   "VNC" : "136",
-   "WORKER_CLASS" : "qemu_x86_64,qemu_x86_64_staging,qemu_x86_64-large-mem,amd,tap,prg,prg2,worker35,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3",
-   "WORKER_HOSTNAME" : "worker35.oqa.prg2.suse.org",
-   "WORKER_ID" : 2743,
-   "WORKER_INSTANCE" : 17,
+   "WORKER_CLASS" : "qemu_x86_64,qemu_x86_64_staging,qemu_x86_64-large-mem,amd,tap,prg,prg2,worker34,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3",
+   "WORKER_HOSTNAME" : "worker34.oqa.prg2.suse.org",
+   "WORKER_ID" : 3424,
+   "WORKER_INSTANCE" : 46,

last_good   12686884
needles_diff_stat   

needles_log 

No needle changes recorded, test regression due to needles unlikely

test_diff_stat  

test_log    

No test changes recorded, test regression unlikely

so unlikely to have a problem due to os-autoinst-distri-opensuse changes or needles. Also no relevant test setting changes that would come from differing settings in the openQA database or job templates and such. However the product has changed and I can't rule out changes in there, e.g. SLE maintenance updates.

From first bad https://openqa.suse.de/tests/12691358/logfile?filename=autoinst-log.txt os-autoinst version is 4.6.1698238759.64b339c vs. last good https://openqa.suse.de/tests/12686884/logfile?filename=autoinst-log.txt 4.6.1698187055.10dd7a0

$ git log1 --no-merges 10dd7a0..64b339c
d438393f Use commit message checks from os-autoinst-common
c5986103 (okurz/feature/s390, feature/s390) backend::baseclass: Fix wording of informative message
fb8c1fed Slightly simplify backend::baseclass
54ad428c Remove unused tools/absolutize

the only commit remotely sounding like it could introduce functional changes is https://github.com/os-autoinst/os-autoinst/commit/fb8c1fed1a021354a62232f6579183c269a3d29b but the diff looks very unsuspicious

Also from https://openqa.suse.de/tests/12691358#comments we see that retry passes and qam_ha_priorityfencing_supportserver:investigate:last_good_build::29978:qemu: passed so another indication for sporadic issue not related to os-autoinst or openQA changes. Taking a look into worker34.oqa.prg2.suse.org:/var/log/zypp/history I see:

2023-10-25 02:12:57|command|root@worker34|'zypper' '-n' '--no-refresh' '--non-interactive-include-reboot-patches' 'patch' '--replacefiles' '--auto-agree-with-licenses' '--download-in-advance'|
2023-10-25 02:12:58|install|libruby2_5-2_5|2.5.9-150000.4.29.1|x86_64||repo-sle-update|3a958f3465e4eab4839b2863523e30d61bca64a982b4f05ca17dcfd656202b59|
2023-10-25 02:12:58|install|ruby2.5-stdlib|2.5.9-150000.4.29.1|x86_64||repo-sle-update|a76c4b98b007a31b26466734f0049d1efb467c2fdb8d8efaaaef586b7224c873|
2023-10-25 02:12:58|install|ruby2.5|2.5.9-150000.4.29.1|x86_64||repo-sle-update|6af225578dacb2ebef69532cd5de72134604821968a2761d321c527940f381ec|
2023-10-25 02:12:58|patch  |openSUSE-SLE-15.5-2023-4176|1|noarch|repo-sle-update|important|security|needed|applied|
2023-10-25 07:14:51|command|root@worker34|'zypper' '--no-refresh' '-n' 'dup' '--replacefiles'|
2023-10-25 07:14:51|install|os-autoinst|4.6.1698187055.10dd7a0-lp155.1689.1|x86_64||devel_openQA|a3847ebfcabbb86c32dc00c423eb162f685a1c9d17dde7facdf68e8d392650a3|
2023-10-25 07:14:52|install|os-autoinst-devel|4.6.1698187055.10dd7a0-lp155.1689.1|x86_64||devel_openQA|b6e93750283ae620e55bb1f97ce43c60a22c3db60539d7f764efc843f0b2b581|
2023-10-25 07:14:52|install|os-autoinst-swtpm|4.6.1698187055.10dd7a0-lp155.1689.1|x86_64||devel_openQA|8ebe851440ee8fc34098e01b183197879237cd7f92e818648c35e8c5874aaa78|
2023-10-25 07:14:54|install|os-autoinst-openvswitch|4.6.1698187055.10dd7a0-lp155.1689.1|x86_64||devel_openQA|51a7ed5e6e4dca782384a17a461ca8b2acf055352e1b27c06b20a6db64c68cd0|
2023-10-25 07:14:54|install|openQA-common|4.6.1698152470.c944acc-lp155.6147.1|x86_64||devel_openQA|b8ba7ef9c7ad73a7712ed61d0fd505733267c9b071738bf1d6f1fb970e3b55aa|
2023-10-25 07:14:54|install|os-autoinst-distri-opensuse-deps|1.1698196593.7916f33b-lp155.13125.1|noarch||devel_openQA|7d17349c1cb5c983317959bc51df86bfcd5dd37500080995a7cdb62bbdf791a2|
2023-10-25 07:14:54|install|openQA-client|4.6.1698152470.c944acc-lp155.6147.1|x86_64||devel_openQA|0d2d629f8f406eed951c369bff7cd40ff67e212c17ee48b87a85abc0e79edfc4|
2023-10-25 07:14:56|install|openQA-worker|4.6.1698152470.c944acc-lp155.6147.1|x86_64||devel_openQA|f794647d29b0daef763e518533678eb949e3d4ccc7ba9913f0e640e28be77fa3|
2023-10-26 02:13:09|command|root@worker34|'zypper' '-n' '--no-refresh' '--non-interactive-include-reboot-patches' 'patch' '--replacefiles' '--auto-agree-with-licenses' '--download-in-advance'|
2023-10-26 02:13:10|install|libnghttp2-14|1.40.0-150200.12.1|x86_64||repo-sle-update|6625e233bc93d47e048dfc9d7a6df96a473a542f95815dae87f5c0db80dd532c|
2023-10-26 02:13:10|install|libssh2-1|1.11.0-150000.4.19.1|x86_64||repo-sle-update|291590f6d5e84f8ad50960aa756501984c9fff723159d3533577fcec5735aec6|
2023-10-26 02:13:10|patch  |openSUSE-SLE-15.5-2023-4192|1|noarch|repo-sle-update|moderate|recommended|needed|applied|
2023-10-26 02:13:10|patch  |openSUSE-SLE-15.5-2023-4200|1|noarch|repo-sle-update|important|security|needed|applied|
2023-10-27 02:13:15|command|root@worker34|'zypper' '-n' '--no-refresh' '--non-interactive-include-reboot-patches' 'patch' '--replacefiles' '--auto-agree-with-licenses' '--download-in-advance'|
2023-10-27 02:13:16|install|libz1|1.2.13-150500.4.3.1|x86_64||repo-sle-update|1f273509bd76f485a289e23791a3d9c5fec7b982fe91f59000d191d40375840d|
2023-10-27 02:13:16|install|zlib-devel|1.2.13-150500.4.3.1|x86_64||repo-sle-update|16b0c66f6384d2ed18894441075a928d263897e8fb1c0c496f9ee41f3a1c2411|
2023-10-27 02:13:16|patch  |openSUSE-SLE-15.5-2023-4215|1|noarch|repo-sle-update|moderate|security|needed|applied|
2023-10-27 07:15:51|command|root@worker34|'zypper' '--no-refresh' '-n' 'dup' '--replacefiles'|
2023-10-27 07:15:51|install|os-autoinst|4.6.1698238759.64b339c-lp155.1693.1|x86_64||devel_openQA|d3acaf331ba15656171a8104292fe48edac5c155076a951d16b1bac8c4469827|
2023-10-27 07:15:51|install|os-autoinst-devel|4.6.1698238759.64b339c-lp155.1693.1|x86_64||devel_openQA|03ec7a9aef0b480cb0cf652f53dd2758b0ca52cebb17d6b2eda58ff79610845d|
2023-10-27 07:15:51|install|os-autoinst-swtpm|4.6.1698238759.64b339c-lp155.1693.1|x86_64||devel_openQA|b2716a8dcd6006b32d63f1baa8def8fae08afb8e7036a5f5e25a346eb716b729|
2023-10-27 07:15:53|install|os-autoinst-openvswitch|4.6.1698238759.64b339c-lp155.1693.1|x86_64||devel_openQA|092833173f63a5c0731f9e47e6255f89a6a36262310ad39b4628b000714e8635|
2023-10-27 07:15:53|install|openQA-common|4.6.1698238589.f8f5bc4-lp155.6149.1|x86_64||devel_openQA|9fcf9828ae242a645e6d9e6b6d2878d7b49cbba4007d962b3f80b7bfbbc94a6d|
2023-10-27 07:15:53|install|os-autoinst-distri-opensuse-deps|1.1698329766.7f036688-lp155.13144.1|noarch||devel_openQA|64d8e3e8e5257b4c86f3d04dcff5c6516ecbe8be5939b383848f91d08a862542|
2023-10-27 07:15:53|install|openQA-client|4.6.1698238589.f8f5bc4-lp155.6149.1|x86_64||devel_openQA|b82470bc7f49d28bf5017d70bb3a411627829ae46bf0e00bc431a7629da70d4c|
2023-10-27 07:15:58|install|openQA-worker|4.6.1698238589.f8f5bc4-lp155.6149.1|x86_64||devel_openQA|33428dde2cf8a24604c863dd4d2332ef1d99686807a351bd016fe7fe46f17d76|
2023-10-27 07:15:58|install|python3-cryptography|3.3.2-150400.20.3|x86_64||repo-sle-update|8d01db80914ea5adb8bfa4e7bd3ad7f8976aab6cc76859f1da5871870d8ca797|
2023-10-27 07:15:58|patch  |openSUSE-SLE-15.5-2023-4194|1|noarch|repo-sle-update|low|feature|needed|applied|

with nothing suspiciuous unless we suspect the ruby security patch ;) For openQA a broader diff log would be

$ git log1 --no-merges d08787a..f8f5bc4
94d1adde3 Use commit message checks from os-autoinst-common
02be1c5aa Warn when modifying files under external directly
89463ff34 (okurz/feature/ci, feature/ci) CI: Use consistent casing in commit message check
f8e89a368 CI: Fix typo in github action name
80296b8c1 Update .github/workflows/commit_message_checker.yml
ab85100ef Update commit-message-checker & add extra rule for subject lines

so again far from suspicious

Actions #7

Updated by okurz 6 months ago

  • Description updated (diff)
Actions #8

Updated by okurz 6 months ago

  • Description updated (diff)

I can't disable all worker instances on worker31-36 as they run other stuff than just x86_64-qemu, e.g. s390x. So again enabling but luckily we have spread non-x86_64-qemu worker instances evenly so I can do

for i in {31..36}; do sudo salt-key -y -a worker$i.oqa.prg2.suse.org; done
sudo salt --no-color --state-output=changes 'worker*' state.apply | grep -av 'Result.*Clean'
sudo salt 'worker3[1-4].oqa.*' cmd.run "sudo systemctl mask --now openqa-worker-auto-restart@{11..50}"

so for now keeping w35+w36 enabled but masked all instances.

sudo salt 'worker3[1-9].oqa.*' cmd.run "sudo pgrep -af 'openqa.*worker'"

looks ok.

Experiment to derive fail-ratio if any with reference scenario "ovs-client+server" from #136013 parameterized by worker while staying with each cluster within one worker host

for i in {29..40}; do name=poo138698-okurz-w$i; openqa-clone-job --repeat=10 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12691977 TEST+=-$name BUILD=$name _GROUP=0 WORKER_CLASS=worker$i,tap; done

w31-34 shouldn't actually start as I assume there is no class running like "worker31,tap" even though that is unsafe as I should include qemu-x86_64 to not run on s390x or something.

Actions #9

Updated by openqa_review 6 months ago

  • Due date set to 2023-11-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by okurz 6 months ago

tests for w29+30,35+36+37+38+39+40 are 10/10 green so no problem there. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-2d&to=now&viewPanel=24 shows a significant improvement, back to 10% failed+parallel_failed. I retriggered multiple tests, in particular HA/SAP SLE maintenance aggregate tests, now https://openqa.suse.de/tests/overview?groupid=405&flavor=SAP-DVD-Updates&flavor=Server-DVD-HA-Updates all good.

As no other tests reproduced the error and also it seems there are no more x86_64 qemu tests scheduled right now on OSD I re-enabled all worker instances. Now more of my debugging jobs can start but I can also try to reproduce the original problem.

$ name=poo138698-okurz; openqa-clone-job --skip-chained-deps --skip-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12691358 INCLUDE_MODULES=support_server/login,support_server/setup TEST+=-$name BUILD=$name _GROUP=0 WORKER_CLSS=qemu_x86_64,tap,worker31
Cloning parents of sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priorityfencing_supportserver@64bit
Cloning children of sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priorityfencing_supportserver@64bit
Cloning parents of sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priority_fencing_node01@64bit
Cloning parents of sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priority_fencing_node02@64bit
3 jobs have been created:
 - sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priorityfencing_supportserver@64bit -> https://openqa.suse.de/tests/12710499
 - sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priority_fencing_node01@64bit -> https://openqa.suse.de/tests/12710497
 - sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priority_fencing_node02@64bit -> https://openqa.suse.de/tests/12710498

worker31 can not work, see #137756, trying w32

$ name=poo138698-okurz; openqa-clone-job --skip-chained-deps --skip-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12691358 INCLUDE_MODULES=support_server/login,support_server/setup TEST+=-$name BUILD=$name _GROUP=0 WORKER_CLASS=qemu_x86_64,tap,worker32
Cloning parents of sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priorityfencing_supportserver@64bit
Cloning children of sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priorityfencing_supportserver@64bit
Cloning parents of sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priority_fencing_node01@64bit
Cloning parents of sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priority_fencing_node02@64bit
3 jobs have been created:
 - sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priorityfencing_supportserver@64bit -> https://openqa.suse.de/tests/12710502
 - sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priority_fencing_node01@64bit -> https://openqa.suse.de/tests/12710501
 - sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29290:libfido2-qam_ha_priority_fencing_node02@64bit -> https://openqa.suse.de/tests/12710500

that failed because with INCLUDE_MODULES the parallel jobs have no tests at all, using EXCLUDE_MODULES:

name=poo138698-okurz; openqa-clone-job --skip-chained-deps --skip-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12691358 EXCLUDE_MODULES=ha/barrier_init,support_server/wait_children TEST+=-$name BUILD=$name _GROUP=0 WORKER_CLASS=qemu_x86_64,tap,worker32

-> https://openqa.suse.de/tests/12710507

And then also with --export-command to spread over workers:

name=poo138698-okurz; openqa-clone-job --export-command --skip-chained-deps --skip-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12691358 EXCLUDE_MODULES=ha/barrier_init,support_server/wait_children TEST+=-$name BUILD=$name _GROUP=0 WORKER_CLASS=qemu_x86_64,tap,worker32

and then from that

openqa-cli api --host … 'WORKER_CLASS:12691358=qemu_x86_64,tap,worker32' 'WORKER_CLASS:12691361=qemu_x86_64,tap,worker33' 'WORKER_CLASS:12691363=qemu_x86_64,tap,worker34' …

-> https://openqa.suse.de/tests/12710512

EXCLUDE_MODULES seems to have no effect so doing

openqa-cli api --host … BUILD=poo138698-okurz-w32-33-34-2 … EXCLUDE_MODULES:12691358=barrier_init,wait_children … 'WORKER_CLASS:12691358=qemu_x86_64,tap,worker32' 'WORKER_CLASS:12691361=qemu_x86_64,tap,worker33' 'WORKER_CLASS:12691363=qemu_x86_64,tap,worker34'

-> https://openqa.suse.de/tests/12710515 passed so at least no easily reproducible problem.

for i in {001..040}; do openqa-cli api --host … BUILD=poo138698-okurz-w32-33-34-2 … EXCLUDE_MODULES:12691358=barrier_init,wait_children … 'TEST:12691358=qam_ha_priorityfencing_supportserver-poo138698-okurz-$i' 'TEST:12691361=qam_ha_priority_fencing_node01-poo138698-okurz-$i' 'TEST:12691363=qam_ha_priority_fencing_node02-poo138698-okurz-$i' … 'WORKER_CLASS:12691358=qemu_x86_64,tap_poo138707,worker32' 'WORKER_CLASS:12691361=qemu_x86_64,tap,worker33' 'WORKER_CLASS:12691363=qemu_x86_64,tap,worker34'

-> https://openqa.suse.de/tests/12710520 -> https://openqa.suse.de/tests/overview?build=poo138698-okurz-w32-33-34-2

Actions #11

Updated by okurz 6 months ago

  • Related to action #138707: Re-enable worker32 for multi-machine tests in production added
Actions #12

Updated by acarvajal 6 months ago

Hello,

Just checked through Aggregate SAP & HA tests that ran over the weekend and things look much improved (meaning, failures are not related to this ticket). I still need to take a look at Incident jobs (which are more) and to the ticket activity from late Friday and the weekend, but wanted to drop a line regarding results.

Actions #13

Updated by acarvajal 6 months ago

Went through HA & SAP Incidents, and do not see the issues from last Friday.

Actions #14

Updated by acarvajal 6 months ago

Update from today. Issues seem to be present again. :(

  1. Found 3 failures in Aggregated jobs and over 10 in Incidents where SUT had been rebooted. For example https://openqa.suse.de/tests/12725464#step/check_after_reboot/13 (ran in worker33, Support Server in worker39)
  2. Found https://openqa.suse.de/tests/12731342#step/ha_cluster_init/7 in worker33 where connection from SUT to Support Server (running in worker29) failed.
  3. Found https://openqa.suse.de/tests/12731350#step/iscsi_client/15 in worker33 where connection from SUT to Support Server (running in worker35) failed.
  4. HAWK client in worker39 https://openqa.suse.de/tests/12725522#step/hawk_gui/29 fails to connect to node 1 running in worker34
  5. Cluster node running in worker40 https://openqa.suse.de/tests/12725512#step/cluster_md/32 fails to connect to other node in the cluster (running in worker33)
  6. HAWK client in worker37 https://openqa.suse.de/tests/12725591#step/hawk_gui/36 cannot connect to node2 in worker33 but it can connect to node1 in worker39
  7. Cluster node fails to reach qnetd server at 10.0.2.17 https://openqa.suse.de/tests/12724650#step/qnetd/26. Node runs in worker29 and qnetd server in worker34.
  8. Cluster init fails because qnetd server is unreachable https://openqa.suse.de/tests/12724701#step/ha_cluster_init/15. Node 1 in worker38, qnetd server in worker33.
  9. SUT cannot resolve names using the DNS provided by the Support Server: https://openqa.suse.de/tests/12724666#step/cluster_md/3. SUT in worker29, SS in worker33.
  10. HAWK client in worker29 fails to connect to cluster node also in worker29 https://openqa.suse.de/tests/12726431#step/hawk_gui/29, but error is name resolution and Support Server is in worker33.
  11. Node in worker34 cannot join cluster https://openqa.suse.de/tests/12731427#step/remove_node/11. Node 1 in worker35

As last Friday, most of the issues seem to be either in worker33 or worker34, so I have a strong suspicion that something is broken there.

These are the type of erros I found on HA job groups. Will check next SAP job groups but since those only have 1 cluster, I don't expect to find anything different. Will add it here if I do.

Actions #15

Updated by livdywan 6 months ago

  • Description updated (diff)
  • Assignee changed from okurz to livdywan
  • Priority changed from Normal to Urgent

Checking with @acarvajal to see if we can spot a commonality here. Seemingly jobs involving any of those workers can fail, and we don't have a reliable reproducer.

there are no more x86_64 qemu tests scheduled right now on OSD I re-enabled all worker instances

Updating the rollback steps accordingly. Also editing the desc since it's misleading to include w31 or w32 here neither of which can currently run mm jobs.

Actions #16

Updated by livdywan 6 months ago

Let's start with taking out w33 and w34 anyway https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/666

Actions #17

Updated by livdywan 6 months ago

  • Subject changed from significant increase in multi-machine test failures on OSD since 2023-10-25, e.g. test fails in support_server/setup to significant increase in multi-machine test failures on OSD since 2023-10-25, e.g. test fails in support_server/setup size:M
  • Description updated (diff)
Actions #18

Updated by livdywan 6 months ago

livdywan wrote in #note-16:

Let's start with taking out w33 and w34 anyway https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/666

It would seem as though things are looking good again, and it might be one of the two machines. Pondering bringing back one of the workers to confirm.

Actions #19

Updated by acarvajal 6 months ago

livdywan wrote in #note-18:

livdywan wrote in #note-16:

Let's start with taking out w33 and w34 anyway https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/666

It would seem as though things are looking good again, and it might be one of the two machines.

Looking at the previous 2 days of Aggregate jobs and currently blocked MU results:

  1. No SAP jobs failed on Multi-Machine-related issues. SAP failures where either NFS-related (see https://progress.opensuse.org/issues/135980) or on IPMI backends.
  2. 2 HA jobs failed, but I don't have enough information to tie those failures to this MM issue. (see https://openqa.suse.de/tests/12738521#step/check_after_reboot/16 & https://openqa.suse.de/tests/12738710#step/register_without_ltss/70). First one could be related as it could be that the node was restarted, but 2nd failure is not related to this for sure, as it seems to be a repository issue.

Pondering bringing back one of the workers to confirm.

That would be fine from our end. Let us know if this is done to be on lookout for failures.

Actions #20

Updated by livdywan 6 months ago

  • Description updated (diff)
Actions #21

Updated by livdywan 6 months ago

  • Description updated (diff)
Actions #22

Updated by livdywan 6 months ago

I decided to simply spawn a batch of qam_ha_priorityfencing_supportserver. Let's see what the failure rate of that will be. We need to confirm wether this is one or multiple test issues, an issue with the worker setup or elsewhere in the infrastructure.

Actions #23

Updated by acarvajal 6 months ago

livdywan wrote in #note-22:

I decided to simply spawn a batch of qam_ha_priorityfencing_supportserver. Let's see what the failure rate of that will be. We need to confirm wether this is one or multiple test issues, an issue with the worker setup or elsewhere in the infrastructure.

Wow, this is cool! There are 5 failures so far ... all of them with at least one job in worker34. Good idea with using Priority Fencing test as it should run in ca. 30-40 minutes.

Could this be used as a reproducer, or is it still too big a test for such a purpose?

Actions #25

Updated by livdywan 6 months ago

  • Copied to action #139055: Comments mentioning bugrefs as part of a sentence are treated like bug refs and taken over size:S added
Actions #26

Updated by okurz 6 months ago

to have a quiet hack week I suggest you simply disable w34 again for tap-use for now and then reduce prio here.

Actions #27

Updated by okurz 6 months ago

  • Due date changed from 2023-11-11 to 2023-11-17

special hackweek due-date bump

Actions #28

Updated by livdywan 6 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/668 to take out worker34 again.

My assumption is this ticket will be closed if the "significant increase" has been addressed. That's what this ticket is about.

Actions #29

Updated by livdywan 6 months ago

  • Description updated (diff)
Actions #30

Updated by livdywan 6 months ago

  • Related to action #139070: Re-enable worker34 for multi-machine tests in production added
Actions #31

Updated by acarvajal 6 months ago

  • Due date changed from 2023-11-17 to 2023-11-11
  • Priority changed from High to Urgent

livdywan wrote in #note-29:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/669 to confirm that worker 33 is indeed fine.

Seems worker33 has its impact: Aggregate jobs over the weekend were fine, but there were many failures in Single Incidents.

In HA job groups:

In SAP job groups:

While not totally thorough, I'd say this looks similar to what I saw when worker34 was enabled in #note-24 and will maintain my hypothesis that there's something wrong with these 2 workers.

Actions #32

Updated by livdywan 6 months ago

  • Related to action #139154: Re-enable worker33 for multi-machine tests in production added
Actions #33

Updated by livdywan 6 months ago

While not totally thorough, I'd say this looks similar to what I saw when worker34 was enabled in #note-24 and will maintain my hypothesis that there's something wrong with these 2 workers.

Ack. Taking 33 out of production for mm again.

Actions #34

Updated by livdywan 6 months ago

  • Priority changed from Urgent to High

@acarvajal Thank you for the update even during hack week. Let's see if what remains looks stable enough so we can start to focus on narrowing down the specific problems.

Actions #35

Updated by livdywan 6 months ago

  • Description updated (diff)
Actions #36

Updated by JERiveraMoya 6 months ago

Looks like this ticket is labeling unrelated things, right? Please, see these two examples for Beta candidate:
https://openqa.suse.de/tests/12774019#comments
https://openqa.suse.de/tests/12775212#comments

Actions #37

Updated by livdywan 6 months ago

JERiveraMoya wrote in #note-36:

Looks like this ticket is labeling unrelated things, right? Please, see these two examples for Beta candidate:
https://openqa.suse.de/tests/12774019#comments
https://openqa.suse.de/tests/12775212#comments

Those are linked to bsc#1217056 and bsc#1191684 respectively. Do you mean they were linked to this ticket?
There was some jobs that were accidentally linked by way of using the wrong format with Re-running to verify connection with poo#138698. If that's what you saw please accept my apologies. This was only meant to be an informational comment.

Actions #38

Updated by okurz 6 months ago

  • Due date deleted (2023-11-11)
  • Status changed from Feedback to Resolved

Looking at https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=now-7d&to=now it seems we are good regarding the ratio of failed multi-machine test issues. Also https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_priorityfencing_supportserver&version=15-SP5#next_previous is very reliable for AC1 and AC2 is covered with #138707, #139070, #139154

Normally I would check for this ticket being used as ticket label on o3+osd but as o3 is currently not reachable due to https://progress.opensuse.org/issues/150815 openqa-query-for-job-label wouldn't work I don't bother and call ticket ticket resolved.

Actions #39

Updated by JERiveraMoya 6 months ago

Now that this ticket is resolved, what is the ticket to use for https://openqa.suse.de/tests/12799702, wasn't the same issue?
this problem seems to be a zypper issue connecting osd in MM.

Actions #40

Updated by okurz 6 months ago

  • Status changed from Resolved to Workable
  • Assignee deleted (livdywan)

ok, https://openqa.suse.de/tests/12799702 looks like the same original problem, reopening.

Actions #41

Updated by bschmidt 6 months ago

I've talked to Paolo Stivanin and we both agreed that https://openqa.suse.de/tests/12799702 does not look like a multi machine issue.
Therefore I've unlinked the poo from that job now.

Actions #42

Updated by livdywan 6 months ago

  • Status changed from Workable to Feedback
  • Assignee set to livdywan

bschmidt wrote in #note-41:

I've talked to Paolo Stivanin and we both agreed that https://openqa.suse.de/tests/12799702 does not look like a multi machine issue.
Therefore I've unlinked the poo from that job now.

Thanks!

I'm taking the ticket again, and will monitor for a bit. We have notably #136013 with regard to more general underlying issues, and #139070, #138707, #139154 and #136130 respectively concerning particular workers. Remember this ticket is not about one individual issue but rather the symptoms.

Actions #43

Updated by okurz 5 months ago

  • Due date set to 2023-11-24

ok, discussed in tools team meeting. livdywan you mentioned another ticket about the specific issue apparent in the job failure of https://openqa.suse.de/tests/12799702#comments but there is only a link back to this ticket here. Please find the according reference.

Actions #44

Updated by livdywan 5 months ago

  • Status changed from Feedback to Resolved

okurz wrote in #note-43:

ok, discussed in tools team meeting. livdywan you mentioned another ticket about the specific issue apparent in the job failure of https://openqa.suse.de/tests/12799702#comments but there is only a link back to this ticket here. Please find the according reference.

Fixed. Issue #150932 wasn't linked on all of the jobs.

Actions #45

Updated by livdywan 5 months ago

  • Related to action #150932: [security][SP6] Failed to connect to openqa.suse.de port 80 in krb5_crypt_nfs_server added
Actions #46

Updated by livdywan 5 months ago

  • Related to deleted (action #150932: [security][SP6] Failed to connect to openqa.suse.de port 80 in krb5_crypt_nfs_server)
Actions #47

Updated by openqa_review 5 months ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: extra_tests_webserver
https://openqa.suse.de/tests/12837409#step/php_version/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #48

Updated by mkittler 5 months ago

  • Status changed from Feedback to In Progress
  • Assignee changed from livdywan to mkittler
Actions #49

Updated by mkittler 5 months ago

The problem doesn't looks that bad anymore and it makes most sense to handle current problems as part of #151310. I'll remove still present bugrefs.

Actions #50

Updated by mkittler 5 months ago

  • Status changed from In Progress to Resolved

Removed all comments referencing this ticket via SQL.

Actions #51

Updated by okurz 5 months ago

  • Parent task set to #111929
Actions #52

Updated by okurz 5 months ago

  • Related to action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added
Actions #53

Updated by okurz 3 months ago

  • Due date deleted (2023-11-24)
Actions

Also available in: Atom PDF