Project

General

Profile

Actions

action #157534

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines

Added by acarvajal about 1 month ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_rolling_update_node01@64bit fails in
suseconnect_scc

It fails while attempting to call script_output which does a curl command to 10.0.2.2 to download the script.

Test suite description

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.

rolling_update tests take a working cluster and performs a migration in each node while the node is in maintenance.

Reproducible

Fails sporadically since (at least) Build :32868:expat

Majority of the failures have been seen in worker40.

Expected result

Last good: :32996:sed (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 3 (2 open1 closed)

Related to openQA Tests - action #151382: [qe-sap] test fails in iscsi_client with unclear error message, please add "ping_size_check" from https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17817 to rule out MTU problemsNew2023-11-24

Actions
Related to openQA Project - action #157147: Documentation for OSD worker region, location, datacenter keys in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls size:SResolvedmkittler2024-03-13

Actions
Copied to openQA Infrastructure - action #157606: Prevent missing gre tunnel connections in our salt states due to misconfigurationNew2024-03-19

Actions
Actions #2

Updated by okurz about 1 month ago

  • Related to action #151382: [qe-sap] test fails in iscsi_client with unclear error message, please add "ping_size_check" from https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17817 to rule out MTU problems added
Actions #3

Updated by okurz about 1 month ago

  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #6

Updated by acarvajal about 1 month ago

Another recent failure, this time in the ping_size_check added to the ha/iscsi_client test module: https://openqa.suse.de/tests/13826037#step/iscsi_client/8

Both recent failures were in worker40

Actions #7

Updated by nicksinger about 1 month ago

From https://suse.slack.com/archives/C02CANHLANP/p1710931704905159?thread_ts=1710757297.684389&cid=C02CANHLANP:

  That's pretty easy: The MM tests cannot work like this on worker40 (cross worker MM tests at least).
/etc/wicked/scripts/gre_tunnel_preup.sh is missing on worker40.oqa.prg2.suse.org

we should check why https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls#L96-128 is not applied on that machine

Actions #8

Updated by okurz about 1 month ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #9

Updated by okurz about 1 month ago ยท Edited

  • Subject changed from Multi-Machine Job fails in suseconnect_scc to Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines
  1. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/748 should fix that because I actually made a mistake when setting up those new worker instances.
  2. Mitigation of impact, e.g. retrigger and monitor impacted tests. -> called host=openqa.suse.de failed_since=2024-03-18 WORKER=worker40 result="result='failed'" openqa-advanced-retrigger-jobs but also acarvajal took care of jobs
  3. Make region,datacenter,location consistent -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/749
  4. How can we prevent such situation in the future in our salt states? -> #157606
  5. Should we have an actual openQA worker at the location prg2e? -> That's something we can consider for the future. I wouldn't do that for now. Should be good enough.
Actions #10

Updated by nicksinger about 1 month ago

I want to add how I found this:

  1. Render our workerconf.sls to plain yaml with: salt-call --local cp.get_template /srv/pillar/openqa/workerconf.sls /tmp/test.output
  2. Create a short python script in our "_modules" directory of salt-states because it contains the code for "gre_peers.compute". I used that script:
import yaml
from gre_peers import compute

with open("workerconf_rendered.yaml", "r") as fi:
  y = yaml.safe_load(fi)

print(compute("worker40", y["workerconf"]))

I then placed "import pdb; pdb.set_trace()" at strategic places in gre_peers.py to understand how this happens.

Actions #11

Updated by okurz about 1 month ago

  • Parent task set to #111929
Actions #12

Updated by okurz about 1 month ago

  • Copied to action #157606: Prevent missing gre tunnel connections in our salt states due to misconfiguration added
Actions #13

Updated by openqa_review about 1 month ago

  • Due date set to 2024-04-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by okurz about 1 month ago

  • Due date deleted (2024-04-04)
  • Status changed from In Progress to Resolved

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/749 deployed, showing up with new classes on workers, no problems observed. Resolving.

Actions #16

Updated by okurz about 1 month ago

  • Related to action #157147: Documentation for OSD worker region, location, datacenter keys in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls size:S added
Actions

Also available in: Atom PDF