action #157534
closed
coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines
Added by acarvajal 9 months ago.
Updated 9 months ago.
Category:
Regressions/Crashes
Description
Observation¶
openQA test in scenario sle-15-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_rolling_update_node01@64bit fails in
suseconnect_scc
It fails while attempting to call script_output
which does a curl
command to 10.0.2.2 to download the script.
Test suite description¶
Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.
rolling_update
tests take a working cluster and performs a migration in each node while the node is in maintenance.
Reproducible¶
Fails sporadically since (at least) Build :32868:expat
Majority of the failures have been seen in worker40.
Expected result¶
Last good: :32996:sed (or more recent)
Further details¶
Always latest result in this scenario: latest
Other failures in different tests but always in Multi-Machine scenario, and mostly on worker40:
- Related to action #151382: [qe-sap] test fails in iscsi_client with unclear error message, please add "ping_size_check" from https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17817 to rule out MTU problems added
- Category set to Regressions/Crashes
- Target version set to Ready
We got also some failures outside of worker40:
- Status changed from New to In Progress
- Assignee set to okurz
- Subject changed from Multi-Machine Job fails in suseconnect_scc to Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines
- https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/748 should fix that because I actually made a mistake when setting up those new worker instances.
- Mitigation of impact, e.g. retrigger and monitor impacted tests. -> called
host=openqa.suse.de failed_since=2024-03-18 WORKER=worker40 result="result='failed'" openqa-advanced-retrigger-jobs
but also acarvajal took care of jobs
- Make region,datacenter,location consistent -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/749
- How can we prevent such situation in the future in our salt states? -> #157606
- Should we have an actual openQA worker at the location prg2e? -> That's something we can consider for the future. I wouldn't do that for now. Should be good enough.
I want to add how I found this:
- Render our workerconf.sls to plain yaml with:
salt-call --local cp.get_template /srv/pillar/openqa/workerconf.sls /tmp/test.output
- Create a short python script in our "_modules" directory of salt-states because it contains the code for "gre_peers.compute". I used that script:
import yaml
from gre_peers import compute
with open("workerconf_rendered.yaml", "r") as fi:
y = yaml.safe_load(fi)
print(compute("worker40", y["workerconf"]))
I then placed "import pdb; pdb.set_trace()" at strategic places in gre_peers.py to understand how this happens.
- Parent task set to #111929
- Copied to action #157606: Prevent missing gre tunnel connections in our salt states due to misconfiguration added
- Due date set to 2024-04-04
Setting due date based on mean cycle time of SUSE QE Tools
- Due date deleted (
2024-04-04)
- Status changed from In Progress to Resolved
- Related to action #157147: Documentation for OSD worker region, location, datacenter keys in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls size:S added
Also available in: Atom
PDF