action #157534: Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines - openQA Project (public) - openSUSE Project Management Tool

Custom queries

All 'new' issues w/o assignee, sorted by version/priority
All auto_review tickets
All auto_review+force_result tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE Tools team - due soon
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #157534

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines

Added by acarvajal about 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2024-03-19

Due date:

% Done:

Estimated time:

Description

Observation¶

openQA test in scenario sle-15-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_rolling_update_node01@64bit fails in
suseconnect_scc

It fails while attempting to call script_output which does a curl command to 10.0.2.2 to download the script.

Test suite description¶

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.

rolling_update tests take a working cluster and performs a migration in each node while the node is in maintenance.

Reproducible¶

Fails sporadically since (at least) Build :32868:expat

Majority of the failures have been seen in worker40.

Expected result¶

Last good: :32996:sed (or more recent)

Further details¶

Always latest result in this scenario: latest

Related issues 3 (2 open — 1 closed)

Related to openQA Tests (public) - action #151382: [qe-sap] test fails in iscsi_client with unclear error message, please add "ping_size_check" from https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17817 to rule out MTU problems

New

2023-11-24

Actions

Related to openQA Project (public) - action #157147: Documentation for OSD worker region, location, datacenter keys in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls size:S

Resolved

mkittler

2024-03-13

Actions

Copied to openQA Infrastructure (public) - action #157606: Prevent missing gre tunnel connections in our salt states due to misconfiguration

New

2024-03-19

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by acarvajal about 1 year ago

Other failures in different tests but always in Multi-Machine scenario, and mostly on worker40:

https://openqa.suse.de/tests/13823327#step/iscsi_client/32
https://openqa.suse.de/tests/13823486#step/iscsi_client/32
https://openqa.suse.de/tests/13823491#step/suseconnect_scc/25
https://openqa.suse.de/tests/13823512#step/iscsi_client/32
https://openqa.suse.de/tests/13823549#step/iscsi_client/32

Actions

Copy link

Updated by okurz about 1 year ago

Related to action #151382: [qe-sap] test fails in iscsi_client with unclear error message, please add "ping_size_check" from https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17817 to rule out MTU problems added

Actions

Copy link

Updated by okurz about 1 year ago

Category set to Regressions/Crashes
Target version set to Ready

Actions

Copy link

Updated by acarvajal about 1 year ago

We got also some failures outside of worker40:

https://openqa.suse.de/tests/13823681#step/iscsi_client/32 (worker29)
https://openqa.suse.de/tests/13823680#step/iscsi_client/32 (worker34)
https://openqa.suse.de/tests/13823698#step/iscsi_client/32 (worker39)
https://openqa.suse.de/tests/13823697#step/iscsi_client/32 (worker30)

Actions

Copy link

Updated by acarvajal about 1 year ago

Recent failure: https://openqa.suse.de/tests/13825910#step/suseconnect_scc/25

Actions

Copy link

Updated by acarvajal about 1 year ago

Another recent failure, this time in the ping_size_check added to the ha/iscsi_client test module: https://openqa.suse.de/tests/13826037#step/iscsi_client/8

Both recent failures were in worker40

Actions

Copy link

Updated by nicksinger about 1 year ago

From https://suse.slack.com/archives/C02CANHLANP/p1710931704905159?thread_ts=1710757297.684389&cid=C02CANHLANP:

  That's pretty easy: The MM tests cannot work like this on worker40 (cross worker MM tests at least).
/etc/wicked/scripts/gre_tunnel_preup.sh is missing on worker40.oqa.prg2.suse.org

we should check why https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/openvswitch.sls#L96-128 is not applied on that machine

Actions

Copy link

Updated by okurz about 1 year ago

Status changed from New to In Progress
Assignee set to okurz

Actions

Copy link

Updated by okurz about 1 year ago · Edited

Subject changed from Multi-Machine Job fails in suseconnect_scc to Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/748 should fix that because I actually made a mistake when setting up those new worker instances.
Mitigation of impact, e.g. retrigger and monitor impacted tests. -> called host=openqa.suse.de failed_since=2024-03-18 WORKER=worker40 result="result='failed'" openqa-advanced-retrigger-jobs but also acarvajal took care of jobs
Make region,datacenter,location consistent -> https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/749
How can we prevent such situation in the future in our salt states? -> #157606
Should we have an actual openQA worker at the location prg2e? -> That's something we can consider for the future. I wouldn't do that for now. Should be good enough.

Actions

Copy link

#10

Updated by nicksinger about 1 year ago

I want to add how I found this:

Render our workerconf.sls to plain yaml with: salt-call --local cp.get_template /srv/pillar/openqa/workerconf.sls /tmp/test.output
Create a short python script in our "_modules" directory of salt-states because it contains the code for "gre_peers.compute". I used that script:

import yaml
from gre_peers import compute

with open("workerconf_rendered.yaml", "r") as fi:
  y = yaml.safe_load(fi)

print(compute("worker40", y["workerconf"]))

I then placed "import pdb; pdb.set_trace()" at strategic places in gre_peers.py to understand how this happens.

Actions

Copy link

#11

Updated by okurz about 1 year ago

Parent task set to #111929

Actions

Copy link

#12

Updated by okurz about 1 year ago

Copied to action #157606: Prevent missing gre tunnel connections in our salt states due to misconfiguration added

Actions

Copy link

#13

Updated by openqa_review about 1 year ago

Due date set to 2024-04-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#14

Updated by okurz about 1 year ago

Mitigation complete. https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_rolling_update_node01&version=15-SP5#next_previous ok again.

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/749 merged. Waiting for that to be deployed. If that shows no problems I will resolve.

Actions

Copy link

#15

Updated by okurz about 1 year ago

Due date deleted (~~2024-04-04~~)
Status changed from In Progress to Resolved

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/749 deployed, showing up with new classes on workers, no problems observed. Resolving.

Actions

Copy link

#16

Updated by okurz about 1 year ago

Related to action #157147: Documentation for OSD worker region, location, datacenter keys in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls size:S added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #157534

Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machines

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Further details¶

Updated by acarvajal about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by acarvajal about 1 year ago

Updated by acarvajal about 1 year ago

Updated by acarvajal about 1 year ago

Updated by nicksinger about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago · Edited

Updated by nicksinger about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by openqa_review about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago