action #162320
closedcoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry
Description
Observation¶
Significant and very high ratio of failed multi-machine tests starting 2024-06-14, possibly related to #157972-10
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#162320
Rollback actions¶
- DONE Enable https://gitlab.suse.de/openqa/scripts-ci/-/pipeline_schedules/167/edit?id=167 again and remove comment linking to ticket when MM-queue has reduced
Updated by okurz 6 months ago
- Copied to action #162323: no alert about multi-machine test failures 2024-06-14 size:S added
Updated by okurz 6 months ago
- Related to action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:S added
Updated by okurz 6 months ago
- Status changed from New to In Progress
- Assignee set to okurz
I saw errors in the salt-minion systemctl status on e.g. worker30 like
Jun 15 23:14:18 worker30 salt-minion[3796]: [WARNING ] test.ping args: []
Jun 15 23:30:36 worker30 salt-minion[3796]: [WARNING ] remote: "worker-arm1" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:36 worker30 salt-minion[3796]: [WARNING ] remote: "worker-arm2" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:36 worker30 salt-minion[3796]: [WARNING ] remote: "worker29" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:36 worker30 salt-minion[3796]: [WARNING ] remote: "worker32" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker36" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker37" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker38" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker39" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker40" found in workerconf.sls but not in salt mine, host currently offline?
I added back worker31 to salt and applied a salt high state and then retriggered failing scripts-ci jobs and retriggered jobs like https://openqa.suse.de/tests/14634921 and they seem to work fine. But the warnings are still there.
failed_since="2024-06-14 14:00Z" result="result='parallel_failed'" host=openqa.suse.de comment="label:poo162320" openqa-advanced-retrigger-jobs
Updated by openqa_review 6 months ago
- Due date set to 2024-06-30
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 6 months ago
- Related to action #162332: 2024-06-15 osd not accessible size:M added
Updated by okurz 6 months ago
- Related to action #162374: Limit number of OSD PRG2 x86_64 tap multi-machine workers until stabilized added
Updated by okurz 6 months ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to Normal
done. Many jobs scheduled. Monitoring https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=now-7d&to=now and https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1
Updated by okurz 6 months ago
- Description updated (diff)
$ openqa-query-for-job-label poo#162320
14657729|2024-06-19 04:45:41|done|failed|cc_ipsec_client||worker40
14674275|2024-06-19 02:02:44|done|failed|cc_ipsec_client||worker40
14672589|2024-06-19 01:12:49|done|failed|cc_ipsec_client||worker40
14654972|2024-06-18 14:38:30|done|failed|cc_ipsec_client||worker40
14663615|2024-06-18 09:49:09|done|failed|cc_ipsec_client||worker40
14660134|2024-06-18 08:14:01|done|failed|cc_ipsec_client||worker40
14654683|2024-06-17 23:52:41|done|failed|cc_ipsec_client||worker40
14651378|2024-06-17 14:36:22|done|failed|cc_ipsec_client||worker-arm1
14640403|2024-06-17 14:16:41|done|failed|cc_ipsec_client||worker-arm2
14641247|2024-06-17 14:12:47|done|failed|sles4sap_nw_node02||petrol
I found https://openqa.suse.de/tests/14657729#step/setup_multimachine/116 failing because ping actually does not exist. Let's make the fail_messages more specific:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/19557
Updated by okurz 6 months ago
- Related to action #157606: Prevent missing gre tunnel connections in our salt states due to misconfiguration added