Project

General

Profile

Actions

action #162320

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry

Added by okurz 6 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-15
Due date:
% Done:

0%

Estimated time:

Description

Observation

Significant and very high ratio of failed multi-machine tests starting 2024-06-14, possibly related to #157972-10

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#162320

Rollback actions


Related issues 5 (1 open4 closed)

Related to openQA Project (public) - action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:SResolvedgpathak

Actions
Related to openQA Infrastructure (public) - action #162332: 2024-06-15 osd not accessible size:MResolvedokurz

Actions
Related to openQA Infrastructure (public) - action #162374: Limit number of OSD PRG2 x86_64 tap multi-machine workers until stabilizedResolvedokurz2024-06-17

Actions
Related to openQA Infrastructure (public) - action #157606: Prevent missing gre tunnel connections in our salt states due to misconfigurationBlockedokurz2024-03-19

Actions
Copied to openQA Project (public) - action #162323: no alert about multi-machine test failures 2024-06-14 size:SResolvedmkittler2024-06-15

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied to action #162323: no alert about multi-machine test failures 2024-06-14 size:S added
Actions #2

Updated by okurz 6 months ago

  • Related to action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:S added
Actions #3

Updated by okurz 6 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz

I saw errors in the salt-minion systemctl status on e.g. worker30 like

Jun 15 23:14:18 worker30 salt-minion[3796]: [WARNING ] test.ping args: []
Jun 15 23:30:36 worker30 salt-minion[3796]: [WARNING ] remote: "worker-arm1" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:36 worker30 salt-minion[3796]: [WARNING ] remote: "worker-arm2" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:36 worker30 salt-minion[3796]: [WARNING ] remote: "worker29" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:36 worker30 salt-minion[3796]: [WARNING ] remote: "worker32" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker36" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker37" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker38" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker39" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker40" found in workerconf.sls but not in salt mine, host currently offline?

I added back worker31 to salt and applied a salt high state and then retriggered failing scripts-ci jobs and retriggered jobs like https://openqa.suse.de/tests/14634921 and they seem to work fine. But the warnings are still there.

failed_since="2024-06-14 14:00Z" result="result='parallel_failed'" host=openqa.suse.de comment="label:poo162320" openqa-advanced-retrigger-jobs
Actions #4

Updated by openqa_review 6 months ago

  • Due date set to 2024-06-30

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by okurz 6 months ago

  • Related to action #162332: 2024-06-15 osd not accessible size:M added
Actions #6

Updated by okurz 6 months ago

  • Status changed from In Progress to Blocked
Actions #7

Updated by okurz 6 months ago

  • Parent task set to #111929
Actions #8

Updated by okurz 6 months ago

  • Status changed from Blocked to In Progress
Actions #9

Updated by okurz 6 months ago

After OSD is back in operation – see #162332 - I now called the retriggering again as some jobs might still be needing handling.

Actions #10

Updated by okurz 6 months ago

  • Related to action #162374: Limit number of OSD PRG2 x86_64 tap multi-machine workers until stabilized added
Actions #11

Updated by okurz 6 months ago

  • Subject changed from multi-machine test failures 2024-06-14+ to multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry
Actions #12

Updated by okurz 6 months ago

Called

export host=openqa.suse.de; failed_since="'2024-06-16'" ./openqa-monitor-investigation-candidates | ./openqa-label-known-issues-multi 
Actions #13

Updated by okurz 6 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to Normal
Actions #14

Updated by okurz 6 months ago

  • Description updated (diff)

I disabled https://gitlab.suse.de/openqa/scripts-ci/-/pipeline_schedules/167/ as it's often running into timeout waiting for jobs to be picked up

Actions #15

Updated by okurz 6 months ago

  • Description updated (diff)
$ openqa-query-for-job-label poo#162320
14657729|2024-06-19 04:45:41|done|failed|cc_ipsec_client||worker40
14674275|2024-06-19 02:02:44|done|failed|cc_ipsec_client||worker40
14672589|2024-06-19 01:12:49|done|failed|cc_ipsec_client||worker40
14654972|2024-06-18 14:38:30|done|failed|cc_ipsec_client||worker40
14663615|2024-06-18 09:49:09|done|failed|cc_ipsec_client||worker40
14660134|2024-06-18 08:14:01|done|failed|cc_ipsec_client||worker40
14654683|2024-06-17 23:52:41|done|failed|cc_ipsec_client||worker40
14651378|2024-06-17 14:36:22|done|failed|cc_ipsec_client||worker-arm1
14640403|2024-06-17 14:16:41|done|failed|cc_ipsec_client||worker-arm2
14641247|2024-06-17 14:12:47|done|failed|sles4sap_nw_node02||petrol

I found https://openqa.suse.de/tests/14657729#step/setup_multimachine/116 failing because ping actually does not exist. Let's make the fail_messages more specific:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/19557

Actions #16

Updated by okurz 6 months ago

  • Related to action #157606: Prevent missing gre tunnel connections in our salt states due to misconfiguration added
Actions #17

Updated by okurz 6 months ago

  • Description updated (diff)
  • Due date deleted (2024-06-30)
  • Status changed from Feedback to Resolved

rollback actions conducted. jobs with label look okayish. There are other related tasks for follow-up including #162374 which would bring back more workers.

Actions

Also available in: Atom PDF