action #162320: multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #162320

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry

Added by okurz 9 months ago. Updated 8 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2024-06-15

Due date:

% Done:

Estimated time:

Description

Observation¶

Significant and very high ratio of failed multi-machine tests starting 2024-06-14, possibly related to #157972-10

Steps to reproduce¶

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#162320

Rollback actions¶

DONE Enable https://gitlab.suse.de/openqa/scripts-ci/-/pipeline_schedules/167/edit?id=167 again and remove comment linking to ticket when MM-queue has reduced

Related issues 5 (1 open — 4 closed)

Actions

Copy link

Updated by okurz 9 months ago

Copied to action #162323: no alert about multi-machine test failures 2024-06-14 size:S added

Actions

Copy link

Updated by okurz 9 months ago

Related to action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:S added

Actions

Copy link

Updated by okurz 9 months ago

Status changed from New to In Progress
Assignee set to okurz

I saw errors in the salt-minion systemctl status on e.g. worker30 like

Jun 15 23:14:18 worker30 salt-minion[3796]: [WARNING ] test.ping args: []
Jun 15 23:30:36 worker30 salt-minion[3796]: [WARNING ] remote: "worker-arm1" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:36 worker30 salt-minion[3796]: [WARNING ] remote: "worker-arm2" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:36 worker30 salt-minion[3796]: [WARNING ] remote: "worker29" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:36 worker30 salt-minion[3796]: [WARNING ] remote: "worker32" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker36" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker37" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker38" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker39" found in workerconf.sls but not in salt mine, host currently offline?
Jun 15 23:30:37 worker30 salt-minion[3796]: [WARNING ] remote: "worker40" found in workerconf.sls but not in salt mine, host currently offline?

I added back worker31 to salt and applied a salt high state and then retriggered failing scripts-ci jobs and retriggered jobs like https://openqa.suse.de/tests/14634921 and they seem to work fine. But the warnings are still there.

failed_since="2024-06-14 14:00Z" result="result='parallel_failed'" host=openqa.suse.de comment="label:poo162320" openqa-advanced-retrigger-jobs

Actions

Copy link

Updated by openqa_review 9 months ago

Due date set to 2024-06-30

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz 9 months ago

Related to action #162332: 2024-06-15 osd not accessible size:M added

Actions

Copy link

Updated by okurz 9 months ago

Status changed from In Progress to Blocked

#162332

Actions

Copy link

Updated by okurz 9 months ago

Parent task set to #111929

Actions

Copy link

Updated by okurz 9 months ago

Status changed from Blocked to In Progress

Actions

Copy link

Updated by okurz 9 months ago

After OSD is back in operation – see #162332 - I now called the retriggering again as some jobs might still be needing handling.

Actions

Copy link

#10

Updated by okurz 9 months ago

Related to action #162374: Limit number of OSD PRG2 x86_64 tap multi-machine workers until stabilized added

Actions

Copy link

#11

Updated by okurz 9 months ago

Subject changed from multi-machine test failures 2024-06-14+ to multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry

Actions

Copy link

#12

Updated by okurz 9 months ago

Called

export host=openqa.suse.de; failed_since="'2024-06-16'" ./openqa-monitor-investigation-candidates | ./openqa-label-known-issues-multi

Actions

Copy link

#13

Updated by okurz 9 months ago

Status changed from In Progress to Feedback
Priority changed from Urgent to Normal

done. Many jobs scheduled. Monitoring https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=now-7d&to=now and https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1

Actions

Copy link

#14

Updated by okurz 9 months ago

Description updated (diff)

I disabled https://gitlab.suse.de/openqa/scripts-ci/-/pipeline_schedules/167/ as it's often running into timeout waiting for jobs to be picked up

Actions

Copy link

#15

Updated by okurz 8 months ago

Description updated (diff)

$ openqa-query-for-job-label poo#162320
14657729|2024-06-19 04:45:41|done|failed|cc_ipsec_client||worker40
14674275|2024-06-19 02:02:44|done|failed|cc_ipsec_client||worker40
14672589|2024-06-19 01:12:49|done|failed|cc_ipsec_client||worker40
14654972|2024-06-18 14:38:30|done|failed|cc_ipsec_client||worker40
14663615|2024-06-18 09:49:09|done|failed|cc_ipsec_client||worker40
14660134|2024-06-18 08:14:01|done|failed|cc_ipsec_client||worker40
14654683|2024-06-17 23:52:41|done|failed|cc_ipsec_client||worker40
14651378|2024-06-17 14:36:22|done|failed|cc_ipsec_client||worker-arm1
14640403|2024-06-17 14:16:41|done|failed|cc_ipsec_client||worker-arm2
14641247|2024-06-17 14:12:47|done|failed|sles4sap_nw_node02||petrol

I found https://openqa.suse.de/tests/14657729#step/setup_multimachine/116 failing because ping actually does not exist. Let's make the fail_messages more specific:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/19557

Actions

Copy link

#16

Updated by okurz 8 months ago

Related to action #157606: Prevent missing gre tunnel connections in our salt states due to misconfiguration added

Actions

Copy link

#17

Updated by okurz 8 months ago

Description updated (diff)
Due date deleted (~~2024-06-30~~)
Status changed from Feedback to Resolved

rollback actions conducted. jobs with label look okayish. There are other related tasks for follow-up including #162374 which would bring back more workers.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #162320

multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retry

Observation¶

Steps to reproduce¶

Rollback actions¶

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by openqa_review 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 9 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago