Project

General

Profile

Actions

action #95833

closed

[qe-sap][ha] test fails in ha_cluster_init - iscsid: Kernel reported iSCSI connection 1:0 error

Added by acarvajal almost 3 years ago. Updated over 2 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
Bugs in existing tests
Target version:
-
Start date:
2021-07-22
Due date:
% Done:

100%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP1-Server-DVD-HA-Incidents-x86_64-qam_ha_priority_fencing_node01@64bit fails in
ha_cluster_init

Test suite description

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Reproducible

Fails since (at least) Build MR:246274:spice-vdagent

Expected result

Last good: :19994:ffmpeg (or more recent)

Further details

Always latest result in this scenario: latest

journal attached to the failing test includes:

Jul 21 12:01:29.114843 priorityfencing-node01 sbd[3248]: /dev/disk/by-path/ip-10.0.2.1:3260-iscsi-iqn.2016-02.de.openqa:132-lun-0:    error: servant_md: No slot allocated, and automatic allocation failed for disk /dev/disk/by-path/ip-10.0.2.1:3260-iscsi-iqn.2016-02.de.openqa:132-lun-0.
Jul 21 12:01:29.826675 priorityfencing-node01 sbd[3246]:    error: inquisitor_child: SBD: Not enough votes to proceed. Aborting start-up.

and

Jul 21 12:01:34.439488 priorityfencing-node01 iscsid[2737]: iscsid: Kernel reported iSCSI connection 1:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
Jul 21 12:02:22.499997 priorityfencing-node01 iscsid[2737]: iscsid: connection1:0 is operational after recovery (4 attempts)

This looks like a networking issue between the cluster node and the support server job which provides iSCSI.

Actions #1

Updated by maritawerner over 2 years ago

  • Subject changed from test fails in ha_cluster_init - iscsid: Kernel reported iSCSI connection 1:0 error to [ha] test fails in ha_cluster_init - iscsid: Kernel reported iSCSI connection 1:0 error
Actions #2

Updated by okurz over 2 years ago

  • Subject changed from [ha] test fails in ha_cluster_init - iscsid: Kernel reported iSCSI connection 1:0 error to [qe-sap][ha] test fails in ha_cluster_init - iscsid: Kernel reported iSCSI connection 1:0 error

Using keyword "qe-sap" as verified by jmichel in weekly QE sync 2021-09-15

Actions #3

Updated by acarvajal over 2 years ago

  • Status changed from New to Rejected
  • % Done changed from 0 to 100

Interesting that I open this ticket to report an underlying issue impacting this HA test, i.e., some internal/osd issue was causing network connectivity problems between the cluster nodes and the support server, but not only is the ticket ignored, it was for all intents and purposes reassigned to the reporter.

Anyways, whatever the issue was, it seems to be gone:

https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-HA-Incidents&machine=64bit&test=qam_ha_priority_fencing_node01&version=15-SP1#next_previous

Closing this.

Actions #4

Updated by okurz over 2 years ago

acarvajal wrote:

Interesting that I open this ticket to report an underlying issue impacting this HA test, i.e., some internal/osd issue was causing network connectivity problems between the cluster nodes and the support server, but not only is the ticket ignored, it was for all intents and purposes reassigned to the reporter.

What do you mean with "reassigned to the reporter"? The assignee field was empty all the time. maritawerner has done the triaging on behalf of test squad experts that apparently did not manage to get the ticket triaged into a proper team within more than a month. maritawerner has made a judgement call that this would be an "ha" ticket. Honestly, from the ticket description I would likely made just the same decision. I did batch-processing to add the correct team keyword for that as decided in weekly QE sync 2021-09-15. At this time I have not looked further into the ticket than just the subject line. You might over-estimate our competence to understand the particular issue here :) So is the team "qe-sap" just consisting of you? According to https://confluence.suse.com/display/qasle/QE+squads+-+structure there are 10 persons and apparently some days ago you have actually even left the team?!?

Actions #5

Updated by acarvajal over 2 years ago

okurz wrote:

What do you mean with "reassigned to the reporter"? The assignee field was empty all the time. maritawerner has done the triaging on behalf of test squad experts that apparently did not manage to get the ticket triaged into a proper team within more than a month. maritawerner has made a judgement call that this would be an "ha" ticket. Honestly, from the ticket description I would likely made just the same decision. I did batch-processing to add the correct team keyword for that as decided in weekly QE sync 2021-09-15. At this time I have not looked further into the ticket than just the subject line. You might over-estimate our competence to understand the particular issue here :) So is the team "qe-sap" just consisting of you? According to https://confluence.suse.com/display/qasle/QE+squads+-+structure there are 10 persons and apparently some days ago you have actually even left the team?!?

  • I am still part of QE-SAP until 15.10.
  • There was an agreement this past Wednesday with jmichel that he will process and assign tickets tagged with qe-sap to someone within the squad. Or did I misunderstand https://progress.opensuse.org/issues/95833#note-2 ?
  • Even if changing squads, I was part of QE-SAP when I created the ticket.
  • For all intents and purposes, this ticket was created by QE-SAP and then assigned to QE-SAP.

I took maritawerner's tagging the ticket with ha as simple categorization. I took your tagging of it with qe-sap as assignment, even if the Assignee field was unchanged, as that seems to be the purpose of https://progress.opensuse.org/issues/95833#note-2 .

okurz also wrote:

Honestly, from the ticket description I would likely made just the same decision.

If you would have also done the same after reading the ticket Description, then I guess we need to modify the Description template, because to me it is clear in the Further Details section of the Description that this is not an HA issue, but a networking one, as I pointed out there. Even the subject is explicit on this being a network connection issue. I used the template suggested by osd itself, guess that was my mistake.

Anyways, issue seems fixed which is what's important IMO.

Actions

Also available in: Atom PDF