Project

General

Profile

action #92467

Unit has `iscsid.socket` failed on some OSD workers since today's nightly reboot

Added by mkittler 3 months ago. Updated 16 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-05-10
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Restarting the unit (which is provided by the open-iscsi package) helped one some hosts but not on all, e.g.:

martchus@openqaworker6:~> systemctl status iscsid.socket
● iscsid.socket - Open-iSCSI iscsid Socket
   Loaded: loaded (/usr/lib/systemd/system/iscsid.socket; enabled; vendor preset: enabled)
   Active: failed (Result: resources) since Mon 2021-05-10 15:33:12 CEST; 58s ago
     Docs: man:iscsid(8)
           man:iscsiadm(8)
   Listen: @ISCSIADM_ABSTRACT_NAMESPACE (Stream)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
martchus@openqaworker6:~> sudo journalctl -fu iscsid.socket
-- Logs begin at Mon 2019-02-18 15:10:35 CET. --
Mai 09 03:32:08 openqaworker6 systemd[1]: Closed Open-iSCSI iscsid Socket.
-- Reboot --
Mai 09 03:37:51 openqaworker6 systemd[1]: Listening on Open-iSCSI iscsid Socket.
Mai 10 03:00:26 openqaworker6 systemd[1]: Closed Open-iSCSI iscsid Socket.
Mai 10 03:00:26 openqaworker6 systemd[1]: Stopping Open-iSCSI iscsid Socket.
Mai 10 03:00:26 openqaworker6 systemd[1]: Listening on Open-iSCSI iscsid Socket.
Mai 10 15:33:12 openqaworker6 systemd[1]: Closed Open-iSCSI iscsid Socket.
Mai 10 15:33:12 openqaworker6 systemd[1]: Stopping Open-iSCSI iscsid Socket.
Mai 10 15:33:12 openqaworker6 systemd[1]: iscsid.socket: Failed to listen on sockets: Address already in use
Mai 10 15:33:12 openqaworker6 systemd[1]: Failed to listen on Open-iSCSI iscsid Socket.
Mai 10 15:33:12 openqaworker6 systemd[1]: iscsid.socket: Unit entered failed state.

I've been searching in our salt states repo for iscsi and apparently this is something we install/configure explicitly.

from okurz in the chat:

History

#1 Updated by okurz 2 months ago

  • Target version set to Ready

openqa/iscsi.sls says "currently only supports openqaworker2" but I don't why that should be the case

mkittler wrote:

Restarting the unit (which is provided by the open-iscsi package) helped one some hosts but not on all, e.g.:

which unit did you try to restart? I see iscsid.service as inactive but restarting iscsi.service seems to have helped because now iscsid.socket is fine again

#2 Updated by okurz 2 months ago

  • Status changed from New to Workable

Can't update description, likely due to the status dot in the iscsi unit output.

Acceptance criteria

  • AC1: No iscsi units fail after multiple reboots

Acceptance tests

  • AT1-1: On OSD machines with isci reboot multiple times and check for non-active services with test $(sudo systemctl is-active iscsid.socket iscsid.service | grep -c active) == 2

#3 Updated by nicksinger 2 months ago

before we try to fix the service I'd raise the question if we even use iscsid any longer in our testing. I couldn't make out any obvious test and recent "access logs" to iscsid looked like it is not used by anything.

#4 Updated by okurz 2 months ago

well, we could test by disabling the service parts on a specific worker and schedule above referenced tests on that machine and crosscheck if tests still work

#5 Updated by okurz 19 days ago

  • Status changed from Workable to New

moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size

#6 Updated by okurz 16 days ago

  • Status changed from New to Resolved
  • Assignee set to okurz

The situation in the past months shows that whatever changed seems to have brought us to a more stable situation.

sudo salt \* cmd.run 'sudo systemctl is-active iscsid.socket'
openqaworker3.suse.de:
    active
QA-Power8-4-kvm.qa.suse.de:
    active
openqaworker9.suse.de:
    active
openqaworker2.suse.de:
    active
openqaworker8.suse.de:
    active
openqa.suse.de:
    inactive
powerqaworker-qam-1.qa.suse.de:
    active
storage.qa.suse.de:
    active
QA-Power8-5-kvm.qa.suse.de:
    active
openqaworker6.suse.de:
    active
openqaworker5.suse.de:
    active
openqa-monitor.qa.suse.de:
    active
openqaworker10.suse.de:
    active
backup.qa.suse.de:
    active
openqaworker13.suse.de:
    active
malbec.arch.suse.de:
    active
grenache-1.qa.suse.de:
    active
openqaworker-arm-2.suse.de:
    active
openqaworker-arm-1.suse.de:
    active
openqaworker-arm-3.suse.de:
    active

so active on all except osd, that should be good enough. Confirmed stable over multiple reboots which had been triggered without further problems since the ticket was last updated

Also available in: Atom PDF