action #92467
closedUnit has `iscsid.socket` failed on some OSD workers since today's nightly reboot
0%
Description
Restarting the unit (which is provided by the open-iscsi
package) helped one some hosts but not on all, e.g.:
martchus@openqaworker6:~> systemctl status iscsid.socket
● iscsid.socket - Open-iSCSI iscsid Socket
Loaded: loaded (/usr/lib/systemd/system/iscsid.socket; enabled; vendor preset: enabled)
Active: failed (Result: resources) since Mon 2021-05-10 15:33:12 CEST; 58s ago
Docs: man:iscsid(8)
man:iscsiadm(8)
Listen: @ISCSIADM_ABSTRACT_NAMESPACE (Stream)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
martchus@openqaworker6:~> sudo journalctl -fu iscsid.socket
-- Logs begin at Mon 2019-02-18 15:10:35 CET. --
Mai 09 03:32:08 openqaworker6 systemd[1]: Closed Open-iSCSI iscsid Socket.
-- Reboot --
Mai 09 03:37:51 openqaworker6 systemd[1]: Listening on Open-iSCSI iscsid Socket.
Mai 10 03:00:26 openqaworker6 systemd[1]: Closed Open-iSCSI iscsid Socket.
Mai 10 03:00:26 openqaworker6 systemd[1]: Stopping Open-iSCSI iscsid Socket.
Mai 10 03:00:26 openqaworker6 systemd[1]: Listening on Open-iSCSI iscsid Socket.
Mai 10 15:33:12 openqaworker6 systemd[1]: Closed Open-iSCSI iscsid Socket.
Mai 10 15:33:12 openqaworker6 systemd[1]: Stopping Open-iSCSI iscsid Socket.
Mai 10 15:33:12 openqaworker6 systemd[1]: iscsid.socket: Failed to listen on sockets: Address already in use
Mai 10 15:33:12 openqaworker6 systemd[1]: Failed to listen on Open-iSCSI iscsid Socket.
Mai 10 15:33:12 openqaworker6 systemd[1]: iscsid.socket: Unit entered failed state.
I've been searching in our salt states repo for iscsi
and apparently this is something we install/configure explicitly.
from @okurz in the chat:
- iscsi is – or at least was – used for YaST installer tests accessing repos over iscsi . could be that it's not actually used on all workers, maybe not all anymore but for that one should check the history of jobs on osd
- https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=64bit&test=iscsi_ibft&version=15-SP3 is likely the scenario to check
Updated by okurz over 3 years ago
- Target version set to Ready
openqa/iscsi.sls says "currently only supports openqaworker2" but I don't why that should be the case
mkittler wrote:
Restarting the unit (which is provided by the
open-iscsi
package) helped one some hosts but not on all, e.g.:
which unit did you try to restart? I see iscsid.service
as inactive but restarting iscsi.service
seems to have helped because now iscsid.socket
is fine again
Updated by okurz over 3 years ago
- Status changed from New to Workable
Can't update description, likely due to the status dot in the iscsi unit output.
Acceptance criteria¶
- AC1: No iscsi units fail after multiple reboots
Acceptance tests¶
- AT1-1: On OSD machines with isci reboot multiple times and check for non-active services with
test $(sudo systemctl is-active iscsid.socket iscsid.service | grep -c active) == 2
Updated by nicksinger over 3 years ago
before we try to fix the service I'd raise the question if we even use iscsid any longer in our testing. I couldn't make out any obvious test and recent "access logs" to iscsid looked like it is not used by anything.
Updated by okurz over 3 years ago
well, we could test by disabling the service parts on a specific worker and schedule above referenced tests on that machine and crosscheck if tests still work
Updated by okurz over 3 years ago
- Status changed from Workable to New
moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size
Updated by okurz over 3 years ago
- Status changed from New to Resolved
- Assignee set to okurz
The situation in the past months shows that whatever changed seems to have brought us to a more stable situation.
sudo salt \* cmd.run 'sudo systemctl is-active iscsid.socket'
openqaworker3.suse.de:
active
QA-Power8-4-kvm.qa.suse.de:
active
openqaworker9.suse.de:
active
openqaworker2.suse.de:
active
openqaworker8.suse.de:
active
openqa.suse.de:
inactive
powerqaworker-qam-1.qa.suse.de:
active
storage.qa.suse.de:
active
QA-Power8-5-kvm.qa.suse.de:
active
openqaworker6.suse.de:
active
openqaworker5.suse.de:
active
openqa-monitor.qa.suse.de:
active
openqaworker10.suse.de:
active
backup.qa.suse.de:
active
openqaworker13.suse.de:
active
malbec.arch.suse.de:
active
grenache-1.qa.suse.de:
active
openqaworker-arm-2.suse.de:
active
openqaworker-arm-1.suse.de:
active
openqaworker-arm-3.suse.de:
active
so active on all except osd, that should be good enough. Confirmed stable over multiple reboots which had been triggered without further problems since the ticket was last updated