action #163852
closed[alert][FIRING:1] Failed systemd services alert session-c69388.scope / session-c69388.scope on openqa.suse.de
0%
Description
Observation¶
As per suggestion in #163825 this is a second ticket covering a different failing service on OSD.
From logs on OSD:
openqa:~ # systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● session-c69388.scope loaded failed failed Session c69388 of User postgres
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
openqa:~ # systemctl status session-c69388.scope
× session-c69388.scope - Session c69388 of User postgres
Loaded: loaded (/run/systemd/transient/session-c69388.scope; transient)
Transient: yes
Active: failed
Jul 12 01:45:40 openqa systemd[1]: session-c69388.scope: Couldn't move process 3018 to requested cgroup '/user.slice/user-26.slice/session-c69388.scope': No such process
Jul 12 01:45:40 openqa systemd[1]: session-c69388.scope: Failed to add PIDs to scope's control group: No such process
Jul 12 01:45:40 openqa systemd[1]: session-c69388.scope: Failed with result 'resources'.
Jul 12 01:45:40 openqa systemd[1]: Failed to start Session c69388 of User postgres.
Jul 12 10:36:10 openqa systemd[1]: Failed to start Session c69388 of User postgres.
Updated by nicksinger 5 months ago
- Copied from action #163825: [alert][FIRING:1] Failed systemd services alert session-c69388.scope / suse-build-key-import.service on backup-qam.qe.nue2.suse.org size:S added
Updated by nicksinger 5 months ago
- Status changed from In Progress to Resolved
I didn't find any useful logs for postgres because "journal has been rotated since unit was started, output may be incomplete.". I assume that some automated task running as this user which exited very quickly resulting in that issue. I wouldn't adjust anything now. If we see this issue again we can think about two possible solutions:
a) Adjust our data collection scripts (https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh?ref_type=heads#L35-42) to e.g. ignore failed scope units if these issues just appear and don't cause any harm
b) Research what changed and what is affected by this. Try to improve logging and monitor it for a longer time.
Updated by nicksinger 5 months ago
reset the failed state with systemctl reset-failed
on OSD