Project

General

Profile

Actions

action #105621

closed

[Alerting] Failed systemd services alert

Added by okurz about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-01-27
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services currently shows

2022-01-27 06:23:00 openqaworker-arm-1  openqa-worker-cacheservice-minion   
2022-01-26 15:19:01 openqa  systemd-journal-flush

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #105618: [Alerting] CPU Load alert size:SRejectedlivdywan2022-01-27

Actions
Actions #1

Updated by okurz about 2 years ago

Actions #2

Updated by mkittler about 2 years ago

  • Assignee set to mkittler
Actions #3

Updated by mkittler about 2 years ago

  • Status changed from New to Feedback

The cache service logs don't show what the problem was, just that the main process exited:

Jan 27 05:17:23 openqaworker-arm-1 openqa-worker-cacheservice-minion[25950]: [25950] [i] [#50] Downloading "SLE-15-SP3-Full-aarch64-GM-Media1.iso" from "http://openqa.suse.de/tests/8039274/asset/iso/SLE-15-SP3-Full-aarch64-GM-Media1.is>
Jan 27 05:17:29 openqaworker-arm-1 openqa-worker-cacheservice-minion[26636]: [26636] [i] [#51] Downloading: "SLES-15-SP3-aarch64-mru-install-minimal-with-addons-Build20220127-1-Server-DVD-Updates-aarch64-virtio-uefi-vars.qcow2"
Jan 27 05:17:29 openqaworker-arm-1 openqa-worker-cacheservice-minion[26636]: [26636] [i] [#51] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
Jan 27 05:17:29 openqaworker-arm-1 openqa-worker-cacheservice-minion[26636]: [26636] [i] [#51] Downloading "SLES-15-SP3-aarch64-mru-install-minimal-with-addons-Build20220127-1-Server-DVD-Updates-aarch64-virtio-uefi-vars.qcow2" from "ht>
Jan 27 05:17:39 openqaworker-arm-1 openqa-worker-cacheservice-minion[26677]: [26677] [i] [#52] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Jan 27 05:17:39 openqaworker-arm-1 openqa-worker-cacheservice-minion[26677]: [26677] [i] [#52] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Jan 27 05:22:55 openqaworker-arm-1 systemd[1]: Stopping OpenQA Worker Cache Service Minion...
Jan 27 05:22:59 openqaworker-arm-1 openqa-worker-cacheservice-minion[2378]: [2378] [i] Worker 2378 stopped
Jan 27 05:23:00 openqaworker-arm-1 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=192/n/a
Jan 27 05:23:00 openqaworker-arm-1 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Jan 27 05:23:00 openqaworker-arm-1 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Jan 27 05:23:00 openqaworker-arm-1 systemd[1]: Started OpenQA Worker Cache Service Minion.
n 27 05:29:45 openqaworker-arm-1 openqa-worker-cacheservice-minion[28656]: [28656] [i] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
Jan 27 05:29:45 openqaworker-arm-1 openqa-worker-cacheservice-minion[28656]: [28656] [i] Resetting all leftover locks after restart
Jan 27 05:29:45 openqaworker-arm-1 openqa-worker-cacheservice-minion[28656]: [28656] [i] Worker 28656 started
Jan 27 05:29:45 openqaworker-arm-1 openqa-worker-cacheservice-minion[37710]: [37710] [i] [#53] Downloading: "SLES-15-SP3-aarch64-mru-install-minimal-with-addons-Build20220127-1-Server-DVD-Updates-aarch64-virtio.qcow2"
Jan 27 05:29:50 openqaworker-arm-1 openqa-worker-cacheservice-minion[37749]: [37749] [i] [#54] Downloading: "SLE-15-SP3-Full-aarch64-GM-Media1.iso"
Jan 27 05:30:01 openqaworker-arm-1 openqa-worker-cacheservice-minion[37804]: [37804] [i] [#55] Downloading: "SLES-15-SP3-aarch64-mru-install-minimal-with-addons-Build20220127-1-Server-DVD-Updates-aarch64-virtio-uefi-vars.qcow2"
Jan 27 05:30:11 openqaworker-arm-1 openqa-worker-cacheservice-minion[37894]: [37894] [i] [#56] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Jan 27 05:30:11 openqaworker-arm-1 openqa-worker-cacheservice-minion[37894]: [37894] [i] [#56] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Jan 27 05:34:23 openqaworker-arm-1 openqa-worker-cacheservice-minion[44451]: [44451] [i] [#57] Downloading: "SLES-12-SP5-aarch64-mru-install-minimal-with-addons-Build20220127-1-Server-DVD-Updates-aarch64-virtio.qcow2"
Jan 27 05:34:43 openqaworker-arm-1 openqa-worker-cacheservice-minion[44451]: [44451] [i] [#57] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
Jan 27 05:34:43 openqaworker-arm-1 openqa-worker-cacheservice-minion[44451]: [44451] [i] [#57] Downloading "SLES-12-SP5-aarch64-mru-install-minimal-with-addons-Build20220127-1-Server-DVD-Updates-aarch64-virtio.qcow2" from "http://openq>
Jan 27 05:34:48 openqaworker-arm-1 openqa-worker-cacheservice-minion[44538]: [44538] [i] [#58] Downloading: "SLE-12-SP5-Server-DVD-aarch64-GM-DVD1.iso"
Jan 27 05:35:45 openqaworker-arm-1 openqa-worker-cacheservice-minion[44538]: [44538] [i] [#58] Cache size of "/var/lib/openqa/cache" is 47 GiB, with limit 50 GiB
Jan 27 05:35:45 openqaworker-arm-1 openqa-worker-cacheservice-minion[44538]: [44538] [i] [#58] Downloading "SLE-12-SP5-Server-DVD-aarch64-GM-DVD1.iso" from "http://openqa.suse.de/tests/8039492/asset/iso/SLE-12-SP5-Server-DVD-aarch64-GM>
Jan 27 05:35:49 openqaworker-arm-1 openqa-worker-cacheservice-minion[44821]: [44821] [i] [#59] Downloading: "SLES-12-SP5-aarch64-mru-install-minimal-with-addons-Build20220127-1-Server-DVD-Updates-aarch64-virtio-uefi-vars.qcow2"
Jan 27 05:35:50 openqaworker-arm-1 openqa-worker-cacheservice-minion[44821]: [44821] [i] [#59] Cache size of "/var/lib/openqa/cache" is 46 GiB, with limit 50 GiB
Jan 27 05:35:50 openqaworker-arm-1 openqa-worker-cacheservice-minion[44821]: [44821] [i] [#59] Downloading "SLES-12-SP5-aarch64-mru-install-minimal-with-addons-Build20220127-1-Server-DVD-Updates-aarch64-virtio-uefi-vars.qcow2" from "ht>
Jan 27 05:35:55 openqaworker-arm-1 openqa-worker-cacheservice-minion[44832]: [44832] [i] [#60] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Jan 27 05:35:55 openqaworker-arm-1 openqa-worker-cacheservice-minion[44832]: [44832] [i] [#60] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Jan 27 05:47:56 openqaworker-arm-1 openqa-worker-cacheservice-minion[48557]: [48557] [i] [#61] Downloading: "SLE-15-SP4-Online-aarch64-Build88.4-Media1.iso.sha256"
Jan 27 05:47:57 openqaworker-arm-1 openqa-worker-cacheservice-minion[48557]: [48557] [i] [#61] Cache size of "/var/lib/openqa/cache" is 46 GiB, with limit 50 GiB
Jan 27 05:47:57 openqaworker-arm-1 openqa-worker-cacheservice-minion[48557]: [48557] [i] [#61] Downloading "SLE-15-SP4-Online-aarch64-Build88.4-Media1.iso.sha256" from "http://openqa.suse.de/tests/8039541/asset/other/SLE-15-SP4-Online->
Jan 27 05:48:02 openqaworker-arm-1 openqa-worker-cacheservice-minion[48599]: [48599] [i] [#62] Downloading: "SLES-15-SP4-aarch64-Build88.4@aarch64-gnome.qcow2"
Jan 27 05:48:31 openqaworker-arm-1 openqa-worker-cacheservice-minion[48599]: [48599] [i] [#62] Cache size of "/var/lib/openqa/cache" is 46 GiB, with limit 50 GiB
Jan 27 05:48:31 openqaworker-arm-1 openqa-worker-cacheservice-minion[48599]: [48599] [i] [#62] Downloading "SLES-15-SP4-aarch64-Build88.4@aarch64-gnome.qcow2" from "http://openqa.suse.de/tests/8039541/asset/hdd/SLES-15-SP4-aarch64-Buil>
Jan 27 05:48:32 openqaworker-arm-1 openqa-worker-cacheservice-minion[48679]: [48679] [i] [#63] Downloading: "SLE-15-SP4-Online-aarch64-Build88.4-Media1.iso"
Jan 27 05:48:39 openqaworker-arm-1 openqa-worker-cacheservice-minion[48679]: [48679] [i] [#63] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
Jan 27 05:48:39 openqaworker-arm-1 openqa-worker-cacheservice-minion[48679]: [48679] [i] [#63] Downloading "SLE-15-SP4-Online-aarch64-Build88.4-Media1.iso" from "http://openqa.suse.de/tests/8039541/asset/iso/SLE-15-SP4-Online-aarch64-B>
Jan 27 05:48:43 openqaworker-arm-1 openqa-worker-cacheservice-minion[48707]: [48707] [i] [#64] Downloading: "SLES-15-SP4-aarch64-Build88.4@aarch64-gnome-uefi-vars.qcow2"
Jan 27 05:48:53 openqaworker-arm-1 openqa-worker-cacheservice-minion[48759]: [48759] [i] [#65] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Jan 27 05:48:53 openqaworker-arm-1 openqa-worker-cacheservice-minion[48759]: [48759] [i] [#65] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Jan 27 05:52:36 openqaworker-arm-1 openqa-worker-cacheservice-minion[1457]: [1457] [i] [#66] Downloading: "SLE-15-SP4-Online-aarch64-Build43.1-Media1.iso.sha256"
Jan 27 05:52:36 openqaworker-arm-1 openqa-worker-cacheservice-minion[1457]: [1457] [i] [#66] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
Jan 27 05:52:36 openqaworker-arm-1 openqa-worker-cacheservice-minion[1457]: [1457] [i] [#66] Downloading "SLE-15-SP4-Online-aarch64-Build43.1-Media1.iso.sha256" from "http://openqa.suse.de/tests/8036768/asset/other/SLE-15-SP4-Online-aa>
Jan 27 05:52:43 openqaworker-arm-1 openqa-worker-cacheservice-minion[1470]: [1470] [i] [#67] Downloading: "SLE-15-SP4-Online-aarch64-Build43.1-Media1.iso.sha256"
Jan 27 05:52:46 openqaworker-arm-1 openqa-worker-cacheservice-minion[1481]: [1481] [i] [#68] Downloading: "sle-15-SP4-aarch64-43.1-textmode@aarch64.qcow2"
Jan 27 05:52:49 openqaworker-arm-1 openqa-worker-cacheservice-minion[1484]: [1484] [i] [#69] Downloading: "sle-15-SP4-aarch64-43.1-textmode@aarch64.qcow2"
-- Reboot --
Jan 27 07:23:36 openqaworker-arm-1 systemd[1]: Started OpenQA Worker Cache Service Minion.
Jan 27 07:23:46 openqaworker-arm-1 openqa-worker-cacheservice-minion[2323]: [2323] [i] Creating cache directory tree for "/var/lib/openqa/cache"
Jan 27 07:23:46 openqaworker-arm-1 systemd[1]: Stopping OpenQA Worker Cache Service Minion...
Jan 27 07:23:46 openqaworker-arm-1 systemd[1]: openqa-worker-cacheservice-minion.service: Succeeded.
Jan 27 07:23:46 openqaworker-arm-1 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Jan 27 07:23:46 openqaworker-arm-1 systemd[1]: Started OpenQA Worker Cache Service Minion.
Jan 27 07:26:17 openqaworker-arm-1 openqa-worker-cacheservice-minion[2500]: [2500] [i] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50 GiB
Jan 27 07:26:17 openqaworker-arm-1 openqa-worker-cacheservice-minion[2500]: [2500] [i] Resetting all leftover locks after restart
Jan 27 07:26:17 openqaworker-arm-1 openqa-worker-cacheservice-minion[2500]: [2500] [i] Worker 2500 started

There are no coredumps present. The machine wasn't up that long before it happened and also rebooted shortly afterwards (likely after a crash). So the relevant logs before and after the failure aren't that long. However, I couldn't find any clues in them.


The logs for the journal service on the web UI host aren't much more helpful:

martchus@openqa:~> sudo journalctl -fu systemd-journal-flush
-- Logs begin at Sun 2022-01-23 02:46:40 CET. --
Jan 24 12:44:22 openqa systemd[1]: Stopped Flush Journal to Persistent Storage.
Jan 24 12:44:23 openqa systemd[1]: Starting Flush Journal to Persistent Storage...
Jan 24 12:45:05 openqa systemd[1]: Finished Flush Journal to Persistent Storage.
Jan 24 13:08:06 openqa systemd[1]: Stopping Flush Journal to Persistent Storage...
Jan 24 13:08:06 openqa systemd[1]: systemd-journal-flush.service: Succeeded.
Jan 24 13:08:06 openqa systemd[1]: Stopped Flush Journal to Persistent Storage.
Jan 24 13:08:06 openqa systemd[1]: Starting Flush Journal to Persistent Storage...
Jan 24 13:08:07 openqa systemd[1]: Finished Flush Journal to Persistent Storage.
Jan 26 15:19:02 openqa systemd[1]: Starting Flush Journal to Persistent Storage...
Jan 26 15:19:02 openqa systemd[1]: Finished Flush Journal to Persistent Storage.

The journal on OSD is generally working so whatever the problem was, it hasn't had much impact.

Actions #4

Updated by mkittler about 2 years ago

  • Status changed from Feedback to Resolved

I'm resolving this due to lack of information and the low impact. If it (those are actually two distinct issues) happens more often we can still try to investigate further.

Note that the journal service failure might have something to do with our recent manual tampering of the journal service when reacting to the file system alert.

Actions

Also available in: Atom PDF