action #134837
closedQA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA (public) - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo
SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called size:M
0%
Description
Observation¶
From https://suse.slack.com/archives/C02CANHLANP/p1693394887125729 Paolo Stivanin made us aware that the test repository checkout for os-autoinst-distri-opensuse on OSD was not up-to-date. I found that the repository checkout had commits from 2023-08-29 but not the expected ones for 2023-08-30. journalctl -e -u cron
showed as last entries
Aug 29 08:05:01 openqa CRON[11204]: (root) CMD (touch /var/lib/openqa/factory/repo/cvd/*)
Aug 29 08:05:02 openqa CRON[11199]: (root) CMDEND (touch /var/lib/openqa/factory/repo/cvd/*)
Aug 29 09:05:01 openqa CRON[19801]: (root) CMD (touch /var/lib/openqa/factory/repo/cvd/*)
Aug 29 10:05:01 openqa CRON[24320]: (root) CMD (touch /var/lib/openqa/factory/repo/cvd/*)
Aug 29 10:05:02 openqa CRON[24317]: (root) CMDEND (touch /var/lib/openqa/factory/repo/cvd/*)
and the service
# systemctl status cron
. cron.service - Command Scheduler
Loaded: loaded (/usr/lib/systemd/system/cron.service; enabled; vendor preset: enabled)
Active: inactive (dead)
I called fetchneedles manually and started the cron service and it was fine again. But we did not receive any alert and no other notice that the service was enabled but not running. Can we find a way to be alerted about systemd services enabled but not running?
Acceptance Criteria¶
- AC1: We are alerted about enabled services not running (or active?)
Suggestions¶
- Research about how to find enabled systemd services not running and how to monitor for that
- We already have a check triggered by telegraf for failed services as well as masked services. Look into that and possible extend or be inspired by those solutions
- Ensure to have monitoring panels and alerts
Workaround¶
Restart the service manually
Updated by okurz over 1 year ago
- Copied from action #132146: Support migration of osd VM to PRG2 - 2023-08-29 size:M added
Updated by livdywan over 1 year ago
- Status changed from New to In Progress
- Assignee set to livdywan
Updated by okurz over 1 year ago
- Related to action #134519: We were not notified that backup.qa.suse.de did not create backups size:M added
Updated by livdywan over 1 year ago
- Subject changed from SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called to SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called size:M
- Description updated (diff)