action #134837: SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #134837

closed

QA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA (public) - coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo

SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

livdywan

Category:

Regressions/Crashes

Target version:

Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

osd, monitoring, systemd, service, dct migration, alerting, cron, fetchneedles

Description

Observation¶

From https://suse.slack.com/archives/C02CANHLANP/p1693394887125729 Paolo Stivanin made us aware that the test repository checkout for os-autoinst-distri-opensuse on OSD was not up-to-date. I found that the repository checkout had commits from 2023-08-29 but not the expected ones for 2023-08-30. journalctl -e -u cron showed as last entries

Aug 29 08:05:01 openqa CRON[11204]: (root) CMD (touch /var/lib/openqa/factory/repo/cvd/*)
Aug 29 08:05:02 openqa CRON[11199]: (root) CMDEND (touch /var/lib/openqa/factory/repo/cvd/*)
Aug 29 09:05:01 openqa CRON[19801]: (root) CMD (touch /var/lib/openqa/factory/repo/cvd/*)
Aug 29 10:05:01 openqa CRON[24320]: (root) CMD (touch /var/lib/openqa/factory/repo/cvd/*)
Aug 29 10:05:02 openqa CRON[24317]: (root) CMDEND (touch /var/lib/openqa/factory/repo/cvd/*)

and the service

# systemctl status cron
. cron.service - Command Scheduler
     Loaded: loaded (/usr/lib/systemd/system/cron.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

I called fetchneedles manually and started the cron service and it was fine again. But we did not receive any alert and no other notice that the service was enabled but not running. Can we find a way to be alerted about systemd services enabled but not running?

Acceptance Criteria¶

AC1: We are alerted about enabled services not running (or active?)

Suggestions¶

Research about how to find enabled systemd services not running and how to monitor for that
We already have a check triggered by telegraf for failed services as well as masked services. Look into that and possible extend or be inspired by those solutions
Ensure to have monitoring panels and alerts

Workaround¶

Restart the service manually

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz over 1 year ago

Copied from action #132146: Support migration of osd VM to PRG2 - 2023-08-29 size:M added

Actions

Copy link

Updated by livdywan over 1 year ago

I called fetchneedles manually and started the cron service and it was fine again. But we did not receive any alert and no other notice that the service was enabled but not running. Can we find a way to be alerted about systemd services enabled but not running?

Hrm. I was thinking of #134519 at first where the issue is that we're not seeing alerts for individual cron jobs. If cron was not running that should have triggered a systemd services alert, though?

telegraf-webui.conf mentions cron, however there is no monitoring/grafana/webui.services.json to sync with and no other instance of cron in salt-states-openqa (or pillars)? Was there some change that broke this inadvertendly?

Actions

Copy link

Updated by livdywan over 1 year ago

Status changed from New to In Progress
Assignee set to livdywan

I'm taking a look. Maybe we can make this an if-or by adjusting our script that checks the service state

Actions

Copy link

Updated by okurz over 1 year ago

Related to action #134519: We were not notified that backup.qa.suse.de did not create backups size:M added

Actions

Copy link

Updated by okurz over 1 year ago

livdywan wrote in #note-2:

I called fetchneedles manually and started the cron service and it was fine again. But we did not receive any alert and no other notice that the service was enabled but not running. Can we find a way to be alerted about systemd services enabled but not running?

Hrm. I was thinking of #134519 at first where the issue is that we're not seeing alerts for individual cron jobs

Well, it might avoid the issue by using systemd timers

If cron was not running that should have triggered a systemd services alert, though?

That would be good but so far we are only looking for failed alerts. I researched yesterday shortly with mkittler+nicksinger how to find enabled systemd services that should be running but aren't but the research was so far inconclusive.

telegraf-webui.conf mentions cron, however there is no monitoring/grafana/webui.services.json to sync with and no other instance of cron in salt-states-openqa (or pillars)?

https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/etc/master/cron.d/SLES.CRON

Was there some change that broke this inadvertendly?

I suspect it was a mistake by us during OSD migration that triggered this and is unlikely to happen again.

Actions

Copy link

Updated by openqa_review over 1 year ago

Due date set to 2023-09-14

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by livdywan over 1 year ago

Subject changed from SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called to SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called size:M
Description updated (diff)

Actions

Copy link

Updated by livdywan over 1 year ago

Due date deleted (~~2023-09-14~~)

It occurred to me that we might've missed the very first step. We've not enabled cron(nie) in salt. We should probably do that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/967

Actions

Copy link

Updated by livdywan over 1 year ago

Status changed from In Progress to Resolved

I also brought this up in jitsi. Ideas considered were having regularly run pipelines with salt state apply similar to what we're doing with package upgrades since we might not apply and e.g. enable services for several days. We probably don't want to interfere with temporary manual changes, though - so for now I'm hesitant to go this route. A bit of online research suggests it's common to simply restart services via cron or vice versa, but we may not want to increase interdependency here.

For now I'd probably say we can consider if we want the above simple MR, but otherwise leave this as-is.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #134837

SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called size:M

Observation¶

Acceptance Criteria¶

Suggestions¶

Workaround¶

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by openqa_review over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago