Project

General

Profile

Actions

action #133907

open

Improve monitoring for http(s?) reachable on jenkins.qa.suse.de size:M

Added by tinita 9 months ago. Updated 8 months ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
2023-08-07
Due date:
% Done:

0%

Estimated time:

Description

Motivation

There's a few issues with Jenkins:

It's unclear what's going on. We didn't get any alerts and we don't know if we have proper monitoring for the service

From the journal for service jenkins.service on the system:

Aug 06 03:25:41 jenkins jenkins[26704]: 2023-08-06 01:25:41.061+0000 [id=110]        INFO        org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.505+0000 [id=71]        INFO        org.pircbotx.InputParser#handleLine: PING :irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.508+0000 [id=122]        INFO        org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
-- Boot d29ffd414ee14afd9e930a7cddfc124b --
Aug 07 13:04:50 jenkins systemd[1]: Starting Jenkins Continuous Integration Server...
Aug 07 13:05:09 jenkins jenkins[1218]: Running from: /usr/share/java/jenkins.war

Acceptance criteria

  • AC1: There's an alert for the Jenkins web interface (HTTP response, not just ping)

Suggestions

  • Find out why we didn't get an alert about a failed systemd service
  • Maybe add a check for systemd is-running? (Likely not very useful.)
  • Add a connectivity check via telegraf and configure an alert via Grafana if there's no simpler solution
    • At least add a local, not-versioned telegraf extension to look at port 80, e.g. in /etc/telegraf/
  • Possibly add a new role in our Salt states (we don't want this kind of check for all generic hosts)
Actions

Also available in: Atom PDF