action #133907: Improve monitoring for http(s?) reachable on jenkins.qa.suse.de size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #133907

open

Improve monitoring for http(s?) reachable on jenkins.qa.suse.de size:M

Added by tinita over 1 year ago. Updated over 1 year ago.

Status:

Workable

Priority:

Normal

Assignee:

Category:

Target version:

QA (public) - future

Start date:

2023-08-07

Due date:

% Done:

Estimated time:

Tags:

alert, infra, reactive work, jenkins, qamaster

Description

Motivation¶

There's a few issues with Jenkins:

We seem to have been missing builds for at least a day at the time of this writing. See https://openqa.opensuse.org/group_overview/24 (but it may be outdated once you see it, it's not a permalink).
DONE ~~http://jenkins.qa.suse.de/view/openQA-in-openQA/ is refusing the connection.~~ okurz: Fixed the wiki reference and job group description in https://openqa.opensuse.org/admin/job_templates/24
It's unclear if jenkins.qa.suse.de is responsive to pings

It's unclear what's going on. We didn't get any alerts and we don't know if we have proper monitoring for the service

From the journal for service jenkins.service on the system:

Aug 06 03:25:41 jenkins jenkins[26704]: 2023-08-06 01:25:41.061+0000 [id=110]        INFO        org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.505+0000 [id=71]        INFO        org.pircbotx.InputParser#handleLine: PING :irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.508+0000 [id=122]        INFO        org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
-- Boot d29ffd414ee14afd9e930a7cddfc124b --
Aug 07 13:04:50 jenkins systemd[1]: Starting Jenkins Continuous Integration Server...
Aug 07 13:05:09 jenkins jenkins[1218]: Running from: /usr/share/java/jenkins.war

Acceptance criteria¶

AC1: There's an alert for the Jenkins web interface (HTTP response, not just ping)

Suggestions¶

Find out why we didn't get an alert about a failed systemd service
Maybe add a check for systemd is-running? (Likely not very useful.)
Add a connectivity check via telegraf and configure an alert via Grafana if there's no simpler solution
- At least add a local, not-versioned telegraf extension to look at port 80, e.g. in /etc/telegraf/
Possibly add a new role in our Salt states (we don't want this kind of check for all generic hosts)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #133907

Improve monitoring for http(s?) reachable on jenkins.qa.suse.de size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan over 1 year ago

Updated by livdywan over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by osukup over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan 8 months ago