action #133907: Improve monitoring for http(s?) reachable on jenkins.qa.suse.de size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #133907

open

Improve monitoring for http(s?) reachable on jenkins.qa.suse.de size:M

Added by tinita almost 2 years ago. Updated almost 2 years ago.

Status:

Workable

Priority:

Normal

Assignee:

Category:

Target version:

QA (public) - future

Start date:

2023-08-07

Due date:

% Done:

Estimated time:

Tags:

alert, infra, reactive work, jenkins, qamaster

Description

Motivation¶

There's a few issues with Jenkins:

We seem to have been missing builds for at least a day at the time of this writing. See https://openqa.opensuse.org/group_overview/24 (but it may be outdated once you see it, it's not a permalink).
DONE ~~http://jenkins.qa.suse.de/view/openQA-in-openQA/ is refusing the connection.~~ okurz: Fixed the wiki reference and job group description in https://openqa.opensuse.org/admin/job_templates/24
It's unclear if jenkins.qa.suse.de is responsive to pings

It's unclear what's going on. We didn't get any alerts and we don't know if we have proper monitoring for the service

From the journal for service jenkins.service on the system:

Aug 06 03:25:41 jenkins jenkins[26704]: 2023-08-06 01:25:41.061+0000 [id=110]        INFO        org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.505+0000 [id=71]        INFO        org.pircbotx.InputParser#handleLine: PING :irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.508+0000 [id=122]        INFO        org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
-- Boot d29ffd414ee14afd9e930a7cddfc124b --
Aug 07 13:04:50 jenkins systemd[1]: Starting Jenkins Continuous Integration Server...
Aug 07 13:05:09 jenkins jenkins[1218]: Running from: /usr/share/java/jenkins.war

Acceptance criteria¶

AC1: There's an alert for the Jenkins web interface (HTTP response, not just ping)

Suggestions¶

Find out why we didn't get an alert about a failed systemd service
Maybe add a check for systemd is-running? (Likely not very useful.)
Add a connectivity check via telegraf and configure an alert via Grafana if there's no simpler solution
- At least add a local, not-versioned telegraf extension to look at port 80, e.g. in /etc/telegraf/
Possibly add a new role in our Salt states (we don't want this kind of check for all generic hosts)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by livdywan almost 2 years ago

Description updated (diff)

Actions

Copy link

Updated by livdywan almost 2 years ago

jenkins.service: Job jenkins.service/start failed with result 'dependency'

For the record Ondrej restarted the service and it seems to listen on 8080 instead of 80 now.

Actions

Copy link

Updated by okurz almost 2 years ago

Tags set to infra, jenkins, qamaster, alert, reactive work
Description updated (diff)
Priority changed from Normal to High
Target version set to Ready

Actions

Copy link

Updated by mkittler almost 2 years ago

Subject changed from Improve monitoring for http(s?) reachable on jenkins.qa.suse.de to Improve monitoring for http(s?) reachable on jenkins.qa.suse.de size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz almost 2 years ago

We already have a role "jenkins" in salt-states-openqa so adding a jenkins specific telegraf config on top is easier.

Actions

Copy link

Updated by osukup almost 2 years ago

livdywan wrote:

jenkins.service: Job jenkins.service/start failed with the result 'dependency'

For the record Ondrej restarted the service and it seems to listen on 8080 instead of 80 now.

both Jenkins and nginx services were down and the swap missing + in the journal big bunch of fails in salt-minion (mostly related to swap and package resolve / install)

after full restart system resumed work without any problems ( with high response times and latency ..)

system got much better after @okurz updated VM configuration to better CPU model + 4GB ram

Actions

Copy link

Updated by okurz almost 2 years ago

Priority changed from High to Normal
Target version changed from Ready to future

Actions

Copy link

Updated by livdywan 11 months ago

Copied to action #163052: jenkins.qa.suse.de no longer reachable via web browser (but responsive to SSH) added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #133907

Improve monitoring for http(s?) reachable on jenkins.qa.suse.de size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan almost 2 years ago

Updated by livdywan almost 2 years ago

Updated by okurz almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by okurz almost 2 years ago

Updated by osukup almost 2 years ago

Updated by okurz almost 2 years ago

Updated by livdywan 11 months ago