Project

General

Profile

Actions

action #133907

open

Improve monitoring for http(s?) reachable on jenkins.qa.suse.de size:M

Added by tinita 11 months ago. Updated 11 months ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
2023-08-07
Due date:
% Done:

0%

Estimated time:

Description

Motivation

There's a few issues with Jenkins:

It's unclear what's going on. We didn't get any alerts and we don't know if we have proper monitoring for the service

From the journal for service jenkins.service on the system:

Aug 06 03:25:41 jenkins jenkins[26704]: 2023-08-06 01:25:41.061+0000 [id=110]        INFO        org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.505+0000 [id=71]        INFO        org.pircbotx.InputParser#handleLine: PING :irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.508+0000 [id=122]        INFO        org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
-- Boot d29ffd414ee14afd9e930a7cddfc124b --
Aug 07 13:04:50 jenkins systemd[1]: Starting Jenkins Continuous Integration Server...
Aug 07 13:05:09 jenkins jenkins[1218]: Running from: /usr/share/java/jenkins.war

Acceptance criteria

  • AC1: There's an alert for the Jenkins web interface (HTTP response, not just ping)

Suggestions

  • Find out why we didn't get an alert about a failed systemd service
  • Maybe add a check for systemd is-running? (Likely not very useful.)
  • Add a connectivity check via telegraf and configure an alert via Grafana if there's no simpler solution
    • At least add a local, not-versioned telegraf extension to look at port 80, e.g. in /etc/telegraf/
  • Possibly add a new role in our Salt states (we don't want this kind of check for all generic hosts)

Related issues 1 (0 open1 closed)

Copied to openQA Infrastructure - action #163052: jenkins.qa.suse.de no longer reachable via web browser (but responsive to SSH)Resolvedokurz

Actions
Actions #1

Updated by livdywan 11 months ago

  • Description updated (diff)
Actions #2

Updated by livdywan 11 months ago

jenkins.service: Job jenkins.service/start failed with result 'dependency'

For the record Ondrej restarted the service and it seems to listen on 8080 instead of 80 now.

Actions #3

Updated by okurz 11 months ago

  • Tags set to infra, jenkins, qamaster, alert, reactive work
  • Description updated (diff)
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #4

Updated by mkittler 11 months ago

  • Subject changed from Improve monitoring for http(s?) reachable on jenkins.qa.suse.de to Improve monitoring for http(s?) reachable on jenkins.qa.suse.de size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by okurz 11 months ago

We already have a role "jenkins" in salt-states-openqa so adding a jenkins specific telegraf config on top is easier.

Actions #6

Updated by osukup 11 months ago

livdywan wrote:

jenkins.service: Job jenkins.service/start failed with the result 'dependency'

For the record Ondrej restarted the service and it seems to listen on 8080 instead of 80 now.

both Jenkins and nginx services were down and the swap missing + in the journal big bunch of fails in salt-minion (mostly related to swap and package resolve / install)

after full restart system resumed work without any problems ( with high response times and latency ..)

system got much better after @okurz updated VM configuration to better CPU model + 4GB ram

Actions #7

Updated by okurz 11 months ago

  • Priority changed from High to Normal
  • Target version changed from Ready to future
Actions #8

Updated by livdywan 14 days ago

  • Copied to action #163052: jenkins.qa.suse.de no longer reachable via web browser (but responsive to SSH) added
Actions

Also available in: Atom PDF