action #133907
openImprove monitoring for http(s?) reachable on jenkins.qa.suse.de size:M
0%
Description
Motivation¶
There's a few issues with Jenkins:
- We seem to have been missing builds for at least a day at the time of this writing. See https://openqa.opensuse.org/group_overview/24 (but it may be outdated once you see it, it's not a permalink).
- DONE
http://jenkins.qa.suse.de/view/openQA-in-openQA/ is refusing the connection.okurz: Fixed the wiki reference and job group description in https://openqa.opensuse.org/admin/job_templates/24 - It's unclear if jenkins.qa.suse.de is responsive to pings
It's unclear what's going on. We didn't get any alerts and we don't know if we have proper monitoring for the service
From the journal for service jenkins.service
on the system:
Aug 06 03:25:41 jenkins jenkins[26704]: 2023-08-06 01:25:41.061+0000 [id=110] INFO org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.505+0000 [id=71] INFO org.pircbotx.InputParser#handleLine: PING :irc.suse.de
Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.508+0000 [id=122] INFO org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de
-- Boot d29ffd414ee14afd9e930a7cddfc124b --
Aug 07 13:04:50 jenkins systemd[1]: Starting Jenkins Continuous Integration Server...
Aug 07 13:05:09 jenkins jenkins[1218]: Running from: /usr/share/java/jenkins.war
Acceptance criteria¶
- AC1: There's an alert for the Jenkins web interface (HTTP response, not just ping)
Suggestions¶
- Find out why we didn't get an alert about a failed systemd service
- Maybe add a check for
systemd is-running
? (Likely not very useful.) - Add a connectivity check via telegraf and configure an alert via Grafana if there's no simpler solution
- At least add a local, not-versioned telegraf extension to look at port 80, e.g. in /etc/telegraf/
- Possibly add a new role in our Salt states (we don't want this kind of check for all generic hosts)
Updated by livdywan over 1 year ago
jenkins.service: Job jenkins.service/start failed with result 'dependency'
For the record Ondrej restarted the service and it seems to listen on 8080 instead of 80 now.
Updated by okurz over 1 year ago
- Tags set to infra, jenkins, qamaster, alert, reactive work
- Description updated (diff)
- Priority changed from Normal to High
- Target version set to Ready
Updated by mkittler over 1 year ago
- Subject changed from Improve monitoring for http(s?) reachable on jenkins.qa.suse.de to Improve monitoring for http(s?) reachable on jenkins.qa.suse.de size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
We already have a role "jenkins" in salt-states-openqa so adding a jenkins specific telegraf config on top is easier.
Updated by osukup over 1 year ago
livdywan wrote:
jenkins.service: Job jenkins.service/start failed with the result 'dependency'
For the record Ondrej restarted the service and it seems to listen on 8080 instead of 80 now.
both Jenkins and nginx services were down and the swap missing + in the journal big bunch of fails in salt-minion (mostly related to swap and package resolve / install)
after full restart system resumed work without any problems ( with high response times and latency ..)
system got much better after @okurz updated VM configuration to better CPU model + 4GB ram
Updated by okurz over 1 year ago
- Priority changed from High to Normal
- Target version changed from Ready to future
Updated by livdywan 8 months ago
- Copied to action #163052: jenkins.qa.suse.de no longer reachable via web browser (but responsive to SSH) added