Project

General

Profile

action #133907

Updated by mkittler 11 months ago

## Motivation 
 There's a few issues with Jenkins: 

 - We seem to have been missing builds for at least a day at the time of this writing. See https://openqa.opensuse.org/group_overview/24 (but it may be outdated once you see it, it's not a permalink). 
 - *DONE* ~~http://jenkins.qa.suse.de/view/openQA-in-openQA/ is refusing the connection.~~ okurz: Fixed the wiki reference and job group description in https://openqa.opensuse.org/admin/job_templates/24 
 - It's unclear if jenkins.qa.suse.de is responsive to pings 

 It's unclear what's going on. We didn't get any alerts and we don't know if we have proper monitoring for the service 

 From the journal for service `jenkins.service` on the system: 
    
 ``` 
 Aug 06 03:25:41 jenkins jenkins[26704]: 2023-08-06 01:25:41.061+0000 [id=110]          INFO          org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de 
 Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.505+0000 [id=71]          INFO          org.pircbotx.InputParser#handleLine: PING :irc.suse.de 
 Aug 06 03:27:41 jenkins jenkins[26704]: 2023-08-06 01:27:41.508+0000 [id=122]          INFO          org.pircbotx.output.OutputRaw#rawLine: PONG irc.suse.de 
 -- Boot d29ffd414ee14afd9e930a7cddfc124b -- 
 Aug 07 13:04:50 jenkins systemd[1]: Starting Jenkins Continuous Integration Server... 
 Aug 07 13:05:09 jenkins jenkins[1218]: Running from: /usr/share/java/jenkins.war 
 ``` 

 ## Acceptance criteria 
 * **AC1:** There's an alert for the Jenkins web interface (HTTP response, not just ping) 

 ## Suggestions 
 * Find out why we didn't get an alert about a failed systemd service 
 * Maybe add a check for `systemd is-running`? (Likely not very useful.) 
 * Add a connectivity check via telegraf and configure an alert via Grafana if there's no simpler solution 
     * At least add a local, not-versioned telegraf extension to look at port 80, e.g. in /etc/telegraf/ 
 * Possibly add a new role in our Salt states (we don't want this kind of check for all generic hosts)

Back