action #62048

monitor incompletes

Added by okurz about 1 month ago. Updated about 1 month ago.

Status:ResolvedStart date:13/01/2020
Priority:NormalDue date:22/01/2020
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Done
Duration: 8

Description

Last Wednesday after deployment we had (as multiple times in the past) many incompletes and we did not learn about this until our users told us. I think our "time to recovery" was good. But the "time to detection" is something I would like to improve. I am thinking of monitor+alert of incompletes, more specifically "incompletion rate" which is a little more complex. we discussed having a mojo command that runs constantly to spit out monitoring data we're intested in. You can deploy this by salt, so it stays in sync with our telegraf config. But also that one you can run in polling fashion by exec. It is to be seen how expensive the startup and db connect is if done every 5s but possibly you don't need it every 5s to be useful.

History

#1 Updated by okurz about 1 month ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version set to Current Sprint

#2 Updated by okurz about 1 month ago

merged and active.

I added two panels to https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test now, "Incomplete jobs of last 24h" and "New incompletes" with an alert.

Maybe we need to save the complete dashboard in the salt repo as well.

#3 Updated by okurz about 1 month ago

  • Status changed from In Progress to Feedback

yesterday there were many incompletes due to #62237 and initially our monitoring didn't alert because apparently I applied too much smoothing for the alert condition detection. I changed this and subsequently the alarms for number of incompletes as well as the rate triggered successfully. With https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/256 these changes are prepared to be persistent and also we add dashboards that were so far not stored in git.

#4 Updated by okurz about 1 month ago

  • Due date set to 22/01/2020

#5 Updated by okurz about 1 month ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done

merged. The dashboard including the alerts for incompletes now lives in https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1

Also available in: Atom PDF