action #87883

osd infrastructure: services like "telegraf" are not enabled to start immediately on boot

Added by okurz 6 months ago. Updated 6 months ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:



strange, now on openqaworker10 telegraf was also disabled. will check another host
well, disabled and hence not started on boot.
I think this can explain why the weekly reboot caused a delay of longer than 1h until hosts came back because in salt we never ensure that the service is enabled, we just ensure "running", not sure when salt is even triggered and I wonder why they even show up after 1h


#1 Updated by okurz 6 months ago

  • Status changed from In Progress to Feedback

#2 Updated by okurz 6 months ago

  • Due date set to 2021-01-26

merged. This could have explain the problems that hosts only report online again after about 1h in grafana. I should wait until next automatic (or manual) reboots of machines and check the "host up" monitoring panels on grafana. Maybe we can reduce again the alerting time period to a sensible lower selection after confirming that hosts are reported as "up" sooner again.

#3 Updated by okurz 6 months ago

  • Status changed from Feedback to Resolved is an example showing that now the "host up" receives no data for only 7 minutes which is much more near to what we would have expected. I have crosschecked all workers for the "host up" history of past 7 days and with exceptions for the known problematic machines all rebooted automatically within around 7 minutes so reducing the alert period would be possible. However I think it's ok to keep a bit grace time in case someone wants to conduct manual actions on a machine with reboot.

Also available in: Atom PDF