action #168148
closed
coordination #161414: [epic] Improved salt based infrastructure management
hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these size:S
Added by nicksinger 2 months ago.
Updated 20 days ago.
Category:
Feature requests
Description
Motivation¶
In #167051 we discovered that our testing of telegraf is not optimal and @nicksinger mentioned that he wants to look into loki (https://grafana.com/oss/loki/). With it we could alert based on unexpected logfile entries e.g. to spot runtime issues with plugins of telegraf
Files
Related issues
1 (1 open — 0 closed)
- Copied from action #168145: implement telegraf health check and adjust according pipelines added
- Target version set to Ready
- Target version changed from Ready to Tools - Next
- Subject changed from hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these to hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these size:S
- Status changed from New to Workable
loki is already installed and running on the monitoring host. I also installed promtail for log collection on the monitoring-host itself. Data is already arriving in loki and available to query/display via our grafana instance. This is a query showing all journalctl-entries with severity "warning" or "error": https://stats.openqa-monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%2248i%22:%7B%22datasource%22:%22ee4ewos1kcidcf%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bjob%3D%5C%22systemd-journal%5C%22,%20level%3D~%5C%22error%7Cwarning%5C%22%7D%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22ee4ewos1kcidcf%22%7D,%22editorMode%22:%22builder%22,%22direction%22:%22forward%22,%22legendFormat%22:%22%22%7D%5D,%22range%22:%7B%22from%22:%22now-24h%22,%22to%22:%22now%22%7D%7D%7D&orgId=1
I also started to add it into our salt: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1312
Still missing:
- /etc/dehydrated/domains.txt add loki.monitor.qa.suse.de (is the dehydrated state even used on that host?)
- promtail config for the monitoring host
- promtail config for all other hosts
- Status changed from In Progress to Feedback
Updated my MRs:
After these are merged I will stop for now with this. I added a manual promtail-setup on monitor and arm1 and we can decide later if we find this useful and want to continue with adding its setup to all machines.
- Status changed from Feedback to Resolved
Also available in: Atom
PDF