action #168148
closedcoordination #161414: [epic] Improved salt based infrastructure management
hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these size:S
0%
Description
Motivation¶
In #167051 we discovered that our testing of telegraf is not optimal and @nicksinger mentioned that he wants to look into loki (https://grafana.com/oss/loki/). With it we could alert based on unexpected logfile entries e.g. to spot runtime issues with plugins of telegraf
Files
Updated by nicksinger 2 months ago
- Copied from action #168145: implement telegraf health check and adjust according pipelines added
Updated by okurz about 2 months ago
- Subject changed from hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these to hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these size:S
- Status changed from New to Workable
Updated by nicksinger 29 days ago
- File clipboard-202411201423-lkrzz.png clipboard-202411201423-lkrzz.png added
- Status changed from Workable to In Progress
loki is already installed and running on the monitoring host. I also installed promtail for log collection on the monitoring-host itself. Data is already arriving in loki and available to query/display via our grafana instance. This is a query showing all journalctl-entries with severity "warning" or "error": https://stats.openqa-monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%2248i%22:%7B%22datasource%22:%22ee4ewos1kcidcf%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bjob%3D%5C%22systemd-journal%5C%22,%20level%3D~%5C%22error%7Cwarning%5C%22%7D%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22ee4ewos1kcidcf%22%7D,%22editorMode%22:%22builder%22,%22direction%22:%22forward%22,%22legendFormat%22:%22%22%7D%5D,%22range%22:%7B%22from%22:%22now-24h%22,%22to%22:%22now%22%7D%7D%7D&orgId=1
I also started to add it into our salt: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1312
Still missing:
- /etc/dehydrated/domains.txt add loki.monitor.qa.suse.de (is the dehydrated state even used on that host?)
- promtail config for the monitoring host
- promtail config for all other hosts
Updated by nicksinger 23 days ago
- Status changed from In Progress to Feedback
Updated my MRs:
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1312
- https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/942
- https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/939 (merged)
After these are merged I will stop for now with this. I added a manual promtail-setup on monitor and arm1 and we can decide later if we find this useful and want to continue with adding its setup to all machines.
Updated by nicksinger 20 days ago
- Status changed from Feedback to Resolved
Everything merged. I got some great help from the team with some minor typo fixes and such:
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1318
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1320/diffs
For now I would call this a success :)