Project

General

Profile

Actions

action #168148

closed

coordination #161414: [epic] Improved salt based infrastructure management

hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these size:S

Added by nicksinger 2 months ago. Updated 20 days ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #167051 we discovered that our testing of telegraf is not optimal and @nicksinger mentioned that he wants to look into loki (https://grafana.com/oss/loki/). With it we could alert based on unexpected logfile entries e.g. to spot runtime issues with plugins of telegraf


Files


Related issues 1 (1 open0 closed)

Copied from openQA Infrastructure (public) - action #168145: implement telegraf health check and adjust according pipelinesNew

Actions
Actions #1

Updated by nicksinger 2 months ago

  • Copied from action #168145: implement telegraf health check and adjust according pipelines added
Actions #2

Updated by okurz 2 months ago

  • Target version set to Ready
Actions #3

Updated by okurz 2 months ago

  • Target version changed from Ready to Tools - Next
Actions #4

Updated by okurz about 2 months ago

  • Subject changed from hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these to hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these size:S
  • Status changed from New to Workable
Actions #5

Updated by nicksinger 29 days ago

loki is already installed and running on the monitoring host. I also installed promtail for log collection on the monitoring-host itself. Data is already arriving in loki and available to query/display via our grafana instance. This is a query showing all journalctl-entries with severity "warning" or "error": https://stats.openqa-monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%2248i%22:%7B%22datasource%22:%22ee4ewos1kcidcf%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bjob%3D%5C%22systemd-journal%5C%22,%20level%3D~%5C%22error%7Cwarning%5C%22%7D%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22ee4ewos1kcidcf%22%7D,%22editorMode%22:%22builder%22,%22direction%22:%22forward%22,%22legendFormat%22:%22%22%7D%5D,%22range%22:%7B%22from%22:%22now-24h%22,%22to%22:%22now%22%7D%7D%7D&orgId=1

I also started to add it into our salt: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1312
Still missing:

  • /etc/dehydrated/domains.txt add loki.monitor.qa.suse.de (is the dehydrated state even used on that host?)
  • promtail config for the monitoring host
  • promtail config for all other hosts
Actions #6

Updated by nicksinger 23 days ago

  • Status changed from In Progress to Feedback

Updated my MRs:

After these are merged I will stop for now with this. I added a manual promtail-setup on monitor and arm1 and we can decide later if we find this useful and want to continue with adding its setup to all machines.

Actions #8

Updated by okurz 20 days ago

Nice!

Actions

Also available in: Atom PDF