action #168148: hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #168148

closed

coordination #161414: [epic] Improved salt based infrastructure management

hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these size:S

Added by nicksinger 8 months ago. Updated 6 months ago.

Status:

Resolved

Priority:

Low

Assignee:

nicksinger

Category:

Feature requests

Target version:

QA (public) - Tools - Next

Start date:

Due date:

% Done:

Estimated time:

Tags:

gitlab, influxdb, grafana, infra, telegraf

Description

Motivation¶

In #167051 we discovered that our testing of telegraf is not optimal and @nicksinger mentioned that he wants to look into loki (https://grafana.com/oss/loki/). With it we could alert based on unexpected logfile entries e.g. to spot runtime issues with plugins of telegraf

Files

clipboard-202411201423-lkrzz.png (170 KB) clipboard-202411201423-lkrzz.png

nicksinger, 2024-11-20 13:23

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by nicksinger 8 months ago

Copied from action #168145: implement telegraf health check and adjust according pipelines added

Actions

Copy link

Updated by okurz 8 months ago

Target version set to Ready

Actions

Copy link

Updated by okurz 8 months ago

Target version changed from Ready to Tools - Next

Actions

Copy link

Updated by okurz 7 months ago

Subject changed from hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these to hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these size:S
Status changed from New to Workable

Actions

Copy link

Updated by nicksinger 6 months ago

File clipboard-202411201423-lkrzz.png clipboard-202411201423-lkrzz.png added
Status changed from Workable to In Progress

loki is already installed and running on the monitoring host. I also installed promtail for log collection on the monitoring-host itself. Data is already arriving in loki and available to query/display via our grafana instance. This is a query showing all journalctl-entries with severity "warning" or "error": https://stats.openqa-monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%2248i%22:%7B%22datasource%22:%22ee4ewos1kcidcf%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bjob%3D%5C%22systemd-journal%5C%22,%20level%3D~%5C%22error%7Cwarning%5C%22%7D%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22ee4ewos1kcidcf%22%7D,%22editorMode%22:%22builder%22,%22direction%22:%22forward%22,%22legendFormat%22:%22%22%7D%5D,%22range%22:%7B%22from%22:%22now-24h%22,%22to%22:%22now%22%7D%7D%7D&orgId=1

I also started to add it into our salt: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1312
Still missing:

/etc/dehydrated/domains.txt add loki.monitor.qa.suse.de (is the dehydrated state even used on that host?)
promtail config for the monitoring host
promtail config for all other hosts

Actions

Copy link

Updated by nicksinger 6 months ago

Status changed from In Progress to Feedback

Updated my MRs:

After these are merged I will stop for now with this. I added a manual promtail-setup on monitor and arm1 and we can decide later if we find this useful and want to continue with adding its setup to all machines.

Actions

Copy link

Updated by nicksinger 6 months ago

Status changed from Feedback to Resolved

Everything merged. I got some great help from the team with some minor typo fixes and such:

Our grafana also contains data: https://monitor.qa.suse.de/explore?schemaVersion=1&panes=%7B%22xjj%22:%7B%22datasource%22:%22ee4ewos1kcidcf%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bsyslog_identifier%3D%5C%22telegraf%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22ee4ewos1kcidcf%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%22now-7d%22,%22to%22:%22now%22%7D%7D%7D&orgId=1

For now I would call this a success :)

Actions

Copy link

Updated by okurz 6 months ago

Nice!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #168148

hackweek idea: use loki to monitor our log files and explore alerting possibilites based on these size:S

Motivation¶

Updated by nicksinger 8 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by okurz 7 months ago

Updated by nicksinger 6 months ago

Updated by nicksinger 6 months ago

Updated by nicksinger 6 months ago

Updated by okurz 6 months ago