action #95293: Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #95293

closed

Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M

Added by okurz almost 4 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-07-09

Due date:

% Done:

Estimated time:

Description

Observation¶

From favogt:

I'm getting a 500 on https://openqa.opensuse.org/tests/1831535#next_previous due to [2021-07-09T09:20:25.0092 UTC] [error] [sNVmMoPb-wu6] Can't call method "id" on an undefined value at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Controller/Test.pm line 119.. Known?

The problem has been fixed with a revert in #95281 but why did we not catch this issue in our tests or monitoring?

Acceptance criteria¶

AC1: DONE: Tests within https://github.com/os-autoinst/openQA/ reproduce/cover the problem -> already covered in https://github.com/os-autoinst/openQA/pull/4035
AC2: Monitoring alerts on errors in logs on o3

Suggestions¶

Consider using again https://github.com/os-autoinst/openqa-logwarn on o3 as we had some years ago with email sending to o3-admins@suse.de by cron jobs

Files

Download all files

coverage-after.png (86.7 KB) coverage-after.png		tinita, 2021-07-14 21:03
coverage-before.png (85.6 KB) coverage-before.png		tinita, 2021-07-14 21:03

Related issues 4 (1 open — 3 closed)

Actions

Copy link

Updated by okurz almost 4 years ago

Copied from action #95281: error on "Next & previous results": ajax error message and no results showing up added

Actions

Copy link

Updated by okurz almost 4 years ago

Target version set to Ready

Actions

Copy link

Updated by tinita almost 4 years ago

why did we not catch this issue in our tests [...]?

We didn't catch it in our tests because the tests didn't have a bugref comments without a corresponding entry in the Bugs table.

Actions

Copy link

Updated by ilausuch almost 4 years ago

Subject changed from followup to: error on "Next & previous results": ajax error message and no results showing up to followup to: error on "Next & previous results": ajax error message and no results showing up size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz almost 4 years ago

Description updated (diff)

Actions

Copy link

Updated by mkittler almost 4 years ago

Description updated (diff)

Judging by #95281#note-8 I would assume this is fixed. It would still make sense to add checks within our monitoring for errors like Can't call method ".*" on an undefined.

Actions

Copy link

Updated by mkittler almost 4 years ago

Description updated (diff)

I had an old version of the ticket open.

Actions

Copy link Download all files

Updated by tinita almost 4 years ago

File coverage-after.png coverage-after.png added
File coverage-before.png coverage-before.png added

Regarding testing and coverage:

I ran a coverage test with the full Devel::Cover abilities to cover not only statements, but branches/conditions.

I attached two screenshots of the WebAPI::Plugin::Helpers module with the bug*_for methods. One shows the coverage before my PR and one after it.

The statement coverage for this module is 93.9 in both cases, but the branches/condition coverage has increased.

file	                            stmt	bran	cond	sub	time	total
lib/OpenQA/WebAPI/Plugin/Helpers.pm	93.9	82.3	72.7	90.2	0.0	89.1    # before
lib/OpenQA/WebAPI/Plugin/Helpers.pm	93.9	83.6	75.0	90.2	0.0	89.6    # after

So indeed my added test increased the coverage. We just don't measure it. One reason is that Codecov doesn't offer this.
The other is that it isn't really possible to mark single conditions as uncoverable.

Actions

Copy link

Updated by okurz almost 4 years ago

tinita wrote:

[…]
The other is that it isn't really possible to mark single conditions as uncoverable.

What do you mean by that? https://metacpan.org/pod/Devel::Cover#Branches and https://metacpan.org/pod/Devel::Cover#Conditions explains how branches and conditions can be marked as uncoverable as well. Just that if we would really aim for 100% branch+condition coverage we would like go mad trying to mark all "uncoverable" conditions/branches as such.

Actions

Copy link

#10

Updated by ilausuch almost 4 years ago

Status changed from Workable to In Progress
Assignee set to ilausuch

Actions

Copy link

#11

Updated by openqa_review almost 4 years ago

Due date set to 2021-08-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#12

Updated by ilausuch almost 4 years ago

Description updated (diff)

Actions

Copy link

#13

Updated by ilausuch almost 4 years ago

Description updated (diff)

In a conversation with Marius during the daily we establish a sort of roadmap and requirements:

Check if telegraf can access to the logs or we need a script to do this job storing in the InfluxDB
Check if we can filter the logs to cherry pick only the interesting logs to not overload the DB

Also we could add some more steps

Store this information in influxDB as logs, or as parsed log information (check what is the better option)
Create a view for these logs
Create an alert for the cases that requires

Actions

Copy link

#14

Updated by okurz almost 4 years ago

ilausuch wrote:

Check if we can filter the logs to cherry pick only the interesting logs to not overload the DB

but that's what openqa-logwarn is doing. Why not just call openqa-logwarn from telegraf and record the exit status, create a grafana dashboard for that with instructions how to check logs when the alert triggers

Actions

Copy link

#15

Updated by ilausuch almost 4 years ago

I think could be interesting to test this plugin (processor/parser)
https://github.com/influxdata/telegraf/blob/release-1.19/plugins/processors/parser/README.md

But also next article although old, is interesting because shows how to create a parser for a log file, and also use patterns. This could be really interesting because we could use this to analyze "[error] " pattern
https://www.influxdata.com/blog/telegraf-correlate-log-metrics-data-performance-bottlenecks/

Actions

Copy link

#16

Updated by ilausuch almost 4 years ago

okurz wrote:

ilausuch wrote:

Check if we can filter the logs to cherry pick only the interesting logs to not overload the DB

but that's what openqa-logwarn is doing. Why not just call openqa-logwarn from telegraf and record the exit status, create a grafana dashboard for that with instructions how to check logs when the alert triggers

Ok, an other interesting approach

Actions

Copy link

#17

Updated by ilausuch almost 4 years ago

Description updated (diff)

Actions

Copy link

#18

Updated by ilausuch almost 4 years ago

Following https://github.com/influxdata/telegraf/blob/master/plugins/inputs/exec/README.md the way to have metrics about logs should be creating a configuration

[[inputs.exec]]
command = "/root/openqa-logwarn/logwarn_openqa"
data_format = "influx"
interval = "120s"

Using the example in the repo logwarn_openqa give us

[Mon Sep 26 21:16:43 2016] [19382:error] File iso/bar.iso is not a registered asset
[Mon Sep 26 21:16:46 2016] [19382:error] Directory repo/baz.license is not a registered asset
[Mon Sep 26 21:17:27 2016] [30003:error] MISSING S:scheduled R:none
[Wed Mar 29 11:25:38 2017] [14008:warn] FOOBAR=RAID0:ppc64le not found - check for typos and dependency cycles
[2018-02-06T04:02:24.0435 CET] [error] The world is coming to an end

Actions

Copy link

#19

Updated by ilausuch almost 4 years ago

I created a proposal PR. It is not tested yet
~~https://github.com/os-autoinst/salt-states-openqa/pull/1~~
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/530

Actions

Copy link

#20

Updated by ilausuch almost 4 years ago

I tested the configuration creating a simple test with the lines that the MR introduces, and works in the moment we have logwarn installed and openqa_logwarn in the /opt directory

telegraf --test --debug --config telegraf-webui.conf  
2021-08-03T12:55:38Z I! Starting Telegraf 
2021-08-03T12:55:38Z D! [agent] Initializing plugins
2021-08-03T12:55:38Z D! [agent] Starting service inputs
2021-08-03T12:55:38Z D! [agent] Stopping service inputs
2021-08-03T12:55:38Z D! [agent] Input channel closed
2021-08-03T12:55:38Z D! [agent] Stopped Successfully

I tested also what happens if logwarn is not installed, openqa_logwarn is not in the /opt directory or /var/log/openqa is not present.

Actions

Copy link

#21

Updated by livdywan almost 4 years ago

ilausuch wrote:

I tested also what happens if logwarn is not installed, openqa_logwarn is not in the /opt directory or /var/log/openqa is not present.

!530 uses pkg.installed and git.cloned so I would assume salt fails if either one is missing.

In other words, why does it matter what happens if logwarn is not installed? Maybe you can clarify what you expect here?

Actions

Copy link

#22

Updated by okurz almost 4 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/530 merged. I doubt we need to care more what happens if logwarn is not installed. The important part is about logwarn doing its job of warning about important new errors and not reporting about known stuff. Does this work?

Actions

Copy link

#23

Updated by mkittler almost 4 years ago

The SR has been merged but I don't find all questions I've asked in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/530#note_335055 addressed in the comments here. It would have been nice to check whether the data gathered now can actually be used within Grafana. Accessing the data from Grafana could also help to answer the question raised in @okurz's last comment.

Actions

Copy link

#24

Updated by livdywan almost 4 years ago

Notes from our conversation:

The config should use commands = ["foobar"] like we also do here https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-common.conf#L34
https://github.com/os-autoinst/openqa-logwarn needs to be modified or forked to produce logwarn,tag1=a,tag2=b i=42i,j=43i,k=44i style output
https://github.com/martchus/openqa-helper#monitoring-with-telegrafinfluxdbgrafana has suggestions on how to test this

Actions

Copy link

#25

Updated by livdywan almost 4 years ago

Due date deleted (~~2021-08-04~~)
Status changed from In Progress to Workable
Assignee deleted (~~ilausuch~~)

I guess we can see if somebody else takes over from here. !537 has open questions.

Actions

Copy link

#26

Updated by mkittler almost 4 years ago

Note that this kind of alerting would also be useful to have for worker logs (see #96710#note-23).

Actions

Copy link

#27

Updated by okurz almost 4 years ago

Subject changed from followup to: error on "Next & previous results": ajax error message and no results showing up size:M to Monitoring alerts on errors in logs (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M

Actions

Copy link

#28

Updated by okurz almost 4 years ago

Related to action #97544: Monitoring alerts on errors in logs on osd added

Actions

Copy link

#29

Updated by okurz almost 4 years ago

Project changed from openQA Project (public) to openQA Infrastructure (public)
Subject changed from Monitoring alerts on errors in logs (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M to Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M
Description updated (diff)
Category deleted (~~Regressions/Crashes~~)

Actions

Copy link

#30

Updated by okurz over 3 years ago

Related to action #80812: Fix mail sending on o3 size:S added

Actions

Copy link

#31

Updated by okurz over 3 years ago

Status changed from Workable to Blocked
Assignee set to okurz

first #80812

Actions

Copy link

#32

Updated by okurz over 3 years ago

Status changed from Blocked to Feedback

After o3 email sending works again let's see how the team copes with alert handling :)

Actions

Copy link

#33

Updated by livdywan over 3 years ago

Copied to action #105828: 4-7 logreport emails a day cause alert fatigue size:M added

Actions

Copy link

#34

Updated by okurz over 3 years ago

Status changed from Feedback to Resolved

they did not give up yet, that's good enough :)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #95293

Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by tinita almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by okurz almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by tinita almost 4 years ago

Updated by okurz almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by okurz almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by ilausuch almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by okurz almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by livdywan over 3 years ago

Updated by okurz over 3 years ago