Project

General

Profile

Actions

action #95293

closed

Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M

Added by okurz over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-07-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

From favogt:

I'm getting a 500 on https://openqa.opensuse.org/tests/1831535#next_previous due to [2021-07-09T09:20:25.0092 UTC] [error] [sNVmMoPb-wu6] Can't call method "id" on an undefined value at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Controller/Test.pm line 119.. Known?

The problem has been fixed with a revert in #95281 but why did we not catch this issue in our tests or monitoring?

Acceptance criteria

Suggestions

Consider using again https://github.com/os-autoinst/openqa-logwarn on o3 as we had some years ago with email sending to o3-admins@suse.de by cron jobs


Files

coverage-after.png (86.7 KB) coverage-after.png tinita, 2021-07-14 21:03
coverage-before.png (85.6 KB) coverage-before.png tinita, 2021-07-14 21:03

Related issues 4 (1 open3 closed)

Related to openQA Project - action #97544: Monitoring alerts on errors in logs on osdNew2021-07-09

Actions
Related to openQA Infrastructure - action #80812: Fix mail sending on o3 size:SResolvedokurz2020-12-08

Actions
Copied from openQA Project - action #95281: error on "Next & previous results": ajax error message and no results showing upResolvedokurz2021-07-09

Actions
Copied to openQA Infrastructure - action #105828: 4-7 logreport emails a day cause alert fatigue size:MResolvedtinita2022-02-032022-02-17

Actions
Actions #1

Updated by okurz over 3 years ago

  • Copied from action #95281: error on "Next & previous results": ajax error message and no results showing up added
Actions #2

Updated by okurz over 3 years ago

  • Target version set to Ready
Actions #3

Updated by tinita over 3 years ago

why did we not catch this issue in our tests [...]?

We didn't catch it in our tests because the tests didn't have a bugref comments without a corresponding entry in the Bugs table.

Actions #4

Updated by ilausuch over 3 years ago

  • Subject changed from followup to: error on "Next & previous results": ajax error message and no results showing up to followup to: error on "Next & previous results": ajax error message and no results showing up size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by okurz over 3 years ago

  • Description updated (diff)
Actions #6

Updated by mkittler over 3 years ago

  • Description updated (diff)

Judging by #95281#note-8 I would assume this is fixed. It would still make sense to add checks within our monitoring for errors like Can't call method ".*" on an undefined.

Actions #7

Updated by mkittler over 3 years ago

  • Description updated (diff)

I had an old version of the ticket open.

Updated by tinita over 3 years ago

Regarding testing and coverage:

I ran a coverage test with the full Devel::Cover abilities to cover not only statements, but branches/conditions.

I attached two screenshots of the WebAPI::Plugin::Helpers module with the bug*_for methods. One shows the coverage before my PR and one after it.

The statement coverage for this module is 93.9 in both cases, but the branches/condition coverage has increased.

file                                stmt    bran    cond    sub time    total
lib/OpenQA/WebAPI/Plugin/Helpers.pm 93.9    82.3    72.7    90.2    0.0 89.1    # before
lib/OpenQA/WebAPI/Plugin/Helpers.pm 93.9    83.6    75.0    90.2    0.0 89.6    # after

So indeed my added test increased the coverage. We just don't measure it. One reason is that Codecov doesn't offer this.
The other is that it isn't really possible to mark single conditions as uncoverable.

Actions #9

Updated by okurz over 3 years ago

tinita wrote:

[…]
The other is that it isn't really possible to mark single conditions as uncoverable.

What do you mean by that? https://metacpan.org/pod/Devel::Cover#Branches and https://metacpan.org/pod/Devel::Cover#Conditions explains how branches and conditions can be marked as uncoverable as well. Just that if we would really aim for 100% branch+condition coverage we would like go mad trying to mark all "uncoverable" conditions/branches as such.

Actions #10

Updated by ilausuch over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to ilausuch
Actions #11

Updated by openqa_review over 3 years ago

  • Due date set to 2021-08-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by ilausuch over 3 years ago

  • Description updated (diff)
Actions #13

Updated by ilausuch over 3 years ago

  • Description updated (diff)

In a conversation with Marius during the daily we establish a sort of roadmap and requirements:

  • Check if telegraf can access to the logs or we need a script to do this job storing in the InfluxDB
  • Check if we can filter the logs to cherry pick only the interesting logs to not overload the DB

Also we could add some more steps

  • Store this information in influxDB as logs, or as parsed log information (check what is the better option)
  • Create a view for these logs
  • Create an alert for the cases that requires
Actions #14

Updated by okurz over 3 years ago

ilausuch wrote:

  • Check if we can filter the logs to cherry pick only the interesting logs to not overload the DB

but that's what openqa-logwarn is doing. Why not just call openqa-logwarn from telegraf and record the exit status, create a grafana dashboard for that with instructions how to check logs when the alert triggers

Actions #15

Updated by ilausuch over 3 years ago

I think could be interesting to test this plugin (processor/parser)
https://github.com/influxdata/telegraf/blob/release-1.19/plugins/processors/parser/README.md

But also next article although old, is interesting because shows how to create a parser for a log file, and also use patterns. This could be really interesting because we could use this to analyze "[error] " pattern
https://www.influxdata.com/blog/telegraf-correlate-log-metrics-data-performance-bottlenecks/

Actions #16

Updated by ilausuch over 3 years ago

okurz wrote:

ilausuch wrote:

  • Check if we can filter the logs to cherry pick only the interesting logs to not overload the DB

but that's what openqa-logwarn is doing. Why not just call openqa-logwarn from telegraf and record the exit status, create a grafana dashboard for that with instructions how to check logs when the alert triggers

Ok, an other interesting approach

Actions #17

Updated by ilausuch over 3 years ago

  • Description updated (diff)
Actions #18

Updated by ilausuch over 3 years ago

Following https://github.com/influxdata/telegraf/blob/master/plugins/inputs/exec/README.md the way to have metrics about logs should be creating a configuration

[[inputs.exec]]
command = "/root/openqa-logwarn/logwarn_openqa"
data_format = "influx"
interval = "120s"

Using the example in the repo logwarn_openqa give us

[Mon Sep 26 21:16:43 2016] [19382:error] File iso/bar.iso is not a registered asset
[Mon Sep 26 21:16:46 2016] [19382:error] Directory repo/baz.license is not a registered asset
[Mon Sep 26 21:17:27 2016] [30003:error] MISSING S:scheduled R:none
[Wed Mar 29 11:25:38 2017] [14008:warn] FOOBAR=RAID0:ppc64le not found - check for typos and dependency cycles
[2018-02-06T04:02:24.0435 CET] [error] The world is coming to an end
Actions #20

Updated by ilausuch over 3 years ago

I tested the configuration creating a simple test with the lines that the MR introduces, and works in the moment we have logwarn installed and openqa_logwarn in the /opt directory

telegraf --test --debug --config telegraf-webui.conf  
2021-08-03T12:55:38Z I! Starting Telegraf 
2021-08-03T12:55:38Z D! [agent] Initializing plugins
2021-08-03T12:55:38Z D! [agent] Starting service inputs
2021-08-03T12:55:38Z D! [agent] Stopping service inputs
2021-08-03T12:55:38Z D! [agent] Input channel closed
2021-08-03T12:55:38Z D! [agent] Stopped Successfully

I tested also what happens if logwarn is not installed, openqa_logwarn is not in the /opt directory or /var/log/openqa is not present.

Actions #21

Updated by livdywan over 3 years ago

ilausuch wrote:

I tested also what happens if logwarn is not installed, openqa_logwarn is not in the /opt directory or /var/log/openqa is not present.

!530 uses pkg.installed and git.cloned so I would assume salt fails if either one is missing.

In other words, why does it matter what happens if logwarn is not installed? Maybe you can clarify what you expect here?

Actions #22

Updated by okurz over 3 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/530 merged. I doubt we need to care more what happens if logwarn is not installed. The important part is about logwarn doing its job of warning about important new errors and not reporting about known stuff. Does this work?

Actions #23

Updated by mkittler over 3 years ago

The SR has been merged but I don't find all questions I've asked in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/530#note_335055 addressed in the comments here. It would have been nice to check whether the data gathered now can actually be used within Grafana. Accessing the data from Grafana could also help to answer the question raised in @okurz's last comment.

Actions #24

Updated by livdywan over 3 years ago

Notes from our conversation:

Actions #25

Updated by livdywan over 3 years ago

  • Due date deleted (2021-08-04)
  • Status changed from In Progress to Workable
  • Assignee deleted (ilausuch)

I guess we can see if somebody else takes over from here. !537 has open questions.

Actions #26

Updated by mkittler over 3 years ago

Note that this kind of alerting would also be useful to have for worker logs (see #96710#note-23).

Actions #27

Updated by okurz about 3 years ago

  • Subject changed from followup to: error on "Next & previous results": ajax error message and no results showing up size:M to Monitoring alerts on errors in logs (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M
Actions #28

Updated by okurz about 3 years ago

  • Related to action #97544: Monitoring alerts on errors in logs on osd added
Actions #29

Updated by okurz about 3 years ago

  • Project changed from openQA Project to openQA Infrastructure
  • Subject changed from Monitoring alerts on errors in logs (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M to Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M
  • Description updated (diff)
  • Category deleted (Regressions/Crashes)
Actions #30

Updated by okurz almost 3 years ago

Actions #31

Updated by okurz almost 3 years ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

first #80812

Actions #32

Updated by okurz almost 3 years ago

  • Status changed from Blocked to Feedback

After o3 email sending works again let's see how the team copes with alert handling :)

Actions #33

Updated by livdywan almost 3 years ago

  • Copied to action #105828: 4-7 logreport emails a day cause alert fatigue size:M added
Actions #34

Updated by okurz almost 3 years ago

  • Status changed from Feedback to Resolved

they did not give up yet, that's good enough :)

Actions

Also available in: Atom PDF