Project

General

Profile

action #95293

Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M

Added by okurz 5 months ago. Updated 3 months ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Target version:
Start date:
2021-07-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

From favogt:

I'm getting a 500 on https://openqa.opensuse.org/tests/1831535#next_previous due to [2021-07-09T09:20:25.0092 UTC] [error] [sNVmMoPb-wu6] Can't call method "id" on an undefined value at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Controller/Test.pm line 119.. Known?

The problem has been fixed with a revert in #95281 but why did we not catch this issue in our tests or monitoring?

Acceptance criteria

Suggestions

Consider using again https://github.com/os-autoinst/openqa-logwarn on o3 as we had some years ago with email sending to o3-admins@suse.de by cron jobs

coverage-after.png (86.7 KB) coverage-after.png tinita, 2021-07-14 21:03
coverage-before.png (85.6 KB) coverage-before.png tinita, 2021-07-14 21:03
11704
11707

Related issues

Related to openQA Project - action #97544: Monitoring alerts on errors in logs on osdNew2021-07-09

Copied from openQA Project - action #95281: error on "Next & previous results": ajax error message and no results showing upResolved2021-07-09

History

#1 Updated by okurz 5 months ago

  • Copied from action #95281: error on "Next & previous results": ajax error message and no results showing up added

#2 Updated by okurz 5 months ago

  • Target version set to Ready

#3 Updated by tinita 5 months ago

why did we not catch this issue in our tests [...]?

We didn't catch it in our tests because the tests didn't have a bugref comments without a corresponding entry in the Bugs table.

#4 Updated by ilausuch 5 months ago

  • Subject changed from followup to: error on "Next & previous results": ajax error message and no results showing up to followup to: error on "Next & previous results": ajax error message and no results showing up size:M
  • Description updated (diff)
  • Status changed from New to Workable

#5 Updated by okurz 5 months ago

  • Description updated (diff)

#6 Updated by mkittler 5 months ago

  • Description updated (diff)

Judging by #95281#note-8 I would assume this is fixed. It would still make sense to add checks within our monitoring for errors like Can't call method ".*" on an undefined.

#7 Updated by mkittler 5 months ago

  • Description updated (diff)

I had an old version of the ticket open.

#8 Updated by tinita 5 months ago

11704
11707

Regarding testing and coverage:

I ran a coverage test with the full Devel::Cover abilities to cover not only statements, but branches/conditions.

I attached two screenshots of the WebAPI::Plugin::Helpers module with the bug*_for methods. One shows the coverage before my PR and one after it.

The statement coverage for this module is 93.9 in both cases, but the branches/condition coverage has increased.

file                                stmt    bran    cond    sub time    total
lib/OpenQA/WebAPI/Plugin/Helpers.pm 93.9    82.3    72.7    90.2    0.0 89.1    # before
lib/OpenQA/WebAPI/Plugin/Helpers.pm 93.9    83.6    75.0    90.2    0.0 89.6    # after

So indeed my added test increased the coverage. We just don't measure it. One reason is that Codecov doesn't offer this.
The other is that it isn't really possible to mark single conditions as uncoverable.

#9 Updated by okurz 5 months ago

tinita wrote:

[…]
The other is that it isn't really possible to mark single conditions as uncoverable.

What do you mean by that? https://metacpan.org/pod/Devel::Cover#Branches and https://metacpan.org/pod/Devel::Cover#Conditions explains how branches and conditions can be marked as uncoverable as well. Just that if we would really aim for 100% branch+condition coverage we would like go mad trying to mark all "uncoverable" conditions/branches as such.

#10 Updated by ilausuch 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to ilausuch

#11 Updated by openqa_review 5 months ago

  • Due date set to 2021-08-04

Setting due date based on mean cycle time of SUSE QE Tools

#12 Updated by ilausuch 5 months ago

  • Description updated (diff)

#13 Updated by ilausuch 5 months ago

  • Description updated (diff)

In a conversation with Marius during the daily we establish a sort of roadmap and requirements:

  • Check if telegraf can access to the logs or we need a script to do this job storing in the InfluxDB
  • Check if we can filter the logs to cherry pick only the interesting logs to not overload the DB

Also we could add some more steps

  • Store this information in influxDB as logs, or as parsed log information (check what is the better option)
  • Create a view for these logs
  • Create an alert for the cases that requires

#14 Updated by okurz 5 months ago

ilausuch wrote:

  • Check if we can filter the logs to cherry pick only the interesting logs to not overload the DB

but that's what openqa-logwarn is doing. Why not just call openqa-logwarn from telegraf and record the exit status, create a grafana dashboard for that with instructions how to check logs when the alert triggers

#15 Updated by ilausuch 5 months ago

I think could be interesting to test this plugin (processor/parser)
https://github.com/influxdata/telegraf/blob/release-1.19/plugins/processors/parser/README.md

But also next article although old, is interesting because shows how to create a parser for a log file, and also use patterns. This could be really interesting because we could use this to analyze "[error] " pattern
https://www.influxdata.com/blog/telegraf-correlate-log-metrics-data-performance-bottlenecks/

#16 Updated by ilausuch 5 months ago

okurz wrote:

ilausuch wrote:

  • Check if we can filter the logs to cherry pick only the interesting logs to not overload the DB

but that's what openqa-logwarn is doing. Why not just call openqa-logwarn from telegraf and record the exit status, create a grafana dashboard for that with instructions how to check logs when the alert triggers

Ok, an other interesting approach

#17 Updated by ilausuch 5 months ago

  • Description updated (diff)

#18 Updated by ilausuch 5 months ago

Following https://github.com/influxdata/telegraf/blob/master/plugins/inputs/exec/README.md the way to have metrics about logs should be creating a configuration

[[inputs.exec]]
command = "/root/openqa-logwarn/logwarn_openqa"
data_format = "influx"
interval = "120s"

Using the example in the repo logwarn_openqa give us

[Mon Sep 26 21:16:43 2016] [19382:error] File iso/bar.iso is not a registered asset
[Mon Sep 26 21:16:46 2016] [19382:error] Directory repo/baz.license is not a registered asset
[Mon Sep 26 21:17:27 2016] [30003:error] MISSING S:scheduled R:none
[Wed Mar 29 11:25:38 2017] [14008:warn] FOOBAR=RAID0:ppc64le not found - check for typos and dependency cycles
[2018-02-06T04:02:24.0435 CET] [error] The world is coming to an end

#20 Updated by ilausuch 4 months ago

I tested the configuration creating a simple test with the lines that the MR introduces, and works in the moment we have logwarn installed and openqa_logwarn in the /opt directory

telegraf --test --debug --config telegraf-webui.conf  
2021-08-03T12:55:38Z I! Starting Telegraf 
2021-08-03T12:55:38Z D! [agent] Initializing plugins
2021-08-03T12:55:38Z D! [agent] Starting service inputs
2021-08-03T12:55:38Z D! [agent] Stopping service inputs
2021-08-03T12:55:38Z D! [agent] Input channel closed
2021-08-03T12:55:38Z D! [agent] Stopped Successfully

I tested also what happens if logwarn is not installed, openqa_logwarn is not in the /opt directory or /var/log/openqa is not present.

#21 Updated by cdywan 4 months ago

ilausuch wrote:

I tested also what happens if logwarn is not installed, openqa_logwarn is not in the /opt directory or /var/log/openqa is not present.

!530 uses pkg.installed and git.cloned so I would assume salt fails if either one is missing.

In other words, why does it matter what happens if logwarn is not installed? Maybe you can clarify what you expect here?

#22 Updated by okurz 4 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/530 merged. I doubt we need to care more what happens if logwarn is not installed. The important part is about logwarn doing its job of warning about important new errors and not reporting about known stuff. Does this work?

#23 Updated by mkittler 4 months ago

The SR has been merged but I don't find all questions I've asked in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/530#note_335055 addressed in the comments here. It would have been nice to check whether the data gathered now can actually be used within Grafana. Accessing the data from Grafana could also help to answer the question raised in okurz's last comment.

#24 Updated by cdywan 4 months ago

Notes from our conversation:

#25 Updated by cdywan 4 months ago

  • Due date deleted (2021-08-04)
  • Status changed from In Progress to Workable
  • Assignee deleted (ilausuch)

I guess we can see if somebody else takes over from here. !537 has open questions.

#26 Updated by mkittler 4 months ago

Note that this kind of alerting would also be useful to have for worker logs (see #96710#note-23).

#27 Updated by okurz 3 months ago

  • Subject changed from followup to: error on "Next & previous results": ajax error message and no results showing up size:M to Monitoring alerts on errors in logs (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M

#28 Updated by okurz 3 months ago

  • Related to action #97544: Monitoring alerts on errors in logs on osd added

#29 Updated by okurz 3 months ago

  • Project changed from openQA Project to openQA Infrastructure
  • Subject changed from Monitoring alerts on errors in logs (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M to Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M
  • Description updated (diff)
  • Category deleted (Concrete Bugs)

Also available in: Atom PDF