action #95293
Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M
0%
Description
Observation¶
From favogt:
I'm getting a 500 on https://openqa.opensuse.org/tests/1831535#next_previous due to [2021-07-09T09:20:25.0092 UTC] [error] [sNVmMoPb-wu6] Can't call method "id" on an undefined value at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Controller/Test.pm line 119.. Known?
The problem has been fixed with a revert in #95281 but why did we not catch this issue in our tests or monitoring?
Acceptance criteria¶
- AC1: DONE: Tests within https://github.com/os-autoinst/openQA/ reproduce/cover the problem -> already covered in https://github.com/os-autoinst/openQA/pull/4035
- AC2: Monitoring alerts on errors in logs on o3
Suggestions¶
Consider using again https://github.com/os-autoinst/openqa-logwarn on o3 as we had some years ago with email sending to o3-admins@suse.de by cron jobs
Related issues
History
#1
Updated by okurz 11 months ago
- Copied from action #95281: error on "Next & previous results": ajax error message and no results showing up added
#4
Updated by ilausuch 11 months ago
- Subject changed from followup to: error on "Next & previous results": ajax error message and no results showing up to followup to: error on "Next & previous results": ajax error message and no results showing up size:M
- Description updated (diff)
- Status changed from New to Workable
#6
Updated by mkittler 11 months ago
- Description updated (diff)
Judging by #95281#note-8 I would assume this is fixed. It would still make sense to add checks within our monitoring for errors like Can't call method ".*" on an undefined
.
#8
Updated by tinita 11 months ago
- File coverage-after.png coverage-after.png added
- File coverage-before.png coverage-before.png added
Regarding testing and coverage:
I ran a coverage test with the full Devel::Cover abilities to cover not only statements, but branches/conditions.
I attached two screenshots of the WebAPI::Plugin::Helpers module with the bug*_for
methods. One shows the coverage before my PR and one after it.
The statement coverage for this module is 93.9 in both cases, but the branches/condition coverage has increased.
file stmt bran cond sub time total lib/OpenQA/WebAPI/Plugin/Helpers.pm 93.9 82.3 72.7 90.2 0.0 89.1 # before lib/OpenQA/WebAPI/Plugin/Helpers.pm 93.9 83.6 75.0 90.2 0.0 89.6 # after
So indeed my added test increased the coverage. We just don't measure it. One reason is that Codecov doesn't offer this.
The other is that it isn't really possible to mark single conditions as uncoverable.
#9
Updated by okurz 10 months ago
tinita wrote:
[…]
The other is that it isn't really possible to mark single conditions as uncoverable.
What do you mean by that? https://metacpan.org/pod/Devel::Cover#Branches and https://metacpan.org/pod/Devel::Cover#Conditions explains how branches and conditions can be marked as uncoverable as well. Just that if we would really aim for 100% branch+condition coverage we would like go mad trying to mark all "uncoverable" conditions/branches as such.
#11
Updated by openqa_review 10 months ago
- Due date set to 2021-08-04
Setting due date based on mean cycle time of SUSE QE Tools
#13
Updated by ilausuch 10 months ago
- Description updated (diff)
In a conversation with Marius during the daily we establish a sort of roadmap and requirements:
- Check if telegraf can access to the logs or we need a script to do this job storing in the InfluxDB
- Check if we can filter the logs to cherry pick only the interesting logs to not overload the DB
Also we could add some more steps
- Store this information in influxDB as logs, or as parsed log information (check what is the better option)
- Create a view for these logs
- Create an alert for the cases that requires
#14
Updated by okurz 10 months ago
ilausuch wrote:
- Check if we can filter the logs to cherry pick only the interesting logs to not overload the DB
but that's what openqa-logwarn is doing. Why not just call openqa-logwarn
from telegraf and record the exit status, create a grafana dashboard for that with instructions how to check logs when the alert triggers
#15
Updated by ilausuch 10 months ago
I think could be interesting to test this plugin (processor/parser)
https://github.com/influxdata/telegraf/blob/release-1.19/plugins/processors/parser/README.md
But also next article although old, is interesting because shows how to create a parser for a log file, and also use patterns. This could be really interesting because we could use this to analyze "[error] " pattern
https://www.influxdata.com/blog/telegraf-correlate-log-metrics-data-performance-bottlenecks/
#16
Updated by ilausuch 10 months ago
okurz wrote:
ilausuch wrote:
- Check if we can filter the logs to cherry pick only the interesting logs to not overload the DB
but that's what openqa-logwarn is doing. Why not just call
openqa-logwarn
from telegraf and record the exit status, create a grafana dashboard for that with instructions how to check logs when the alert triggers
Ok, an other interesting approach
#18
Updated by ilausuch 10 months ago
Following https://github.com/influxdata/telegraf/blob/master/plugins/inputs/exec/README.md the way to have metrics about logs should be creating a configuration
[[inputs.exec]] command = "/root/openqa-logwarn/logwarn_openqa" data_format = "influx" interval = "120s"
Using the example in the repo logwarn_openqa give us
[Mon Sep 26 21:16:43 2016] [19382:error] File iso/bar.iso is not a registered asset [Mon Sep 26 21:16:46 2016] [19382:error] Directory repo/baz.license is not a registered asset [Mon Sep 26 21:17:27 2016] [30003:error] MISSING S:scheduled R:none [Wed Mar 29 11:25:38 2017] [14008:warn] FOOBAR=RAID0:ppc64le not found - check for typos and dependency cycles [2018-02-06T04:02:24.0435 CET] [error] The world is coming to an end
#19
Updated by ilausuch 10 months ago
I created a proposal PR. It is not tested yet
https://github.com/os-autoinst/salt-states-openqa/pull/1
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/530
#20
Updated by ilausuch 10 months ago
I tested the configuration creating a simple test with the lines that the MR introduces, and works in the moment we have logwarn installed and openqa_logwarn in the /opt directory
telegraf --test --debug --config telegraf-webui.conf 2021-08-03T12:55:38Z I! Starting Telegraf 2021-08-03T12:55:38Z D! [agent] Initializing plugins 2021-08-03T12:55:38Z D! [agent] Starting service inputs 2021-08-03T12:55:38Z D! [agent] Stopping service inputs 2021-08-03T12:55:38Z D! [agent] Input channel closed 2021-08-03T12:55:38Z D! [agent] Stopped Successfully
I tested also what happens if logwarn is not installed, openqa_logwarn is not in the /opt directory or /var/log/openqa is not present.
#21
Updated by cdywan 10 months ago
ilausuch wrote:
I tested also what happens if logwarn is not installed, openqa_logwarn is not in the /opt directory or /var/log/openqa is not present.
!530 uses pkg.installed
and git.cloned
so I would assume salt fails if either one is missing.
In other words, why does it matter what happens if logwarn is not installed? Maybe you can clarify what you expect here?
#22
Updated by okurz 10 months ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/530 merged. I doubt we need to care more what happens if logwarn is not installed. The important part is about logwarn doing its job of warning about important new errors and not reporting about known stuff. Does this work?
#23
Updated by mkittler 10 months ago
The SR has been merged but I don't find all questions I've asked in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/530#note_335055 addressed in the comments here. It would have been nice to check whether the data gathered now can actually be used within Grafana. Accessing the data from Grafana could also help to answer the question raised in okurz's last comment.
#24
Updated by cdywan 10 months ago
Notes from our conversation:
- The config should use
commands = ["foobar"]
like we also do here https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/monitoring/telegraf/telegraf-common.conf#L34 - https://github.com/os-autoinst/openqa-logwarn needs to be modified or forked to produce
logwarn,tag1=a,tag2=b i=42i,j=43i,k=44i
style output - https://github.com/martchus/openqa-helper#monitoring-with-telegrafinfluxdbgrafana has suggestions on how to test this
#26
Updated by mkittler 9 months ago
Note that this kind of alerting would also be useful to have for worker logs (see #96710#note-23).
#27
Updated by okurz 9 months ago
- Subject changed from followup to: error on "Next & previous results": ajax error message and no results showing up size:M to Monitoring alerts on errors in logs (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M
#28
Updated by okurz 9 months ago
- Related to action #97544: Monitoring alerts on errors in logs on osd added
#29
Updated by okurz 9 months ago
- Project changed from openQA Project to openQA Infrastructure
- Subject changed from Monitoring alerts on errors in logs (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M to Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M
- Description updated (diff)
- Category deleted (
Concrete Bugs)
#30
Updated by okurz 5 months ago
- Related to action #80812: Fix mail sending on o3 size:S added
#33
Updated by cdywan 4 months ago
- Copied to action #105828: 4-7 logreport emails a day cause alert fatigue size:M added