action #105828
closed
4-7 logreport emails a day cause alert fatigue size:M
Added by livdywan almost 3 years ago.
Updated almost 3 years ago.
Description
Observation¶
Thanks to #80812 o3 can send out emails. Unfortunately now we're getting 4-7 logreport emails from openqa-monitor@ariel.suse-dmz.opensuse.org on a daily basis and we're not keeping up with handling all of them.
Emails are sent by a cronjob running https://github.com/os-autoinst/openqa-logwarn
Examples:
[2022-02-02T09:44:45.023821Z] [error] [pid:6229] Cannot read symbolic link (/opt/openqa-trigger-from-obs/openSUSE:Leap:15.4:ARM:Images:ToTest/.run_last): No such file or directory
[2022-02-02T08:07:52.883567Z] [warn] [pid:22053] Ignoring invalid group {"name":"38"} when creating new job 2172324
[2022-02-02T02:30:10.097604Z] [warn] [pid:10722] Unable to wakeup scheduler: Request timeout
[2022-02-02T02:30:14.810226Z] [error] [pid:13594] Publishing opensuse.openqa.job.restart failed: Connect timeout (9 attempts left)
[2022-02-01T15:38:12.281868Z] [warn] [pid:28556] fatal: Invalid revision range 745485c7527687dab875e0ab0f4c96f730e26dea..8f56d6708e2211a41fe189635a3bbebd2f9d0be8
[2022-02-01T15:38:12.282093Z] [error] [pid:28556] cmd returned 32768
Acceptance criteria¶
Suggestions¶
- Team up to investigate all of the current issues
- Create individual tickets for the issues and blocklist them by proposing changes to https://github.com/os-autoinst/openqa-logwarn (changes are effective ~10 minutes after a merge)
- Copied from action #95293: Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M added
- Priority changed from High to Urgent
- Subject changed from 4-7 logreport emails a day cause alert fatigue to 4-7 logreport emails a day cause alert fatigue size:M
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Assignee set to tinita
- Due date set to 2022-02-17
Setting due date based on mean cycle time of SUSE QE Tools
I tried to find out where /usr/local/bin/logwarn
comes from on o3, but wasn't successful.
I have also searched in redmine and didn't find out how it was installed.
ls -l /usr/local/bin/logwarn
-rwxr-xr-x 1 okurz users 60992 Oct 31 2016 /usr/local/bin/logwarn
Looks to me like it was installed manually from a release tarball from code.google.com.
https://github.com/archiecobbs/logwarn/blob/master/CHANGES#L30
Version 1.0.16 Released November 21, 2021
...
...
...
Version 1.0.12 Released May 24, 2016
...
- Moved project hosing from Google code to GitHub
Time to update I guess?
% zypper info logwarn
Information for package logwarn:
--------------------------------
Repository : openSUSE-Leap-15.3-Oss
Name : logwarn
Version : 1.0.14-bp153.1.13
Started to add things to the blocklist and made testing a bit more user friendly
Setting due date based on mean cycle time of SUSE QE Tools
Setting due date based on mean cycle time of SUSE QE Tools
- Related to action #105930: o3 logreports - empty warnings/errors added
- Related to action #105924: o3 logreports - Template was modified added
- Related to action #105921: o3 logreports - Cannot read symbolic link (/opt/openqa-trigger-from-obs/.../.run_last): No such file or directory added
- Related to action #105918: o3 logreports - fatal: Invalid revision range sha1..sha2 added
- Related to action #105915: o3 logreports - Needle file <filename>.json not found within /var/.../opensuse/needles added
- Related to action #105909: o3 logreports - Ignoring invalid group {"name":"123"} when creating new job added
- Related to action #105903: o3 logreports - Publishing opensuse.openqa.job.restart failed: Connect timeout (9 attempts left) added
- Related to action #105900: o3 logreports - Unable to wakeup scheduler: Request timeout added
- Due date set to 2022-02-17
I moved all subtasks out. Adding them to the blocklist is part of the ticket, solving them shouldn't be. So now we can set a due date and are able to resolve the ticket as soon as the ACs are covered
- Related to action #106245: o3 logreports - Testsuite 'xyz' is invalid added
- Status changed from In Progress to Resolved
- Copied to action #106756: cmd returned 32768 repeatedly reported on o3 added
- Copied to action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added
- Copied to action #106760: DBI Exception: DBD::Pg::st execute failed: number of parameters must be between 0 and 65535 repeatedly reported on o3 added
- Related to action #106613: o3 logreports DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks row not found added
- Related to action #106880: Job template name ... is already used in job group error logged on o3 size:M added
- Related to action #107023: cmd returned 31744 repeatedly reported on o3 added
- Copied to action #108533: o3 logreports DBI Exception: DBD::Pg::st execute failed: ERROR: invalid input syntax for type integer added
- Related to action #137765: logwarn does not work on new o3 (anymore?) size:M added
Also available in: Atom
PDF