Project

General

Profile

Actions

action #105828

closed

4-7 logreport emails a day cause alert fatigue size:M

Added by livdywan about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2022-02-03
Due date:
2022-02-17
% Done:

0%

Estimated time:

Description

Observation

Thanks to #80812 o3 can send out emails. Unfortunately now we're getting 4-7 logreport emails from openqa-monitor@ariel.suse-dmz.opensuse.org on a daily basis and we're not keeping up with handling all of them.
Emails are sent by a cronjob running https://github.com/os-autoinst/openqa-logwarn

Examples:

[2022-02-02T09:44:45.023821Z] [error] [pid:6229] Cannot read symbolic link (/opt/openqa-trigger-from-obs/openSUSE:Leap:15.4:ARM:Images:ToTest/.run_last): No such file or directory
[2022-02-02T08:07:52.883567Z] [warn] [pid:22053] Ignoring invalid group {"name":"38"} when creating new job 2172324
[2022-02-02T02:30:10.097604Z] [warn] [pid:10722] Unable to wakeup scheduler: Request timeout
[2022-02-02T02:30:14.810226Z] [error] [pid:13594] Publishing opensuse.openqa.job.restart failed: Connect timeout (9 attempts left)
[2022-02-01T15:38:12.281868Z] [warn] [pid:28556] fatal: Invalid revision range 745485c7527687dab875e0ab0f4c96f730e26dea..8f56d6708e2211a41fe189635a3bbebd2f9d0be8
[2022-02-01T15:38:12.282093Z] [error] [pid:28556] cmd returned 32768

Acceptance criteria

Suggestions

  • Team up to investigate all of the current issues
  • Create individual tickets for the issues and blocklist them by proposing changes to https://github.com/os-autoinst/openqa-logwarn (changes are effective ~10 minutes after a merge)

Related issues 18 (10 open8 closed)

Related to openQA Project - action #105930: o3 logreports - empty warnings/errorsNew2022-02-03

Actions
Related to openQA Project - action #105924: o3 logreports - Template was modifiedRejectedmkittler2022-02-03

Actions
Related to openQA Project - action #105921: o3 logreports - Cannot read symbolic link (/opt/openqa-trigger-from-obs/.../.run_last): No such file or directoryNew2022-02-03

Actions
Related to openQA Project - action #105918: o3 logreports - fatal: Invalid revision range sha1..sha2New2022-02-03

Actions
Related to openQA Project - action #105915: o3 logreports - Needle file <filename>.json not found within /var/.../opensuse/needlesNew2022-02-03

Actions
Related to openQA Project - action #105909: o3 logreports - Ignoring invalid group {"name":"123"} when creating new jobResolvedokurz2022-02-03

Actions
Related to openQA Project - action #105903: o3 logreports - Publishing opensuse.openqa.job.restart failed: Connect timeout (9 attempts left)New2022-02-03

Actions
Related to openQA Project - action #105900: o3 logreports - Unable to wakeup scheduler: Request timeoutNew2022-02-03

Actions
Related to openQA Project - action #106245: o3 logreports - Testsuite 'xyz' is invalidRejectedmkittler

Actions
Related to openQA Project - action #106613: o3 logreports DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks row not foundWorkable2022-02-10

Actions
Related to openQA Infrastructure - action #106880: Job template name ... is already used in job group error logged on o3 size:MResolvedmkittler2022-02-16

Actions
Related to openQA Infrastructure - action #107023: cmd returned 31744 repeatedly reported on o3New2022-02-03

Actions
Related to openQA Project - action #137765: logwarn does not work on new o3 (anymore?) size:MResolved2023-10-11

Actions
Copied from openQA Infrastructure - action #95293: Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:MResolvedokurz2021-07-09

Actions
Copied to openQA Infrastructure - action #106756: cmd returned 32768 repeatedly reported on o3New2022-02-03

Actions
Copied to openQA Project - action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:MResolvedlivdywan2022-02-03

Actions
Copied to openQA Infrastructure - action #106760: DBI Exception: DBD::Pg::st execute failed: number of parameters must be between 0 and 65535 repeatedly reported on o3New2022-02-03

Actions
Copied to openQA Infrastructure - action #108533: o3 logreports DBI Exception: DBD::Pg::st execute failed: ERROR: invalid input syntax for type integerResolvedtinita2022-03-31

Actions
Actions #1

Updated by livdywan about 2 years ago

  • Copied from action #95293: Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M added
Actions #2

Updated by okurz about 2 years ago

  • Priority changed from High to Urgent

+1

Actions #3

Updated by livdywan about 2 years ago

  • Subject changed from 4-7 logreport emails a day cause alert fatigue to 4-7 logreport emails a day cause alert fatigue size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by tinita about 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita
Actions #5

Updated by openqa_review about 2 years ago

  • Due date set to 2022-02-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by tinita about 2 years ago

I tried to find out where /usr/local/bin/logwarn comes from on o3, but wasn't successful.

I have also searched in redmine and didn't find out how it was installed.

Actions #7

Updated by tinita about 2 years ago

ls -l /usr/local/bin/logwarn
-rwxr-xr-x 1 okurz users 60992 Oct 31  2016 /usr/local/bin/logwarn

Looks to me like it was installed manually from a release tarball from code.google.com.
https://github.com/archiecobbs/logwarn/blob/master/CHANGES#L30

Version 1.0.16 Released November 21, 2021
...
...
...
Version 1.0.12 Released May 24, 2016

...
    - Moved project hosing from Google code to GitHub

Time to update I guess?

% zypper info logwarn

Information for package logwarn:
--------------------------------
Repository     : openSUSE-Leap-15.3-Oss
Name           : logwarn
Version        : 1.0.14-bp153.1.13

Actions #8

Updated by tinita about 2 years ago

Actions #9

Updated by tinita about 2 years ago

Started to add things to the blocklist and made testing a bit more user friendly

Actions #10

Updated by openqa_review about 2 years ago

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by tinita about 2 years ago

Created 3 PRs:
https://github.com/os-autoinst/openqa-logwarn/pull/14 Improve unit tests
https://github.com/os-autoinst/openqa-logwarn/pull/15 Update logwarn
https://github.com/os-autoinst/openqa-logwarn/pull/16 Add new things to blocklist

I only left out #105918 because that looks like it might be an error to fix

Actions #12

Updated by openqa_review about 2 years ago

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by okurz about 2 years ago

  • Related to action #105930: o3 logreports - empty warnings/errors added
Actions #14

Updated by okurz about 2 years ago

  • Related to action #105924: o3 logreports - Template was modified added
Actions #15

Updated by okurz about 2 years ago

  • Related to action #105921: o3 logreports - Cannot read symbolic link (/opt/openqa-trigger-from-obs/.../.run_last): No such file or directory added
Actions #16

Updated by okurz about 2 years ago

  • Related to action #105918: o3 logreports - fatal: Invalid revision range sha1..sha2 added
Actions #17

Updated by okurz about 2 years ago

  • Related to action #105915: o3 logreports - Needle file <filename>.json not found within /var/.../opensuse/needles added
Actions #18

Updated by okurz about 2 years ago

  • Related to action #105909: o3 logreports - Ignoring invalid group {"name":"123"} when creating new job added
Actions #19

Updated by okurz about 2 years ago

  • Related to action #105903: o3 logreports - Publishing opensuse.openqa.job.restart failed: Connect timeout (9 attempts left) added
Actions #20

Updated by okurz about 2 years ago

  • Related to action #105900: o3 logreports - Unable to wakeup scheduler: Request timeout added
Actions #21

Updated by okurz about 2 years ago

  • Due date set to 2022-02-17

I moved all subtasks out. Adding them to the blocklist is part of the ticket, solving them shouldn't be. So now we can set a due date and are able to resolve the ticket as soon as the ACs are covered

Actions #22

Updated by tinita about 2 years ago

  • Related to action #106245: o3 logreports - Testsuite 'xyz' is invalid added
Actions #23

Updated by tinita about 2 years ago

  • Status changed from In Progress to Resolved

So far no alert emails anymore.
If there are new ones, create a ticket and add to https://github.com/os-autoinst/openqa-logwarn

Actions #24

Updated by livdywan about 2 years ago

  • Copied to action #106756: cmd returned 32768 repeatedly reported on o3 added
Actions #25

Updated by livdywan about 2 years ago

  • Copied to action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added
Actions #26

Updated by livdywan about 2 years ago

  • Copied to action #106760: DBI Exception: DBD::Pg::st execute failed: number of parameters must be between 0 and 65535 repeatedly reported on o3 added
Actions #27

Updated by tinita about 2 years ago

  • Related to action #106613: o3 logreports DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks row not found added
Actions #28

Updated by tinita about 2 years ago

  • Related to action #106880: Job template name ... is already used in job group error logged on o3 size:M added
Actions #29

Updated by tinita about 2 years ago

  • Related to action #107023: cmd returned 31744 repeatedly reported on o3 added
Actions #30

Updated by tinita about 2 years ago

  • Copied to action #108533: o3 logreports DBI Exception: DBD::Pg::st execute failed: ERROR: invalid input syntax for type integer added
Actions #31

Updated by jbaier_cz 6 months ago

  • Related to action #137765: logwarn does not work on new o3 (anymore?) size:M added
Actions

Also available in: Atom PDF