Project

General

Profile

action #105828

4-7 logreport emails a day cause alert fatigue size:M

Added by cdywan 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-02-03
Due date:
2022-02-17
% Done:

0%

Estimated time:

Description

Observation

Thanks to #80812 o3 can send out emails. Unfortunately now we're getting 4-7 logreport emails from openqa-monitor@ariel.suse-dmz.opensuse.org on a daily basis and we're not keeping up with handling all of them.
Emails are sent by a cronjob running https://github.com/os-autoinst/openqa-logwarn

Examples:

[2022-02-02T09:44:45.023821Z] [error] [pid:6229] Cannot read symbolic link (/opt/openqa-trigger-from-obs/openSUSE:Leap:15.4:ARM:Images:ToTest/.run_last): No such file or directory
[2022-02-02T08:07:52.883567Z] [warn] [pid:22053] Ignoring invalid group {"name":"38"} when creating new job 2172324
[2022-02-02T02:30:10.097604Z] [warn] [pid:10722] Unable to wakeup scheduler: Request timeout
[2022-02-02T02:30:14.810226Z] [error] [pid:13594] Publishing opensuse.openqa.job.restart failed: Connect timeout (9 attempts left)
[2022-02-01T15:38:12.281868Z] [warn] [pid:28556] fatal: Invalid revision range 745485c7527687dab875e0ab0f4c96f730e26dea..8f56d6708e2211a41fe189635a3bbebd2f9d0be8
[2022-02-01T15:38:12.282093Z] [error] [pid:28556] cmd returned 32768

Acceptance criteria

Suggestions

  • Team up to investigate all of the current issues
  • Create individual tickets for the issues and blocklist them by proposing changes to https://github.com/os-autoinst/openqa-logwarn (changes are effective ~10 minutes after a merge)

Related issues

Related to openQA Project - action #105930: o3 logreports - empty warnings/errorsNew2022-02-03

Related to openQA Project - action #105924: o3 logreports - Template was modifiedRejected2022-02-03

Related to openQA Project - action #105921: o3 logreports - Cannot read symbolic link (/opt/openqa-trigger-from-obs/.../.run_last): No such file or directoryNew2022-02-03

Related to openQA Project - action #105918: o3 logreports - fatal: Invalid revision range sha1..sha2New2022-02-03

Related to openQA Project - action #105915: o3 logreports - Needle file <filename>.json not found within /var/.../opensuse/needlesNew2022-02-03

Related to openQA Project - action #105909: o3 logreports - Ignoring invalid group {"name":"123"} when creating new jobResolved2022-02-03

Related to openQA Project - action #105903: o3 logreports - Publishing opensuse.openqa.job.restart failed: Connect timeout (9 attempts left)New2022-02-03

Related to openQA Project - action #105900: o3 logreports - Unable to wakeup scheduler: Request timeoutNew2022-02-03

Related to openQA Project - action #106245: o3 logreports - Testsuite 'xyz' is invalidRejected

Related to openQA Project - action #106613: o3 logreports DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks row not foundWorkable2022-02-10

Related to openQA Infrastructure - action #106880: Job template name ... is already used in job group error logged on o3 size:MResolved2022-02-16

Related to openQA Infrastructure - action #107023: cmd returned 31744 repeatedly reported on o3New2022-02-03

Copied from openQA Infrastructure - action #95293: Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:MResolved2021-07-09

Copied to openQA Infrastructure - action #106756: cmd returned 32768 repeatedly reported on o3New2022-02-03

Copied to openQA Project - action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:MFeedback2022-02-03

Copied to openQA Infrastructure - action #106760: DBI Exception: DBD::Pg::st execute failed: number of parameters must be between 0 and 65535 repeatedly reported on o3New2022-02-03

Copied to openQA Infrastructure - action #108533: o3 logreports DBI Exception: DBD::Pg::st execute failed: ERROR: invalid input syntax for type integerResolved2022-03-31

History

#1 Updated by cdywan 5 months ago

  • Copied from action #95293: Monitoring alerts on errors in logs on o3 (was: followup to: error on "Next & previous results": ajax error message and no results showing up) size:M added

#2 Updated by okurz 5 months ago

  • Priority changed from High to Urgent

+1

#3 Updated by cdywan 5 months ago

  • Subject changed from 4-7 logreport emails a day cause alert fatigue to 4-7 logreport emails a day cause alert fatigue size:M
  • Description updated (diff)
  • Status changed from New to Workable

#4 Updated by tinita 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita

#5 Updated by openqa_review 5 months ago

  • Due date set to 2022-02-17

Setting due date based on mean cycle time of SUSE QE Tools

#6 Updated by tinita 5 months ago

I tried to find out where /usr/local/bin/logwarn comes from on o3, but wasn't successful.

I have also searched in redmine and didn't find out how it was installed.

#7 Updated by tinita 5 months ago

ls -l /usr/local/bin/logwarn
-rwxr-xr-x 1 okurz users 60992 Oct 31  2016 /usr/local/bin/logwarn

Looks to me like it was installed manually from a release tarball from code.google.com.
https://github.com/archiecobbs/logwarn/blob/master/CHANGES#L30

Version 1.0.16 Released November 21, 2021
...
...
...
Version 1.0.12 Released May 24, 2016

...
    - Moved project hosing from Google code to GitHub

Time to update I guess?

% zypper info logwarn

Information for package logwarn:
--------------------------------
Repository     : openSUSE-Leap-15.3-Oss
Name           : logwarn
Version        : 1.0.14-bp153.1.13

#8 Updated by tinita 5 months ago

#9 Updated by tinita 5 months ago

Started to add things to the blocklist and made testing a bit more user friendly

#10 Updated by openqa_review 5 months ago

Setting due date based on mean cycle time of SUSE QE Tools

#11 Updated by tinita 5 months ago

Created 3 PRs:
https://github.com/os-autoinst/openqa-logwarn/pull/14 Improve unit tests
https://github.com/os-autoinst/openqa-logwarn/pull/15 Update logwarn
https://github.com/os-autoinst/openqa-logwarn/pull/16 Add new things to blocklist

I only left out #105918 because that looks like it might be an error to fix

#12 Updated by openqa_review 5 months ago

Setting due date based on mean cycle time of SUSE QE Tools

#13 Updated by okurz 5 months ago

  • Related to action #105930: o3 logreports - empty warnings/errors added

#14 Updated by okurz 5 months ago

  • Related to action #105924: o3 logreports - Template was modified added

#15 Updated by okurz 5 months ago

  • Related to action #105921: o3 logreports - Cannot read symbolic link (/opt/openqa-trigger-from-obs/.../.run_last): No such file or directory added

#16 Updated by okurz 5 months ago

  • Related to action #105918: o3 logreports - fatal: Invalid revision range sha1..sha2 added

#17 Updated by okurz 5 months ago

  • Related to action #105915: o3 logreports - Needle file <filename>.json not found within /var/.../opensuse/needles added

#18 Updated by okurz 5 months ago

  • Related to action #105909: o3 logreports - Ignoring invalid group {"name":"123"} when creating new job added

#19 Updated by okurz 5 months ago

  • Related to action #105903: o3 logreports - Publishing opensuse.openqa.job.restart failed: Connect timeout (9 attempts left) added

#20 Updated by okurz 5 months ago

  • Related to action #105900: o3 logreports - Unable to wakeup scheduler: Request timeout added

#21 Updated by okurz 5 months ago

  • Due date set to 2022-02-17

I moved all subtasks out. Adding them to the blocklist is part of the ticket, solving them shouldn't be. So now we can set a due date and are able to resolve the ticket as soon as the ACs are covered

#22 Updated by tinita 5 months ago

  • Related to action #106245: o3 logreports - Testsuite 'xyz' is invalid added

#23 Updated by tinita 5 months ago

  • Status changed from In Progress to Resolved

So far no alert emails anymore.
If there are new ones, create a ticket and add to https://github.com/os-autoinst/openqa-logwarn

#24 Updated by cdywan 4 months ago

  • Copied to action #106756: cmd returned 32768 repeatedly reported on o3 added

#25 Updated by cdywan 4 months ago

  • Copied to action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added

#26 Updated by cdywan 4 months ago

  • Copied to action #106760: DBI Exception: DBD::Pg::st execute failed: number of parameters must be between 0 and 65535 repeatedly reported on o3 added

#27 Updated by tinita 4 months ago

  • Related to action #106613: o3 logreports DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks row not found added

#28 Updated by tinita 4 months ago

  • Related to action #106880: Job template name ... is already used in job group error logged on o3 size:M added

#29 Updated by tinita 4 months ago

  • Related to action #107023: cmd returned 31744 repeatedly reported on o3 added

#30 Updated by tinita 3 months ago

  • Copied to action #108533: o3 logreports DBI Exception: DBD::Pg::st execute failed: ERROR: invalid input syntax for type integer added

Also available in: Atom PDF