Project

General

Profile

Actions

action #120744

closed

[alert] QA-Power8-5-kvm: Too many Minion job failures alert size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-11-18
Due date:
2022-12-09
% Done:

0%

Estimated time:

Description

Observation

Too many Minion jobs have failed on QA-Power8-5-kvm. Review failed jobs on http://localhost:9530/minion/jobs?state=failed after tunneling the worker's Minion dashboard via ssh -L 9530:localhost:9530 -N QA-Power8-5-kvm. Create a ticket if there's not already one. For the general log of the Minion job queue, checkout journalctl -u openqa-worker-cacheservice.service -u openqa-worker-cacheservice-minion.service. To remove all failed jobs on the machine: /usr/share/openqa/script/openqa-workercache eval 'my $jobs = app->minion->jobs({states => ["failed"]}); while (my $job = $jobs->next) { $job->remove }'
Metric name

Value
Failed

101.000

http://stats.openqa-monitor.qa.suse.de/d/WDQA-Power8-5-kvm/worker-dashboard-qa-power8-5-kvm?tab=alert&viewPanel=65104&orgId=1

Acceptance criteria

  • AC1: The cause of sqlite lock errors is known

Rollback steps

  • Unpause alert "QA-Power8-5-kvm: Too many Minion job failures alert"

Suggestions

  • Consider implementing a retry with exponential backoff
  • Exit code 11 is a SEGFAULT, suggesting this is due to a C dependency

Files

cache-service-sqlite-test.t (593 Bytes) cache-service-sqlite-test.t mkittler, 2022-11-25 15:58
Actions

Also available in: Atom PDF