action #120744
Updated by livdywan about 2 years ago
## Observation > Too many Minion jobs have failed on QA-Power8-5-kvm. Review failed jobs on http://localhost:9530/minion/jobs?state=failed after tunneling the worker's Minion dashboard via `ssh -L 9530:localhost:9530 -N QA-Power8-5-kvm`. Create a ticket if there's not already one. For the general log of the Minion job queue, checkout `journalctl -u openqa-worker-cacheservice.service -u openqa-worker-cacheservice-minion.service`. To remove all failed jobs on the machine: ``` /usr/share/openqa/script/openqa-workercache eval 'my $jobs = app->minion->jobs({states => ["failed"]}); while (my $job = $jobs->next) { $job->remove }' ``` Metric name Value Failed 101.000 http://stats.openqa-monitor.qa.suse.de/d/WDQA-Power8-5-kvm/worker-dashboard-qa-power8-5-kvm?tab=alert&viewPanel=65104&orgId=1 ## Acceptance criteria - **AC1**: The cause of sqlite lock errors is known ## Rollback steps * Unpause alert "QA-Power8-5-kvm: Too many Minion job failures alert" ## Suggestions - Consider implementing a retry with exponential backoff - Exit code 11 is a SEGFAULT, suggesting this is due to a C dependency