Project

General

Profile

action #120744

Updated by livdywan about 2 years ago

## Observation 
 > Too many Minion jobs have failed on QA-Power8-5-kvm. Review failed jobs on http://localhost:9530/minion/jobs?state=failed after tunneling the worker's Minion dashboard via `ssh -L 9530:localhost:9530 -N QA-Power8-5-kvm`. Create a ticket if there's not already one. For the general log of the Minion job queue, checkout `journalctl -u openqa-worker-cacheservice.service -u openqa-worker-cacheservice-minion.service`. To remove all failed jobs on the machine: ``` /usr/share/openqa/script/openqa-workercache eval 'my $jobs = app->minion->jobs({states => ["failed"]}); while (my $job = $jobs->next) { $job->remove }' ``` 
 Metric name 
        
 
	
 Value 
 Failed 
        
 
	
 101.000 

 http://stats.openqa-monitor.qa.suse.de/d/WDQA-Power8-5-kvm/worker-dashboard-qa-power8-5-kvm?tab=alert&viewPanel=65104&orgId=1 

 ## Acceptance criteria 
 - **AC1**: The cause of sqlite lock errors is known 

 ## Rollback steps 
 * Unpause alert "QA-Power8-5-kvm: Too many Minion job failures alert" 

 ## Suggestions 
 - Consider implementing a retry with exponential backoff 
 - Exit code 11 is a SEGFAULT, suggesting this is due to a C dependency

Back