action #76828
closedbig job queue for ppc as powerqaworker-qam-1.qa and malbec.arch and qa-power8-5-kvm were not active
0%
Description
Observation¶
We have reached 10k scheduled jobs on osd in https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1603848964424&to=1603970807972 . I don't know if this is good or bad :D
https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1603963753871&to=1603970807972 shows that the main problem right now is ppc64le, also visible in https://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?orgId=1&fullscreen&panelId=4&from=1604107477948&to=1604135288558 .
Updated by okurz almost 4 years ago
- Related to action #73633: OSD partially unresponsive, triggering 500 responses, spotty response visible in monitoring panels but no alert triggered (yet) added
Updated by okurz almost 4 years ago
- Due date set to 2020-11-03
- Status changed from In Progress to Feedback
First I called power reset
over IPMI for qa-power8-5-kvm.qa , then called ipmi-fsp1-malbec.arch power reset
, waited for malbec.arch to come up, ensure services are properly started and monitored until openQA jobs were picked up. For powerqaworker-qam-1 I also proceeded in #68053
Updated by okurz almost 4 years ago
- Due date deleted (
2020-11-03) - Status changed from Feedback to In Progress
- Priority changed from Urgent to High
qa-power8-5-kvm.qa again showed problems, commented in https://progress.opensuse.org/issues/76792#change-346096 on what I did. The queue of jobs on https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test is near-empty now but again I can't reach powerqaworker-qam-1.qa right now and also not malbec.arch which https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview confirms :(
Had a quick chat with nsinger, thanks for the quick reaction! nsinger confirmed the observation I had that when a machine is not reachable a sol activate
may not show anything which is likely when the machine crashed and just does not output anything anymore on the serial console.
Triggered a reset of malbec.arch and after boot did sudo systemctl restart var-lib-openqa-share.mount
on the machine. Machine is back and working on jobs. Did not care about powerqaworker-qam-1.qa for now.
Updated by okurz almost 4 years ago
- Status changed from In Progress to Resolved
The job queue has decreased enough that this isn't a problem anymore. Currently malbec.arch is still up as well as qa-power8-5-kvm.qa and for powerqaworker-qam-1.qa we have our own ticket anyway.