Project

General

Profile

action #76828

big job queue for ppc as powerqaworker-qam-1.qa and malbec.arch and qa-power8-5-kvm were not active

Added by okurz 9 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2020-10-31
Due date:
% Done:

0%

Estimated time:


Related issues

Related to openQA Infrastructure - action #73633: OSD partially unresponsive, triggering 500 responses, spotty response visible in monitoring panels but no alert triggered (yet)Resolved2020-10-202020-11-17

History

#1 Updated by okurz 9 months ago

  • Related to action #73633: OSD partially unresponsive, triggering 500 responses, spotty response visible in monitoring panels but no alert triggered (yet) added

#2 Updated by okurz 9 months ago

  • Due date set to 2020-11-03
  • Status changed from In Progress to Feedback

First I called power reset over IPMI for qa-power8-5-kvm.qa , then called ipmi-fsp1-malbec.arch power reset, waited for malbec.arch to come up, ensure services are properly started and monitored until openQA jobs were picked up. For powerqaworker-qam-1 I also proceeded in #68053

#3 Updated by okurz 9 months ago

  • Due date deleted (2020-11-03)
  • Status changed from Feedback to In Progress
  • Priority changed from Urgent to High

qa-power8-5-kvm.qa again showed problems, commented in https://progress.opensuse.org/issues/76792#change-346096 on what I did. The queue of jobs on https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test is near-empty now but again I can't reach powerqaworker-qam-1.qa right now and also not malbec.arch which https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview confirms :(

Had a quick chat with nsinger, thanks for the quick reaction! nsinger confirmed the observation I had that when a machine is not reachable a sol activate may not show anything which is likely when the machine crashed and just does not output anything anymore on the serial console.

Triggered a reset of malbec.arch and after boot did sudo systemctl restart var-lib-openqa-share.mount on the machine. Machine is back and working on jobs. Did not care about powerqaworker-qam-1.qa for now.

#4 Updated by okurz 9 months ago

  • Status changed from In Progress to Resolved

The job queue has decreased enough that this isn't a problem anymore. Currently malbec.arch is still up as well as qa-power8-5-kvm.qa and for powerqaworker-qam-1.qa we have our own ticket anyway.

Also available in: Atom PDF