action #94456
closed
no data from any arm host on https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1
Added by okurz over 3 years ago.
Updated over 3 years ago.
I checked and at least openqaworker-arm-1 is up and running openQA jobs just fine. Maybe this is related to #93922#note-10
- Related to action #93922: grafana dashboard for "approximate result size by job group" fails to render any data with "InfluxDB Error: unsupported mean iterator type: *query.stringInterruptIterator" added
I've also noticed the lack of "ping data" on the new generic dashboards, see #91779#note-8.
martchus@openqa:~> sudo journalctl -fu telegraf
Jun 22 12:57:00 openqa telegraf[2034]: 2021-06-22T10:57:00Z E! [inputs.ping] Error in plugin: lookup backup-vm on 10.160.0.1:53: no such host
But no errors about other hosts.
- Priority changed from Urgent to Immediate
By now all three ARM workers are offline and are not automatically recovered because with no data, no alert, no trigger in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines , bumping prio. Triggered power reset of openqaworker-arm-1 and openqaworker-arm-2 manually now, openqaworker-arm-3 is not reachable over IPMI. I do not know how to recover openqaworker-arm-3 other than manual reset by EngInfra. I thought in #89815 we have found a way to reset the complete chassis but I can't reach ipmi -> #94513
- Related to action #89815: osd-deployment blocked by openqaworker-arm-3 offline and not recovered automatically added
- Copied to action #94513: openqaworker-arm-3 not reachable and not recoverable over usual ways added
- Status changed from Workable to In Progress
- Assignee set to mkittler
- Status changed from In Progress to Resolved
- Status changed from Resolved to Feedback
For an urgent or immediate ticket we should do more than just a single fix. Let's talk about it.
- Status changed from Feedback to Resolved
We have crosschecked and we found that only the ARM workers were hit in a problematic way. We worked on the ticket within 12h after setting to "Immediate" priority. We also extended the wiki. Reminder to everyone: Please take care that urgent or immediate tickets are picked up as fast as possible and get other people to help. Calling it "Resolved" now.
Also available in: Atom
PDF