action #97502
closedosd deployment failed due to openqaworker-arm-3 being down, needs to be worked around size:M
0%
Description
Updated by ilausuch over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to ilausuch
Updated by ilausuch over 3 years ago
- Status changed from In Progress to Blocked
I did the two first suggested steps and now the pipeline passes the failed step. https://gitlab.suse.de/openqa/osd-deployment/-/pipelines
In the moment of writting this comment openqaworker-arm-3 seems down. So the last step cannot be done until thhe worker come back to life. I change the status of this ticket to blocked and write a comment in the other ticket #97244
Updated by okurz over 3 years ago
- Related to action #97244: openqaworker-arm-3 is offline and EngInfra wants us to create JiraSD tickets instead of infra size:M added
Updated by okurz over 3 years ago
- Priority changed from Urgent to Normal
thanks. This adressed the urgency hence reducing prio.
Updated by ilausuch over 3 years ago
I checked today, and the openqaworker-arm-3 seems thhat is still down
Updated by ilausuch over 3 years ago
- Assignee deleted (
ilausuch)
I removed myself from this ticket because I won't be working in this group for a period
Updated by okurz over 3 years ago
- Status changed from Blocked to In Progress
- Assignee set to okurz
after #97244 resolved we can continue here. I triggered a manual explicit pipeline run for openqaworker-arm-3 in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/589516 which actually recovered the machine. Now I readd the salt key and apply the current high state.
Updated by okurz over 3 years ago
tests are correctly executed. https://monitor.qa.suse.de/d/WDopenqaworker-arm-3/worker-dashboard-openqaworker-arm-3?orgId=1&refresh=1m&from=now-24h&to=now shows some data now but https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?viewPanel=4&orgId=1 says that there is still no data. Restarted services on monitor.qa "influxdb grafana telegraf" but no change. I assume it's actually also here (same as in the other ticket where I struggled) the telegraf service on the webUI host as that one does the actual ping to each host. So I restarted the service on osd, let's see. If this works then I should ensure that the telegraf service restarts on config file changes because I assume that is the case for both the latest postgresql related changes as well as here for the host.
EDIT: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/580 created for telegraf service. https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?editPanel=4&viewPanel=4&orgId=1 shows data again same as https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1
Updated by okurz over 3 years ago
- Status changed from Feedback to Resolved
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/580 merged and deployed. openqaworker-arm-3 is fully back in production and working on jobs, e.g. https://openqa.suse.de/tests/7160099