Project

General

Profile

Actions

action #97364

closed

openqaworker-arm-2 and openqaworker-arm-3 seem to be offline, alerts had been triggered size:S

Added by okurz about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2021-08-23
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/alerting/list?state=not_ok
shows

[openqa] openqaworker-arm-2 online (long-time) alert
ALERTING for 4 days
Edit alert
[openqa] openqaworker-arm-3 online (long-time) alert
ALERTING for 5 days

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #97244: openqaworker-arm-3 is offline and EngInfra wants us to create JiraSD tickets instead of infra size:MResolveddheidler2021-08-192021-09-17

Actions
Related to openQA Infrastructure - action #113561: failed pipelines for openQABot and bot-ng because of an expired certResolvedokurz2022-07-13

Actions
Actions #1

Updated by okurz about 3 years ago

  • Related to action #97244: openqaworker-arm-3 is offline and EngInfra wants us to create JiraSD tickets instead of infra size:M added
Actions #2

Updated by nicksinger about 3 years ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #3

Updated by okurz about 3 years ago

  • Subject changed from openqaworker-arm-2 and openqaworker-arm-3 seem to be offline, alerts had been triggered to openqaworker-arm-2 and openqaworker-arm-3 seem to be offline, alerts had been triggered size:S

discussed in daily, out of scope: changing automatic ticket creation. in scope: please take a short look into the pipeline why power cycling over gitlab did not work.

Actions #4

Updated by nicksinger about 3 years ago

  • Status changed from In Progress to Resolved

Both workers look good again after a manual reboot and show up as "online" in grafana: https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1

Pipeline triggered for both machine. The execution for arm-2 failed due to some CI issue which I think is something we can't really change:

ERROR: Job failed (system failure): prepare environment: error sending request: Post "https://caasp-master.suse.de:6443/api/v1/namespaces/gitlab/pods/runner-h1wecofv-project-4652-concurrent-0l8dw9/attach?container=helper&stdin=true": dial tcp: lookup caasp-master.suse.de on [2620:113:80c0:8080:10:160:2:88]:53: server misbehaving. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

(https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/536479#L13)

For arm-3 I created https://progress.opensuse.org/issues/97382 to fix our pipeline.

Actions #5

Updated by jbaier_cz over 2 years ago

  • Related to action #113561: failed pipelines for openQABot and bot-ng because of an expired cert added
Actions

Also available in: Atom PDF