Project

General

Profile

Actions

action #106771

closed

imagetester missing in action

Added by favogt almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2022-02-14
Due date:
% Done:

0%

Estimated time:

Description

Observation

imagetester is unreachable over the network.

According to ariel, the last sign of life was

Feb 08 06:44:06 ariel dnsmasq-dhcp[2018]: DHCPACK(eth1) 192.168.112.5 00:25:90:a2:ba:92 imagetester

Rollback steps


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #106751: Update machines and passwords in the monitor-o3 repositoryResolvednicksinger2022-02-142022-03-01

Actions
Related to openQA Infrastructure (public) - action #135137: Bring back imagetester size:MResolvedokurz2023-09-04

Actions
Actions #1

Updated by okurz almost 3 years ago

  • Priority changed from Normal to High
  • Target version set to Ready

especially important for me is how we missed the machine being down. Well, ok, I know the answer, because we don't have alerting for that :) I guess we could try very simple monitoring with a cron job on ariel pinging machines.

Actions #2

Updated by mkittler almost 3 years ago

I couldn't revive it via ipmi:

ipmitool -I lanplus -C 3 -H 10.160.65.195 -U … -P … power cycle
Error: Unable to establish IPMI v2 / RMCP+ session
Actions #3

Updated by mkittler almost 3 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #4

Updated by mkittler almost 3 years ago

  • Status changed from In Progress to Feedback
Actions #5

Updated by okurz almost 3 years ago

  • Status changed from Feedback to Blocked

As we can reach the EngInfra ticket let's treat this one here as blocked, ok?

Actions #6

Updated by okurz almost 3 years ago

  • Description updated (diff)

okurz wrote:

especially important for me is how we missed the machine being down. Well, ok, I know the answer, because we don't have alerting for that :) I guess we could try very simple monitoring with a cron job on ariel pinging machines.

That's meanwhile addressed with #106751 although that means that we need to pause the scheduled CI jobs and unpause again as soon as imagester is back. I added a section "Rollback steps" to the ticket accordingly.

Actions #7

Updated by okurz almost 3 years ago

  • Related to action #106751: Update machines and passwords in the monitor-o3 repository added
Actions #8

Updated by okurz almost 3 years ago

  • Status changed from Blocked to Resolved

SD ticket resolved. machine is up again. I enabled the pipeline again and triggered one and it's green. imagetester is again working on openQA jobs. Should be ok now.

Actions #9

Updated by okurz over 1 year ago

Actions

Also available in: Atom PDF