QA &raquo; openQA Project &raquo; openQA Infrastructure

Related to QA - action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M

Resolved

ybonatakis

Related to openQA Infrastructure - action #159318: openqa-piworker host up alert

Resolved

nicksinger

2023-08-09

Related to openQA Infrastructure - action #159555: IPMI access over IPv6 doesn't work on imagetester - try to update BIOS with physical access size:S

Resolved

2024-04-24

Related to openQA Infrastructure - action #41882: all arm worker die after some time

Resolved

2018-10-02

Related to openQA Infrastructure - action #89815: osd-deployment blocked by openqaworker-arm-3 offline and not recovered automatically

Resolved

mkittler

2021-03-10

2021-04-22

Related to openQA Infrastructure - action #95482: openqaworker-arm-3 offline and not automatically recovered due to gitlab CI failures

Resolved

2021-07-14

Related to openQA Infrastructure - action #107074: error on openqaworker-arm-2 failing osd-deployment size:M

Resolved

mkittler

2022-02-18

Related to openQA Infrastructure - action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:M

Rejected

2023-11-28

Updated by ybonatakis 3 months ago

And this also breaks https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2510604

Actions

Updated by tinita 3 months ago

Target version changed from Tools - Next to Ready

Actions

Updated by okurz 3 months ago

Related to action #159303: [alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S added

Actions

Updated by okurz 3 months ago

Related to action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M added

Actions

Updated by okurz 3 months ago

Subject changed from openqaworker-arm-1 is Unreachable to openqaworker-arm-1 is Unreachable size:S
Description updated (diff)
Status changed from New to Workable
Assignee set to ybonatakis

Actions

Updated by ybonatakis 3 months ago · Edited

Status changed from Workable to Feedback

Actions

Updated by ybonatakis 3 months ago

Status changed from Feedback to Resolved

Also possible to ssh into it.

Actions

Updated by okurz 3 months ago

Status changed from Resolved to In Progress

@ybonatakis you leaked the IPMI passwords in #159270-6. I deleted that comment. Now please update all IPMI passwords as documented in https://gitlab.suse.de/openqa/salt-pillars-openqa/#ipmi-passwords . Please use a pronouncable password. I suggest to think of a good password based on https://github.com/okurz/scripts/blob/master/xkcdpass-two-word

Actions

Updated by okurz 3 months ago

Related to action #159318: openqa-piworker host up alert added

Actions

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/789/diffs
Also updated https://gitlab.suse.de/openqa/grafana-webhook-actions/-/settings/ci_cd
https://gitlab.suse.de/openqa/monitor-o3/-/settings/ci_cd

#10

Updated by ybonatakis 3 months ago

Status changed from In Progress to Feedback

Actions

#11

Updated by livdywan 3 months ago

Status changed from Feedback to In Progress

@ybonatakis Please rememember you need to address the urgency of the ticket or resolve it immediately.

Actions

#12

Updated by ybonatakis 3 months ago

but https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/789 is still open

Actions

#13

Updated by livdywan 3 months ago

ybonatakis wrote in #note-12:

but https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/789 is still open

Did you see the unanswered questions there?

Actions

#14

Updated by openqa_review 3 months ago

Due date set to 2024-05-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions

#15

Updated by okurz 3 months ago · Edited

I found the following entries with inconsistencies:

malbec, you did not change the password, please revert
imagetester, should be changed
storage, should be changed
kerosene, should be changed
openqaworker{20..28}, should be changed
openqaworker-arm{21..22}, should be changed

Actions

#16

Updated by okurz 3 months ago

Related to action #159555: IPMI access over IPv6 doesn't work on imagetester - try to update BIOS with physical access size:S added

Actions

#17

Updated by okurz 3 months ago

Parent task set to #129280

Actions

#18

Updated by ybonatakis 3 months ago

Status changed from In Progress to Blocked

waiting for https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5032

Actions

#19

Updated by ybonatakis 3 months ago

Status changed from Blocked to In Progress

merged

Actions

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/794

#20

Updated by ybonatakis 3 months ago

Due date deleted (~~2024-05-08~~)

Still waiting to get ssh access to update
openqaworker{20..28}, should be changed
openqaworker-arm{21..22}, should be changed

Actions

#21

Updated by nicksinger 3 months ago

I addressed your question in https://suse.slack.com/archives/C02AJ1E568M/p1713963938332169?thread_ts=1713940157.475059&cid=C02AJ1E568M and deleted the stale alerts with:

sqlite3 /var/lib/grafana/grafana.db "$(for RULEID in DzAhcifVk dzA25mfVk Fk0h5iBVk Sk02ciBVk VzA2cif4zz; do echo -n "delete from alert_rule where uid = '$RULEID'; delete from alert_rule_version where rule_uid = '$RULEID'; delete from provenance_type where record_key = '$RULEID'; delete from annotation where text like '%$RULEID%';"; done)"

Actions

#22

Updated by openqa_review 3 months ago

Due date set to 2024-05-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions

#23

Updated by ybonatakis 3 months ago

Status changed from In Progress to Feedback
Priority changed from Urgent to High

as the main reported issue has been resolved i lower the prority.
What remains to be done is to update the passwords on:
openqaworker{20..28} and openqaworker-arm{21..22}
My ssh keys are not yet in oqa-jumpy.dmz-prg2.suse.org so i will have to wait.

Actions

#24

Updated by okurz 3 months ago

Status changed from Feedback to Workable

please create IT ticket about getting your ssh key deployed on oqa-jumpy

Actions

https://sd.suse.com/servicedesk/customer/portal/1/SD-155399

#25

Updated by ybonatakis 3 months ago

Actions

#26

Updated by nicksinger 3 months ago

@ybonatakis could you please share the SD ticket with our "OSD Admins" group so we can see the progress? Also, can you maybe ask someone from the team to make the required changes? I don't think it is useful if you have to wait several days for a response. Especially for a "Workable" ticket with "High" Priority…

Actions

#27

Updated by ybonatakis 3 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/800
the only one machine which password havent update is kerosene

Actions

#28

Updated by ybonatakis 3 months ago

Status changed from Workable to Resolved

kerosine was updated as well.
https://suse.slack.com/archives/C02AJ1E568M/p1714466396069729

Actions

#29

Updated by okurz 3 months ago

Due date deleted (~~2024-05-09~~)

Actions

See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589906

#30

Updated by livdywan 2 months ago

Status changed from Resolved to Feedback

openqaworker-arm-1.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20240510052716652055

Is this machine broken again?

Actions

See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589906

#31

Updated by ybonatakis 2 months ago

I just checked the CI jobs and there are not problems since the reopening.
It is 4 days since. The machine was restarted 3 days before

 12:59:02  up 3 days 13:54,  1 user,  load average: 1.39, 1.49, 1.31

most likely from https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2596329
but I cant tell if Oliver triggered this by himself or if it was automation.
therefore I assume that the recovery works.

livdywan wrote in #note-30:

openqaworker-arm-1.qe.nue2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20240510052716652055

Is this machine broken again?

Actions

#32

Updated by okurz 2 months ago · Edited

ybonatakis wrote in #note-31:

I just checked the CI jobs and there are not problems since the reopening.
It is 4 days since. The machine was restarted 3 days before
 12:59:02  up 3 days 13:54,  1 user,  load average: 1.39, 1.49, 1.31
most likely from https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2596329
but I cant tell if Oliver triggered this by himself or if it was automation.
therefore I assume that the recovery works.

I have not triggered a recovery manually recently so I concur, the automatic recovery works in principle. https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589906 failed as openqaworker-arm-1 did not respond to salt commands in time. That was also what we saw in https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589864#L326 in the pre-deploy check which isn't fatal but just informative. I am not sure how we handled the temporary outages of openqaworker-arm-[123] in the past during osd-deployment. We might have not really seen the problem in before and were possibly just lucky/unlucky? I will look into some older tickets if something comes up.

EDIT: Nothing came up from older tickets so except for "moar retries!!!1" covering the recovery periods of openqaworker-arm-1 I don't have further good ideas.

EDIT2: https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589906 was running at 0727. There are already retries by gitlab with retry: 2 but those retries are quickly one after another. The previous jobs were https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589904 0725 and https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2589872 at 0723 so only 4 minutes, not enough to allow openqaworker-arm-1 to be up again. We could try retries with customized retry periods. Will propose a solution.

Actions

#33

Updated by okurz 2 months ago

Related to action #41882: all arm worker die after some time added

Actions

#34

Updated by okurz 2 months ago

Related to action #89815: osd-deployment blocked by openqaworker-arm-3 offline and not recovered automatically added

Actions

#35

Updated by okurz 2 months ago

Related to action #95482: openqaworker-arm-3 offline and not automatically recovered due to gitlab CI failures added

Actions

#36

Updated by okurz 2 months ago

Related to action #107074: error on openqaworker-arm-2 failing osd-deployment size:M added

Actions

#37

Updated by okurz 2 months ago

Related to action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:M added

Actions

https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/57

#38

Updated by okurz 2 months ago

Due date set to 2024-05-28

Actions

#39

Updated by okurz 2 months ago

https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/57 merged. I suggest to await at least one succesful normal osd-deployment regardless if a machine like openqaworker-arm-1 is not currently available during the deployment.

Actions