action #109055: Broken workers alert - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #109055

closed

Broken workers alert

Added by livdywan almost 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-03-28

Due date:

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

Number of broken workers 2.000

Workers on OSD shows:

powerqaworker-qam-1:1 powerqaworker-qam-1 qemu_ppc64le,qemu_ppc64le-large-mem,powerqaworker-qam-1 ppc64le Broken 1 25
powerqaworker-qam-1:5 powerqaworker-qam-1 qemu_ppc64le,power8,powerqaworker-qam-1 ppc64le Broken 1 25

The alert is currently active, and was also active twice earlier today, and thrice yesterday.

Rollback steps¶

Unpause broken workers alert

Related issues 2 (0 open — 2 closed)

Related to openQA Infrastructure (public) - action #108845: Network performance problems, DNS, DHCP, within SUSE QA network auto_review:"(Error connecting to VNC server.*qa.suse.*Connection timed out|ipmitool.*qa.suse.*Unable to establish)":retry but also other symptoms size:M

Resolved

nicksinger

2022-03-24

Actions

Related to openQA Project (public) - action #109734: Better way to prevent conflicts between openqa-worker@ and openqa-worker-auto-restart@ variants size:M

Resolved

jbaier_cz

2022-04-09

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by livdywan almost 3 years ago

Description updated (diff)

Actions

Copy link

Updated by nicksinger almost 3 years ago

instances 2-6 show that it is unable to lock the pool directory but it's hard to debug because I face ~50% package loss to that worker :/

Actions

Copy link

Updated by okurz almost 3 years ago

Description updated (diff)
Priority changed from Normal to High
Target version set to Ready

@cdywan for alert related tickets in general I suggest to put them to our backlog, i.e. target version "Ready", and with high prio

Actions

Copy link

Updated by okurz almost 3 years ago

Related to action #108845: Network performance problems, DNS, DHCP, within SUSE QA network auto_review:"(Error connecting to VNC server.*qa.suse.*Connection timed out|ipmitool.*qa.suse.*Unable to establish)":retry but also other symptoms size:M added

Actions

Copy link

Updated by okurz almost 3 years ago

Priority changed from High to Urgent

Actions

Copy link

Updated by okurz almost 3 years ago

Status changed from New to Blocked
Assignee set to okurz

This might as well be just related to #108845 and QA (related) switches. Due to recent development e.g. in #108845#note-21 I will take this ticket and track it.

Actions

Copy link

Updated by okurz almost 3 years ago

Related to action #109734: Better way to prevent conflicts between openqa-worker@ and openqa-worker-auto-restart@ variants size:M added

Actions

Copy link

Updated by okurz almost 3 years ago

Tags set to reactive work
Description updated (diff)
Status changed from Blocked to Resolved

#108845 was resolved but does not fix the issues. https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&editPanel=96&tab=alert still shows broken workers. On https://openqa.suse.de/admin/workers filtering for "Broken" I can find powerqaworker-qam-1:1 and others on the same host broken. I logged in to the system over ssh and found that openqa-worker@{1..6} are running but they should not, see https://gitlab.suse.de/openqa/salt-states-openqa#remarks-about-the-systemd-units-used-to-start-workers . Who did that again?

I did

systemctl mask --now openqa-worker@{1..6} && systemctl enable --now openqa-worker-auto-restart@{1..6}

and everything looks ok again but I am expecting that someone by mistake tries the same approach again and again. Eventually we should find a better approach by not having multiple systemd services for the same but solve with configuration within the services or something like that. Reported in #109734

I also found openqaworker2:5 broken due to the same problem. Someone started openqa-worker@5 when they should not. Visible from the history of the root user.

I suspect it was jpupava:

openqaworker2:/home/okurz # history | grep 'openqa-worker@5'
  613  2022-04-07 11:45:39 systemctl status openqa-worker@5
  618  2022-04-07 11:46:58 systemctl restart openqa-worker@5
  620  2022-04-07 11:47:48 systemctl status openqa-worker@5
  624  2022-04-07 11:49:09 systemctl restart openqa-worker@5
  626  2022-04-07 11:49:12 systemctl restart openqa-worker@5
  628  2022-04-07 11:49:18 systemctl stop openqa-worker@5
  630  2022-04-07 11:49:26 systemctl status openqa-worker@5
  631  2022-04-07 11:49:31 systemctl start openqa-worker@5
…
openqaworker2:/home/okurz # last | head -n 20
jpupava  pts/6        10.100.12.155    Thu Apr  7 12:08 - 14:22  (02:13)
jpupava  pts/6        10.100.12.155    Thu Apr  7 11:45 - 12:03  (00:17)

I will tell him over chat. Did so in https://suse.slack.com/archives/C02CANHLANP/p1649499490207389

Now there should be no more broken workers. Monitored the alert and unpaused. Problem solved, rollback steps completed.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #109055

Broken workers alert

Observation¶

Rollback steps¶

Updated by livdywan almost 3 years ago

Updated by nicksinger almost 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago