action #47087: [scheduling] Workers on openqaworker2 stuck frequently - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #47087

closed

[scheduling] Workers on openqaworker2 stuck frequently

Added by michalnowak about 6 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Current Sprint

Start date:

2019-02-04

Due date:

% Done:

Estimated time:

Description

Workers of following classes on openqaworker2 host stuck every couple of days: virt-mm-64bit-ipmi, svirt-hyperv, and svirt-hyperv2012r2. I had to restart them manually, so they are untangled and accept jobs again.

On surface, from openQA dashboard, the affected worker has a job from SLES15 SP1 build 157.1 "running" for 2-3 days. Canceling the job didn't work, new job wasn't acquired from the pool. Worker service restart did the job for the time being, but the worker stuck again after 3 days.

This is one such a worker:

mnowak@openqaworker2:~> sudo systemctl status openqa-worker@19
● openqa-worker@19.service - openQA Worker #19
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2019-02-01 13:54:32 CET; 2 days ago
 Main PID: 9887 (worker)
    Tasks: 1 (limit: 512)
   CGroup: /openqa.slice/openqa-worker.slice/openqa-worker@19.service
           └─9887 /usr/bin/perl /usr/share/openqa/script/worker --instance 19

Feb 01 19:52:21 openqaworker2 worker[9887]: [info] uploading vars.json
Feb 01 19:52:21 openqaworker2 worker[9887]: [info] uploading serial0.txt
Feb 01 19:52:21 openqaworker2 worker[9887]: [info] uploading autoinst-log.txt
Feb 01 19:52:21 openqaworker2 worker[9887]: [info] uploading worker-log.txt
Feb 01 19:52:21 openqaworker2 worker[9887]: [info] cleaning up 02430240-sle-15-SP1-Installer-DVD-x86_64-Build158.4-skip_registration@svirt-hyperv-uefi
Feb 01 19:53:44 openqaworker2 worker[9887]: GLOB(0x8005aa8)[info] got job 2426719: 02426719-sle-15-SP1-Installer-DVD-x86_64-Build157.1-mediacheck@svirt-hyperv
Feb 01 19:53:44 openqaworker2 worker[9887]: [info] +++ setup notes +++
Feb 01 19:53:44 openqaworker2 worker[9887]: [info] start time: 2019-02-01 18:53:44
Feb 01 19:53:44 openqaworker2 worker[9887]: [info] running on openqaworker2:19 (Linux 4.7.5-2.g02c4d35-default #1 SMP PREEMPT Mon Sep 26 08:11:45 UTC 2016 (02c4d35) x86_64)
Feb 02 18:09:05 openqaworker2 systemd[1]: openqa-worker@19.service: Got notification message from PID 11599, but reception is disabled.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by coolo about 6 years ago

Related to coordination #47117: [epic] Fix worker->websocket->scheduler->webui connection added

Actions

Copy link

Updated by coolo about 6 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Category set to 122

this is a code bug - we just have to find it :(

Actions

Copy link

Updated by okurz about 6 years ago

I think there was also another ticket but this one is what I got looking for "stuck" in the subject:

It seems like we have big backlog of especially ipmi and s390x-kvm tests on osd scheduled for multiple days and the according workers report as "Working" but also for a period longer than a day already, e.g.
https://openqa.suse.de/admin/workers/1246 , https://openqa.suse.de/admin/workers/1245 , https://openqa.suse.de/admin/workers/1243

Actions

Copy link

Updated by okurz about 6 years ago

Related to action #47060: [worker service][scheduling] openqaworker2:21 ~ openqaworker2:24 stops getting new jobs for over 1 day. added

Actions

Copy link

Updated by mkittler about 6 years ago

I've tried to restart the stuck worker on grenache.qa.suse.de but I don't have access to that machine. Can someone with access restart workers 12 to 15 and maybe check the other slots, too?

Actions

Copy link

Updated by mkittler about 6 years ago

Status changed from New to In Progress
Assignee set to mkittler
Target version set to Current Sprint

This is likely the same issue as #47060 so I'm assigning here as well.

Actions

Copy link

Updated by okurz almost 6 years ago

Subject changed from Workers on openqaworker2 stuck frequently to scheduling] Workers on openqaworker2 stuck frequently
Category changed from 122 to Regressions/Crashes

Actions

Copy link

Updated by okurz almost 6 years ago

Subject changed from scheduling] Workers on openqaworker2 stuck frequently to [scheduling] Workers on openqaworker2 stuck frequently

Actions

Copy link

Updated by okurz almost 6 years ago

Has duplicate action #52997: [sle][functional][tools]test fails in sshd - timeout_exceeded (13:20 hours) added

Actions

Copy link

#10

Updated by okurz over 5 years ago

do you think the situation changed since you implemented the reworked worker which is already deployed within the OSD infrastructure since some days already?

Actions

Copy link

#11

Updated by mkittler over 5 years ago

Yes, the restructuring should help because the worker would re-register and the web UI will mark the job as incomplete then. (Triggering the re-registration by restarting the worker was the current workaround.) And the worker itself should of course also be able to work on further jobs now.

Actions

Copy link

#12

Updated by okurz over 5 years ago

Status changed from In Progress to Resolved

I did not see this for a long time so I guess it's done

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #47087

[scheduling] Workers on openqaworker2 stuck frequently

Updated by coolo about 6 years ago

Updated by coolo about 6 years ago

Updated by okurz about 6 years ago

Updated by okurz about 6 years ago

Updated by mkittler about 6 years ago

Updated by mkittler about 6 years ago

Updated by okurz almost 6 years ago

Updated by okurz almost 6 years ago

Updated by okurz almost 6 years ago

Updated by okurz over 5 years ago

Updated by mkittler over 5 years ago

Updated by okurz over 5 years ago