action #44105: if workercache dies, we get *tons* of incompletes - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #44105

closed

if workercache dies, we get tons of incompletes

Added by coolo over 6 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Current Sprint

Start date:

2018-11-21

Due date:

% Done:

Estimated time:

Description

I guess if the workercache service is unavailable, the worker should stop accepting jobs - otherwise it can enqueue a lot of incompletes
really quickly.

● openqa-worker-cacheservice.service - OpenQA Worker Cache Service
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2018-11-21 03:33:35 CET; 4h 36min ago
  Process: 1962 ExecStart=/usr/share/openqa/script/openqa-workercache daemon -m production (code=exited, status=22)
 Main PID: 1962 (code=exited, status=22)

Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [DEBUG] CACHE: removed /var/lib/openqa/cache/old/openSUSE-13.2-x86_64.qcow2
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [INFO] CACHE: Purging non registered /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [ERROR] CACHE: Could not remove /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [DEBUG] CACHE: removed /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [DEBUG] CACHE: Health: Real size: 52798166016, Configured limit: 53687091200
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 52798166016
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Nov 21 03:33:35 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Nov 21 03:33:35 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 21 03:33:35 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by coolo over 6 years ago

ok, it didn't die - it didn't come up correctly on reboot. I guess some dependency in systemd unit is missing

Actions

Copy link

Updated by okurz over 6 years ago

Not fixing wrong dependencies but could help something: https://github.com/os-autoinst/openQA/pull/1878

Actions

Copy link

Updated by okurz over 6 years ago

Related to action #44162: Various tests stayed 'running' for ~ 4 hours or longer added

Actions

Copy link

Updated by mgriessmeier over 6 years ago

so on o.s.d we had the same issue since 19th of November - it took until 25th of November that someone recognized it (I wonder why) - can there be any better monitoring for this.
we actually also found the mentioned "enqueued incomplete jobs" - over 100 for a single testsuite which all popped up in the webui after we restarted the services (https://openqa.suse.de/tests/2282363#next_previous)

We've restarted all the services and workers and cleaned up old jobs, seems to run again

Actions

Copy link

Updated by mkittler over 6 years ago

Status changed from New to In Progress

I've just been reading https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget and I agree that services should be more robust in general and therefore should e.g. retry to connect.

But https://github.com/os-autoinst/openQA/pull/1878 should be sufficient to fix the immediate problem. Also the worker should come up again automatically now since https://github.com/os-autoinst/openQA/pull/1892 has been merged (in case it exited because the cache was unavailable).

Actions

Copy link

Updated by okurz over 6 years ago

Status changed from In Progress to Feedback
Assignee set to okurz

Actions

Copy link

Updated by coolo over 6 years ago

If you assign the ticket to yourself, remember it's about "if the workercache service is unavailable, the worker should stop accepting jobs". Just restarting it doesn't make available.

Actions

Copy link

Updated by mkittler over 6 years ago

Yes, to prevent tons of incompletes in case the cache is not available (for whatever reason) the worker code must be changed so it doesn't accept any new jobs until the cache is up again.

Actions

Copy link

Updated by okurz over 6 years ago

coolo wrote:

Just restarting it doesn't make available.

True but I wanted to track it to gather feedback from our systemd changes first. If one of you wants to pick it up and go further with actual implementation be my guest

Actions

Copy link

#10

Updated by okurz over 6 years ago

Related to action #44693: Caching issue on new snapshots synced to o3 - no cache minion workers available added

Actions

Copy link

#11

Updated by okurz over 6 years ago

Status changed from Feedback to In Progress
Assignee changed from okurz to mkittler

https://github.com/os-autoinst/openQA/pull/1892 is merged and deployed on o3 and osd. So far no problems observed. @mkittler over to you

Actions

Copy link

#12

Updated by mkittler over 6 years ago

Target version changed from Ready to Current Sprint

Actions

Copy link

#13

Updated by mkittler over 6 years ago

Status changed from In Progress to Resolved

PR has been merged. The worker should now go into 'broken' state visible via the web UI. In the regular status update interval the worker checks whether the cache is available again to recover from the broken state on its own.

Actions

Copy link

#14

Updated by okurz about 5 years ago

Related to action #62567: openqa services can fail when network is not up (yet) "Can't create listen socket: Address family for hostname not supported" added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #44105

if workercache dies, we get tons of incompletes

Updated by coolo over 6 years ago

Updated by okurz over 6 years ago

Updated by okurz over 6 years ago

Updated by mgriessmeier over 6 years ago

Updated by mkittler over 6 years ago

Updated by okurz over 6 years ago

Updated by coolo over 6 years ago

Updated by mkittler over 6 years ago

Updated by okurz over 6 years ago

Updated by okurz over 6 years ago

Updated by okurz over 6 years ago

Updated by mkittler over 6 years ago

Updated by mkittler over 6 years ago

Updated by okurz about 5 years ago