Project

General

Profile

Actions

action #44105

closed

if workercache dies, we get *tons* of incompletes

Added by coolo about 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2018-11-21
Due date:
% Done:

0%

Estimated time:

Description

I guess if the workercache service is unavailable, the worker should stop accepting jobs - otherwise it can enqueue a lot of incompletes
really quickly.

● openqa-worker-cacheservice.service - OpenQA Worker Cache Service
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2018-11-21 03:33:35 CET; 4h 36min ago
  Process: 1962 ExecStart=/usr/share/openqa/script/openqa-workercache daemon -m production (code=exited, status=22)
 Main PID: 1962 (code=exited, status=22)

Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [DEBUG] CACHE: removed /var/lib/openqa/cache/old/openSUSE-13.2-x86_64.qcow2
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [INFO] CACHE: Purging non registered /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [ERROR] CACHE: Could not remove /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [DEBUG] CACHE: removed /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [DEBUG] CACHE: Health: Real size: 52798166016, Configured limit: 53687091200
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 52798166016
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Nov 21 03:33:35 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Nov 21 03:33:35 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 21 03:33:35 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.

Related issues 3 (0 open3 closed)

Related to openQA Project (public) - action #44162: Various tests stayed 'running' for ~ 4 hours or longerResolvedokurz2018-11-21

Actions
Related to openQA Project (public) - action #44693: Caching issue on new snapshots synced to o3 - no cache minion workers availableResolvedokurz2018-12-04

Actions
Related to openQA Project (public) - action #62567: openqa services can fail when network is not up (yet) "Can't create listen socket: Address family for hostname not supported"Resolvedokurz2020-01-172020-03-06

Actions
Actions #1

Updated by coolo about 6 years ago

ok, it didn't die - it didn't come up correctly on reboot. I guess some dependency in systemd unit is missing

Actions #2

Updated by okurz about 6 years ago

Not fixing wrong dependencies but could help something: https://github.com/os-autoinst/openQA/pull/1878

Actions #3

Updated by okurz about 6 years ago

  • Related to action #44162: Various tests stayed 'running' for ~ 4 hours or longer added
Actions #4

Updated by mgriessmeier about 6 years ago

so on o.s.d we had the same issue since 19th of November - it took until 25th of November that someone recognized it (I wonder why) - can there be any better monitoring for this.
we actually also found the mentioned "enqueued incomplete jobs" - over 100 for a single testsuite which all popped up in the webui after we restarted the services (https://openqa.suse.de/tests/2282363#next_previous)

We've restarted all the services and workers and cleaned up old jobs, seems to run again

Actions #5

Updated by mkittler about 6 years ago

  • Status changed from New to In Progress

I've just been reading https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget and I agree that services should be more robust in general and therefore should e.g. retry to connect.

But https://github.com/os-autoinst/openQA/pull/1878 should be sufficient to fix the immediate problem. Also the worker should come up again automatically now since https://github.com/os-autoinst/openQA/pull/1892 has been merged (in case it exited because the cache was unavailable).

Actions #6

Updated by okurz about 6 years ago

  • Status changed from In Progress to Feedback
  • Assignee set to okurz
Actions #7

Updated by coolo about 6 years ago

If you assign the ticket to yourself, remember it's about "if the workercache service is unavailable, the worker should stop accepting jobs". Just restarting it doesn't make available.

Actions #8

Updated by mkittler about 6 years ago

Yes, to prevent tons of incompletes in case the cache is not available (for whatever reason) the worker code must be changed so it doesn't accept any new jobs until the cache is up again.

Actions #9

Updated by okurz about 6 years ago

coolo wrote:

Just restarting it doesn't make available.

True but I wanted to track it to gather feedback from our systemd changes first. If one of you wants to pick it up and go further with actual implementation be my guest

Actions #10

Updated by okurz about 6 years ago

  • Related to action #44693: Caching issue on new snapshots synced to o3 - no cache minion workers available added
Actions #11

Updated by okurz about 6 years ago

  • Status changed from Feedback to In Progress
  • Assignee changed from okurz to mkittler

https://github.com/os-autoinst/openQA/pull/1892 is merged and deployed on o3 and osd. So far no problems observed. @mkittler over to you

Actions #12

Updated by mkittler about 6 years ago

  • Target version changed from Ready to Current Sprint
Actions #13

Updated by mkittler about 6 years ago

  • Status changed from In Progress to Resolved

PR has been merged. The worker should now go into 'broken' state visible via the web UI. In the regular status update interval the worker checks whether the cache is available again to recover from the broken state on its own.

Actions #14

Updated by okurz almost 5 years ago

  • Related to action #62567: openqa services can fail when network is not up (yet) "Can't create listen socket: Address family for hostname not supported" added
Actions

Also available in: Atom PDF