action #44105

if workercache dies, we get *tons* of incompletes

Added by coolo over 1 year ago. Updated about 1 year ago.

Status:ResolvedStart date:21/11/2018
Priority:HighDue date:
Assignee:mkittler% Done:

0%

Category:Concrete Bugs
Target version:Current Sprint
Difficulty:
Duration:

Description

I guess if the workercache service is unavailable, the worker should stop accepting jobs - otherwise it can enqueue a lot of incompletes
really quickly.

● openqa-worker-cacheservice.service - OpenQA Worker Cache Service
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2018-11-21 03:33:35 CET; 4h 36min ago
  Process: 1962 ExecStart=/usr/share/openqa/script/openqa-workercache daemon -m production (code=exited, status=22)
 Main PID: 1962 (code=exited, status=22)

Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [DEBUG] CACHE: removed /var/lib/openqa/cache/old/openSUSE-13.2-x86_64.qcow2
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [INFO] CACHE: Purging non registered /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [ERROR] CACHE: Could not remove /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [DEBUG] CACHE: removed /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [DEBUG] CACHE: Health: Real size: 52798166016, Configured limit: 53687091200
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 52798166016
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Nov 21 03:33:35 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Nov 21 03:33:35 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 21 03:33:35 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.

Related issues

Related to openQA Project - action #44162: Various tests stayed 'running' for ~ 4 hours or longer New 21/11/2018
Related to openQA Project - action #44693: Caching issue on new snapshots synced to o3 - no cache mi... Resolved 04/12/2018

History

#1 Updated by coolo over 1 year ago

ok, it didn't die - it didn't come up correctly on reboot. I guess some dependency in systemd unit is missing

#2 Updated by okurz over 1 year ago

Not fixing wrong dependencies but could help something: https://github.com/os-autoinst/openQA/pull/1878

#3 Updated by okurz over 1 year ago

  • Related to action #44162: Various tests stayed 'running' for ~ 4 hours or longer added

#4 Updated by mgriessmeier about 1 year ago

so on o.s.d we had the same issue since 19th of November - it took until 25th of November that someone recognized it (I wonder why) - can there be any better monitoring for this.
we actually also found the mentioned "enqueued incomplete jobs" - over 100 for a single testsuite which all popped up in the webui after we restarted the services (https://openqa.suse.de/tests/2282363#next_previous)

We've restarted all the services and workers and cleaned up old jobs, seems to run again

#5 Updated by mkittler about 1 year ago

  • Status changed from New to In Progress

I've just been reading https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget and I agree that services should be more robust in general and therefore should e.g. retry to connect.

But https://github.com/os-autoinst/openQA/pull/1878 should be sufficient to fix the immediate problem. Also the worker should come up again automatically now since https://github.com/os-autoinst/openQA/pull/1892 has been merged (in case it exited because the cache was unavailable).

#6 Updated by okurz about 1 year ago

  • Status changed from In Progress to Feedback
  • Assignee set to okurz

#7 Updated by coolo about 1 year ago

If you assign the ticket to yourself, remember it's about "if the workercache service is unavailable, the worker should stop accepting jobs". Just restarting it doesn't make available.

#8 Updated by mkittler about 1 year ago

Yes, to prevent tons of incompletes in case the cache is not available (for whatever reason) the worker code must be changed so it doesn't accept any new jobs until the cache is up again.

#9 Updated by okurz about 1 year ago

coolo wrote:

Just restarting it doesn't make available.

True but I wanted to track it to gather feedback from our systemd changes first. If one of you wants to pick it up and go further with actual implementation be my guest

#10 Updated by okurz about 1 year ago

  • Related to action #44693: Caching issue on new snapshots synced to o3 - no cache minion workers available added

#11 Updated by okurz about 1 year ago

  • Status changed from Feedback to In Progress
  • Assignee changed from okurz to mkittler

https://github.com/os-autoinst/openQA/pull/1892 is merged and deployed on o3 and osd. So far no problems observed. @mkittler over to you

#12 Updated by mkittler about 1 year ago

  • Target version changed from Ready to Current Sprint

#13 Updated by mkittler about 1 year ago

  • Status changed from In Progress to Resolved

PR has been merged. The worker should now go into 'broken' state visible via the web UI. In the regular status update interval the worker checks whether the cache is available again to recover from the broken state on its own.

Also available in: Atom PDF