action #44105
closedif workercache dies, we get *tons* of incompletes
0%
Description
I guess if the workercache service is unavailable, the worker should stop accepting jobs - otherwise it can enqueue a lot of incompletes
really quickly.
● openqa-worker-cacheservice.service - OpenQA Worker Cache Service
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-cacheservice.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2018-11-21 03:33:35 CET; 4h 36min ago
Process: 1962 ExecStart=/usr/share/openqa/script/openqa-workercache daemon -m production (code=exited, status=22)
Main PID: 1962 (code=exited, status=22)
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [DEBUG] CACHE: removed /var/lib/openqa/cache/old/openSUSE-13.2-x86_64.qcow2
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [INFO] CACHE: Purging non registered /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [ERROR] CACHE: Could not remove /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [DEBUG] CACHE: removed /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [DEBUG] CACHE: Health: Real size: 52798166016, Configured limit: 53687091200
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 52798166016
Nov 21 03:33:35 openqaworker4 openqa-workercache[1962]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Nov 21 03:33:35 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=22/n/a
Nov 21 03:33:35 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Unit entered failed state.
Nov 21 03:33:35 openqaworker4 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Updated by coolo about 6 years ago
ok, it didn't die - it didn't come up correctly on reboot. I guess some dependency in systemd unit is missing
Updated by okurz about 6 years ago
Not fixing wrong dependencies but could help something: https://github.com/os-autoinst/openQA/pull/1878
Updated by okurz about 6 years ago
- Related to action #44162: Various tests stayed 'running' for ~ 4 hours or longer added
Updated by mgriessmeier about 6 years ago
so on o.s.d we had the same issue since 19th of November - it took until 25th of November that someone recognized it (I wonder why) - can there be any better monitoring for this.
we actually also found the mentioned "enqueued incomplete jobs" - over 100 for a single testsuite which all popped up in the webui after we restarted the services (https://openqa.suse.de/tests/2282363#next_previous)
We've restarted all the services and workers and cleaned up old jobs, seems to run again
Updated by mkittler about 6 years ago
- Status changed from New to In Progress
I've just been reading https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget and I agree that services should be more robust in general and therefore should e.g. retry to connect.
But https://github.com/os-autoinst/openQA/pull/1878 should be sufficient to fix the immediate problem. Also the worker should come up again automatically now since https://github.com/os-autoinst/openQA/pull/1892 has been merged (in case it exited because the cache was unavailable).
Updated by okurz about 6 years ago
- Status changed from In Progress to Feedback
- Assignee set to okurz
Updated by coolo about 6 years ago
If you assign the ticket to yourself, remember it's about "if the workercache service is unavailable, the worker should stop accepting jobs". Just restarting it doesn't make available.
Updated by mkittler about 6 years ago
Yes, to prevent tons of incompletes in case the cache is not available (for whatever reason) the worker code must be changed so it doesn't accept any new jobs until the cache is up again.
Updated by okurz about 6 years ago
coolo wrote:
Just restarting it doesn't make available.
True but I wanted to track it to gather feedback from our systemd changes first. If one of you wants to pick it up and go further with actual implementation be my guest
Updated by okurz about 6 years ago
- Related to action #44693: Caching issue on new snapshots synced to o3 - no cache minion workers available added
Updated by okurz about 6 years ago
- Status changed from Feedback to In Progress
- Assignee changed from okurz to mkittler
https://github.com/os-autoinst/openQA/pull/1892 is merged and deployed on o3 and osd. So far no problems observed. @mkittler over to you
Updated by mkittler about 6 years ago
- Target version changed from Ready to Current Sprint
Updated by mkittler about 6 years ago
- Status changed from In Progress to Resolved
PR has been merged. The worker should now go into 'broken' state visible via the web UI. In the regular status update interval the worker checks whether the cache is available again to recover from the broken state on its own.
Updated by okurz almost 5 years ago
- Related to action #62567: openqa services can fail when network is not up (yet) "Can't create listen socket: Address family for hostname not supported" added