action #19564
closed[tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked?
0%
Description
observation¶
No jobs are executed on worker class "zkvm-image", i.e. openqaworker2:7.
journal output on that worker:
Jun 01 22:32:13 openqaworker2 worker[1662]: [INFO] OpenQA::Worker::Cache: Initialized with http://openqa.suse.de at /var/lib/openqa/cache, current size is 47971230641
Jun 01 22:32:13 openqaworker2 worker[1662]: [DEBUG] Found HDD_1, caching sle-12-SP1-Server-DVD-s390x-sdk+allpatterns.qcow2
Jun 01 22:32:13 openqaworker2 worker[1662]: [DEBUG] CACHE: Aquiring lock for /var/lib/openqa/cache/sle-12-SP1-Server-DVD-s390x-sdk+allpatterns.qcow2 in the database
Jun 01 22:32:19 openqaworker2 worker[1662]: [DEBUG] Update status so job is not considered dead.
Jun 01 22:32:39 openqaworker2 worker[1662]: [DEBUG] Update status so job is not considered dead.
Jun 01 22:34:32 openqaworker2 worker[1662]: [DEBUG] Update status so job is not considered dead.
Jun 01 22:35:02 openqaworker2 worker[1662]: DBD::SQLite::st execute failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 212.
Jun 01 22:35:02 openqaworker2 worker[1662]: [ERROR] toggle_asset_lock: Rolling back DBD::SQLite::st execute failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 212.
Jun 01 22:35:02 openqaworker2 worker[1662]: rollback ineffective with AutoCommit enabled at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 216.
Jun 01 22:35:07 openqaworker2 worker[1662]: [WARN] job is missing files, releasing job
Jun 01 22:35:07 openqaworker2 worker[1662]: [ERROR] 404 response: Not Found (remaining tries: 0)
Jun 01 22:35:07 openqaworker2 worker[1662]: [ERROR] ERROR autoinst-log.txt: 404 response: Not Found
Jun 01 22:35:07 openqaworker2 worker[1662]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Jun 01 22:35:07 openqaworker2 worker[1662]: [INFO] cleaning up 00975813-sle-12-SP3-Server-DVD-s390x-Build0409-om_proxyscc_sles12sp1_sdk+allpatterns_full_update_by_yast_s390x@zkvm-image
Jun 01 22:35:07 openqaworker2 worker[1662]: [INFO] got job 975814: 00975814-sle-12-SP3-Server-DVD-s390x-Build0409-om_proxyscc_sles12sp2_allpatterns_full_update_by_zypper_s390x@zkvm-image
Jun 01 22:35:37 openqaworker2 worker[1662]: DBD::SQLite::db prepare failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 345.
Jun 01 22:35:37 openqaworker2 worker[1662]: Mojo::Reactor::Poll: Timer failed: DBD::SQLite::db prepare failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 345.
lines 243531-243591/243591 (END)
reproducible¶
TBD
problem¶
Is the locked database a symptom or the problem causing the worker to be stuck?
This is the output when stopping the worker
Jun 04 08:34:09 openqaworker2 systemd[1]: Stopping openQA Worker #7...
Jun 04 08:34:09 openqaworker2 worker[1662]: [ERROR] 404 response: Not Found (remaining tries: 0)
Jun 04 08:34:09 openqaworker2 worker[1662]: [INFO] registering worker with openQA http://openqa.suse.de...
Jun 04 08:34:09 openqaworker2 worker[1662]: [INFO] quit due to signal TERM
Jun 04 08:34:09 openqaworker2 worker[1662]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Jun 04 08:34:09 openqaworker2 worker[1662]: [DEBUG] duplicating job 975814
Jun 04 08:34:09 openqaworker2 worker[1662]: [INFO] cleaning up 00975814-sle-12-SP3-Server-DVD-s390x-Build0409-om_proxyscc_sles12sp2_allpatterns_full_update_by_zypper_s390x@zkvm-image
Jun 04 08:34:09 openqaworker2 systemd[1]: Stopped openQA Worker #7.
Why the "[ERROR] 404 response: Not Found" when trying to stop the worker?
When there is no output of "[DEBUG] Update status so job is not considered dead." in the logfile for three days shouldn't the worker be considered dead by the scheduler?
workaround¶
Restart the worker
Updated by szarate over 7 years ago
- Assignee set to szarate
- Priority changed from High to Normal
- Target version set to Milestone 8
This is a situation I was talking on this commit, i will take a look ai this after RC unless this becomes a bigger problem.
Updated by szarate over 7 years ago
- Status changed from New to Feedback
This issue is addressed here, but it is still a race condition, I just tried to reduce the probability of this happening.
Updated by szarate over 7 years ago
I believe this is solved now... I found #20396 along the way too, but it's a cousin of this problem.
Updated by okurz over 7 years ago
- Status changed from Feedback to Resolved
I see the worker as being active. I did not manually need to restart it or something in the past weeks. Also I did not find any cronjobs that would restart the worker as a workaround so: Yes, I guess it's resolved.
Updated by okurz over 7 years ago
- Status changed from Resolved to In Progress
I was wrong, the original report is not really done. I am pretty sure the original issue "worker reports as online but not doing anything" persists.
Updated by okurz over 7 years ago
- Related to action #18164: [devops][tools] monitoring of openqa worker instances added
Updated by szarate over 7 years ago
- Assignee changed from szarate to okurz
- Target version deleted (
Milestone 8)
The part where the worker is not doing anything, related to Mojo::Reactor::Poll failing (Because something inside the loop died), can be handled through logwarn and might be useful to have it, because this might not be the only situation where this can happen (But it's abnormal enough...)
@okurz, care to take a look at it? I believe you will be faster here :)
However, this should be properly taken care of in the openQA side of things in the future.
Updated by okurz over 7 years ago
- Assignee deleted (
okurz)
I agree with your last statement that the original problem is not solved: The worker should be considered dead if it does not respond to pending jobs within three days. As there were no related changes within the job monitoring or scheduling code this is not done yet.
Updated by coolo about 7 years ago
- Status changed from In Progress to Closed
Ettore redid the whole worker status thing completely, so closing this as old story