Project

General

Profile

Actions

action #19564

closed

[tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked?

Added by okurz almost 7 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
-
Start date:
2017-06-04
Due date:
% Done:

0%

Estimated time:

Description

observation

No jobs are executed on worker class "zkvm-image", i.e. openqaworker2:7.

journal output on that worker:

Jun 01 22:32:13 openqaworker2 worker[1662]: [INFO] OpenQA::Worker::Cache: Initialized with http://openqa.suse.de at /var/lib/openqa/cache, current size is 47971230641
Jun 01 22:32:13 openqaworker2 worker[1662]: [DEBUG] Found HDD_1, caching sle-12-SP1-Server-DVD-s390x-sdk+allpatterns.qcow2
Jun 01 22:32:13 openqaworker2 worker[1662]: [DEBUG] CACHE: Aquiring lock for /var/lib/openqa/cache/sle-12-SP1-Server-DVD-s390x-sdk+allpatterns.qcow2 in the database
Jun 01 22:32:19 openqaworker2 worker[1662]: [DEBUG] Update status so job is not considered dead.
Jun 01 22:32:39 openqaworker2 worker[1662]: [DEBUG] Update status so job is not considered dead.
Jun 01 22:34:32 openqaworker2 worker[1662]: [DEBUG] Update status so job is not considered dead.
Jun 01 22:35:02 openqaworker2 worker[1662]: DBD::SQLite::st execute failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 212.
Jun 01 22:35:02 openqaworker2 worker[1662]: [ERROR] toggle_asset_lock: Rolling back DBD::SQLite::st execute failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 212.
Jun 01 22:35:02 openqaworker2 worker[1662]: rollback ineffective with AutoCommit enabled at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 216.
Jun 01 22:35:07 openqaworker2 worker[1662]: [WARN] job is missing files, releasing job
Jun 01 22:35:07 openqaworker2 worker[1662]: [ERROR] 404 response: Not Found (remaining tries: 0)
Jun 01 22:35:07 openqaworker2 worker[1662]: [ERROR] ERROR autoinst-log.txt: 404 response: Not Found
Jun 01 22:35:07 openqaworker2 worker[1662]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Jun 01 22:35:07 openqaworker2 worker[1662]: [INFO] cleaning up 00975813-sle-12-SP3-Server-DVD-s390x-Build0409-om_proxyscc_sles12sp1_sdk+allpatterns_full_update_by_yast_s390x@zkvm-image
Jun 01 22:35:07 openqaworker2 worker[1662]: [INFO] got job 975814: 00975814-sle-12-SP3-Server-DVD-s390x-Build0409-om_proxyscc_sles12sp2_allpatterns_full_update_by_zypper_s390x@zkvm-image
Jun 01 22:35:37 openqaworker2 worker[1662]: DBD::SQLite::db prepare failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 345.
Jun 01 22:35:37 openqaworker2 worker[1662]: Mojo::Reactor::Poll: Timer failed: DBD::SQLite::db prepare failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 345.
lines 243531-243591/243591 (END)

reproducible

TBD

problem

Is the locked database a symptom or the problem causing the worker to be stuck?

This is the output when stopping the worker

Jun 04 08:34:09 openqaworker2 systemd[1]: Stopping openQA Worker #7...
Jun 04 08:34:09 openqaworker2 worker[1662]: [ERROR] 404 response: Not Found (remaining tries: 0)
Jun 04 08:34:09 openqaworker2 worker[1662]: [INFO] registering worker with openQA http://openqa.suse.de...
Jun 04 08:34:09 openqaworker2 worker[1662]: [INFO] quit due to signal TERM
Jun 04 08:34:09 openqaworker2 worker[1662]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Jun 04 08:34:09 openqaworker2 worker[1662]: [DEBUG] duplicating job 975814
Jun 04 08:34:09 openqaworker2 worker[1662]: [INFO] cleaning up 00975814-sle-12-SP3-Server-DVD-s390x-Build0409-om_proxyscc_sles12sp2_allpatterns_full_update_by_zypper_s390x@zkvm-image
Jun 04 08:34:09 openqaworker2 systemd[1]: Stopped openQA Worker #7.

Why the "[ERROR] 404 response: Not Found" when trying to stop the worker?

When there is no output of "[DEBUG] Update status so job is not considered dead." in the logfile for three days shouldn't the worker be considered dead by the scheduler?

workaround

Restart the worker


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #18164: [devops][tools] monitoring of openqa worker instancesResolvednicksinger2018-04-25

Actions
Actions

Also available in: Atom PDF