Project

General

Profile

Actions

action #19564

closed

[tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked?

Added by okurz almost 7 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
-
Start date:
2017-06-04
Due date:
% Done:

0%

Estimated time:

Description

observation

No jobs are executed on worker class "zkvm-image", i.e. openqaworker2:7.

journal output on that worker:

Jun 01 22:32:13 openqaworker2 worker[1662]: [INFO] OpenQA::Worker::Cache: Initialized with http://openqa.suse.de at /var/lib/openqa/cache, current size is 47971230641
Jun 01 22:32:13 openqaworker2 worker[1662]: [DEBUG] Found HDD_1, caching sle-12-SP1-Server-DVD-s390x-sdk+allpatterns.qcow2
Jun 01 22:32:13 openqaworker2 worker[1662]: [DEBUG] CACHE: Aquiring lock for /var/lib/openqa/cache/sle-12-SP1-Server-DVD-s390x-sdk+allpatterns.qcow2 in the database
Jun 01 22:32:19 openqaworker2 worker[1662]: [DEBUG] Update status so job is not considered dead.
Jun 01 22:32:39 openqaworker2 worker[1662]: [DEBUG] Update status so job is not considered dead.
Jun 01 22:34:32 openqaworker2 worker[1662]: [DEBUG] Update status so job is not considered dead.
Jun 01 22:35:02 openqaworker2 worker[1662]: DBD::SQLite::st execute failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 212.
Jun 01 22:35:02 openqaworker2 worker[1662]: [ERROR] toggle_asset_lock: Rolling back DBD::SQLite::st execute failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 212.
Jun 01 22:35:02 openqaworker2 worker[1662]: rollback ineffective with AutoCommit enabled at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 216.
Jun 01 22:35:07 openqaworker2 worker[1662]: [WARN] job is missing files, releasing job
Jun 01 22:35:07 openqaworker2 worker[1662]: [ERROR] 404 response: Not Found (remaining tries: 0)
Jun 01 22:35:07 openqaworker2 worker[1662]: [ERROR] ERROR autoinst-log.txt: 404 response: Not Found
Jun 01 22:35:07 openqaworker2 worker[1662]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Jun 01 22:35:07 openqaworker2 worker[1662]: [INFO] cleaning up 00975813-sle-12-SP3-Server-DVD-s390x-Build0409-om_proxyscc_sles12sp1_sdk+allpatterns_full_update_by_yast_s390x@zkvm-image
Jun 01 22:35:07 openqaworker2 worker[1662]: [INFO] got job 975814: 00975814-sle-12-SP3-Server-DVD-s390x-Build0409-om_proxyscc_sles12sp2_allpatterns_full_update_by_zypper_s390x@zkvm-image
Jun 01 22:35:37 openqaworker2 worker[1662]: DBD::SQLite::db prepare failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 345.
Jun 01 22:35:37 openqaworker2 worker[1662]: Mojo::Reactor::Poll: Timer failed: DBD::SQLite::db prepare failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 345.
lines 243531-243591/243591 (END)

reproducible

TBD

problem

Is the locked database a symptom or the problem causing the worker to be stuck?

This is the output when stopping the worker

Jun 04 08:34:09 openqaworker2 systemd[1]: Stopping openQA Worker #7...
Jun 04 08:34:09 openqaworker2 worker[1662]: [ERROR] 404 response: Not Found (remaining tries: 0)
Jun 04 08:34:09 openqaworker2 worker[1662]: [INFO] registering worker with openQA http://openqa.suse.de...
Jun 04 08:34:09 openqaworker2 worker[1662]: [INFO] quit due to signal TERM
Jun 04 08:34:09 openqaworker2 worker[1662]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Jun 04 08:34:09 openqaworker2 worker[1662]: [DEBUG] duplicating job 975814
Jun 04 08:34:09 openqaworker2 worker[1662]: [INFO] cleaning up 00975814-sle-12-SP3-Server-DVD-s390x-Build0409-om_proxyscc_sles12sp2_allpatterns_full_update_by_zypper_s390x@zkvm-image
Jun 04 08:34:09 openqaworker2 systemd[1]: Stopped openQA Worker #7.

Why the "[ERROR] 404 response: Not Found" when trying to stop the worker?

When there is no output of "[DEBUG] Update status so job is not considered dead." in the logfile for three days shouldn't the worker be considered dead by the scheduler?

workaround

Restart the worker


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #18164: [devops][tools] monitoring of openqa worker instancesResolvednicksinger2018-04-25

Actions
Actions #1

Updated by szarate almost 7 years ago

  • Assignee set to szarate
  • Priority changed from High to Normal
  • Target version set to Milestone 8

This is a situation I was talking on this commit, i will take a look ai this after RC unless this becomes a bigger problem.

Actions #2

Updated by szarate almost 7 years ago

  • Status changed from New to Feedback

This issue is addressed here, but it is still a race condition, I just tried to reduce the probability of this happening.

Actions #3

Updated by szarate almost 7 years ago

I believe this is solved now... I found #20396 along the way too, but it's a cousin of this problem.

Actions #4

Updated by okurz almost 7 years ago

  • Status changed from Feedback to Resolved

I see the worker as being active. I did not manually need to restart it or something in the past weeks. Also I did not find any cronjobs that would restart the worker as a workaround so: Yes, I guess it's resolved.

Actions #5

Updated by okurz almost 7 years ago

  • Status changed from Resolved to In Progress

I was wrong, the original report is not really done. I am pretty sure the original issue "worker reports as online but not doing anything" persists.

Actions #6

Updated by okurz almost 7 years ago

  • Related to action #18164: [devops][tools] monitoring of openqa worker instances added
Actions #7

Updated by szarate almost 7 years ago

  • Assignee changed from szarate to okurz
  • Target version deleted (Milestone 8)

The part where the worker is not doing anything, related to Mojo::Reactor::Poll failing (Because something inside the loop died), can be handled through logwarn and might be useful to have it, because this might not be the only situation where this can happen (But it's abnormal enough...)

@okurz, care to take a look at it? I believe you will be faster here :)

However, this should be properly taken care of in the openQA side of things in the future.

Actions #8

Updated by okurz over 6 years ago

  • Assignee deleted (okurz)

I agree with your last statement that the original problem is not solved: The worker should be considered dead if it does not respond to pending jobs within three days. As there were no related changes within the job monitoring or scheduling code this is not done yet.

Actions #9

Updated by coolo over 6 years ago

  • Status changed from In Progress to Closed

Ettore redid the whole worker status thing completely, so closing this as old story

Actions

Also available in: Atom PDF