action #19564: [tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked? - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #19564

closed

[tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked?

Added by okurz over 7 years ago. Updated over 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Regressions/Crashes

Target version:

Start date:

2017-06-04

Due date:

% Done:

Estimated time:

Description

observation¶

No jobs are executed on worker class "zkvm-image", i.e. openqaworker2:7.

journal output on that worker:

Jun 01 22:32:13 openqaworker2 worker[1662]: [INFO] OpenQA::Worker::Cache: Initialized with http://openqa.suse.de at /var/lib/openqa/cache, current size is 47971230641
Jun 01 22:32:13 openqaworker2 worker[1662]: [DEBUG] Found HDD_1, caching sle-12-SP1-Server-DVD-s390x-sdk+allpatterns.qcow2
Jun 01 22:32:13 openqaworker2 worker[1662]: [DEBUG] CACHE: Aquiring lock for /var/lib/openqa/cache/sle-12-SP1-Server-DVD-s390x-sdk+allpatterns.qcow2 in the database
Jun 01 22:32:19 openqaworker2 worker[1662]: [DEBUG] Update status so job is not considered dead.
Jun 01 22:32:39 openqaworker2 worker[1662]: [DEBUG] Update status so job is not considered dead.
Jun 01 22:34:32 openqaworker2 worker[1662]: [DEBUG] Update status so job is not considered dead.
Jun 01 22:35:02 openqaworker2 worker[1662]: DBD::SQLite::st execute failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 212.
Jun 01 22:35:02 openqaworker2 worker[1662]: [ERROR] toggle_asset_lock: Rolling back DBD::SQLite::st execute failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 212.
Jun 01 22:35:02 openqaworker2 worker[1662]: rollback ineffective with AutoCommit enabled at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 216.
Jun 01 22:35:07 openqaworker2 worker[1662]: [WARN] job is missing files, releasing job
Jun 01 22:35:07 openqaworker2 worker[1662]: [ERROR] 404 response: Not Found (remaining tries: 0)
Jun 01 22:35:07 openqaworker2 worker[1662]: [ERROR] ERROR autoinst-log.txt: 404 response: Not Found
Jun 01 22:35:07 openqaworker2 worker[1662]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Jun 01 22:35:07 openqaworker2 worker[1662]: [INFO] cleaning up 00975813-sle-12-SP3-Server-DVD-s390x-Build0409-om_proxyscc_sles12sp1_sdk+allpatterns_full_update_by_yast_s390x@zkvm-image
Jun 01 22:35:07 openqaworker2 worker[1662]: [INFO] got job 975814: 00975814-sle-12-SP3-Server-DVD-s390x-Build0409-om_proxyscc_sles12sp2_allpatterns_full_update_by_zypper_s390x@zkvm-image
Jun 01 22:35:37 openqaworker2 worker[1662]: DBD::SQLite::db prepare failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 345.
Jun 01 22:35:37 openqaworker2 worker[1662]: Mojo::Reactor::Poll: Timer failed: DBD::SQLite::db prepare failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache.pm line 345.
lines 243531-243591/243591 (END)

reproducible¶

TBD

problem¶

Is the locked database a symptom or the problem causing the worker to be stuck?

This is the output when stopping the worker

Jun 04 08:34:09 openqaworker2 systemd[1]: Stopping openQA Worker #7...
Jun 04 08:34:09 openqaworker2 worker[1662]: [ERROR] 404 response: Not Found (remaining tries: 0)
Jun 04 08:34:09 openqaworker2 worker[1662]: [INFO] registering worker with openQA http://openqa.suse.de...
Jun 04 08:34:09 openqaworker2 worker[1662]: [INFO] quit due to signal TERM
Jun 04 08:34:09 openqaworker2 worker[1662]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: api-failure)
Jun 04 08:34:09 openqaworker2 worker[1662]: [DEBUG] duplicating job 975814
Jun 04 08:34:09 openqaworker2 worker[1662]: [INFO] cleaning up 00975814-sle-12-SP3-Server-DVD-s390x-Build0409-om_proxyscc_sles12sp2_allpatterns_full_update_by_zypper_s390x@zkvm-image
Jun 04 08:34:09 openqaworker2 systemd[1]: Stopped openQA Worker #7.

Why the "[ERROR] 404 response: Not Found" when trying to stop the worker?

When there is no output of "[DEBUG] Update status so job is not considered dead." in the logfile for three days shouldn't the worker be considered dead by the scheduler?

workaround¶

Restart the worker

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by szarate over 7 years ago

Assignee set to szarate
Priority changed from High to Normal
Target version set to Milestone 8

This is a situation I was talking on this commit, i will take a look ai this after RC unless this becomes a bigger problem.

Actions

Copy link

Updated by szarate over 7 years ago

Status changed from New to Feedback

This issue is addressed here, but it is still a race condition, I just tried to reduce the probability of this happening.

Actions

Copy link

Updated by szarate over 7 years ago

I believe this is solved now... I found #20396 along the way too, but it's a cousin of this problem.

Actions

Copy link

Updated by okurz over 7 years ago

Status changed from Feedback to Resolved

I see the worker as being active. I did not manually need to restart it or something in the past weeks. Also I did not find any cronjobs that would restart the worker as a workaround so: Yes, I guess it's resolved.

Actions

Copy link

Updated by okurz over 7 years ago

Status changed from Resolved to In Progress

I was wrong, the original report is not really done. I am pretty sure the original issue "worker reports as online but not doing anything" persists.

Actions

Copy link

Updated by okurz over 7 years ago

Related to action #18164: [devops][tools] monitoring of openqa worker instances added

Actions

Copy link

Updated by szarate over 7 years ago

Assignee changed from szarate to okurz
Target version deleted (~~Milestone 8~~)

The part where the worker is not doing anything, related to Mojo::Reactor::Poll failing (Because something inside the loop died), can be handled through logwarn and might be useful to have it, because this might not be the only situation where this can happen (But it's abnormal enough...)

@okurz, care to take a look at it? I believe you will be faster here :)

However, this should be properly taken care of in the openQA side of things in the future.

Actions

Copy link

Updated by okurz over 7 years ago

Assignee deleted (~~okurz~~)

I agree with your last statement that the original problem is not solved: The worker should be considered dead if it does not respond to pending jobs within three days. As there were no related changes within the job monitoring or scheduling code this is not done yet.

Actions

Copy link

Updated by coolo over 7 years ago

Status changed from In Progress to Closed

Ettore redid the whole worker status thing completely, so closing this as old story

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #19564

[tools]worker is unresponsive for three days but reports as online to the webui because of cache database locked?

observation¶

reproducible¶

problem¶

workaround¶

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by okurz over 7 years ago

Updated by szarate over 7 years ago

Updated by okurz over 7 years ago

Updated by coolo over 7 years ago