action #42986
closedparallel jobs not reliable killed/restarted
0%
Description
https://openqa.suse.de/tests/2219200#dependencies
rsyslog-client (https://openqa.suse.de/tests/2216925) is cancelled, but restarted as https://openqa.suse.de/tests/2219198, which failed
rsyslog-server (https://openqa.suse.de/tests/2216935) is failed, but restarted as https://openqa.suse.de/tests/2219199, which is cancelled/skipped (got skipped before 2216947 has been canceled manually)
rsyslog-controller (https://openqa.suse.de/tests/2216947) is canclled, but restarted as https://openqa.suse.de/tests/2219200, which stayed scheduled (it is right now canceled because @coolo canceled it manually)
I assume this is due to scheduler and/or cloning not working reliably in transactions. But it will be a puzzle to sort out
Updated by coolo about 6 years ago
there were 7 scheduled cluster jobs in total - and all were 2 days old. So it's possible some worker restarted and created this mess.
Updated by mkittler almost 6 years ago
- Assignee set to mkittler
Right now we have jobs which have been in scheduled state for months, eg. https://openqa.suse.de/tests/2278614. Some scheduled jobs even have been cloned, eg. https://openqa.suse.de/tests/2307004.
I would not delete those jobs from OSD right now because they might be helpful to understand the cause.
Updated by mkittler almost 6 years ago
- Description updated (diff)
- Status changed from New to In Progress
- Target version changed from Ready to Current Sprint
Seems like https://openqa.suse.de/tests/2216947 has been restarted manually by a user which also caused the other restarts in the cluster:
openqa=# select event, event_data, users.nickname from audit_events join users on user_id = users.id where event_data like '%2216947%';
event | event_data | nickname
-------------+-----------------------------------------------------------------------------------+----------------
job_create | {"id":2216947} | sle-qam-openqa
job_restart | {"result":{"2216947":2219200,"2216925":2219198,"2216935":2219199},"id":"2216947"} | osukup
(2 Zeilen)
So that the restarts happened is not a bug. However, the jobs being stuck in the scheduled state which we still see on OSD right now are. Not sure 2219199 has been cancelled.
Updated by mkittler almost 6 years ago
Some of the jobs which are currently stuck in scheduled state are not part of the problem. For instance https://openqa.suse.de/tests/2300326#dependencies and https://openqa.suse.de/tests/2300326#dependencies are simply stuck because they require the worker class 64bit-mlx_con5
but only worker with class 64bit-mlx_con5_DISABLED
are available on OSD. It seems that somebody intentionally postponed the execution of these jobs.
Updated by mkittler almost 6 years ago
The problem that one job in the parallel cluster is stuck in scheduled state (or get canceled/skipped) can be observed also without previous restart. So likely the problem lies in the scheduling and not the restart. For instance slenkins-tests-git-control@64bit
is one of those tests which sometimes are stuck in scheduled state (https://openqa.suse.de/tests/2353721, https://openqa.suse.de/tests/2351382, https://openqa.suse.de/tests/2349564). But it is a sporadic issue, the test also gets executed quite often and then passes (https://openqa.suse.de/tests/2357039). Speaks for a bug in the scheduler causing clusters only partially scheduled.
Updated by mkittler almost 6 years ago
Theory: The job which is stuck in scheduled state is excluded by OpenQA::Scheduler::Scheduler::update_scheduled_jobs
because a Gru task it depends on is still running. That function calls $job->cluster_jobs
to compute the cluster considering all jobs. But if I read the code correctly it will in the end nevertheless only consider those jobs without a Gru task.
Updated by coolo almost 6 years ago
- Assignee changed from mkittler to coolo
I killed all old jobs and increased debug level - and we caught another one indeed:
https://openqa.suse.de/tests/2364271 failed, set https://openqa.suse.de/tests/2364272 to parallel_failed but https://openqa.suse.de/tests/2364326 stayed scheduled
Updated by coolo almost 6 years ago
[2019-01-10T02:11:14.0502 CET] [debug] [pid:7423] Allocated: { job => 2364271, worker => 575 }
[2019-01-10T02:11:14.0501 CET] [debug] [pid:7423] Allocated: { job => 2364272, worker => 595 }
[2019-01-10T02:11:14.0504 CET] [debug] [pid:7423] Allocated: { job => 2364326, worker => 896 }
So that part still worked
Updated by coolo almost 6 years ago
But https://openqa.suse.de/admin/workers/896 reported the last job 29 days ago - it looks broken.
Updated by coolo almost 6 years ago
That worker host in general is troubled for a while it seems:
Jan 10 07:25:16 openqaworker3 worker[21681]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: 500. Apache modules proxy_wstunnel and rewrite enabled?
Jan 10 07:25:17 openqaworker3 worker[21679]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: 500. Apache modules proxy_wstunnel and rewrite enabled?
Updated by coolo almost 6 years ago
So the next thing that happened, was:
[2019-01-10T02:11:12.0381 CET] [debug] [pid:4246] message sent to 896 for job 2364326
[2019-01-10T02:11:54.0096 CET] [debug] [pid:4246] Found Job(2364326) in DB from worker_status update sent by Worker(896)
[2019-01-10T02:11:54.0127 CET] [debug] [pid:4246] Job '2364326' reset to 'scheduled' state
and then it was assigned to 633, which also returned the favor:
[2019-01-10T02:15:39.0730 CET] [debug] [pid:4246] message sent to 633 for job 2364326
[2019-01-10T02:16:50.0150 CET] [debug] [pid:4246] Found Job(2364326) in DB from worker_status update sent by Worker(633)
[2019-01-10T02:16:50.0176 CET] [debug] [pid:4246] Job '2364326' reset to 'scheduled' state
So the job was 'assigned' twice - and the parallel_failed should have happened here:
[2019-01-10T02:16:20.0219 CET] [debug] POST "/api/v1/jobs/2364271/set_done" (9e769989)
And at that time the job was assigned to 633 - and once 633 gave up, the job was 'reset to scheduled'. And that's what I suspected from the beginning.
Updated by coolo almost 6 years ago
The problem is that there are far more bugs in play here:
Jan 10 02:16:20 openqaworker3 worker[21693]: [warn] Received command cancel for different job id 2364326 (our 2363137). Ignoring!
Why did worker3:10 got 326 assigned if it wasn't free? Why is worker3:1 dead for a month without us noticing, while worker3:10 is fine? Why did the scheduler not realize the cluster is incomplete?
Updated by mkravec almost 6 years ago
Jan 10 07:25:16 openqaworker3 worker[21681]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: 500. Apache modules proxy_wstunnel and rewrite enabled?
I noticed those messages on my remote worker too. Worker can't reconnect (Unable to upgrade connection..) after "systemctl restart openqa-worker@1" until I reboot server with webui.
No idea what's the cause.
Updated by coolo almost 6 years ago
the full story is even more complex:
[2019-01-10T02:11:54.0080 CET] [debug] [pid:4246] Received from worker "896" worker_status message "$VAR1 = {
[2019-01-10T02:11:54.0083 CET] [debug] [pid:4246] Updated worker seen from worker_status
[2019-01-10T02:11:54.0096 CET] [debug] [pid:4246] Found Job(2364326) in DB from worker_status update sent by Worker(896)
[2019-01-10T02:11:54.0096 CET] [debug] [pid:4246] Received request has id: 2326956
[2019-01-10T02:11:54.0110 CET] [debug] [pid:4246] Worker(896) for Job(2364326) has token u20HTTwQ1lLTrpi2
[2019-01-10T02:11:54.0110 CET] [debug] [pid:4246] Received request has token: BWDiHETB9O3uLSLD
[2019-01-10T02:11:54.0112 CET] [debug] [pid:4246] Possibly worker 896 should be freed.
[2019-01-10T02:11:54.0127 CET] [debug] [pid:4246] Job '2364326' reset to 'scheduled' state
So worker said it's working on 2326956, websocket said 'WTF you should work on 2364326', canceled the job and the worker said 'hey, why? I work on 2363137'.
Very confused workers :(
Updated by coolo almost 6 years ago
- Status changed from In Progress to Feedback
This is potentially the fix, but I miss the time to verify with test cases (as several components are involved)
Updated by coolo almost 6 years ago
I really thought hard on the cycling tour home how to avoid this mess - and our options are really limited, but I guess we should try :(
Updated by coolo almost 6 years ago
- Related to coordination #47117: [epic] Fix worker->websocket->scheduler->webui connection added