action #41048
closedjob_networks not reliably deleted
Description
Done jobs are not supposed to have a job_networks entry after ac9f540ef945520ae209e0298f3a37487ec71eb9,
so this needs to be looked at:
openqa=> delete from job_networks where job_id in (select id from jobs where id in (select job_id from job_networks) and state='done');
DELETE 348
Updated by coolo about 6 years ago
- Target version changed from Ready to Current Sprint
- Difficulty set to medium
Updated by coolo about 6 years ago
- Assignee set to coolo
looking into this - all problematic jobs right now were incompletes.
Updated by coolo about 6 years ago
The last chunk happened during restart of the webui:
Sep 17 11:48:59 openqaworker3 worker[28323]: [info] 30572: WORKING 2061750
Sep 17 11:49:02 openqaworker3 qemu-system-x86_64[30591]: looking for plugins in '/usr/lib64/sasl2', failed to open directory, error: No such file or directory
Sep 17 11:58:45 openqaworker3 worker[28323]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. Apache modules proxy_wstunnel and rewrite enabled?
Sep 17 11:58:49 openqaworker3 worker[28323]: [error] Connection error: Connection refused (remaining tries: 2)
Sep 17 11:58:54 openqaworker3 worker[28323]: [error] Connection error: Connection refused (remaining tries: 1)
Sep 17 11:58:55 openqaworker3 worker[28323]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. Apache modules proxy_wstunnel and rewrite enabled?
Sep 17 11:58:59 openqaworker3 worker[28323]: [error] Connection error: Connection refused (remaining tries: 0)
Sep 17 11:58:59 openqaworker3 worker[28323]: [error] Job aborted because web UI doesn't accept updates anymore (likely considers this job dead)
Sep 17 11:59:00 openqaworker3 worker[28323]: [error] Connection error: Connection refused (remaining tries: 2)
Sep 17 11:59:05 openqaworker3 worker[28323]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: [no code]. Apache modules proxy_wstunnel and rewrite enabled?
Sep 17 11:59:05 openqaworker3 worker[28323]: [error] Connection error: Connection refused (remaining tries: 1)
Sep 17 11:59:09 openqaworker3 worker[28323]: [info] registering worker openqaworker3 version 13 with openQA openqa.suse.de using protocol version [1]
Sep 17 11:59:09 openqaworker3 worker[28323]: [error] unable to connect to host openqa.suse.de, retry in 10s
Sep 17 11:59:10 openqaworker3 worker[28323]: [error] Connection error: Connection refused (remaining tries: 0)
So the worker didn't even abort itself, this must be dead job detection then
Updated by coolo about 6 years ago
If it was, we should see log_warning 'Dead job... - and we don't
Plus everything setting results to incomplete use done() call, which deletes the networks. Puzzling
Updated by coolo about 6 years ago
- Subject changed from job_networks not reliabled deleted to job_networks not reliably deleted
[2018-09-17T17:56:23.0342 CEST] [debug] [pid:20240] enqueuing abort for 2065142 621
[2018-09-17T17:56:30.0279 CEST] [debug] [DBIx debug] Took 0.00421119 seconds executed: UPDATE job_modules SET result = ? WHERE ( ( job_id = ? AND result = ? ) ): 'none', '2065142', 'running'.
[2018-09-17T17:56:30.0282 CEST] [debug] [DBIx debug] Took 0.00050807 seconds executed: UPDATE jobs SET result = ?, state = ?, t_finished = ?, t_updated = ? WHERE ( id = ? ): 'incomplete', 'done', '2018-09-17 15:56:30', '2018-09-17 15:56:30', '2065142'.
No mention of jobs_networks
Updated by coolo about 6 years ago
- Status changed from New to Resolved
Found it (and some more): https://github.com/os-autoinst/openQA/pull/1795
Updated by szarate about 6 years ago
- Target version changed from Current Sprint to Done