action #178126
closedOpenQA logreport for ariel.suse-dmz.opensuse.org: Could not find job <job_id> in database
0%
Description
Motivation¶
Got this error on o3:
[2025-02-28T07:31:44.432680Z] [error] [jFIITmhFKHPf] Could not find job '4889617' in database at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm line 91.
Background¶
As part of #169939, there were several openQA jobs cloned for worker stability and sanity testing using
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/<job_id> _GROUP=0 BUILD+=-gpathak WORKER_CLASS=qa-power8-3,qemu_ppc64,qemu_ppc64le,tap,heavyload
Maybe while issuing these commands, the _GROUP=0
was missed in one of them.
Later as part of cleaning up the jobs with -gpathak
in them following command was used to fetch all the job ids having -gpathak
in the name of the jobs:
job_ids=$(openqa-cli api --o3 /jobs | jq -r '.jobs[] |select(.name|test("gpathak"))' | jq .id)
After that for j in $job_ids; do openqa-cli api -X DELETE /jobs/$j --o3; done
to delete all jobs.
Updated by gpathak about 1 month ago
- Tags changed from bug, openQA to bug, openQA, alert
- Subject changed from O3: Could not find job <job_id> in database to OpenQA logreport for ariel.suse-dmz.opensuse.org: Could not find job <job_id> in database
Updated by mkittler about 1 month ago
- Status changed from New to In Progress
- Assignee set to mkittler
Updated by openqa_review about 1 month ago
- Due date set to 2025-03-18
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler 30 days ago · Edited
We let openQA emit the event only after deleting the job - so this is generally supposed to work¹.
I tried to reproduce this by starting the web UI locally in prefork mode and the following commands:
openqa-cli api --host http://localhost:9526 -X DELETE jobs/3493 & openqa-cli api --host http://localhost:9526 -X DELETE jobs/3493
In theory this can lead to the "Could not find job … in database" error depending on timing. In practice I wasn't able to reproduce the error. I was able to reproduce a different error, though:
cannot remove directory for /hdd/openqa-devel/openqa/testresults/00003/00003493-opensuse-Tumbleweed-kiwi-test-disk-efi-x86_64-Build20230124-kiwi_disk_image_test_efi@64bit/.thumbs: No such file or directory at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/Schema/Result/Jobs.pm line 245.
cannot remove directory for /hdd/openqa-devel/openqa/testresults/00003/00003493-opensuse-Tumbleweed-kiwi-test-disk-efi-x86_64-Build20230124-kiwi_disk_image_test_efi@64bit/ulogs: No such file or directory at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/Schema/Result/Jobs.pm line 245.
[trace] [iyC6aOR4dYq2] 200 OK (0.207646s, 4.816/s)
cannot remove directory for /hdd/openqa-devel/openqa/testresults/00003/00003493-opensuse-Tumbleweed-kiwi-test-disk-efi-x86_64-Build20230124-kiwi_disk_image_test_efi@64bit: No such file or directory at /hdd/openqa-devel/repos/openQA/script/../lib/OpenQA/Schema/Result/Jobs.pm line 245.
This is because only one of the concurrent API calls will be able to remove these directories.
So if we consider concurrent API calls then there's probably more to fix then just the error mentioned in the ticket description. Otherwise I'm not sure why this error would occur at all. If the error would have occurred due to the API calls mentioned in the ticket description those API calls would have failed, e.g.:
script/openqa-cli api --host http://localhost:9526 -X DELETE jobs/3092
500 Internal Server Error
{"error_status":500}
If that would have been the case we it would have been mentioned, right?
We only ever delete jobs when:
- An ISO is explicitly deleted via the API.
- A single job is explicitly deleted via the API.
- As part of the cleanup when a job "has expired".
That means there must have been a race condition between any of those, e.g. any of the API calls is invoked while the job is still there and finds it (not returning a 404 response), then it is deleted by another API call or the cleanup and then the API call runs into the problem when emitting the event.
Of course the job might also be deleted when some other event is still tried to be emitted for it. Maybe that is more likely the case here.
I would say we can simply avoid the die
and return early. (This still leaves the error messages about the result dir deletion in case really two deletions are happening in parallel.)