action #70774
opencoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert
save_needle Minion tasks fail frequently
0%
Description
Observation¶
The save_needle
Minion task fails frequently on OSD and also sometimes on o3.
This can be observed using the following query parameters: https://openqa.suse.de/minion/jobs?state=failed&offset=0&task=save_needle
I'm going to remove most of these jobs to calm down the alert but right now 24 jobs have piled up over 2 month. However, the problem actually exists longer than 2 month but the failures have been manually cleaned up so far.
The problem here is always that the Git working tree is in a state which can not be handled by the task:
1.
"result" => {
"error" => "<strong>Failed to save addon_products-module-dev-tools-pvm-20200805.</strong><br><pre>Unable to commit via Git: On branch master\nYour branch is up to date with 'origin/master'.\n\nnothing to commit, working tree clean</pre>"
},
2.
"result" => {
"error" => "<strong>Failed to save manually_add_profile-AppArmor-Chose-a-program-to-generate-a-profile-20200827.</strong><br><pre>Unable to reset repository to origin/master: error: cannot rebase: Your index contains uncommitted changes.\nerror: Please commit or stash them.</pre>"
},
Suggestions¶
It would be useful if the task would be able to handle the problematic situations itself instead of requiring manual intervention. Note that the delete_needle
task (which shares the same Git code) is also affected. We have likely less problems there because that task is not executed that often.
Problematic situations¶
- No diff has been produced which could be committed: Maybe that's simply when there's no actual change and we can simply return early in that case.
- The Git directory contains uncommitted changes: We could save these changes on a new branch before rebasing.
- We can not push the new commit because in the meantime new commits have been pushed to the remote from elsewhere: Just repeat the procedure.
- The fetch needles script is interfering.