action #164895
closedo3 had corrupted needles git repo, lost uncommitted needles between 2024-07-31 and 2024-08-02
0%
Description
I wasn't able to commit new needles from the WebUI with the error fatal: .git/index: index file smaller than expected
.
Checking the repo on github showed that the last commit was >2d ago, so I checked on ariel.
The /var/lib/openqa/tests/opensuse/products/opensuse/needles/.git/index
file was empty and basically no git command
worked. To attempt recovery I did:
rm .git/index
git status (complained about a lot of deleted and unstaged files)
git restore --staged
git restore --staged .
git status (now shows >2 screen pages of uncommitted needles)
git diff
git fetch origin
git log
git add .
git status (shows empty!!!!)
For some reason between the last two git status
calls all uncommitted needles got deleted :-(
I have no clue why, maybe some fallout of the corrupt index confusing git add or the WebUI did something strange.
Updated by okurz 4 months ago
- Related to action #162077: Create and maintain up to date version of test distri/needles for webui - enabled by default size:S added
Updated by favogt 4 months ago
openqaworker-arm21 did not rsync the needles dir after the deletion happened, so I could sync the lost needles to o3 (/space/needles-bkup, can be deleted once this is over) and then to the final location.
For some reason between the last two git status calls all uncommitted needles got deleted :-(
I have no clue why, maybe some fallout of the corrupt index confusing git add or the WebUI did something strange.
After I git add
ed the recovered needles but before I was able to git commit
, they were gone again.
The journal showed that cron ran, so I suspect the cause of the deletion is fetchneedles.
Given that fetchneedles runs every minute, it's actually possible that this conflicts with the Web UI's needle editor as well occasionally, with the result that needles simply get lost instead of commit + push. This needs to be checked.
I stopped cron temporarily before the next add + commit attempt and it helped: https://github.com/os-autoinst/os-autoinst-needles-opensuse/commit/03e12dc74fa5bd523a2d58dcfaacd500c1ffa14d \o/
Updated by tinita 4 months ago · Edited
I mentioned this in #162125#note-20 two days ago. I guess my comment was overlooked.
I also think it happens because it interferes with fetchneedles, but somehow it seems to repair itself again sometimes, as you can see by the mixture of passed and failed minion jobs:
https://openqa.opensuse.org/minion/jobs?task=save_needle&state=&queue=¬e=&limit=100&offset=0
Anyway, we are in the process of replacing fetchneedles with a minion job (#162125) that would use the same kind of minion lock as save_needle and delete_needles, so there wouldn't be any conflicts like that anymore.
(The alternative without minion would be to use a lockfile with flock
)
I think it's likely that this is not a new thing happening, but as it repairs itself sometimes and we don't get notified of such failed jobs we didn't notice.
Updated by tinita 4 months ago
- Related to action #162125: [timeboxed:10h][spike] Let openQA keep test distribution checkouts up to date without needing fetchneedles size:S added
Updated by tinita 4 months ago
- Related to action #164898: Replace fetchneedles with a minion job for the regular update of git repos size:M added
Updated by Vogtinator 4 months ago · Edited
I mentioned this in #162125#note-20 two days ago. I guess my comment was overlooked.
Not just your comment, also the big fat error all users got when saving a needle...
I also think it happens because it interferes with fetchneedles, but somehow it seems to repair itself again sometimes, as you can see by the mixture of passed and failed minion jobs:
https://openqa.opensuse.org/minion/jobs?task=save_needle&state=&queue=¬e=&limit=100&offset=0
...
I think it's likely that this is not a new thing happening, but as it repairs itself sometimes and we don't get notified of such failed jobs we didn't notice.
There was no commit on github for ~2 days, so the passed jobs there did not really succeed. I've never seen this particular error before.
Anyway, we are in the process of replacing fetchneedles with a minion job (#162125) that would use the same kind of minion lock as save_needle and delete_needles, so there wouldn't be any conflicts like that anymore.
Sounds good.
Updated by tinita 4 months ago · Edited
- Status changed from New to In Progress
- Assignee set to tinita
I remember that I saw this error on o3 in $MAIL
before.
We actually have the problem right now in /var/lib/openqa/share/tests/obs
.
But the first occurrence was in October 2021, according to cron mails.
The next occurrence was in 17 August 2023. And then July 19 2024.
Not sure if somebody repaired the repo after every occurrence.
I will fix the obs checkout.
The modification time of the index file was Jul 19 15:17 .git/index
Updated by tinita 4 months ago
/var/lib/openqa/share/tests/obs
is fixed now.
The funny thing is, there weren't any new commits since the last one on march 28: https://github.com/os-autoinst/os-autoinst-distri-obs/commits/master/
So I can't really see why there would be something running into a conflict.