action #164895
closed
o3 had corrupted needles git repo, lost uncommitted needles between 2024-07-31 and 2024-08-02
Added by favogt 5 months ago.
Updated 4 months ago.
Category:
Regressions/Crashes
Description
I wasn't able to commit new needles from the WebUI with the error fatal: .git/index: index file smaller than expected
.
Checking the repo on github showed that the last commit was >2d ago, so I checked on ariel.
The /var/lib/openqa/tests/opensuse/products/opensuse/needles/.git/index
file was empty and basically no git command
worked. To attempt recovery I did:
rm .git/index
git status (complained about a lot of deleted and unstaged files)
git restore --staged
git restore --staged .
git status (now shows >2 screen pages of uncommitted needles)
git diff
git fetch origin
git log
git add .
git status (shows empty!!!!)
For some reason between the last two git status
calls all uncommitted needles got deleted :-(
I have no clue why, maybe some fallout of the corrupt index confusing git add or the WebUI did something strange.
- Tags set to reactive work, needles
- Category set to Regressions/Crashes
- Priority changed from Normal to High
- Target version set to Ready
- Related to action #162077: Create and maintain up to date version of test distri/needles for webui - enabled by default size:S added
openqaworker-arm21 did not rsync the needles dir after the deletion happened, so I could sync the lost needles to o3 (/space/needles-bkup, can be deleted once this is over) and then to the final location.
For some reason between the last two git status calls all uncommitted needles got deleted :-(
I have no clue why, maybe some fallout of the corrupt index confusing git add or the WebUI did something strange.
After I git add
ed the recovered needles but before I was able to git commit
, they were gone again.
The journal showed that cron ran, so I suspect the cause of the deletion is fetchneedles.
Given that fetchneedles runs every minute, it's actually possible that this conflicts with the Web UI's needle editor as well occasionally, with the result that needles simply get lost instead of commit + push. This needs to be checked.
I stopped cron temporarily before the next add + commit attempt and it helped: https://github.com/os-autoinst/os-autoinst-needles-opensuse/commit/03e12dc74fa5bd523a2d58dcfaacd500c1ffa14d \o/
I mentioned this in #162125#note-20 two days ago. I guess my comment was overlooked.
I also think it happens because it interferes with fetchneedles, but somehow it seems to repair itself again sometimes, as you can see by the mixture of passed and failed minion jobs:
https://openqa.opensuse.org/minion/jobs?task=save_needle&state=&queue=¬e=&limit=100&offset=0
Anyway, we are in the process of replacing fetchneedles with a minion job (#162125) that would use the same kind of minion lock as save_needle and delete_needles, so there wouldn't be any conflicts like that anymore.
(The alternative without minion would be to use a lockfile with flock
)
I think it's likely that this is not a new thing happening, but as it repairs itself sometimes and we don't get notified of such failed jobs we didn't notice.
- Related to action #162125: [timeboxed:10h][spike] Let openQA keep test distribution checkouts up to date without needing fetchneedles size:S added
Btw, I think #162077 is not really related. The git_auto_clone
feature is not enabled on o3 or osd, and it's only implemented for tests which set CASEDIR/NEEDLES_DIR explicitly to a git url.
- Related to action #164898: Replace fetchneedles with a minion job for the regular update of git repos size:M added
I mentioned this in #162125#note-20 two days ago. I guess my comment was overlooked.
Not just your comment, also the big fat error all users got when saving a needle...
I also think it happens because it interferes with fetchneedles, but somehow it seems to repair itself again sometimes, as you can see by the mixture of passed and failed minion jobs:
https://openqa.opensuse.org/minion/jobs?task=save_needle&state=&queue=¬e=&limit=100&offset=0
...
I think it's likely that this is not a new thing happening, but as it repairs itself sometimes and we don't get notified of such failed jobs we didn't notice.
There was no commit on github for ~2 days, so the passed jobs there did not really succeed. I've never seen this particular error before.
Anyway, we are in the process of replacing fetchneedles with a minion job (#162125) that would use the same kind of minion lock as save_needle and delete_needles, so there wouldn't be any conflicts like that anymore.
Sounds good.
Wondering if there is any relation to #163790 - granted we don't know what the cause was, but I wanted to mention it at least
- Status changed from New to In Progress
- Assignee set to tinita
I remember that I saw this error on o3 in $MAIL
before.
We actually have the problem right now in /var/lib/openqa/share/tests/obs
.
But the first occurrence was in October 2021, according to cron mails.
The next occurrence was in 17 August 2023. And then July 19 2024.
Not sure if somebody repaired the repo after every occurrence.
I will fix the obs checkout.
The modification time of the index file was Jul 19 15:17 .git/index
- Status changed from In Progress to Feedback
- Status changed from Feedback to Resolved
We will now get emails again hopefully when this happens.
There is #164898 which will avoid two processes interfering with each other.
Also available in: Atom
PDF