Project

General

Profile

Actions

action #164895

closed

o3 had corrupted needles git repo, lost uncommitted needles between 2024-07-31 and 2024-08-02

Added by favogt 14 days ago. Updated 10 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-08-02
Due date:
% Done:

0%

Estimated time:

Description

I wasn't able to commit new needles from the WebUI with the error fatal: .git/index: index file smaller than expected.
Checking the repo on github showed that the last commit was >2d ago, so I checked on ariel.
The /var/lib/openqa/tests/opensuse/products/opensuse/needles/.git/index file was empty and basically no git command
worked. To attempt recovery I did:

rm .git/index
git status (complained about a lot of deleted and unstaged files)
git restore --staged
git restore --staged .
git status (now shows >2 screen pages of uncommitted needles)
git diff
git fetch origin
git log
git add .
git status (shows empty!!!!)

For some reason between the last two git status calls all uncommitted needles got deleted :-(
I have no clue why, maybe some fallout of the corrupt index confusing git add or the WebUI did something strange.


Related issues 3 (1 open2 closed)

Related to openQA Project - action #162077: Create and maintain up to date version of test distri/needles for webui - enabled by default size:SResolvedmkittler

Actions
Related to openQA Project - action #162125: [timeboxed:10h][spike] Let openQA keep test distribution checkouts up to date without needing fetchneedles size:SResolvedtinita2024-06-122024-08-13

Actions
Related to openQA Project - action #164898: Replace fetchneedles with a minion jobBlockedtinita

Actions
Actions #1

Updated by okurz 14 days ago

  • Tags set to reactive work, needles
  • Category set to Regressions/Crashes
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #2

Updated by okurz 14 days ago

  • Related to action #162077: Create and maintain up to date version of test distri/needles for webui - enabled by default size:S added
Actions #3

Updated by favogt 14 days ago

openqaworker-arm21 did not rsync the needles dir after the deletion happened, so I could sync the lost needles to o3 (/space/needles-bkup, can be deleted once this is over) and then to the final location.

For some reason between the last two git status calls all uncommitted needles got deleted :-(
I have no clue why, maybe some fallout of the corrupt index confusing git add or the WebUI did something strange.

After I git added the recovered needles but before I was able to git commit, they were gone again.
The journal showed that cron ran, so I suspect the cause of the deletion is fetchneedles.

Given that fetchneedles runs every minute, it's actually possible that this conflicts with the Web UI's needle editor as well occasionally, with the result that needles simply get lost instead of commit + push. This needs to be checked.

I stopped cron temporarily before the next add + commit attempt and it helped: https://github.com/os-autoinst/os-autoinst-needles-opensuse/commit/03e12dc74fa5bd523a2d58dcfaacd500c1ffa14d \o/

Actions #4

Updated by tinita 14 days ago · Edited

I mentioned this in #162125#note-20 two days ago. I guess my comment was overlooked.

I also think it happens because it interferes with fetchneedles, but somehow it seems to repair itself again sometimes, as you can see by the mixture of passed and failed minion jobs:
https://openqa.opensuse.org/minion/jobs?task=save_needle&state=&queue=¬e=&limit=100&offset=0

Anyway, we are in the process of replacing fetchneedles with a minion job (#162125) that would use the same kind of minion lock as save_needle and delete_needles, so there wouldn't be any conflicts like that anymore.

(The alternative without minion would be to use a lockfile with flock)

I think it's likely that this is not a new thing happening, but as it repairs itself sometimes and we don't get notified of such failed jobs we didn't notice.

Actions #5

Updated by tinita 14 days ago

  • Related to action #162125: [timeboxed:10h][spike] Let openQA keep test distribution checkouts up to date without needing fetchneedles size:S added
Actions #6

Updated by tinita 14 days ago

Btw, I think #162077 is not really related. The git_auto_clone feature is not enabled on o3 or osd, and it's only implemented for tests which set CASEDIR/NEEDLES_DIR explicitly to a git url.

Actions #7

Updated by tinita 14 days ago

  • Related to action #164898: Replace fetchneedles with a minion job added
Actions #8

Updated by Vogtinator 14 days ago · Edited

I mentioned this in #162125#note-20 two days ago. I guess my comment was overlooked.

Not just your comment, also the big fat error all users got when saving a needle...

I also think it happens because it interferes with fetchneedles, but somehow it seems to repair itself again sometimes, as you can see by the mixture of passed and failed minion jobs:
https://openqa.opensuse.org/minion/jobs?task=save_needle&state=&queue=¬e=&limit=100&offset=0
...
I think it's likely that this is not a new thing happening, but as it repairs itself sometimes and we don't get notified of such failed jobs we didn't notice.

There was no commit on github for ~2 days, so the passed jobs there did not really succeed. I've never seen this particular error before.

Anyway, we are in the process of replacing fetchneedles with a minion job (#162125) that would use the same kind of minion lock as save_needle and delete_needles, so there wouldn't be any conflicts like that anymore.

Sounds good.

Actions #9

Updated by tinita 11 days ago

Should we block this on #164898 ?

Actions #10

Updated by livdywan 11 days ago

Wondering if there is any relation to #163790 - granted we don't know what the cause was, but I wanted to mention it at least

Actions #11

Updated by tinita 11 days ago · Edited

  • Status changed from New to In Progress
  • Assignee set to tinita

I remember that I saw this error on o3 in $MAIL before.
We actually have the problem right now in /var/lib/openqa/share/tests/obs.
But the first occurrence was in October 2021, according to cron mails.

The next occurrence was in 17 August 2023. And then July 19 2024.

Not sure if somebody repaired the repo after every occurrence.

I will fix the obs checkout.
The modification time of the index file was Jul 19 15:17 .git/index

Actions #12

Updated by tinita 11 days ago

/var/lib/openqa/share/tests/obs is fixed now.
The funny thing is, there weren't any new commits since the last one on march 28: https://github.com/os-autoinst/os-autoinst-distri-obs/commits/master/

So I can't really see why there would be something running into a conflict.

Actions #13

Updated by tinita 11 days ago

  • Status changed from In Progress to Feedback
Actions #14

Updated by tinita 10 days ago

  • Status changed from Feedback to Resolved

We will now get emails again hopefully when this happens.
There is #164898 which will avoid two processes interfering with each other.

Actions

Also available in: Atom PDF