Project

General

Profile

action #100712

Investigate what broke git checkouts on o3

Added by favogt about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2021-10-11
Due date:
% Done:

0%

Estimated time:

Description

Pretty much all git repos on o3 broke over night:

fatal: .git/index: index file smaller than expected

was printed for /var/lib/openqa/tests/{opensuse,openqa,obs}/ as well as /var/lib/openqa/tests/opensuse/products/opensuse/needles`.

The .git/index file had 0 size for them.

The index file of the needles repo had a birthtime of 2021-10-09 22:41:05.923862229 +0000, which conincides with a cron run of fetchneedles:

From geekotest@ariel.suse-dmz.opensuse.org  Sat Oct  9 22:41:09 2021
Return-Path: <geekotest@ariel.suse-dmz.opensuse.org>
X-Original-To: geekotest
Delivered-To: geekotest@ariel.suse-dmz.opensuse.org
Received: by ariel.suse-dmz.opensuse.org (Postfix, from userid 493)
        id 66A5C18B5F; Sat,  9 Oct 2021 22:41:08 +0000 (UTC)
From: "(Cron Daemon)" <geekotest@ariel.suse-dmz.opensuse.org>
To: geekotest@ariel.suse-dmz.opensuse.org
Subject: Cron <geekotest@ariel>     env updateall=1 force=1 /usr/share/openqa/script/fetchneedles
Content-Type: text/plain; charset=UTF-8
Auto-Submitted: auto-generated
Precedence: bulk
X-Cron-Env: <XDG_SESSION_ID=19181>
X-Cron-Env: <XDG_RUNTIME_DIR=/run/user/493>
X-Cron-Env: <DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/493/bus>
X-Cron-Env: <XDG_SESSION_TYPE=unspecified>
X-Cron-Env: <XDG_SESSION_CLASS=background>
X-Cron-Env: <LANG=en_US.UTF-8>
X-Cron-Env: <LC_CTYPE=en_US.UTF-8>
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/var/lib/openqa>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=geekotest>
X-Cron-Env: <USER=geekotest>
Message-Id: <20211009224109.66A5C18B5F@ariel.suse-dmz.opensuse.org>
Date: Sat,  9 Oct 2021 22:41:04 +0000 (UTC)

fatal: It seems that there is already a rebase-merge directory, and
I wonder if you are in the middle of another rebase.  If that is the
case, please try
        git rebase (--continue | --abort | --skip)
If that is not the case, please
        rm -fr ".git/rebase-merge"
and run me again.  I am stopping in case you still have something
valuable there.

Use force=1 to discard uncommitted changes before rebasing

From geekotest@ariel.suse-dmz.opensuse.org  Sun Oct 10 08:31:04 2021
Return-Path: <geekotest@ariel.suse-dmz.opensuse.org>
X-Original-To: geekotest
Delivered-To: geekotest@ariel.suse-dmz.opensuse.org
Received: by ariel.suse-dmz.opensuse.org (Postfix, from userid 493)
        id D457D18B5F; Sun, 10 Oct 2021 08:31:04 +0000 (UTC)
From: "(Cron Daemon)" <geekotest@ariel.suse-dmz.opensuse.org>
To: geekotest@ariel.suse-dmz.opensuse.org
Subject: Cron <geekotest@ariel>     env updateall=1 force=1 /usr/share/openqa/script/fetchneedles
Content-Type: text/plain; charset=UTF-8
Auto-Submitted: auto-generated
Precedence: bulk
X-Cron-Env: <XDG_SESSION_ID=1>
X-Cron-Env: <XDG_RUNTIME_DIR=/run/user/493>
X-Cron-Env: <DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/493/bus>
X-Cron-Env: <XDG_SESSION_TYPE=unspecified>
X-Cron-Env: <XDG_SESSION_CLASS=background>
X-Cron-Env: <LANG=en_US.UTF-8>
X-Cron-Env: <LC_CTYPE=en_US.UTF-8>
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/var/lib/openqa>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=geekotest>
X-Cron-Env: <USER=geekotest>
Message-Id: <20211010083104.D457D18B5F@ariel.suse-dmz.opensuse.org>
Date: Sun, 10 Oct 2021 08:31:04 +0000 (UTC)

fatal: .git/index: index file smaller than expected
fatal: .git/index: index file smaller than expected
fatal: .git/index: index file smaller than expected
Use force=1 to discard uncommitted changes before rebasing
fatal: .git/index: index file smaller than expected
fatal: .git/index: index file smaller than expected
fatal: It seems that there is already a rebase-merge directory, and
I wonder if you are in the middle of another rebase.  If that is the
case, please try
        git rebase (--continue | --abort | --skip)
If that is not the case, please
        rm -fr ".git/rebase-merge"
and run me again.  I am stopping in case you still have something
valuable there.

Use force=1 to discard uncommitted changes before rebasing
fatal: .git/index: index file smaller than expected
fatal: .git/index: index file smaller than expected
fatal: .git/index: index file smaller than expected
Use force=1 to discard uncommitted changes before rebasing

I fixed those repos manually by doing git reset --quiet; git status; git pull.
The openqa tests repo needed a manual rebase to deal with a conflict.

The fatal: It seems that there is already a rebase-merge directory error is probably a red herring, because it is only printed once while multiple repos are broken, it's still printed after the index purge and has been going on since Wed, 14 Jul 2021 17:30:10 +0000 (UTC) according to root's mailbox.

All of the error logs ended up in /var/spool/mail/root, which probably should be archived:

ariel:~ # ll -h /var/spool/mail/root
-rw------- 1 root root 508M Oct 11 08:32 /var/spool/mail/root
ariel:~ # wc -l /var/spool/mail/root
10018188 /var/spool/mail/root

History

#1 Updated by okurz about 2 months ago

  • Assignee set to okurz
  • Target version set to Ready

@fvogt as I understood you already fixed the individual repos. In the past we have seen similar errors and I always tried to improve fetchneedles one step at a time. There had been no recent change in fetchneedles however. There was just https://github.com/os-autoinst/openQA/commit/986bac2a9b8a42cd8cd673f061d5407ccd893717#diff-0bb3fc4e32c66e0e4e124d1288c9e57e8e32f17d020d88fc2f085693996814f6 in August in past months so after the incident and also it looks it can not possibly cause such error.

As you fixed the current situation should we still treat it as "Urgent"?

favogt wrote:

All of the error logs ended up in /var/spool/mail/root, which probably should be archived

What do you mean with "should be archived"?

#2 Updated by favogt about 2 months ago

iforster has the suspicion that this might be related to/caused by the recent NetApp failure which brought down some services. I asked infra about that, maybe it fits.

It's a bit weird though that it would only hit .git/index files, and that in multiple subsequent fetch runs.

okurz wrote:

@fvogt as I understood you already fixed the individual repos. In the past we have seen similar errors

Also corrupt git files? Merge conflicts and similar probably aren't related.

and I always tried to improve fetchneedles one step at a time. There had been no recent change in fetchneedles however. There was just https://github.com/os-autoinst/openQA/commit/986bac2a9b8a42cd8cd673f061d5407ccd893717#diff-0bb3fc4e32c66e0e4e124d1288c9e57e8e32f17d020d88fc2f085693996814f6 in August in past months so after the incident and also it looks it can not possibly cause such error.

Yeah, I couldn't find anything either.

As you fixed the current situation should we still treat it as "Urgent"?

Without knowing the cause it's not unlikely that it happens again.

favogt wrote:

All of the error logs ended up in /var/spool/mail/root, which probably should be archived

What do you mean with "should be archived"?

Stored elsewhere in compressed form to keep it, but make it easier to inspect for future events.

#3 Updated by favogt about 2 months ago

  • Status changed from New to Closed

Got an answer:

yes, very likely related. From geekotest@ariel.suse-dmz.opensuse.org Sat Oct 9 22:41:09 2021 is within a few seconds of when the trouble on the host started

So let's close this for now, and hope it doesn't happen again.

#4 Updated by okurz about 2 months ago

  • Status changed from Closed to Resolved

Also available in: Atom PDF