tickets #45614: A couple of openSUSE machines run out of disk space - openSUSE admin - openSUSE Project Management Tool

Actions

tickets #45614

closed

Actions

#1

Updated by Anonymous over 6 years ago

Dear sender

I'm out of office until Tuesday, 2019-01-02, and will not read my Email regulary.
In urgent cases, please contact my manager, Roland Haidl rhaidl@suse.com.

You might also contact:

autobuild@suse.de for all questions around Autobuild and the Build Service

With kind regards
Lars Vogdt

--
Lars Vogdt Lars.Vogdt@suse.com

BuildOPS Team Lead -
SUSE Linux GmbH - GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer
Maxfeldstraße 5, 90409 Nuernberg, Germany - HRB 16746 (AG Nuernberg)

admin@opensuse.org 12/30/18 09:33 >>>

[openSUSE Tracker]
Issue #45614 has been reported by Lars.Vogdt@suse.com.

tickets #45614: A couple of openSUSE machines run out of disk space
https://progress.opensuse.org/issues/45614

Author: Lars.Vogdt@suse.com
Status: New
Priority: Normal
Assignee:
Category:
Target version:

Hi

Sorry to say, but while debugging a problem with one of the hypervisor
machines, I noticed that some openSUSE machines are running out of disk
space. Namely:

boosters.infra.opensuse.org
mirrordb3.infra.opensuse.org
mirrordb4.infra.opensuse.org
narwal3.infra.opensuse.org
osc-collab.infra.opensuse.org

Please inform the administrators of those boxes, so they can start a
cleanup round.

Another topic:

icc.infra.opensuse.org hangs
narwal2.infra.opensuse.org hangs in maintenance mode (see screen)

Please investigate.

Regards
Lars

--
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here: http://progress.opensuse.org/my/account

Actions

#2

Updated by tampakrap over 6 years ago

Category set to Servers hosted in NBG
Assignee set to tampakrap

@Lars, thanks a lot for handling the hypervisor issue while everyone was on Christmas break, and also for bringing back the failed VMs, including the very important mirrordb1! Your effort is really appreciated!

As for the rest of the still failed VMs, I'll get to them with a bit of delay though, as I'm about to leave on a business trip for the whole week and I'll be on very limited availability.

A few more VMs that have been reported directly to me as broken are:

aedir[1-2].i.o.o
lnt.i.o.o
the CaaSP cluster (not all of the VMs of the cluster seem to be down though, but the endpoint fails)

Actions

#3

Updated by tampakrap over 6 years ago

Private changed from Yes to No

Actions

#4

Updated by cboltz over 6 years ago

I have good and bad news:

bad: provo-mirror is also down (no idea why, I'd guess it's unrelated to the NBG hypervisor problems)

good 1: I manually compressed the nginx logs on narwal3 some days ago, so the disk space issue is fixed for now (interestingly, the logs were rotated, but not compressed)

good 2: I'm working on replacing the old narwals with some salt (both the webservers and automated git pull) and hope to have it ready in the next days, so maybe you won't need to spend too much time to fix narwal2 ;-)

I'll also add a checklist to the ticket (one item per server) to make sure nothing gets lost ;-)

Actions

#5

Updated by cboltz over 6 years ago

Checklist item changed from to [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [ ] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: provo-mirror

Actions

#6

Updated by mcaj over 6 years ago

Checklist item changed from [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [ ] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: provo-mirror to [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [x] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: provo-mirror

FYI I checked the status of the machine lnt.infra.opensuse.org aka lnt.opensuse.org.

The machine was not responding on ping. I found only one message on the serial console output:

[16776824.048003] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]

I was not able to log there via virt-manager. The machine did not react on soft reboot so I had to do
the force reboot.

After the force reboot its seem to be up and running. Also the web https://lnt.opensuse.org/ is working.
But admin of the machine should check logs of the machine.

Martin

Actions

#7

Updated by mcaj over 6 years ago

The VM machine icc is broken and reboot does not help.

The is a message from kernel:
Probing EDD (edd=off to disable)... ok

and then this message :

PANIC early exception 0d rip 10:ffffffff810321f5 error 0 cr2 0

Actions

#8

Updated by cboltz over 6 years ago

mcaj wrote:

The VM machine icc is broken and reboot does not help.

The is a message from kernel: [...]
PANIC early exception 0d rip 10:ffffffff810321f5 error 0 cr2 0

Wild guess: try booting the previous kernel

Actions

#9

Updated by cboltz over 6 years ago

Checklist item changed from [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [x] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: provo-mirror to [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [x] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: 101.opensuse.org (CaaSP?), [ ] down: provo-mirror

101.opensuse.org shows "404 Not Found: Requested route ('101.cf.infra.opensuse.org') does not exist.", added to the checklist

Actions

#10

Updated by cboltz over 6 years ago

Checklist item changed from [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [x] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: 101.opensuse.org (CaaSP?), [ ] down: provo-mirror to [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [x] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: 101.opensuse.org (CaaSP?), [x] down: provo-mirror

provo-mirror is back since about 17 hours - and we instantly got ticket #46031 because it's outdated ;-)

Thanks to whoever brought provo-mirror back!

Actions

#11

Updated by Anonymous over 6 years ago

Am Fri, 11 Jan 2019 16:54:18 +0000
schrieb admin@opensuse.org:

provo-mirror is back since about 17 hours - and we instantly got
ticket #46031 because it's outdated ;-)

Thanks to whoever brought provo-mirror back!

FYI: provo-mirror had "disk full". Luckily we found someone with big
pockets at SUSE who sponsored some more space (30TB).

The machine is and will provide outdated packages for the weekend
12./13.Jan as we decided to stop updating with latest builds but
instead speeding up the sync of the underlying lvm move process.

provo-mirror should be back on track (and hopefully stay online and
up-to date for a longer time) early next week. Until than, it might be
a good idea to rely on download.opensuse.org to get the latest
packages. For installation media and some (not updated) packages or
repositories, the packages on provo-mirror should be good enough
(that's the reason why we leave it online). Thankfully MirrorBrain
behind download.opensuse.org knows which packages or ISO images can be
used and which not - and will redirect you to other mirrors in case the
files on provo-mirror are outdated.

I hope this explains the situation.

With kind regards,
Lars

Actions

#12

Updated by cboltz over 6 years ago

That's the best reason I ever heard for making a server read-only :-)

Actions

#13

Updated by tampakrap over 6 years ago

Checklist item changed from to [x] down: CaaSP cluster (endpoints fail)

Actions

#14

Updated by tampakrap over 6 years ago

all CaaSP nodes are back up again. Also, the NFS server that k8s uses as storage was also down. I brought it up but it still didn't catch up. Thus cloud foundry and the websites on top of it are down atm

Actions

#15

Updated by tampakrap over 6 years ago

Checklist item changed from to [x] disk space: mirrordb3

Actions

#16

Updated by tampakrap over 6 years ago

Checklist item changed from to [x] disk space: mirrordb4

Actions

#17

Updated by tampakrap over 6 years ago

I marked mirrordb3/4 as done because they are not actually used any more, and they are pending destruction. I'm waiting for darix's ok first

Actions

#18

Updated by tampakrap over 6 years ago

Checklist item changed from to [x] down: 101.opensuse.org (CaaSP?)

Actions

#19

Updated by tampakrap over 6 years ago

Checklist item changed from to [x] down: aedir1

Actions

#20

Updated by tampakrap over 6 years ago

Checklist item changed from to [x] down: aedir2

Actions

#21

Updated by tampakrap over 6 years ago

Checklist item changed from to [x] down: icc

Actions

#22

Updated by cboltz over 6 years ago

Checklist item changed from [ ] disk space: boosters, [x] disk space: mirrordb3, [x] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [x] down: icc, [ ] down: narwal2, [x] down: aedir1, [x] down: aedir2, [x] down: lnt, [x] down: CaaSP cluster (endpoints fail), [x] down: 101.opensuse.org (CaaSP?), [x] down: provo-mirror to [ ] disk space: boosters, [x] disk space: mirrordb3, [x] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [x] down: aedir1, [x] down: aedir2, [x] down: lnt, [x] down: CaaSP cluster (endpoints fail), [x] down: 101.opensuse.org (CaaSP?), [x] down: provo-mirror

icc.o.o still shows the 503 maintenance page :-(

I can ping the VM, so maybe "only" the service is down.

Actions

#23

Updated by tampakrap about 6 years ago

Checklist item changed from to [x] down: narwal2

Actions

#24

Updated by tampakrap about 6 years ago

Checklist item changed from to [x] disk space: boosters

Actions

#25

Updated by tampakrap about 6 years ago

Status changed from New to Closed

closing this one as icc and osc-collab have dedicated maintainers that are aware of the issues already. Anyone feel free to file separate tickets for those

Actions

Also available in: Atom PDF