tickets #45614

A couple of openSUSE machines run out of disk space

Added by Lars.Vogdt@suse.com 12 months ago. Updated 9 months ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:tampakrap% Done:

0%

Category:servers hosted in NBG
Target version:-
Duration:

Description

Hi

Sorry to say, but while debugging a problem with one of the hypervisor
machines, I noticed that some openSUSE machines are running out of disk
space. Namely:

  • boosters.infra.opensuse.org
  • mirrordb3.infra.opensuse.org
  • mirrordb4.infra.opensuse.org
  • narwal3.infra.opensuse.org
  • osc-collab.infra.opensuse.org

Please inform the administrators of those boxes, so they can start a
cleanup round.

Another topic:
* icc.infra.opensuse.org hangs
* narwal2.infra.opensuse.org hangs in maintenance mode (see screen)

Please investigate.

Regards
Lars


Checklist

  • disk space: boosters
  • disk space: mirrordb3
  • disk space: mirrordb4
  • disk space: narwal3
  • disk space: osc-collab
  • down: icc
  • down: narwal2
  • down: aedir1
  • down: aedir2
  • down: lnt
  • down: CaaSP cluster (endpoints fail)
  • down: 101.opensuse.org (CaaSP?)
  • down: provo-mirror

History

#1 Updated by Lars.Vogdt@suse.com 12 months ago

Dear sender

I'm out of office until Tuesday, 2019-01-02, and will not read my Email regulary.
In urgent cases, please contact my manager, Roland Haidl rhaidl@suse.com.

You might also contact:
* autobuild@suse.de for all questions around Autobuild and the Build Service

With kind regards
Lars Vogdt

Lars Vogdt Lars.Vogdt@suse.com

- BuildOPS Team Lead -
SUSE Linux GmbH - GF: Jeff Hawn, Jennifer Guild, Felix Imend├Ârffer
Maxfeldstra├če 5, 90409 Nuernberg, Germany - HRB 16746 (AG Nuernberg)

admin@opensuse.org 12/30/18 09:33 >>>

[openSUSE Tracker]
Issue #45614 has been reported by Lars.Vogdt@suse.com.


tickets #45614: A couple of openSUSE machines run out of disk space
https://progress.opensuse.org/issues/45614

* Target version:

Hi

Sorry to say, but while debugging a problem with one of the hypervisor
machines, I noticed that some openSUSE machines are running out of disk
space. Namely:

  • boosters.infra.opensuse.org
  • mirrordb3.infra.opensuse.org
  • mirrordb4.infra.opensuse.org
  • narwal3.infra.opensuse.org
  • osc-collab.infra.opensuse.org

Please inform the administrators of those boxes, so they can start a
cleanup round.

Another topic:
* icc.infra.opensuse.org hangs
* narwal2.infra.opensuse.org hangs in maintenance mode (see screen)

Please investigate.

Regards
Lars

You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here: http://progress.opensuse.org/my/account

#2 Updated by tampakrap 11 months ago

  • Category set to servers hosted in NBG
  • Assignee set to tampakrap

@Lars, thanks a lot for handling the hypervisor issue while everyone was on Christmas break, and also for bringing back the failed VMs, including the very important mirrordb1! Your effort is really appreciated!

As for the rest of the still failed VMs, I'll get to them with a bit of delay though, as I'm about to leave on a business trip for the whole week and I'll be on very limited availability.

A few more VMs that have been reported directly to me as broken are:
- aedir[1-2].i.o.o
- lnt.i.o.o
- the CaaSP cluster (not all of the VMs of the cluster seem to be down though, but the endpoint fails)

#3 Updated by tampakrap 11 months ago

  • Private changed from Yes to No

#4 Updated by cboltz 11 months ago

I have good and bad news:

bad: provo-mirror is also down (no idea why, I'd guess it's unrelated to the NBG hypervisor problems)

good 1: I manually compressed the nginx logs on narwal3 some days ago, so the disk space issue is fixed for now (interestingly, the logs were rotated, but not compressed)

good 2: I'm working on replacing the old narwals with some salt (both the webservers and automated git pull) and hope to have it ready in the next days, so maybe you won't need to spend too much time to fix narwal2 ;-)

I'll also add a checklist to the ticket (one item per server) to make sure nothing gets lost ;-)

#5 Updated by cboltz 11 months ago

  • Checklist set to [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [ ] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: provo-mirror

#6 Updated by mcaj 11 months ago

  • Checklist changed from [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [ ] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: provo-mirror to [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [x] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: provo-mirror

FYI I checked the status of the machine lnt.infra.opensuse.org aka lnt.opensuse.org.

The machine was not responding on ping. I found only one message on the serial console output:

[16776824.048003] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]

I was not able to log there via virt-manager. The machine did not react on soft reboot so I had to do
the force reboot.

After the force reboot its seem to be up and running. Also the web https://lnt.opensuse.org/ is working.
But admin of the machine should check logs of the machine.

Martin

#7 Updated by mcaj 11 months ago

The VM machine icc is broken and reboot does not help.

The is a message from kernel:
Probing EDD (edd=off to disable)... ok

and then this message :

PANIC early exception 0d rip 10:ffffffff810321f5 error 0 cr2 0

#8 Updated by cboltz 11 months ago

mcaj wrote:

The VM machine icc is broken and reboot does not help.


The is a message from kernel: [...]

PANIC early exception 0d rip 10:ffffffff810321f5 error 0 cr2 0

Wild guess: try booting the previous kernel

#9 Updated by cboltz 11 months ago

  • Checklist changed from [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [x] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: provo-mirror to [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [x] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: 101.opensuse.org (CaaSP?), [ ] down: provo-mirror

101.opensuse.org shows "404 Not Found: Requested route ('101.cf.infra.opensuse.org') does not exist.", added to the checklist

#10 Updated by cboltz 11 months ago

  • Checklist changed from [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [x] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: 101.opensuse.org (CaaSP?), [ ] down: provo-mirror to [ ] disk space: boosters, [ ] disk space: mirrordb3, [ ] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [ ] down: aedir1, [ ] down: aedir2, [x] down: lnt, [ ] down: CaaSP cluster (endpoints fail), [ ] down: 101.opensuse.org (CaaSP?), [x] down: provo-mirror

provo-mirror is back since about 17 hours - and we instantly got ticket #46031 because it's outdated ;-)

Thanks to whoever brought provo-mirror back!

#11 Updated by lrupp@suse.de 11 months ago

Am Fri, 11 Jan 2019 16:54:18 +0000
schrieb admin@opensuse.org:

provo-mirror is back since about 17 hours - and we instantly got

ticket #46031 because it's outdated ;-)


Thanks to whoever brought provo-mirror back!

FYI: provo-mirror had "disk full". Luckily we found someone with big
pockets at SUSE who sponsored some more space (30TB).

The machine is and will provide outdated packages for the weekend
12./13.Jan as we decided to stop updating with latest builds but
instead speeding up the sync of the underlying lvm move process.

provo-mirror should be back on track (and hopefully stay online and
up-to date for a longer time) early next week. Until than, it might be
a good idea to rely on download.opensuse.org to get the latest
packages. For installation media and some (not updated) packages or
repositories, the packages on provo-mirror should be good enough
(that's the reason why we leave it online). Thankfully MirrorBrain
behind download.opensuse.org knows which packages or ISO images can be
used and which not - and will redirect you to other mirrors in case the
files on provo-mirror are outdated.

I hope this explains the situation.

With kind regards,
Lars

#12 Updated by cboltz 11 months ago

That's the best reason I ever heard for making a server read-only :-)

#13 Updated by tampakrap 11 months ago

  • Checklist set to [x] down: CaaSP cluster (endpoints fail)

#14 Updated by tampakrap 11 months ago

all CaaSP nodes are back up again. Also, the NFS server that k8s uses as storage was also down. I brought it up but it still didn't catch up. Thus cloud foundry and the websites on top of it are down atm

#15 Updated by tampakrap 11 months ago

  • Checklist set to [x] disk space: mirrordb3

#16 Updated by tampakrap 11 months ago

  • Checklist set to [x] disk space: mirrordb4

#17 Updated by tampakrap 11 months ago

I marked mirrordb3/4 as done because they are not actually used any more, and they are pending destruction. I'm waiting for darix's ok first

#18 Updated by tampakrap 11 months ago

  • Checklist set to [x] down: 101.opensuse.org (CaaSP?)

#19 Updated by tampakrap 11 months ago

  • Checklist set to [x] down: aedir1

#20 Updated by tampakrap 11 months ago

  • Checklist set to [x] down: aedir2

#21 Updated by tampakrap 11 months ago

  • Checklist set to [x] down: icc

#22 Updated by cboltz 11 months ago

  • Checklist changed from [ ] disk space: boosters, [x] disk space: mirrordb3, [x] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [x] down: icc, [ ] down: narwal2, [x] down: aedir1, [x] down: aedir2, [x] down: lnt, [x] down: CaaSP cluster (endpoints fail), [x] down: 101.opensuse.org (CaaSP?), [x] down: provo-mirror to [ ] disk space: boosters, [x] disk space: mirrordb3, [x] disk space: mirrordb4, [x] disk space: narwal3, [ ] disk space: osc-collab, [ ] down: icc, [ ] down: narwal2, [x] down: aedir1, [x] down: aedir2, [x] down: lnt, [x] down: CaaSP cluster (endpoints fail), [x] down: 101.opensuse.org (CaaSP?), [x] down: provo-mirror

icc.o.o still shows the 503 maintenance page :-(

I can ping the VM, so maybe "only" the service is down.

#23 Updated by tampakrap 10 months ago

  • Checklist set to [x] down: narwal2

#24 Updated by tampakrap 10 months ago

  • Checklist set to [x] disk space: boosters

#25 Updated by tampakrap 9 months ago

  • Status changed from New to Closed

closing this one as icc and osc-collab have dedicated maintainers that are aware of the issues already. Anyone feel free to file separate tickets for those

Also available in: Atom PDF