action #102143: o3 ran out of disk space - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #102143

closed

o3 ran out of disk space

Added by mkittler about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-11-09

Due date:

% Done:

Estimated time:

Description

rm -rf /var/cache/zypp/packages helped as a temporary measure. Logrotation for /var/log/openqa_scheduler and /var/log/openqa_gru seems broken.

acceptance criteria¶

AC1: Logs are rotated so they don't become too big
AC2: Temporary files are cleaned up automatically

Files

o3-disk-space.png (110 KB) o3-disk-space.png

tinita, 2021-11-11 13:11

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz about 3 years ago

Target version set to Ready

Actions

Copy link

Updated by mkittler about 3 years ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler about 3 years ago

Status changed from New to Feedback

I've moved some big files from home directories to /space.
I configured logrotate to cover also gru and scheduler logs (same config as on OSD).
I've moved some log files to /space/logs because there was still not enough disk space left to perform the log rotation.

We should have enough free disk space again:

/dev/vda1                20G    6,7G   13G   36% /

This should do it for now. Other areas don't look problematic (checked via ncdu -x /). If necessary we can still reduce the log storage duration or store the openQA logs under /space/logs.

It is still not clear why the logrotate config was changed to remove handling of scheduler and gru logs. There actually were rotated scheduler/gru logs from September and the logrotate config was modified on 14. Oktober.

Actions

Copy link

Updated by livdywan about 3 years ago

Let's conduct 5 WHYs tomorrow 14.00 CE(S)T, at least Oli and I would find it useful to understand why this wasn't working before, and what changed.

Actions

Copy link

Updated by tinita about 3 years ago

File o3-disk-space.png o3-disk-space.png added

Attached screenshot of munin disk space graph.

Actions

Copy link

Updated by livdywan about 3 years ago

Why did our monitoring not inform us of the issue?
- thruk.suse.de didn't send any emails
- This was brought up in eng-testing (Slack) on by Oleksandr on Monday Dec 8 17.08 CEST, o3 returning 503, seemingly the web UI was not running (from a user's point of view)
- Fabian tried rm -rf /var/cache/zypp/packages and restarted openqa-livehandler.service

1,9G    /var/log/journal
2,7G    /var/log/openqa_scheduler
6,9G    /var/log/openqa_gru

Why have we not been alerted by thruk.suse.de i.e. email to individual persons?
- https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=ariel-opensuse.suse.de&service=root%20partition&backend=7215e#pnp_th4/1633871505/1636636305/0
- Last change to logrotate config Oct 14
- Notifications weren't enabled and it has to be done per user
Why weren't we aware that configuration was changed?
- Without salt, ansible or etckeeper we can't tell
Why did thruk stop working?
- Ask eng infra
Why are /home and /var/log on the root partition?
- There's #59933 but we didn't have capacity in the backlog
- We already had very old o3 ops tickets in the backlog and hence not added more
- More team members with an ops focus are needed

Improvements

Everyone in the team should have access (and mention in https://progress.opensuse.org/projects/qa/wiki#Onboarding-for-new-joiners)
Install https://wiki.archlinux.org/title/etckeeper to keep track of changes
Take a look at backup.qa.suse.de for relevant changes in /etc
Clarify when we do 5 WHYs

Actions

Copy link

Updated by livdywan about 3 years ago

Copied to coordination #102266: [epic] o3 ran out of disk space added

Actions

Copy link

Updated by mkittler about 3 years ago

Status changed from Feedback to Resolved

Looks like the changed logrotate config works. Since we've created a follow-up ticket for further improvements I'm resolving this one.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #102143

o3 ran out of disk space

acceptance criteria¶

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by livdywan about 3 years ago

Updated by tinita about 3 years ago

Updated by livdywan about 3 years ago

Updated by livdywan about 3 years ago

Updated by mkittler about 3 years ago