action #102143
closedo3 ran out of disk space
0%
Description
rm -rf /var/cache/zypp/packages
helped as a temporary measure. Logrotation for /var/log/openqa_scheduler
and /var/log/openqa_gru
seems broken.
acceptance criteria¶
- AC1: Logs are rotated so they don't become too big
- AC2: Temporary files are cleaned up automatically
Files
Updated by mkittler about 3 years ago
- Status changed from New to Feedback
- I've moved some big files from home directories to
/space
. - I configured logrotate to cover also gru and scheduler logs (same config as on OSD).
- I've moved some log files to
/space/logs
because there was still not enough disk space left to perform the log rotation.
We should have enough free disk space again:
/dev/vda1 20G 6,7G 13G 36% /
This should do it for now. Other areas don't look problematic (checked via ncdu -x /
). If necessary we can still reduce the log storage duration or store the openQA logs under /space/logs
.
It is still not clear why the logrotate config was changed to remove handling of scheduler and gru logs. There actually were rotated scheduler/gru logs from September and the logrotate config was modified on 14. Oktober.
Updated by livdywan about 3 years ago
Let's conduct 5 WHYs tomorrow 14.00 CE(S)T, at least Oli and I would find it useful to understand why this wasn't working before, and what changed.
Updated by tinita about 3 years ago
- File o3-disk-space.png o3-disk-space.png added
Attached screenshot of munin disk space graph.
Updated by livdywan about 3 years ago
- Why did our monitoring not inform us of the issue?
- thruk.suse.de didn't send any emails
- This was brought up in eng-testing (Slack) on by Oleksandr on Monday Dec 8 17.08 CEST, o3 returning 503, seemingly the web UI was not running (from a user's point of view)
- Fabian tried
rm -rf /var/cache/zypp/packages
and restartedopenqa-livehandler.service
1,9G /var/log/journal
2,7G /var/log/openqa_scheduler
6,9G /var/log/openqa_gru
- Why have we not been alerted by thruk.suse.de i.e. email to individual persons?
- https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=ariel-opensuse.suse.de&service=root%20partition&backend=7215e#pnp_th4/1633871505/1636636305/0
- Last change to logrotate config Oct 14
- Notifications weren't enabled and it has to be done per user
- Why weren't we aware that configuration was changed?
- Without salt, ansible or etckeeper we can't tell
- Why did thruk stop working?
- Ask eng infra
- Why are /home and /var/log on the root partition?
- There's #59933 but we didn't have capacity in the backlog
- We already had very old o3 ops tickets in the backlog and hence not added more
- More team members with an ops focus are needed
Improvements
- Everyone in the team should have access (and mention in https://progress.opensuse.org/projects/qa/wiki#Onboarding-for-new-joiners)
- Install https://wiki.archlinux.org/title/etckeeper to keep track of changes
- Take a look at backup.qa.suse.de for relevant changes in /etc
- Clarify when we do 5 WHYs
Updated by livdywan about 3 years ago
- Copied to coordination #102266: [epic] o3 ran out of disk space added
Updated by mkittler about 3 years ago
- Status changed from Feedback to Resolved
Looks like the changed logrotate config works. Since we've created a follow-up ticket for further improvements I'm resolving this one.