action #102143
closed
Added by mkittler about 3 years ago.
Updated about 3 years ago.
Description
rm -rf /var/cache/zypp/packages
helped as a temporary measure. Logrotation for /var/log/openqa_scheduler
and /var/log/openqa_gru
seems broken.
acceptance criteria¶
- AC1: Logs are rotated so they don't become too big
- AC2: Temporary files are cleaned up automatically
Files
- Target version set to Ready
- Status changed from New to Feedback
- I've moved some big files from home directories to
/space
.
- I configured logrotate to cover also gru and scheduler logs (same config as on OSD).
- I've moved some log files to
/space/logs
because there was still not enough disk space left to perform the log rotation.
We should have enough free disk space again:
/dev/vda1 20G 6,7G 13G 36% /
This should do it for now. Other areas don't look problematic (checked via ncdu -x /
). If necessary we can still reduce the log storage duration or store the openQA logs under /space/logs
.
It is still not clear why the logrotate config was changed to remove handling of scheduler and gru logs. There actually were rotated scheduler/gru logs from September and the logrotate config was modified on 14. Oktober.
Let's conduct 5 WHYs tomorrow 14.00 CE(S)T, at least Oli and I would find it useful to understand why this wasn't working before, and what changed.
Attached screenshot of munin disk space graph.
- Why did our monitoring not inform us of the issue?
- thruk.suse.de didn't send any emails
- This was brought up in eng-testing (Slack) on by Oleksandr on Monday Dec 8 17.08 CEST, o3 returning 503, seemingly the web UI was not running (from a user's point of view)
- Fabian tried
rm -rf /var/cache/zypp/packages
and restarted openqa-livehandler.service
1,9G /var/log/journal
2,7G /var/log/openqa_scheduler
6,9G /var/log/openqa_gru
- Why have we not been alerted by thruk.suse.de i.e. email to individual persons?
- Why weren't we aware that configuration was changed?
- Without salt, ansible or etckeeper we can't tell
- Why did thruk stop working?
- Why are /home and /var/log on the root partition?
- There's #59933 but we didn't have capacity in the backlog
- We already had very old o3 ops tickets in the backlog and hence not added more
- More team members with an ops focus are needed
Improvements
- Status changed from Feedback to Resolved
Looks like the changed logrotate config works. Since we've created a follow-up ticket for further improvements I'm resolving this one.
Also available in: Atom
PDF