Project

General

Profile

action #102143

o3 ran out of disk space

Added by mkittler 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-11-09
Due date:
% Done:

0%

Estimated time:

Description

rm -rf /var/cache/zypp/packages helped as a temporary measure. Logrotation for /var/log/openqa_scheduler and /var/log/openqa_gru seems broken.

acceptance criteria

  • AC1: Logs are rotated so they don't become too big
  • AC2: Temporary files are cleaned up automatically
o3-disk-space.png (110 KB) o3-disk-space.png tinita, 2021-11-11 13:11
12161

Related issues

Copied to openQA Infrastructure - coordination #102266: [epic] o3 ran out of disk spaceBlocked2021-12-21

History

#1 Updated by okurz 3 months ago

  • Target version set to Ready

#2 Updated by mkittler 3 months ago

  • Assignee set to mkittler

#3 Updated by mkittler 3 months ago

  • Status changed from New to Feedback
  • I've moved some big files from home directories to /space.
  • I configured logrotate to cover also gru and scheduler logs (same config as on OSD).
  • I've moved some log files to /space/logs because there was still not enough disk space left to perform the log rotation.

We should have enough free disk space again:

/dev/vda1                20G    6,7G   13G   36% /

This should do it for now. Other areas don't look problematic (checked via ncdu -x /). If necessary we can still reduce the log storage duration or store the openQA logs under /space/logs.


It is still not clear why the logrotate config was changed to remove handling of scheduler and gru logs. There actually were rotated scheduler/gru logs from September and the logrotate config was modified on 14. Oktober.

#4 Updated by cdywan 3 months ago

Let's conduct 5 WHYs tomorrow 14.00 CE(S)T, at least Oli and I would find it useful to understand why this wasn't working before, and what changed.

#5 Updated by tinita 3 months ago

12161

Attached screenshot of munin disk space graph.

#6 Updated by cdywan 3 months ago

  • Why did our monitoring not inform us of the issue?
    • thruk.suse.de didn't send any emails
    • This was brought up in eng-testing (Slack) on by Oleksandr on Monday Dec 8 17.08 CEST, o3 returning 503, seemingly the web UI was not running (from a user's point of view)
    • Fabian tried rm -rf /var/cache/zypp/packages and restarted openqa-livehandler.service
1,9G    /var/log/journal
2,7G    /var/log/openqa_scheduler
6,9G    /var/log/openqa_gru

Improvements

#7 Updated by cdywan 3 months ago

#8 Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved

Looks like the changed logrotate config works. Since we've created a follow-up ticket for further improvements I'm resolving this one.

Also available in: Atom PDF