Project

General

Profile

Actions

action #167719

closed

coordination #161414: [epic] Improved salt based infrastructure management

No new data in monitor.qe.nue2.suse.org due to influxdb failing to write with ""error opening new segment file for wal (1): write /var/lib/influxdb/….wal: no space left on device"

Added by okurz about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-10-02
Due date:
% Done:

0%

Estimated time:

Description

Observation

No new data in monitor.qe.nue2.suse.org due to influxdb failing to write with ""error opening new segment file for wal (1): write /var/lib/influxdb/….wal: no space left on device"

df -h says

/dev/vdb1       300G  291G  8.0G  98% /var/lib/influxdb

Related issues 1 (1 open0 closed)

Copied to openQA Infrastructure - action #167722: Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:MWorkablenicksinger2024-10-022024-11-29

Actions
Actions #1

Updated by okurz about 2 months ago

  • Assignee deleted (okurz)
  • Priority changed from Urgent to Normal

I logged into the monitor instance, called systemctl status, found all good and then checked the service status on first grafana which was fine and second influxdb which showed error messages. On qamaster.qe.nue2.suse.org I shut down monitor and then did

qemu-img resize /var/lib/libvirt/images/openqa-monitoring-data.qcow2 +200G

I booted monitor.qe.nue2.suse.org I did

parted -s -a opt /dev/vdb "resizepart 1 100%"
btrfs fi resize max /var/lib/influxdb

Following the system journal I could see that all recovered well.

Next tasks:

  1. Check ressource usage within influxdb which measurements consume the most
  2. Find out why we didn't see the space usage problem in before and receive alerts
Actions #2

Updated by okurz about 2 months ago

  • Copied to action #167722: Efficient use of monitoring data within influxdb on monitor.qe.nue2.suse.org size:M added
Actions #3

Updated by okurz about 2 months ago

  • Parent task set to #161414
Actions #4

Updated by okurz about 2 months ago

  • Status changed from New to Resolved
  • Assignee set to okurz
  • Priority changed from Normal to Urgent

I created dedicated tickets for the two identified follow-up tasks:

  1. #167722
  2. #167728
Actions

Also available in: Atom PDF