Project

General

Profile

Actions

action #96789

closed

File systems alert 90.256 assets used size:M

Added by livdywan about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2021-08-12
Due date:
% Done:

0%

Estimated time:

Description

Observation

[Alerting] File systems alert

One of the file systems is too full
Metric name

Value
/assets: Used Percentage

90.256

See http://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=74&orgId=1

Suggestion

  • Find assets to delete
  • Use archive feature to move assets?
  • See if the cleanup ran properly

Related issues 1 (1 open0 closed)

Related to openQA Infrastructure - action #97976: [alert] OSD file systems - assetsNew2021-09-022021-10-01

Actions
Actions #1

Updated by livdywan about 3 years ago

  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #2

Updated by livdywan about 3 years ago

  • Subject changed from File systems alert 90.256 assets used to File systems alert 90.256 assets used size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by mkittler about 3 years ago

  • Status changed from Workable to Resolved
  • Assignee set to mkittler

I've actually been handling this, see my mails. The alert is now ok again. There was nothing really broken; the asset cleanup was just postpone for too long (but in a way which is expected).

I've been asking myself the following questions on how to improve this in the future:

  1. Maybe we could also change the locking to allow running the cleanup of assets and results concurrently? In our setup results and assets are on different disks so running both at the same time shouldn't be counterproductive and in this case it would have helped. In fact I resolved the issue by manually deleting the limit_tasks lock to let the asset cleanup run in parallel with the result cleanup.
  2. The last 3 asset cleanup jobs which could have actually ran did not because at this point the threshold hasn't been reached and therefore the cleanup has been skipped. The same counts for the result cleanup which ran before the currently active one. It was skipped because we were under the threshold but that's likely contributing to the fact the todays cleanup is taking very long. Maybe we should rethink postpone the cleanup according to the thresholds (at least in its current form)?
Actions #4

Updated by okurz about 3 years ago

  • Related to action #97976: [alert] OSD file systems - assets added
Actions

Also available in: Atom PDF