action #164979: [alert][grafana] File systems alert for WebUI /results size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #164979

closed

[alert][grafana] File systems alert for WebUI /results size:S

Added by nicksinger 5 months ago. Updated 4 months ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

Due date:

2024-08-21

% Done:

Estimated time:

Tags:

alert, infra

Description

Observation¶

Observed at 2024-08-06 07:17:00 +0200 CEST
~~One of the file systems~~ /results is too full (> 90%)
See http://stats.openqa-monitor.qa.suse.de/d/WebuiDb?orgId=1&viewPanel=74

Current usage:

Filesystem      Size  Used Avail Use% Mounted on
/dev/vdd        7.0T  6.4T  681G  91% /results

Acceptance criteria¶

AC1: There is enough space and headroom on the affected file system /results, i.e. considerably more than 20%

Suggestions¶

Check job group "logs" retention settings for "not-important" / "groupless" result and consider reducing the period
Consider extending the silence period if fixing takes too long: https://stats.openqa-monitor.qa.suse.de/alerting/silence/9ee9b299-3d06-4234-97bf-6b84e2ad9a24/edit?alertmanager=grafana
Reconsider the design of scheduling openqa-investigate for unreviewed jobs and possibly plan in a separate ticket
Tell the security squad that their test scenario(s) are problematic and should fail less or be properly reviewed
Tell the security squad about their test scenario(s) which is significantly bigger than other jobs and consider reducing the space usage, e.g. save less or compress stuff

Rollback steps¶

DONE ~~Remove silence https://stats.openqa-monitor.qa.suse.de/alerting/silence/9ee9b299-3d06-4234-97bf-6b84e2ad9a24/edit?alertmanager=grafana~~

Out of scope¶

Better accounting e.g. linking of investigation jobs to their original groups -> #164988

Files

Download all files

Screenshot_20240807_093724_results_usage_6_months.png (33.2 KB) Screenshot_20240807_093724_results_usage_6_months.png		okurz, 2024-08-07 07:41
clipboard-202408071619-oi5o1.png (73.6 KB) clipboard-202408071619-oi5o1.png		livdywan, 2024-08-07 14:19

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by nicksinger 5 months ago

Copied from action #129244: [alert][grafana] File systems alert for WebUI /results size:M added

Actions

Copy link

Updated by okurz 5 months ago

Category set to Regressions/Crashes

Actions

Copy link

Updated by mkittler 5 months ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 5 months ago

We had an increase of 10 percent points in the last 7 days. The graph also shows that the cleanup is still generally working. The Minion dashboard and gru logs also look generally good (e.g. https://openqa.suse.de/minion/jobs?task=limit_results_and_logs). I nevertheless triggered another result cleanup.

It looks like the group of groupless jobs is the biggest contributor: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=19

Actions

Copy link

Updated by mkittler 4 months ago · Edited

The query select id, TEST, result, pg_size_pretty(result_size) as result_size_pretty from jobs where group_id is null and t_created > '2024-07-30' and state = 'done' and archived = false order by result_size desc; shows lots of oscap_ansible_anssi_bp28_high jobs like https://openqa.suse.de/tests/15082557.

The following query shows this more clearly:

openqa=> select (substring(TEST from 0 for position(':' in TEST))) as short_name, min(TEST) as example_name, pg_size_pretty(sum(result_size)) as result_size_pretty from jobs where group_id is null and t_created > '2024-07-30' and state = 'done' and archived = false group by short_name order by sum(result_size) desc;
                   short_name                   |                                             example_name                                              | result_size_pretty 
------------------------------------------------+-------------------------------------------------------------------------------------------------------+--------------------
 oscap_bash_anssi_bp28_high                     | oscap_bash_anssi_bp28_high:investigate:bisect_without_32673                                           | 355 GB
 oscap_ansible_anssi_bp28_high                  | oscap_ansible_anssi_bp28_high:investigate:bisect_without_32673                                        | 350 GB
                                                | 390x-2                                                                                                | 18 GB
 gnome                                          | gnome:investigate:last_good_build::34061:mozilla-nss                                                  | 3098 MB
 qam-regression-firefox-SLES                    | qam-regression-firefox-SLES:investigate:bisect_without_34295                                          | 2784 MB
 oscap_ansible_cis                              | oscap_ansible_cis:investigate:bisect_without_33993                                                    | 2436 MB
 cryptlvm                                       | cryptlvm:investigate:last_good_build::34061:mozilla-nss                                               | 2421 MB
 jeos-containers-podman                         | jeos-containers-podman:investigate:bisect_without_33993                                               | 2194 MB
 jeos-containers-docker                         | jeos-containers-docker:investigate:bisect_without_33585                                               | 2173 MB
 docker_tests                                   | docker_tests:investigate:bisect_without_33585                                                         | 2113 MB
 qam_ha_hawk_haproxy_node01                     | qam_ha_hawk_haproxy_node01:investigate:last_good_build::32673:python-docutils                         | 2048 MB
 jeos-extratest                                 | jeos-extratest:investigate:bisect_without_33993                                                       | 2021 MB
 qam_ha_hawk_haproxy_node02                     | qam_ha_hawk_haproxy_node02:investigate:last_good_build::32673:python-docutils                         | 1921 MB
 qam_ha_hawk_client                             | qam_ha_hawk_client:investigate:last_good_build::32673:python-docutils                                 | 1891 MB
 fips_env_mode_tests_crypt_web                  | fips_env_mode_tests_crypt_web:investigate:bisect_without_33585                                        | 1705 MB
 jeos-filesystem                                | jeos-filesystem:investigate:bisect_without_33585                                                      | 1694 MB
 fips_ker_mode_tests_crypt_web                  | fips_ker_mode_tests_crypt_web:investigate:bisect_without_33585                                        | 1663 MB
 qam_wicked_startandstop_sut                    | qam_wicked_startandstop_sut:investigate:last_good_build:20240804-1                                    | 1440 MB
 qam_ha_rolling_upgrade_migration_node01        | qam_ha_rolling_upgrade_migration_node01:investigate:last_good_build::32673:python-docutils            | 1404 MB
 slem_image_default                             | slem_image_default:investigate:bisect_without_28151                                                   | 1350 MB
 slem_fips_docker                               | slem_fips_docker:investigate:bisect_without_35050                                                     | 1141 MB
 slem_installation_autoyast                     | slem_installation_autoyast:investigate:bisect_without_28150                                           | 1105 MB
 wsl2-main+wsl_gui                              | wsl2-main+wsl_gui:investigate:last_good_build:2.374                                                   | 1072 MB

Problematic job groups:

openqa=> select concat('https://openqa.suse.de/group_overview/', group_id) as group_url, (select name from job_group_parents where id = (select parent_id from job_groups where id = group_id)) as parent_group_name, (select name from job_groups where id = group_id) as group_name, count(id) as job_count from jobs where TEST = 'oscap_bash_anssi_bp28_high' group by group_id order by job_count desc;
                 group_url                 |        parent_group_name        |              group_name              | job_count 
-------------------------------------------+---------------------------------+--------------------------------------+-----------
 https://openqa.suse.de/group_overview/429 | Maintenance: Aggregated updates | Security Maintenance Updates         |       839
 https://openqa.suse.de/group_overview/268 | SLE 15                          | Security                             |       141
 https://openqa.suse.de/group_overview/517 | Maintenance: QR                 | Maintenance - QR - SLE15SP5-Security |        43
 https://openqa.suse.de/group_overview/462 | Maintenance: QR                 | Maintenance - QR - SLE15SP4-Security |        38
 https://openqa.suse.de/group_overview/588 | Maintenance: QR                 | Maintenance - QR - SLE15SP6-Security |         1
(5 rows)

Actions

Copy link

Updated by mkittler 4 months ago

Those are the most offending jobs:

openqa=> select count(id), pg_size_pretty(sum(result_size)) from jobs where group_id is null and TEST like 'oscap_%_anssi_bp28_high:investigate%' and t_finished < '2024-08-06';
 count | pg_size_pretty 
-------+----------------
  2488 | 548 GB
(1 row)

openqa=> select count(id), pg_size_pretty(sum(result_size)) from jobs where group_id is null and TEST like 'oscap_%_anssi_bp28_high:investigate%' and t_finished >= '2024-08-06';
 count | pg_size_pretty 
-------+----------------
   700 | 157 GB
(1 row)

I am deleting currently the jobs from before today via:

openqa=> \copy (select id from jobs where group_id is null and TEST like 'oscap_%_anssi_bp28_high:investigate%' and t_finished < '2024-08-06') to '/tmp/jobs_to_delete' csv;
COPY 2488

martchus@openqa:~> for id in $(cat /tmp/jobs_to_delete) ; do sudo openqa-cli api --osd -X delete jobs/$id ; done

Actions

Copy link

Updated by okurz 4 months ago

Copied to action #164988: Better accounting for openqa-investigation jobs size:S added

Actions

Copy link

Updated by livdywan 4 months ago

Subject changed from [alert][grafana] File systems alert for WebUI /results to [alert][grafana] File systems alert for WebUI /results size:S
Description updated (diff)

Actions

Copy link

Updated by okurz 4 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1243

Actions

Copy link

#10

Updated by mkittler 4 months ago

The deletion went through and now we're back at 83 % which should be good enough for now.

I also triggered another cleanup via sudo systemctl start openqa-enqueue-asset-cleanup.service to also get rid of screenshots which might now be dangling.

Actions

Copy link

#11

Updated by okurz 4 months ago

mkittler wrote in #note-10:

The deletion went through and now we're back at 83 % which should be good enough for now.

83% means that the space-aware cleanup can not delete more but should be able to do so so I think the job group quotas in sum are bigger than the overall available space in /results, isn't it?

Actions

Copy link

#12

Updated by mkittler 4 months ago

I mentioned the problematic scenarios on the Slack channel #discuss-qe-security. Let's see what the response will be. If it is not helpful and if https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1243 doesn't help well enough we'll have to resort to the sledge hammer method of disabling the investigation jobs for all involved jobs groups (see #164979#note-5) by adjusting the exclude_group_regex configured in salt states.

@okurz I'm not entire sure what you're getting at. We're above the threshold for skipping/executing the cleanup - in other words the cleanup is executed and that is expected. Whether the sum of the quotas is bigger than the overall space is not calculatable because we use time-based quotas.

Actions

Copy link

#13

Updated by openqa_review 4 months ago

Due date set to 2024-08-21

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#14

Updated by okurz 4 months ago · Edited

File Screenshot_20240807_093724_results_usage_6_months.png Screenshot_20240807_093724_results_usage_6_months.png added

Taking a look into the last 6 months one should clearly see that the space-aware cleanup was able to keep enough free space only until 2024-04

So in my understanding that means that the job group quotas in sum exceed what is available and that should be reduced.

Actions

Copy link

#15

Updated by livdywan 4 months ago

Action items:

File an SD ticket to increase storage space by 2TB
Follow up with security squad
Switch off relevant scenarios by EOD if they continue to accumulate space

Actions

Copy link

#16

Updated by mkittler 4 months ago

SD ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-164873

Actions

Copy link

#17

Updated by mkittler 4 months ago

I suppose for deleting the jobs we needed to file a SR on https://gitlab.suse.de/qe-security/osd-sle15-security. This should also cover the 5xx groups not explicitly mentioned in their README. We just needed to search for _anssi_bp28_high and delete all entries in the YAML that ends with it.

Actions

Copy link

#18

Updated by mkittler 4 months ago

I just wanted to see on what products the problematic test scenarios fail specifically. However, not it looks like they have all been fixed:

With that I hope we can refrain from removing those scenarios.

Actions

Copy link

#19

Updated by livdywan 4 months ago

File clipboard-202408071619-oi5o1.png clipboard-202408071619-oi5o1.png added

Looks much better indeed. Are we okay to lower the urgency then?

Actions

Copy link

#20

Updated by mkittler 4 months ago

I now deleted all jobs from \copy (select id from jobs where group_id is null and TEST like 'oscap_%_anssi_bp28_high:investigate%') to '/tmp/jobs_to_delete' csv; to free 160 GiB of disk space. Now we're at 81 %. So we'd still have to reduce quotas. However, it makes probably sense to wait for a response in https://sd.suse.com/servicedesk/customer/portal/1/SD-164873 first.

Actions

Copy link

#21

Updated by mkittler 4 months ago

Priority changed from Urgent to High

Actions

Copy link

#22

Updated by livdywan 4 months ago

Assignee changed from mkittler to livdywan

I'm taking the ticket from here

Actions

Copy link

#23

Updated by livdywan 4 months ago

Description updated (diff)
Status changed from In Progress to Blocked

We're at 122GB and the silence is gone. Let's block on SD-164873 at this point.

Actions

Copy link

#24

Updated by livdywan 4 months ago

livdywan wrote in #note-23:

We're at 122GB and the silence is gone. Let's block on SD-164873 at this point.

https://suse.slack.com/archives/C029APBKLGK/p1723626574395099

Actions

Copy link

#25

Updated by mkittler 4 months ago

Assignee changed from livdywan to mkittler

Actions

Copy link

#26

Updated by mkittler 4 months ago

Status changed from Blocked to Feedback

The scenarios mentioned in note #164979#note-18 fail again as of today and the investigation jobs have already occupied 80 GiB. I'm deleting all investigation jobs. All jobs in the scenarios seem to have proper bugrefs now. So let's see whether the carry over works and further investigation jobs are prevented. (Although it is already problematic that we use 80 GiB on the first occurrence. So even if those scenarios are reviewed in a timely manner that's a lot of disk space being used.)

Actions

Copy link

#27

Updated by viktors.trubovics 4 months ago

Hello!
One of possible long term solution can be move/change OpenQA storage to the BTRFS FS level compression.
It effectively compressing more than x100 times text files.
I solved this way storage problem of storing huge amount of small text files (billions of files and terabytes of text) on terabyte storage.

I created PR for OSCAP tests: Compress and upload test artifacts https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/19967

Actions

Copy link

#28

Updated by mkittler 4 months ago

Thanks, the PR looks good to me.

Actions

Copy link

#29

Updated by mkittler 4 months ago

The PR for the compression has been merged. Meanwhile no further investigation jobs piled up.

Actions

Copy link

#30

Updated by mkittler 4 months ago · Edited

Looks like the carry over generally works, see https://openqa.suse.de/tests/15198055#comments. 19 GiB have nevertheless piled up again for investigation jobs. I'll check the situation next week and possible delete jobs again manually.

EDIT: Not much more has piled up. I'm deleting the now 20 GiB of investigation jobs now nevertheless.

Actions

Copy link

#31

Updated by mkittler 4 months ago

We got one additional TiB and I have already increased the filesystem size. Not sure what we still can/should improve here.

Actions

Copy link

#32

Updated by mkittler 4 months ago

Status changed from Feedback to Resolved

We only gained additional 2431 MiB from new investigation jobs of the problematic scenarios. They are now also better maintained and smaller thanks to compression. We also have #164988 for better accounting of such jobs in general. The additional 1 TiB also gives us more headroom. With all of that I guess the alert is handled for now (and the silence has been removed for a while now) so I consider this ticket resolved.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #164979

[alert][grafana] File systems alert for WebUI /results size:S

Observation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Out of scope¶

Updated by nicksinger 5 months ago

Updated by okurz 5 months ago

Updated by mkittler 5 months ago

Updated by mkittler 5 months ago

Updated by mkittler 4 months ago · Edited

Updated by mkittler 4 months ago

Updated by okurz 4 months ago

Updated by livdywan 4 months ago

Updated by okurz 4 months ago

Updated by mkittler 4 months ago

Updated by okurz 4 months ago

Updated by mkittler 4 months ago

Updated by openqa_review 4 months ago

Updated by okurz 4 months ago · Edited

Updated by livdywan 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by livdywan 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by livdywan 4 months ago

Updated by livdywan 4 months ago

Updated by livdywan 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by viktors.trubovics 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago · Edited

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago