action #176604
open[alert] "Inode utilization Salt netboot /srv/tftpboot/mnt/dist" alert was flaky on 25-02-04 for multiple hours size:s
0%
Description
Observation¶
This caused multiple alert e-mails pointing to https://monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view?orgId=1. The state history can be viewed on https://monitor.qa.suse.de/alerting/list?search=namespace:%22Salt%22%20group:%22inodes%22 and shows that the alert was going on and off 4 times. It was also pending 3 times before that.
Acceptance criteria¶
- AC1: No more alerts about inode utilization
Suggestions¶
- Check whether the alert thresholds are reasonable and possibly adjust them.
- Check whether there were any actual issues with tests that ran around that time (see further details).
Further details¶
Incomplete jobs around that time:
openqa=> select id, result, reason from jobs where t_started >= '2025-02-04 18:00:00' and t_finished <= '2025-02-05 00:00:00' and result in ('incomplete') order by t_started;
id | result | reason
----------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------
16655323 | incomplete | died: terminated prematurely, see log output for details
16655373 | incomplete | api failure: 400 response: Error in tempfile() using template /var/lib/openqa/testresults/16655/16655373-sle-15-SP6-Server-DVD-SAP-Incidents-x86_64-Build:36438:yast2-sap-ha-sles4sap_ensa_supportserver@64bit-4gbr
am/XXXXXXXXXX: Could not create temp file /var/lib/openqa/testresults/16655/16655373-sle-…
16650790 | incomplete | api failure: 400 response: Error in tempfile() using template /var/lib/openqa/testresults/16650/16650790-sle-15-SP3-GCE-SAP-BYOS-Incidents-saptune-x86_64-Build:37325:rpmlint-sles4sap_gnome_saptune_solutions@gce_
n1_highmem_8/XXXXXXXXXX: Could not create temp file /var/lib/openqa/testresults/16650/166…
16650795 | incomplete | api failure: 400 response: Error in tempfile() using template /var/lib/openqa/testresults/16650/16650795-sle-15-SP4-Azure-SAP-PAYG-Incidents-x86_64-Build:37325:rpmlint-SAPHanaSR-ScaleUp-PerfOpt-spn@az_Standard_E
4s_v3/XXXXXXXXXX: Could not create temp file /var/lib/openqa/testresults/16650/16650795-s…
16655545 | incomplete | api failure: 400 response: Error in tempfile() using template /var/lib/openqa/testresults/16655/16655545-sle-15-SP3-BCI-Updates-x86_64-Build2.35_sles15-ltss-image-bci-base_15.3-ltss_on_SLES_15-SP5_podman@64bit/X
XXXXXXXXX: Could not create temp file /var/lib/openqa/testresults/16655/16655545-sle-15-S…
16655725 | incomplete | api failure: 400 response: Error in tempfile() using template /var/lib/openqa/testresults/16655/16655725-sle-15-SP6-BCI-Updates-x86_64-Build5.3_dotnet-9.0-dotnet-sdk_9.0_on_SLES_15-SP4_docker@64bit/XXXXXXXXXX: C
ould not create temp file /var/lib/openqa/testresults/16655/16655725-sle-15-SP6-BCI-Updat…
16658339 | incomplete | asset failure: Failed to download SLE-15-SP4-x86_64-Build20250204-1-qam-sle_sap-gnome.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-x86_64-Build20250204-1-qam-sle_sap-gnome.qcow2
16660457 | incomplete | backend died: QEMU exited unexpectedly, see log for details
16660458 | incomplete | backend died: QEMU exited unexpectedly, see log for details
16658857 | incomplete | asset failure: Failed to download sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
16659285 | incomplete | asset failure: Failed to download sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
16659376 | incomplete | asset failure: Failed to download sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
16659503 | incomplete | backend died: QEMU exited unexpectedly, see log for details
16660024 | incomplete | asset failure: Failed to download autoyast_SLES-15-SP5-x86_64-create_hdd_yast_maintenance_minimal-Build20250204-1-Server-DVD-Updates-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/autoyast_SLES-15-SP5-x86_6
4-create_hdd_yast_maintenance_minimal-Build20250204-1-Server-DVD-Updates-64bit.qcow2
16660516 | incomplete | asset failure: Failed to download autoyast_SLES-15-SP5-x86_64-create_hdd_yast_maintenance_minimal-Build20250204-1-Server-DVD-Updates-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/autoyast_SLES-15-SP5-x86_6
4-create_hdd_yast_maintenance_minimal-Build20250204-1-Server-DVD-Updates-64bit.qcow2
16659985 | incomplete | backend died: QEMU exited unexpectedly, see log for details
16660472 | incomplete | asset failure: Failed to download sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
16660482 | incomplete | asset failure: Failed to download sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
16660467 | incomplete | asset failure: Failed to download sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
16660584 | incomplete | asset failure: Failed to download sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
16660585 | incomplete | asset failure: Failed to download sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
16660583 | incomplete | asset failure: Failed to download sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
16660596 | incomplete | asset failure: Failed to download sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
16660597 | incomplete | asset failure: Failed to download sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
16660595 | incomplete | asset failure: Failed to download sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
16660181 | incomplete | asset failure: Failed to download sle-15-SP6-x86_64-min-add-qsec-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP6-x86_64-min-add-qsec-20250204-1@64bit.qcow2
16660731 | incomplete | asset failure: Failed to download SLES-Packages-16.0-x86_64-Build4.2.iso to /var/lib/openqa/cache/openqa.suse.de/SLES-Packages-16.0-x86_64-Build4.2.iso
16660905 | incomplete | asset failure: Failed to download SLES-16.0-x86_64-mru-install-minimal-with-addons-Build4.2-agama-installer-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-16.0-x86_64-mru-install-minimal-with-addons-Bu
ild4.2-agama-installer-64bit.qcow2
(28 rows)
Files
Updated by livdywan 13 days ago
- Subject changed from [alert] "Inode utilization Salt netboot /srv/tftpboot/mnt/dist" alert was flaky on 25-02-04 for multiple hours to [alert] "Inode utilization Salt netboot /srv/tftpboot/mnt/dist" alert was flaky on 25-02-04 for multiple hours size:s
- Description updated (diff)
- Status changed from New to Workable
Updated by gpathak 5 days ago
- File clipboard-202502132116-lockp.png clipboard-202502132116-lockp.png added
- File clipboard-202502132117-rpvft.png clipboard-202502132117-rpvft.png added
I did some investigation and found that there are some very old files, dating back more than a decade. Maybe we can delete them if they are no longer required?
These are the two folders that contain some fairly large files, and /srv/tftpboot/mnt/dist/RH.epam
includes some files that are 0 bytes in size.
Updated by okurz 5 days ago
This is just a mountpoint showing the content to dist.suse.de which we are merely users of. I don't know what you discussed during the refinement about this but the suggestions don't make sense to me. Seems like nobody read the subject properly. We already have other cases where we excluded mount points using certain filesystems from alerting, e.g. "nfs". That is likely the same we should do here.
Updated by nicksinger 5 days ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by openqa_review 4 days ago
- Due date set to 2025-03-01
Setting due date based on mean cycle time of SUSE QE Tools
Updated by ybonatakis 1 day ago · Edited
2025-02-17 0 6:46:50 localhost 6741ad8f6150485aa47c6dff23eaf35dfbb7e5246e30f56468e3b8f9089827cc1
2025-02-17 05:51:50netbootsrv-tftpboot-mnt-openqa.mount1
http://monitor.qa.suse.de/goto/9w9845cNR?orgId=1 is this related to the ticket?
Updated by nicksinger 1 day ago
ybonatakis wrote in #note-9:
2025-02-17 0 6:46:50 localhost 6741ad8f6150485aa47c6dff23eaf35dfbb7e5246e30f56468e3b8f9089827cc1
2025-02-17 05:51:50netbootsrv-tftpboot-mnt-openqa.mount1http://monitor.qa.suse.de/goto/9w9845cNR?orgId=1 is this related to the ticket?
hard to say without any logs and just a plain copy+paste from our dashboards. But doesn't look related and rather like network issues:
netboot:/etc/telegraf # journalctl --since "3 days ago" -u srv-tftpboot-mnt-openqa.mount
Feb 16 03:30:29 netboot systemd[1]: Mounting /srv/tftpboot/mnt/openqa...
Feb 16 03:31:59 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Mounting timed out. Terminating.
Feb 16 03:31:59 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Mount process exited, code=killed, status=15/TERM
Feb 16 03:31:59 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Failed with result 'timeout'.
Feb 16 03:31:59 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Unit process 1420 (mount.nfs4) remains running after unit stopped.
Feb 16 03:31:59 netboot systemd[1]: Failed to mount /srv/tftpboot/mnt/openqa.
Updated by nicksinger 1 day ago
- Status changed from In Progress to Feedback
- Priority changed from High to Normal
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1376
also I don't see bigger problems with this machine and no related alerts in the past days. I think we can handle this with normal prio.