Project

General

Profile

Actions

action #176604

open

[alert] "Inode utilization Salt netboot /srv/tftpboot/mnt/dist" alert was flaky on 25-02-04 for multiple hours size:s

Added by mkittler 13 days ago. Updated 1 day ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2025-02-05
Due date:
2025-03-01 (Due in 11 days)
% Done:

0%

Estimated time:

Description

Observation

This caused multiple alert e-mails pointing to https://monitor.qa.suse.de/alerting/grafana/d74e764d-6097-4d14-b77c-76c8d1da6ff0/view?orgId=1. The state history can be viewed on https://monitor.qa.suse.de/alerting/list?search=namespace:%22Salt%22%20group:%22inodes%22 and shows that the alert was going on and off 4 times. It was also pending 3 times before that.

Acceptance criteria

  • AC1: No more alerts about inode utilization

Suggestions

  • Check whether the alert thresholds are reasonable and possibly adjust them.
  • Check whether there were any actual issues with tests that ran around that time (see further details).

Further details

Incomplete jobs around that time:

openqa=> select id, result, reason from jobs where t_started >= '2025-02-04 18:00:00' and t_finished <= '2025-02-05 00:00:00' and result in ('incomplete') order by t_started;
    id    |   result   |                                                                                                                                                    reason                                                          

----------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------
 16655323 | incomplete | died: terminated prematurely, see log output for details
 16655373 | incomplete | api failure: 400 response: Error in tempfile() using template /var/lib/openqa/testresults/16655/16655373-sle-15-SP6-Server-DVD-SAP-Incidents-x86_64-Build:36438:yast2-sap-ha-sles4sap_ensa_supportserver@64bit-4gbr
am/XXXXXXXXXX: Could not create temp file /var/lib/openqa/testresults/16655/16655373-sle-…
 16650790 | incomplete | api failure: 400 response: Error in tempfile() using template /var/lib/openqa/testresults/16650/16650790-sle-15-SP3-GCE-SAP-BYOS-Incidents-saptune-x86_64-Build:37325:rpmlint-sles4sap_gnome_saptune_solutions@gce_
n1_highmem_8/XXXXXXXXXX: Could not create temp file /var/lib/openqa/testresults/16650/166…
 16650795 | incomplete | api failure: 400 response: Error in tempfile() using template /var/lib/openqa/testresults/16650/16650795-sle-15-SP4-Azure-SAP-PAYG-Incidents-x86_64-Build:37325:rpmlint-SAPHanaSR-ScaleUp-PerfOpt-spn@az_Standard_E
4s_v3/XXXXXXXXXX: Could not create temp file /var/lib/openqa/testresults/16650/16650795-s…
 16655545 | incomplete | api failure: 400 response: Error in tempfile() using template /var/lib/openqa/testresults/16655/16655545-sle-15-SP3-BCI-Updates-x86_64-Build2.35_sles15-ltss-image-bci-base_15.3-ltss_on_SLES_15-SP5_podman@64bit/X
XXXXXXXXX: Could not create temp file /var/lib/openqa/testresults/16655/16655545-sle-15-S…
 16655725 | incomplete | api failure: 400 response: Error in tempfile() using template /var/lib/openqa/testresults/16655/16655725-sle-15-SP6-BCI-Updates-x86_64-Build5.3_dotnet-9.0-dotnet-sdk_9.0_on_SLES_15-SP4_docker@64bit/XXXXXXXXXX: C
ould not create temp file /var/lib/openqa/testresults/16655/16655725-sle-15-SP6-BCI-Updat…
 16658339 | incomplete | asset failure: Failed to download SLE-15-SP4-x86_64-Build20250204-1-qam-sle_sap-gnome.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-x86_64-Build20250204-1-qam-sle_sap-gnome.qcow2
 16660457 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16660458 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16658857 | incomplete | asset failure: Failed to download sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
 16659285 | incomplete | asset failure: Failed to download sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
 16659376 | incomplete | asset failure: Failed to download sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
 16659503 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16660024 | incomplete | asset failure: Failed to download autoyast_SLES-15-SP5-x86_64-create_hdd_yast_maintenance_minimal-Build20250204-1-Server-DVD-Updates-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/autoyast_SLES-15-SP5-x86_6
4-create_hdd_yast_maintenance_minimal-Build20250204-1-Server-DVD-Updates-64bit.qcow2
 16660516 | incomplete | asset failure: Failed to download autoyast_SLES-15-SP5-x86_64-create_hdd_yast_maintenance_minimal-Build20250204-1-Server-DVD-Updates-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/autoyast_SLES-15-SP5-x86_6
4-create_hdd_yast_maintenance_minimal-Build20250204-1-Server-DVD-Updates-64bit.qcow2
 16659985 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16660472 | incomplete | asset failure: Failed to download sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
 16660482 | incomplete | asset failure: Failed to download sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
 16660467 | incomplete | asset failure: Failed to download sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
 16660584 | incomplete | asset failure: Failed to download sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
 16660585 | incomplete | asset failure: Failed to download sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
 16660583 | incomplete | asset failure: Failed to download sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
 16660596 | incomplete | asset failure: Failed to download sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
 16660597 | incomplete | asset failure: Failed to download sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP3-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
 16660595 | incomplete | asset failure: Failed to download sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP5-x86_64-min-add-qsec-cc-20250204-1@64bit.qcow2
 16660181 | incomplete | asset failure: Failed to download sle-15-SP6-x86_64-min-add-qsec-20250204-1@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP6-x86_64-min-add-qsec-20250204-1@64bit.qcow2
 16660731 | incomplete | asset failure: Failed to download SLES-Packages-16.0-x86_64-Build4.2.iso to /var/lib/openqa/cache/openqa.suse.de/SLES-Packages-16.0-x86_64-Build4.2.iso
 16660905 | incomplete | asset failure: Failed to download SLES-16.0-x86_64-mru-install-minimal-with-addons-Build4.2-agama-installer-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-16.0-x86_64-mru-install-minimal-with-addons-Bu
ild4.2-agama-installer-64bit.qcow2
(28 rows)

Files

Actions #1

Updated by okurz 13 days ago

  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #2

Updated by livdywan 13 days ago

  • Subject changed from [alert] "Inode utilization Salt netboot /srv/tftpboot/mnt/dist" alert was flaky on 25-02-04 for multiple hours to [alert] "Inode utilization Salt netboot /srv/tftpboot/mnt/dist" alert was flaky on 25-02-04 for multiple hours size:s
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by livdywan 6 days ago

We looked at it briefly but it wasn't assigned in the end. Let's make that today 🤞🏼

Updated by gpathak 5 days ago

I did some investigation and found that there are some very old files, dating back more than a decade. Maybe we can delete them if they are no longer required?



These are the two folders that contain some fairly large files, and /srv/tftpboot/mnt/dist/RH.epam includes some files that are 0 bytes in size.

Actions #6

Updated by okurz 5 days ago

This is just a mountpoint showing the content to dist.suse.de which we are merely users of. I don't know what you discussed during the refinement about this but the suggestions don't make sense to me. Seems like nobody read the subject properly. We already have other cases where we excluded mount points using certain filesystems from alerting, e.g. "nfs". That is likely the same we should do here.

Actions #7

Updated by nicksinger 5 days ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #8

Updated by openqa_review 4 days ago

  • Due date set to 2025-03-01

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by ybonatakis 1 day ago · Edited

2025-02-17 0 6:46:50 localhost 6741ad8f6150485aa47c6dff23eaf35dfbb7e5246e30f56468e3b8f9089827cc1
2025-02-17 05:51:50netbootsrv-tftpboot-mnt-openqa.mount1

http://monitor.qa.suse.de/goto/9w9845cNR?orgId=1 is this related to the ticket?

Actions #10

Updated by nicksinger 1 day ago

ybonatakis wrote in #note-9:

2025-02-17 0 6:46:50 localhost 6741ad8f6150485aa47c6dff23eaf35dfbb7e5246e30f56468e3b8f9089827cc1
2025-02-17 05:51:50netbootsrv-tftpboot-mnt-openqa.mount1

http://monitor.qa.suse.de/goto/9w9845cNR?orgId=1 is this related to the ticket?

hard to say without any logs and just a plain copy+paste from our dashboards. But doesn't look related and rather like network issues:

netboot:/etc/telegraf # journalctl --since "3 days ago" -u srv-tftpboot-mnt-openqa.mount
Feb 16 03:30:29 netboot systemd[1]: Mounting /srv/tftpboot/mnt/openqa...
Feb 16 03:31:59 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Mounting timed out. Terminating.
Feb 16 03:31:59 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Mount process exited, code=killed, status=15/TERM
Feb 16 03:31:59 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Failed with result 'timeout'.
Feb 16 03:31:59 netboot systemd[1]: srv-tftpboot-mnt-openqa.mount: Unit process 1420 (mount.nfs4) remains running after unit stopped.
Feb 16 03:31:59 netboot systemd[1]: Failed to mount /srv/tftpboot/mnt/openqa.
Actions #11

Updated by nicksinger 1 day ago

  • Status changed from In Progress to Feedback
  • Priority changed from High to Normal

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1376

also I don't see bigger problems with this machine and no related alerts in the past days. I think we can handle this with normal prio.

Actions

Also available in: Atom PDF