action #95299
closedTests timeout with reason 'setup exceeded MAX_SETUP_TIME' on osd ppc64le workers auto_review:"Result: timeout":retry size:M
Description
Observation¶
I don't know why this ppc64le jobs timed out, cache service ?
Should the regex be more specific or can it be generic for any timeout ?
https://openqa.suse.de/tests/6393939
https://openqa.suse.de/tests/6393941
https://openqa.suse.de/tests/6393945
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#95299
ssh osd "sudo -u geekotest psql --command=\"select jobs.id,result_dir,t_finished,host,instance from jobs join workers on jobs.assigned_worker_id=workers.id where reason ~ 'timeout: setup exceeded' order by t_finished;\" openqa"
Expected result¶
A log https://openqa.suse.de/tests/6416847/logfile?filename=autoinst-log.txt shows how it should look:
[2021-07-12T10:36:56.0008 CEST] [info] [pid:64283] Download of SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso processed:
[info] [#106833]
Cache size of "/var/lib/openqa/cache" is 44GiB, with limit 50GiB
[info] [#106833]
Downloading "SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" from "http://openqa.suse.de/tests/6416847/asset/iso/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso"
[info] [#106833]
Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" has not changed, updating last use
[2021-07-12T10:36:56.0100 CEST] [info] [pid:64283] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #106839 sent to Cache Service
[2021-07-12T10:37:01.0168 CEST] [info] [pid:64283] Output of rsync:
[info] [#106839] Calling: rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
receiving incremental file list
sent 1,992 bytes received 2,581,652 bytes 1,722,429.33 bytes/sec
total size is 12,978,495,757 speedup is 5,023.33
[2021-07-12T10:37:01.0168 CEST] [info] [pid:64283] Finished to rsync tests
[2021-07-12T10:37:01.0172 CEST] [debug] [pid:70271] +++ worker notes +++
where one can see the output from the rsync call.
Problem¶
Problem seems to be specific to ppc64le workers, maybe only "malbec" now. Maybe specific to test syncing. The cacheservice-minion logs do not mention the test rsync request at all.
Suggestions¶
- Start with calling
git grep
on the source code of openQA for the log messages that we see in the jobs mentioned above (or see comments below), to see which debug messages should be expected before the timeout - If you identify that there could be helpful log messages, e.g. to be able to distinguish if a request was received by the minion service or not, add it
Updated by okurz over 3 years ago
- Project changed from openQA Tests to openQA Project
- Subject changed from Test timeout auto_review:"Result: timeout":retry to Tests timeout with reason 'setup exceeded MAX_SETUP_TIME' on osd ppc64le workers auto_review:"Result: timeout":retry
- Description updated (diff)
- Category set to Regressions/Crashes
- Priority changed from Normal to High
- Target version set to Ready
https://github.com/os-autoinst/scripts/#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger has an example "steps to reproduce" that can be added to the ticket. Because if auto-review matches this ticket to jobs we can also search for jobs that match the ticket and get good statistics.
dzedro wrote:
I don't know why this ppc64le jobs timed out, cache service ?
Should the regex be more specific or can it be generic for any timeout ?
If it would work then I consider the regex a bit too generic as that means that effectively every job that reproducibly times out would be stuck in a loop, constantly being retriggered.
However, "timeout_exceeded" is a specific result and so far OSD is configured only to trigger auto-review on "incomplete" and "failed" jobs, see https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L69
You could easily extend that by a line like
job_done_hook_timeout_exceeded: env host=openqa.suse.de /opt/os-autoinst-scripts/openqa-label-known-issues-hook
or set as described in
https://github.com/os-autoinst/openQA/blob/master/etc/openqa/openqa.ini#L76
auto_clone_regex: '^(cache failure|terminated prematurely|timeout):'
but I don't know if that actually works for "timeout_exceeded".
Now to the real issue:
I looked in the database and found:
openqa=> select jobs.id,result_dir,t_finished,host,instance from jobs join workers on jobs.assigned_worker_id=workers.id where reason ~ 'timeout: setup exceeded' order by t_finished;
id | result_dir | t_finished | host | instance
---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+------------+----------
6296933 | 06296933-sle-15-SP1-Server-DVD-Updates-s390x-sle_image_on_sle_host:investigate:last_good_tests_and_build:70109c8d5237d35ada19cb80202f02e8dde380ae+20210615-1@s390x-kvm-sle12 | 2021-06-20 03:38:59 | grenache-1 | 14
6296810 | 06296810-sle-15-Server-DVD-Updates-s390x-engines_and_tools:investigate:last_good_tests_and_build:1f2c0d54fa6bde064609447df532c9eada86d16d+20210619-1@s390x-kvm-sle12 | 2021-06-20 03:39:56 | grenache-1 | 45
6296934 | 06296934-sle-15-SP2-Server-DVD-Updates-s390x-sle_image_on_sle_host:investigate:retry@s390x-kvm-sle12 | 2021-06-20 03:39:56 | grenache-1 | 15
6296932 | 06296932-sle-15-SP1-Server-DVD-Updates-s390x-sle_image_on_sle_host:investigate:last_good_build:20210615-1@s390x-kvm-sle12 | 2021-06-20 03:41:00 | grenache-1 | 49
6296938 | 06296938-sle-12-SP4-Server-DVD-Updates-s390x-engines_and_tools:investigate:retry@s390x-kvm-sle12 | 2021-06-20 03:41:00 | grenache-1 | 34
6296930 | 06296930-sle-15-SP1-Server-DVD-Updates-s390x-sle_image_on_sle_host:investigate:retry@s390x-kvm-sle12 | 2021-06-20 03:41:00 | grenache-1 | 47
6296931 | 06296931-sle-15-SP1-Server-DVD-Updates-s390x-sle_image_on_sle_host:investigate:last_good_tests:70109c8d5237d35ada19cb80202f02e8dde380ae@s390x-kvm-sle12 | 2021-06-20 03:41:01 | grenache-1 | 48
6296935 | 06296935-sle-15-SP2-Server-DVD-Updates-s390x-sle_image_on_sle_host:investigate:last_good_tests:f88d0c029e1fe4c2ecaab5f75149bd21713a5eb5@s390x-kvm-sle12 | 2021-06-20 03:47:53 | grenache-1 | 35
6296936 | 06296936-sle-15-SP2-Server-DVD-Updates-s390x-sle_image_on_sle_host:investigate:last_good_build:20210615-1@s390x-kvm-sle12 | 2021-06-20 03:49:20 | grenache-1 | 31
6296937 | 06296937-sle-15-SP2-Server-DVD-Updates-s390x-sle_image_on_sle_host:investigate:last_good_tests_and_build:f88d0c029e1fe4c2ecaab5f75149bd21713a5eb5+20210615-1@s390x-kvm-sle12 | 2021-06-20 03:50:51 | grenache-1 | 36
6295749 | 06295749-sle-15-SP3-Server-DVD-Incidents-Kernel-KOTD-s390x-Build5.3.18-254.1.g0483fe0-ltp_syscalls_ipc@s390x-kvm-sle12 | 2021-06-20 03:52:24 | grenache-1 | 13
6295750 | 06295750-sle-15-SP3-Server-DVD-Incidents-Kernel-KOTD-s390x-Build5.3.18-254.1.g0483fe0-ltp_syscalls_debug_pagealloc@s390x-kvm-sle12 | 2021-06-20 03:53:29 | grenache-1 | 44
6295752 | 06295752-sle-15-SP3-Server-DVD-Incidents-Kernel-KOTD-s390x-Build5.3.18-254.1.g0483fe0-ltp_sched@s390x-kvm-sle12 | 2021-06-20 03:55:44 | grenache-1 | 32
6295751 | 06295751-sle-15-SP3-Server-DVD-Incidents-Kernel-KOTD-s390x-Build5.3.18-254.1.g0483fe0-ltp_syscalls@s390x-kvm-sle12 | 2021-06-20 03:55:45 | grenache-1 | 37
6295753 | 06295753-sle-15-SP3-Server-DVD-Incidents-Kernel-KOTD-s390x-Build5.3.18-254.1.g0483fe0-ltp_openposix@s390x-kvm-sle12 | 2021-06-20 03:57:15 | grenache-1 | 33
6295754 | 06295754-sle-15-SP3-Server-DVD-Incidents-Kernel-KOTD-s390x-Build5.3.18-254.1.g0483fe0-ltp_dio@s390x-kvm-sle12 | 2021-06-20 04:07:42 | grenache-1 | 12
6295755 | 06295755-sle-15-SP3-Server-DVD-Incidents-Kernel-KOTD-s390x-Build5.3.18-254.1.g0483fe0-kernel-live-patching@s390x-kvm-sle12 | 2021-06-20 04:14:29 | grenache-1 | 46
6269112 | 06269112-sle-15-SP3-Server-DVD-SAP-Incidents-x86_64-qam-sles4sap_online_dvd_gnome_hana_nvdimm:investigate:last_good_build::19935:habootstrap-formula@64bit-ipmi-nvdimm | 2021-06-20 04:25:53 | grenache-1 | 11
6296940 | 06296940-sle-15-SP3-Server-DVD-Updates-s390x-sle_image_on_sle_host:investigate:retry@s390x-kvm-sle12 | 2021-06-20 04:39:27 | grenache-1 | 14
6296941 | 06296941-sle-15-SP3-Server-DVD-Updates-s390x-sle_image_on_sle_host:investigate:last_good_tests:f88d0c029e1fe4c2ecaab5f75149bd21713a5eb5@s390x-kvm-sle12 | 2021-06-20 04:40:55 | grenache-1 | 15
6296942 | 06296942-sle-15-SP3-Server-DVD-Updates-s390x-sle_image_on_sle_host:investigate:last_good_build:20210615-1@s390x-kvm-sle12 | 2021-06-20 04:40:55 | grenache-1 | 45
6296943 | 06296943-sle-15-SP3-Server-DVD-Updates-s390x-sle_image_on_sle_host:investigate:last_good_tests_and_build:f88d0c029e1fe4c2ecaab5f75149bd21713a5eb5+20210615-1@s390x-kvm-sle12 | 2021-06-20 04:42:26 | grenache-1 | 34
6295811 | 06295811-sle-15-SP3-Server-DVD-SAP-Incidents-x86_64-qam-sles4sap_online_dvd_gnome_hana_nvdimm:investigate:last_good_tests_and_build:abc2a9e0439943c86ce1a1a6bc8777fda2bf6804+:19935:habootstrap-formula@64bit-ipmi-nvdimm | 2021-06-20 05:26:57 | grenache-1 | 11
6296143 | 06296143-sle-15-SP3-Server-DVD-SAP-Incidents-x86_64-qam-sles4sap_online_dvd_gnome_hana_nvdimm:investigate:retry@64bit-ipmi-nvdimm | 2021-06-20 06:28:23 | grenache-1 | 11
6264554 | 06264554-sle-15-Server-DVD-SAP-Incidents-x86_64-qam-sles4sap_online_dvd_gnome_hana_nvdimm:investigate:retry@64bit-ipmi-nvdimm | 2021-06-20 07:29:50 | grenache-1 | 11
6289148 | 06289148-sle-15-SP1-Server-DVD-SAP-Incidents-x86_64-qam-sles4sap_online_dvd_gnome_hana_nvdimm:investigate:last_good_build::19983:xterm@64bit-ipmi-nvdimm | 2021-06-20 08:31:19 | grenache-1 | 11
6289149 | 06289149-sle-15-SP1-Server-DVD-SAP-Incidents-x86_64-qam-sles4sap_online_dvd_gnome_hana_nvdimm:investigate:last_good_tests_and_build:8fe545bc1a5752e32689ea7a594c57503b80345c+:19983:xterm@64bit-ipmi-nvdimm | 2021-06-20 09:32:49 | grenache-1 | 11
6289736 | 06289736-sle-15-SP1-Server-DVD-SAP-Incidents-x86_64-qam-sles4sap_online_dvd_gnome_hana_nvdimm:investigate:last_good_build::19983:xterm@64bit-ipmi-nvdimm | 2021-06-20 10:34:20 | grenache-1 | 11
6289737 | 06289737-sle-15-SP1-Server-DVD-SAP-Incidents-x86_64-qam-sles4sap_online_dvd_gnome_hana_nvdimm:investigate:last_good_tests_and_build:8fe545bc1a5752e32689ea7a594c57503b80345c+:19983:xterm@64bit-ipmi-nvdimm | 2021-06-20 11:35:50 | grenache-1 | 11
6297272 | 06297272-sle-12-SP4-Server-DVD-Updates-s390x-Build20210621-1-qam-minimal+base@s390x-kvm-sle12 | 2021-06-21 00:17:47 | grenache-1 | 46
6297415 | 06297415-sle-12-SP5-Server-DVD-Updates-s390x-Build20210621-1-qam-minimal+base@s390x-kvm-sle15 | 2021-06-21 00:18:05 | grenache-1 | 33
6297416 | 06297416-sle-12-SP5-Server-DVD-Updates-s390x-Build20210621-1-qam-gnome@s390x-kvm-sle15 | 2021-06-21 00:18:05 | grenache-1 | 44
6297417 | 06297417-sle-12-SP5-Server-DVD-Updates-s390x-Build20210621-1-mru-install-minimal-with-addons@s390x-kvm-sle15 | 2021-06-21 00:18:05 | grenache-1 | 37
6297564 | 06297564-sle-15-Server-DVD-Updates-s390x-Build20210621-1-mru-install-minimal-with-addons@s390x-kvm-sle12 | 2021-06-21 00:18:21 | grenache-1 | 34
6297717 | 06297717-sle-15-SP1-Server-DVD-Updates-s390x-Build20210621-1-qam-minimal+base@s390x-kvm-sle12 | 2021-06-21 00:18:39 | grenache-1 | 31
6297719 | 06297719-sle-15-SP1-Server-DVD-Updates-s390x-Build20210621-1-mru-install-minimal-with-addons@s390x-kvm-sle12 | 2021-06-21 00:18:39 | grenache-1 | 45
6297886 | 06297886-sle-15-SP2-Server-DVD-Updates-s390x-Build20210621-1-qam-gnome@s390x-kvm-sle12 | 2021-06-21 00:19:00 | grenache-1 | 47
6297885 | 06297885-sle-15-SP2-Server-DVD-Updates-s390x-Build20210621-1-qam-minimal+base@s390x-kvm-sle12 | 2021-06-21 00:19:01 | grenache-1 | 49
6298045 | 06298045-sle-15-SP3-Server-DVD-Updates-s390x-Build20210621-1-qam-gnome@s390x-kvm-sle12 | 2021-06-21 00:19:22 | grenache-1 | 32
6298858 | 06298858-sle-12-SP4-Server-DVD-Incidents-Kernel-KOTD-s390x-Build4.12.14-97.1.g759e1c1-ltp_cve_git@s390x-kvm-sle12 | 2021-06-21 01:19:04 | grenache-1 | 35
6298046 | 06298046-sle-15-SP3-Server-DVD-Updates-s390x-Build20210621-1-mru-install-minimal-with-addons@s390x-kvm-sle12 | 2021-06-21 01:19:04 | grenache-1 | 15
6298988 | 06298988-sle-15-SP1-Server-DVD-Updates-s390x-Build20210621-1-qam-gnome@s390x-kvm-sle12 | 2021-06-21 01:19:05 | grenache-1 | 45
6298678 | 06298678-sle-15-Server-DVD-Incidents-Kernel-KOTD-s390x-Build4.12.14-132.1.g1edcf88-ltp_cve_git@s390x-kvm-sle12 | 2021-06-21 01:19:05 | grenache-1 | 46
6298679 | 06298679-sle-15-Server-DVD-Incidents-Kernel-KOTD-s390x-Build4.12.14-132.1.g1edcf88-install_ltp+sle+Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-sle12 | 2021-06-21 01:19:06 | grenache-1 | 14
6298986 | 06298986-sle-15-Server-DVD-Updates-s390x-Build20210621-1-qam-gnome@s390x-kvm-sle12 | 2021-06-21 01:19:06 | grenache-1 | 36
6298987 | 06298987-sle-15-Server-DVD-Updates-s390x-Build20210621-1-qam-minimal+base@s390x-kvm-sle12 | 2021-06-21 01:19:06 | grenache-1 | 33
6298614 | 06298614-sle-15-SP1-Server-DVD-Incidents-Kernel-KOTD-s390x-Build4.12.14-51.1.g5bd5080-install_ltp+sle+Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-sle12 | 2021-06-21 01:19:07 | grenache-1 | 37
6298613 | 06298613-sle-15-SP1-Server-DVD-Incidents-Kernel-KOTD-s390x-Build4.12.14-51.1.g5bd5080-ltp_cve_git@s390x-kvm-sle12 | 2021-06-21 01:19:07 | grenache-1 | 34
6298859 | 06298859-sle-12-SP4-Server-DVD-Incidents-Kernel-KOTD-s390x-Build4.12.14-97.1.g759e1c1-install_ltp+sle+Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-sle12 | 2021-06-21 01:19:08 | grenache-1 | 12
6298974 | 06298974-sle-12-SP4-Server-DVD-Updates-s390x-Build20210621-1-mru-install-minimal-with-addons@s390x-kvm-sle12 | 2021-06-21 01:19:09 | grenache-1 | 31
6298989 | 06298989-sle-15-SP2-Server-DVD-Updates-s390x-Build20210621-1-mru-install-minimal-with-addons@s390x-kvm-sle12 | 2021-06-21 01:20:36 | grenache-1 | 49
6299002 | 06299002-sle-15-SP3-Server-DVD-Updates-s390x-Build20210621-1-qam-minimal+base@s390x-kvm-sle12 | 2021-06-21 01:20:37 | grenache-1 | 47
6299013 | 06299013-sle-15-SP3-Migration-from-SLE12-SPx-s390x-Build187.1-offline_sles12sp3_ltss_pscc_asmm-lgm_all_full_lock@s390x-kvm-sle12 | 2021-06-21 01:47:31 | grenache-1 | 48
6299014 | 06299014-sle-15-SP3-Migration-from-SLE12-SPx-s390x-Build187.1-offline_sles12sp5_media_base_full@s390x-kvm-sle12 | 2021-06-21 01:54:33 | grenache-1 | 13
6299023 | 06299023-sle-12-SP4-Server-DVD-Updates-s390x-Build20210621-1-qam-gnome@s390x-kvm-sle12 | 2021-06-21 02:19:23 | grenache-1 | 15
6390578 | 06390578-sle-15-SP3-Server-DVD-Incidents-Install-ppc64le-BuildMR:244267:nodejs14-qam-incidentinstall@ppc64le | 2021-07-07 13:12:54 | malbec | 2
6391210 | 06391210-sle-15-SP2-Server-DVD-Incidents-Install-ppc64le-Build:20300:nodejs12-qam-incidentinstall@ppc64le | 2021-07-07 13:26:14 | malbec | 4
6391605 | 06391605-sle-15-SP1-Server-DVD-SAP-Incidents-Install-ppc64le-Build:20294:dtb-aarch64-qam-incidentinstall-sap@ppc64le-2g | 2021-07-07 13:27:18 | malbec | 3
6392106 | 06392106-sle-15-SP3-Server-DVD-Incidents-Install-ppc64le-BuildMR:244278:dovecot23-qam-incidentinstall@ppc64le | 2021-07-07 14:14:44 | malbec | 2
6392642 | 06392642-sle-12-SP4-Server-DVD-SAP-Incidents-Install-ppc64le-Build:20298:linuxptp-qam-incidentinstall-sap@ppc64le-2g | 2021-07-07 14:27:00 | malbec | 1
6393089 | 06393089-sle-15-SP3-Server-DVD-Incidents-Install-ppc64le-Build:20116:python-cachetools-qam-incidentinstall@ppc64le | 2021-07-07 17:13:38 | malbec | 2
6393125 | 06393125-sle-15-SP3-Server-DVD-Incidents-Minimal-ppc64le-Build:20293:dtb-aarch64-qam-minimal-full@ppc64le | 2021-07-07 17:13:44 | malbec | 1
6393261 | 06393261-sle-15-SP3-Server-DVD-Incidents-Kernel-Base-ppc64le-Build:20293:dtb-aarch64-ltp_syscalls_base@ppc64le-virtio | 2021-07-07 17:14:06 | malbec | 4
6393207 | 06393207-sle-15-SP3-Server-DVD-Incidents-Kernel-ppc64le-Build:20293:dtb-aarch64-ltp_fs_bind@ppc64le-virtio | 2021-07-07 17:31:44 | malbec | 3
6394011 | 06394011-sle-15-SP1-Server-DVD-Incidents-Minimal-ppc64le-Build:20294:dtb-aarch64-qam-minimal@ppc64le | 2021-07-07 19:15:10 | malbec | 4
6393945 | 06393945-sle-15-SP1-Server-DVD-Incidents-Kernel-ppc64le-Build:20294:dtb-aarch64-ltp_containers@ppc64le-virtio | 2021-07-07 19:36:50 | malbec | 3
6393941 | 06393941-sle-15-SP1-Server-DVD-Incidents-Kernel-ppc64le-Build:20294:dtb-aarch64-ltp_can@ppc64le-virtio | 2021-07-07 19:36:51 | malbec | 1
6393939 | 06393939-sle-15-SP1-Server-DVD-Incidents-Kernel-ppc64le-Build:20294:dtb-aarch64-ltp_aiodio_part3@ppc64le-virtio | 2021-07-07 19:36:54 | malbec | 2
6394566 | 06394566-sle-15-SP3-Server-DVD-Incidents-Install-ppc64le-Build:19446:yast2-country-qam-incidentinstall@ppc64le | 2021-07-07 20:15:19 | malbec | 4
6397191 | 06397191-sle-15-SP2-Server-DVD-Incidents-Install-ppc64le-Build:20116:python-cachetools-qam-incidentinstall@ppc64le | 2021-07-08 08:12:14 | malbec | 1
6397381 | 06397381-sle-15-SP2-Server-DVD-Incidents-Install-ppc64le-BuildMR:244309:hwdata-qam-incidentinstall@ppc64le | 2021-07-08 09:13:12 | malbec | 1
6397702 | 06397702-sle-15-SP2-Server-DVD-Incidents-Install-ppc64le-Build:20310:dovecot23-qam-incidentinstall@ppc64le | 2021-07-08 11:12:40 | malbec | 1
6398451 | 06398451-sle-15-Server-DVD-HA-Incidents-Install-ppc64le-Build:20315:crmsh-qam-incidentinstall-ha@ppc64le | 2021-07-08 12:12:56 | malbec | 1
(73 rows)
so previously failures on grenache-1 on 2021-06-21, now malbec in the past days 2021-07-07 and 2021-07-08, not any today.
Following the trail of https://openqa.suse.de/tests/6393939# https://openqa.suse.de/tests/6393939/logfile?filename=autoinst-log.txt says
[info] [#51162] Downloading "SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" from "http://openqa.suse.de/tests/6393939/asset/iso/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso"
[info] [#51162] Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" has not changed, updating last use
[2021-07-07T20:37:39.0882 CEST] [info] [pid:81625] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #51165 sent to Cache Service
[2021-07-07T21:36:54.0341 CEST] [info] [pid:81625] +++ worker notes +++
[2021-07-07T21:36:54.0341 CEST] [info] [pid:81625] End time: 2021-07-07 19:36:54
[2021-07-07T21:36:54.0341 CEST] [info] [pid:81625] Result: timeout
so the iso is correctly downloaded in request #51162
and then a request for syncing tests is triggered in request #51165
On malbec in journalctl -u openqa-worker-cacheservice-minion.service
I find
Jul 07 20:37:34 malbec openqa-worker-cacheservice-minion[16175]: [88318] [i] [#51162] Downloading: "SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso"
Jul 07 20:37:34 malbec openqa-worker-cacheservice-minion[16175]: Can't opendir(/var/lib/openqa/cache/lost+found): Permission denied
Jul 07 20:37:34 malbec openqa-worker-cacheservice-minion[16175]: at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 74.
Jul 07 20:37:34 malbec openqa-worker-cacheservice-minion[16175]: [88318] [i] [#51162] Cache size of "/var/lib/openqa/cache" is 46GiB, with limit 50GiB
Jul 07 20:37:34 malbec openqa-worker-cacheservice-minion[16175]: [88318] [i] [#51162] Downloading "SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" from "http://openqa.suse.de/tests/6393939/asset/iso/SLE-15-SP1>
Jul 07 20:37:35 malbec openqa-worker-cacheservice-minion[16175]: [88318] [i] [#51162] Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" has not changed, updating >
Jul 07 20:37:35 malbec openqa-worker-cacheservice-minion[16175]: [88318] [i] [#51162] Finished download
Jul 07 20:37:36 malbec openqa-worker-cacheservice-minion[16175]: [88320] [i] [#51163] Downloading: "SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso"
Jul 07 20:37:36 malbec openqa-worker-cacheservice-minion[16175]: Can't opendir(/var/lib/openqa/cache/lost+found): Permission denied
Jul 07 20:37:36 malbec openqa-worker-cacheservice-minion[16175]: at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 74.
Jul 07 20:37:36 malbec openqa-worker-cacheservice-minion[16175]: [88320] [i] [#51163] Cache size of "/var/lib/openqa/cache" is 46GiB, with limit 50GiB
Jul 07 20:37:36 malbec openqa-worker-cacheservice-minion[16175]: [88320] [i] [#51163] Downloading "SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" from "http://openqa.suse.de/tests/6393945/asset/iso/SLE-15-SP1>
Jul 07 20:37:36 malbec openqa-worker-cacheservice-minion[16175]: [88320] [i] [#51163] Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" has not changed, updating >
Jul 07 20:37:36 malbec openqa-worker-cacheservice-minion[16175]: [88320] [i] [#51163] Finished download
Jul 07 20:37:37 malbec openqa-worker-cacheservice-minion[16175]: [88321] [i] [#51164] Downloading: "SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso"
Jul 07 20:37:37 malbec openqa-worker-cacheservice-minion[16175]: Can't opendir(/var/lib/openqa/cache/lost+found): Permission denied
Jul 07 20:37:37 malbec openqa-worker-cacheservice-minion[16175]: at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 74.
Jul 07 20:37:37 malbec openqa-worker-cacheservice-minion[16175]: [88321] [i] [#51164] Cache size of "/var/lib/openqa/cache" is 46GiB, with limit 50GiB
Jul 07 20:37:37 malbec openqa-worker-cacheservice-minion[16175]: [88321] [i] [#51164] Downloading "SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" from "http://openqa.suse.de/tests/6393941/asset/iso/SLE-15-SP1>
Jul 07 20:37:37 malbec openqa-worker-cacheservice-minion[16175]: [88321] [i] [#51164] Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" has not changed, updating >
Jul 07 20:37:37 malbec openqa-worker-cacheservice-minion[16175]: [88321] [i] [#51164] Finished download
Jul 07 21:15:19 malbec openqa-worker-cacheservice-minion[16175]: [91671] [i] [#51168] Downloading: "SLES-15-SP3-ppc64le-Installtest.qcow2"
Jul 07 21:15:19 malbec openqa-worker-cacheservice-minion[16175]: Can't opendir(/var/lib/openqa/cache/lost+found): Permission denied
Jul 07 21:15:19 malbec openqa-worker-cacheservice-minion[16175]: at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 74.
Jul 07 21:15:19 malbec openqa-worker-cacheservice-minion[16175]: [91671] [i] [#51168] Cache size of "/var/lib/openqa/cache" is 46GiB, with limit 50GiB
Jul 07 21:15:19 malbec openqa-worker-cacheservice-minion[16175]: [91671] [i] [#51168] Downloading "SLES-15-SP3-ppc64le-Installtest.qcow2" from "http://openqa.suse.de/tests/6394566/asset/hdd/SLES-15-SP3-ppc64>
Jul 07 21:15:19 malbec openqa-worker-cacheservice-minion[16175]: [91671] [i] [#51168] Content of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP3-ppc64le-Installtest.qcow2" has not changed, updating last use
Jul 07 21:15:19 malbec openqa-worker-cacheservice-minion[16175]: [91671] [i] [#51168] Finished download
Jul 07 21:15:29 malbec openqa-worker-cacheservice-minion[16175]: [91686] [i] [#51169] Downloading: "SLE-15-SP3-Full-ppc64le-GM-Media1.iso"
Jul 07 21:15:29 malbec openqa-worker-cacheservice-minion[16175]: Can't opendir(/var/lib/openqa/cache/lost+found): Permission denied
Jul 07 21:15:29 malbec openqa-worker-cacheservice-minion[16175]: at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 74.
Jul 07 21:15:29 malbec openqa-worker-cacheservice-minion[16175]: [91686] [i] [#51169] Cache size of "/var/lib/openqa/cache" is 46GiB, with limit 50GiB
Jul 07 21:15:29 malbec openqa-worker-cacheservice-minion[16175]: [91686] [i] [#51169] Downloading "SLE-15-SP3-Full-ppc64le-GM-Media1.iso" from "http://openqa.suse.de/tests/6394566/asset/iso/SLE-15-SP3-Full-p>
Jul 07 21:15:29 malbec openqa-worker-cacheservice-minion[16175]: [91686] [i] [#51169] Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP3-Full-ppc64le-GM-Media1.iso" has not changed, updating last use
Jul 07 21:15:29 malbec openqa-worker-cacheservice-minion[16175]: [91686] [i] [#51169] Finished download
so the request for the ISO appears with #51162
and also later requests appear but #51165
is not appearing anymore at all. I suggest we look into how it can happen that the request gets lost from the cacheservice to the minion, or add corresponding logging for that case.
Updated by okurz over 3 years ago
- Subject changed from Tests timeout with reason 'setup exceeded MAX_SETUP_TIME' on osd ppc64le workers auto_review:"Result: timeout":retry to Tests timeout with reason 'setup exceeded MAX_SETUP_TIME' on osd ppc64le workers auto_review:"Result: timeout":retry size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 3 years ago
All of these tests timed out 6 days ago waiting for tests to be synchronized ([2021-07-07T20:37:39.0882 CEST] [info] [pid:81625] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #51165 sent to Cache Service
).
There's a Minion job for caching tests (http://localhost:9530/minion/jobs?id=51130 on malbec.arch.suse.de) which has been started 6 days ago and was stuck for 2 days until it failed with result: 'Job terminated unexpectedly (exit code: 0, signal: 15)'
. There are also more of these failures from several months ago. More recent jobs have all succeeded.
So I guess I'll have to find out why these tasks can get stuck. Unfortunately the logs are not very helpful:
martchus@malbec:~> sudo journalctl --no-pager -u openqa-worker-cacheservice-minion | grep -i 51130
Jul 07 14:13:14 malbec openqa-worker-cacheservice-minion[16175]: [54325] [i] [#51130] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Jul 07 14:13:14 malbec openqa-worker-cacheservice-minion[16175]: [54325] [i] [#51130] Calling: rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Before the service is stopped we see this:
Jul 09 05:19:55 malbec openqa-worker-cacheservice-minion[16175]: [124176] [i] [#51254] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Jul 09 05:19:55 malbec openqa-worker-cacheservice-minion[16175]: [124176] [i] [#51254] Calling: rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Jul 09 05:20:05 malbec openqa-worker-cacheservice-minion[16175]: [124176] [i] [#51254] Finished sync: 0
Jul 09 07:16:12 malbec systemd[1]: Stopping OpenQA Worker Cache Service Minion...
Jul 09 07:16:12 malbec openqa-worker-cacheservice-minion[16175]: rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(642) [generator=3.1.3]
Jul 09 07:16:12 malbec openqa-worker-cacheservice-minion[16175]: rsync error: received SIGUSR1 (code 19) at main.c(1455) [receiver=3.1.3]
Jul 09 07:16:18 malbec openqa-worker-cacheservice-minion[16175]: [16175] [i] Worker 16175 stopped
Jul 09 07:16:18 malbec systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=144/n/a
Jul 09 07:16:18 malbec systemd[1]: Stopped OpenQA Worker Cache Service Minion.
so the request for the ISO appears with #51162 and also later requests appear but #51165 is not appearing anymore at all. I suggest we look into how it can happen that the request gets lost from the cacheservice to the minion, or add corresponding logging for that case.
Looks like #51165 (and Minion jobs mentioned in the other openQA jobs) already have been cleaned up from the Minion dashboard. However, I suppose these jobs are not relevant anyways because #51130 was doing the real download. So the other jobs have just been telling the worker that the real download is happening within #51130 and therefore didn't produce any logs. Not sure whether we should add logging for this.
Jul 07 20:37:34 malbec openqa-worker-cacheservice-minion[16175]: Can't opendir(/var/lib/openqa/cache/lost+found): Permission denied
These log lines are likely not causing the problem. However, I've removed the lost+found
directory which was empty (and only readable by root). We don't have such a directory on other workers so it has likely been created by accident.
Updated by mkittler over 3 years ago
- Status changed from Workable to In Progress
I'm still not sure why this happens but likely adding rsync
's --timeout
option helps here. PR: https://github.com/os-autoinst/openQA/pull/4043
Updated by okurz over 3 years ago
mkittler wrote:
Jul 07 20:37:34 malbec openqa-worker-cacheservice-minion[16175]: Can't opendir(/var/lib/openqa/cache/lost+found): Permission denied
These log lines are likely not causing the problem. However, I've removed the
lost+found
directory which was empty (and only readable by root). We don't have such a directory on other workers so it has likely been created by accident.
The directory lost+found
is automatically created on any ext filesystem and is only readable by root, that is correct. We should not try to read it at all. Better if we avoid reading it within our code, e.g. either not try to read all directories within the cache dir or if that is not feasible then at least explicitly exclude "lost+found" … or catch the error message and pipe it to /dev/null
Updated by openqa_review over 3 years ago
- Due date set to 2021-07-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 3 years ago
- Status changed from In Progress to Feedback
PR for the lost+found
error: https://github.com/os-autoinst/openQA/pull/4048
Note that this error being logged has nothing to do with the problem from the ticket description. It is only a disturbing log message without any impact on the behavior of the cache service (apart from that directory not being taken into account).
The PR for adding an inactivity timeout for the Minion job has been merged and deployed. So far select jobs.id,result_dir,t_finished,host,instance from jobs join workers on jobs.assigned_worker_id=workers.id where reason ~ 'timeout: setup exceeded' and t_started >= '2021-07-14T11:40' order by t_finished;
returns no results. I'll check again later in a few days. (However, judging by the failures on the worker's Minion dashboard the problem only happened about once a month.)
Updated by okurz over 3 years ago
mkittler wrote:
PR for the
lost+found
error: https://github.com/os-autoinst/openQA/pull/4048Note that this error being logged has nothing to do with the problem from the ticket description. It is only a disturbing log message without any impact on the behavior of the cache service (apart from that directory not being taken into account).
agreed.
The PR for adding an inactivity timeout for the Minion job has been merged and deployed. So far
select jobs.id,result_dir,t_finished,host,instance from jobs join workers on jobs.assigned_worker_id=workers.id where reason ~ 'timeout: setup exceeded' and t_started >= '2021-07-14T11:40' order by t_finished;
returns no results. I'll check again later in a few days. (However, judging by the failures on the worker's Minion dashboard the problem only happened about once a month.)
+1
Updated by mkittler over 3 years ago
Now both PRs have been merged. The query still returns no results.
Updated by mkittler over 3 years ago
So far the problem isn't occurring anymore.
Updated by mkittler over 3 years ago
It is still not happening anymore. The plan is to keep it in feedback for a while because the problem only occurred once a month or so. But we can also just close the ticket (and possibly re-open it if it occurs again).
Updated by okurz over 3 years ago
As an alternative you can consider introducing an alert because users can likely not do something useful when jobs run into setup timeout
Updated by mkittler over 3 years ago
Not sure whether it is worth adding an alert. Making it useful, e.g. to contain relevant job IDs (and not just the count) might not be that trivial.
Note that a similar problem occurred again:
openqa=> select jobs.id,result_dir,t_finished,host,instance from jobs join workers on jobs.assigned_worker_id=workers.id where reason ~ 'timeout: setup exceeded' and t_started >= '2021-07-14T11:40' order by t_finished;
id | result_dir | t_finished | host | instance
---------+----------------------------------------------------------------------------------------------------------------+---------------------+---------------+----------
6578161 | 06578161-sle-15-SP3-Server-DVD-Updates-x86_64-Build20210727-1-mru-iscsi_client_normal_auth_backstore_lvm@64bit | 2021-07-27 05:40:56 | openqaworker6 | 17
(1 Zeile)
(https://openqa.suse.de/tests/6578161)
The history doesn't go far enough to tell whether there's a failed cache_tests
Minion job (or a passed one which took extraordinarily long). That's all the log has to say:
Jul 27 07:42:38 openqaworker6 openqa-worker-cacheservice-minion[31438]: [37967] [i] [#4278] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Jul 27 07:42:38 openqaworker6 openqa-worker-cacheservice-minion[31438]: [37967] [i] [#4278] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
So the timeout parameter is passed as expected.
According to the timestamps of the job's log the setup already took already ~50 minutes before the rsync request:
[2021-07-27T06:40:55.0658 CEST] [info] [pid:30484] +++ setup notes +++
[2021-07-27T06:40:55.0658 CEST] [info] [pid:30484] Running on openqaworker6:17 (Linux 5.3.18-lp152.84-default #1 SMP Tue Jul 20 23:04:11 UTC 2021 (baaeecf) x86_64)
[2021-07-27T06:40:55.0668 CEST] [debug] [pid:30484] Found HDD_1, caching SLES-15-SP3-x86_64-mru-install-desktop-with-addons-Build20210727-1.qcow2
[2021-07-27T06:40:55.0671 CEST] [info] [pid:30484] Downloading SLES-15-SP3-x86_64-mru-install-desktop-with-addons-Build20210727-1.qcow2, request #4149 sent to Cache Service
[2021-07-27T06:51:31.0843 CEST] [info] [pid:30484] Download of SLES-15-SP3-x86_64-mru-install-desktop-with-addons-Build20210727-1.qcow2 processed:
[info] [#4149]
Cache size of "/var/lib/openqa/cache" is 48GiB, with limit 50GiB
[info] [#4149]
Downloading "SLES-15-SP3-x86_64-mru-install-desktop-with-addons-Build20210727-1.qcow2" from "http://openqa.suse.de/tests/6578161/asset/hdd/SLES-15-SP3-x86_64-mru-install-desktop-with-addons-Build20210727-1.qcow2"
[info] [#4149]
Size of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP3-x86_64-mru-install-desktop-with-addons-Build20210727-1.qcow2" is 2.5GiB, with ETag ""9d050000-5c812f9f306d1""
[info] [#4149]
Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP3-x86_64-mru-install-desktop-with-addons-Build20210727-1.qcow2" successful, new cache size is 49GiB
[2021-07-27T06:51:31.0855 CEST] [debug] [pid:30484] Found ISO, caching SLE-15-SP3-Full-x86_64-GM-Media1.iso
[2021-07-27T06:51:31.0858 CEST] [info] [pid:30484] Downloading SLE-15-SP3-Full-x86_64-GM-Media1.iso, request #4187 sent to Cache Service
[2021-07-27T07:37:30.0997 CEST] [info] [pid:30484] Download of SLE-15-SP3-Full-x86_64-GM-Media1.iso processed:
[info] [#4187]
Cache size of "/var/lib/openqa/cache" is 50GiB, with limit 50GiB
[info] [#4187]
Downloading "SLE-15-SP3-Full-x86_64-GM-Media1.iso" from "http://openqa.suse.de/tests/6578161/asset/iso/SLE-15-SP3-Full-x86_64-GM-Media1.iso"
[info] [#4187]
Cache size 49GiB + needed 11GiB exceeds limit of 50GiB, purging least used assets
[info] [#4187]
Purging "/var/lib/openqa/cache/openqa.suse.de/SLED-15-SP3-x86_64-GM-gnome.qcow2" because we need space for new assets, reclaiming 2.6GiB
[info] [#4187]
Purging "/var/lib/openqa/cache/openqa.suse.de/SLE-12-SP5-SAP-DVD-x86_64-GM-DVD1.iso" because we need space for new assets, reclaiming 3.9GiB
[info] [#4187]
Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP4-x86_64-mru-install-minimal-with-addons-Build20210726-1-Server-DVD-Updates-64bit.qcow2" because we need space for new assets, reclaiming 1.1GiB
[info] [#4187]
Purging "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Full-x86_64-Build16.3-Media1.iso" because we need space for new assets, reclaiming 11GiB
[info] [#4187]
Size of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP3-Full-x86_64-GM-Media1.iso" is 11GiB, with ETag ""2d7b00000-5c2d3a3560f9a""
[info] [#4187]
Download of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP3-Full-x86_64-GM-Media1.iso" successful, new cache size is 42GiB
[2021-07-27T07:37:31.0002 CEST] [info] [pid:30484] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #4278 sent to Cache Service
[2021-07-27T07:40:55.0720 CEST] [info] [pid:30484] +++ worker notes +++
[2021-07-27T07:40:55.0720 CEST] [info] [pid:30484] End time: 2021-07-27 05:40:55
[2021-07-27T07:40:55.0720 CEST] [info] [pid:30484] Result: timeout
[2021-07-27T07:40:55.0726 CEST] [info] [pid:37617] Uploading autoinst-log.txt
There are no timeouts, though. So I guess this worker really just suffered from very slow downloads at this point and it was not just rsync hanging forever. Not sure whether we can do something about it. (We restart such jobs already anyways via job_done_hook_timeout_exceeded = env host=openqa.suse.de /opt/os-autoinst-scripts/openqa-label-known-issues-hook
.)
Updated by okurz over 3 years ago
- Related to coordination #96185: [epic] Multimachine failure rate increased added
Updated by okurz over 3 years ago
As discussed we should do the following
- Add an alert if there is any non-restarted job exceeding max_setup_time (similar to what we already have for incomplete job monitoring), maybe with a hint in the description what SQL query to run to find out the job and find logs from there -> #96254
- Apply a higher MAX_SETUP_TIME applicable for the complete OSD infrastructure, e.g. add to every worker config entry -> #96257
- DONE: Add a feature request for "configuration applicable to all jobs for one openQA instance", e.g. add to #65271 with a link back here -> #65271#note-69
Updated by okurz over 3 years ago
- Copied to action #96254: Tests timeout with MAX_SETUP_TIME - Add an alert if there is any non-restarted job exceeding max_setup_time added
Updated by okurz over 3 years ago
- Copied to action #96257: Tests timeout with MAX_SETUP_TIME on osd - Apply a higher MAX_SETUP_TIME applicable for the *complete* OSD infrastructure, e.g. add to every worker config entry added
Updated by okurz over 3 years ago
- Due date deleted (
2021-07-28) - Status changed from Feedback to Resolved