action #153880
closed
openQA Infrastructure - coordination #168895: [saga][epic][infra] Support SUSE PRG office move while ensuring business continuity
openQA Infrastructure - coordination #168898: [epic][infra] Support SUSE PRG office datacenter "PRG1" move while ensuring business continuity
https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de, likely DNS problems in PRG1
Added by okurz 10 months ago.
Updated 10 months ago.
Category:
Regressions/Crashes
- Copied from action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added
- Status changed from New to In Progress
- Assignee set to nicksinger
- Description updated (diff)
I checked the database but openqaworker17 did not show up there
openqa=> select distinct count(jobs.id) as total, sum(case when jobs.result in ('failed', 'incomplete') then 1 else 0 end) * 100. / count(jobs.id) as fail_rate_percent, host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) join workers on jobs.assigned_worker_id = workers.id where dependency = 2 and t_finished >= '2023-12-12' group by host having count(jobs.id) > 50 order by fail_rate_percent desc;
total | fail_rate_percent | host
-------+---------------------+-------------
751 | 17.0439414114513981 | mania
966 | 14.1821946169772257 | worker36
1069 | 13.3769878391019645 | worker32
966 | 12.9399585921325052 | worker35
1076 | 11.1524163568773234 | worker34
807 | 10.7806691449814126 | worker30
2361 | 10.5463786531130877 | worker-arm1
789 | 10.5196451204055767 | worker-arm2
1076 | 10.1301115241635688 | worker33
1103 | 10.0634632819582956 | worker31
829 | 9.6501809408926417 | worker37
30301 | 7.5542061318108313 | worker38
789 | 7.4778200253485425 | worker40
705 | 7.3758865248226950 | worker29
806 | 6.2034739454094293 | worker39
(15 rows)
sudo salt -C 'G@roles:worker' cmd.run 'host download.suse.de'
reproduces the DNS problem that nicksinger mentioned.
- Subject changed from significant increase in MM-test failure ratio 2024-01-18: https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de to https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de, likely DNS problems in PRG1
- Description updated (diff)
It seems the DNS problem just resolved itself. host download.suse.de
now works fine as I could tell using salt.
- Priority changed from Immediate to High
ran salt '*.qa.suse.cz' cmd.run "sed -i 's/NETCONFIG_DNS_STATIC_SERVERS.*/NETCONFIG_DNS_STATIC_SERVERS="10.100.96.2"/' /etc/sysconfig/network/config && netconfig update -f && cat /etc/resolv.conf"
to mitigate the urgency
- Description updated (diff)
- Assignee changed from nicksinger to okurz
I called openqa-advanced-retrigger-jobs to restart failed jobs from the ~ last 2 hours:
host=openqa.suse.de failed_since="2024-01-18 11:00:00" result="result='failed'" additional_filters=" test not like '%investigate:%'" ./openqa-advanced-retrigger-jobs
It restarted 89 jobs.
- Due date set to 2024-02-02
Setting due date based on mean cycle time of SUSE QE Tools
- Tags set to infra, dns, prg1
- Due date deleted (
2024-02-02)
- Status changed from In Progress to Blocked
- Priority changed from High to Normal
- Status changed from Blocked to In Progress
- Related to action #138275: Ensure that there is proper ownership and maintainership for qanet.qa.suse.cz added
- Status changed from In Progress to Blocked
- Parent task changed from #111929 to #154042
- Status changed from Blocked to Resolved
- Target version changed from Ready to Tools - Next
Also available in: Atom
PDF