action #153880
closedopenQA Infrastructure (public) - coordination #168895: [saga][epic][infra] Support SUSE PRG office move while ensuring business continuity
https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de, likely DNS problems in PRG1
0%
Description
Observation¶
https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de
I guess we need to take one more deep look into the situation: https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=now-2d&to=now shows again a significant increase in the failure ratio.
The job got triggered on w17 which does not have "tap" so the scheduler should not have picked that machine The job is not multi-machine
Rollback steps¶
run salt '*.qa.suse.cz' cmd.run "sed -i 's/NETCONFIG_DNS_STATIC_SERVERS.*/NETCONFIG_DNS_STATIC_SERVERS=""/' /etc/sysconfig/network/config && netconfig update -f && cat /etc/resolv.conf"
and check that the output mentions nameserver 10.100.96.1
as first server again
Updated by okurz 11 months ago
- Copied from action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added
Updated by nicksinger 11 months ago
- Status changed from New to In Progress
- Assignee set to nicksinger
Updated by okurz 11 months ago
I checked the database but openqaworker17 did not show up there
openqa=> select distinct count(jobs.id) as total, sum(case when jobs.result in ('failed', 'incomplete') then 1 else 0 end) * 100. / count(jobs.id) as fail_rate_percent, host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) join workers on jobs.assigned_worker_id = workers.id where dependency = 2 and t_finished >= '2023-12-12' group by host having count(jobs.id) > 50 order by fail_rate_percent desc;
total | fail_rate_percent | host
-------+---------------------+-------------
751 | 17.0439414114513981 | mania
966 | 14.1821946169772257 | worker36
1069 | 13.3769878391019645 | worker32
966 | 12.9399585921325052 | worker35
1076 | 11.1524163568773234 | worker34
807 | 10.7806691449814126 | worker30
2361 | 10.5463786531130877 | worker-arm1
789 | 10.5196451204055767 | worker-arm2
1076 | 10.1301115241635688 | worker33
1103 | 10.0634632819582956 | worker31
829 | 9.6501809408926417 | worker37
30301 | 7.5542061318108313 | worker38
789 | 7.4778200253485425 | worker40
705 | 7.3758865248226950 | worker29
806 | 6.2034739454094293 | worker39
(15 rows)
Updated by okurz 11 months ago
- Subject changed from significant increase in MM-test failure ratio 2024-01-18: https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de to https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de, likely DNS problems in PRG1
- Description updated (diff)
Updated by nicksinger 11 months ago
- Priority changed from Immediate to High
ran salt '*.qa.suse.cz' cmd.run "sed -i 's/NETCONFIG_DNS_STATIC_SERVERS.*/NETCONFIG_DNS_STATIC_SERVERS="10.100.96.2"/' /etc/sysconfig/network/config && netconfig update -f && cat /etc/resolv.conf"
to mitigate the urgency
Updated by tinita 11 months ago
I called openqa-advanced-retrigger-jobs to restart failed jobs from the ~ last 2 hours:
host=openqa.suse.de failed_since="2024-01-18 11:00:00" result="result='failed'" additional_filters=" test not like '%investigate:%'" ./openqa-advanced-retrigger-jobs
It restarted 89 jobs.
Updated by okurz 11 months ago
https://openqa.suse.de/tests/13279173 has now progressed further. More jobs have been restarted by tinita. We can monitor for a bit with lowered prio.
Updated by openqa_review 11 months ago
- Due date set to 2024-02-02
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 11 months ago
- Tags set to infra, dns, prg1
- Due date deleted (
2024-02-02) - Status changed from In Progress to Blocked
- Priority changed from High to Normal
I think we applied all mitigations that are useful right now, waiting for https://sd.suse.com/servicedesk/customer/portal/1/SD-145291
Updated by okurz 11 months ago · Edited
From SD-ticket
Alright I put “10.100.2.10,10.100.2.8” into /etc/dhcpd.conf on qanet.qa.suse.cz as that is the DHCP server for those hosts. However qanet.qa.suse.cz states that it is salt-controlled so I assume corresponding changes need to go into https://gitlab.suse.de/OPS-Service/salt/ which I expect one of the owners of https://gitlab.suse.de/OPS-Service/salt/ to do. So, who will pick this up?
With that this ticket is now related to #138275
I also found now https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=16051 smithers.qa.suse.cz as secondary DHCP/DNS server
Updated by okurz 11 months ago
- Related to action #138275: Ensure that there is proper ownership and maintainership for qanet.qa.suse.cz added
Updated by okurz 11 months ago
- Status changed from In Progress to Blocked
- Parent task changed from #111929 to #154042
I put in the relation to #138275 and also created a parent topic there and using that new parent here as well. I have provided an update in https://sd.suse.com/servicedesk/customer/portal/1/SD-145291, waiting for responses there.
Updated by okurz 11 months ago
- Status changed from Blocked to Resolved
Problem solved. Updated https://sd.suse.com/servicedesk/customer/portal/1/SD-145291 and suggested to close it. Conducted rollback steps and verified the expected output.