action #153880: https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de, likely DNS problems in PRG1 - openQA Project (public) - openSUSE Project Management Tool

Actions

action #153880

closed

openQA Infrastructure (public) - coordination #168895: [saga][epic][infra] Support SUSE PRG office move while ensuring business continuity

https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de, likely DNS problems in PRG1

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Regressions/Crashes

Target version:

QA (public) - Tools - Next

Start date:

Due date:

% Done:

0%

Estimated time:

Tags:

infra, DNS, prg1

Description

Observation¶

https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de

I guess we need to take one more deep look into the situation: https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=now-2d&to=now shows again a significant increase in the failure ratio.

~~The job got triggered on w17 which does not have "tap" so the scheduler should not have picked that machine~~ The job is not multi-machine

Rollback steps¶

run salt '*.qa.suse.cz' cmd.run "sed -i 's/NETCONFIG_DNS_STATIC_SERVERS.*/NETCONFIG_DNS_STATIC_SERVERS=""/' /etc/sysconfig/network/config && netconfig update -f && cat /etc/resolv.conf" and check that the output mentions nameserver 10.100.96.1 as first server again

Related issues 2 (1 open — 1 closed)

Actions

#1

Updated by okurz over 1 year ago

Copied from action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added

Actions

#2

Updated by nicksinger over 1 year ago

Status changed from New to In Progress
Assignee set to nicksinger

https://sd.suse.com/servicedesk/customer/portal/1/SD-145291

Actions

#3

Updated by okurz over 1 year ago

Description updated (diff)

Actions

#4

Updated by okurz over 1 year ago

I checked the database but openqaworker17 did not show up there

openqa=> select distinct count(jobs.id) as total, sum(case when jobs.result in ('failed', 'incomplete') then 1 else 0 end) * 100. / count(jobs.id) as fail_rate_percent, host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) join workers on jobs.assigned_worker_id = workers.id where dependency = 2 and t_finished >= '2023-12-12' group by host having count(jobs.id) > 50 order by fail_rate_percent desc;
 total |  fail_rate_percent  |    host     
-------+---------------------+-------------
   751 | 17.0439414114513981 | mania
   966 | 14.1821946169772257 | worker36
  1069 | 13.3769878391019645 | worker32
   966 | 12.9399585921325052 | worker35
  1076 | 11.1524163568773234 | worker34
   807 | 10.7806691449814126 | worker30
  2361 | 10.5463786531130877 | worker-arm1
   789 | 10.5196451204055767 | worker-arm2
  1076 | 10.1301115241635688 | worker33
  1103 | 10.0634632819582956 | worker31
   829 |  9.6501809408926417 | worker37
 30301 |  7.5542061318108313 | worker38
   789 |  7.4778200253485425 | worker40
   705 |  7.3758865248226950 | worker29
   806 |  6.2034739454094293 | worker39
(15 rows)

Actions

#5

Updated by okurz over 1 year ago

sudo salt -C 'G@roles:worker' cmd.run 'host download.suse.de' reproduces the DNS problem that nicksinger mentioned.

Actions

#6

Updated by okurz over 1 year ago

Subject changed from significant increase in MM-test failure ratio 2024-01-18: https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de to https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de, likely DNS problems in PRG1
Description updated (diff)

Actions

#7

Updated by okurz over 1 year ago

It seems the DNS problem just resolved itself. host download.suse.de now works fine as I could tell using salt.

Actions

#8

Updated by nicksinger over 1 year ago

Priority changed from Immediate to High

ran salt '*.qa.suse.cz' cmd.run "sed -i 's/NETCONFIG_DNS_STATIC_SERVERS.*/NETCONFIG_DNS_STATIC_SERVERS="10.100.96.2"/' /etc/sysconfig/network/config && netconfig update -f && cat /etc/resolv.conf" to mitigate the urgency

Actions

#9

Updated by nicksinger over 1 year ago

Description updated (diff)

Actions

#10

Updated by nicksinger over 1 year ago

Assignee changed from nicksinger to okurz

Actions

#11

Updated by tinita over 1 year ago

I called openqa-advanced-retrigger-jobs to restart failed jobs from the ~ last 2 hours:

host=openqa.suse.de failed_since="2024-01-18 11:00:00" result="result='failed'" additional_filters=" test not like '%investigate:%'" ./openqa-advanced-retrigger-jobs

It restarted 89 jobs.

Actions

#12

Updated by okurz over 1 year ago

https://openqa.suse.de/tests/13279173 has now progressed further. More jobs have been restarted by tinita. We can monitor for a bit with lowered prio.

Actions

#13

Updated by openqa_review over 1 year ago

Due date set to 2024-02-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions

#14

Updated by okurz over 1 year ago

Tags set to infra, dns, prg1
Due date deleted (~~2024-02-02~~)
Status changed from In Progress to Blocked
Priority changed from High to Normal

I think we applied all mitigations that are useful right now, waiting for https://sd.suse.com/servicedesk/customer/portal/1/SD-145291

Actions

#15

Updated by okurz over 1 year ago

Status changed from Blocked to In Progress

answer in SD ticket

Actions

#16

Updated by okurz over 1 year ago · Edited

From SD-ticket

Alright I put “10.100.2.10,10.100.2.8” into /etc/dhcpd.conf on qanet.qa.suse.cz as that is the DHCP server for those hosts. However qanet.qa.suse.cz states that it is salt-controlled so I assume corresponding changes need to go into https://gitlab.suse.de/OPS-Service/salt/ which I expect one of the owners of https://gitlab.suse.de/OPS-Service/salt/ to do. So, who will pick this up?

With that this ticket is now related to #138275

I also found now https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=16051 smithers.qa.suse.cz as secondary DHCP/DNS server

Actions

#17

Updated by okurz over 1 year ago

Related to action #138275: Ensure that there is proper ownership and maintainership for qanet.qa.suse.cz added

Actions

#18

Updated by okurz over 1 year ago

Status changed from In Progress to Blocked
Parent task changed from #111929 to #154042

I put in the relation to #138275 and also created a parent topic there and using that new parent here as well. I have provided an update in https://sd.suse.com/servicedesk/customer/portal/1/SD-145291, waiting for responses there.

Actions

#19

Updated by okurz over 1 year ago

Status changed from Blocked to Resolved

Problem solved. Updated https://sd.suse.com/servicedesk/customer/portal/1/SD-145291 and suggested to close it. Conducted rollback steps and verified the expected output.

Actions

#20

Updated by okurz over 1 year ago

Target version changed from Ready to Tools - Next

Actions

Also available in: Atom PDF