Project

General

Profile

Actions

action #153880

closed

QA - coordination #139094: [epic] Improve collaboration with Eng-Infra - take 2

https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de, likely DNS problems in PRG1

Added by okurz 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de

I guess we need to take one more deep look into the situation: https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24&from=now-2d&to=now shows again a significant increase in the failure ratio.

The job got triggered on w17 which does not have "tap" so the scheduler should not have picked that machine The job is not multi-machine

Rollback steps

run salt '*.qa.suse.cz' cmd.run "sed -i 's/NETCONFIG_DNS_STATIC_SERVERS.*/NETCONFIG_DNS_STATIC_SERVERS=""/' /etc/sysconfig/network/config && netconfig update -f && cat /etc/resolv.conf" and check that the output mentions nameserver 10.100.96.1 as first server again


Related issues 2 (1 open1 closed)

Related to openQA Infrastructure - action #138275: Ensure that there is proper ownership and maintainership for qanet.qa.suse.czBlockedokurz2023-10-19

Actions
Copied from openQA Project - action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:MResolvedmkittler2023-12-11

Actions
Actions #1

Updated by okurz 4 months ago

  • Copied from action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added
Actions #2

Updated by nicksinger 4 months ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #3

Updated by okurz 4 months ago

  • Description updated (diff)
Actions #4

Updated by okurz 4 months ago

I checked the database but openqaworker17 did not show up there

openqa=> select distinct count(jobs.id) as total, sum(case when jobs.result in ('failed', 'incomplete') then 1 else 0 end) * 100. / count(jobs.id) as fail_rate_percent, host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) join workers on jobs.assigned_worker_id = workers.id where dependency = 2 and t_finished >= '2023-12-12' group by host having count(jobs.id) > 50 order by fail_rate_percent desc;
 total |  fail_rate_percent  |    host     
-------+---------------------+-------------
   751 | 17.0439414114513981 | mania
   966 | 14.1821946169772257 | worker36
  1069 | 13.3769878391019645 | worker32
   966 | 12.9399585921325052 | worker35
  1076 | 11.1524163568773234 | worker34
   807 | 10.7806691449814126 | worker30
  2361 | 10.5463786531130877 | worker-arm1
   789 | 10.5196451204055767 | worker-arm2
  1076 | 10.1301115241635688 | worker33
  1103 | 10.0634632819582956 | worker31
   829 |  9.6501809408926417 | worker37
 30301 |  7.5542061318108313 | worker38
   789 |  7.4778200253485425 | worker40
   705 |  7.3758865248226950 | worker29
   806 |  6.2034739454094293 | worker39
(15 rows)
Actions #5

Updated by okurz 4 months ago

sudo salt -C 'G@roles:worker' cmd.run 'host download.suse.de' reproduces the DNS problem that nicksinger mentioned.

Actions #6

Updated by okurz 4 months ago

  • Subject changed from significant increase in MM-test failure ratio 2024-01-18: https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de to https://openqa.suse.de/tests/13277880#step/patterns/96 not being able to resolve download.suse.de, likely DNS problems in PRG1
  • Description updated (diff)
Actions #7

Updated by okurz 4 months ago

It seems the DNS problem just resolved itself. host download.suse.de now works fine as I could tell using salt.

Actions #8

Updated by nicksinger 4 months ago

  • Priority changed from Immediate to High

ran salt '*.qa.suse.cz' cmd.run "sed -i 's/NETCONFIG_DNS_STATIC_SERVERS.*/NETCONFIG_DNS_STATIC_SERVERS="10.100.96.2"/' /etc/sysconfig/network/config && netconfig update -f && cat /etc/resolv.conf" to mitigate the urgency

Actions #9

Updated by nicksinger 4 months ago

  • Description updated (diff)
Actions #10

Updated by nicksinger 4 months ago

  • Assignee changed from nicksinger to okurz
Actions #11

Updated by tinita 4 months ago

I called openqa-advanced-retrigger-jobs to restart failed jobs from the ~ last 2 hours:

host=openqa.suse.de failed_since="2024-01-18 11:00:00" result="result='failed'" additional_filters=" test not like '%investigate:%'" ./openqa-advanced-retrigger-jobs

It restarted 89 jobs.

Actions #12

Updated by okurz 4 months ago

https://openqa.suse.de/tests/13279173 has now progressed further. More jobs have been restarted by tinita. We can monitor for a bit with lowered prio.

Actions #13

Updated by openqa_review 4 months ago

  • Due date set to 2024-02-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #14

Updated by okurz 4 months ago

  • Tags set to infra, dns, prg1
  • Due date deleted (2024-02-02)
  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal

I think we applied all mitigations that are useful right now, waiting for https://sd.suse.com/servicedesk/customer/portal/1/SD-145291

Actions #15

Updated by okurz 4 months ago

  • Status changed from Blocked to In Progress

answer in SD ticket

Actions #16

Updated by okurz 4 months ago · Edited

From SD-ticket

Alright I put “10.100.2.10,10.100.2.8” into /etc/dhcpd.conf on qanet.qa.suse.cz as that is the DHCP server for those hosts. However qanet.qa.suse.cz states that it is salt-controlled so I assume corresponding changes need to go into https://gitlab.suse.de/OPS-Service/salt/ which I expect one of the owners of https://gitlab.suse.de/OPS-Service/salt/ to do. So, who will pick this up?

With that this ticket is now related to #138275

I also found now https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=16051 smithers.qa.suse.cz as secondary DHCP/DNS server

Actions #17

Updated by okurz 4 months ago

  • Related to action #138275: Ensure that there is proper ownership and maintainership for qanet.qa.suse.cz added
Actions #18

Updated by okurz 4 months ago

  • Status changed from In Progress to Blocked
  • Parent task changed from #111929 to #154042

I put in the relation to #138275 and also created a parent topic there and using that new parent here as well. I have provided an update in https://sd.suse.com/servicedesk/customer/portal/1/SD-145291, waiting for responses there.

Actions #19

Updated by okurz 3 months ago

  • Status changed from Blocked to Resolved

Problem solved. Updated https://sd.suse.com/servicedesk/customer/portal/1/SD-145291 and suggested to close it. Conducted rollback steps and verified the expected output.

Actions #20

Updated by okurz 3 months ago

  • Target version changed from Ready to Tools - Next
Actions

Also available in: Atom PDF