action #131123: qa-jump.qe.nue2.suse.org is not reachable since 2023-06-19 - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #131123

closed

qa-jump.qe.nue2.suse.org is not reachable since 2023-06-19

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Urgent

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-06-20

Due date:

% Done:

Estimated time:

Tags:

infra, FC Basement, PXE

Description

Observation¶

https://openqa.suse.de/tests/11377514/video?filename=video.ogv&t=19.8,20.5
shows that the PXE server on qa-jump.qe.nue2.suse.org is not reachable. I also can not ping the host at all anymore. Also visible in https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1&viewPanel=4&from=1687165791669&to=1687222145516

The host seems to have stopped responding 2023-06-19 13:46:48Z

Expected result¶

The machines should be able to boot from PXE.

Last good: https://openqa.suse.de/tests/11377512#step/boot_from_pxe/1 from 2023-06-19 13:17:59Z

Related issues 2 (0 open — 2 closed)

Has duplicate openQA Infrastructure (public) - action #131141: [alert] Packet loss between qa-jump.qe.nue2.suse.org and other hosts

Rejected

okurz

2023-06-20

Actions

Copied to openQA Infrastructure (public) - action #133322: qa-jump.qe.nue2.suse.org is not reachable - take 3

Resolved

okurz

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz over 1 year ago

Status changed from New to Blocked

https://sd.suse.com/servicedesk/customer/portal/1/SD-124776

Actions

Copy link

Updated by okurz over 1 year ago

Has duplicate action #131141: [alert] Packet loss between qa-jump.qe.nue2.suse.org and other hosts added

Actions

Copy link

Updated by okurz over 1 year ago

Status changed from Blocked to In Progress

SD ticket closed, host should be reachable again and according job failures should be handled, e.g. with restart

Actions

Copy link

Updated by nicksinger over 1 year ago

Assignee changed from okurz to nicksinger

Actions

Copy link

Updated by nicksinger over 1 year ago

Status changed from In Progress to Feedback

I restarted every job from the last 24h failing in boot_from_pxe with:

for i in $(ssh openqa.suse.de "sudo -u geekotest psql -t --command=\"select jobs.id from job_modules join jobs on jobs.id = job_modules.job_id where job_modules.result='failed' and t_started between NOW() - Interval '24 HOURS' and NOW() and name='boot_from_pxe';\" openqa"); do openqa-client --host openqa.suse.de jobs/$i/restart post; done

{
  errors   => ["Specified job 11392667 has already been cloned as 11395877"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11377521 has already been cloned as 11397236"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11392668 => 11397616 }],
  test_url => [{ 11392668 => "/tests/11397616" }],
}
{
  result   => [{ 11395237 => 11397617 }],
  test_url => [{ 11395237 => "/tests/11397617" }],
}
{
  errors   => ["Specified job 11390240 has already been cloned as 11397615"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11377525 has already been cloned as 11397233"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11392669 => 11397618 }],
  test_url => [{ 11392669 => "/tests/11397618" }],
}
{
  result   => [{ 11377515 => 11397619 }],
  test_url => [{ 11377515 => "/tests/11397619" }],
}
{
  result   => [{ 11390294 => 11397620 }],
  test_url => [{ 11390294 => "/tests/11397620" }],
}
{
  errors   => ["Specified job 11377524 has already been cloned as 11397232"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11388049 => 11397621 }],
  test_url => [{ 11388049 => "/tests/11397621" }],
}
{
  errors   => ["Specified job 11388639 has already been cloned as 11389800"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11377516 => 11397622 }],
  test_url => [{ 11377516 => "/tests/11397622" }],
}
{
  result   => [{ 11389762 => 11397623 }],
  test_url => [{ 11389762 => "/tests/11397623" }],
}
{
  errors   => ["Specified job 11377520 has already been cloned as 11397235"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11377514 has already been cloned as 11397512"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11377518 has already been cloned as 11397234"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11377523 => 11397624 }],
  test_url => [{ 11377523 => "/tests/11397624" }],
}
{
  result   => [{ 11377526 => 11397625 }],
  test_url => [{ 11377526 => "/tests/11397625" }],
}
{
  result   => [{ 11388042 => 11397626 }],
  test_url => [{ 11388042 => "/tests/11397626" }],
}
{
  result   => [{ 11387254 => 11397627 }],
  test_url => [{ 11387254 => "/tests/11397627" }],
}
{
  result   => [{ 11387247 => 11397628 }],
  test_url => [{ 11387247 => "/tests/11397628" }],
}
{
  result   => [{ 11388040 => 11397629 }],
  test_url => [{ 11388040 => "/tests/11397629" }],
}
{
  result   => [{ 11390163 => 11397630 }],
  test_url => [{ 11390163 => "/tests/11397630" }],
}

Actions

Copy link

Updated by okurz over 1 year ago

Status changed from Feedback to Resolved

looks good. The next time you can also use https://github.com/os-autoinst/scripts/blob/master/openqa-advanced-retrigger-jobs

Actions

Copy link

Updated by okurz over 1 year ago

Status changed from Resolved to In Progress

happened again. @Nick Singer please create an SD ticket and point out that lately the VM was already down and that https://sd.suse.com/servicedesk/customer/portal/1/SD-124776 still has an unanswered question.

Actions

Copy link

Updated by nicksinger over 1 year ago

Status changed from In Progress to Feedback

created https://sd.suse.com/servicedesk/customer/portal/1/SD-125516

Actions

Copy link

Updated by crameleon over 1 year ago

Same issue as last time, machine did not automatically apply its network configuration after a reboot:

qa-jump:~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 72:1c:68:af:4d:17 brd ff:ff:ff:ff:ff:ff
    altname enp0s2
    altname ens2
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 52:54:00:3b:4c:8c brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    altname ens3

qa-jump:~ # rcnetwork restart

qa-jump:~ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 72:1c:68:af:4d:17 brd ff:ff:ff:ff:ff:ff
    altname enp0s2
    altname ens2
    inet 10.168.192.10/22 brd 10.168.195.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::701c:68ff:feaf:4d17/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:3b:4c:8c brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    altname ens3
    inet 192.168.44.1/24 brd 192.168.44.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe3b:4c8c/64 scope link
       valid_lft forever preferred_lft forever

qa-jump:~ # systemctl --failed
  UNIT             LOAD   ACTIVE SUB    DESCRIPTION
● mnt-dist.mount   loaded failed failed /mnt/dist
● mnt-openqa.mount loaded failed failed /mnt/openqa

qa-jump:~ # systemctl restart mnt-dist.mount
qa-jump:~ # systemctl restart mnt-openqa.mount

This seems ok:

qa-jump:~ # systemctl is-enabled network
alias
qa-jump:~ # systemctl is-enabled wicked
enabled

Not sure yet what's different with this machine.

Actions

Copy link

#10

Updated by crameleon over 1 year ago

There is an issue with the mounts, causing wicked to fail during boot:

Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found ordering cycle on network-online.target/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on wicked.service/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on wickedd-auto4.service/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on local-fs.target/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on srv-tftpboot-mnt-openqa.mount/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Found dependency on mnt-openqa.mount/start
Jun 28 10:25:09 qa-jump systemd[1]: mnt-openqa.mount: Job local-fs.target/start deleted to break ordering cycle starting with mnt-openqa.mount/start

This seems like a similar issue as the one described in https://www.suse.com/support/kb/doc/?id=000018796.

I edited the fstab entry from @okurz respectively by appending _netdev to its options:

# okurz: 2023-05-17: "journalctl -e" revealed that systems (still?) try to access /mounts/dist so adding that bind mount additionally for convenience but that should be streamlined
# gpfuetzenreuter: 2023-06-28: repair broken network dependency cycle by adding _netdev
/mnt/dist                                       /mounts/dist                  none     defaults,bind,_netdev    0 0

After another reboot, the network came up fine.

Actions

Copy link

#11

Updated by nicksinger over 1 year ago

Status changed from Feedback to Resolved

Actions

Copy link

#12

Updated by nicksinger over 1 year ago

All affected jobs restarted:

{
  result   => [{ 11462148 => 11469067 }],
  test_url => [{ 11462148 => "/tests/11469067" }],
}
{
  result   => [{ 11468047 => 11469068 }],
  test_url => [{ 11468047 => "/tests/11469068" }],
}
{
  errors   => ["Specified job 11467991 has already been cloned as 11468908"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11462542 has already been cloned as 11467992"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11463220 => 11469082 }],
  test_url => [{ 11463220 => "/tests/11469082" }],
}
{
  errors   => ["Specified job 11463236 has already been cloned as 11468027"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11463194 has already been cloned as 11468046"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11463213 has already been cloned as 11468047"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11468046 => 11469088 }],
  test_url => [{ 11468046 => "/tests/11469088" }],
}
{
  errors   => ["Specified job 11463228 has already been cloned as 11467988"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11467826 => 11469089 }],
  test_url => [{ 11467826 => "/tests/11469089" }],
}
{
  errors   => ["Specified job 11462530 has already been cloned as 11467991"],
  result   => [],
  test_url => [],
}
{
  result   => [{ 11468045 => 11469090 }],
  test_url => [{ 11468045 => "/tests/11469090" }],
}
{
  result   => [{ 11449788 => 11469091 }],
  test_url => [{ 11449788 => "/tests/11469091" }],
}
{
  errors   => ["Specified job 11464106 has already been cloned as 11468016"],
  result   => [],
  test_url => [],
}
{
  errors   => ["Specified job 11462129 has already been cloned as 11468045"],
  result   => [],
  test_url => [],
}

Actions

Copy link

#13

Updated by okurz over 1 year ago

Status changed from Resolved to Feedback

good, thx. I would still like to follow-up on the comment "The machine is already monitored by our Zabbix infrastructure […] However, our team does not actively monitor alerts." from https://sd.suse.com/servicedesk/customer/portal/1/SD-125516 . That does not sound like a good approach to me. I see this related to #121726 "Get management access to o3/osd VMs". IMHO the team that can control VMs should act on alerts. I am ok if we do that if Eng-Infra can't do that but needing to wait for users to be confused and then create a ticket just to find that there was already an alert does not make sense.

Actions

Copy link

#15

Updated by okurz over 1 year ago

Status changed from Feedback to Resolved

What I wrote in #131123-13 will be covered in #132149

Actions

Copy link

#16

Updated by okurz over 1 year ago

Copied to action #133322: qa-jump.qe.nue2.suse.org is not reachable - take 3 added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #131123

qa-jump.qe.nue2.suse.org is not reachable since 2023-06-19

Observation¶

Expected result¶

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by crameleon over 1 year ago

Updated by crameleon over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago