action #107227: bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:M - QA (public) - openSUSE Project Management Tool

Actions

Copy link

action #107227

closed

bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:M

Added by okurz about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

osukup

Target version:

openQA Project (public) - Ready

Start date:

2022-02-22

Due date:

% Done:

Estimated time:

Description

Observation¶

The first bad seems to be from 9h ago: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/850672#L46 . "ERROR: something wrong with /etc/openqabot/singlearch.yml"

ERROR: something wrong with /etc/openqabot/singlearch.yml
DEBUG: Getting id for Data(incident=0, settings_id=0, flavor='Server-DVD-HA-Updates', arch='x86_64', distri='sle', version='12-SP3', build='', product='HA12SP3')
Traceback (most recent call last):
  File "./qem-bot/bot-ng.py", line 5, in <module>
    main()
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/main.py", line 40, in main
    sys.exit(cfg.func(cfg))
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/args.py", line 68, in do_sync_aggregate_results
    return syncer()
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/aggrsync.py", line 29, in __call__
    update_setting += get_aggregate_settings_data(self.token, product)
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/qem.py", line 175, in get_aggregate_settings_data
    for s in settings[:3]:
TypeError: unhashable type: 'slice'
Uploading artifacts for failed job

a6a304dc1592:/ # cat /etc/openqabot/singlearch.yml
- powerpc-utils
- s390-tools
- yast2-s390
- lspvd

Acceptance crtieria¶

AC1: Pipeline doesn't fail due to config errors

Suggestions¶

Research what "slice" maps to in python
Lookup where /etc/openqabot/singlearch.yml comes from (It comes from the openSUSE package qam-metadata-openqabot) and the expected format
Improve the error message
Ensure singlearch.yml is deployed from GitLab
Confirm if this is a regression due to a recent change
Look into registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest (https://registry.suse.de/cgi-bin/cooverview)

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by livdywan about 3 years ago

Subject changed from bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" to bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by osukup about 3 years ago

Assignee set to osukup

Actions

Copy link

Updated by livdywan about 3 years ago

Description updated (diff)

Actions

Copy link

Updated by tinita about 3 years ago

Description updated (diff)

Actions

Copy link

Updated by osukup about 3 years ago

not regression, ERROR here is a hammerless warning ( I'll add proper exclude for singlearch.yml file )

the real cause of the problem is broken dashboard.qam.suse.de --> same as https://progress.opensuse.org/issues/106179

Actions

Copy link

Updated by livdywan about 3 years ago

Status changed from Workable to In Progress

I assume you're working on it since you hinted at having a fix in Slack ;-)

Actions

Copy link

Updated by jbaier_cz about 3 years ago

I suspected that the error message is misleading in this case (as there was no change in metadata for some time).

So it was the IP problem again. The interesting thing is, I have pinpointed the exact moment which caused it. A nightly update for systemd-networkd.

Logs from the database:

Feb 21 23:06:26 postgresql [RPM][18920]: erase systemd-network-246.16-150300.7.36.1.x86_64: success
Feb 21 23:06:26 postgresql dbus-daemon[25]: [system] Reloaded configuration
Feb 21 23:06:26 postgresql dbus-daemon[25]: [system] Reloaded configuration
Feb 21 23:06:26 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql [RPM][18920]: install systemd-network-246.16-150300.7.39.1.x86_64: success
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Stopping Network Service...
Feb 21 23:06:27 postgresql systemd-networkd[1142]: host0: DHCPv6 lease lost
Feb 21 23:06:27 postgresql systemd[1]: systemd-networkd.service: Succeeded.
Feb 21 23:06:27 postgresql systemd[1]: Stopped Network Service.
Feb 21 23:06:27 postgresql systemd[1]: Starting Network Service...
Feb 21 23:06:28 postgresql systemd-networkd[18991]: Failed to increase receive buffer size for general netlink socket, ignoring: Operation not permitted
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: IPv6 successfully enabled
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: Gained IPv6LL
Feb 21 23:06:28 postgresql systemd-networkd[18991]: Enumeration completed
Feb 21 23:06:28 postgresql systemd[1]: Started Network Service.
Feb 21 23:06:28 postgresql systemd[1]: Reloading.
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: DHCPv4 address 192.168.0.48/24 via 192.168.0.1

And a corresponding snippet from DHCP server. For some unknown reason, the machine is asking for a particular address and is ignoring the DHCP offer.

Feb 21 22:43:31 qam2 dnsmasq-dhcp[6139]: DHCPREQUEST(br0) 192.168.0.168 aa:5d:de:99:79:13
Feb 21 22:43:31 qam2 dnsmasq-dhcp[6139]: DHCPACK(br0) 192.168.0.168 aa:5d:de:99:79:13 postgresql
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPDISCOVER(br0) aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPOFFER(br0) 192.168.0.168 aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPREQUEST(br0) 192.168.0.48 aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPACK(br0) 192.168.0.48 aa:5d:de:99:79:13 postgresql

Actions

Copy link

Updated by okurz about 3 years ago

Related to action #106179: No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S added

Actions

Copy link

Updated by osukup about 3 years ago

Status changed from In Progress to Feedback

https://github.com/openSUSE/qem-bot/pull/3

real problem solved by restarting systemd-networkd service in container and analysis by @jbaier in https://progress.opensuse.org/issues/107227#note-7

Actions

Copy link

#10

Updated by osukup about 3 years ago

jbaier_cz wrote:

I suspected that the error message is misleading in this case (as there was no change in metadata for some time).

So it was the IP problem again. The interesting thing is, I have pinpointed the exact moment which caused it. A nightly update for systemd-networkd.

Logs from the database:

Feb 21 23:06:26 postgresql [RPM][18920]: erase systemd-network-246.16-150300.7.36.1.x86_64: success
Feb 21 23:06:26 postgresql dbus-daemon[25]: [system] Reloaded configuration
Feb 21 23:06:26 postgresql dbus-daemon[25]: [system] Reloaded configuration
Feb 21 23:06:26 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql [RPM][18920]: install systemd-network-246.16-150300.7.39.1.x86_64: success
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Stopping Network Service...
Feb 21 23:06:27 postgresql systemd-networkd[1142]: host0: DHCPv6 lease lost
Feb 21 23:06:27 postgresql systemd[1]: systemd-networkd.service: Succeeded.
Feb 21 23:06:27 postgresql systemd[1]: Stopped Network Service.
Feb 21 23:06:27 postgresql systemd[1]: Starting Network Service...
Feb 21 23:06:28 postgresql systemd-networkd[18991]: Failed to increase receive buffer size for general netlink socket, ignoring: Operation not permitted
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: IPv6 successfully enabled
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: Gained IPv6LL
Feb 21 23:06:28 postgresql systemd-networkd[18991]: Enumeration completed
Feb 21 23:06:28 postgresql systemd[1]: Started Network Service.
Feb 21 23:06:28 postgresql systemd[1]: Reloading.
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: DHCPv4 address 192.168.0.48/24 via 192.168.0.1

And a corresponding snippet from DHCP server. For some unknown reason, the machine is asking for a particular address and is ignoring the DHCP offer.

Feb 21 22:43:31 qam2 dnsmasq-dhcp[6139]: DHCPREQUEST(br0) 192.168.0.168 aa:5d:de:99:79:13
Feb 21 22:43:31 qam2 dnsmasq-dhcp[6139]: DHCPACK(br0) 192.168.0.168 aa:5d:de:99:79:13 postgresql
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPDISCOVER(br0) aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPOFFER(br0) 192.168.0.168 aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPREQUEST(br0) 192.168.0.48 aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPACK(br0) 192.168.0.48 aa:5d:de:99:79:13 postgresql

looks like problem/bug in systemd-networkd ?

Actions

Copy link

#11

Updated by okurz about 3 years ago

The above DHCP flow looks normal to me. AFAIK it's always:

-> DISCOVER
<- OFFER
-> REQUEST
<- ACK

Right?

Actions

Copy link

#12

Updated by jbaier_cz about 3 years ago

okurz wrote:

The above DHCP flow looks normal to me. AFAIK it's always:

-> DISCOVER

<- OFFER

-> REQUEST

<- ACK

Right?

Yes, the question here is why was the offer for 192.168.0.168 ignored by the client and 192.168.0.48 was requested instead.

Actions

Copy link

#13

Updated by osukup about 3 years ago

Status changed from Feedback to Resolved

gitlab is ci working

AC1: was meet even before, different cause

Actions

Copy link

#14

Updated by okurz about 3 years ago

Status changed from Resolved to Feedback
Priority changed from Immediate to High

I agree that the original problem was resolved. Let's follow https://progress.opensuse.org/projects/qa/wiki/Tools#How-we-work-on-our-backlog in particular "For every regression or bigger issue that we encounter try to come up with at least two improvements, e.g. the actual issue is fixed and similar cases are prevented in the future with better tests and optionally also monitoring is improved" and try to identify and conduct at least one improvement.

EDIT: Discussed with jbaier and osukup in SUSE QE Tools and we found that currently we relied on hardcoded IPv4 addresses within qem-dashboard mainly due to qam2 not being able to resolve the systemd nspawn container by name. For that likely /etc/resolv.conf should be changed to use the local dnsmasq daemon over 127.0.0.1 with additional nameserver entries to be able to resolve both internal as well as external addresses. Then, when being able to resolve the database container over DNS we should ensure that the machine can also be reached from the dashboard itself and use a DNS entry instead of IPv4.

Actions

Copy link

#15