Project

General

Profile

action #107227

bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:M

Added by okurz 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2022-02-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

The first bad seems to be from 9h ago: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/850672#L46 . "ERROR: something wrong with /etc/openqabot/singlearch.yml"

ERROR: something wrong with /etc/openqabot/singlearch.yml
DEBUG: Getting id for Data(incident=0, settings_id=0, flavor='Server-DVD-HA-Updates', arch='x86_64', distri='sle', version='12-SP3', build='', product='HA12SP3')
Traceback (most recent call last):
  File "./qem-bot/bot-ng.py", line 5, in <module>
    main()
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/main.py", line 40, in main
    sys.exit(cfg.func(cfg))
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/args.py", line 68, in do_sync_aggregate_results
    return syncer()
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/aggrsync.py", line 29, in __call__
    update_setting += get_aggregate_settings_data(self.token, product)
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/qem.py", line 175, in get_aggregate_settings_data
    for s in settings[:3]:
TypeError: unhashable type: 'slice'
Uploading artifacts for failed job
a6a304dc1592:/ # cat /etc/openqabot/singlearch.yml
- powerpc-utils
- s390-tools
- yast2-s390
- lspvd

Acceptance crtieria

  • AC1: Pipeline doesn't fail due to config errors

Suggestions

  • Research what "slice" maps to in python
  • Lookup where /etc/openqabot/singlearch.yml comes from (It comes from the openSUSE package qam-metadata-openqabot) and the expected format
  • Improve the error message
  • Ensure singlearch.yml is deployed from GitLab
  • Confirm if this is a regression due to a recent change
  • Look into registry.suse.de/qa/maintenance/containers/qam-ci-leap:latest (https://registry.suse.de/cgi-bin/cooverview)

Related issues

Related to QA - action #106179: No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:SResolved2022-02-08

Copied to QA - action #109878: bot-ng schedule/approve abortedResolved2022-02-22

History

#1 Updated by cdywan 4 months ago

  • Subject changed from bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" to bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:M
  • Description updated (diff)
  • Status changed from New to Workable

#2 Updated by osukup 4 months ago

  • Assignee set to osukup

#3 Updated by cdywan 4 months ago

  • Description updated (diff)

#4 Updated by tinita 4 months ago

  • Description updated (diff)

#5 Updated by osukup 4 months ago

not regression, ERROR here is a hammerless warning ( I'll add proper exclude for singlearch.yml file )

the real cause of the problem is broken dashboard.qam.suse.de --> same as https://progress.opensuse.org/issues/106179

#6 Updated by cdywan 4 months ago

  • Status changed from Workable to In Progress

I assume you're working on it since you hinted at having a fix in Slack ;-)

#7 Updated by jbaier_cz 4 months ago

I suspected that the error message is misleading in this case (as there was no change in metadata for some time).

So it was the IP problem again. The interesting thing is, I have pinpointed the exact moment which caused it. A nightly update for systemd-networkd.

Logs from the database:

Feb 21 23:06:26 postgresql [RPM][18920]: erase systemd-network-246.16-150300.7.36.1.x86_64: success
Feb 21 23:06:26 postgresql dbus-daemon[25]: [system] Reloaded configuration
Feb 21 23:06:26 postgresql dbus-daemon[25]: [system] Reloaded configuration
Feb 21 23:06:26 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql [RPM][18920]: install systemd-network-246.16-150300.7.39.1.x86_64: success
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Stopping Network Service...
Feb 21 23:06:27 postgresql systemd-networkd[1142]: host0: DHCPv6 lease lost
Feb 21 23:06:27 postgresql systemd[1]: systemd-networkd.service: Succeeded.
Feb 21 23:06:27 postgresql systemd[1]: Stopped Network Service.
Feb 21 23:06:27 postgresql systemd[1]: Starting Network Service...
Feb 21 23:06:28 postgresql systemd-networkd[18991]: Failed to increase receive buffer size for general netlink socket, ignoring: Operation not permitted
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: IPv6 successfully enabled
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: Gained IPv6LL
Feb 21 23:06:28 postgresql systemd-networkd[18991]: Enumeration completed
Feb 21 23:06:28 postgresql systemd[1]: Started Network Service.
Feb 21 23:06:28 postgresql systemd[1]: Reloading.
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: DHCPv4 address 192.168.0.48/24 via 192.168.0.1

And a corresponding snippet from DHCP server. For some unknown reason, the machine is asking for a particular address and is ignoring the DHCP offer.

Feb 21 22:43:31 qam2 dnsmasq-dhcp[6139]: DHCPREQUEST(br0) 192.168.0.168 aa:5d:de:99:79:13
Feb 21 22:43:31 qam2 dnsmasq-dhcp[6139]: DHCPACK(br0) 192.168.0.168 aa:5d:de:99:79:13 postgresql
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPDISCOVER(br0) aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPOFFER(br0) 192.168.0.168 aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPREQUEST(br0) 192.168.0.48 aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPACK(br0) 192.168.0.48 aa:5d:de:99:79:13 postgresql

#8 Updated by okurz 4 months ago

  • Related to action #106179: No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S added

#9 Updated by osukup 4 months ago

  • Status changed from In Progress to Feedback

https://github.com/openSUSE/qem-bot/pull/3

real problem solved by restarting systemd-networkd service in container and analysis by @jbaier in https://progress.opensuse.org/issues/107227#note-7

#10 Updated by osukup 4 months ago

jbaier_cz wrote:

I suspected that the error message is misleading in this case (as there was no change in metadata for some time).

So it was the IP problem again. The interesting thing is, I have pinpointed the exact moment which caused it. A nightly update for systemd-networkd.

Logs from the database:

Feb 21 23:06:26 postgresql [RPM][18920]: erase systemd-network-246.16-150300.7.36.1.x86_64: success
Feb 21 23:06:26 postgresql dbus-daemon[25]: [system] Reloaded configuration
Feb 21 23:06:26 postgresql dbus-daemon[25]: [system] Reloaded configuration
Feb 21 23:06:26 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql [RPM][18920]: install systemd-network-246.16-150300.7.39.1.x86_64: success
Feb 21 23:06:27 postgresql systemd[1]: Reloading.
Feb 21 23:06:27 postgresql systemd[1]: Stopping Network Service...
Feb 21 23:06:27 postgresql systemd-networkd[1142]: host0: DHCPv6 lease lost
Feb 21 23:06:27 postgresql systemd[1]: systemd-networkd.service: Succeeded.
Feb 21 23:06:27 postgresql systemd[1]: Stopped Network Service.
Feb 21 23:06:27 postgresql systemd[1]: Starting Network Service...
Feb 21 23:06:28 postgresql systemd-networkd[18991]: Failed to increase receive buffer size for general netlink socket, ignoring: Operation not permitted
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: IPv6 successfully enabled
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: Gained IPv6LL
Feb 21 23:06:28 postgresql systemd-networkd[18991]: Enumeration completed
Feb 21 23:06:28 postgresql systemd[1]: Started Network Service.
Feb 21 23:06:28 postgresql systemd[1]: Reloading.
Feb 21 23:06:28 postgresql systemd-networkd[18991]: host0: DHCPv4 address 192.168.0.48/24 via 192.168.0.1

And a corresponding snippet from DHCP server. For some unknown reason, the machine is asking for a particular address and is ignoring the DHCP offer.

Feb 21 22:43:31 qam2 dnsmasq-dhcp[6139]: DHCPREQUEST(br0) 192.168.0.168 aa:5d:de:99:79:13
Feb 21 22:43:31 qam2 dnsmasq-dhcp[6139]: DHCPACK(br0) 192.168.0.168 aa:5d:de:99:79:13 postgresql
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPDISCOVER(br0) aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPOFFER(br0) 192.168.0.168 aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPREQUEST(br0) 192.168.0.48 aa:5d:de:99:79:13
Feb 21 23:06:28 qam2 dnsmasq-dhcp[6139]: DHCPACK(br0) 192.168.0.48 aa:5d:de:99:79:13 postgresql

looks like problem/bug in systemd-networkd ?

#11 Updated by okurz 4 months ago

The above DHCP flow looks normal to me. AFAIK it's always:

  1. -> DISCOVER
  2. <- OFFER
  3. -> REQUEST
  4. <- ACK

Right?

#12 Updated by jbaier_cz 4 months ago

okurz wrote:

The above DHCP flow looks normal to me. AFAIK it's always:

  1. -> DISCOVER
  2. <- OFFER
  3. -> REQUEST
  4. <- ACK

Right?

Yes, the question here is why was the offer for 192.168.0.168 ignored by the client and 192.168.0.48 was requested instead.

#13 Updated by osukup 4 months ago

  • Status changed from Feedback to Resolved

gitlab is ci working

AC1: was meet even before, different cause

#14 Updated by okurz 4 months ago

  • Status changed from Resolved to Feedback
  • Priority changed from Immediate to High

I agree that the original problem was resolved. Let's follow https://progress.opensuse.org/projects/qa/wiki/Tools#How-we-work-on-our-backlog in particular "For every regression or bigger issue that we encounter try to come up with at least two improvements, e.g. the actual issue is fixed and similar cases are prevented in the future with better tests and optionally also monitoring is improved" and try to identify and conduct at least one improvement.

EDIT: Discussed with jbaier and osukup in SUSE QE Tools and we found that currently we relied on hardcoded IPv4 addresses within qem-dashboard mainly due to qam2 not being able to resolve the systemd nspawn container by name. For that likely /etc/resolv.conf should be changed to use the local dnsmasq daemon over 127.0.0.1 with additional nameserver entries to be able to resolve both internal as well as external addresses. Then, when being able to resolve the database container over DNS we should ensure that the machine can also be reached from the dashboard itself and use a DNS entry instead of IPv4.

#15 Updated by okurz 4 months ago

Created https://gitlab.suse.de/qa-maintenance/qamops/-/merge_requests/9 with a better resolv.conf including forwarding to internal dnsmasq

EDIT: And also https://gitlab.suse.de/qa-maintenance/qamops/-/merge_requests/10 to save the config properly in ansible.

#16 Updated by okurz 4 months ago

  • Status changed from Feedback to Resolved

With both MRs merged we have a proper DNS resolution and dashboard config stored in git. That should certainly count as good improvements :)

#17 Updated by jbaier_cz 4 months ago

Both request were merged and also deployed.

#18 Updated by osukup 3 months ago

Also available in: Atom PDF