action #166313
closedFailed systemd services on multiple workers due to logrotate errors size:S
Added by livdywan 3 months ago. Updated about 2 months ago.
0%
Description
Observation¶
Failed services:
2024-09-04 12:14:00 openqaworker17 logrotate 1
2024-09-04 12:13:00 openqaworker16 logrotate 1
2024-09-04 12:13:00 openqaworker-arm-1 logrotate
On openqaworker-arm-1.qe.nue2.suse.org I checked the error:
error: destination /var/log/zypper.log-20240904.xz already exists
Deleting the file and restarting the service did the trick.
Suggestions¶
- Login and check failed services
- Restart services and/ or the worker
- Investigate why logrotate would try to create the same file again
- Check the logrotate config and see if this can be improved - or maybe this is an upstream issue as we've seen this before
- Most likely because we only include the day into the timestamp and run logrotate multiple times a day
Files
clipboard-202409162250-cgqir.png (149 KB) clipboard-202409162250-cgqir.png | livdywan, 2024-09-16 20:50 |
Updated by livdywan 2 months ago
- Status changed from Workable to In Progress
- Investigate why logrotate would try to create the same file again
- Check the logrotate config and see if this can be improved - or maybe this is an upstream issue as we've seen this before
- Most likely because we only include the day into the timestamp and run logrotate multiple times a day
I don't see anything in salt that configures the timestamp 🤔 Taking a look at the logs of the two other machines anyway
Updated by livdywan 2 months ago
- Assignee set to livdywan
- Investigate why logrotate would try to create the same file again
- Check the logrotate config and see if this can be improved - or maybe this is an upstream issue as we've seen this before
- Most likely because we only include the day into the timestamp and run logrotate multiple times a day
I don't see anything in salt that configures the timestamp 🤔 Taking a look at the logs of the two other machines anyway
Apparently openqaworker1{6,7}.qa.suse.cz are just fine. Not sure if anyone cleaned those up meanwhile.
/etc/logrotate.conf contains this:
weekly
rotate 4
dateext
Which would suggest we only rotate every week and there's no way timestamps collide.
Updated by livdywan 2 months ago · Edited
- Status changed from In Progress to Feedback
- Most likely because we only include the day into the timestamp and run logrotate multiple times a day
I don't see anything in salt that configures the timestamp 🤔 Taking a look at the logs of the two other machines anyway
Which would suggest we only rotate every week and there's no way timestamps collide.
cat /etc/logrotate.d/zypper.lr
/var/log/zypper.log {
compress
dateext
notifempty
missingok
nocreate
maxage 60
rotate 99
size 10M
}
I guess we don't override anything on workers? Not sure how this ends up rotating more often than daily then. The timer also says OnCalendar=daily
which is the upstream default.
Updated by tinita 2 months ago · Edited
hm: https://stackoverflow.com/a/23728801
size
will rotate when the log > size. Regardless of whether hourly/daily/weekly/monthly is specified.
https://linux.die.net/man/8/logrotate doesn't explitly say that.
We probably want minsize
and one of daily
/weekly
instead.
Updated by tinita 2 months ago
btw, I see repeated zypper messages like this:
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1369 Encountered problems! Here are the solutions:
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1369
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1373 Problem 1:
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1374 ====================================
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1378 the to be installed patch:openSUSE-SLE-15.5-2024-3092-1.noarch conflicts with 'dpdk-kmp-default.aarch64 < 19.11.10_
k5.14.21_150500.55.68-150500.5.14.3' provided by the to be installed dpdk-kmp-default-19.11.10_k5.14.21_150500.55.44-150500.5.3.1.aarch64
It's always about the same package. That could maybe explain the large size of the logs.
Updated by livdywan 2 months ago
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1369 Encountered problems! Here are the solutions: 2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1369 2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1373 Problem 1: 2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1374 ==================================== 2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1378 the to be installed patch:openSUSE-SLE-15.5-2024-3092-1.noarch conflicts with 'dpdk-kmp-default.aarch64 < 19.11.10_ k5.14.21_150500.55.68-150500.5.14.3' provided by the to be installed dpdk-kmp-default-19.11.10_k5.14.21_150500.55.44-150500.5.3.1.aarch64
It's always about the same package. That could maybe explain the large size of the logs.
I got a suggestion to re-run with --debug-solver
which might help explain what is happening and if it's a bug or a conflict of a patch.
Updated by livdywan 2 months ago
We probably want
minsize
and one ofdaily
/weekly
instead.
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1265/diffs
Updated by okurz 2 months ago · Edited
The actual problem is nothing about logrotate. It is only that zypper produces too much content in log so that logrotate eventually is triggered too often. The actual problem as visible in the zypper logs comes from the auto-update service which shows
Problem: 1: the to be installed patch:openSUSE-SLE-15.5-2024-3092-1.noarch conflicts with 'dpdk-kmp-default.x86_64 < 19.11.10_k5.14.21_150500.55.68-150500.5.14.3' provided by the to be installed dpdk-kmp-default-19.11.10_k5.14.21_150500.53-150500.3.37.x86_64
Solution 1: Following actions will be done:
replacement of dpdk-kmp-default-19.11.10_k5.14.21_150500.55.68-150500.5.9.1.x86_64 with dpdk-kmp-default-19.11.10_k5.14.21_150500.55.68-150500.5.14.3.x86_64
deinstallation of dpdk-kmp-default-19.11.10_k5.14.21_150500.55.68-150500.5.9.1.x86_64
Solution 2: do not install patch:openSUSE-SLE-15.5-2024-3092-1.noarch
Choose from above solutions by number or cancel [1/2/c/d/?] (c): c
The proper solution here is "Solution 1" but so far auto-update does not apply that automatically nor is the auto-update.service reporting as failed which would trigger a direct alert. The status of the service is
● auto-update.service - Automatically patch system packages.
Loaded: loaded (/etc/systemd/system/auto-update.service; static)
Active: activating (auto-restart) (Result: exit-code) since Mon 2024-09-16 12:53:56 UTC; 1min 18s ago
TriggeredBy: ● auto-update.timer
Process: 78747 ExecStart=/usr/local/bin/auto-update (code=exited, status=4)
Main PID: 78747 (code=exited, status=4)
The service is configured to automatically restart on failure but I don't know why it's not replicating the status in between or what "activating (auto-restart)" means.
In openqa/salt-states-openqa commit f6549bc I originally introduced Restart=on-failure
however that was for #62201 for problems to access non-official repositories. For maintenance patches we should actually assume a much higher level of availability of repositories so I suggest to not retry there which might already fix the problem that the service failure is not properly propagated.
Updated by okurz 2 months ago
and https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1267 to fix the actual problem.
Updated by livdywan 2 months ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1266
Let's see if the fix also works.
Updated by livdywan 2 months ago · Edited
sudo rpm -e libsolv-tools
sudo rpm -e libsolv-tools-base
sudo rpm -i --force /var/cache/zypp/packages/repo-oss/ppc64le/libs
olv-tools-0.7.19-6.1.ppc64le.rpm
sudo rpm -i /var/cache/zypp/packages/repo-oss/ppc64le/libzypp-17.2
7.0-12.1.ppc64le.rpm
warning: /etc/zypp/zypp.conf created as /etc/zypp/zypp.conf.rpmnew
sudo rpm -i /var/cache/zypp/packages/repo-oss/ppc64le/zypper-1.14.
46-13.1.ppc64le.rpm
Unfortunately re-installing zypper is not enough as the package versions will conflict again on the next attempt:
Problem: the to be installed libzypp-17.35.8-150500.6.13.1.ppc64le conflicts with 'pyt
hon2-zypp-plugin < 0.6.4' provided by the installed python2-zypp-plugin-0.6.3-2.18.noa
rch
Solution 1: deinstallation of libzypp-17.27.0-12.1.ppc64le
Solution 2: deinstallation of python2-zypp-plugin-0.6.3-2.18.noarch
Solution 3: do not install patch:openSUSE-SLE-15.5-2024-3178-1.noarch
[...]
The following 2 packages are going to be REMOVED:
libzypp zypper
The following package is going to be upgraded:
libsolv-tools
Updated by livdywan 2 months ago · Edited
- Status changed from Resolved to In Progress
logrotate failing on worker36:
Sep 18 08:11:22 worker36 logrotate[19188]: error: destination /var/log/zypper.log-20240918.xz already exists, skipping rotation
auto-update failing on worker36:
Sep 18 08:11:23 worker36 auto-update[19156]: Repository 'Server Monitoring Software' is invalid.
Sep 18 08:11:23 worker36 auto-update[19156]: error]
Sep 18 08:11:23 worker36 auto-update[19156]: Do you want to reject the key, trust temporarily, or trust always? [r/t/a/?] (r): r
[...]
Problem: 1: the installed gio-branding-openSUSE-42.1-lp156.8.3.noarch requires 'libgio
-2_0-0 = 2.78.3', but this requirement cannot be provided
Solution 1: deinstallation of gio-branding-openSUSE-42.1-lp156.8.3.noarch
Solution 2: do not install patch:openSUSE-SLE-15.6-2024-1950-1.noarch
Solution 3: break gio-branding-openSUSE-42.1-lp156.8.3.noarch by ignoring some of its dependencies
Updated by livdywan about 2 months ago
- Status changed from Feedback to Resolved
livdywan wrote in #note-23:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1269
Discussed and merged. No other issues seen.