Project

General

Profile

Actions

action #166313

closed

Failed systemd services on multiple workers due to logrotate errors size:S

Added by livdywan 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

Failed services:

2024-09-04 12:14:00        openqaworker17        logrotate        1
2024-09-04 12:13:00        openqaworker16        logrotate        1
2024-09-04 12:13:00        openqaworker-arm-1        logrotate

On openqaworker-arm-1.qe.nue2.suse.org I checked the error:

error: destination /var/log/zypper.log-20240904.xz already exists

Deleting the file and restarting the service did the trick.

Suggestions

  • Login and check failed services
  • Restart services and/ or the worker
  • Investigate why logrotate would try to create the same file again
    • Check the logrotate config and see if this can be improved - or maybe this is an upstream issue as we've seen this before
    • Most likely because we only include the day into the timestamp and run logrotate multiple times a day

Files

Actions #1

Updated by livdywan 3 months ago

  • Subject changed from Failed systemd services on multiple workers due to logwarn errors to Failed systemd services on multiple workers due to logrotate errors
Actions #2

Updated by livdywan 3 months ago

  • Description updated (diff)
Actions #3

Updated by livdywan 2 months ago

  • Subject changed from Failed systemd services on multiple workers due to logrotate errors to Failed systemd services on multiple workers due to logrotate errors size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by livdywan 2 months ago

  • Status changed from Workable to In Progress
  • Investigate why logrotate would try to create the same file again
    • Check the logrotate config and see if this can be improved - or maybe this is an upstream issue as we've seen this before
    • Most likely because we only include the day into the timestamp and run logrotate multiple times a day

I don't see anything in salt that configures the timestamp 🤔 Taking a look at the logs of the two other machines anyway

Actions #5

Updated by livdywan 2 months ago

  • Assignee set to livdywan
  • Investigate why logrotate would try to create the same file again
    • Check the logrotate config and see if this can be improved - or maybe this is an upstream issue as we've seen this before
    • Most likely because we only include the day into the timestamp and run logrotate multiple times a day

I don't see anything in salt that configures the timestamp 🤔 Taking a look at the logs of the two other machines anyway

Apparently openqaworker1{6,7}.qa.suse.cz are just fine. Not sure if anyone cleaned those up meanwhile.

/etc/logrotate.conf contains this:

weekly
rotate 4
dateext

Which would suggest we only rotate every week and there's no way timestamps collide.

Actions #6

Updated by livdywan 2 months ago · Edited

  • Status changed from In Progress to Feedback
  • Most likely because we only include the day into the timestamp and run logrotate multiple times a day

I don't see anything in salt that configures the timestamp 🤔 Taking a look at the logs of the two other machines anyway
Which would suggest we only rotate every week and there's no way timestamps collide.

cat /etc/logrotate.d/zypper.lr
/var/log/zypper.log {                                                                 
    compress                                                                          
    dateext                                                                           
    notifempty                                                                        
    missingok                                                                         
    nocreate                                                                          

    maxage 60                                                                         
    rotate 99                                                                         
    size 10M                                                                          
}

I guess we don't override anything on workers? Not sure how this ends up rotating more often than daily then. The timer also says OnCalendar=daily which is the upstream default.

Actions #7

Updated by tinita 2 months ago · Edited

hm: https://stackoverflow.com/a/23728801

size will rotate when the log > size. Regardless of whether hourly/daily/weekly/monthly is specified.

https://linux.die.net/man/8/logrotate doesn't explitly say that.

We probably want minsize and one of daily/weekly instead.

Actions #8

Updated by tinita 2 months ago

btw, I see repeated zypper messages like this:

2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1369 Encountered problems! Here are the solutions:
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1369 
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1373 Problem 1:
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1374 ====================================
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1378 the to be installed patch:openSUSE-SLE-15.5-2024-3092-1.noarch conflicts with 'dpdk-kmp-default.aarch64 < 19.11.10_
k5.14.21_150500.55.68-150500.5.14.3' provided by the to be installed dpdk-kmp-default-19.11.10_k5.14.21_150500.55.44-150500.5.3.1.aarch64

It's always about the same package. That could maybe explain the large size of the logs.

Actions #9

Updated by livdywan 2 months ago

2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1369 Encountered problems! Here are the solutions:
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1369 
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1373 Problem 1:
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1374 ====================================
2024-09-06 12:21:12 <1> openqaworker-arm-1(44841) [zypp::solver] SATResolver.cc(problems):1378 the to be installed patch:openSUSE-SLE-15.5-2024-3092-1.noarch conflicts with 'dpdk-kmp-default.aarch64 < 19.11.10_
k5.14.21_150500.55.68-150500.5.14.3' provided by the to be installed dpdk-kmp-default-19.11.10_k5.14.21_150500.55.44-150500.5.3.1.aarch64

It's always about the same package. That could maybe explain the large size of the logs.

I got a suggestion to re-run with --debug-solver which might help explain what is happening and if it's a bug or a conflict of a patch.

Actions #10

Updated by livdywan 2 months ago

We probably want minsize and one of daily/weekly instead.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1265/diffs

Actions #11

Updated by okurz 2 months ago · Edited

The actual problem is nothing about logrotate. It is only that zypper produces too much content in log so that logrotate eventually is triggered too often. The actual problem as visible in the zypper logs comes from the auto-update service which shows

Problem: 1: the to be installed patch:openSUSE-SLE-15.5-2024-3092-1.noarch conflicts with 'dpdk-kmp-default.x86_64 < 19.11.10_k5.14.21_150500.55.68-150500.5.14.3' provided by the to be installed dpdk-kmp-default-19.11.10_k5.14.21_150500.53-150500.3.37.x86_64
 Solution 1: Following actions will be done:
  replacement of dpdk-kmp-default-19.11.10_k5.14.21_150500.55.68-150500.5.9.1.x86_64 with dpdk-kmp-default-19.11.10_k5.14.21_150500.55.68-150500.5.14.3.x86_64
  deinstallation of dpdk-kmp-default-19.11.10_k5.14.21_150500.55.68-150500.5.9.1.x86_64
 Solution 2: do not install patch:openSUSE-SLE-15.5-2024-3092-1.noarch

Choose from above solutions by number or cancel [1/2/c/d/?] (c): c

The proper solution here is "Solution 1" but so far auto-update does not apply that automatically nor is the auto-update.service reporting as failed which would trigger a direct alert. The status of the service is

● auto-update.service - Automatically patch system packages.
     Loaded: loaded (/etc/systemd/system/auto-update.service; static)
     Active: activating (auto-restart) (Result: exit-code) since Mon 2024-09-16 12:53:56 UTC; 1min 18s ago
TriggeredBy: ● auto-update.timer
    Process: 78747 ExecStart=/usr/local/bin/auto-update (code=exited, status=4)
   Main PID: 78747 (code=exited, status=4)

The service is configured to automatically restart on failure but I don't know why it's not replicating the status in between or what "activating (auto-restart)" means.

In openqa/salt-states-openqa commit f6549bc I originally introduced Restart=on-failure however that was for #62201 for problems to access non-official repositories. For maintenance patches we should actually assume a much higher level of availability of repositories so I suggest to not retry there which might already fix the problem that the service failure is not properly propagated.

Actions #15

Updated by livdywan 2 months ago

  • Status changed from Feedback to Resolved

Let's see if the fix also works.

Sep 17 02:47:20 worker33 [RPM][26815]: erase dpdk-kmp-default-19.11.10_k5.14.21_150500.55.68-150500.5.9.1.x86_64: success

Looks like it worked.

Actions #16

Updated by livdywan 2 months ago · Edited

  • Status changed from Resolved to In Progress

Actually there is something wrong with diesel still. Investigating:

Sep 17 02:04:08 diesel auto-update[49031]: /usr/local/bin/auto-update: line 6: /usr/bin/zypper: No such file or directory
Actions #17

Updated by livdywan 2 months ago · Edited

sudo rpm -e libsolv-tools
sudo rpm -e libsolv-tools-base
sudo rpm -i --force /var/cache/zypp/packages/repo-oss/ppc64le/libs
olv-tools-0.7.19-6.1.ppc64le.rpm
sudo rpm -i /var/cache/zypp/packages/repo-oss/ppc64le/libzypp-17.2
7.0-12.1.ppc64le.rpm                                                                  
warning: /etc/zypp/zypp.conf created as /etc/zypp/zypp.conf.rpmnew
sudo rpm -i /var/cache/zypp/packages/repo-oss/ppc64le/zypper-1.14.
46-13.1.ppc64le.rpm

Unfortunately re-installing zypper is not enough as the package versions will conflict again on the next attempt:

Problem: the to be installed libzypp-17.35.8-150500.6.13.1.ppc64le conflicts with 'pyt
hon2-zypp-plugin < 0.6.4' provided by the installed python2-zypp-plugin-0.6.3-2.18.noa
rch                                                                                   
 Solution 1: deinstallation of libzypp-17.27.0-12.1.ppc64le                           
 Solution 2: deinstallation of python2-zypp-plugin-0.6.3-2.18.noarch                  
 Solution 3: do not install patch:openSUSE-SLE-15.5-2024-3178-1.noarch
[...]
The following 2 packages are going to be REMOVED:                                     
  libzypp zypper                                                                      

The following package is going to be upgraded:                                        
  libsolv-tools
Actions #18

Updated by livdywan 2 months ago

sudo zypper rm python2-zypp-plugin

Apparently this was a left-over. Other systems already use sudo zypper rm python3-zypp-plugin.

Actions #19

Updated by livdywan 2 months ago

  • Status changed from In Progress to Resolved

With this we should be good here again.

Actions #20

Updated by livdywan 2 months ago

livdywan wrote in #note-19:

With this we should be good here again.

Correction. I should have restarted the service. I only ran the script directly to verify it.

Actions #21

Updated by livdywan 2 months ago · Edited

  • Status changed from Resolved to In Progress

logrotate failing on worker36:

Sep 18 08:11:22 worker36 logrotate[19188]: error: destination /var/log/zypper.log-20240918.xz already exists, skipping rotation

auto-update failing on worker36:

Sep 18 08:11:23 worker36 auto-update[19156]: Repository 'Server Monitoring Software' is invalid.                                                                      
Sep 18 08:11:23 worker36 auto-update[19156]: error]                                                                                                                   
Sep 18 08:11:23 worker36 auto-update[19156]: Do you want to reject the key, trust temporarily, or trust always? [r/t/a/?] (r): r
[...]
Problem: 1: the installed gio-branding-openSUSE-42.1-lp156.8.3.noarch requires 'libgio
-2_0-0 = 2.78.3', but this requirement cannot be provided                             
 Solution 1: deinstallation of gio-branding-openSUSE-42.1-lp156.8.3.noarch            
 Solution 2: do not install patch:openSUSE-SLE-15.6-2024-1950-1.noarch                
 Solution 3: break gio-branding-openSUSE-42.1-lp156.8.3.noarch by ignoring some of its dependencies
Actions #22

Updated by okurz 2 months ago

that's the machine I am working on, see #139103 . I see that zypper is locked. I assume it's you.

Actions #23

Updated by livdywan 2 months ago

  • Status changed from In Progress to Feedback
Actions #24

Updated by livdywan about 2 months ago

  • Status changed from Feedback to Resolved

livdywan wrote in #note-23:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1269

Discussed and merged. No other issues seen.

Actions

Also available in: Atom PDF