Project

General

Profile

Actions

action #96551

closed

coordination #96447: [epic] Failed systemd services and job age alerts

Persistent records of systemd journal size:S

Added by livdywan over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2021-10-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

When alerts are triggered there's a brief window where the journal can be checked but we don't seem to retain this long/ at all.

Acceptance criteria

  • AC1: systemd journal is kept for 1 week

Suggestion


Related issues 4 (1 open3 closed)

Related to openQA Project - action #101379: Reduce amount of unhelpful log messages at debug levelResolvedkraih2021-10-22

Actions
Related to openQA Infrastructure - action #104967: File systems alert repeatedly triggering on and offResolvedmkittler2022-01-172022-02-02

Actions
Copied to openQA Infrastructure - action #96552: Persistent records of I/O usage by process size:MWorkable2021-08-04

Actions
Copied to openQA Project - action #105064: Reduce verbosity of openQA logging to improve performance and reduce storage requirements size:MResolvedokurz2021-10-22

Actions
Actions #1

Updated by okurz over 3 years ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz
Actions #2

Updated by livdywan over 3 years ago

  • Parent task set to #96447
Actions #3

Updated by okurz over 3 years ago

  • Due date set to 2021-08-18
Actions #4

Updated by livdywan over 3 years ago

  • Copied to action #96552: Persistent records of I/O usage by process size:M added
Actions #5

Updated by okurz over 3 years ago

sudo journalctl --disk-usage says logs take up 2.9G but journalctl itself says

-- Logs begin at Sun 2021-07-18 03:30:00 CEST, end at Sun 2021-08-15 22:23:10 CEST. --
Jul 18 03:30:00 openqa sudo[25770]: pam_unix(sudo:session): session closed for user root
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Default.
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Basic System.
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Timers.
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Sockets.
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Paths.
Jul 18 03:30:02 openqa systemd[25237]: Closed D-Bus User Message Bus Socket.
Jul 18 03:30:02 openqa systemd[25237]: Reached target Shutdown.
Jul 18 03:30:02 openqa systemd[25237]: Starting Exit the Session...
Jul 18 03:30:02 openqa systemd[25237]: Received SIGRTMIN+24 from PID 2783 (kill).
Jul 18 03:30:02 openqa systemd[25238]: pam_unix(systemd-user:session): session closed for user asmorodskyi
-- Reboot --
Aug 15 03:30:00 openqa systemd[29348]: Stopped target Default.
Aug 15 03:30:00 openqa systemd[29348]: Stopped target Basic System.
Aug 15 03:30:00 openqa systemd[29348]: Stopped target Sockets.
Aug 15 03:30:00 openqa systemd[29348]: Closed D-Bus User Message Bus Socket.
Aug 15 03:30:00 openqa systemd[29348]: Stopped target Timers.
Aug 15 03:30:00 openqa systemd[29348]: Stopped target Paths.
Aug 15 03:30:00 openqa systemd[29348]: Reached target Shutdown.
Aug 15 03:30:00 openqa systemd[29348]: Starting Exit the Session...
Aug 15 03:30:01 openqa systemd[29348]: Received SIGRTMIN+24 from PID 6482 (kill).
Aug 15 03:30:01 openqa systemd[29354]: pam_unix(systemd-user:session): session closed for user cdywan
Aug 15 03:30:01 openqa systemd[3598]: Stopped target Default.
Aug 15 03:30:01 openqa systemd[3598]: Stopped target Basic System.
Aug 15 03:30:01 openqa systemd[3598]: Stopped target Paths.
Aug 15 03:30:01 openqa systemd[3598]: Stopped target Sockets.
Aug 15 03:30:01 openqa systemd[3598]: Closed D-Bus User Message Bus Socket.
Aug 15 03:30:01 openqa systemd[3598]: Reached target Shutdown.
Aug 15 03:30:01 openqa systemd[3598]: Starting Exit the Session...
Aug 15 03:30:01 openqa systemd[3598]: Stopped target Timers.
Aug 15 03:30:01 openqa systemd[3598]: Received SIGRTMIN+24 from PID 6513 (kill).
Aug 15 03:30:01 openqa systemd[17836]: Stopped target Default.
Aug 15 03:30:01 openqa systemd[17836]: Stopped target Basic System.
Aug 15 03:30:01 openqa systemd[17836]: Stopped target Timers.
Aug 15 03:30:01 openqa systemd[17836]: Stopped target Sockets.
Aug 15 03:30:01 openqa systemd[17836]: Closed D-Bus User Message Bus Socket.
Aug 15 03:30:01 openqa systemd[17836]: Stopped target Paths.
Aug 15 03:30:01 openqa systemd[17836]: Reached target Shutdown.
Aug 15 03:30:01 openqa systemd[17836]: Starting Exit the Session...
Aug 15 03:30:01 openqa systemd[17836]: Received SIGRTMIN+24 from PID 6522 (kill).
-- Reboot --
Aug 15 18:00:01 openqa systemd[17066]: Reached target Timers.
Aug 15 18:00:01 openqa systemd[17066]: Starting D-Bus User Message Bus Socket.
Aug 15 18:00:01 openqa systemd[17066]: Reached target Paths.
Aug 15 18:00:01 openqa systemd[17066]: Listening on D-Bus User Message Bus Socket.
Aug 15 18:00:01 openqa systemd[17066]: Reached target Sockets.
Aug 15 18:00:01 openqa systemd[17066]: Reached target Basic System.
Aug 15 18:00:01 openqa systemd[17066]: Reached target Default.
Aug 15 18:00:01 openqa systemd[17066]: Startup finished in 137ms.
Aug 15 18:00:13 openqa systemd[17066]: Stopped target Default.
Aug 15 18:00:13 openqa systemd[17066]: Stopped target Basic System.
Aug 15 18:00:13 openqa systemd[17066]: Stopped target Timers.
Aug 15 18:00:13 openqa systemd[17066]: Stopped target Sockets.
Aug 15 18:00:13 openqa systemd[17066]: Closed D-Bus User Message Bus Socket.
Aug 15 18:00:13 openqa systemd[17066]: Stopped target Paths.
Aug 15 18:00:13 openqa systemd[17066]: Reached target Shutdown.
Aug 15 18:00:13 openqa systemd[17066]: Starting Exit the Session...
Aug 15 18:00:13 openqa systemd[17066]: Received SIGRTMIN+24 from PID 17847 (kill).
Aug 15 18:00:13 openqa systemd[17080]: pam_unix(systemd-user:session): session closed for user cvdupdate
Aug 15 18:03:27 openqa openqa-websockets-daemon[2219]:       "HDDSIZEGB" => 40,
Aug 15 18:03:27 openqa openqa-websockets-daemon[2219]:       "INCIDENT_ID" => 18273,
…

so there are some logs back to 2021-07-18 but then only since the last reboot. Also the entries look weird with the reboot at 03:30 as planned but then another line of "-- Reboot --" before "Aug 15 18:00:01" but uptime says that the machine is up since 18h:53m, that is 03:30 localtime.

Maybe the problem is openqa-webui which spams the logs. On o3 the logs go back way further and /var/log/openqa is configured to take the openQA logs, on osd it's not the case. Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/549

Actions #6

Updated by okurz over 3 years ago

merged the MR. Let's see if it helps to keep a longer system journal.

Actions #7

Updated by okurz over 3 years ago

  • Due date changed from 2021-08-18 to 2021-08-20

/var/log/openqa has logs. The system journal still has a lot of information, e.g. from openqa-websockets-daemon, etc. Maybe I would need to restart these services as well? I will need to check on the next day if the journal persists now a bit longer. I don't think there is a better way than to simply wait hence bumping due-date.

Actions #8

Updated by okurz over 3 years ago

  • Due date changed from 2021-08-20 to 2021-09-03
  • Priority changed from Normal to Low

Again there is not much left in the journal:

-- Logs begin at Sun 2021-07-18 03:30:00 CEST, end at Thu 2021-08-19 07:38:01 CEST. --
Jul 18 03:30:00 openqa sudo[25770]: pam_unix(sudo:session): session closed for user root
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Default.
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Basic System.
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Timers.
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Sockets.
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Paths.
Jul 18 03:30:02 openqa systemd[25237]: Closed D-Bus User Message Bus Socket.
Jul 18 03:30:02 openqa systemd[25237]: Reached target Shutdown.
Jul 18 03:30:02 openqa systemd[25237]: Starting Exit the Session...
Jul 18 03:30:02 openqa systemd[25237]: Received SIGRTMIN+24 from PID 2783 (kill).
Jul 18 03:30:02 openqa systemd[25238]: pam_unix(systemd-user:session): session closed for user asmorodskyi
-- Reboot --
Aug 19 00:34:01 openqa systemd[1]: Started Session 11796 of user geekotest.
Aug 19 00:34:01 openqa cron[8641]: pam_unix(crond:session): session opened for user geekotest by (uid=0)
Aug 19 00:34:04 openqa systemd-journald[23984]: Suppressed 5660 messages from /system.slice/openqa-websockets.service
Aug 19 00:34:11 openqa CRON[8641]: pam_unix(crond:session): session closed for user geekotest

/var/log/openqa is right now 137M but now the problem seems to be mainly openqa-websockets. systemctl status systemd-journald shows

Aug 19 07:35:30 openqa systemd-journald[14984]: Suppressed 2269 messages from /system.slice/openqa-websockets.service
Aug 19 07:36:00 openqa systemd-journald[14984]: Suppressed 2717 messages from /system.slice/openqa-websockets.service

so I restarted openqa-websockets as well. Looks like now the messages end up in /var/log/openqa as well.

I am not sure if the settings controlled by salt are effective now so to be sure I additionally set the settings for now with vim /etc/systemd/journald.conf and did systemctl restart systemd-journald. Now systemctl status systemd-journald shows System journal (/var/log/journal/84f4f4f356b525388b60f0ae547597e0) is 2.4G, max 23.7G, 21.2G free. But I removed the settings again, only keeping in /etc/systemd/journald.conf.d/journal_size.conf and after systemctl restart systemd-journald the command systemctl status systemd-journald shows System journal (/var/log/journal/84f4f4f356b525388b60f0ae547597e0) is 2.4G, max 25.1G, 22.7G free. so that should be fine.
Let's see how this changes over the next days. journalctl --disk-usage says Archived and active journals take up 2.4G in the file system.

As the original problem about I/O alerts seem to be less severe by now I will follow up with this with low prio.

Actions #9

Updated by okurz about 3 years ago

  • Due date changed from 2021-09-03 to 2021-10-08

As of now the latest in systemctl status systemd-journald is still

Aug 20 11:59:21 openqa systemd-journald[8460]: System journal (/var/log/journal/84f4f4f356b525388b60f0ae547597e0) is 2.0G, max 22.5G, 20.4G free.

so looks good. journalctl --disk-usage reports 1.3G. journalctl goes back to Aug 19. I want to wait over the next automatic reboots if this still works, then revert the openQA log target and check again.

Actions #10

Updated by okurz about 3 years ago

osd was rebooted 4 days ago. journal size is reported to take 1.2G, systemctl status systemd-journald says "Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.", not that helpful. journalctl says

-- Logs begin at Sun 2021-07-18 03:30:00 CEST, end at Thu 2021-09-02 22:24:00 CEST. --
Jul 18 03:30:00 openqa sudo[25770]: pam_unix(sudo:session): session closed for user root
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Default.
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Basic System.
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Timers.
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Sockets.
Jul 18 03:30:02 openqa systemd[25237]: Stopped target Paths.
Jul 18 03:30:02 openqa systemd[25237]: Closed D-Bus User Message Bus Socket.
Jul 18 03:30:02 openqa systemd[25237]: Reached target Shutdown.
Jul 18 03:30:02 openqa systemd[25237]: Starting Exit the Session...
Jul 18 03:30:02 openqa systemd[25237]: Received SIGRTMIN+24 from PID 2783 (kill).
Jul 18 03:30:02 openqa systemd[25238]: pam_unix(systemd-user:session): session closed for user asmorodskyi
-- Reboot --
Aug 31 09:39:57 openqa systemd[6377]: Reached target Timers.
…

so there are still some entries from July 18 but then only after August 31 though the last reboot was August 29? Restarted "systemd-journald" again and status says "System journal (/var/log/journal/84f4f4f356b525388b60f0ae547597e0) is 1.2G, max 21.1G, 19.8G free"

So it does not look like the 20G would really be used.

Regardless of this I see that vsftpd plays a big part in filling the logs, e.g. entry blocks like

Sep 02 00:18:46 openqa vsftpd[18289]: [ftp] OK DOWNLOAD: Client "::ffff:10.160.0.245", "/SLE-15-SP4-Module-Basesystem-POOL-aarch64-Build31.2-Media1/aarch64/dhcp-client-4.3.6.P1-6.11.1.aarch64.rpm", 747476 bytes, 34388.14Kbyte/sec
Sep 02 00:18:46 openqa vsftpd[18289]: [ftp] OK DOWNLOAD: Client "::ffff:10.160.0.245", "/SLE-15-SP4-Module-Basesystem-POOL-aarch64-Build31.2-Media1/aarch64/dracut-049.1+suse.203.g8ee14a90-3.35.1.aarch64.rpm", 442592 bytes, 54669.71Kbyte/sec
Sep 02 00:18:46 openqa vsftpd[18289]: [ftp] OK DOWNLOAD: Client "::ffff:10.160.0.245", "/SLE-15-SP4-Module-Basesystem-POOL-aarch64-Build31.2-Media1/aarch64/util-linux-systemd-2.36.2-2.1.aarch64.rpm", 166636 bytes, 74991.00Kbyte/sec
Sep 02 00:18:46 openqa vsftpd[18289]: [ftp] OK DOWNLOAD: Client "::ffff:10.160.0.245", "/SLE-15-SP4-Module-Basesystem-POOL-aarch64-Build31.2-Media1/aarch64/postfix-3.5.9-5.3.3.aarch64.rpm", 1164512 bytes, 76068.14Kbyte/sec
Sep 02 00:18:46 openqa vsftpd[18289]: [ftp] OK DOWNLOAD: Client "::ffff:10.160.0.245", "/SLE-15-SP4-Module-Basesystem-POOL-aarch64-Build31.2-Media1/aarch64/device-mapper-1.02.163-8.33.1.aarch64.rpm", 142104 bytes, 30634.31Kbyte/sec
Sep 02 00:18:46 openqa vsftpd[18289]: [ftp] OK DOWNLOAD: Client "::ffff:10.160.0.245", "/SLE-15-SP4-Module-Basesystem-POOL-aarch64-Build31.2-Media1/aarch64/mdadm-4.1-24.3.1.aarch64.rpm", 473004 bytes, 24733.24Kbyte/sec
Sep 02 00:18:46 openqa vsftpd[18289]: [ftp] OK DOWNLOAD: Client "::ffff:10.160.0.245", "/SLE-15-SP4-Module-Basesystem-POOL-aarch64-Build31.2-Media1/aarch64/liblvm2cmd2_03-2.03.05-8.33.1.aarch64.rpm", 740500 bytes, 47000.16Kbyte/sec
Sep 02 00:18:46 openqa vsftpd[18289]: [ftp] OK DOWNLOAD: Client "::ffff:10.160.0.245", "/SLE-15-SP4-Module-Basesystem-POOL-aarch64-Build31.2-Media1/aarch64/kpartx-0.8.6+10+suse.4771137-1.1.aarch64.rpm", 98888 bytes, 49778.51Kbyte/sec
Sep 02 00:18:47 openqa vsftpd[18289]: [ftp] OK DOWNLOAD: Client "::ffff:10.160.0.245", "/SLE-15-SP4-Module-Basesystem-POOL-aarch64-Build31.2-Media1/aarch64/lvm2-2.03.05-8.33.1.aarch64.rpm", 1227116 bytes, 32950.82Kbyte/sec
Sep 02 00:18:47 openqa kernel: vsftpd: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
Sep 02 00:18:47 openqa kernel: CPU: 4 PID: 19545 Comm: vsftpd Not tainted 5.3.18-lp152.87-default #1 openSUSE Leap 15.2
Sep 02 00:18:47 openqa kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Sep 02 00:18:47 openqa kernel: Call Trace:
Sep 02 00:18:47 openqa kernel:  dump_stack+0x66/0x8b
Sep 02 00:18:47 openqa kernel:  warn_alloc+0x10b/0x190
Sep 02 00:18:47 openqa kernel:  ? __alloc_pages_direct_compact+0x155/0x160
Sep 02 00:18:47 openqa kernel:  __alloc_pages_slowpath+0xcc3/0xd00
Sep 02 00:18:47 openqa kernel:  __alloc_pages_nodemask+0x2c4/0x300
Sep 02 00:18:47 openqa kernel:  kmalloc_order+0x18/0x70
Sep 02 00:18:47 openqa kernel:  kmalloc_order_trace+0x1d/0xa0
Sep 02 00:18:47 openqa kernel:  __kmalloc+0x25f/0x2c0
Sep 02 00:18:47 openqa kernel:  ? _cond_resched+0x15/0x40
Sep 02 00:18:47 openqa kernel:  nfsd_reply_cache_init+0xb2/0x140 [nfsd]
Sep 02 00:18:47 openqa kernel:  nfsd_init_net+0x5d/0x100 [nfsd]
Sep 02 00:18:47 openqa kernel:  ops_init+0x40/0x110
Sep 02 00:18:47 openqa kernel:  setup_net+0xd2/0x210
Sep 02 00:18:47 openqa kernel:  copy_net_ns+0xf5/0x230
Sep 02 00:18:47 openqa kernel:  create_new_namespaces+0x124/0x2a0
Sep 02 00:18:47 openqa kernel:  copy_namespaces+0x7e/0xd0
Sep 02 00:18:47 openqa kernel:  copy_process+0xccb/0x1b10
Sep 02 00:18:47 openqa kernel:  ? __alloc_file+0x8d/0xf0
Sep 02 00:18:47 openqa kernel:  _do_fork+0x74/0x3b0
Sep 02 00:18:47 openqa kernel:  ? do_sigaction+0x1c9/0x250
Sep 02 00:18:47 openqa kernel:  __x64_sys_clone+0x8f/0xc0
Sep 02 00:18:47 openqa kernel:  do_syscall_64+0x65/0x1f0
Sep 02 00:18:47 openqa kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 02 00:18:47 openqa kernel: RIP: 0033:0x7f01a3f55759

showing both a lot of download status messages as well as kernel reports about "vsftpd: page allocation failure" and kernel stack traces. Couldn't find anything useful for the kernel stack trace in a web search.

Maybe making the vsftpd logs less noisy can help?: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/571

Actions #11

Updated by okurz about 3 years ago

Taking look right now I see that the journal goes back only 3 days so again and still a problem. journalctl -u systemd-journald says

Sep 23 20:14:26 openqa systemd-journald[23151]: System journal (/var/log/journal/84f4f4f356b525388b60f0ae547597e0) is 1.2G, max 19.4G, 18.1G free.

so we do not use the available space at all. What's going on? I don't know. I suspected that maybe some other process using the same filesystem, e.g. postgreSQL, would use more space temporarily so that journalctl erroneously deletes too much but according to the man page of systemd-journald it does not behave like that and also looking at https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=74&orgId=1&from=now-7d&to=now I don't see that we would have peaks in bigger space usage.

There is rotation, the retention is configured to store max 80G. Hm, maybe the many user sessions cause the max files number in combination with size per rotated archive file to exceed?

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/589

Actions #12

Updated by okurz about 3 years ago

no review was conducted :,( merged now.

Actions #13

Updated by okurz about 3 years ago

All of the above still does not seem to work. Now the journal got truncated again to just 2 days worth of history. Trying to increase number of rotated files now: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/598

Also temporarily I disabled logging of vsftpd to the system journal with syslog_enable=NO in /etc/vsftpd.conf

Actions #14

Updated by livdywan about 3 years ago

okurz wrote:

All of the above still does not seem to work. Now the journal got truncated again to just 2 days worth of history. Trying to increase number of rotated files now: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/598

Also temporarily I disabled logging of vsftpd to the system journal with syslog_enable=NO in /etc/vsftpd.conf

> sudo journalctl --disk-usage
Archived and active journals take up 1.8G in the file system.

Also

> sudo journalctl
-- Logs begin at Tue 2021-09-28 01:30:09 CEST, end at Thu 2021-10-07 15:18:04 CEST. --

So it seems like you achieved something here?

Actions #15

Updated by okurz about 3 years ago

  • Due date changed from 2021-10-08 to 2021-11-05

yes, that part seems to work. I would like to wait further until we see at least 2GiB of used space exceeded. Then we can slowly move back services from extra log files to systemd-journal

Actions #16

Updated by kraih about 3 years ago

  • Related to action #101379: Reduce amount of unhelpful log messages at debug level added
Actions #17

Updated by okurz about 3 years ago

I reported https://bugzilla.opensuse.org/show_bug.cgi?id=1192748 about the kernel stack trace from vsftpd.

Next to that I see many lines like

Nov 16 12:57:01 openqa cron[29088]: pam_unix(crond:session): session opened for user geekotest by (uid=0)
Nov 16 12:57:01 openqa cron[29089]: pam_unix(crond:session): session opened for user geekotest by (uid=0)
Nov 16 12:57:02 openqa CRON[29088]: pam_unix(crond:session): session closed for user geekotest
Nov 16 12:57:09 openqa CRON[29089]: pam_unix(crond:session): session closed for user geekotest

a solution might be in
https://languor.us/cron-pam-unix-cron-session-session-opened-closed-user-root-uid0
but I have not followed up with this so far.

So currently journalctl goes back until 2021-10-11 so more than a month. That's good.

systemctl restart systemd-journald ; systemctl status systemd-journald says "System journal (/var/log/journal/84f4f4f356b525388b60f0ae547597e0) is 1.5G, max 46.7G, 45.1G free." so it's not using that much space. But I also wonder if it shouldn't use more space by now.

We previously had a daily logrotation for openQA logs, now we are currently using hourly log rotation with up to 20 archives stored meaning we are only storing openQA logs up to 21h, not longer.

Proposing to re-enable openQA logs in system journal with
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/613

Actions #18

Updated by okurz almost 3 years ago

MR merged and removed file based config manually. Monitoring for some time to ensure keep logs long enough.

Actions #19

Updated by okurz almost 3 years ago

  • Related to action #104967: File systems alert repeatedly triggering on and off added
Actions #20

Updated by okurz almost 3 years ago

Created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/642 for #104967

sudo journalctl -f right now still shows a lot of verbose messages, e.g. about artifact storing:

Jan 19 11:20:28 openqa openqa[4236]: [debug] [pid:4236] Accounting size of screenshot /var/lib/openqa/images/34b/bcf/02553a852f8f5cc7b5805c8d7d.png for job 8000019: 6259
Jan 19 11:20:28 openqa openqa[4236]: [debug] [pid:4236] Stored image: /var/lib/openqa/images/34b/bcf/02553a852f8f5cc7b5805c8d7d.png
Jan 19 11:20:28 openqa openqa[21928]: [debug] [pid:21928] Accounting size of screenshot /var/lib/openqa/images/a28/9d1/cd77fb71a6ee38c644a7fa5ada.png for job 7999899: 61354
Jan 19 11:20:28 openqa openqa[21928]: [debug] [pid:21928] Stored image: /var/lib/openqa/images/a28/9d1/cd77fb71a6ee38c644a7fa5ada.png
Jan 19 11:20:28 openqa openqa[21928]: [debug] [pid:21928] Accounting size of screenshot /var/lib/openqa/images/34b/bcf/.thumbs/02553a852f8f5cc7b5805c8d7d.png for job 8000019: 525
Jan 19 11:20:28 openqa openqa[22376]: [debug] [pid:22376] Unrecognized referer 'openqa.suse.de'
Jan 19 11:20:28 openqa openqa[22698]: [debug] [pid:22698] Accounting size of screenshot /var/lib/openqa/images/a28/9d1/.thumbs/cd77fb71a6ee38c644a7fa5ada.png for job 7999899: 1015
Jan 19 11:20:28 openqa openqa[12980]: [debug] Received JobToken: vLSXXFH0K68_stkU
Jan 19 11:20:28 openqa openqa[12980]: [debug] Found associated job 7999509
Jan 19 11:20:28 openqa openqa-websockets-daemon[27231]: [debug] [pid:27231] Updating seen of worker 529 from worker_status (working)
Jan 19 11:20:28 openqa openqa[22698]: [debug] [pid:22698] Accounting size of screenshot /var/lib/openqa/images/161/32d/ee61a2006d004c9f22f72b9dca.png for job 8000019: 8010
Jan 19 11:20:28 openqa openqa[22698]: [debug] [pid:22698] Stored image: /var/lib/openqa/images/161/32d/ee61a2006d004c9f22f72b9dca.png
Jan 19 11:20:28 openqa openqa[22698]: [debug] [pid:22698] Accounting size of screenshot /var/lib/openqa/images/161/32d/.thumbs/ee61a2006d004c9f22f72b9dca.png for job 8000019: 580
Jan 19 11:20:28 openqa openqa[12980]: [debug] [pid:12980] Accounting size of screenshot /var/lib/openqa/images/b51/d58/d6bbc56c66ae429982af5817a8.png for job 8000019: 6662
Jan 19 11:20:28 openqa openqa[12980]: [debug] [pid:12980] Stored image: /var/lib/openqa/images/b51/d58/d6bbc56c66ae429982af5817a8.png
Jan 19 11:20:28 openqa openqa[22698]: [debug] [pid:22698] Accounting size of screenshot /var/lib/openqa/images/b51/d58/.thumbs/d6bbc56c66ae429982af5817a8.png for job 8000019: 549
Jan 19 11:20:28 openqa openqa[22698]: [debug] [pid:22698] Accounting size of artefact /var/lib/openqa/testresults/08000/08000019-sle-15-SP4-JeOS-for-VMware-x86_64-jeos-main:investigate:retry@svirt-vmware65/diskusage-4.txt for job 8000019: 77
Jan 19 11:20:28 openqa openqa[22698]: [debug] [pid:22698] Created artefact: /var/lib/openqa/testresults/08000/08000019-sle-15-SP4-JeOS-for-VMware-x86_64-jeos-main:investigate:retry@svirt-vmware65/diskusage-4.txt
Jan 19 11:20:28 openqa openqa[21928]: [debug] [pid:21928] Accounting size of artefact /var/lib/openqa/testresults/08000/08000019-sle-15-SP4-JeOS-for-VMware-x86_64-jeos-main:investigate:retry@svirt-vmware65/diskusage-11.txt for job 8000019: 104
Jan 19 11:20:28 openqa openqa[21928]: [debug] [pid:21928] Created artefact: /var/lib/openqa/testresults/08000/08000019-sle-15-SP4-JeOS-for-VMware-x86_64-jeos-main:investigate:retry@svirt-vmware65/diskusage-11.txt
Jan 19 11:20:28 openqa openqa[23731]: [debug] [pid:23731] Accounting size of artefact /var/lib/openqa/testresults/08000/08000019-sle-15-SP4-JeOS-for-VMware-x86_64-jeos-main:investigate:retry@svirt-vmware65/diskusage-12.txt for job 8000019: 11
Jan 19 11:20:28 openqa openqa[23731]: [debug] [pid:23731] Created artefact: /var/lib/openqa/testresults/08000/08000019-sle-15-SP4-JeOS-for-VMware-x86_64-jeos-main:investigate:retry@svirt-vmware65/diskusage-12.txt
Jan 19 11:20:28 openqa openqa[23731]: [debug] [pid:23731] Accounting size of artefact /var/lib/openqa/testresults/08000/08000019-sle-15-SP4-JeOS-for-VMware-x86_64-jeos-main:investigate:retry@svirt-vmware65/diskusage-8.txt for job 8000019: 60
Jan 19 11:20:28 openqa openqa[23731]: [debug] [pid:23731] Created artefact: /var/lib/openqa/testresults/08000/08000019-sle-15-SP4-JeOS-for-VMware-x86_64-jeos-main:investigate:retry@svirt-vmware65/diskusage-8.txt
Jan 19 11:20:29 openqa openqa[23731]: [debug] [pid:23731] Accounting size of artefact /var/lib/openqa/testresults/08000/08000019-sle-15-SP4-JeOS-for-VMware-x86_64-jeos-main:investigate:retry@svirt-vmware65/diskusage-5.txt for job 8000019: 10
Jan 19 11:20:29 openqa openqa[23731]: [debug] [pid:23731] Created artefact: /var/lib/openqa/testresults/08000/08000019-sle-15-SP4-JeOS-for-VMware-x86_64-jeos-main:investigate:retry@svirt-vmware65/diskusage-5.txt
Jan 19 11:20:29 openqa openqa[25937]: [debug] Received JobToken: guptAy10C3aorn0v
Jan 19 11:20:29 openqa openqa[25937]: [debug] Found associated job 8000149
  • Find more too verbose messages and move them to trace -> #105064
  • Revisit compression settings for systemd journal. File based logging seemed to be more efficient
  • Distributed logging to save where we have more space. Maybe just the archiving or logging for historical data to different storage partitions or different computers
  • Move logs or journal to different filesystem
  • Ask for more space for that partition
Actions #21

Updated by okurz almost 3 years ago

  • Copied to action #105064: Reduce verbosity of openQA logging to improve performance and reduce storage requirements size:M added
Actions #22

Updated by mkittler almost 3 years ago

Your SR isn't effective because a GitLab issue prevented the deployment pipeline to run. I retried the pipeline now: https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines/301693

(There's already #105145 about the GitLab issue.)

Actions #23

Updated by mkittler almost 3 years ago

The change is deployed but still has no effect. I tried systemctl restart systemd-journald, maybe that helps.

Actions #24

Updated by mkittler almost 3 years ago

The change is now actually effective, see #104967#note-11. Since we have #105064 we can likely resolve this ticket.

Actions #25

Updated by okurz almost 3 years ago

  • Status changed from Feedback to Resolved

Agreed. Logging period is not forever but certainly longer than a week hence AC1 fulfilled

Actions #26

Updated by mkittler almost 3 years ago

  • Status changed from Resolved to Feedback

There was another filesystem alert. I've been vaccuuming the journal again on OSD. Likely I should have restarted the journal service last time I did that so the free disk space is actually recognized and SystemKeepFree=20% has an effect. When we see this alert again restarting didn't help and we need to take care of it, possibly as part of https://progress.opensuse.org/issues/96551.

Actions #27

Updated by mkittler almost 3 years ago

Now we're somehow only at 46 %, see https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=74. Maybe reducing the logging verbosity helped that much?

Actions #28

Updated by livdywan almost 3 years ago

mkittler wrote:

Now we're somehow only at 46 %, see https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=74. Maybe reducing the logging verbosity helped that much?

Clarified in the Weekly. Space was doubled as well 😉️

Actions #29

Updated by okurz almost 3 years ago

Correct. Currently on OSD sudo journalctl -u openqa-webui goes back to 2022-01-22 so only 7 days as of today which I think is not that much. I would like to wait a bit longer to see how we make use of the more available space and if it's enough to store the system journal for a little bit more than just a straight week.

Actions #30

Updated by okurz almost 3 years ago

  • Status changed from Feedback to Resolved

It's effective on OSD. We still have some rather verbose messages but the journal still goes back to 2022-01-23 so I guess we are ok for now.

Actions

Also available in: Atom PDF