Project

General

Profile

Actions

action #135482

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert

Move to systemd journal only on o3+osd (was: Missing openqa_websockets log file on OSD for websocket server) size:M

Added by kraih over 1 year ago. Updated about 1 year ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-09-11
Due date:
% Done:

0%

Estimated time:

Description

Motivation

While investigating #135122 we noticed that there is currently no log file for the websocket server. Despite one existing for each of the other openQA services on the OSD webui (/var/log/openqa, /var/log/openqa_gru, /var/log/openqa_scheduler). This is currently getting in the way because we can't just add new log messages to the websocket server to help with debugging. Searching the journal for specific log messages is almost impossible since it is too slow.

Acceptance criteria

  • AC1: We use system journal only for o3+osd or we know why we don't

Suggestions

  • Optional: Research in old tickets why we chose explicit log files over just trusting systemd journal
  • As by default openQA and all related tooling already just use the systemd service we shouldn't need to implement anything in upstream openQA itself, just change the config accordingly
  • Just disable log files for all openQA related services on o3 and see what happens
  • After positive result do the same for OSD
  • Ensure that all openQA related services still run as expected
  • Ensure that our system journal shows results from all according openQA services for a sufficient amount of time, at least 7 days or so
  • Look into how logwarn can access the journal, either just configure journald to write to a logfile and point logwarn to that, or if it's too much effort create dedicated ticket

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #137813: [alert] Failed systemd services - qamaster - logrotate fails on /var/log/messages with "/usr/bin/xz: (stdin): Read error: Input/output error" size:SResolvedjbaier_cz2023-10-07

Actions
Actions #1

Updated by kraih over 1 year ago

  • Description updated (diff)
Actions #2

Updated by kraih over 1 year ago

  • Description updated (diff)
Actions #3

Updated by kraih over 1 year ago

  • Description updated (diff)
Actions #4

Updated by okurz over 1 year ago

  • Category set to Feature requests
  • Assignee set to okurz

There is journalctl -u openqa-websockets. That should be enough, isn't it?

Actions #5

Updated by tinita over 1 year ago

okurz wrote in #note-4:

There is journalctl -u openqa-websockets. That should be enough, isn't it?

I'm sure I had checked this but didn't see the log messages. Now I can see it, so everything ok. I saved the current journal which starts at Sep 4 so we have the historical data to compare the number of worker status updates to the occurrence of the problem.

Actions #6

Updated by kraih over 1 year ago

Actually, i wasn't aware that the journalctl -g ... grep option was fast enough for us to use with OSD, but even for openqa-webui it seems to work fine. So this ticket can probably be rejected.

Actions #7

Updated by okurz over 1 year ago

kraih wrote in #note-6:

Actually, i wasn't aware that the journalctl -g ... grep option was fast enough for us to use with OSD, but even for openqa-webui it seems to work fine. So this ticket can probably be rejected.

oh, nice! This is why I think it's a good idea to migrate more and more to systemd journal.

Actions #8

Updated by okurz over 1 year ago

  • Assignee deleted (okurz)

So I assume you are ok to accept the journal solution for now. I guess we can re-consider moving webui and gru also to systemd journal. Or at least research why we use separate log files.

Actions #9

Updated by kraih over 1 year ago

okurz wrote in #note-8:

So I assume you are ok to accept the journal solution for now. I guess we can re-consider moving webui and gru also to systemd journal. Or at least research why we use separate log files.

Yes, works for me.

Actions #10

Updated by okurz over 1 year ago

  • Subject changed from Missing openqa_websockets log file on OSD for websocket server to Move to systemd journal only (was: Missing openqa_websockets log file on OSD for websocket server)
  • Target version changed from Ready to future
Actions #11

Updated by okurz about 1 year ago

  • Target version changed from future to Tools - Next
Actions #12

Updated by okurz about 1 year ago

  • Related to action #137813: [alert] Failed systemd services - qamaster - logrotate fails on /var/log/messages with "/usr/bin/xz: (stdin): Read error: Input/output error" size:S added
Actions #13

Updated by okurz about 1 year ago

  • Target version changed from Tools - Next to Ready
Actions #14

Updated by okurz about 1 year ago

  • Tags changed from reactive work to reactive work, infra
Actions #15

Updated by okurz about 1 year ago

  • Subject changed from Move to systemd journal only (was: Missing openqa_websockets log file on OSD for websocket server) to Move to systemd journal only on o3+osd (was: Missing openqa_websockets log file on OSD for websocket server) size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #16

Updated by okurz about 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #17

Updated by okurz about 1 year ago

  • Status changed from In Progress to Rejected

I did a bit of research and found no good best practices or suggestions how we would be able to easily integrate logwarn. I thought maybe we can just configure journald to write to a file. But then we would need to also ensure logrotation on that file. We could forward to a syslogger but then we would have double the data and as we have quite big logfiles already I don't think we should duplicate all the log messages. With that I don't think it's worth to do it and we should live with the mixture that we have. Please speak up if you think otherwise.

Actions #18

Updated by tinita about 1 year ago

We could try out https://opensource.com/article/20/7/systemd-journals-email as a replacement for logwarn.

Actions

Also available in: Atom PDF