action #135482
closedcoordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert
Move to systemd journal only on o3+osd (was: Missing openqa_websockets log file on OSD for websocket server) size:M
0%
Description
Motivation¶
While investigating #135122 we noticed that there is currently no log file for the websocket server. Despite one existing for each of the other openQA services on the OSD webui (/var/log/openqa
, /var/log/openqa_gru
, /var/log/openqa_scheduler
). This is currently getting in the way because we can't just add new log messages to the websocket server to help with debugging. Searching the journal for specific log messages is almost impossible since it is too slow.
Acceptance criteria¶
- AC1: We use system journal only for o3+osd or we know why we don't
Suggestions¶
- Optional: Research in old tickets why we chose explicit log files over just trusting systemd journal
- As by default openQA and all related tooling already just use the systemd service we shouldn't need to implement anything in upstream openQA itself, just change the config accordingly
- Just disable log files for all openQA related services on o3 and see what happens
- After positive result do the same for OSD
- Ensure that all openQA related services still run as expected
- Ensure that our system journal shows results from all according openQA services for a sufficient amount of time, at least 7 days or so
- Look into how logwarn can access the journal, either just configure journald to write to a logfile and point logwarn to that, or if it's too much effort create dedicated ticket
Updated by tinita 3 months ago
okurz wrote in #note-4:
There is
journalctl -u openqa-websockets
. That should be enough, isn't it?
I'm sure I had checked this but didn't see the log messages. Now I can see it, so everything ok. I saved the current journal which starts at Sep 4 so we have the historical data to compare the number of worker status updates to the occurrence of the problem.
Updated by okurz 3 months ago
kraih wrote in #note-6:
Actually, i wasn't aware that the
journalctl -g ...
grep option was fast enough for us to use with OSD, but even for openqa-webui it seems to work fine. So this ticket can probably be rejected.
oh, nice! This is why I think it's a good idea to migrate more and more to systemd journal.
Updated by okurz about 2 months ago
- Related to action #137813: [alert] Failed systemd services - qamaster - logrotate fails on /var/log/messages with "/usr/bin/xz: (stdin): Read error: Input/output error" size:S added
Updated by okurz about 2 months ago
- Target version changed from Tools - Next to Ready
Updated by okurz about 2 months ago
- Tags changed from reactive work to reactive work, infra
Updated by okurz about 2 months ago
- Subject changed from Move to systemd journal only (was: Missing openqa_websockets log file on OSD for websocket server) to Move to systemd journal only on o3+osd (was: Missing openqa_websockets log file on OSD for websocket server) size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz about 2 months ago
- Status changed from Workable to In Progress
- Assignee set to okurz
Updated by okurz about 2 months ago
- Status changed from In Progress to Rejected
I did a bit of research and found no good best practices or suggestions how we would be able to easily integrate logwarn. I thought maybe we can just configure journald to write to a file. But then we would need to also ensure logrotation on that file. We could forward to a syslogger but then we would have double the data and as we have quite big logfiles already I don't think we should duplicate all the log messages. With that I don't think it's worth to do it and we should live with the mixture that we have. Please speak up if you think otherwise.
Updated by tinita about 2 months ago
We could try out https://opensource.com/article/20/7/systemd-journals-email as a replacement for logwarn.