action #95221
closedProvide more people with administrative access to services on qam2.suse.de, at least qa-maintenance/openQABot, i.e. increase bus factor size:M
0%
Description
Motivation¶
There were multiple people trying to understand why some jobs were scheduled or not scheduled by qa-maintenance/openQABot, e.g. asmorodskyi on 2021-07-08 "I see that there was trigger for Public Cloud job groups 28 minutes whom I can ask for the reasons ? and was it in our "prod" instance or it was someone "personal initiative" ?"
https://chat.suse.de/channel/testing?msg=NPuXTpKknF99dhY22
and then there was a discussion what could be done, e.g. it was suggested to use botmaster to run the "bot" which is basically a schedule based script, not a continous service
https://chat.suse.de/channel/testing?msg=K4LixBDAMGp248Zzr
Acceptance Criteria¶
- AC1: Users can see the logs of qa-maintenance/openQABot without special accounts
- DONE
AC2: Provide tools team members access to this logs-> #9576
Suggestions¶
- Run services in a way that users can see the logs, e.g. gitlab CI pipeline, botmaster.suse.de, etc.
- Maybe https://github.com/openSUSE/openSUSE-release-tools/blob/master/gocd/daily-cleanup.gocd.yaml#L20 can be used as example
- We could just deploy user accounts and ssh keys same as for the OSD infrastructure using https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/sshd/users.sls . Likely salt+ansible don't conflict here
Updated by ph03nix about 3 years ago
What would helpful as a first step is to have access to the logs of the scheduling job runs. In some cases (perhaps even a lot), looking at the log helps you already to identify the culprit in the metadata yaml and fix it without the need to login into the system and work directly on the bot.
Perhaps those logs can be made internally accessible without administrative access as a simple directory via a webserver or something? A simple webserver and accessible directory with the logs of the last X days might be easy to setup and would be already a big help. X remains to be defined, perhaps one full week makes sense to be able to compare errorous runs to previous successful ones?
Updated by okurz about 3 years ago
ph03nix wrote:
Perhaps those logs can be made internally accessible without administrative access as a simple directory via a webserver or something?
This would likely help a bit and would be relatively to accomplish. But there is a benefit in being able to monitor live as well as be able to interact. We have good results with about 60 people having ssh access to OSD and I don't see a reason why we need to treat qam2.suse.de as "more secure".
Updated by ilausuch about 3 years ago
- Subject changed from Provide more people with administrative access to services on qam2.suse.de, at least qa-maintenance/openQABot, i.e. increase bus factor to Provide more people with administrative access to services on qam2.suse.de, at least qa-maintenance/openQABot, i.e. increase bus factor size: M
- Description updated (diff)
Updated by okurz about 3 years ago
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz about 3 years ago
- Subject changed from Provide more people with administrative access to services on qam2.suse.de, at least qa-maintenance/openQABot, i.e. increase bus factor size: M to Provide more people with administrative access to services on qam2.suse.de, at least qa-maintenance/openQABot, i.e. increase bus factor size:M
Updated by okurz about 3 years ago
- Copied to action #95765: Provide more people with administrative access to services on qam2.suse.de, adding ssh keys for existing tools team members added
Updated by mgrifalconi about 3 years ago
I have a couple of points:
- Allow roles, I want only access to logs, not root of the machine. Let's give access on a need basis
- Use some automation to fetch public keys. Public key can be stored on Gitlab. On some yaml file you only write the username that should have access, a script will get the keys and update the ssh file configuration. This way, if user rotate their keys (which should happen more frequently than never :) ), there is only need to update it on Gitlab. All other machines will fetch it from there.
Updated by jbaier_cz about 3 years ago
- Status changed from Workable to In Progress
- Assignee set to jbaier_cz
As this seems to be pretty important lately, I will jump into this and investigate our possibilities.
Just for the record,
ph03nix wrote:
What would helpful as a first step is to have access to the logs of the scheduling job runs. In some cases (perhaps even a lot), looking at the log helps you already to identify the culprit in the metadata yaml and fix it without the need to login into the system and work directly on the bot.
Perhaps those logs can be made internally accessible without administrative access as a simple directory via a webserver or something? A simple webserver and accessible directory with the logs of the last X days might be easy to setup and would be already a big help. X remains to be defined, perhaps one full week makes sense to be able to compare errorous runs to previous successful ones?
There is currently no log file to export, all logs are in systemd journal. Maybe we can make use of this.
mgrifalconi wrote:
- Allow roles, I want only access to logs, not root of the machine. Let's give access on a need basis
This is doable, but for this situation not quite easy solution, without creating more user accounts. As there is no way to log directly inside the container, one still needs to have elevated rights on the main machine to access the container.
- Use some automation to fetch public keys. Public key can be stored on Gitlab. On some yaml file you only write the username that should have access, a script will get the keys and update the ssh file configuration. This way, if user rotate their keys (which should happen more frequently than never :) ), there is only need to update it on Gitlab. All other machines will fetch it from there.
We already have this, the keys for root access are stored inside ansible repository.
Updated by okurz about 3 years ago
jbaier_cz wrote:
We already have this, the keys for root access are stored inside ansible repository.
There had been manual changes lately. Can you synchronize the changes please?
Updated by jbaier_cz about 3 years ago
Options¶
User access to the container¶
Enhance Ansible role to deploy additional user account and allow more ssh-keys to login under this account. The target account should have permission to inspect journal (has to be a member of systemd-journal
group).
Cons¶
- has to be done also on qam2 as there is no way to login directly to container (without additional [re-]configuration and infra-team involvement)
- inspecting container journal from qam2 cannot be done under regular user (Using the --machine= switch requires root privileges.) so root or sudo needed (adds complexity)
- maintaining allowed ssh-key list inside gitlab repository
botmaster.suse.de¶
Cons¶
- dependency on external service (from the QAM point of view)
- deployment from git tree -- QAM services consist of two parts, the program and metadata (two different repositories) and use rpm packages for distribution to ensure correct location of the metadata; in case of deployment from git, extra setup would be needed to copy necessary files to proper destination; also there would be no more stable/testing branch anymore (not necessary a problem)
- not well-defined deployment process for new jobs
GitLab CI¶
We could elevate gitlab pipeline scheduling to run the bot, logs from the bot will be accessible within the job itself (and/or exported as an artifact).
Cons¶
- dependency on external service (from the QAM point of view)
- long running jobs could be problematic
- shared gitlab runners are non-reliable sometimes
GitLab CI with custom runner¶
We can deploy a custom gitlab-runner inside the container where the bot currently runs, that would improve the reliability and external service dependency and introduce only a little more complexity. The gitlab-runner deployment can be covered by Ansible.
Cons¶
- long running jobs could be still problematic
- gitlab-runner needs regular maintenance / version updates to be able to reliable work with gitlab.suse.de
Expose journal logs via web interface¶
As was mentioned, we can just simply export the logs via a web interface (there is already a web server on qam2 anyway). Unfortunately, there is no log file to serve, so there is still some configuration needed. Either reconfigure the journal to also log to a file (and setup logrotate to preserve space!) or find out if there is a convenient interface between journald and outer world. Maybe just running journalctl --output=json-pretty
on demand from webserver might be already enough.
Cons¶
- without history (if not saved as a timestamped logs)
- does not provide ability to trigger job manually (although it can be provided via simple API which will trigger the corresponding
systemctl start
command)
Updated by jbaier_cz about 3 years ago
okurz wrote:
jbaier_cz wrote:
We already have this, the keys for root access are stored inside ansible repository.
There had been manual changes lately. Can you synchronize the changes please?
Sure, done in commit a914f263.
Updated by openqa_review about 3 years ago
- Due date set to 2021-08-10
Setting due date based on mean cycle time of SUSE QE Tools
Updated by jbaier_cz about 3 years ago
I have a promising PoC GitlabCI at https://gitlab.suse.de/jbaier_cz/openQABot/-/jobs/516115
Updated by jbaier_cz about 3 years ago
- Due date changed from 2021-08-10 to 2021-08-16
A merge request has been created as one possible solution for this problem. Postponing due date, as the general unavailability of GitLab last week rendered testing the pipeline very difficult.
Updated by jbaier_cz about 3 years ago
- Due date changed from 2021-08-16 to 2021-08-20
- Status changed from In Progress to Feedback
Branch is merged and schedules are prepared, now we need to wait to see the results in action.
Updated by okurz about 3 years ago
In https://gitlab.suse.de/qa-maintenance/openQABot/-/pipeline_schedules I see MR+L3 schedules. The schedule for "incidents" I could find on qam2. Where are the current schedules for "full"?
Updated by jbaier_cz about 3 years ago
As decided during last week (probably, I was not a participant of that call unfortunately), the scheduling of other jobs is now done via qem-dashboard and is no longer in scope of openQABot.
Updated by jbaier_cz about 3 years ago
- Copied to action #96998: Increase bus factor for bot-ng size:M added
Updated by jbaier_cz about 3 years ago
okurz wrote:
In https://gitlab.suse.de/qa-maintenance/openQABot/-/pipeline_schedules I see MR+L3 schedules. The schedule for "incidents" I could find on qam2. Where are the current schedules for "full"?
#96998 explains
Updated by jbaier_cz about 3 years ago
- Due date deleted (
2021-08-20) - Status changed from Feedback to Resolved
qa-maintenance/openQABot logs are visible on pipeline_schedules for a couple of days, logs are available as artifacts for a week; we might create a gitlab-page in a future for better presentation and we should update the (not really big) documentation in not very distant future
Updated by okurz about 3 years ago
- Status changed from Resolved to Feedback
@jbaier_cz The work you have done is certainly nice. But I am confused about which components are relevant now. If you are moving openQA bot to gitlab CI pipelines but if the "dashboard" receives uncontrolled changes or if "bot-ng" is a relevant component now we would need to do more. I don't understand why 2 months ago we had "MR/L3/full/incidents" in openQABot systemd timers but now we only have MR+L3. But you created #96998 We can do specific steps in a separate ticket of course but I would like to stay in the context of this ticket to at least identify the relevant components.
Updated by jbaier_cz about 3 years ago
Well, from my point of view, the AC1 is completed as the qa-maintenance/openQABot is visible in Gitlab. The other think is the schedule which changed during the 2 months. At the time this ticket was written, there were 4 schedules in place (full, incidents, MR, L3). At the time my merge request was merged (week ago), there are only 2 relevant schedules (MR, L3) for qa-maintenance/openQABot. Both of them are migrated to the Gitlab. At this moment, there is no other bot (besides template generator which I do not recognize as a proper bot) running on qam2.suse.de (hence this ticket is done).
My notes about this problem (please note, I was not part of any meetings or decisions so my knowledge is limited to what I was able to put together from other tickets and comments):
Some time ago (6 months actually) after deployment of qem-dashboard there was an idea to remake the qa-maintenance/openQABot to a new bot which will understand the API used by the dashboard: https://progress.opensuse.org/issues/88371, https://progress.opensuse.org/issues/88375 As suggested in https://progress.opensuse.org/issues/88371#note-4 the new bot is capable of scheduling jobs known as full and incidents in the openqabot terminology. From what I know, there was a meeting (2.5 weeks ago?) as indicated by https://progress.opensuse.org/issues/97121#note-3 and there was decided that the new bot should be used instead of openqabot for full and incidents jobs and it was promptly implemented (ticket unknown), from this point, there are only two more services left for the qa-maintenance/openQABot (the MR and L3 schedules).
Project qa-maintenance/bot-ng was never been on qam2.suse.de and was not used until the aforementioned meeting, at this time, it runs from one of the shared qam server; for those reason https://progress.opensuse.org/issues/96998 was promptly created to cover this new tools and to make sure it will be visible (as it took over part of the scheduling) and migrated from "some server" as soon as possible. Just a side node, both schedulers consumes the same metadata package to get the scheduling settings, both just need to be executed with a common parameters.
I consider this ticket closed as it is not about scheduling openQA jobs specifically nor about all QA services (it is size M after all) but about one specific service, which was unfortunately partly deprecated during my work on this ticket. Do you agree @okurz, or do you have any suggestions how to proceed with this ticket?
Updated by jbaier_cz about 3 years ago
- Status changed from Feedback to Resolved
"Incidents" and "full" schedules are now done via https://progress.opensuse.org/issues/96998, I believe the user story is covered completely.