Project

General

Profile

Actions

QE tools - Team description

"The easiest way to provide complete quality for your software"

We provide the most complete free-software system-level testing solution to ensure high quality of operating systems, complete software stacks and multi-machine services for software distribution builders, system integration engineers and release teams. We continuously develop, maintain and release our software to be readily used by anyone while we offer a friendly community to support you in your needs. We maintain the main public and SUSE internal openQA server as well as supporting tools in the surrounding ecosystem.

Team responsibilities

Out of scope

  • Maintenance and recurring review of individual tests (besides openQA-in-openQA tests)
  • Maintenance of special worker addendums needed for tests, e.g. external hypervisor hosts for s390x, powerVM, xen, hyperv, IPMI, VMWare (Clarification: We maintain the code for all backends but we are no experts in specific domains. So we always try to help but it's a case by case decision based on what we realistically can provide based on our competence. We can't be expected to be experts in everything and also we are limited in what we can actually test.)
  • Maintenance of most openSUSE related triggering solutions, e.g. for Tumbleweed or Leap maintenance that use https://github.com/openSUSE/opensuse-release-tools on https://botmaster.suse.de. Contact "SUSE Security Solutions", e.g. Marcus Meissner, for this.
  • Ticket triaging of http://progress.opensuse.org/projects/openqatests/
  • Setup of configuration for individual products to test, e.g. new job groups in openQA
  • Feature development within the backend for single teams (commonly provided by teams themselves)

Our common userbase

Known users of our products: Most SUSE QA engineers, SUSE SLE release managers and release engineers, every SLE developer submitting "submit requests" in OBS/IBS where product changes are tested as part of the "staging" process before changes are accepted in either SLE or openSUSE (staging tests must be green before packages are accepted), same for all openSUSE contributors submitting to either openSUSE:Factory (for Tumbleweed, SLE, future Leap versions) or Leap, other GNU/Linux distributions like Fedora https://openqa.fedoraproject.org/ , AlmaLinux http://openqa.almalinux.org/, Debian https://openqa.debian.net/ , https://openqa.qubes-os.org/ , https://openqa.endlessm.com/ , the GNOME project https://openqa.gnome.org, https://www.codethink.co.uk/articles/2021/automated-linux-kernel-testing/, https://en.euro-linux.com/blog/openqa-or-how-we-test-eurolinux/, openSUSE KDE contributors (with their own workflows, https://openqa.opensuse.org/group_overview/23 ), openSUSE GNOME contributors (https://openqa.opensuse.org/group_overview/35 ), OBS developers (https://openqa.opensuse.org/parent_group_overview/7#grouped_by_build) , wicked developers (https://gitlab.suse.de/wicked-maintainers/wicked-ci#openqa), and of course our team itself for "openQA-in-openQA Tests" :) https://openqa.opensuse.org/group_overview/24 . Also see https://en.opensuse.org/openSUSE:OpenQA/Partners .
Keep in mind: "Users of openQA" and talking about "openSUSE release managers and engineers" means SUSE employees but also employees of other companies, also development partners of SUSE.
In summary our products, for example openQA, are a critical part of many development processes hence outages and regressions are disruptive and costly. Hence we need to ensure a high quality in production hence we practice DevOps with a slight tendency to a conservative approach for introducing changes while still ensuring a high development velocity.

This might be reworked via: https://github.com/os-autoinst/linux-qa/issues/1 to make it more discoverable

How we work

The QE Tools team is following the DevOps approach working using a lightweight Agile approach also inspired by Extreme Programming and Kanban and of course the original http://agilemanifesto.org/. We structure our team and roles following Agile Product Ownership in a Nutshell. We plan and track our works using tickets on https://progress.opensuse.org . We pick tickets based on priority and planning decisions. We use weekly meetings as checkpoints for progress and also track cycle and lead times to crosscheck progress against expectations.

Be aware: Custom queries in the right-hand sidebar of individual projects, e.g. https://progress.opensuse.org/projects/openqav3/issues , show queries with the same name but are limited to the scope of the specific projects so can show only a subset of all relevant tickets.

What we expect from team members

  • Actively show visible contributions to our products every workday (pull requests, code review, ticket updates in descending priority, i.e. if you are very active in pull requests + code review ticket updates are much less important)
  • Be responsive over usual communication platforms and channels (user questions, team discussions)
  • Stick to our rules (this wiki, SLOs, alert handling)

Common tasks for team members

This is a list of common tasks that we follow, e.g. reviewing daily based on individual steps in the DevOps Process DevOps Process

Best practices for major changes

When proposing non-trivial changes with the potential of breaking existing tests consider the follow best practice patterns:

  • Make the problematic change opt-in via a test variable like MY_NEW_FEATURE_ENABLED to enable the new behavior, and otherwise log a warning only
  • Include a reference to a relevant GitHub PR and progress ticket
  • If a BARK test is to be conducted to assess the full impact of the change an autoreview regex matching the most relevant error message should be prepared so that affected jobs can be restarted trivially without disrupting daily operation too much - it's called a BARK test from how the bark of a tree is scratched to confirm if it's green and alive or brown and not healthy anymore.
  • Inform all stakeholders in relevant Slack channels, Matrix and mailing lists
  • Include an explicit mention in the release notes

Guideline for communication in tickets

  • Clarify action items and steps (to be) taken, for example
    • I will implement ... from the suggestions
    • I will monitor ... and evaluate results
    • Confirm if other experts will provide reproducers
    • Document mitigations with references to MRs, PRs or manual file changes and keep the description updated
    • Confirm if adjustments made by others are still in place
  • Explicitly include examples of what won't be done

    • I won't look into the test code itself here
  • Make use of the scientific method template with hypotheses, experiments and observations

How we work on our backlog

  • "due dates" are only used as exception or reminders. Commonly the due-date is set automatically to 14 days in the future as soon as a non-low ticket is picked up. That period is roughly the median cycle time which we want to stay well below. And on top, to prevent redmine sending a reminder and the backlog status to flag issues the ticket should be resolved before the due-date, at least a day but possibly a reminder is sent out even on the last day before so better resolve on the second to last day. Of course, even better to always try to finish as soon as possible, well before the due date.
  • every team member can pick up tickets themselves
  • everybody can set priority, PO can help to resolve conflicts
  • consider the ready, not assigned/blocked/low query as preferred. It is suggested to pick up tickets based on priority. "Workable" tickets are often convenient and hence preferred.
  • ask questions in tickets, even potentially "stupid" questions, oftentimes descriptions are unclear and should be improved
  • There are "low-level infrastructure tasks" only conducted by some team members, the "DevOps" aspect does not include that but focusses on the joint development and operation of our main products
  • Consider tickets with the subject keyword or tag "learning" as good learning opportunities for people new to a certain area. Experts in the specific area should prefer helping others but not work on the ticket
  • For tickets which are out of the scope of the team remove from backlog, delegate to corresponding teams or persons but be nice and supportive, e.g. SUSE-IT, EngInfra especially see our IT ticket handling process and SLA, test maintainer, QE-LSG PrjMgr/mgmt
  • Whenever we apply changes to the infrastructure we should have a ticket
  • Refactoring and general improvements are conducted while we work on features or regression fixes
  • For every regression or bigger issue that we encounter try to come up with at least two improvements, e.g. the actual issue is fixed and similar cases are prevented in the future with better tests and optionally also monitoring is improved
  • For critical issues and very big problems especially when we were informed by users about outages collect "lessons learned", e.g. in notes in the ticket or a meeting with minutes in the ticket, consider https://en.wikipedia.org/wiki/Five_whys and answer at least the following questions: "User impact, outwards-facing communication and mitigation, upstream improvement ideas, Why did the issue appear, can we reduce our detection time, can we prevent similar issues in the future, what can we improve technically, what can we improve in our processes". Also see https://youtu.be/_Dv4M39Arec
  • okurz proposes to use "#NoEstimates". Though that topic is controversial and often misunderstood. https://ronjeffries.com/xprog/articles/the-noestimates-movement/ describes it nicely :) Hence tickets should be evenly sized and no estimation numbers should be provided on tickets
  • If you really want you can look at the burndown chart (some people wish to have this) but we consider it unnecessary due to the continuous development, not a project with defined end. Also an agile board is available but likely due to problems within the redmine installation ordering cards is not reliable.
  • Write to qa-team@suse.de as well for critical changes as well as chat channels
  • Everyone should propose reverts of features if we find problems that can not be immediately fixed or worked around in production

Definition of DONE

Also see https://web.archive.org/web/20110308065330/ http://www.allaboutagile.com/definition-of-done-10-point-checklist/ and https://web.archive.org/web/20170214020537/ https://www.scrumalliance.org/community/articles/2008/september/what-is-definition-of-done-(dod)

  • Code changes are made available via a pull request on a version control repository, e.g. github for openQA
  • Guidelines for git commits have been followed
  • Code has been reviewed (e.g. in the github PR)
  • Depending on criticality/complexity/size/feature: A local verification test has been run, e.g. post link to a local openQA machine or screenshot or logfile (especially also for hardware-related changes, e.g. in os-autoinst backend)
  • For regressions: A regression fix is provided, flaws in the design, monitoring, process have been considered
  • Potentially impacted package builds have been considered, e.g. openSUSE Tumbleweed and Leap, Fedora, etc.
  • Code has been merged (either by reviewer or "mergify" bot or reviewee after 'LGTM' from others)
  • Code has been deployed to osd and o3 (monitor automatic deployment, apply necessary config or infrastructure changes)

Definition of READY for new features

The following points should be considered before a new feature ticket is READY to be implemented:

  • Follow the ticket template from https://progress.opensuse.org/projects/openqav3/wiki/#Feature-requests
  • A clear motivation or user expressing a wish is available
  • Acceptance criteria are stated (see ticket template) or use [timeboxed:<nr>h] with <nr> hours for tasks that should be limited in time, e.g. a research task with [timeboxed:20h] research …
  • add tasks as a hint where to start

WIP-limits (reference "Kanban development")

Target numbers or "guideline", "should be", in priorities

  1. New, untriaged QA (openQA, etc.): 0 (daily) . Every ticket should have a target version, e.g. "Ready" for QE tools team, "future" if unplanned, others for other teams
  2. Untriaged "tools" tagged: 0 (daily) . Every ticket should have a target version, e.g. "Ready" for QE tools team, "future" if unplanned, others for other teams
  3. Workable (properly defined): 10-40 . Enough tickets to reflect a proper plan but not too many to limit unfinished data (see "waste")
  4. Overall backlog length: ideally less than 100 . Similar as for "Workable". Enough tickets to reflect a proper roadmap as well as give enough flexibility for all unfinished work but limited to a feasible number that can still be overlooked by the team without loosing overview. One more reason for a maximum of 100 are that pagination in redmine UI allows to show only up to 100 issues on one page at a time, same for redmine API access.
  5. Within due-date: 0 (daily/weekly) . We should take due-dates serious, finish tickets fast and at the very least update tickets with an explanation why the due-date could not be hold and update to a reasonable time in the future based on usual cycle time expectations

SLAs (service level agreements)

  • for at least picking up tickets, better providing reasonable updates based on priority, first goal is "urgency removal":

  • "reasonable updates": Provide fixes, workarounds or at least state of progress or when the task is blocked

  • to ensure timely updates immediate/urgent tickets must never be in status "Blocked" or "Feedback"

  • aim for cycle time of individual tickets (not epics or sagas): 1h-2w

SLOs (service level objectives, internal)

  • For providing reasonable updates on tickets in our backlog based on priority, first goal is "urgency removal":

  • Frequent updates do not necessarily need to happen in tickets but visible in written form, e.g. just internal chat. Especially in ticket updates every comment should give a clear answer: Who plans to do what until when, in particular the ticket assignee.

  • Reference for SLOs and related topics: https://sre.google/sre-book/table-of-contents/

Status overview

Dynamic dashboard showing target numbers and SLOs: https://os-autoinst.github.io/qa-tools-backlog-assistant/

Backlog prioritization

When we prioritize tickets we assess:

  1. What the main use cases of openQA are among all users, be it SUSE QA engineers, other SUSE employees, openSUSE contributors as well as any other outside user of openQA
  2. We try to understand how many persons and products are affected by feature requests as well as regressions and prioritize issues affecting more persons and products and use cases over limited issues. See #120540 for details in particular about the various os-autoinst backends
  3. We prioritize regressions higher than work on (new) feature requests
  4. If a workaround or alternative exists then this lowers priority. We prioritize tasks that need deep understanding of the architecture and an efficient low-level implementation over convenience additions that other contributors are more likely to be able to implement themselves.

Periodic backlog refinement

These queries can be used to help organize our work efficiently

  1. QE tools team - backlog - sorted by update time ensure all tickets are reasonably up-to-date and don't keep hanging around
  2. QE tools team - due date forecast prevent running into due-dates proactively
  3. QE tools team - next - sorted by update time ensure all next tickets are reasonably up-to-date and considered for the backlog
  4. QE tools team - backlog, non-reactive, needs parent ensure all our (non-reactive) work is linked to higher-level planning as motivation

It's good practice to keep an eye on the queries to anticipate blockers. All team members are encouraged to utilize them and they are useful as part of the Scrum Master's daily routine as well as moderation duty.

Note that due dates should provide a hint as to when a ticket will be resolved but they need to be realistic. Availability, reviews and deployment need to be factored in as well since typically a ticket will be in Feedback before it can be resolved. If in doubt the Due date should be extended with an accompanying message like "Outstanding branches still need to be reviewed" or simply "Bumping the due date because of availability".

Team meetings

Note: We're are using the virtual office on workadventu.re for regular meetings unless otherwise mentioned. We meet at the glass table in the north-east (walk towards the right) linked to https://meet.opensuse.org/6blugd-meetopensuseorg . There's other tables for ad-hoc conversations. You just need to be next to each other to chat.

Good habits:

  • Close meetings on time and ensure everyone knows if the call is closed, and if there's follow-up conversations

Regular calls:

  • Dev Daily: Use (internal) chat actively, e.g. formulate your findings or achievements and plans for the day, "think out loud" while working on individual problems. Join our regular meeting location every weekday 1035-1050 CET/CEST. At the latest at 1100 CET/CEST everyone working on that day must have checked in, at least with a text message in chat.
    • Goal: Emergency responses, clarify next steps or blockers on current work items, asking and answering questions on tickets that would be ignored otherwise, ticket estimations (after the regular daily) (compare to Daily Scrum)
    • Conduction: Answer the following questions concerning non-infra tasks:
    • Is the backlog status green?
    • Are there any time-critical issues to be handled?
    • What was achieved since the last time?
    • Who needs help?
    • Plans until next time?
  • Infra Daily: Every weekday 1020-1035 CET/CEST
    • Goal: State your on-going tasks and plans for the day regarding "infra" work. Estimate and unblock infra tickets as needed.
    • Consider making time for a short break for those who also join the dev daily
  • Ticket Estimations: Infra Every Tuesday 1400 CET/CEST, Dev Every Thursday 1100-1150 CET/CEST including a 5 minute break
    • Goal: Estimate t-shirt sizes for infra: our non-estimated tickets and dev: our non-estimated tickets.
    • Goal: Ensure tickets are workable. Refine and split tickets for larger estimates.
    • Conduction:
    • Consider using https://www.scrumpoker-online.org/en/room/52534457/ or Jitsi surveys to make explicit decision points for ticket estimation calls to prevent awkward silences
    • Check who reads out tickets, prepares the etherpad and updates the ticket respectively at the start of the call
    • Try and aim for S size tickets (e.g. <20h of effort), and split up the ticket if needed. An M size ticket is more complex, e.g. when multiple code repositories need to be touched.
    • If a ticket can't be estimated in 10 minutes, schedule a follow-up conversation or skip the ticket e.g. with a short comment on open questions
  • Midweekly Unblock: Every Wednesday 1100-1150 CET/CEST including a 5 minute break
  • Collaborative Session: Thursdays between 1330-1630 CET/CEST in our regular meeting location if a topic was picked at the latest in the Estimations and announced accordingly. Pick from previous suggestions or bring up your own topic
    • Goal: Follow-up on tasks too difficult to solve alone, or where someone looks to be stuck using pair programming and other means
  • Fortnightly Coordination: Friday 1100-1150 CET/CEST every even week including a 5 minute break. Community members and guests are particularly welcome to join this meeting.
    • Goal: Demo of features, Evaluation of metrics(#152957), Team backlog coordination and design decisions of bigger topics (compare to Sprint Planning).
    • Conduction: Demo recently finished feature work depending on last closed, crosscheck status of team, discuss blocked tasks and upcoming work
  • Fortnightly Retrospective: Friday 1100-1150 CET/CEST every odd week including a 5 minute break - a link to our retro board can be found in the Slack bookmarks, or the reminder to join the call.
  • Virtual coffee: Weekly every Monday 1330-1345 CET/CEST in our regular meeting location.
  • Workshop: Friday 0900-0950 CET/CEST every even week in meet.opensuse.org/suse_qa_tools especially for community members and users!

Weekly moderation duty

We do not CURRENTLY assign this task to team members in rotation, see #132446. Instead the moderator is in this order, whoever first is available: SM (Liv), Marius, Oliver. And the moderator can just start the meeting and ask somebody else to conduct

We see mandatory daily video calls as an effective measure but we don't want to enforce the team to do this unless we have to. To ensure that we have daily updates next to the Alert duty we have the rotating role of "moderation duty". The person doing alert duty in the next week has "moderation duty". The duty consists of ensuring How we work on tickets, in particular:

  • On a daily base ensure that we have an update from every team member that is expected to be present this day. If a person actively contributes to the daily meeting in video call or provided an update related to backlog tasks in chat then this is already ensured.
  • Hand over to the next person during the weekly, going by the order of team members in the wiki
  • Asks for standin on unavailabilities

We expect that this of course is an additional task with the corresponding time investment. The expected time invested per day is in the range of 3-15m, not more, so accounting for 15m-1h15m during duty week. Even in the worst case of a 30h part time worker investing said 15m every day that accounts for only 5% of weekly work time so no significant impact on contributions is expected.

Best practices for meetings

  • Meetings concerning the whole team are moderated by the scrum master by default, who should join the call early and verify that the meeting itself and any tools used are working or e.g. advise the use of the fallback option.
  • We would prefer UTC for meeting times to be globally fair but as many other SUSE meetings are bound to European time we need to stick to that as well.
  • It is recommended to use the Jitsi Audio-feedback feature, blue/green circles depending on microphone volume. Everybody should ensure that at least "two green bubbles" show up. Consider hints from https://en.opensuse.org/SDB:Audio_troubleshooting#Configuring_the_microphone
  • Ask and give feedback regarding the audio quality. Use terms from https://en.wikipedia.org/wiki/Plain_language_radio_checks "Loud and clear" if everything is good or "weak but readable" if of low volume but one can understand what the person is saying if there is no one else overlaying with higher volume. Or "loud but distorted" if there are interferences, e.g. broken sound due to overloaded system or too low connection bandwidth.
  • Hand signals over video can be used, e.g. "waving/circling hands": "I am lost, please bring me into discussion again"; "T-Sign": "I need a break"; "Raised hand": "I would like to speak"
  • Make the end of each meeting explicit. For example clearly mention when a meeting is done plus use visual cues like a chat message "daily is over" or the Jitsi "clapping hands" reaction. This ensure that people are engaged in the meeting and only staying as long as they want and can be engaged and not miss it when a new spontaneous meeting started end-to-end #153937
  • Discuss topics relevant for all within the common meetings, continue discussions pro-actively over asynchronous communication, e.g. tickets, as well as conduct topic centered follow-up meetings with only relevant attendees
  • Reminders in Slack correct for summer/winter time automatically but if you make changes on them the time might be shifted by one hour e.g. if you scheduled a reminder on 10:30 am CEST, it will become 9:30 CET after the switch
  • Use https://etherpad.opensuse.org/p/suse_qe_tools for collaborative editing and put the content back into tickets or wikis. For a SUSE internal and hence more protected environment use https://etherpad.prg2.suse.org/

Workshop Topics

  • The SUSE QE Tools roadmap: Recent achievements, mid-term plan and future outlook. Every first Friday every even month
  • Find older workshop topics and recordings on our SUSE QE Tools Workshop Archive

For the call details see Team Meetings


  • periodic proposal by okurz: How to report tickets, investigate issues, etc. (#104805)
  • general proposal: if there are no further topics make it an "open conversation", at least from time to time :)
  • proposal by okurz: Generic agile project management trainings and tutorials
  • feedback from yearly workshop review: run it every second week but maybe longer, more interactive, more technical sessions, about backends and more openQA internals, from jlausuch: maybe understanding how svirt backend boots VMs in s390x, VMWare, etc? Highlight the differences between how qemu backend spawns VMs and how others do

Note: Everybody should feel welcome to add topic proposals here or approach us with ideas or requests.
Remove appointments from https://calendar.opensuse.org/ when events are skipped.

Announcements

  • For every meeting, regular or one-off, desired attendants should be invited to make sure a slot blocked in their calendar and reminders with the correct local time will show up when it's time to join the meeting
    • Create a new event, for example in Thunderbird via the Calendar tab or New > Event via the menu.
    • Select individual attendants via their respective email addresses .g. Invite attendees in Thunderbird
    • Specify the time of the meeting
    • Set a schedule to repeat the event if applicable.
    • Add a location, e.g. https://meet.opensuse.org/suse_qa_tools
    • Don't worry if any of the details might change - you can update the invitation later and participants will be notified.
    • Prefer new events if the time and date change
  • See the respective meeting for regular actions such as communication via chat

Team

The team is comprised of engineers from different teams, some only partially available:

  1. Liv Dywan (Scrum Master - Ensure that we build it fast) @livdywan / @kalikiana
  2. Oliver Kurz (Product Owner - Ensure that we build the right thing) @okurz / @okurz
  3. Nick Singer (only OPS) @nicksinger / @nicksinger
  4. Tina Müller (Part time (35h)) @tinita / @perlpunk
  5. Jan Baier (part time, QEM-dedicated work areas) @jbaier_cz / @baierjan
  6. Dominik Heidler @dheidler / @dheidler
  7. Marius Kittler @mkittler / @Martchus
  8. Yannis Bonatakis @ybonatakis / @b10n1k
  9. Robert Richardson @robert.richardson / @r-richardson
  10. Gaurav Pathak @gpathak / @gauravpathak
  11. Sebastian Riedel (mostly working on other projects currently, only bug fixing and feature development) @kraih / @kraih

Onboarding for new joiners

Communication

OBS

IBS

Github

Gitlab

Other

Offboarding

When someone leaves the team the following steps should be taken

  • Conduct a team-internal exit-interview (Learn about what was good, what can be improved, what to learn)
  • Remove from https://github.com/orgs/os-autoinst/teams/tools-team . Optionally add the people still as contributors with additional priviledges to individual projects
  • Remove from team calendars

Alert handling

Best practices

Process

  • React on any alert or report of an outage
  • If users report outages of components of our infrastructure
    • Ensure there is a ticket on the backlog tracking the issue
  • For any user-facing outages
    • Consider teaming up and assigning individual tasks to focus on
    • Inform affected users about the impact and ETA via chat channels, ticket updates and mailing list posts
    • Look into mitigations and short-term workarounds such as a hotpatch in production or a revert to an older release
    • Investigate a proper solution with a conservative estimate on the effort involved
    • Set a time limit to ensure either a workaround or a solution is available within a reasonable amount of time (for example 4 hours or end of working day of the person communicating the changes)
    • Join an ad-hoc video call to discuss further steps
    • Keep a record of what was discussed and investigated to allow for a later analysis
    • Look into symptoms such as restarting incomplete jobs
  • For each failing alert, e.g. Grafana, failing CI pipelines, etc.
    • Create a ticket for the issue (with a tag "alert"; create ticket unless the alert is trivial to resolve and needs no improvement; if an alert is unhandled for at least 4h then a ticket must be created; even create a ticket if alerts turn to "ok" to prevent these issues in the future and to improve the alert)
    • Link the corresponding ... in the ticket
    • Grafana panel as reference in the alert email
    • Details of the failing job in case of an Unreviewed issue alert
    • Pipeline name and link in case of GitLab
    • Copy relevant metadata from the email, especially date and time, mentioned hostname(s) and the subject of the email
    • Respond to the notification email with a link to the ticket or forward the email to a corresponding mailing list, e.g. o3-admins@suse.de or osd-admins@suse.de (Caveat: gitlab@suse.de as sender seems to be able to receive emails and swallow them without any useful response or error message)
    • Optional: Inform in chat
    • Optional: Add "annotation" in corresponding Grafana panel with a link to the corresponding ticket
    • Silence/pause the alert to mitigate urgency and reduce the priority of the ticket
    • For grafana just follow the "silence" button in alert emails or use https://monitor.qa.suse.de/alerting/silences, consider a default of 2 months, reference the ticket and mention to remove the silence in the ticket in "Rollback actions". Alternatively if you as ticket assignee want to be notified on alerts but to not distract others on https://monitor.qa.suse.de/alerting/routes click next to the policy for __contacts__ =~ .*"osd-admins".* on "New nested policy" and add direct messages to yourself instead of the mailing list. Also mention that in "Rollback actions"
    • GitLab pipelines can be paused after taking ownership (think of it as who touched it last, not who maintains it)
    • In Zabbix a problem can be suppressed
    • When observing an Unknown issue, file a ticket and add it in a comment on the job and consider an autoreview regex in case it affects multiple test modules
    • To address openqa logwarn issues, add the message to the list of known messages (and potentially look into changing the message or log level later)
    • See Munin
    • See gitlab pipeline notifications
  • If you consider an alert non-actionable then change it accordingly
  • If you do not know how to handle an alert ask the team for help
  • We must always strive for an accepted hypothesis when we want to change alerts or call an issue resolved
  • After resolving the issue add explanation in ticket, unpause alert and verify it going to "ok" again, resolve ticket

References

Grafana

Pausing alerts
  • Silence the alert in Grafana
  • It is most useful to match by the rule_uid label or by the alertname label, e.g. alertname=~openqa-piworker:.* or rule_uid=~host_up_alert_openqaworker-arm-\d+. Note that the regex matching requires you to use .* at the start or end as ^ and $ are implied.
  • Fill in the comment field, e.g. with a ticket URL.

Gitlab Pipeline Notifications

Currently, the following projects are configured to write an email to osd-admins@suse.de if a pipeline fails:

Note:
  • The configuration can be found by going to Settings > Integrations > Pipeline Status Emails (for any new projects the plugin will need to be enabled first)
  • There's no way to subscribe as a user - instead an email address must be added
API usage for handling email notification
  • For disabling all CI fails notifications run:
export GITLAB_TOKEN=OAUTH2_USER_TOKEN_FROM_GITLAB
for i in 6096 4877 5544 3731 743 746 4652 3530 4884;do
    curl -X DELETE --header "Authorization: Bearer ${GITLAB_TOKEN}" "https://gitlab.suse.de/api/v4/projects/${i}/integrations/pipelines-email"
done
  • For enabling all notifications:
export GITLAB_TOKEN=OAUTH2_USER_TOKEN_FROM_GITLAB
for i in 6096 4877 5544 3731 743 746 4652 3530 4884;do
    curl -X PUT --data 'recipients=osd-admins@suse.de&notify_only_broken_pipelines=true' --header "Authorization: Bearer ${GITLAB_TOKEN}" "https://gitlab.suse.de/api/v4/projects/${i}/integrations/pipelines-email"
done
  • OAUTH2_USER_TOKEN_FROM_GITLAB must be valid user generated token with privileges to read/write api and user must have corresponding privileges in these repositories

Munin

  • To completely disable alert emails from munin: in /etc/munin/munin.conf, comment out the line contact.o3admins.command.
  • For individual plugins it is necessary to read the plugin docs, e.g. in /etc/munin/plugins/df you can see how to adjust the values for warning and critical. You then put this in /etc/munin/plugin-conf.d/munin-node and then systemctl restart munin-node, e.g.
[df]
env.exclude none unknown rootfs iso9660 squashfs udf romfs ramfs debugfs cgroup_root devtmpfs
env.warning 92
env.critical 98

Weekly alert duty

We all should react on alert but additionally we can have one person on "alert duty" for one week each to ensure quicker reaction times when other team members are focussed on development work. For this the person on duty should do the following:

  • React quickly (e.g. within two hours) on any unhandled alerts
  • Hand over to the next person after the weekly, going by the order of team members in the wiki
  • Asks for standin on unavailabilities

Collaboration best practices

Sometimes there are pull requests that are based on other pull requests. Person X reviews PR 1 and Person Y reviews PR 2, but they share the same commit. As a result we have more work for all. For a best practice it is recommended to

  • Include keywords in the PR subject line, e.g. "Part 2: … - based on #". Example: https://github.com/os-autoinst/openQA/pull/4473
  • Include the list of base pull request(s) in the PR description. Keep in mind that pull request links in github only seem to be properly rendered as preview links when included in a Markdown list, e.g.
Based on
* #1234
  • Mark the dependant pull request as draft until the base pull request is approved or merged

See #105244 for the motivation for these best practices

SUSE-IT ticket handling

As we are relying on Eng-Infra a lot and need to coordinate our work we should follow a consistent process with best practices.

  1. By default use Incident as it includes fields for "Impact" and "Urgency", avoid "Service Request". In some cases Service Request with Approval should be used, e.g. when trying to give access to some systems for new team members
  2. Ensure there's a corresponding ticket for it in openQA Infrastructure
  3. Use Eng-Infra under Select a system
  4. Use [openqa] … in the subject if applicable
  5. Use the below template for the Description
  6. Select a sensible Impact and Urgency and make sure the severity and impact of the issue is explicitly mentioned in the EngInfra ticket, e.g. what business related workflows are impacted
  7. Share the ticket with "OSD Admins" (or after the ticket was created then Share with OSD Admins; the icon with two figures, not a single gray avatar)
  8. Use the tracker ticket for internal notes
  9. React quickly to questions and ticket updates but also keep in mind limited capacity of EngInfra (as of 2022-02)

We have a ticket template to be used for SUSE SD Eng-Infra to improve our communication, to communicate impact, steps to reproduce, acceptance criteria. Use the following template and replace all instances of <…>:

h2. Observation

<To be replaced: Observation of the problem>

h2. Steps to reproduce

<To be replaced: Steps to reproduce>

h2. Expected result

<To be replaced: Expected result details>

h2. Impact

<To be replaced: What and who is impacted>

h2. Further details

Internal tracking issue: <To be replaced: ticket link on progress.opensuse.org>

Feel welcome to comment in the progress ticket which can be shared with more people by default and helps to communicate and we can edit texts and know who is assigned.

Things to try

  • Everybody can be "Product Owner" or "Scrum Master" or "Admin" or "Developer" for some time to get the different perspective
  • From time to time ask stakeholders for their list of priorities regarding our tasks
  • Seelect mob-programming tasks in unblock meetings to deep-dive in dedicated meeting

Literature references

Historical

Previously the former QA tools team used target versions "Ready" (to be planned into individual milestone periods or sprints), "Current Sprint" and "Done". However the team never really did use proper time-limited sprints so the distinction was rather vague. After having tickets "Resolved" after some time the PO or someone else would also update the target version to "Done" to signal that the result has been reviewed. This was causing a lot of ticket update noise for not much value considering that the Definition-of-Done when properly followed already has rather strict requirements on when something can be considered really "Resolved" hence the team eventually decided to not use the "Done" target version anymore. Since about 2019-05 (and since okurz is doing more backlog management) the team uses priorities more as well as the status "Workable" together with an explicit team member list for "What the team is working on" to better visualize what is making team members busy regardless of what was "officially" planned to be part of the team's work. So we closed the target version. On 2020-07-03 okurz subsequently closed "Current Sprint" as also this one was in most cases equivalent to just picking an assignee for a ticket or setting to "In Progress". We can just distinguish between "(no version)" meaning untriaged, "Ready" meaning tools team should consider picking up these issues and "future" meaning that there is no plan for this to be picked up. Everything else is defined by status and priority.
In 2020-10-27 we discussed together to find out the history of the team. We clarified that the team started out as a not well defined "Dev+Ops" team. "team responsibilities" have been mainly unchanged since at least beginning of 2019. We agreed that learning from users and production about our "Dev" contributions is good, so this part of "Ops" is responsibility of everyone.

Also see #73060 for more details about how the responsibilities were setup.

Team-internal Hack Week (or Hackweek)

Rules of the game

  • Regular meetings with the exception of the Weekly are cancelled
  • Look into future tickets or other projects that relate to our usual work
  • Backlog priorities are not enforced, short of emergency responses
  • The challenge has to be solved the previous week, weekly to weekly

Extra-ordinary "hack-week" 2020-W51

SUSE QE Tools plans to have an internal "hack-week": Condition: We close 30 tickets from our backlog within the time frame 2020-12-03 until 2020-12-11 start of weekly meeting. No cheating! :) See this query. During week 2020-W51 everyone is allowed to work on any hack-week project, it should just have a reasonable, "explainable" connection to our normal work. okurz volunteers to take over ops-duty for the week.

Result during meeting 2020-12-11: We missed the goal (by a slight amount) but we are motivated to try again in the next year :) Everybody, put some easy tickets aside for the next time!

Extra-ordinary "hack-week" 2021-W8

Similar as our attempt for 2020-W51 with same rules, except condition: We close 30 tickets from our backlog within the time frame 2021-02-05 until 2021-02-19 start of weekly meeting. No cheating! See this query.

Result during meeting 2021-02-19: We missed the goal (25/30 tickets resolved) but again we are open to try again, maybe after next SUSE hack week.

Extra-ordinary "hack-week" 2022-W9

Same as in before, similar condition: We close 30 tickets from our backlog within the time frame 2022-02-18 until 2022-02-25 start of weekly meeting. No cheating! See this query.

Change announcements

For new, cool features or disruptive changes consider providing according notifications to our common userbase as well as potential future users, for example create post on opensuse-factory@opensuse.org , link to post on openqa@suse.de , invite for workshop, #opensuse-factory (IRC) (irc://irc.libera.chat/opensuse-factory), #testing (Slack)

Updated by okurz 6 days ago · 474 revisions