Project

General

Profile

Actions

communication #99804

closed

2021-11-02 19:00 UTC: openSUSE Heroes meeting November 2021

Added by cboltz about 3 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
Normal
Category:
Event
Target version:
-
Start date:
2021-10-05
Due date:
% Done:

0%

Estimated time:

Description

Where: https://meet.opensuse.org/heroes
When: 2021-11-02 19:00 UTC / 20:00 CET
Who: The openSUSE Heroes team and everybody else!

Topics
see/use checklist

Note: European summer time ended, therefore the UTC time changed to 19:00.


Checklist

  • Questions and answers from the community
  • status reports about everything
  • review old tickets
  • Contributor Agreement
Actions #1

Updated by cboltz about 3 years ago

  • Private changed from Yes to No
Actions #2

Updated by lrupp about 3 years ago

Status updates

New machines

  • provo-galera1.infra.opensuse.org
  • provo-galera2.infra.opensuse.org
  • provo-galera3.infra.opensuse.org
  • nala2.infra.opensuse.org (aka mirrordb4.infra.opensuse.org)

Connected status machines to private network

status3 and status2 are now connected to the internal network. This should make accessing them easier (incl. Salt).
TODO: re-setup status1.infra.opensuse.org on the q.beyond machine.

Updated matomo

Newest version: 4.5.0

Re-enabled DNS CAA

Re-enabled DNS CAA for opensuse.org. We had it enabled in the past, but it got lost during the migration from FreeIPA to PowerDNS.

> dig +short -t caa opensuse.org
0 iodef "mailto:admin@opensuse.org"
128 issue "letsencrypt.org"

This means, that we only trust Let's encrypt certificates for the opensuse.org domain. We would need to change this, if we accept Certificates from other Certification Authorities for any opensuse.org DNS entry.

Good news: this brings www.opensuse.org a "Level A+" at SSL Labs from Qualys. (Note: more work in this area is on the way.)

Fixed and Upgraded Galera cluster

The outage at 2021-10-20 was a result of two problems:

Nodes not in sync

  • Changes in the configuration blocked the flushing of logs -> this "influenced" the synchronization between the nodes significantly
  • Once this setting was reverted, the nodes began to synchronize themselves again - but with a huge delay of ~2 months. As we had a DB problem around this time, the config settings might be a left over from this time.
  • Sadly, the nodes were not able to get their re-syncs done successfully. Problem No.2 might - or might not - be the reason for this. This left us more or less with just one single node with current data: galera2

Broken filesystem

  • turned out, that the (xfs) filesystem on galera2 is broken (maybe also a result of the problem 2 months before?)

As result, the whole cluster was somehow broken. While we initially tried an open-heart operation, we realized after some time that this would in the end just take ages...

In the end, we decided to restore a database dump, that we extracted (means: the dump of the node with the broken filesystem took a while to verify with a dump from another out-of-sync node). But wait: if we need to start from scratch anyway, why not using this to do the needed version upgrade of the MariaDB?

...and so we ended up in:

  • installing new RPMs (that were updated and published just in the same minute, thanks to Darix)
  • bootstrapping one node
  • restoring the DB dump
  • adding the other two nodes (one after the other)

The whole incident (from the first information that the DB is down to a working and running cluster) took 8:30h.

Thanks to everyone involved in this!

Note: there are new machines: provo-galera1, provo-galera2 and provo-galera3 setup in Provo. These machines should form another Galera cluster in the future. An initial test (dump current DBs in Nuremberg -> compress dump -> sent via VPN from Nuremberg to Provo -> insert the dump -> start the cluster in Provo) took 3 hours. Most of the time is spent for the dump, transfer and insert. But at least we know now how to migrate our MariaDB data in front of a known outage.

Upgraded Limesurvey

Newest version: 5.1.16 5.1.17, deployed at https://survey.opensuse.org/

Upgraded Gitlab

Updated Frontend and all Runners to version: 14.4.0

Actions #3

Updated by lrupp about 3 years ago

  • Checklist item changed from [ ] Questions and answers from the community, [ ] status reports about everything, [ ] review old tickets to [ ] Questions and answers from the community, [ ] status reports about everything, [ ] review old tickets, [ ] Contributor Agreement

openSUSE Infrastructure Contributor Agreement

The following text should be an entry point for discussions. Discussions inside the openSUSE heroes, inside SUSE - and between both.
URLs that might be helpful in the discussions:

At the moment, all work on the openSUSE heroes infrastructure is voluntary. While SUSE is providing some resources (especially: hardware, storage and network resources), the majority of the openSUSE infrastructure is meanwhile driven by the openSUSE heroes: a group of volunteers.

Even if some of these volunteers are currently SUSE employees, the work they are doing inside the openSUSE heroes is been seen as voluntary work, not mandated by SUSE as company.

On the other side, SUSE still has some legal responsibility. Even more: SUSE has a environmental, cultural and historical responsibility to take care about the openSUSE community.

Included in this responsibility is the right to decide about the usage and non-usage of the openSUSE infrastructure. But with less and less SUSE employees involved into the openSUSE infrastructure, there is a high potential risk that SUSE will somehow loose control over it. At the moment, there is no special agreement/contract between SUSE and the openSUSE heroes to be always friendly and do nothing problematic with the given permissions. For SUSE employees, there is a signed contract that does not allow the employees to do something evil/harmful against SUSE/openSUSE. But today, with community members having access to the MX and DNS servers of openSUSE, there is a rising risk for the company. A risk, that can only be solved by some Contributor Agreement.

We - as openSUSE heroes - should make clear what we expect from SUSE as company. We should also agree on something like: "don't be evil" - for our own safety.

Topics, that should be provided by SUSE:

  • handling of critical account data
  • handling of GDPR requests
  • sponsoring of needed infrastructure
  • 24/7 remote-hands, if needed
  • fixed contact persons for the openSUSE heroes
  • access to the systems providing services driven by openSUSE heroes
  • a dedicated account for Amazon Web Services
  • hosting of servers in multiple data centers - for redundancy reasons
  • a reasonable amount of IPv4 and IPv6 address ranges in the data centers
  • access to the DNS Registrar
  • backup capabilities

Topics, that can be covered by openSUSE heroes:

  • general setup of the services
  • basic operating system maintenance
  • private network setups between the data centers
  • contact persons for questions from SUSE

Topics, that should be provided by volunteers of the services:

  • general documentation about the services
  • general maintenance of the services
  • security response for the provided services
  • Apparmor, SELinux or similar security hardening of the services
  • Contact data, in case of a service emergency
Actions #4

Updated by cboltz about 3 years ago

  • Subject changed from 2021-11-02 18:00 UTC: openSUSE Heroes meeting November 2021 to 2021-11-02 19:00 UTC: openSUSE Heroes meeting November 2021
  • Description updated (diff)
Actions #5

Updated by cboltz about 3 years ago

  • Status changed from New to Closed

2021-11-02 heroes meeting

Jitsi:

  • packaging is broken
  • that also makes debugging etc. harder because we'd need debug builds
  • javascript+java (npm+mvn) make handling hard (> 2000 dependencies, packaging automation needed)
  • salt help needed - rewrite pending by Lars?
  • Move Grafana Dashboard to Public Instance metrics.o.o

Status report - Pagure:

  • Pagure (code.o.o) updated to Leap 15.3
  • hit permissions problems
  • upstream update incoming
  • ssh port access needs fixing again -> AI Bernhard .144:22
  • need to add https to code.o.o with letsencrypt cert

matrix-IRC-bridge has trouble

There are lots of status reports in the meeting ticket (see above) and
https://lists.opensuse.org/archives/list/heroes@lists.opensuse.org/message/CDFGBU47JBXGKVNCDJL43BJBTNIRVJZO/

Actions

Also available in: Atom PDF