Actions

Copy link

communication #99804

closed

2021-11-02 19:00 UTC: openSUSE Heroes meeting November 2021

Added by cboltz over 3 years ago. Updated over 3 years ago.

Status:

Closed

Priority:

Normal

Assignee:

opensuse-admin

Category:

Event

Target version:

Start date:

2021-10-05

Due date:

% Done:

Estimated time:

Description

Where: https://meet.opensuse.org/heroes
When: 2021-11-02 19:00 UTC / 20:00 CET
Who: The openSUSE Heroes team and everybody else!

Topics
see/use checklist

Note: European summer time ended, therefore the UTC time changed to 19:00.

Hide closed

Checklist

Questions and answers from the community
status reports about everything
review old tickets
Contributor Agreement

Actions

Copy link

Updated by cboltz over 3 years ago

Private changed from Yes to No

Actions

Copy link

Updated by lrupp over 3 years ago

Status updates¶

New machines¶

provo-galera1.infra.opensuse.org
provo-galera2.infra.opensuse.org
provo-galera3.infra.opensuse.org
nala2.infra.opensuse.org (aka mirrordb4.infra.opensuse.org)

Connected status machines to private network¶

status3 and status2 are now connected to the internal network. This should make accessing them easier (incl. Salt).
TODO: re-setup status1.infra.opensuse.org on the q.beyond machine.

Updated matomo¶

Newest version: 4.5.0

Re-enabled DNS CAA¶

Re-enabled DNS CAA for opensuse.org. We had it enabled in the past, but it got lost during the migration from FreeIPA to PowerDNS.

> dig +short -t caa opensuse.org
0 iodef "mailto:admin@opensuse.org"
128 issue "letsencrypt.org"

This means, that we only trust Let's encrypt certificates for the opensuse.org domain. We would need to change this, if we accept Certificates from other Certification Authorities for any opensuse.org DNS entry.

Good news: this brings www.opensuse.org a "Level A+" at SSL Labs from Qualys. (Note: more work in this area is on the way.)

Fixed and Upgraded Galera cluster¶

The outage at 2021-10-20 was a result of two problems:

Nodes not in sync¶

Changes in the configuration blocked the flushing of logs -> this "influenced" the synchronization between the nodes significantly
Once this setting was reverted, the nodes began to synchronize themselves again - but with a huge delay of ~2 months. As we had a DB problem around this time, the config settings might be a left over from this time.
Sadly, the nodes were not able to get their re-syncs done successfully. Problem No.2 might - or might not - be the reason for this. This left us more or less with just one single node with current data: galera2

Broken filesystem¶

turned out, that the (xfs) filesystem on galera2 is broken (maybe also a result of the problem 2 months before?)

As result, the whole cluster was somehow broken. While we initially tried an open-heart operation, we realized after some time that this would in the end just take ages...

In the end, we decided to restore a database dump, that we extracted (means: the dump of the node with the broken filesystem took a while to verify with a dump from another out-of-sync node). But wait: if we need to start from scratch anyway, why not using this to do the needed version upgrade of the MariaDB?

...and so we ended up in:

installing new RPMs (that were updated and published just in the same minute, thanks to Darix)
bootstrapping one node
restoring the DB dump
adding the other two nodes (one after the other)

The whole incident (from the first information that the DB is down to a working and running cluster) took 8:30h.

Thanks to everyone involved in this!

Note: there are new machines: provo-galera1, provo-galera2 and provo-galera3 setup in Provo. These machines should form another Galera cluster in the future. An initial test (dump current DBs in Nuremberg -> compress dump -> sent via VPN from Nuremberg to Provo -> insert the dump -> start the cluster in Provo) took 3 hours. Most of the time is spent for the dump, transfer and insert. But at least we know now how to migrate our MariaDB data in front of a known outage.

Upgraded Limesurvey¶

Newest version: ~~5.1.16~~ 5.1.17, deployed at https://survey.opensuse.org/

Upgraded Gitlab¶

Updated Frontend and all Runners to version: 14.4.0

Actions

Copy link

Updated by lrupp over 3 years ago

Checklist item changed from [ ] Questions and answers from the community, [ ] status reports about everything, [ ] review old tickets to [ ] Questions and answers from the community, [ ] status reports about everything, [ ] review old tickets, [ ] Contributor Agreement

openSUSE Infrastructure Contributor Agreement¶

The following text should be an entry point for discussions. Discussions inside the openSUSE heroes, inside SUSE - and between both.
URLs that might be helpful in the discussions:

https://opensource.suse.com/suse-open-source-policy.html

https://en.opensuse.org/openSUSE:Guiding_principles

At the moment, all work on the openSUSE heroes infrastructure is voluntary. While SUSE is providing some resources (especially: hardware, storage and network resources), the majority of the openSUSE infrastructure is meanwhile driven by the openSUSE heroes: a group of volunteers.

Even if some of these volunteers are currently SUSE employees, the work they are doing inside the openSUSE heroes is been seen as voluntary work, not mandated by SUSE as company.

On the other side, SUSE still has some legal responsibility. Even more: SUSE has a environmental, cultural and historical responsibility to take care about the openSUSE community.

Included in this responsibility is the right to decide about the usage and non-usage of the openSUSE infrastructure. But with less and less SUSE employees involved into the openSUSE infrastructure, there is a high potential risk that SUSE will somehow loose control over it. At the moment, there is no special agreement/contract between SUSE and the openSUSE heroes to be always friendly and do nothing problematic with the given permissions. For SUSE employees, there is a signed contract that does not allow the employees to do something evil/harmful against SUSE/openSUSE. But today, with community members having access to the MX and DNS servers of openSUSE, there is a rising risk for the company. A risk, that can only be solved by some Contributor Agreement.

We - as openSUSE heroes - should make clear what we expect from SUSE as company. We should also agree on something like: "don't be evil" - for our own safety.

Topics, that should be provided by SUSE:¶

handling of critical account data
handling of GDPR requests
sponsoring of needed infrastructure
24/7 remote-hands, if needed
fixed contact persons for the openSUSE heroes
access to the systems providing services driven by openSUSE heroes
a dedicated account for Amazon Web Services
hosting of servers in multiple data centers - for redundancy reasons
a reasonable amount of IPv4 and IPv6 address ranges in the data centers
access to the DNS Registrar
backup capabilities

Topics, that can be covered by openSUSE heroes:¶

general setup of the services
basic operating system maintenance
private network setups between the data centers
contact persons for questions from SUSE

Topics, that should be provided by volunteers of the services:¶

general documentation about the services
general maintenance of the services
security response for the provided services
Apparmor, SELinux or similar security hardening of the services
Contact data, in case of a service emergency

Actions

Copy link

Updated by cboltz over 3 years ago

Subject changed from 2021-11-02 18:00 UTC: openSUSE Heroes meeting November 2021 to 2021-11-02 19:00 UTC: openSUSE Heroes meeting November 2021
Description updated (diff)

Actions

Copy link

Updated by cboltz over 3 years ago

Status changed from New to Closed

2021-11-02 heroes meeting

Jitsi:

packaging is broken
that also makes debugging etc. harder because we'd need debug builds
javascript+java (npm+mvn) make handling hard (> 2000 dependencies, packaging automation needed)
salt help needed - rewrite pending by Lars?
Move Grafana Dashboard to Public Instance metrics.o.o

Status report - Pagure:

Pagure (code.o.o) updated to Leap 15.3
hit permissions problems
upstream update incoming
ssh port access needs fixing again -> AI Bernhard .144:22
need to add https to code.o.o with letsencrypt cert

matrix-IRC-bridge has trouble

There are lots of status reports in the meeting ticket (see above) and
https://lists.opensuse.org/archives/list/heroes@lists.opensuse.org/message/CDFGBU47JBXGKVNCDJL43BJBTNIRVJZO/

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

openSUSE admin

Tags

Custom queries

communication #99804

2021-11-02 19:00 UTC: openSUSE Heroes meeting November 2021

Updated by cboltz over 3 years ago

Updated by lrupp over 3 years ago

Status updates¶

New machines¶

Connected status machines to private network¶

Updated matomo¶

Re-enabled DNS CAA¶

Fixed and Upgraded Galera cluster¶

Nodes not in sync¶

Broken filesystem¶

Upgraded Limesurvey¶

Upgraded Gitlab¶

Updated by lrupp over 3 years ago

openSUSE Infrastructure Contributor Agreement¶

Topics, that should be provided by SUSE:¶

Topics, that can be covered by openSUSE heroes:¶

Topics, that should be provided by volunteers of the services:¶

Updated by cboltz over 3 years ago

Updated by cboltz over 3 years ago