action #92701
closedbackup of etc/ from both o3 was not working since some days due to OOM on backup.qa.suse.de (was: … and osd not updated anymore since 2019)
0%
Description
Observation¶
$ ls -ltra /home/backup/o*/root-complete/etc/openqa/
/home/backup/o3/root-complete/etc/openqa/:
total 28
-rw-r--r-- 1 root root 174 Jan 19 2015 database.ini
-rw-r----- 1 chrony root 125 Mar 19 2015 client.conf
-rw-r--r-- 1 root root 452 Jan 27 2017 workers.ini
-rw-r--r-- 1 root root 2445 Jun 7 2019 openqa.ini
drwxr-xr-x 98 root root 8192 Jul 1 2019 ..
drwxr-xr-x 2 root root 82 Jul 5 2019 .
/home/backup/osd/root-complete/etc/openqa/:
total 36
-rw-r--r-- 1 root root 174 Jan 16 2015 database.ini.rpmsave
-rw-r----- 1 openslp root 229 Nov 19 2015 client.conf
-rw-r----- 1 1001 root 82 Jul 9 2018 database.ini.rpmnew
-rw-r--r-- 1 1001 root 4058 Jul 31 2019 openqa.ini.rpmnew
drwxr-xr-x 3 root root 18 Aug 15 2019 templates
drwxr-xr-x 3 root root 160 Aug 16 2019 .
drwxr-xr-x 113 root root 8192 Aug 18 2019 ..
-rw-r--r-- 1 1001 root 3434 Aug 18 2019 openqa.ini
-rw-r----- 1 1001 root 194 Aug 18 2019 database.ini
Acceptance criteria¶
- AC1: Automatic update from o3 to backup.qa.suse.de works again
AC2: Same as AC1 for osd-> #94015- AC3: Alert in place
Suggestions¶
- Learn about backup.qa.suse.de
- crosscheck if rsnapshot on backup.qa.suse.de can still login to both o3 and osd and copy data from there
- fix where it breaks
- Look into alerting
- the host is up since more than 176h, consider using a similar approach as in https://github.com/os-autoinst/openQA/blob/master/script/openqa-auto-update#L27 to ensure automatic upgrades when necessary
Further details¶
- Q: What is backup.qa.suse.de?
- A: A VM running on qamaster with a big volume for backup data. See https://gitlab.suse.de/qa-sle/qanet-configs/-/blob/master/etc/dhcpd.conf#L64 for the dhcp entry with a description as well. https://gitlab.suse.de/qa-sle/backup-server-salt is the project with salt config for the backup host
Updated by mkittler over 3 years ago
I've just tried whether I can connect to the backup VM via SSH. I wanted to see what services are running and noticed that all the time I use systemctl status …
the SSH connection is terminated, e.g.
backup-vm:~ # systemctl status dbus.service
Connection to backup.qa.suse.de closed by remote host.
Connection to backup.qa.suse.de closed.
Updated by mkittler over 3 years ago
- Related to action #88546: Make use of the new "Storage Server", e.g. complete OSD backup added
Updated by okurz over 3 years ago
- Subject changed from backup of etc/ from both o3 and osd not updated anymore since 2019 to backup of etc/ from both o3 was not working since some days due to OOM on backup.qa.suse.de (was: … and osd not updated anymore since 2019)
- Status changed from Workable to In Progress
- Assignee set to mkittler
The backup in /home/backup was actually only manually created backups. The automatic backups conducted by rsnapshot go into /home/rsnapshot and they work just fine with just the exception of backup.qa.suse.de being stuck in OOM since some days.
Updated by mkittler over 3 years ago
The host was stuck as it ran out-of-memory. It isn't clear what caused this condition. The automatic backups are actually stored under /home/rsnapshot/
and the files under /home/backup/
which are mentioned in the ticket description have been created manually (so it is no surprise that they're not updated). The actual automatic backups seem to work, e.g. triggering rsnapshot alpha
manually worked and the cron configuration is actually in place.
the host is up since more than 176h …
I've now enabled our usual salt setup for the host so it should now reboot weekly.
The host has almost 4 GiB memory which should actually be more than enough considering only a small number of services runs on the host. Maybe it makes sense to add some basic graphs/alerts in Grafana for the backup host so we can keep an eye on it.
Updated by openqa_review over 3 years ago
- Due date set to 2021-06-30
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 3 years ago
SR for adding generic the monitoring for the host: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/507
With https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/da6e34e0121f1f4f4042ef3e4687873311e6e228 the systemd services monitoring/alert should now cover the backup host as well.
Updated by mkittler over 3 years ago
- Status changed from In Progress to Feedback
Judging by the timestamps in /home/rsnapshot/
it looks like the backup is still performed automatically (at least the alpha one).
Updated by okurz over 3 years ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/507 merged. If you can ensure that our generic alerts like "failed systemd services" are covered for the backup host as well I think that should be enough. Having a check for free space on the backup host would be nice on top.
Updated by mkittler over 3 years ago
SR for fixing the memory alert: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/510
Updated by okurz over 3 years ago
MR merged but deployment failed in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/468624 , don't want to check manually if the relevant part was deployed as we should fix the problems in deployment anyway.
Updated by mkittler over 3 years ago
With https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/510 the memory alert works and I enabled it again.
And yes, the pipeline still fails. The remaining errors are about user creation. e.g.:
ID: ldevulder
Function: user.present
Result: False
Comment: These values could not be changed: {'home': '/home/ldevulder'}
Started: 11:14:22.376449
Duration: 27.562 ms
I still don't know why that's the case.
Updated by mkittler over 3 years ago
The memory alert is fixed and with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/513#note_327858 also the ping-based alert. What remains are the strange user creation errors.
Updated by mkittler over 3 years ago
The user problem is fixed, /etc/passwd
was broken again. I hope it won't break again.
I'm keeping this ticket open until https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/514 has been merged.
Updated by okurz over 3 years ago
mkittler wrote:
The user problem is fixed,
/etc/passwd
was broken again. I hope it won't break again.
What do you mean with "broken again"? When has that happened in before?
Updated by mkittler over 3 years ago
The first time I ran salt on the machine the last entry in the file was broken causing many errors. Removing the line helped. At some point there were only these user-related errors remaining (but I haven't checked /etc/passwd
again immediately because I haven't expected it to break again).
Updated by okurz over 3 years ago
hm, ok. We create users properly with salt so I don't think we should try to do something on top. Well, good enough :) Can you verify AC1+AC3?
Updated by mkittler over 3 years ago
I've read the comment but currently the VPN is down so no, I cannot verify it at the moment.
Updated by mkittler over 3 years ago
- AC1: It looks good:
martchus@backup-vm:~> ls -ltra /home/rsnapshot
insgesamt 8
drwxr-xr-x 4 root root 50 25. Jun 2020 _delete.14889
drwxr-xr-x 4 root root 50 26. Jun 2020 _delete.15189
drwxr-xr-x 6 root root 109 29. Jan 04:04 delta.2
drwxr-xr-x 6 root root 109 26. Feb 04:03 delta.1
drwxr-xr-x 6 root root 109 26. Mär 04:04 delta.0
drwxr-xr-x 6 root root 109 14. Mai 04:05 gamma.3
drwxr-xr-x 6 root root 109 20. Mai 08:05 _delete.651
drwxr-xr-x 6 root root 109 21. Mai 04:12 gamma.2
drwxr-xr-x 6 root root 109 22. Mai 00:05 _delete.15401
drwxr-xr-x 6 root root 109 22. Mai 12:04 _delete.20003
drwxr-xr-x 4 root root 65 26. Mai 12:04 _delete.14362
drwxr-xr-x 6 root root 109 26. Mai 20:04 _delete.26284
drwxr-xr-x 6 root root 109 29. Mai 04:04 gamma.1
drwxr-xr-x 4 root root 65 29. Mai 12:04 _delete.8886
drwxr-xr-x 6 root root 109 29. Mai 16:04 _delete.16796
drwxr-xr-x 6 root root 109 29. Mai 20:04 _delete.22377
drwxr-xr-x 4 root root 65 29. Mai 20:04 _delete.12833
drwxr-xr-x 6 root root 109 30. Mai 00:04 _delete.2735
drwxr-xr-x 4 root root 50 30. Mai 08:00 _delete.14087
drwxr-xr-x 4 root root 50 31. Mai 16:00 _delete.23694
drwxr-xr-x 4 root root 50 2. Jun 00:00 _delete.13239
drwxr-xr-x 4 root root 50 2. Jun 20:00 _delete.16276
drwxr-xr-x 4 root root 50 3. Jun 00:00 _delete.22317
drwxr-xr-x 4 root root 50 3. Jun 08:00 _delete.27611
drwxr-xr-x 4 root root 50 3. Jun 12:00 _delete.1851
drwxr-xr-x 4 root root 50 4. Jun 04:00 _delete.8595
drwxr-xr-x 4 root root 50 4. Jun 08:00 _delete.12420
drwxr-xr-x 4 root root 50 4. Jun 16:00 _delete.16685
drwxr-xr-x 4 root root 50 4. Jun 20:00 _delete.19139
drwxr-xr-x 56 root root 4096 15. Jun 14:56 ..
drwxr-xr-x 6 root root 109 18. Jun 04:04 gamma.0
drwxr-xr-x 6 root root 109 23. Jun 04:03 beta.6
drwxr-xr-x 6 root root 109 24. Jun 04:03 beta.5
drwxr-xr-x 6 root root 109 25. Jun 04:03 beta.4
drwxr-xr-x 6 root root 109 26. Jun 04:03 beta.3
drwxr-xr-x 6 root root 109 27. Jun 04:03 beta.2
drwxr-xr-x 6 root root 109 28. Jun 04:03 beta.1
drwxr-xr-x 6 root root 109 29. Jun 04:03 beta.0
drwxr-xr-x 6 root root 109 29. Jun 12:04 alpha.5
drwxr-xr-x 6 root root 109 29. Jun 16:03 alpha.4
drwxr-xr-x 6 root root 109 29. Jun 20:04 alpha.3
drwxr-xr-x 6 root root 109 30. Jun 00:04 alpha.2
drwxr-xr-x 6 root root 109 30. Jun 04:04 alpha.1
drwxr-xr-x 6 root root 109 30. Jun 08:03 alpha.0
drwxr-xr-x 45 root root 4096 30. Jun 08:04 .
- AC3: The alert is configured in exactly the same way as for the worker dashboards so I expect it to work (see https://stats.openqa-monitor.qa.suse.de/d/GDbackup-vm/dashboard-for-backup-vm?tab=alert&editPanel=65090&viewPanel=65090&orgId=1&refresh=1m vs. https://stats.openqa-monitor.qa.suse.de/d/WDgrenache-1/worker-dashboard-grenache-1?tab=alert&editPanel=65090&viewPanel=65090&orgId=1&refresh=1m)