action #92701
closed
backup of etc/ from both o3 was not working since some days due to OOM on backup.qa.suse.de (was: … and osd not updated anymore since 2019)
Added by okurz almost 4 years ago.
Updated over 3 years ago.
Description
Observation¶
$ ls -ltra /home/backup/o*/root-complete/etc/openqa/
/home/backup/o3/root-complete/etc/openqa/:
total 28
-rw-r--r-- 1 root root 174 Jan 19 2015 database.ini
-rw-r----- 1 chrony root 125 Mar 19 2015 client.conf
-rw-r--r-- 1 root root 452 Jan 27 2017 workers.ini
-rw-r--r-- 1 root root 2445 Jun 7 2019 openqa.ini
drwxr-xr-x 98 root root 8192 Jul 1 2019 ..
drwxr-xr-x 2 root root 82 Jul 5 2019 .
/home/backup/osd/root-complete/etc/openqa/:
total 36
-rw-r--r-- 1 root root 174 Jan 16 2015 database.ini.rpmsave
-rw-r----- 1 openslp root 229 Nov 19 2015 client.conf
-rw-r----- 1 1001 root 82 Jul 9 2018 database.ini.rpmnew
-rw-r--r-- 1 1001 root 4058 Jul 31 2019 openqa.ini.rpmnew
drwxr-xr-x 3 root root 18 Aug 15 2019 templates
drwxr-xr-x 3 root root 160 Aug 16 2019 .
drwxr-xr-x 113 root root 8192 Aug 18 2019 ..
-rw-r--r-- 1 1001 root 3434 Aug 18 2019 openqa.ini
-rw-r----- 1 1001 root 194 Aug 18 2019 database.ini
Acceptance criteria¶
- AC1: Automatic update from o3 to backup.qa.suse.de works again
AC2: Same as AC1 for osd -> #94015
- AC3: Alert in place
Suggestions¶
Further details¶
I've just tried whether I can connect to the backup VM via SSH. I wanted to see what services are running and noticed that all the time I use systemctl status …
the SSH connection is terminated, e.g.
backup-vm:~ # systemctl status dbus.service
Connection to backup.qa.suse.de closed by remote host.
Connection to backup.qa.suse.de closed.
- Related to action #88546: Make use of the new "Storage Server", e.g. complete OSD backup added
- Subject changed from backup of etc/ from both o3 and osd not updated anymore since 2019 to backup of etc/ from both o3 was not working since some days due to OOM on backup.qa.suse.de (was: … and osd not updated anymore since 2019)
- Status changed from Workable to In Progress
- Assignee set to mkittler
The backup in /home/backup was actually only manually created backups. The automatic backups conducted by rsnapshot go into /home/rsnapshot and they work just fine with just the exception of backup.qa.suse.de being stuck in OOM since some days.
The automatic backup for o3 was initially introduced with #44078 and that never covered osd so putting that into a separate ticket. Consider OSD out-of-scope and covered in the new ticket #94015
The host was stuck as it ran out-of-memory. It isn't clear what caused this condition. The automatic backups are actually stored under /home/rsnapshot/
and the files under /home/backup/
which are mentioned in the ticket description have been created manually (so it is no surprise that they're not updated). The actual automatic backups seem to work, e.g. triggering rsnapshot alpha
manually worked and the cron configuration is actually in place.
the host is up since more than 176h …
I've now enabled our usual salt setup for the host so it should now reboot weekly.
The host has almost 4 GiB memory which should actually be more than enough considering only a small number of services runs on the host. Maybe it makes sense to add some basic graphs/alerts in Grafana for the backup host so we can keep an eye on it.
- Due date set to 2021-06-30
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Feedback
Judging by the timestamps in /home/rsnapshot/
it looks like the backup is still performed automatically (at least the alpha one).
- Description updated (diff)
With https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/510 the memory alert works and I enabled it again.
And yes, the pipeline still fails. The remaining errors are about user creation. e.g.:
ID: ldevulder
Function: user.present
Result: False
Comment: These values could not be changed: {'home': '/home/ldevulder'}
Started: 11:14:22.376449
Duration: 27.562 ms
I still don't know why that's the case.
mkittler wrote:
The user problem is fixed, /etc/passwd
was broken again. I hope it won't break again.
What do you mean with "broken again"? When has that happened in before?
The first time I ran salt on the machine the last entry in the file was broken causing many errors. Removing the line helped. At some point there were only these user-related errors remaining (but I haven't checked /etc/passwd
again immediately because I haven't expected it to break again).
hm, ok. We create users properly with salt so I don't think we should try to do something on top. Well, good enough :) Can you verify AC1+AC3?
I've read the comment but currently the VPN is down so no, I cannot verify it at the moment.
martchus@backup-vm:~> ls -ltra /home/rsnapshot
insgesamt 8
drwxr-xr-x 4 root root 50 25. Jun 2020 _delete.14889
drwxr-xr-x 4 root root 50 26. Jun 2020 _delete.15189
drwxr-xr-x 6 root root 109 29. Jan 04:04 delta.2
drwxr-xr-x 6 root root 109 26. Feb 04:03 delta.1
drwxr-xr-x 6 root root 109 26. Mär 04:04 delta.0
drwxr-xr-x 6 root root 109 14. Mai 04:05 gamma.3
drwxr-xr-x 6 root root 109 20. Mai 08:05 _delete.651
drwxr-xr-x 6 root root 109 21. Mai 04:12 gamma.2
drwxr-xr-x 6 root root 109 22. Mai 00:05 _delete.15401
drwxr-xr-x 6 root root 109 22. Mai 12:04 _delete.20003
drwxr-xr-x 4 root root 65 26. Mai 12:04 _delete.14362
drwxr-xr-x 6 root root 109 26. Mai 20:04 _delete.26284
drwxr-xr-x 6 root root 109 29. Mai 04:04 gamma.1
drwxr-xr-x 4 root root 65 29. Mai 12:04 _delete.8886
drwxr-xr-x 6 root root 109 29. Mai 16:04 _delete.16796
drwxr-xr-x 6 root root 109 29. Mai 20:04 _delete.22377
drwxr-xr-x 4 root root 65 29. Mai 20:04 _delete.12833
drwxr-xr-x 6 root root 109 30. Mai 00:04 _delete.2735
drwxr-xr-x 4 root root 50 30. Mai 08:00 _delete.14087
drwxr-xr-x 4 root root 50 31. Mai 16:00 _delete.23694
drwxr-xr-x 4 root root 50 2. Jun 00:00 _delete.13239
drwxr-xr-x 4 root root 50 2. Jun 20:00 _delete.16276
drwxr-xr-x 4 root root 50 3. Jun 00:00 _delete.22317
drwxr-xr-x 4 root root 50 3. Jun 08:00 _delete.27611
drwxr-xr-x 4 root root 50 3. Jun 12:00 _delete.1851
drwxr-xr-x 4 root root 50 4. Jun 04:00 _delete.8595
drwxr-xr-x 4 root root 50 4. Jun 08:00 _delete.12420
drwxr-xr-x 4 root root 50 4. Jun 16:00 _delete.16685
drwxr-xr-x 4 root root 50 4. Jun 20:00 _delete.19139
drwxr-xr-x 56 root root 4096 15. Jun 14:56 ..
drwxr-xr-x 6 root root 109 18. Jun 04:04 gamma.0
drwxr-xr-x 6 root root 109 23. Jun 04:03 beta.6
drwxr-xr-x 6 root root 109 24. Jun 04:03 beta.5
drwxr-xr-x 6 root root 109 25. Jun 04:03 beta.4
drwxr-xr-x 6 root root 109 26. Jun 04:03 beta.3
drwxr-xr-x 6 root root 109 27. Jun 04:03 beta.2
drwxr-xr-x 6 root root 109 28. Jun 04:03 beta.1
drwxr-xr-x 6 root root 109 29. Jun 04:03 beta.0
drwxr-xr-x 6 root root 109 29. Jun 12:04 alpha.5
drwxr-xr-x 6 root root 109 29. Jun 16:03 alpha.4
drwxr-xr-x 6 root root 109 29. Jun 20:04 alpha.3
drwxr-xr-x 6 root root 109 30. Jun 00:04 alpha.2
drwxr-xr-x 6 root root 109 30. Jun 04:04 alpha.1
drwxr-xr-x 6 root root 109 30. Jun 08:03 alpha.0
drwxr-xr-x 45 root root 4096 30. Jun 08:04 .
- Status changed from Feedback to Resolved
Also available in: Atom
PDF