action #60923
closed[alert] /srv about to run full, postgres logs very big due to repeated error "duplicate key value violates unique constraint "screenshots_filename", Key (filename)=(8ca/3c9/98a00d8bb2ccba5a2de1d403b5.png) already exists. INSERT INTO screenshots …"
0%
Description
Observation¶
On Wednesday, 11 December 2019 20.55.01 CET Grafana wrote:
[Alerting] File systems alert
One of the file systems is too fullMetric name
Value
/srv: Used Percentage
90.151
It seems /srv is about to be full soon. This seems to be an evolution since start of October:
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=74&orgId=1&refresh=30s&from=now-6M&to=now
On osd I can see that /srv/PSQL10/data/log has 39G of log files.
There is an error message repeated multiple times per second repeatedly, first occurence in postgresql-2019-10-23_000000.log:
2019-10-23 08:29:38.235 CEST openqa geekotest [30552]ERROR: duplicate key value violates unique constraint "screenshots_filename"
2019-10-23 08:29:38.235 CEST openqa geekotest [30552]DETAIL: Key (filename)=(68c/0ee/3ceb7071b7b9364aa721b3421e.png) already exists.
2019-10-23 08:29:38.235 CEST openqa geekotest [30552]STATEMENT: INSERT INTO screenshots ( filename, t_created) VALUES ( $1, $2 ) RETURNING id
Problem¶
Formerly we also had quite some postgres errors which seems we did not care about but now we have a severe case causing log files to grow that much that multiple files per day are recorded. Either postgres never cleans up old rotations or did not reach the limit yet. The same happens on o3 but we have a little bit more space for the partition holding the database.
Updated by okurz about 5 years ago
- Copied to action #60926: postgres error "duplicate key value violates unique constraint "screenshots_filename", Key (filename)=(8ca/3c9/98a00d8bb2ccba5a2de1d403b5.png) already exists. INSERT INTO screenshots …" added
Updated by okurz about 5 years ago
I have deleted postgresql-2019-11-*
which gives us a bit of time to solve the issue. This ticket shall focus on the admin impact. I created #60926 for the openQA bug itself.
Updated by okurz about 5 years ago
- Priority changed from Urgent to High
I could resolve #60926 already and deploy the change to o3 so while there are also many logs there I do not even think we need to delete any. For osd next week's deployment should suffice or we do it in the middle. What I would still like to do is learn a bit about how postgres manages the log rotation and also if we have a possibility to monitor for errors in it's according logfiles.
EDIT: 2019-12-14: So what I learned by now is that postgres has logrotation and we seem to use the default, do daily rotations plus limit file size by 10M. On OSD we have an additional /etc/logrotate.d/postgresql that ensures deletion of old files with
/var/lib/pgsql/data/pg_log/postgresql-*.log {
daily
minage 15
rotate 0
missingok
}
on o3 we do not have that.
https://www.endpoint.com/blog/2010/01/01/monitoring-postgres-log-files-with mentions a program "tail_and_mail" monitoring postgres logs and sending email alerts depending on configuration, e.g. alerts and aggregated error reports. Sounds useful.
After the original problem what seems to be the second-most likely error in logs:
2019-12-15 12:42:25.201 UTC openqa geekotest [23773]STATEMENT: INSERT INTO job_modules ( always_rollback, category, fatal, important, job_id, milestone, name, script, t_created, t_updated) VALUES ( $1, $2, $3, $4, $5, $6, $7, $8, $9, $10 ) RETURNING id
2019-12-15 12:42:25.204 UTC openqa geekotest [23773]ERROR: duplicate key value violates unique constraint "job_modules_job_id_name_category_script"
2019-12-15 12:42:25.204 UTC openqa geekotest [23773]DETAIL: Key (job_id, name, category, script)=(1114160, cleanup_before_shutdown, shutdown, tests/shutdown/cleanup_before_shutdown.pm) already exists.
Updated by coolo about 5 years ago
please fix https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/43 - postgresql logs aren't supposed to be stored for months
https://www.postgresql.org/docs/current/logfile-maintenance.html
Updated by okurz about 5 years ago
Updated by okurz about 5 years ago
- Status changed from In Progress to Resolved
- Target version changed from Current Sprint to Done
merged, deployed and active on osd. /srv is at 47% space usage.
Updated by okurz almost 5 years ago
- Related to action #64096: partition /srv was nearly depleted but now fixed (itself?) added