Project

General

Profile

Actions

action #60923

closed

[alert] /srv about to run full, postgres logs very big due to repeated error "duplicate key value violates unique constraint "screenshots_filename", Key (filename)=(8ca/3c9/98a00d8bb2ccba5a2de1d403b5.png) already exists. INSERT INTO screenshots …"

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2019-12-11
Due date:
% Done:

0%

Estimated time:

Description

Observation

On Wednesday, 11 December 2019 20.55.01 CET Grafana wrote:

[Alerting] File systems alert
One of the file systems is too full

Metric name
Value
/srv: Used Percentage
90.151

It seems /srv is about to be full soon. This seems to be an evolution since start of October:
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?fullscreen&edit&tab=alert&panelId=74&orgId=1&refresh=30s&from=now-6M&to=now

On osd I can see that /srv/PSQL10/data/log has 39G of log files.

There is an error message repeated multiple times per second repeatedly, first occurence in postgresql-2019-10-23_000000.log:

2019-10-23 08:29:38.235 CEST openqa geekotest [30552]ERROR:  duplicate key value violates unique constraint "screenshots_filename"
2019-10-23 08:29:38.235 CEST openqa geekotest [30552]DETAIL:  Key (filename)=(68c/0ee/3ceb7071b7b9364aa721b3421e.png) already exists.
2019-10-23 08:29:38.235 CEST openqa geekotest [30552]STATEMENT:  INSERT INTO screenshots ( filename, t_created) VALUES ( $1, $2 ) RETURNING id

Problem

Formerly we also had quite some postgres errors which seems we did not care about but now we have a severe case causing log files to grow that much that multiple files per day are recorded. Either postgres never cleans up old rotations or did not reach the limit yet. The same happens on o3 but we have a little bit more space for the partition holding the database.


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #64096: partition /srv was nearly depleted but now fixed (itself?)Resolvedokurz2020-03-032020-03-20

Actions
Copied to openQA Project - action #60926: postgres error "duplicate key value violates unique constraint "screenshots_filename", Key (filename)=(8ca/3c9/98a00d8bb2ccba5a2de1d403b5.png) already exists. INSERT INTO screenshots …"Resolvedokurz2019-12-11

Actions
Actions #1

Updated by okurz over 4 years ago

  • Copied to action #60926: postgres error "duplicate key value violates unique constraint "screenshots_filename", Key (filename)=(8ca/3c9/98a00d8bb2ccba5a2de1d403b5.png) already exists. INSERT INTO screenshots …" added
Actions #2

Updated by okurz over 4 years ago

I have deleted postgresql-2019-11-* which gives us a bit of time to solve the issue. This ticket shall focus on the admin impact. I created #60926 for the openQA bug itself.

Actions #3

Updated by okurz over 4 years ago

  • Priority changed from Urgent to High

I could resolve #60926 already and deploy the change to o3 so while there are also many logs there I do not even think we need to delete any. For osd next week's deployment should suffice or we do it in the middle. What I would still like to do is learn a bit about how postgres manages the log rotation and also if we have a possibility to monitor for errors in it's according logfiles.

EDIT: 2019-12-14: So what I learned by now is that postgres has logrotation and we seem to use the default, do daily rotations plus limit file size by 10M. On OSD we have an additional /etc/logrotate.d/postgresql that ensures deletion of old files with

/var/lib/pgsql/data/pg_log/postgresql-*.log {
    daily
    minage 15
    rotate 0
    missingok
}

on o3 we do not have that.

https://www.endpoint.com/blog/2010/01/01/monitoring-postgres-log-files-with mentions a program "tail_and_mail" monitoring postgres logs and sending email alerts depending on configuration, e.g. alerts and aggregated error reports. Sounds useful.

After the original problem what seems to be the second-most likely error in logs:

2019-12-15 12:42:25.201 UTC openqa geekotest [23773]STATEMENT:  INSERT INTO job_modules ( always_rollback, category, fatal, important, job_id, milestone, name, script, t_created, t_updated) VALUES ( $1, $2, $3, $4, $5, $6, $7, $8, $9, $10 ) RETURNING id
2019-12-15 12:42:25.204 UTC openqa geekotest [23773]ERROR:  duplicate key value violates unique constraint "job_modules_job_id_name_category_script"
2019-12-15 12:42:25.204 UTC openqa geekotest [23773]DETAIL:  Key (job_id, name, category, script)=(1114160, cleanup_before_shutdown, shutdown, tests/shutdown/cleanup_before_shutdown.pm) already exists.
Actions #6

Updated by okurz over 4 years ago

  • Status changed from In Progress to Resolved
  • Target version changed from Current Sprint to Done

merged, deployed and active on osd. /srv is at 47% space usage.

Actions #7

Updated by okurz about 4 years ago

  • Related to action #64096: partition /srv was nearly depleted but now fixed (itself?) added
Actions

Also available in: Atom PDF