action #60926

postgres error "duplicate key value violates unique constraint "screenshots_filename", Key (filename)=(8ca/3c9/98a00d8bb2ccba5a2de1d403b5.png) already exists. INSERT INTO screenshots …"

Added by okurz 4 months ago. Updated 4 months ago.

Status:ResolvedStart date:11/12/2019
Priority:HighDue date:
Assignee:okurz% Done:

0%

Category:Concrete Bugs
Target version:Current Sprint
Difficulty:
Duration:

Description

Observation

There is an error message repeated multiple times per second repeatedly in postgres error logs, first occurence of the big repetition on osd in postgresql-2019-10-23_000000.log:

2019-10-23 08:29:38.235 CEST openqa geekotest [30552]ERROR:  duplicate key value violates unique constraint "screenshots_filename"
2019-10-23 08:29:38.235 CEST openqa geekotest [30552]DETAIL:  Key (filename)=(68c/0ee/3ceb7071b7b9364aa721b3421e.png) already exists.
2019-10-23 08:29:38.235 CEST openqa geekotest [30552]STATEMENT:  INSERT INTO screenshots ( filename, t_created) VALUES ( $1, $2 ) RETURNING id

on o3 grep 'duplicate key value violates unique constraint "screenshots_filename"' * | less shows that the error was in already in the oldest still preserved log files, that is postgresql-2019-09-08_071605.log currently but also since 2019-10-17 03:00 it appears way more often. As the issue repeats since 03:00 I assume an openQA change to cause this, not os-autoinst or openQA-worker which we both install 30m later on workers (I think).

The diff of openQA that was installed from /var/log/zypp/history:

2019-10-16 03:00:38|install|openQA|4.6.1571122761.67cc75da9-lp151.1919.1|noarch||devel_openQA|f8e591a5a2176268263689cc99a5e95c3421187c30c3ed368b8e9a02349f1713|
2019-10-17 03:02:01|install|openQA|4.6.1571253176.1a322744e-lp151.1926.1|noarch||devel_openQA|84a559884b250b25372631df6fc542c8f5318b2ed69dcb99a8d501f71befcf4d|

so 67cc75da9..1a322744e with the git log:

$ git log1 --no-merges 67cc75da9..1a322744e
916c45f5c (kraih/local_postgres) PostgreSQL errors can be localized, so just use the name of the unique constraint
ce83ab943 worker: Do not treat reconnect attempts as errors but with warning only
8811ad46c Remove wrong error handling code when sending ws messages
7506e0ae4 Prevent potential deadlocks in scheduler and ws server
81d318dd5 Hide old job templates editor for new groups
3172996fd (kraih/screenshots_resultset) Handle unique constraint correctly
51967db7f Add missing resultset for screenshots and make a few small optimizations
94afcda64 Drop -v flag on test runs and avoid noisy job "name"
e7c3f3cff clone job: Support specifying a port in host URL

A commit named "Handle unique constraint correctly" looks very suspicious in this regard :)
I blame https://github.com/os-autoinst/openQA/pull/2406 therefore.


Related issues

Copied from openQA Infrastructure - action #60923: [alert] /srv about to run full, postgres logs very big du... Resolved 11/12/2019

History

#1 Updated by okurz 4 months ago

  • Copied from action #60923: [alert] /srv about to run full, postgres logs very big due to repeated error "duplicate key value violates unique constraint "screenshots_filename", Key (filename)=(8ca/3c9/98a00d8bb2ccba5a2de1d403b5.png) already exists. INSERT INTO screenshots …" added

#2 Updated by coolo 4 months ago

  • Priority changed from Urgent to Normal
  • Target version changed from Current Sprint to Ready

Not urgent at all - it's just postgresql logs. we catch the error

#3 Updated by coolo 4 months ago

it's not the only constraint filling the logs btw

#4 Updated by okurz 4 months ago

coolo wrote:

Not urgent at all - it's just postgresql logs. we catch the error

I was expecting this from you and this is one more reason why I created this ticket as a copy of the infrastructure one. It's a valid opinion to consider it less important. I compare it to global warming: Just because you did not do anything about it yet does not make it less urgent though ;) For tickets with "Normal" priority I am working with the expectation of "months" until it is done. This means just more work for us on the admin side to handle the huge logs. A proper should help with the problem at the source and maybe even helps with performance: Multiple log messages per second also have some I/O impact.

#5 Updated by kraih 4 months ago

  • Priority changed from Normal to Low

The code in question should be here. And i'm not even sure there is much we can do with DBIx::Class, that error is expected and handled correctly on the openQA side.

#6 Updated by okurz 4 months ago

Do you have a suggestion how we handle the error filling up GBs of logs on servers then?

I am not sure you both understood the implications when setting the priority to Normal/Low. If we do not do anything about this error then either postgresql logs will grow to a large steady state (I estimate 200GB) or partitions will run full. My time expectations for "Low" tickets is "years". I don't think we have that long, rather weeks.

#7 Updated by mkittler 4 months ago

  • Priority changed from Low to High

That there's not obvious solution doesn't mean the prio is low. I think we should handle this case correctly and without flooding the logs. It is not a good idea to wait with that until the production instance runs out of disk space.

Maybe we can use manual SQL using UPSERT here? We don't use the DBIx object here for anything else but the ID.

it's not the only constraint filling the logs btw

That's likely true. There are also still places in the code which are racy. If we handle more of them in the way @kraih pointed out this problem will be even more notable. That actually calls for a more general solution. It would be nice if we could turn off logging for these errors specifically.

#8 Updated by kraih 4 months ago

I'm not sure, will check with DBIx::Class folks if they have a better way to handle this. If we used raw SQL we could just append a ON CONFLICT screenshots_filename DO NOTHING to the insert query.

#9 Updated by kraih 4 months ago

Maybe we can use manual SQL using UPSERT here? We don't use the DBIx object here for anything else but the ID.

Maybe not on this one, but i remember working on another of our unique constraints and it triggered a lot of Perl code on insert in the related Result classes.

#10 Updated by mkittler 4 months ago

I know, in some cases we do too much DBIx magic which would be annoying to do manually. Here we are likely fine of course we should check before just replacing the code with raw SQL. But maybe DBIx even allows one to mix in custom SQL like ON CONFLICT screenshots_filename DO NOTHING.

#11 Updated by okurz 4 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version changed from Ready to Current Sprint

I will try out something.

#12 Updated by kraih 4 months ago

Maybe we can use manual SQL using UPSERT here? We don't use the DBIx object here for anything else but the ID.


Maybe not on this one, but i remember working on another of our unique constraints and it triggered a lot of Perl code on insert in the related Result classes.

I think this was the one i remembered.

#13 Updated by kraih 4 months ago

I know, in some cases we do too much DBIx magic which would be annoying to do manually. Here we are likely fine of course we should check before just replacing the code with raw SQL. But maybe DBIx even allows one to mix in custom SQL like ON CONFLICT screenshots_filename DO NOTHING.

I looked through the code a bit. On the SQL::Abstract layer it seems possible by exploiting the returning feature, but in DBIx::Class you have to go down all the way to the storage layer to reach the SQL generating part. So i don't think it is possible without monkey patching DBIx::Class (but will double check).

#15 Updated by okurz 4 months ago

  • Status changed from In Progress to Resolved

merged and deployed to o3 immediately. No more errors in postgres log about the screenshot insert conflict.

Also available in: Atom PDF