action #109836
closedAll Jobs on OSD are incomplete since 2022-04-12
Description
Observation¶
We found almost all jobs on OSD are incomplete with the reason 'Reason: api failure: 400 response: OpenQA::Schema::Result::Jobs::insert_module(): DBI Exception: DBD::Pg::st execute failed: ERROR: integer out of range [for Statement " INSERT INTO job_modules ( job_id, name, category, script, milestone, important, fatal, always_rollback, t_created, t_u…' on 2022-04-12 morning.
Examples:
https://openqa.suse.de/tests/8519011#
https://openqa.suse.de/tests/8519030
BTW o3 doesn't have this problem.
Please help to check that.
Expected¶
Jobs on OSD could run as normal.
Rollback steps¶
- DONE:
sudo systemctl start openqa-scheduler - DONE:*
OSD deployment schedule enabled again in https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit
Updated by okurz over 2 years ago
- Category changed from Support to Regressions/Crashes
- Target version set to Ready
Updated by okurz over 2 years ago
- Description updated (diff)
- Status changed from New to In Progress
- Assignee set to okurz
on it. https://openqa.suse.de/tests/8516601 shows that the issue already happened before today's deployment.
openqa=> select * from job_modules order by id desc limit 10;
id | job_id | name | script | category | t_created | t_updated | result | milestone | important | fatal | always_rollback
------------+---------+--------------------+-------------------------+----------+----------------------------+----------------------------+--------+-----------+-----------+-------+-----------------
2147482702 | 8508180 | fcntl09 | tests/kernel/run_ltp.pm | kernel | 2022-04-11 23:03:39.362748 | 2022-04-11 23:03:39.362748 | none | 0 | 1 | 0 | 0
shows that the most recent job module insert was 2147482702 from 022-04-11 23:03:39.362748. The number is suspiciously near 2 ** 31 = 2147483648
We currently have
openqa=> select count(*) from job_modules;
count
----------
27364042
(1 row)
so I guess an in-place migration of 27M modules would take some minutes to hours but not endless. For now only new jobs since about 10h are affected, so we should be careful to not destroy everything.
I stopped the scheduler for now to not add more tests for now which can not complete anyway.
EDIT: Executing ALTER TABLE job_modules ALTER COLUMN id TYPE bigint;
manually. Took some minutes (only).
Also triggered
host=openqa.suse.de failed_since=2022-04-11 ./openqa-advanced-retrigger-jobs
to restart the 3.5k incompletes, see https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=14&orgId=1 .
openqa=> \d job_modules;
Table "public.job_modules"
Column | Type | Collation | Nullable | Default
-----------------+-----------------------------+-----------+----------+-----------------------------------------
id | bigint | | not null | nextval('job_modules_id_seq'::regclass)
job_id | integer | | not null |
name | text | | not null |
script | text | | not null |
category | text | | not null |
t_created | timestamp without time zone | | not null |
t_updated | timestamp without time zone | | not null |
result | character varying | | not null | 'none'::character varying
milestone | integer | | not null | 0
important | integer | | not null | 0
fatal | integer | | not null | 0
always_rollback | integer | | not null | 0
Indexes:
"job_modules_pkey" PRIMARY KEY, btree (id)
"idx_job_modules_result" btree (result)
"job_modules_idx_job_id" btree (job_id)
"job_modules_job_id_name_category_script" UNIQUE CONSTRAINT, btree (job_id, name, category, script)
Foreign-key constraints:
"job_modules_fk_job_id" FOREIGN KEY (job_id) REFERENCES jobs(id) ON UPDATE CASCADE DEFERRABLE
Referenced by:
TABLE "needles" CONSTRAINT "needles_fk_last_matched_module_id" FOREIGN KEY (last_matched_module_id) REFERENCES job_modules(id) ON DELETE SET NULL DEFERRABLE
TABLE "needles" CONSTRAINT "needles_fk_last_seen_module_id" FOREIGN KEY (last_seen_module_id) REFERENCES job_modules(id) ON DELETE SET NULL DEFERRABLE
looks better now.
Updated by okurz over 2 years ago
I restarted services but then hit DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: value "2147490421" is out of range for type integer CONTEXT: unnamed portal parameter $1 = '...' [for Statement "UPDATE needles SET last_seen_module_id = ?, last_seen_time = ?, t…
looks like we need to find all occurences of job_module_id and change tables.
So I did
ALTER TABLE needles ALTER COLUMN last_seen_module_id TYPE bigint; ALTER TABLE needles ALTER COLUMN last_matched_module_id TYPE bigint;
which was quick.
Then I got
400 response: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: cached plan must not change result type [for Statement "SELECT me.id, me.dir_id, me.filename, me.last_seen_time, me.last_seen_module_id, me.last_matched_time, me.last_matched_modu…
Clone of 8515034 (restarted already 2 times)
oh, those carwos jobs can be quick, already some passed, e.g. https://openqa.suse.de/tests/8519408
Updated by okurz over 2 years ago
- Copied to coordination #109846: [epic] Ensure all our database tables accomodate enough data, e.g. bigint for ids added
Updated by okurz over 2 years ago
Current situation: The issue was mitigated on OSD. Incomplete jobs are in the process of being restarted. The situation will be monitored. A followup plan for migration of other existing instances as well as new installations will be conducted within #109846
Updated by openqa_review over 2 years ago
- Due date set to 2022-04-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 2 years ago
All incomplete jobs were retriggered accordingly. Alerts are handled again. A proper upstream database migration was already prepared in https://github.com/os-autoinst/openQA/pull/4604 . The migration was done silently and successfully on o3 so I have a good feeling we can also enable OSD deployment. I enabled the deployment again and triggered a manual run now and I am monitoring that.
Updated by okurz over 2 years ago
- Description updated (diff)
- Due date deleted (
2022-04-27) - Status changed from In Progress to Resolved