action #57683
closedo3 /space is nearly running out again, assets are not refreshed, not cleaned up (was: too much logs&results)
0%
Description
Observation¶
openqa=> select id,name,keep_logs_in_days,keep_results_in_days from job_groups order by keep_results_in_days desc;
id | name | keep_logs_in_days | keep_results_in_days
----+------------------------------------+-------------------+----------------------
61 | openSUSE Leap 15.1 Updates | |
21 | openSUSE Leap 42.2 PowerPC | 30 |
20 | openSUSE Leap 42.2 AArch64 | 30 |
15 | openSUSE Leap 42.1 JeOS | 30 |
55 | openSUSE Leap 15.0 Updates | |
34 | openSUSE Tumbleweed s390x | 30 | 365
12 | openSUSE Leap 42.1 AArch64 | 30 | 365
53 | openSUSE Leap 15.0 Images | 30 | 365
7 | openSUSE Leap 42.1 | 30 | 365
58 | openSUSE Leap 15.1 Images | 30 | 365
13 | openSUSE Leap 42.1 PowerPC | 30 | 365
4 | openSUSE Tumbleweed PowerPC | 30 | 365
48 | openSUSE Leap 42.3 Incidents | 30 | 365
54 | openSUSE Leap 15.0 Incidents | 30 | 365
65 | openSUSE Leap 15.2 Images | 20 | 120
39 | Development Leap | 20 | 120
41 | Development Kubic | 30 | 120
59 | openSUSE Leap 15.1 AArch64 Images | 20 | 120
We simply currently can not afford to save that much.
Updated by okurz about 5 years ago
going through mentioned job groups and reducing result retention period. Set many to "90" days now for results.
found some old, unhandled testresults:
ariel:/space/openqa/testresults # ls
00071 00302 00328 00368 00408 00448 00488 00528 00568 00608 00648 00688 00728 00768 00808 00848 00888 00928 00968 01008
00097 00303 00329 00369 00409 00449 00489 00529 00569 00609 00649 00689 00729 00769 00809 00849 00889 00929 00969 01009
00098 00304 00330 00370 00410 00450 00490 00530 00570 00610 00650 00690 00730 00770 00810 00850 00890 00930 00970 01010
00121 00305 00331 00371 00411 00451 00491 00531 00571 00611 00651 00691 00731 00771 00811 00851 00891 00931 00971 01011
00122 00306 00332 00372 00412 00452 00492 00532 00572 00612 00652 00692 00732 00772 00812 00852 00892 00932 00972 01012
00125 00307 00333 00373 00413 00453 00493 00533 00573 00613 00653 00693 00733 00773 00813 00853 00893 00933 00973 01013
00215 00308 00334 00374 00414 00454 00494 00534 00574 00614 00654 00694 00734 00774 00814 00854 00894 00934 00974 01014
00225 00308234-opensuse-42.2-Updates-x86_64-Build20161121-2-kde@64bit-2G 00335 00375 00415 00455 00495 00535 00575 00615 00655 00695 00735 00775 00815 00855 00895 00935 00975 01015
00226 00308238-opensuse-42.2-Updates-x86_64-Build20161121-2-gnome@64bit-2G 00336 00376 00416 00456 00496 00536 00576 00616 00656 00696 00736 00776 00816 00856 00896 00936 00976 01016
00227 00308239-opensuse-42.1-UpdateTest-x86_64-Build20161121-1-gnome@uefi-2G 00337 00377 00417 00457 00497 00537 00577 00617 00657 00697 00737 00777 00817 00857 00897 00937 00977 01017
00228 00308240-opensuse-42.2-Updates-x86_64-Build20161121-2-install_with_updates_gnome@64bit-2G 00338 00378 00418 00458 00498 00538 00578 00618 00658 00698 00738 00778 00818 00858 00898 00938 00978 01018
00266 00308241-opensuse-42.2-Updates-x86_64-Build20161121-2-install_with_updates_kde@uefi-2G 00339 00379 00419 00459 00499 00539 00579 00619 00659 00699 00739 00779 00819 00859 00899 00939 00979 01019
00267 00308242-opensuse-42.2-Updates-x86_64-Build20161121-2-kde@64bit-2G 00340 00380 00420 00460 00500 00540 00580 00620 00660 00700 00740 00780 00820 00860 00900 00940 00980 01020
00268 00308243-opensuse-42.2-UpdateTest-x86_64-Build20161121-2-kde@64bit-2G 00341 00381 00421 00461 00501 00541 00581 00621 00661 00701 00741 00781 00821 00861 00901 00941 00981 01021
00276 00308244-opensuse-42.2-UpdateTest-x86_64-Build20161121-2-gnome@uefi-2G 00342 00382 00422 00462 00502 00542 00582 00622 00662 00702 00742 00782 00822 00862 00902 00942 00982 01022
00277 00308245-opensuse-42.2-Updates-x86_64-Build20161121-2-gnome@uefi 00343 00383 00423 00463 00503 00543 00583 00623 00663 00703 00743 00783 00823 00863 00903 00943 00983 01023
00278 00308246-opensuse-5.8.90-Krypton-Live-x86_64-Build5.54-krypton-live@64bit-2G 00344 00384 00424 00464 00504 00544 00584 00624 00664 00704 00744 00784 00824 00864 00904 00944 00984 01024
00279 00308247-opensuse-5.7.90-Argon-Live-x86_64-Build11.4-krypton-live@64bit-2G 00345 00385 00425 00465 00505 00545 00585 00625 00665 00705 00745 00785 00825 00865 00905 00945 00985 01025
00280 00308249-opensuse-42.2-Updates-x86_64-Build20161121-2-gnome@uefi 00346 00386 00426 00466 00506 00546 00586 00626 00666 00706 00746 00786 00826 00866 00906 00946 00986 01026
00281 00308255-opensuse-42.2-Updates-x86_64-Build20161121-2-install_with_updates_kde@uefi-2G 00347 00387 00427 00467 00507 00547 00587 00627 00667 00707 00747 00787 00827 00867 00907 00947 00987 01027
00282 00308257-opensuse-42.1-UpdateTest-x86_64-Build20161121-1-gnome@uefi-2G 00348 00388 00428 00468 00508 00548 00588 00628 00668 00708 00748 00788 00828 00868 00908 00948 00988 01028
00283 00309 00349 00389 00429 00469 00509 00549 00589 00629 00669 00709 00749 00789 00829 00869 00909 00949 00989 01029
00284 00310 00350 00390 00430 00470 00510 00550 00590 00630 00670 00710 00750 00790 00830 00870 00910 00950 00990 01030
00285 00311 00351 00391 00431 00471 00511 00551 00591 00631 00671 00711 00751 00791 00831 00871 00911 00951 00991 01031
00286 00312 00352 00392 00432 00472 00512 00552 00592 00632 00672 00712 00752 00792 00832 00872 00912 00952 00992 01032
00287 00313 00353 00393 00433 00473 00513 00553 00593 00633 00673 00713 00753 00793 00833 00873 00913 00953 00993 01033
00288 00314 00354 00394 00434 00474 00514 00554 00594 00634 00674 00714 00754 00794 00834 00874 00914 00954 00994 01034
00289 00315 00355 00395 00435 00475 00515 00555 00595 00635 00675 00715 00755 00795 00835 00875 00915 00955 00995 01035
00290 00316 00356 00396 00436 00476 00516 00556 00596 00636 00676 00716 00756 00796 00836 00876 00916 00956 00996 01036
00291 00317 00357 00397 00437 00477 00517 00557 00597 00637 00677 00717 00757 00797 00837 00877 00917 00957 00997 01037
00292 00318 00358 00398 00438 00478 00518 00558 00598 00638 00678 00718 00758 00798 00838 00878 00918 00958 00998 01038
00293 00319 00359 00399 00439 00479 00519 00559 00599 00639 00679 00719 00759 00799 00839 00879 00919 00959 00999 01039
00294 00320 00360 00400 00440 00480 00520 00560 00600 00640 00680 00720 00760 00800 00840 00880 00920 00960 01000 01040
00295 00321 00361 00401 00441 00481 00521 00561 00601 00641 00681 00721 00761 00801 00841 00881 00921 00961 01001 01041
00296 00322 00362 00402 00442 00482 00522 00562 00602 00642 00682 00722 00762 00802 00842 00882 00922 00962 01002 01042
00297 00323 00363 00403 00443 00483 00523 00563 00603 00643 00683 00723 00763 00803 00843 00883 00923 00963 01003 01043
00298 00324 00364 00404 00444 00484 00524 00564 00604 00644 00684 00724 00764 00804 00844 00884 00924 00964 01004 01044
00299 00325 00365 00405 00445 00485 00525 00565 00605 00645 00685 00725 00765 00805 00845 00885 00925 00965 01005 01045
00300 00326 00366 00406 00446 00486 00526 00566 00606 00646 00686 00726 00766 00806 00846 00886 00926 00966 01006 01046
00301 00327 00367 00407 00447 00487 00527 00567 00607 00647 00687 00727 00767 00807 00847 00887 00927 00967 01007 01047
cleaned up some dirs manually.
Also deleted old files in /var/lib/openqa/share/factory/hdd/ . Now it seems like asset cleanup did not work since some days. https://openqa.opensuse.org/admin/assets shows assets only older than around 2019-09-28 so potentially the upgrade on 2019-09-29 was the one breaking the asset tracking (and cleanup). In /var/log/zypp/history I can not find any changes between 2019-09-28 and the next day but there are changes between 2019-09-27 and 2019-10-01 with their corresponding git commits:
git log1 --no-merges b5a1dadd6..683ca6661
c1046c5f8 (okurz/enhance/cleanup_circle_ci, enhance/cleanup_circle_ci) circleci: Remove whitespace at EOL
889f14b04 Fix publishing documentation via Travis
1abc10b24 Don't silently exit doc generation if asciidoctor not available
62241751c Load build results on dashboard via AJAX
763ded82c Update perl-DBIx-Class-DeploymentHandler dependency to 0.002233 (#2359)
43892c077 Move stale job detection from ws server to scheduler
152d5ed19 Remove obsolete comment regarding offline workers
5b3447b30 Rely on t_updated for the worker's online status in the web UI
6617fa456 Prevent failures in feature tour test
313790d8b Set default check interval for wait_util to 1 second
b9ac321af (Martchus/uniform-dependency-boxes) Enforce same width for nodes in dependency graph
92402582a Move test helper embed_server_for_testing to test utilities
15e48ae07 Add unit test for test schedule change processing
e96794dbc Avoid race condition if test_order.json changes too often
dbae31f30 Reload test_order.json if it changes at test runtime
519f61fe3 Refactor job result file path concatenation
2fefb9e8f (Martchus/staging) Move incompletion logic when worker shows up again to scheduler
9b04a7daa (okurz/feature/devel_test) Add package-test for openQA-devel allowing to check all dependencies in all repos
Nothing obvious jumps to eye. I am tempted to simply reboot the whole system.
# ps auxf | grep '\<gru\>'
root 13447 0.0 0.0 7432 968 pts/14 S+ 21:14 0:00 \_ grep --color=auto \<gru\>
geekote+ 26763 0.0 0.9 321996 155656 ? SNs 18:45 0:08 /usr/bin/perl /usr/share/openqa/script/openqa gru -m production run
geekote+ 28826 4.9 1.3 386188 217024 ? DN 19:01 6:36 \_ /usr/bin/perl /usr/share/openqa/script/openqa gru -m production run
ariel:/space/openqa/share/factory # cat /proc/28826/stack
[<ffffffffa040fbfe>] xfs_buf_submit_wait+0x7e/0x200 [xfs]
[<ffffffffa040feb6>] xfs_buf_read_map+0x106/0x170 [xfs]
[<ffffffffa0442dec>] xfs_trans_read_buf_map+0xac/0x2e0 [xfs]
[<ffffffffa03fab77>] xfs_imap_to_bp+0x57/0xd0 [xfs]
[<ffffffffa03fb40e>] xfs_iread+0x6e/0x1f0 [xfs]
[<ffffffffa0419dcb>] xfs_iget+0x2eb/0x980 [xfs]
[<ffffffffa04232f8>] xfs_lookup+0xb8/0xf0 [xfs]
[<ffffffffa041fb7c>] xfs_vn_lookup+0x4c/0x80 [xfs]
[<ffffffff812644c9>] lookup_slow+0x99/0x150
[<ffffffff81264a1d>] walk_component+0x19d/0x440
[<ffffffff812652a5>] path_lookupat+0x75/0x1d0
[<ffffffff81268b87>] filename_lookup+0xa7/0x160
[<ffffffff8125cff3>] vfs_statx+0x63/0xb0
[<ffffffff8125d496>] SYSC_newlstat+0x26/0x40
[<ffffffff81003aeb>] do_syscall_64+0x7b/0x160
[<ffffffff8180009a>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<ffffffffffffffff>] 0xffffffffffffffff
# lsof -p 28826
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
openqa 28826 geekotest cwd DIR 253,17 8192 12898734141 /var/lib/openqa/testresults/00858/00858784-opensuse-15.0-DVD-Incidents-x86_64-Build:9409:gcc7.1550619705-cryptlvm@uefi-2G/.thumbs
and strace shows me that it's reading test result directories so it looks like everything is in order albeit a bit slow, possibly the big backlog of cleanup necessary for results and logs but no job is running for asset cleanup. It looks like asset cleanup is never given a chance to run. Let's monitor over the night.
Updated by okurz about 5 years ago
- Subject changed from o3 /space is nearly running out again, too much logs&results to o3 /space is nearly running out again, assets are not refreshed, not cleaned up (was: too much logs&results)
Updated by okurz about 5 years ago
- Related to action #57689: asset cleanup jobs do not run on o3 (results cleanup works), workaround: unlock locks manually added
Updated by okurz about 5 years ago
- Status changed from Feedback to Resolved
Manually restarting minion jobs, cleaning locks, etc. worked. With strace -f -eopen,unlink -p 28805
I could follow the gru minion process and see that eventually it unlinked files and we are back to only 79% usage.