action #19398
closed[functional][y] Handle console messages covering ncurses dialogs (was: Remove "console=tty" boot option for installed system)
Added by okurz over 7 years ago. Updated about 4 years ago.
100%
Description
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/7d5a52788e50f8b9ba8800cb0a2dc63b3ff979c4/lib/bootloader_setup.pm#L169
should be the most important line to remove to actually go forward with https://bugzilla.suse.com/show_bug.cgi?id=1011815
When working on this one should also try to remove all other copy-pasted references to console=tty
Acceptance criteria¶
- AC1: bsc#1011815 is VERIFIED
tasks¶
- Find out how actually kernel console messages are showing up on terminals
- Ensure openQA tests simulate something a common user would do
- Mark existing workarounds accordingly, e.g. see https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3344#issuecomment-319284634
Updated by riafarov over 7 years ago
- Status changed from New to In Progress
- Assignee set to riafarov
Updated by riafarov over 7 years ago
Original issue was reproduced by increasing loglevel for kernel messages to debug.
After removing console=tty from boot parameters, get segfault in showconsole and installation fails to start. Investigating the issue.
Core dump has following:
setconsole@/dev/console@can not open console: %m@Usage: %s [-r | ]@can not open %s: %m@%s is not a tty@Usage: %s [-n]@%u %u
@real console unknown@can not set console device: %m@@can not connect on UNIX socket@@can not get terminal flags of %s@@@@@@@@can not set terminal flags of %s@@@@@@@@can not wait on password asking process@system console stolen at line %d!@@@Repeated error on reading from fd %d@@@@can not read request magic from UNIX socket@@@@@can not get message len from UNIX socket@@@@@@@@can not allocate memory for message from socket@can not get credentials from UNIX socket part1@@can not get credentials from UNIX socket part2@@Connection from %s of user %lu@@Connection from pid %lu user %lu@@@@@@@@can not allocate string for password@@@@can not allocate integer for password length@@@@can not open epoll file descriptor@@@@@@can not get file status of /var/log@@@@@can not get file system status of /var/log@@@@@@no message logging because /var file system is not accessible@@@can not determine real path of %s@@@@@@@can not determine device numbers for %s@epoll_pwait()@missing console pointer@memory allocation@can not open %s@can not write to fd %d@failed to fork process@
M%s: @M%s: @can not set password prompt@can not make invisible@can not read password@E@ @%s no connection jet@Can not read from fd %d@U@F@/dev/blog@can not open named fifo %s@/var/log/boot.log@/var/log/boot.old@/var/log@Can not rename %s@Can not write to %s@Can not open %s@error: console pointer empty@re@can not open /proc/consoles@/dev/char/%s@can not allocate string@%u:%u@%*s %*s (%[)]) %[0-9:]@no device provided@can not scan %s@can not open /dev@/dev/%s@can not handle %s@/proc/mounts@tmpfs@%s/blogd-XXXXXX@/dev/shm
Updated by riafarov over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
As not possible to remove, but only to workaround, just improved comment why console=tty is needed: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3009
If kernel messages bring instability to the tests, we need to implement workaround, as of now - resolve.
Updated by riafarov over 7 years ago
- Subject changed from Remove "console=tty" from tests to Remove "console=tty" boot option for installed system
- Status changed from Resolved to In Progress
As our research has shown, we cannot simply remove console=tty option from installation boot option, which leads to kernel messages get printed in active tty and interfere with tests execution (especially yast ncurses). After short discussion we've decided to introduce additional step to remove console=tty from boot options in the end of installation, as reboot is required to make it work.
To be checked if can apply new boot options on runtime, then we can introduce to consoletest_setup test suite, where it belongs logically.
Updated by riafarov over 7 years ago
Unfortunately modifying boot option in the end of installation is not possible, as /etc/default/grub is not yet created.
So, either have to modify boot options as a step during installation, or do this on installed system, nevertheless reboot is required.
Modification of syslog didn't help, as /dev/console is not really used there.
Question is still open how to do this. With syslog we are able to redirect all messages to serial device by adding . -/dev/ttyS0 or kern.* /dev/ttyS0
Another major point: https://openqa.suse.de/tests/989670#step/textinfo/1
Here we got some kernel message, but they cannot be found in serial log attached to the test!
Updated by riafarov over 7 years ago
- Status changed from In Progress to Feedback
Updated by riafarov over 7 years ago
- Assignee changed from riafarov to okurz
After discussion was decided to leave current behavior as it is, with assumption that for every kernel message which disrupts test has to be filed as a bug and processed individually.
Another thing which we may improve is to provide more information about the failure by possibly collecting more logs. Workarounds mentioned in the bugzilla ticket are valid and may be used by customers to resolve the issue of kernel messages flood in case systemd is running.
Another issue which remains is inconsistency in "console" boot parameter behavior for installation and installed system. As well as user not getting notifications in case of kernel messages while on x11 session.
TBD: what we want to do here.
Good example here: https://openqa.opensuse.org/tests/419586#step/seahorse/8 here btrfs spitted info messages:
[ 202.773864] BTRFS info (device vda2): qgroup scan completed (inconsistency flag cleared)
Updated by riafarov over 7 years ago
- Related to action #17956: [sles][functional] test fails in command_not_found added
Updated by riafarov over 7 years ago
https://bugzilla.suse.com/show_bug.cgi?id=1007813 this bug affects many tests. TBD if we want to make a workaround for this particular known issue.
Updated by okurz over 7 years ago
We can confirm the original issue also when testing locally and with default kernel cmdline parameters so it is not an openQA specific issue.
Discussed with riafarov:
try to lower default kernel log messages ourselves by identifying the package which owns the file with
rpm -qf <filename>
and fix it e.g. by SR on the IBS/OBS package or upstream patch (e.g. github PR)- elaborate which level for which group -> @riafarov
add an openQA test to show the error explicitly, see https://bugzilla.suse.com/show_bug.cgi?id=1011815#c34 . The test flow can be as follows:
select_console('root-console');
script_run('dev=$(ls /sys/class/net | head -n1); sleep 3 && ip link set $dev down && ip link set $dev up &', 0);
select_console('root-console');
script_sudo('yast2', 0);
assert_screen('yast2-ncurses-complete-menu');
send_key 'alt-q';
and have a "workaround" needle covering the console message popping up with "NIC Link is Down".
But first I should crosscheck with settings "splash=silent quiet" on grub.
EDIT: Ok, I was wrong. The above does not trigger any messages so "quiet" is enough. Let's check btrfs info messages.
Updated by okurz over 7 years ago
- Subject changed from Remove "console=tty" boot option for installed system to Handle console messages covering ncurses dialogs (was: Remove "console=tty" boot option for installed system)
- Description updated (diff)
Updated by okurz about 7 years ago
- Subject changed from Handle console messages covering ncurses dialogs (was: Remove "console=tty" boot option for installed system) to [functional]Handle console messages covering ncurses dialogs (was: Remove "console=tty" boot option for installed system)
Updated by okurz over 6 years ago
- Status changed from Feedback to Workable
- Assignee deleted (
okurz)
So … we do not remember seeing that issue impacting us in the past 8 months but checking linked bugs we find https://bugzilla.suse.com/show_bug.cgi?id=1011815 which is full of automatic reminder comments about "the issue" still appearing in tests. But apparently this is coming from an openSUSE needle "rescue-mode-emergency-shell-bsc1011815-20170825" created by riafarov with
commit ff461296
Author: Rodion Iafarov riafarov@suse.com
Date: Fri Aug 25 12:53:52 2017 +0000
rescue-mode-emergency-shell-bsc1011815-20170825 for opensuse-Tumbleweed-DVD-x86_64-Build20170823-extra_test_filesystem@64bit
but mgriessmeier, riafarov and me agree that it must be wrong.
next steps:
- Remove needle (or replace by proper not-workaround one)
- Add soft-failure to the modules using "dmesg -n 4" pointing to https://bugzilla.suse.com/show_bug.cgi?id=1011815
- Find out why nobody cares, well, we know, because nobody understands what we need to do, this is why we have this progress ticket but at least the upper two steps are a start
Updated by okurz over 6 years ago
- Due date deleted (
2018-04-24) - Target version changed from Milestone 15 to Milestone 17
see above next steps but we do not have capacity for this in S15-S17, delaying.
Updated by okurz over 6 years ago
- Has duplicate action #36802: [functional][u][sporadic] test fails in consoletest_setup added
Updated by okurz over 6 years ago
- Subject changed from [functional]Handle console messages covering ncurses dialogs (was: Remove "console=tty" boot option for installed system) to [functional][y] Handle console messages covering ncurses dialogs (was: Remove "console=tty" boot option for installed system)
Updated by okurz over 6 years ago
- Target version changed from Milestone 17 to Milestone 17
Updated by riafarov over 6 years ago
Needle removed here: https://github.com/os-autoinst/os-autoinst-needles-opensuse/pull/386
Soft-failures added here: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5248
Updated by riafarov over 6 years ago
- Status changed from Feedback to Resolved
I believe we can resolve this ticket now and create a new ticket if we want to try some other solution.
Updated by okurz over 6 years ago
- Status changed from Resolved to Feedback
Can you state why you think this is resolved? I doubt the mentioned bugs will ever be resolved without our doing and IMHO this is what this ticket is about. Or do we already have the other ticket you mentioned? I am worried by closing this ticket without having a followup we would just rely on our memory when this issue comes up again and it will come up again I am sure because the soft-failure reminds in the bug and the job can not turn green as long as the underlying issue(s) are not fixed.
Updated by riafarov over 6 years ago
- Status changed from Feedback to Resolved
I don't see any follow up actions here to be honest. We know the way to work the issue around, we record soft-failure, we have removed needle which causes false positives. If you see anything else we can do here, we can continue with this ticket. So feel free to reopen, put follow up actions or create a new ticket.
Updated by okurz over 6 years ago
- Description updated (diff)
- Due date deleted (
2018-07-03) - Status changed from Resolved to Workable
- Assignee deleted (
riafarov) - Target version changed from Milestone 17 to future
ok, so let's make this more clear what I mean by "acceptance criteria" which I put into the description now. I doubt https://bugzilla.suse.com/show_bug.cgi?id=1011815 would be resolved without our help so this needs to be ensured first before we can close this ticket here.
Updated by oorlov over 6 years ago
- Blocks action #36241: [qe-core][functional][medium] test fails in NM_wpa2_enterprise - shows certificate selection screen where only pull down menu is expected added
Updated by riafarov almost 6 years ago
- Status changed from Workable to Resolved
- Assignee set to riafarov
We haven't seen issues recently, and there is a workaround with setting higher log level for kernel messages, so I would resolve this one.
Updated by okurz almost 6 years ago
- Status changed from Resolved to Workable
@riafarov, again, sorry but as in #19398#note-28 I do not think this is closed. You only mentioned the same as in before.
Updated by riafarov almost 6 years ago
- Assignee deleted (
riafarov)
okurz wrote:
@riafarov, again, sorry but as in #19398#note-28 I do not think this is closed. You only mentioned the same as in before.
Not really, we have preformed many mitigation steps, including dmesg -n 4
calls. So I don't see what else we should do, as having separate tty for serial output is not a silver bullet. But sure, we can keep it in the backlog.
I personally see no related issues in recent runs and therefore no action points.
Updated by okurz almost 6 years ago
Yes, sure. But please see the AC1 (which I did not change recently)
Updated by riafarov about 4 years ago
- Status changed from Workable to Rejected
- Assignee set to riafarov
We seem to solve this problem by filing bugs against services which have unexpected logging level.