action #19398

[functional][y] Handle console messages covering ncurses dialogs (was: Remove "console=tty" boot option for installed system)

Added by okurz over 2 years ago. Updated about 1 year ago.

Status:WorkableStart date:26/05/2017
Priority:NormalDue date:
Assignee:-% Done:

100%

Category:Enhancement to existing testsEstimated time:5.00 hours
Target version:QA - future
Difficulty:
Duration:

Description

https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/7d5a52788e50f8b9ba8800cb0a2dc63b3ff979c4/lib/bootloader_setup.pm#L169
should be the most important line to remove to actually go forward with https://bugzilla.suse.com/show_bug.cgi?id=1011815
When working on this one should also try to remove all other copy-pasted references to console=tty

Acceptance criteria

tasks


Related issues

Related to openQA Tests - action #17956: [sles][functional] test fails in command_not_found Rejected 24/03/2017
Duplicated by openQA Tests - action #36802: [functional][u][sporadic] test fails in consoletest_setup Rejected 05/06/2018
Blocks openQA Tests - action #36241: [functional][u][medium] test fails in NM_wpa2_enterprise ... Blocked 15/05/2018

History

#1 Updated by riafarov over 2 years ago

  • Status changed from New to In Progress
  • Assignee set to riafarov

#2 Updated by riafarov over 2 years ago

Original issue was reproduced by increasing loglevel for kernel messages to debug.
After removing console=tty from boot parameters, get segfault in showconsole and installation fails to start. Investigating the issue.

Core dump has following:

setconsole@/dev/console@can not open console: %m@Usage: %s [-r | ]@can not open %s: %m@%s is not a tty@Usage: %s [-n]@%u %u
@real console unknown@can not set console device: %m@@can not connect on UNIX socket@@can not get terminal flags of %s@@@@@@@@can not set terminal flags of %s@@@@@@@@can not wait on password asking process@system console stolen at line %d!@@@Repeated error on reading from fd %d@@@@can not read request magic from UNIX socket@@@@@can not get message len from UNIX socket@@@@@@@@can not allocate memory for message from socket@can not get credentials from UNIX socket part1@@can not get credentials from UNIX socket part2@@Connection from %s of user %lu@@Connection from pid %lu user %lu@@@@@@@@can not allocate string for password@@@@can not allocate integer for password length@@@@can not open epoll file descriptor@@@@@@can not get file status of /var/log@@@@@can not get file system status of /var/log@@@@@@no message logging because /var file system is not accessible@@@can not determine real path of %s@@@@@@@can not determine device numbers for %s@epoll_pwait()@missing console pointer@memory allocation@can not open %s@can not write to fd %d@failed to fork process@
M%s: @M%s: @can not set password prompt@can not make invisible@can not read password@E@ @%s no connection jet@Can not read from fd %d@U@F@/dev/blog@can not open named fifo %s@/var/log/boot.log@/var/log/boot.old@/var/log@Can not rename %s@Can not write to %s@Can not open %s@error: console pointer empty@re@can not open /proc/consoles@/dev/char/%s@can not allocate string@%u:%u@%*s %*s (%[)]) %[0-9:]@no device provided@can not scan %s@can not open /dev@/dev/%s@can not handle %s@/proc/mounts@tmpfs@%s/blogd-XXXXXX@/dev/shm

#3 Updated by riafarov over 2 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

As not possible to remove, but only to workaround, just improved comment why console=tty is needed: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3009
If kernel messages bring instability to the tests, we need to implement workaround, as of now - resolve.

#4 Updated by riafarov over 2 years ago

  • Subject changed from Remove "console=tty" from tests to Remove "console=tty" boot option for installed system
  • Status changed from Resolved to In Progress

As our research has shown, we cannot simply remove console=tty option from installation boot option, which leads to kernel messages get printed in active tty and interfere with tests execution (especially yast ncurses). After short discussion we've decided to introduce additional step to remove console=tty from boot options in the end of installation, as reboot is required to make it work.
To be checked if can apply new boot options on runtime, then we can introduce to consoletest_setup test suite, where it belongs logically.

#5 Updated by riafarov over 2 years ago

Unfortunately modifying boot option in the end of installation is not possible, as /etc/default/grub is not yet created.
So, either have to modify boot options as a step during installation, or do this on installed system, nevertheless reboot is required.

Modification of syslog didn't help, as /dev/console is not really used there.

Question is still open how to do this. With syslog we are able to redirect all messages to serial device by adding . -/dev/ttyS0 or kern.* /dev/ttyS0

Another major point: https://openqa.suse.de/tests/989670#step/textinfo/1
Here we got some kernel message, but they cannot be found in serial log attached to the test!

#6 Updated by riafarov over 2 years ago

  • Status changed from In Progress to Feedback

#7 Updated by riafarov over 2 years ago

  • Assignee changed from riafarov to okurz

After discussion was decided to leave current behavior as it is, with assumption that for every kernel message which disrupts test has to be filed as a bug and processed individually.
Another thing which we may improve is to provide more information about the failure by possibly collecting more logs. Workarounds mentioned in the bugzilla ticket are valid and may be used by customers to resolve the issue of kernel messages flood in case systemd is running.
Another issue which remains is inconsistency in "console" boot parameter behavior for installation and installed system. As well as user not getting notifications in case of kernel messages while on x11 session.
TBD: what we want to do here.

Good example here: https://openqa.opensuse.org/tests/419586#step/seahorse/8 here btrfs spitted info messages:
[ 202.773864] BTRFS info (device vda2): qgroup scan completed (inconsistency flag cleared)

#8 Updated by riafarov over 2 years ago

  • Related to action #17956: [sles][functional] test fails in command_not_found added

#9 Updated by riafarov over 2 years ago

https://bugzilla.suse.com/show_bug.cgi?id=1007813 this bug affects many tests. TBD if we want to make a workaround for this particular known issue.

#10 Updated by okurz over 2 years ago

We can confirm the original issue also when testing locally and with default kernel cmdline parameters so it is not an openQA specific issue.

Discussed with riafarov:

  • try to lower default kernel log messages ourselves by identifying the package which owns the file with rpm -qf <filename> and fix it e.g. by SR on the IBS/OBS package or upstream patch (e.g. github PR)

    • elaborate which level for which group -> @riafarov
  • add an openQA test to show the error explicitly, see https://bugzilla.suse.com/show_bug.cgi?id=1011815#c34 . The test flow can be as follows:

select_console('root-console');
script_run('dev=$(ls /sys/class/net | head -n1); sleep 3 && ip link set $dev down && ip link set $dev up &', 0);
select_console('root-console');
script_sudo('yast2', 0);
assert_screen('yast2-ncurses-complete-menu');
send_key 'alt-q';

and have a "workaround" needle covering the console message popping up with "NIC Link is Down".

But first I should crosscheck with settings "splash=silent quiet" on grub.

EDIT: Ok, I was wrong. The above does not trigger any messages so "quiet" is enough. Let's check btrfs info messages.

#11 Updated by okurz over 2 years ago

  • Subject changed from Remove "console=tty" boot option for installed system to Handle console messages covering ncurses dialogs (was: Remove "console=tty" boot option for installed system)
  • Description updated (diff)

#12 Updated by okurz about 2 years ago

  • Subject changed from Handle console messages covering ncurses dialogs (was: Remove "console=tty" boot option for installed system) to [functional]Handle console messages covering ncurses dialogs (was: Remove "console=tty" boot option for installed system)

#13 Updated by okurz about 2 years ago

  • Target version set to Milestone 15

#14 Updated by okurz almost 2 years ago

  • Due date set to 24/04/2018

#15 Updated by okurz almost 2 years ago

  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

So … we do not remember seeing that issue impacting us in the past 8 months but checking linked bugs we find https://bugzilla.suse.com/show_bug.cgi?id=1011815 which is full of automatic reminder comments about "the issue" still appearing in tests. But apparently this is coming from an openSUSE needle "rescue-mode-emergency-shell-bsc1011815-20170825" created by riafarov with
commit ff461296
Author: Rodion Iafarov riafarov@suse.com
Date: Fri Aug 25 12:53:52 2017 +0000

rescue-mode-emergency-shell-bsc1011815-20170825 for opensuse-Tumbleweed-DVD-x86_64-Build20170823-extra_test_filesystem@64bit

but mgriessmeier, riafarov and me agree that it must be wrong.

next steps:

  • Remove needle (or replace by proper not-workaround one)
  • Add soft-failure to the modules using "dmesg -n 4" pointing to https://bugzilla.suse.com/show_bug.cgi?id=1011815
  • Find out why nobody cares, well, we know, because nobody understands what we need to do, this is why we have this progress ticket but at least the upper two steps are a start

#16 Updated by okurz almost 2 years ago

  • Due date deleted (24/04/2018)
  • Target version changed from Milestone 15 to Milestone 17

see above next steps but we do not have capacity for this in S15-S17, delaying.

#17 Updated by okurz over 1 year ago

  • Duplicated by action #36802: [functional][u][sporadic] test fails in consoletest_setup added

#18 Updated by okurz over 1 year ago

  • Subject changed from [functional]Handle console messages covering ncurses dialogs (was: Remove "console=tty" boot option for installed system) to [functional][y] Handle console messages covering ncurses dialogs (was: Remove "console=tty" boot option for installed system)

#19 Updated by okurz over 1 year ago

  • Due date set to 03/07/2018

#20 Updated by okurz over 1 year ago

  • Target version changed from Milestone 17 to Milestone 17

#21 Updated by riafarov over 1 year ago

  • Assignee set to riafarov

#23 Updated by riafarov over 1 year ago

  • Status changed from Workable to Feedback

#24 Updated by JERiveraMoya over 1 year ago

  • Estimated time set to 5.00

#25 Updated by riafarov over 1 year ago

  • Status changed from Feedback to Resolved

I believe we can resolve this ticket now and create a new ticket if we want to try some other solution.

#26 Updated by okurz over 1 year ago

  • Status changed from Resolved to Feedback

Can you state why you think this is resolved? I doubt the mentioned bugs will ever be resolved without our doing and IMHO this is what this ticket is about. Or do we already have the other ticket you mentioned? I am worried by closing this ticket without having a followup we would just rely on our memory when this issue comes up again and it will come up again I am sure because the soft-failure reminds in the bug and the job can not turn green as long as the underlying issue(s) are not fixed.

#27 Updated by riafarov over 1 year ago

  • Status changed from Feedback to Resolved

I don't see any follow up actions here to be honest. We know the way to work the issue around, we record soft-failure, we have removed needle which causes false positives. If you see anything else we can do here, we can continue with this ticket. So feel free to reopen, put follow up actions or create a new ticket.

#28 Updated by okurz over 1 year ago

  • Description updated (diff)
  • Due date deleted (03/07/2018)
  • Status changed from Resolved to Workable
  • Assignee deleted (riafarov)
  • Target version changed from Milestone 17 to future

ok, so let's make this more clear what I mean by "acceptance criteria" which I put into the description now. I doubt https://bugzilla.suse.com/show_bug.cgi?id=1011815 would be resolved without our help so this needs to be ensured first before we can close this ticket here.

#29 Updated by oorlov over 1 year ago

  • Blocks action #36241: [functional][u][medium] test fails in NM_wpa2_enterprise - shows certificate selection screen where only pull down menu is expected added

#30 Updated by riafarov about 1 year ago

  • Status changed from Workable to Resolved
  • Assignee set to riafarov

We haven't seen issues recently, and there is a workaround with setting higher log level for kernel messages, so I would resolve this one.

#31 Updated by okurz about 1 year ago

  • Status changed from Resolved to Workable

@riafarov, again, sorry but as in #19398#note-28 I do not think this is closed. You only mentioned the same as in before.

#32 Updated by riafarov about 1 year ago

  • Assignee deleted (riafarov)

okurz wrote:

@riafarov, again, sorry but as in #19398#note-28 I do not think this is closed. You only mentioned the same as in before.

Not really, we have preformed many mitigation steps, including dmesg -n 4 calls. So I don't see what else we should do, as having separate tty for serial output is not a silver bullet. But sure, we can keep it in the backlog.
I personally see no related issues in recent runs and therefore no action points.

#33 Updated by okurz about 1 year ago

Yes, sure. But please see the AC1 (which I did not change recently)

Also available in: Atom PDF