On Thu, Jan 24, 2019 at 09:45:50AM +0000, Neil Williams wrote:
On Tue, 22 Jan 2019 at 21:38, Dan Rue dan.rue@linaro.org wrote:
Hi Neil! Happy to see you back online.
On Mon, Jan 21, 2019 at 11:08:07AM +0000, Neil Williams wrote:
On Fri, 11 Jan 2019 at 20:28, Dan Rue dan.rue@linaro.org wrote:
I'm sorry, as surely this is an FAQ but I've spent quite a bit of time troubleshooting and reading. This is very similar to Kevin's thread from May subject 'u-boot devices broken after 2018.4 upgrade, strange u-boot interaction'. In that thread's case, the issue was that interrupt_char was being set to "\n". My symptoms are the same, but interrupt_char is set to " " or "d".
Space could well be problematic. This is down to how patterns get matched in the stream coming from the serial port and it's not LAVA as such which matters here, it's pexpect.
Yea, but "d" has the same symptom (it appears an extra \n is getting sent).
I'm running LAVA from the latest released containers (2018.11), and trying to use a beaglebone-black with a more recent u-boot than exists in validation.l.o. qemu works fine.
This will need investigation with that specific build of U-Boot on a suitable device and it's probably better to take this out of the container based instance to reduce the possible permutations.
This is the hill I will die on :)
My equivalent: test one thing at a time.
The whole point of containers is to reduce permutations. You know exactly what I'm running, bit for bit, without ambiguity, cruft, or any other artifacts from past versions or ancillary packages that may be laying around the filesystem. Anyway, who's to say I'm even running debian.
Right now, we have other issues which are being tested on the beaglebone-black devices in staging.validation.linaro.org and I do not want to complicate those by adding this testing to those boards.
I do have beaglebone-black devices available via lkft-staging.validation.linaro.org, so this is probably best handled as an issue in GitLab where the U-Boot files can be attached (e.g. as a tarball I can unpack onto my own microSD card for those devices).
It's OK - thank you for the offer, but I'm not asking for someone else to investigate and solve the problem. I'm actually trying to learn, and happy to do some of the legwork myself. It does seem curious to me that this is both what seems like a trivial issue to me, and I imagine also quite common.
Compared to the number of times that the current setup "just works", it's not that common but it still needs investigation. I'd advise at least two beaglebone-black devices, one using a U-Boot build showing the behaviour on staging.validation.linaro.org and one showing this secondary behaviour. Pin down where the difference lies (is it configuration or a timing issue within the hardware?) by swapping the SD cards around and get some real data.
Isn't there just an option to eat the first =>, or, to not send the extra \n? I'm missing something in my understanding.
No, because we don't have the data to know, for certain, whether this is something that LAVA can fix. "Just consuming the first prompt" produces a high risk of timing errors where the first prompt is missed and the test job hangs until the timeout.
It's possible that it's also related to the annoying U-Boot behaviour of repeating the last shell command every time the shell receives an empty newline instead of merely issuing a prompt.
In other situations, setting the boot_character_delay fixes the problem, indicating that at least in some cases, this is related to how the board receives keyboard input and how the buffer is handled within U-Boot.
There have been a lot of guesses in this area. A range of people have tried a range of solutions on a wide range of different U-Boot configurations without actually establishing any useful data on which config options on which versions show which behaviour and which combinations are fixed by which method.
A methodical approach is what is required to fix this intermittent problem. We need reliable, deterministic data on exactly what happens with each permutation.
Good news - there is an option to change the default behavior of sending a new line with the interrupt character during uboot. Matt pointed me to the code and we found that I just had to set interrupt-newline to False. I did have to make the following change to the base template to allow it to be used for uboot (I guess it was written for grub).
https://git.lavasoftware.org/danrue/lava/commit/ab2ea1f817fe264e5671cce21a65...
I'm not sure if that's the right place for the change but if it is I'm happy to submit it.
So, with the above base template change, I added the following line to my device template:
{% set uboot_interrupt_newline = False %}
This option also does not seem to be documented, but again, I'm not sure where the correct place is to document it.
And with that, I've been running qemu and bbb health checks in a loop successfully.
If anyone's read this far, I wrote up some of this experience on Twitter @ https://twitter.com/mndrue/status/1088627889426350080. Please share or reply with any feedback!
Dan
The problem seems to be that LAVA thinks there's a prompt when there isn't, and so it sends commands too quickly. Here's example output from the serial console (job link[2]):
U-Boot 2017.07 (Aug 31 2017 - 15:35:58 +0000) CPU : AM335X-GP rev 2.1 I2C: ready DRAM: 512 MiB No match for driver 'omap_hsmmc' No match for driver 'omap_hsmmc' Some drivers were not found MMC: OMAP SD/MMC: 0, OMAP SD/MMC: 1 Net: cpsw, usb_ether Press SPACE to abort autoboot in 10 seconds => => setenv autoload no => setenv initrd_high 0xffffffff => setenv fdt_high 0xffffffff => dhcp link up on port 0, speed 100, full duplex BOOTP broadcast 1 BOOTP broadcast 2 BOOTP broadcast 3 DHCP client bound to address 10.100.0.55 (1006 ms) => 172.28.0.4 Unknown command '172.28.0.4' - try 'help' => tftp 0x82000000 57/tftp-deploy-t7xus3ey/kernel/vmlinuz link up on port 0, speed 100, full duplex *** ERROR: `serverip' not set ...
When I u-boot manually, after I hit SPACE (or 'd', both work), u-boot *deletes* the character and then prints '=> ' (is that delete the root cause?). When LAVA runs, it shows an extra => and starts typing as seen
The extra => is clearly a problem because pexpect is watching for every instance of that string and exiting the wait each time.That is what causes LAVA to proceed to sending more characters.
above. dhcp takes a second or two, and so the subsequent command starts to get lost (in the above log we see an IP, because 'setenv serverip' got lost).
If I set boot_character_delay to like 1000, it works because it gives enough time for dhcp to finish before typing the next character, but obviously makes the job very slow, and still not reliable.
I'm out of ideas.. help?
P.S. Two interesting things I've learned recently:
- boot_character_delay must be specified in device_types file. it's
ignored when specified in the device file (surprising, as I see it listed in some people's device files[3]). 2) If you install ser2net from sid, you can set max-connections and do some _very handy_ voyeurism on the serial console while lava does its thing (hat tip Kevin Hilman for that one).
Thanks, Dan
[1] https://lists.lavasoftware.org/pipermail/lava-users/2018-May/001064.html [2] https://lava.therub.org/scheduler/job/57 [3] https://git.linaro.org/lava/lava-lab.git/tree/lkft.validation.linaro.org/mas...
-- Linaro - Kernel Validation
Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users
--
Neil Williams
neil.williams@linaro.org http://www.linux.codehelp.co.uk/
-- Linaro - Kernel Validation
--
Neil Williams
neil.williams@linaro.org http://www.linux.codehelp.co.uk/