I'm sorry, as surely this is an FAQ but I've spent quite a bit of time troubleshooting and reading. This is very similar to Kevin's thread from May subject 'u-boot devices broken after 2018.4 upgrade, strange u-boot interaction'. In that thread's case, the issue was that interrupt_char was being set to "\n". My symptoms are the same, but interrupt_char is set to " " or "d".
I'm running LAVA from the latest released containers (2018.11), and trying to use a beaglebone-black with a more recent u-boot than exists in validation.l.o. qemu works fine.
The problem seems to be that LAVA thinks there's a prompt when there isn't, and so it sends commands too quickly. Here's example output from the serial console (job link[2]):
U-Boot 2017.07 (Aug 31 2017 - 15:35:58 +0000)
CPU : AM335X-GP rev 2.1 I2C: ready DRAM: 512 MiB No match for driver 'omap_hsmmc' No match for driver 'omap_hsmmc' Some drivers were not found MMC: OMAP SD/MMC: 0, OMAP SD/MMC: 1 Net: cpsw, usb_ether Press SPACE to abort autoboot in 10 seconds => => setenv autoload no => setenv initrd_high 0xffffffff => setenv fdt_high 0xffffffff => dhcp link up on port 0, speed 100, full duplex BOOTP broadcast 1 BOOTP broadcast 2 BOOTP broadcast 3 DHCP client bound to address 10.100.0.55 (1006 ms) => 172.28.0.4 Unknown command '172.28.0.4' - try 'help' => tftp 0x82000000 57/tftp-deploy-t7xus3ey/kernel/vmlinuz link up on port 0, speed 100, full duplex *** ERROR: `serverip' not set ...
When I u-boot manually, after I hit SPACE (or 'd', both work), u-boot *deletes* the character and then prints '=> ' (is that delete the root cause?). When LAVA runs, it shows an extra => and starts typing as seen above. dhcp takes a second or two, and so the subsequent command starts to get lost (in the above log we see an IP, because 'setenv serverip' got lost).
If I set boot_character_delay to like 1000, it works because it gives enough time for dhcp to finish before typing the next character, but obviously makes the job very slow, and still not reliable.
I'm out of ideas.. help?
P.S. Two interesting things I've learned recently: 1) boot_character_delay must be specified in device_types file. it's ignored when specified in the device file (surprising, as I see it listed in some people's device files[3]). 2) If you install ser2net from sid, you can set max-connections and do some _very handy_ voyeurism on the serial console while lava does its thing (hat tip Kevin Hilman for that one).
Thanks, Dan
[1] https://lists.lavasoftware.org/pipermail/lava-users/2018-May/001064.html [2] https://lava.therub.org/scheduler/job/57 [3] https://git.linaro.org/lava/lava-lab.git/tree/lkft.validation.linaro.org/mas...
On Fri, 11 Jan 2019 at 20:28, Dan Rue dan.rue@linaro.org wrote:
I'm sorry, as surely this is an FAQ but I've spent quite a bit of time troubleshooting and reading. This is very similar to Kevin's thread from May subject 'u-boot devices broken after 2018.4 upgrade, strange u-boot interaction'. In that thread's case, the issue was that interrupt_char was being set to "\n". My symptoms are the same, but interrupt_char is set to " " or "d".
Space could well be problematic. This is down to how patterns get matched in the stream coming from the serial port and it's not LAVA as such which matters here, it's pexpect.
I'm running LAVA from the latest released containers (2018.11), and trying to use a beaglebone-black with a more recent u-boot than exists in validation.l.o. qemu works fine.
This will need investigation with that specific build of U-Boot on a suitable device and it's probably better to take this out of the container based instance to reduce the possible permutations.
Right now, we have other issues which are being tested on the beaglebone-black devices in staging.validation.linaro.org and I do not want to complicate those by adding this testing to those boards.
I do have beaglebone-black devices available via lkft-staging.validation.linaro.org, so this is probably best handled as an issue in GitLab where the U-Boot files can be attached (e.g. as a tarball I can unpack onto my own microSD card for those devices).
The problem seems to be that LAVA thinks there's a prompt when there isn't, and so it sends commands too quickly. Here's example output from the serial console (job link[2]):
U-Boot 2017.07 (Aug 31 2017 - 15:35:58 +0000) CPU : AM335X-GP rev 2.1 I2C: ready DRAM: 512 MiB No match for driver 'omap_hsmmc' No match for driver 'omap_hsmmc' Some drivers were not found MMC: OMAP SD/MMC: 0, OMAP SD/MMC: 1 Net: cpsw, usb_ether Press SPACE to abort autoboot in 10 seconds => => setenv autoload no => setenv initrd_high 0xffffffff => setenv fdt_high 0xffffffff => dhcp link up on port 0, speed 100, full duplex BOOTP broadcast 1 BOOTP broadcast 2 BOOTP broadcast 3 DHCP client bound to address 10.100.0.55 (1006 ms) => 172.28.0.4 Unknown command '172.28.0.4' - try 'help' => tftp 0x82000000 57/tftp-deploy-t7xus3ey/kernel/vmlinuz link up on port 0, speed 100, full duplex *** ERROR: `serverip' not set ...
When I u-boot manually, after I hit SPACE (or 'd', both work), u-boot *deletes* the character and then prints '=> ' (is that delete the root cause?). When LAVA runs, it shows an extra => and starts typing as seen
The extra => is clearly a problem because pexpect is watching for every instance of that string and exiting the wait each time.That is what causes LAVA to proceed to sending more characters.
above. dhcp takes a second or two, and so the subsequent command starts to get lost (in the above log we see an IP, because 'setenv serverip' got lost).
If I set boot_character_delay to like 1000, it works because it gives enough time for dhcp to finish before typing the next character, but obviously makes the job very slow, and still not reliable.
I'm out of ideas.. help?
P.S. Two interesting things I've learned recently:
- boot_character_delay must be specified in device_types file. it's
ignored when specified in the device file (surprising, as I see it listed in some people's device files[3]). 2) If you install ser2net from sid, you can set max-connections and do some _very handy_ voyeurism on the serial console while lava does its thing (hat tip Kevin Hilman for that one).
Thanks, Dan
[1] https://lists.lavasoftware.org/pipermail/lava-users/2018-May/001064.html [2] https://lava.therub.org/scheduler/job/57 [3] https://git.linaro.org/lava/lava-lab.git/tree/lkft.validation.linaro.org/mas...
-- Linaro - Kernel Validation
Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users
Hi Neil! Happy to see you back online.
On Mon, Jan 21, 2019 at 11:08:07AM +0000, Neil Williams wrote:
On Fri, 11 Jan 2019 at 20:28, Dan Rue dan.rue@linaro.org wrote:
I'm sorry, as surely this is an FAQ but I've spent quite a bit of time troubleshooting and reading. This is very similar to Kevin's thread from May subject 'u-boot devices broken after 2018.4 upgrade, strange u-boot interaction'. In that thread's case, the issue was that interrupt_char was being set to "\n". My symptoms are the same, but interrupt_char is set to " " or "d".
Space could well be problematic. This is down to how patterns get matched in the stream coming from the serial port and it's not LAVA as such which matters here, it's pexpect.
Yea, but "d" has the same symptom (it appears an extra \n is getting sent).
I'm running LAVA from the latest released containers (2018.11), and trying to use a beaglebone-black with a more recent u-boot than exists in validation.l.o. qemu works fine.
This will need investigation with that specific build of U-Boot on a suitable device and it's probably better to take this out of the container based instance to reduce the possible permutations.
This is the hill I will die on :) The whole point of containers is to reduce permutations. You know exactly what I'm running, bit for bit, without ambiguity, cruft, or any other artifacts from past versions or ancillary packages that may be laying around the filesystem. Anyway, who's to say I'm even running debian.
Right now, we have other issues which are being tested on the beaglebone-black devices in staging.validation.linaro.org and I do not want to complicate those by adding this testing to those boards.
I do have beaglebone-black devices available via lkft-staging.validation.linaro.org, so this is probably best handled as an issue in GitLab where the U-Boot files can be attached (e.g. as a tarball I can unpack onto my own microSD card for those devices).
It's OK - thank you for the offer, but I'm not asking for someone else to investigate and solve the problem. I'm actually trying to learn, and happy to do some of the legwork myself. It does seem curious to me that this is both what seems like a trivial issue to me, and I imagine also quite common. Isn't there just an option to eat the first =>, or, to not send the extra \n? I'm missing something in my understanding.
The problem seems to be that LAVA thinks there's a prompt when there isn't, and so it sends commands too quickly. Here's example output from the serial console (job link[2]):
U-Boot 2017.07 (Aug 31 2017 - 15:35:58 +0000) CPU : AM335X-GP rev 2.1 I2C: ready DRAM: 512 MiB No match for driver 'omap_hsmmc' No match for driver 'omap_hsmmc' Some drivers were not found MMC: OMAP SD/MMC: 0, OMAP SD/MMC: 1 Net: cpsw, usb_ether Press SPACE to abort autoboot in 10 seconds => => setenv autoload no => setenv initrd_high 0xffffffff => setenv fdt_high 0xffffffff => dhcp link up on port 0, speed 100, full duplex BOOTP broadcast 1 BOOTP broadcast 2 BOOTP broadcast 3 DHCP client bound to address 10.100.0.55 (1006 ms) => 172.28.0.4 Unknown command '172.28.0.4' - try 'help' => tftp 0x82000000 57/tftp-deploy-t7xus3ey/kernel/vmlinuz link up on port 0, speed 100, full duplex *** ERROR: `serverip' not set ...
When I u-boot manually, after I hit SPACE (or 'd', both work), u-boot *deletes* the character and then prints '=> ' (is that delete the root cause?). When LAVA runs, it shows an extra => and starts typing as seen
The extra => is clearly a problem because pexpect is watching for every instance of that string and exiting the wait each time.That is what causes LAVA to proceed to sending more characters.
above. dhcp takes a second or two, and so the subsequent command starts to get lost (in the above log we see an IP, because 'setenv serverip' got lost).
If I set boot_character_delay to like 1000, it works because it gives enough time for dhcp to finish before typing the next character, but obviously makes the job very slow, and still not reliable.
I'm out of ideas.. help?
P.S. Two interesting things I've learned recently:
- boot_character_delay must be specified in device_types file. it's
ignored when specified in the device file (surprising, as I see it listed in some people's device files[3]). 2) If you install ser2net from sid, you can set max-connections and do some _very handy_ voyeurism on the serial console while lava does its thing (hat tip Kevin Hilman for that one).
Thanks, Dan
[1] https://lists.lavasoftware.org/pipermail/lava-users/2018-May/001064.html [2] https://lava.therub.org/scheduler/job/57 [3] https://git.linaro.org/lava/lava-lab.git/tree/lkft.validation.linaro.org/mas...
-- Linaro - Kernel Validation
Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users
--
Neil Williams
neil.williams@linaro.org http://www.linux.codehelp.co.uk/
On Tue, 22 Jan 2019 at 21:38, Dan Rue dan.rue@linaro.org wrote:
Hi Neil! Happy to see you back online.
On Mon, Jan 21, 2019 at 11:08:07AM +0000, Neil Williams wrote:
On Fri, 11 Jan 2019 at 20:28, Dan Rue dan.rue@linaro.org wrote:
I'm sorry, as surely this is an FAQ but I've spent quite a bit of time troubleshooting and reading. This is very similar to Kevin's thread from May subject 'u-boot devices broken after 2018.4 upgrade, strange u-boot interaction'. In that thread's case, the issue was that interrupt_char was being set to "\n". My symptoms are the same, but interrupt_char is set to " " or "d".
Space could well be problematic. This is down to how patterns get matched in the stream coming from the serial port and it's not LAVA as such which matters here, it's pexpect.
Yea, but "d" has the same symptom (it appears an extra \n is getting sent).
I'm running LAVA from the latest released containers (2018.11), and trying to use a beaglebone-black with a more recent u-boot than exists in validation.l.o. qemu works fine.
This will need investigation with that specific build of U-Boot on a suitable device and it's probably better to take this out of the container based instance to reduce the possible permutations.
This is the hill I will die on :)
My equivalent: test one thing at a time.
The whole point of containers is to reduce permutations. You know exactly what I'm running, bit for bit, without ambiguity, cruft, or any other artifacts from past versions or ancillary packages that may be laying around the filesystem. Anyway, who's to say I'm even running debian.
Right now, we have other issues which are being tested on the beaglebone-black devices in staging.validation.linaro.org and I do not want to complicate those by adding this testing to those boards.
I do have beaglebone-black devices available via lkft-staging.validation.linaro.org, so this is probably best handled as an issue in GitLab where the U-Boot files can be attached (e.g. as a tarball I can unpack onto my own microSD card for those devices).
It's OK - thank you for the offer, but I'm not asking for someone else to investigate and solve the problem. I'm actually trying to learn, and happy to do some of the legwork myself. It does seem curious to me that this is both what seems like a trivial issue to me, and I imagine also quite common.
Compared to the number of times that the current setup "just works", it's not that common but it still needs investigation. I'd advise at least two beaglebone-black devices, one using a U-Boot build showing the behaviour on staging.validation.linaro.org and one showing this secondary behaviour. Pin down where the difference lies (is it configuration or a timing issue within the hardware?) by swapping the SD cards around and get some real data.
Isn't there just an option to eat the first =>, or, to not send the extra \n? I'm missing something in my understanding.
No, because we don't have the data to know, for certain, whether this is something that LAVA can fix. "Just consuming the first prompt" produces a high risk of timing errors where the first prompt is missed and the test job hangs until the timeout.
It's possible that it's also related to the annoying U-Boot behaviour of repeating the last shell command every time the shell receives an empty newline instead of merely issuing a prompt.
In other situations, setting the boot_character_delay fixes the problem, indicating that at least in some cases, this is related to how the board receives keyboard input and how the buffer is handled within U-Boot.
There have been a lot of guesses in this area. A range of people have tried a range of solutions on a wide range of different U-Boot configurations without actually establishing any useful data on which config options on which versions show which behaviour and which combinations are fixed by which method.
A methodical approach is what is required to fix this intermittent problem. We need reliable, deterministic data on exactly what happens with each permutation.
The problem seems to be that LAVA thinks there's a prompt when there isn't, and so it sends commands too quickly. Here's example output from the serial console (job link[2]):
U-Boot 2017.07 (Aug 31 2017 - 15:35:58 +0000) CPU : AM335X-GP rev 2.1 I2C: ready DRAM: 512 MiB No match for driver 'omap_hsmmc' No match for driver 'omap_hsmmc' Some drivers were not found MMC: OMAP SD/MMC: 0, OMAP SD/MMC: 1 Net: cpsw, usb_ether Press SPACE to abort autoboot in 10 seconds => => setenv autoload no => setenv initrd_high 0xffffffff => setenv fdt_high 0xffffffff => dhcp link up on port 0, speed 100, full duplex BOOTP broadcast 1 BOOTP broadcast 2 BOOTP broadcast 3 DHCP client bound to address 10.100.0.55 (1006 ms) => 172.28.0.4 Unknown command '172.28.0.4' - try 'help' => tftp 0x82000000 57/tftp-deploy-t7xus3ey/kernel/vmlinuz link up on port 0, speed 100, full duplex *** ERROR: `serverip' not set ...
When I u-boot manually, after I hit SPACE (or 'd', both work), u-boot *deletes* the character and then prints '=> ' (is that delete the root cause?). When LAVA runs, it shows an extra => and starts typing as seen
The extra => is clearly a problem because pexpect is watching for every instance of that string and exiting the wait each time.That is what causes LAVA to proceed to sending more characters.
above. dhcp takes a second or two, and so the subsequent command starts to get lost (in the above log we see an IP, because 'setenv serverip' got lost).
If I set boot_character_delay to like 1000, it works because it gives enough time for dhcp to finish before typing the next character, but obviously makes the job very slow, and still not reliable.
I'm out of ideas.. help?
P.S. Two interesting things I've learned recently:
- boot_character_delay must be specified in device_types file. it's
ignored when specified in the device file (surprising, as I see it listed in some people's device files[3]). 2) If you install ser2net from sid, you can set max-connections and do some _very handy_ voyeurism on the serial console while lava does its thing (hat tip Kevin Hilman for that one).
Thanks, Dan
[1] https://lists.lavasoftware.org/pipermail/lava-users/2018-May/001064.html [2] https://lava.therub.org/scheduler/job/57 [3] https://git.linaro.org/lava/lava-lab.git/tree/lkft.validation.linaro.org/mas...
-- Linaro - Kernel Validation
Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users
--
Neil Williams
neil.williams@linaro.org http://www.linux.codehelp.co.uk/
-- Linaro - Kernel Validation
On Thu, Jan 24, 2019 at 09:45:50AM +0000, Neil Williams wrote:
On Tue, 22 Jan 2019 at 21:38, Dan Rue dan.rue@linaro.org wrote:
Hi Neil! Happy to see you back online.
On Mon, Jan 21, 2019 at 11:08:07AM +0000, Neil Williams wrote:
On Fri, 11 Jan 2019 at 20:28, Dan Rue dan.rue@linaro.org wrote:
I'm sorry, as surely this is an FAQ but I've spent quite a bit of time troubleshooting and reading. This is very similar to Kevin's thread from May subject 'u-boot devices broken after 2018.4 upgrade, strange u-boot interaction'. In that thread's case, the issue was that interrupt_char was being set to "\n". My symptoms are the same, but interrupt_char is set to " " or "d".
Space could well be problematic. This is down to how patterns get matched in the stream coming from the serial port and it's not LAVA as such which matters here, it's pexpect.
Yea, but "d" has the same symptom (it appears an extra \n is getting sent).
I'm running LAVA from the latest released containers (2018.11), and trying to use a beaglebone-black with a more recent u-boot than exists in validation.l.o. qemu works fine.
This will need investigation with that specific build of U-Boot on a suitable device and it's probably better to take this out of the container based instance to reduce the possible permutations.
This is the hill I will die on :)
My equivalent: test one thing at a time.
The whole point of containers is to reduce permutations. You know exactly what I'm running, bit for bit, without ambiguity, cruft, or any other artifacts from past versions or ancillary packages that may be laying around the filesystem. Anyway, who's to say I'm even running debian.
Right now, we have other issues which are being tested on the beaglebone-black devices in staging.validation.linaro.org and I do not want to complicate those by adding this testing to those boards.
I do have beaglebone-black devices available via lkft-staging.validation.linaro.org, so this is probably best handled as an issue in GitLab where the U-Boot files can be attached (e.g. as a tarball I can unpack onto my own microSD card for those devices).
It's OK - thank you for the offer, but I'm not asking for someone else to investigate and solve the problem. I'm actually trying to learn, and happy to do some of the legwork myself. It does seem curious to me that this is both what seems like a trivial issue to me, and I imagine also quite common.
Compared to the number of times that the current setup "just works", it's not that common but it still needs investigation. I'd advise at least two beaglebone-black devices, one using a U-Boot build showing the behaviour on staging.validation.linaro.org and one showing this secondary behaviour. Pin down where the difference lies (is it configuration or a timing issue within the hardware?) by swapping the SD cards around and get some real data.
Isn't there just an option to eat the first =>, or, to not send the extra \n? I'm missing something in my understanding.
No, because we don't have the data to know, for certain, whether this is something that LAVA can fix. "Just consuming the first prompt" produces a high risk of timing errors where the first prompt is missed and the test job hangs until the timeout.
It's possible that it's also related to the annoying U-Boot behaviour of repeating the last shell command every time the shell receives an empty newline instead of merely issuing a prompt.
In other situations, setting the boot_character_delay fixes the problem, indicating that at least in some cases, this is related to how the board receives keyboard input and how the buffer is handled within U-Boot.
There have been a lot of guesses in this area. A range of people have tried a range of solutions on a wide range of different U-Boot configurations without actually establishing any useful data on which config options on which versions show which behaviour and which combinations are fixed by which method.
A methodical approach is what is required to fix this intermittent problem. We need reliable, deterministic data on exactly what happens with each permutation.
Good news - there is an option to change the default behavior of sending a new line with the interrupt character during uboot. Matt pointed me to the code and we found that I just had to set interrupt-newline to False. I did have to make the following change to the base template to allow it to be used for uboot (I guess it was written for grub).
https://git.lavasoftware.org/danrue/lava/commit/ab2ea1f817fe264e5671cce21a65...
I'm not sure if that's the right place for the change but if it is I'm happy to submit it.
So, with the above base template change, I added the following line to my device template:
{% set uboot_interrupt_newline = False %}
This option also does not seem to be documented, but again, I'm not sure where the correct place is to document it.
And with that, I've been running qemu and bbb health checks in a loop successfully.
If anyone's read this far, I wrote up some of this experience on Twitter @ https://twitter.com/mndrue/status/1088627889426350080. Please share or reply with any feedback!
Dan
The problem seems to be that LAVA thinks there's a prompt when there isn't, and so it sends commands too quickly. Here's example output from the serial console (job link[2]):
U-Boot 2017.07 (Aug 31 2017 - 15:35:58 +0000) CPU : AM335X-GP rev 2.1 I2C: ready DRAM: 512 MiB No match for driver 'omap_hsmmc' No match for driver 'omap_hsmmc' Some drivers were not found MMC: OMAP SD/MMC: 0, OMAP SD/MMC: 1 Net: cpsw, usb_ether Press SPACE to abort autoboot in 10 seconds => => setenv autoload no => setenv initrd_high 0xffffffff => setenv fdt_high 0xffffffff => dhcp link up on port 0, speed 100, full duplex BOOTP broadcast 1 BOOTP broadcast 2 BOOTP broadcast 3 DHCP client bound to address 10.100.0.55 (1006 ms) => 172.28.0.4 Unknown command '172.28.0.4' - try 'help' => tftp 0x82000000 57/tftp-deploy-t7xus3ey/kernel/vmlinuz link up on port 0, speed 100, full duplex *** ERROR: `serverip' not set ...
When I u-boot manually, after I hit SPACE (or 'd', both work), u-boot *deletes* the character and then prints '=> ' (is that delete the root cause?). When LAVA runs, it shows an extra => and starts typing as seen
The extra => is clearly a problem because pexpect is watching for every instance of that string and exiting the wait each time.That is what causes LAVA to proceed to sending more characters.
above. dhcp takes a second or two, and so the subsequent command starts to get lost (in the above log we see an IP, because 'setenv serverip' got lost).
If I set boot_character_delay to like 1000, it works because it gives enough time for dhcp to finish before typing the next character, but obviously makes the job very slow, and still not reliable.
I'm out of ideas.. help?
P.S. Two interesting things I've learned recently:
- boot_character_delay must be specified in device_types file. it's
ignored when specified in the device file (surprising, as I see it listed in some people's device files[3]). 2) If you install ser2net from sid, you can set max-connections and do some _very handy_ voyeurism on the serial console while lava does its thing (hat tip Kevin Hilman for that one).
Thanks, Dan
[1] https://lists.lavasoftware.org/pipermail/lava-users/2018-May/001064.html [2] https://lava.therub.org/scheduler/job/57 [3] https://git.linaro.org/lava/lava-lab.git/tree/lkft.validation.linaro.org/mas...
-- Linaro - Kernel Validation
Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users
--
Neil Williams
neil.williams@linaro.org http://www.linux.codehelp.co.uk/
-- Linaro - Kernel Validation
--
Neil Williams
neil.williams@linaro.org http://www.linux.codehelp.co.uk/
On Fri, 25 Jan 2019 at 15:39, Dan Rue dan.rue@linaro.org wrote:
On Thu, Jan 24, 2019 at 09:45:50AM +0000, Neil Williams wrote:
On Tue, 22 Jan 2019 at 21:38, Dan Rue dan.rue@linaro.org wrote:
Hi Neil! Happy to see you back online.
On Mon, Jan 21, 2019 at 11:08:07AM +0000, Neil Williams wrote:
On Fri, 11 Jan 2019 at 20:28, Dan Rue dan.rue@linaro.org wrote:
I'm sorry, as surely this is an FAQ but I've spent quite a bit of time troubleshooting and reading. This is very similar to Kevin's thread from May subject 'u-boot devices broken after 2018.4 upgrade, strange u-boot interaction'. In that thread's case, the issue was that interrupt_char was being set to "\n". My symptoms are the same, but interrupt_char is set to " " or "d".
Space could well be problematic. This is down to how patterns get matched in the stream coming from the serial port and it's not LAVA as such which matters here, it's pexpect.
Yea, but "d" has the same symptom (it appears an extra \n is getting sent).
I'm running LAVA from the latest released containers (2018.11), and trying to use a beaglebone-black with a more recent u-boot than exists in validation.l.o. qemu works fine.
This will need investigation with that specific build of U-Boot on a suitable device and it's probably better to take this out of the container based instance to reduce the possible permutations.
This is the hill I will die on :)
My equivalent: test one thing at a time.
The whole point of containers is to reduce permutations. You know exactly what I'm running, bit for bit, without ambiguity, cruft, or any other artifacts from past versions or ancillary packages that may be laying around the filesystem. Anyway, who's to say I'm even running debian.
Right now, we have other issues which are being tested on the beaglebone-black devices in staging.validation.linaro.org and I do not want to complicate those by adding this testing to those boards.
I do have beaglebone-black devices available via lkft-staging.validation.linaro.org, so this is probably best handled as an issue in GitLab where the U-Boot files can be attached (e.g. as a tarball I can unpack onto my own microSD card for those devices).
It's OK - thank you for the offer, but I'm not asking for someone else to investigate and solve the problem. I'm actually trying to learn, and happy to do some of the legwork myself. It does seem curious to me that this is both what seems like a trivial issue to me, and I imagine also quite common.
Compared to the number of times that the current setup "just works", it's not that common but it still needs investigation. I'd advise at least two beaglebone-black devices, one using a U-Boot build showing the behaviour on staging.validation.linaro.org and one showing this secondary behaviour. Pin down where the difference lies (is it configuration or a timing issue within the hardware?) by swapping the SD cards around and get some real data.
Isn't there just an option to eat the first =>, or, to not send the extra \n? I'm missing something in my understanding.
No, because we don't have the data to know, for certain, whether this is something that LAVA can fix. "Just consuming the first prompt" produces a high risk of timing errors where the first prompt is missed and the test job hangs until the timeout.
It's possible that it's also related to the annoying U-Boot behaviour of repeating the last shell command every time the shell receives an empty newline instead of merely issuing a prompt.
In other situations, setting the boot_character_delay fixes the problem, indicating that at least in some cases, this is related to how the board receives keyboard input and how the buffer is handled within U-Boot.
There have been a lot of guesses in this area. A range of people have tried a range of solutions on a wide range of different U-Boot configurations without actually establishing any useful data on which config options on which versions show which behaviour and which combinations are fixed by which method.
A methodical approach is what is required to fix this intermittent problem. We need reliable, deterministic data on exactly what happens with each permutation.
Good news - there is an option to change the default behavior of sending a new line with the interrupt character during uboot. Matt pointed me to the code and we found that I just had to set interrupt-newline to False.
OK, so that needs documentation in https://master.lavasoftware.org/static/docs/v2/integrate-uboot.html#configur...
I did have to make the following change to the base template to allow it to be used for uboot (I guess it was written for grub).
https://git.lavasoftware.org/danrue/lava/commit/ab2ea1f817fe264e5671cce21a65...
This change looks ok. It would be good to have the documentation change in the same merge request.
The docs content would be particularly useful as a new section at the same level as Configuration and Prompts. Crucially, the section needs to describe the symptoms and how to detect the need for this option. I don't know the source of U-Boot at all but if there is a chance that this is somehow related to a configuration option within the U-Boot build or something which can be detected on the device after U-Boot has been deployed, it would be very useful to include such content. Feel free to include a link to this thread in the list archive for this list. https://lists.lavasoftware.org/pipermail/lava-users/2019-January/001533.html
With an addition to the docs, your branch can be turned into a merge request and I'd expect that to be accepted, subject to the normal CI.
I'm not sure if that's the right place for the change but if it is I'm happy to submit it.
So, with the above base template change, I added the following line to my device template:
{% set uboot_interrupt_newline = False %}
This option also does not seem to be documented, but again, I'm not sure where the correct place is to document it.
https://master.lavasoftware.org/static/docs/v2/integrate-uboot.html#configur...
The source in git for this page is doc/v2/integrate-uboot.rst
And with that, I've been running qemu and bbb health checks in a loop successfully.
If anyone's read this far, I wrote up some of this experience on Twitter @ https://twitter.com/mndrue/status/1088627889426350080. Please share or reply with any feedback!
Dan
The problem seems to be that LAVA thinks there's a prompt when there isn't, and so it sends commands too quickly. Here's example output from the serial console (job link[2]):
U-Boot 2017.07 (Aug 31 2017 - 15:35:58 +0000) CPU : AM335X-GP rev 2.1 I2C: ready DRAM: 512 MiB No match for driver 'omap_hsmmc' No match for driver 'omap_hsmmc' Some drivers were not found MMC: OMAP SD/MMC: 0, OMAP SD/MMC: 1 Net: cpsw, usb_ether Press SPACE to abort autoboot in 10 seconds => => setenv autoload no => setenv initrd_high 0xffffffff => setenv fdt_high 0xffffffff => dhcp link up on port 0, speed 100, full duplex BOOTP broadcast 1 BOOTP broadcast 2 BOOTP broadcast 3 DHCP client bound to address 10.100.0.55 (1006 ms) => 172.28.0.4 Unknown command '172.28.0.4' - try 'help' => tftp 0x82000000 57/tftp-deploy-t7xus3ey/kernel/vmlinuz link up on port 0, speed 100, full duplex *** ERROR: `serverip' not set ...
When I u-boot manually, after I hit SPACE (or 'd', both work), u-boot *deletes* the character and then prints '=> ' (is that delete the root cause?). When LAVA runs, it shows an extra => and starts typing as seen
The extra => is clearly a problem because pexpect is watching for every instance of that string and exiting the wait each time.That is what causes LAVA to proceed to sending more characters.
above. dhcp takes a second or two, and so the subsequent command starts to get lost (in the above log we see an IP, because 'setenv serverip' got lost).
If I set boot_character_delay to like 1000, it works because it gives enough time for dhcp to finish before typing the next character, but obviously makes the job very slow, and still not reliable.
I'm out of ideas.. help?
P.S. Two interesting things I've learned recently:
- boot_character_delay must be specified in device_types file. it's
ignored when specified in the device file (surprising, as I see it listed in some people's device files[3]). 2) If you install ser2net from sid, you can set max-connections and do some _very handy_ voyeurism on the serial console while lava does its thing (hat tip Kevin Hilman for that one).
Thanks, Dan
[1] https://lists.lavasoftware.org/pipermail/lava-users/2018-May/001064.html [2] https://lava.therub.org/scheduler/job/57 [3] https://git.linaro.org/lava/lava-lab.git/tree/lkft.validation.linaro.org/mas...
-- Linaro - Kernel Validation
Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users
--
Neil Williams
neil.williams@linaro.org http://www.linux.codehelp.co.uk/
-- Linaro - Kernel Validation
--
Neil Williams
neil.williams@linaro.org http://www.linux.codehelp.co.uk/
-- Linaro - Kernel Validation
On Fri, Jan 25, 2019 at 04:54:27PM +0000, Neil Williams wrote:
On Fri, 25 Jan 2019 at 15:39, Dan Rue dan.rue@linaro.org wrote:
On Thu, Jan 24, 2019 at 09:45:50AM +0000, Neil Williams wrote:
On Tue, 22 Jan 2019 at 21:38, Dan Rue dan.rue@linaro.org wrote:
On Mon, Jan 21, 2019 at 11:08:07AM +0000, Neil Williams wrote:
On Fri, 11 Jan 2019 at 20:28, Dan Rue dan.rue@linaro.org wrote:
Isn't there just an option to eat the first =>, or, to not send the extra \n? I'm missing something in my understanding.
No, because we don't have the data to know, for certain, whether this is something that LAVA can fix. "Just consuming the first prompt" produces a high risk of timing errors where the first prompt is missed and the test job hangs until the timeout.
It's possible that it's also related to the annoying U-Boot behaviour of repeating the last shell command every time the shell receives an empty newline instead of merely issuing a prompt.
In other situations, setting the boot_character_delay fixes the problem, indicating that at least in some cases, this is related to how the board receives keyboard input and how the buffer is handled within U-Boot.
There have been a lot of guesses in this area. A range of people have tried a range of solutions on a wide range of different U-Boot configurations without actually establishing any useful data on which config options on which versions show which behaviour and which combinations are fixed by which method.
A methodical approach is what is required to fix this intermittent problem. We need reliable, deterministic data on exactly what happens with each permutation.
Good news - there is an option to change the default behavior of sending a new line with the interrupt character during uboot. Matt pointed me to the code and we found that I just had to set interrupt-newline to False.
OK, so that needs documentation in https://master.lavasoftware.org/static/docs/v2/integrate-uboot.html#configur...
I did have to make the following change to the base template to allow it to be used for uboot (I guess it was written for grub).
https://git.lavasoftware.org/danrue/lava/commit/ab2ea1f817fe264e5671cce21a65...
This change looks ok. It would be good to have the documentation change in the same merge request.
The docs content would be particularly useful as a new section at the same level as Configuration and Prompts. Crucially, the section needs to describe the symptoms and how to detect the need for this option. I don't know the source of U-Boot at all but if there is a chance that this is somehow related to a configuration option within the U-Boot build or something which can be detected on the device after U-Boot has been deployed, it would be very useful to include such content. Feel free to include a link to this thread in the list archive for this list. https://lists.lavasoftware.org/pipermail/lava-users/2019-January/001533.html
With an addition to the docs, your branch can be turned into a merge request and I'd expect that to be accepted, subject to the normal CI.
I'm not sure if that's the right place for the change but if it is I'm happy to submit it.
So, with the above base template change, I added the following line to my device template:
{% set uboot_interrupt_newline = False %}
This option also does not seem to be documented, but again, I'm not sure where the correct place is to document it.
https://master.lavasoftware.org/static/docs/v2/integrate-uboot.html#configur...
The source in git for this page is doc/v2/integrate-uboot.rst
I updated the commit to include documentation for this option as suggested, as well as interrupt_char at https://git.lavasoftware.org/lava/lava/merge_requests/350
I was a bit confused between interrupt-character (in base.jinja2), interrupt_char (in base-uboot.jinja2), and uboot_interrupt_character (in base.jinja2). One defaults to '', one to ' ', and the source comment in base-uboot.jinja2 disagreed with the line of code following it... I documented the behavior based on what I observed empirically, but I'm not certain of its correctness.
base.jinja2: interrupt-character: '{{ uboot_interrupt_character | default(" ") }}' base-uboot.jinja2: interrupt_char: "{{ interrupt_char }}" base-uboot.jinja2: interrupt_char: "{{ interrupt_char|default('') }}"
Dan
On Mon, 28 Jan 2019 at 01:21, Dan Rue dan.rue@linaro.org wrote:
On Fri, Jan 25, 2019 at 04:54:27PM +0000, Neil Williams wrote:
On Fri, 25 Jan 2019 at 15:39, Dan Rue dan.rue@linaro.org wrote:
On Thu, Jan 24, 2019 at 09:45:50AM +0000, Neil Williams wrote:
On Tue, 22 Jan 2019 at 21:38, Dan Rue dan.rue@linaro.org wrote:
On Mon, Jan 21, 2019 at 11:08:07AM +0000, Neil Williams wrote:
On Fri, 11 Jan 2019 at 20:28, Dan Rue dan.rue@linaro.org wrote:
Isn't there just an option to eat the first =>, or, to not send the extra \n? I'm missing something in my understanding.
No, because we don't have the data to know, for certain, whether this is something that LAVA can fix. "Just consuming the first prompt" produces a high risk of timing errors where the first prompt is missed and the test job hangs until the timeout.
It's possible that it's also related to the annoying U-Boot behaviour of repeating the last shell command every time the shell receives an empty newline instead of merely issuing a prompt.
In other situations, setting the boot_character_delay fixes the problem, indicating that at least in some cases, this is related to how the board receives keyboard input and how the buffer is handled within U-Boot.
There have been a lot of guesses in this area. A range of people have tried a range of solutions on a wide range of different U-Boot configurations without actually establishing any useful data on which config options on which versions show which behaviour and which combinations are fixed by which method.
A methodical approach is what is required to fix this intermittent problem. We need reliable, deterministic data on exactly what happens with each permutation.
Good news - there is an option to change the default behavior of sending a new line with the interrupt character during uboot. Matt pointed me to the code and we found that I just had to set interrupt-newline to False.
OK, so that needs documentation in https://master.lavasoftware.org/static/docs/v2/integrate-uboot.html#configur...
I did have to make the following change to the base template to allow it to be used for uboot (I guess it was written for grub).
https://git.lavasoftware.org/danrue/lava/commit/ab2ea1f817fe264e5671cce21a65...
This change looks ok. It would be good to have the documentation change in the same merge request.
The docs content would be particularly useful as a new section at the same level as Configuration and Prompts. Crucially, the section needs to describe the symptoms and how to detect the need for this option. I don't know the source of U-Boot at all but if there is a chance that this is somehow related to a configuration option within the U-Boot build or something which can be detected on the device after U-Boot has been deployed, it would be very useful to include such content. Feel free to include a link to this thread in the list archive for this list. https://lists.lavasoftware.org/pipermail/lava-users/2019-January/001533.html
With an addition to the docs, your branch can be turned into a merge request and I'd expect that to be accepted, subject to the normal CI.
I'm not sure if that's the right place for the change but if it is I'm happy to submit it.
So, with the above base template change, I added the following line to my device template:
{% set uboot_interrupt_newline = False %}
This option also does not seem to be documented, but again, I'm not sure where the correct place is to document it.
https://master.lavasoftware.org/static/docs/v2/integrate-uboot.html#configur...
The source in git for this page is doc/v2/integrate-uboot.rst
I updated the commit to include documentation for this option as suggested, as well as interrupt_char at https://git.lavasoftware.org/lava/lava/merge_requests/350
I was a bit confused between interrupt-character (in base.jinja2), interrupt_char (in base-uboot.jinja2), and uboot_interrupt_character (in base.jinja2). One defaults to '', one to ' ', and the source comment in base-uboot.jinja2 disagreed with the line of code following it... I documented the behavior based on what I observed empirically, but I'm not certain of its correctness.
base.jinja2: interrupt-character: '{{ uboot_interrupt_character | default(" ") }}'
This is the base constant for u-boot.
base-uboot.jinja2: interrupt_char: "{{ interrupt_char }}" base-uboot.jinja2: interrupt_char: "{{ interrupt_char|default('') }}"
These are the method specific variables. The first only affects devices which require loading fastboot from U-Boot.
There are a couple of different use cases and there's been a couple of different changes, some of which have not affected all use cases.
The relevant code is the BootloaderInterruptAction class in https://git.lavasoftware.org/lava/lava/blob/master/lava_dispatcher/actions/b...
This needs some reworking but it's hard to do that when so many different devices have varying behaviour in this area. (It's one of the reasons for developing lavafed.)
So it's something we need to revisit and simplify.
lava-users@lists.lavasoftware.org