On 31 May 2018 at 16:01, Kevin Hilman <khilman@baylibre.com> wrote:
Hi Neil,

On Thu, May 31, 2018 at 12:30 AM, Neil Williams
<neil.williams@linaro.org> wrote:
> On 30 May 2018 at 19:28, Kevin Hilman <khilman@baylibre.com> wrote:
>>
>> Hello,
>>
>> After upgrading to 2018.4 (also tried .5) many of our device-types
>> using base-uboot.jinja2 are broken.  While I really like the major
>> improvement to run commands individually, there seems to be some
>> problems and the LAVA output logs are very confusing, showing
>> concatenated strings, etc.
>>
>> Here is an example for an upstream device-type (meson-gxbb-p200), and
>> here is where it starts interacting with u-boot:
>> http://khilman.ddns.net/scheduler/job/15#L336
>
> This is a classic presentation of the receiving end not coping with the
> speed of the transfer of the characters - the device-type template can
> define:
>
> {% set boot_character_delay = 10 %}

Yes, I'm aware of this feature and actually tried, but it doesn't
help.  I use that feature on several platforms that have buggy UART
drivers in uboot, or that don't have proper FIFO control so commands
need to be slowed down, so I'm aware of the classic symptoms and that
solution.  Sorry, I should've mentioned that I tried that in my
original post.

That board is doing bad stuff on the U-Boot command line - commands are being repeated in the output despite only being sent once. That then causes the prompt to appear and therefore match.

gxb_p200_v1#setenv autoload no

appears twice in that log but was only sent once.

setenv initrd_high 0xffffffff

appears twice, although the second one is without the prompt (which is what should be happening because BBB does the same too).

The serverip *is* sent but the prompt is already out of sync, so the dhcp output happens afterwards.

At least in job 20, the TFTP does start but the board mangled the command to load the ramdisk.

The purpose of this change is to send the U-Boot commands and do the operations of those commands at the same time.

https://review.linaro.org/#/c/23195/

Change the flow of bootloader commands so they are executed individually

For uboot, instead of building and sending a singular large boot command,
execute each step individually to allow better monitoring of errors.
Also, make bootloader error messages per-bootloader.

That double prompt output is the problem. The old jinja2 template just happened to work with this because all the work was bundled up in one lump at the end. This was actually much more prone to errors on a wide range of boards.
 

However, for me, there are several reason why this is not a classic symptom

- it worked without any delays before the 2018.4 upgrade
- it works fine in 2018.4 if we replace base-uboot.jinja2 with one from 2017.11

That's more of a change than it seems - it radically changes how the interaction occurs with U-Boot - pushing back to the bundle of commands which caused errors elsewhere.
 
- the same board works with other pexpect tooling (pyboot) driving
u-boot without any delays
- it fails in the same way on several boards: it's always the command
right after dhcp that's missing

We're not seeing failures on boards available on staging.validation.linaro.org - which boards are failing?
 

Also, the classic symptoms that require that fix (at least in my
experience) are randomly lost characters, not entire commands that
just go missing.

We've seen more than just missing characters. Besides, if the missing character was a newline, then concatenation of commands might be seen but the execution of the command would be missing.
 

To me what this actually looks like is LAVA is not fully waiting for a
prompt before sending the next command. 

LAVA saw the prompt - the board issued it twice. The same prompt is being expected for the next line, so LAVA issued that line. There is no way to know that the board has not processed the previous line, we have to rely on the appearance of the prompt.
 
IOW, it sends a command
before the prompt shows up, and so that command is totally lost.  I
haven't dug enough in the code to know how/why it's happening, but
that's what it "feels" like so far.

The device is duplicating the output (at a very early stage) and that output includes a prompt, so the issuing of commands gets ahead of the processing.

I think we need a staging instance which has these boards available and which runs functional tests (where the test job never changes) on a regular basis, reporting back by email on incomplete test jobs. We do exactly this with staging.validation.linaro.org but that only covers those devices. The instance should use the nightly builds from the staging-repo (although it doesn't have to upgrade or run functional tests every night) so that we get a report before we make the release.

We're developing the docker support so that this can be done without needing a complete set of new devices, as this can be problematic. The idea will be that default runs use the production docker and functional tests use a docker containing the code from the nightly build.

Whilst this can work for the dispatcher now, it does not account for changes in the Jinja2 templates and we're working on lava-server docker support. It isn't ready yet.

This is one of those situations where there isn't a good alternative for simply having the relevant devices available for functional testing based on the staging-repo builds with reports coming in ahead of a release. That will provide some assurance ahead of upgrading the production instance.

It's just possible that the interrupt_char could be causing this problem, you've got a newline set - that could end up being carried over into the shell and U-Boot has a nasty habit of repeating the last command if a newline is entered with no other content.

IOW this change may have revealed an inherent problem in the device config which was hidden in the previous "bundled" set of commands.
 

Kevin



--