On 30 January 2018 at 15:45, Anibal Limon <anibal.limon@linaro.org> wrote:

On 30 January 2018 at 02:15, Neil Williams <neil.williams@linaro.org> wrote:
On 29 January 2018 at 04:23, Anibal Limon <anibal.limon@linaro.org> wrote:
Hi,

I created a Piglit test definition [1] before i try to test it out into the V.L.O. i used test-runner to make a run locally (db410c) without lava-dispatcher, the run took ~2 hours 5 minutes [2] (wo download/flash) after that i tried to execute into the V.L.O. and it consumes 8 hours of execution and then finalize with a timeout, the test didn't reach the end and was able to execute only 14486 tests of 39249.

By V.L.O you mean LKFT. Please be careful with your terminology. V.L.O is an entirely separate LAVA installation.

Right,

So i download the definition and executed the lava-run command in my desktop and the full test execution was in ~2 hours 38 minutes that makes sense because needs to download and flash the boot and rootfs images. [4]

I introduced a time call for measure the piglit time execution in test-runner and lava-dispatcher and the exec times are the same ~96 minutes in both executions, search for ' Thank you for running Piglit' in the logs [2][5].

I noticed when i ran lava-run into my desktop the CPU usage increases to 100% when reaches the step of actually running the test [6], i made some debugging and discover that the manner to get the results and notifications from serial console is using pexpect and certain regexes, i attached the gdb to the python process and found that the regexes executes over the whole serial buffer in very short intervals [7]

It's not the entire buffer (that would include the bootloader operations), it is the buffer since the last pattern match (the start of the last test run). This is standard pexpect behaviour. In most cases, it doesn't cause any issues but one step is to match more often.

Yes isn't the entire buffer is the Lava test shell piglit buffer.

I've created a review to provide advice on this area: https://review.linaro.org/#/c/23977/

Summary: When a test produces pages and pages of output, it isn't a problem on an interactive terminal but it becomes a significant problem for anything logging that output. Try to adapt and/or wrap your test operations so that the output is limited to the useful output. Progress bars are a particular problem because although, with an interactive terminal, it looks like it goes from 0% to 100% all on one line, when that is received over serial, that can be thousands of separate lines and that is just noise. Not only does it bloat the log file (making it hard to scroll through to the interesting content) but it also makes the pattern matching much slower.

Apart from that, very large amounts of test shell output without any matching patterns (so inside a single lava-test-case or before any test case is created) are known to cause issues. I've experimented with an updated pexpect package which is supposed to have a fix in this area but that showed almost no benefit. Whereas simply reducing the amount of logging being generated had a massive effect, cutting execution time to 25% of previous.

Therefore, the advice is to wrap all such test operations in scripts which can filter out the noise and only print/echo interesting lines. Combined with functionality like tee to preserve the full / nearly full content in case it's needed (using the Publishing API), this matches how we've been running the LAVA unit tests as test jobs for some time. See https://git.linaro.org/lava-team/refactoring.git/tree/functional/unittests.py

Very, very large log files are handled differently - if you have many megabytes of useful content, LAVA admins can set a viewing limit to prevent timeouts in the browser. The test job works and all the results are available, it's just that to view the log you have to download it instead of viewing it in the browser.

One solution is to "chunk" the operations so that results are reported in batches by redirecting to local storage (not emitting anything to the console). Then process the storage and report the results directly, one run per batch.

Ok but in a long test runs it needs to be showing results into the console to avoid the feeling on not be executed, i'm not familiar with pexpect but may be has an option to use two buffers, one
for receive the incoming data from serial and process and other for has the full test buffer.

pexpect does not support this, AFAICT. However, by using a wrapper script, that becomes easy. The second buffer (in RAM on the DUT or in an LXC) can be outputting to a file and only limited output is sent to the console yet that is sent "live" so that there is always *some* output, just without the noise. It is only the output sent to the console which matters.

Wrapper scripts like this can also report interim test case results. If your test operation runs iterations of the same test, you can report a test case for each N iterations. Not only does that break up the buffer but it also gets you some results even during long running operations which may ultimately fail. (A replacement for the progress bar, if you like.)

Wrapper scripts also have the ability to dump out the last N lines of the log file should an exception be caught during operation of the wrapper script, that can help triage as well.

It would also help not to use the verbose flag or to redirect all the Piglit output to file(s) to avoid flooding the test log.

(We do something similar with the lava unit tests - redirect to a script which filters out the noise and just reports the useful content.)

That could be the reason of the delay in the test execution since the Piglit command takes the same time to run with test-runner and lava-dispatcher.

pexpect happens inside lava-run, I'm not sure you are attributing the delay correctly. You describe the lava-run output as taking not much longer than the local run.

Yes, may be there is other issue to explain the delay of the execution, this is the last run into the v.l.o [1], i have the job set to timeout after 2 hours and only was able to execute
08234/39249 of the tests or it seems since the lava-test-shell doesn't affects the device execution time.

[1] https://validation.linaro.org/scheduler/job/1667367

Test job 101274 on LKFT didn't get to the point of reporting any results from the test shell.

Start: 17:22
LXC ready and fastboot downloaded and unpacked: 17:25
Overlay created and power reset the device: 17:27

First fastboot operation (rootfs): 17:27 to 17:55 OKAY [846.845s]\nfinished. total time: 1676.151s

Start of the LAVA test shell: 17:56
First output from test: 17:58

First output which actually matched any patterns expected by LAVA: 01:22 the following day.

So, yes, that was a large buffer which needed to be scanned. Breaking up the output into test sets and test cases would make things a lot easier and let the job run more smoothly.

Best regards,
Anibal

[1] https://git.linaro.org/qa/test-definitions.git/tree/automated/linux/piglit/piglit.yaml
[2] http://qcom-hackbox.linaro.org:8090/stdout_testrunner.log.gz
[3] https://lkft.validation.linaro.org/scheduler/job/101274
[4] http://qcom-hackbox.linaro.org:8090/stdout_dispatcher.log.gz
[5] http://qcom-hackbox.linaro.org:8090/stdout_dispatcher_time.log.gz
[6] http://qcom-hackbox.linaro.org:8090/lava_run_top.log
[7] http://qcom-hackbox.linaro.org:8090/lava_run_gdb.log
[8] https://git.linaro.org/lava/lava-dispatcher.git/tree/lava_dispatcher/actions/test/shell.py#n282

_______________________________________________
Lava-users mailing list
Lava-users@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lava-users

--

Neil Williams
=============
neil.williams@linaro.org
http://www.linux.codehelp.co.uk/

Neil Williams
=============
neil.williams@linaro.org
http://www.linux.codehelp.co.uk/