On Thu, 3 Jan 2019 at 23:13, Milosz Wasilewski milosz.wasilewski@linaro.org wrote:
On Thu, 3 Jan 2019 at 22:19, Andrei Narkevitch Andrei.Narkevitch@cypress.com wrote:
Hello,
What is the rationale to ignore individual lava-test-case results when running a health check jobs?
For example, the following job failed one case: https://validation.linaro.org/scheduler/job/1902316
A failing test case can be an indication of device malfunction (e.g. out of disk space, hw issues). Is it possible to force LAVA to fail a health check and thus put device in a “bad” state if one of the test cases is not successful?
IMHO the idea is that the health check is more directed towards deployment/boot than actual test. If you really would like to have a health check in which every test counts you should probably rewrite it using lava-test-raise: https://master.lavasoftware.org/static/docs/v2/writing-tests.html#index-8 This would terminate the health check at any failure.
milosz
I'll update the docs at https://master.lavasoftware.org/static/docs/v2/healthchecks.html#using-lava-... as https://git.lavasoftware.org/lava/lava/issues/196
Early on in LAVA, health checks were commonly only boot tests - if the device deployed and booted, the infrastructure was deemed to be working correctly.
Things have developed and now there are many infrastructure elements which benefit from being tested inside a test action. These checks need to be considered as "setup" checks, e.g. for external hardware or peripheral support and what you will tend to find is that these checks both need to be done in health checks but also as a setup phase of lots / all test jobs on the DUT as well. Therefore, a dedicated test definition is advised which expressly tests that essential peripherals and other infrastructure needs to be created. Any check which does not cause the DUT to fail to boot upon error needs one of these test definitions. Milosz is correct, as a "setup" test definition, each one should use lava-test-raise: https://master.lavasoftware.org/static/docs/v2/writing-tests.html#call-lava-...
Add these setup test definitions to every test job as the first defniitions in the test action block. It's not practical to run a health check every time and every deployment has the potential to affect some kind of external support, so check it and fail early before spending time configuring and running the other test actions.
I would advise against making these checks too intrusive or time-consuming. Also avoid testing things like available disk space unless you *know* how much space the following test actions are going to need. (You might be able to use test action parameters for this, depending on your setup). If space is constrained, look at running the test using NFS and having some kind of on-device storage as scratch space, maybe a USB external drive. Disentangle the test requirements from the device requirements, so that you know what you are testing. Test only one thing at a time, break up your test structures so that you are always within the limits of the DUT except for those few times when you are explicitly testing the limits of the DUT (and then test one limit at a time). At all costs, avoid testing everything in one go - a hundred different test jobs are better than a single test job which tries to run 100 tests and fails at test 45 - that gives you no data at all for the majority. Careful use of setup actions, lava-test-raise and portable test scripts is required to get to a point where intermittent and cascading errors can be identified and fixed. That's how labs get from a 40% failure rate to a 0.4% failure rate.
Feel free to use lava-common to help make your test action scripts portable: https://git.lavasoftware.org/lava/functional-tests/blob/master/testdefs/lava...
Simplest way to use lava-common is something like: https://git.lavasoftware.org/lava/functional-tests/blob/master/testdefs/disp... - in a custom setup script, if the individual command MUST operate 100% successfully, make it a command. If not, make it a testcase. If a command fails for any reason, lava-test-raise is called and the test job ends with the device in Bad health.
LAVA also makes other checks automatically and raises infrastructure exceptions if those fail - e.g. if static_info is defined but the defined hardware cannot be found: https://staging.validation.linaro.org/scheduler/device/staging-hi960-hikey-0... which can result in a health check failing: https://staging.validation.linaro.org/scheduler/job/247199
This is another example of using lava-test-raise, this time from a python custom script - it's not a health check because in this case, there is one DUT with external hardware and one without.
https://staging.validation.linaro.org/scheduler/job/246700/definition
https://git.lavasoftware.org/lava/functional-tests/blob/master/testdefs/arm-...
https://git.lavasoftware.org/lava/functional-tests/blob/master/testdefs/aep-...
Thanks,
Andrei Narkevitch
Cypress Semiconductors
This message and any attachments may contain confidential information from Cypress or its subsidiaries. If it has been received in error, please advise the sender and immediately delete this message.
Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users
Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users