On Thu, 12 Jul 2018 at 13:03, <magnus@minimum.se> wrote:
Hey,

Where I work, we've been using LAVA since 2014 to test our in-house Linux distribution. Our devices are typically "low-end" devices with 128 to 256 MB RAM and up to 512 MB NAND flash. We use LAVA to test our BSPs, e.g lots of connectivity/interface/IO tests. We also use it for stability-testing and performance testing (like measuring the application context switch time or Ethernet TX rates). 

For a while, there has been a growing concern within our team that LAVA might not be ideal for our testing needs. Right now, the team is discussing if we should drop LAVA and use something else. There is even talks about developing our own test framework. I personally like the idea behind LAVA but also agree that it has been a bumpy road these past 4 years. Due to various bugs and missing features, we've several times been forced to upgrade to an unstable version of LAVA just to get normal operations working.

​Please clarify. There are nightly builds of LAVA, developer builds and there are releases. Versions uploaded to images.validation.linaro.org ​production-repo and Debian backports are fully supported, stable, releases. 

​We release new versions of the stable releases from time to time and there are particular security reasons to upgrade to 2018.5.post1 ​but unless you have local changes to LAVA and are using developer builds, you do not have an "unstable" version of LAVA. (Developer builds and nightly builds are easily identifiable in the test job logs:

lava-dispatcher, installed at version: 2018.5.post1+12390.07889e548-1+stretch  - nightly build from staging-repo on images.validation.linaro.org

lava-dispatcher, installed at version: 2018.5.post1-1+stretch - production release on production-repo on images.validation.linaro.org

lava-dispatcher, installed at version: 2018.5.post1-2~bpo9+1 - stretch-backports from Debian.

So unless you have the git hash details (rev number + latest git hash), you are running a production release of LAVA.

 
Two times we've lost the entire test database because we were unable to recover from a LAVA upgrade.

​Those should have been reported to the upstream team. We test extensively to prevent those problems and unless you've got local changes which have not been submitted upstream, we have not had database upgrade ​failures in those tests or on active instances since about 2014.


We also run many instances which upgrade automatically, again, without any problems with database migrations.

​We provide documentation on making backups and provided you talk to us, we can often advise on how to fix database migration issues without​ dropping the database itself.

 
In those cases, it was easier for us to just "start over".  Today we use LAVA 2018.02.

​That is (an old) production release. There are known security issues in that release, already described on lava-announce. Users need to upgrade.


 
I've compiled a list that summarize the most pressing issues we've experience with LAVA:

​Overall, your list is familar - we have all had similar problems whilst on the learning curve. (Yes, we recognise that the learning curve is steep and we try to provide documentation to help. We can only help if asked - we have no knowledge of your instance or problems until you ask.) 

The common feature is this: LAVA can be complex because LAVA copes with complex problems arising from operating automation at scale. It is precisely this ability which will give you enough data to know whether a test failure is due to component A or component B or a combination of components A and B when used together with kernel module C. It is simply impossible to get all of this information from a single test-it-all test job, it would run for days. Worse, a single test job will frequently fail before even *attempting* to execute component B, just due to random or intermittent bugs which would be *caught* if separate test jobs where submitted to test component A fully before testing component B. The larger matrix is easier to test than single test jobs because it can now be parallelised. 6 devices can test all permutations of component A in isolation. The same 6 devices can then get new test jobs which test component A in conjunction with component B. Each of the 6 test jobs have known, pre-determined differences from each other - one change (anywhere), one test job.​ When the matrix expands, add more devices. Let the automation do the heavy work, ideally overnight if your developer team is based in only a small number of timezones. Provide enough hardware to run the entire matrix in a few hours. One test job per device, if you have to. Idle devices are not a problem, provided you have remote power control the devices will be without power whilst Idle.

What does it matter if your submissions are bursty like that? Whilst the developers are active, there will be lots of devices Idle to support hacking sessions and experimentation and blue-sky ideas and other wacky and unusual one-off tests. With an active development team, it is a mistake to have a LAVA instance which constantly has jobs in the queue. Add more devices, more workers or a new instance. Invest in the infrastructure and value your admins - you are going to need good people to administer CI properly and keep all those LED's flashing.

One commit to your source code should generate dozens of test jobs, buidling a mountain of data which will then tell you exactly what combinations work and which failed. This way, when you come to investigate a business-critical bug in a few months time, you can refer to the data and *know* with certainty, that the bug originates at a particular point within the development process and that gives you a known point from which you can develop a way to bisect the actual bug, or simply roll back to that point.

There are no shortcuts here. Inventing yet another automated testing framework will simply miss out all of the scalability features which LAVA has already (mostly) solved. It is a mistake to think that a single developer model can do any useful CI - the data set and the matrix of test jobs arising from the possible permutations of all the different components makes this impossible.

It is also a mistake to think that any one test operation can test more than one component. From what you describe, your codebase is sufficiently large that is is also a mistake to think that any one test operation can have even the remotest chance of testing even a single commit.

​We've done all this work, we've run millions of test jobs across hundreds of devices for over 7 years now. What we put into the documentation and the best practice that we promote comes as a direct result of all of that data.​

 

1. Test development is slow. Several members of my team avoids test development because the process of developing a test with LAVA is tedious / very timeconsuming. I think it mostly boils down to making a change, pushing it to a Git repo, submitting a job, running the job and then watching the LAVA job output for result. This will in our environment take several minutes, just to verify a change in a test. 
I'm aware of the guidelines for making portable tests and I personally think we can be a lot better at it for single-node tests which could enable us to run testscripts on local devices, but we have also quite a number of multinode jobs that we find are difficult to run in any other environment than LAVA.

​MultiNode is complex, agreed. Separating the synchronisation calls from the test actions will help because then the test can be mocked up using two separate log files.

The best solution here is to only use MultiNode when essential - it is very useful and an important part of LAVA but it is and always will be complex. Do as much testing as you can using singlenode, especially for things like testing whether the kernel which has just been built actually boots - that is NOT a MultiNode test job.

The matrix of test jobs needs to be expanded and this may well involve adding more devices. You need to start testing one element at a time, one change at a time and submitting test jobs based on the results of less complex test jobs. This can all be automated using the API and templating, it just happens overnight. You will need to use the Results API to collate the data and present a summary to the developers each morning.

LAVA is not a complete CI solution, we try to cover that in the documentation. You cannot look at LAVA as the entirety of your CI and you should not look at any single test job as a complete solution either. This is what we have learnt over the years of running automation at scale - start small, build slowly and test every permutation. That means lots of different test jobs, lots of metadata and lots of results. One component, one test job - changing only that one component and not trying to go on and use that (untested) component in further tests until it has been tested. Keep it simple, keep your data clean and avoid polluting your results with contamination from unrelated changes.

Test development can be slow but you should use the power of templating (https://staging.validation.linaro.org/static/docs/v2/templating.html#template-best-practice) to automate the generation of enough test jobs to cover all of the work and not attempt to foist it all onto one test job because that job can and will fail eventually, leaving you with large holes in your dataset.

Do everything you can in single node before then adding specific test definitions which truly need MultiNode for client:server type testing. Test the bootloader/firmware combination in singlenode. Test the kernel boot in singlenode. Test the infrastructure of the rootfs and the mechanisms for deploying the tools to support your client:server operations as separate single node test jobs. Build all those test definitions *separately* and then add MultiNode on top. Track all of this with metadata and modify your submission template support to only submit the complex test jobs if the simpler test jobs have passed.
 
We've also tried using hacksessions to speed up development (e.g you edit the tests on the DUT and then synchronize it back to your host once you're happy). This works quite well, but feels a bit hacky and if the hacksession timeout hits, you lose all your work ;-)

​Hacking sessions have their place. If you use ssh -A you could carry authentication with you into the hacking session to push changes to sites outside the hacking session, e.g. to a personal git repo for later tidying up before being put into a code review against your internal production code.
 

2. Can't test bootloaders.

​That depends. The phrase is overloaded.

LAVA can run tests using bootloaders. LAVA can deploy bootloaders to *some* devices. Many devices used in LAVA lack the ability to recover automatically from a failed ​bootloader binary but this is a limitation of the hardware, not LAVA.

 
Several of our hardware contain FPGAs and the FPGA images / "firmware" is tightly bundled with the bootloader. In addition to configuring the FPGA, the bootloader also contains in-house developed code for firmware update that we would like to autotest. We have a _lot_ of bootloader variants and we need a way of testing it along with the Linux system.

​As mentioned on IRC, this simply means a large matrix of test jobs where only one element is changed at any one time.​

 
Our current setup is that we manually flash bootloaders in our LAVA lab and then cross our fingers that the Linux system we test on the device is compatible with the bootloader. The ideal situation for us would be to always test the Linux system and the matching bootloader together.

​This depends entirely on whether the device can support automated bootloader deployment. I cannot answer that for you, but unless you have some way of attaching relays or some kind of BMC, the answer is probably no.​

 
Granted, the better solution would be to move away the FPGA loading from the bootloader, but this is a design chosen by our SoC provider and we prefer to keep it.
We also manage a "LTS" branch of our Linux distro. We support it for several years and we need to ensure our test setup can test both our "master" branch and our LTS branch. With our current setup, this is not possible because all devices in our lab runs a bootloader that was manually flashed at some arbitrary time.

​As mentioned on IRC, the solution is two fold.

Sub-divide the devices into a pool which admins control the bootloader version, using a known working version which is upgraded infrequently and create a second pool where the bootloader (and firmware if required) is updated in every test job.

Again, this isn't a limitation imposed by LAVA, this is a limitation of doing any automated testing at scale. Every test job needs to always start in a predetermined​ state.

 
We've considered setting up several hardware of the same type, but with different bootloaders and then let LAVA treat them as different device types. This would work but our lab would fill up fast and the usage of each device would be low.

​It is a necessary cost to be able to run any triage on the results.​

 
We also tried making jobs that boot a golden Linux system, write the software under test (including bootloader), reboot and run tests. This did work, but required customization per device since the test has to know where to write the bootloader. We would rather put this information into the LAVA device type definition somehow.

​That can be done. Integrating a new device-type is the hardest job in using LAVA but it is solveable. We will try to help as much as we can.​

 

3. Can't write to NAND. Our devices are NAND-based and we use UBIFS ontop of UBI. We have not found a way for LAVA to write NAND because the LAVA mechanism that embeds stuff into the rootfs before deployment doesn't support UBIFS.

​Please clarify. Nothing in LAVA knows or needs to know anything about UBIFS. If you can boot the device into RAM​, then there is scope for ensuring that sufficient drivers are available to allow writing to the NAND.

 
At the moment, we ramboot our devices but we are now at a point where our devices OOM because they don't have enough RAM to fit both the rootfs and running the tests.

​So only put enough test operations into the ramdisk to be able to write to other storage and then use that to do more work. See lava-target-storage. 


 
Our current solution is to split the job into several jobs that each run a smaller amount of tests, but this is less than ideal because it makes our test run slower (we need to reboot) and it is a bit annoying that test results are spread across several jobs.

​As mentioned on IRC, you are likely to need lots of separate test jobs to be able to test one thing at a time across all supported permutations​. What that means is that your test job metadata needs to have enough information to tie those jobs together and then use the Results API to pull all the data into one place.

LAVA is generic and only has minimal reporting infrastructure - as soon as anyone starts doing important triage and CI, a frontend is going to be needed to collate results from many test jobs into a useful format which is specific to your development team.


 
We have our own deployment tool that would be nice to integrate into LAVA as a deployment method.

​This is a common mistake. Avoid duplicating the support LAVA already has for TFTP and DFU or USB. There are good reasons why this has been split out into dedicated actions which can report specific errors and results. LAVA does scale and the reasons why it works relate, in part, to how the LAVA support is separated into discrete components which can be rigorously tested across multiple devices, instances and labs.​

 
It accepts a kernel, rootfs, DT and bootloader and will write it using TFTP or DFU (USB) depending on target. To avoid forking all of LAVA in order to implement such deploy method, is there any plugin architecture that allows us to install additional boot methods alongside the LAVA packages?

​See the help on integrating a new device-type into LAVA.


 

I'd love to get your views on these issues and if there is a solution when using LAVA. 

​There are a lot of separate threads in this. Before you reply, consider starting some new threads on this list. Again, separate the task into discrete components - starting with the discussions on how to fix your issues.​

 

Best regards, Magnus.




_______________________________________________
Lava-users mailing list
Lava-users@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lava-users


--

Neil Williams
=============
neil.williams@linaro.org
http://www.linux.codehelp.co.uk/