Hey,

Where I work, we've been using LAVA since 2014 to test our in-house Linux distribution. Our devices are typically "low-end" devices with 128 to 256 MB RAM and up to 512 MB NAND flash. We use LAVA to test our BSPs, e.g lots of connectivity/interface/IO tests. We also use it for stability-testing and performance testing (like measuring the application context switch time or Ethernet TX rates). 

For a while, there has been a growing concern within our team that LAVA might not be ideal for our testing needs. Right now, the team is discussing if we should drop LAVA and use something else. There is even talks about developing our own test framework. I personally like the idea behind LAVA but also agree that it has been a bumpy road these past 4 years. Due to various bugs and missing features, we've several times been forced to upgrade to an unstable version of LAVA just to get normal operations working. Two times we've lost the entire test database because we were unable to recover from a LAVA upgrade. In those cases, it was easier for us to just "start over".  Today we use LAVA 2018.02. I've compiled a list that summarize the most pressing issues we've experience with LAVA:

1. Test development is slow. Several members of my team avoids test development because the process of developing a test with LAVA is tedious / very timeconsuming. I think it mostly boils down to making a change, pushing it to a Git repo, submitting a job, running the job and then watching the LAVA job output for result. This will in our environment take several minutes, just to verify a change in a test. 
I'm aware of the guidelines for making portable tests and I personally think we can be a lot better at it for single-node tests which could enable us to run testscripts on local devices, but we have also quite a number of multinode jobs that we find are difficult to run in any other environment than LAVA. We've also tried using hacksessions to speed up development (e.g you edit the tests on the DUT and then synchronize it back to your host once you're happy). This works quite well, but feels a bit hacky and if the hacksession timeout hits, you lose all your work ;-)

2. Can't test bootloaders. Several of our hardware contain FPGAs and the FPGA images / "firmware" is tightly bundled with the bootloader. In addition to configuring the FPGA, the bootloader also contains in-house developed code for firmware update that we would like to autotest. We have a _lot_ of bootloader variants and we need a way of testing it along with the Linux system. Our current setup is that we manually flash bootloaders in our LAVA lab and then cross our fingers that the Linux system we test on the device is compatible with the bootloader. The ideal situation for us would be to always test the Linux system and the matching bootloader together. Granted, the better solution would be to move away the FPGA loading from the bootloader, but this is a design chosen by our SoC provider and we prefer to keep it.
We also manage a "LTS" branch of our Linux distro. We support it for several years and we need to ensure our test setup can test both our "master" branch and our LTS branch. With our current setup, this is not possible because all devices in our lab runs a bootloader that was manually flashed at some arbitrary time.
We've considered setting up several hardware of the same type, but with different bootloaders and then let LAVA treat them as different device types. This would work but our lab would fill up fast and the usage of each device would be low.
We also tried making jobs that boot a golden Linux system, write the software under test (including bootloader), reboot and run tests. This did work, but required customization per device since the test has to know where to write the bootloader. We would rather put this information into the LAVA device type definition somehow.

3. Can't write to NAND. Our devices are NAND-based and we use UBIFS ontop of UBI. We have not found a way for LAVA to write NAND because the LAVA mechanism that embeds stuff into the rootfs before deployment doesn't support UBIFS. At the moment, we ramboot our devices but we are now at a point where our devices OOM because they don't have enough RAM to fit both the rootfs and running the tests. Our current solution is to split the job into several jobs that each run a smaller amount of tests, but this is less than ideal because it makes our test run slower (we need to reboot) and it is a bit annoying that test results are spread across several jobs.
We have our own deployment tool that would be nice to integrate into LAVA as a deployment method. It accepts a kernel, rootfs, DT and bootloader and will write it using TFTP or DFU (USB) depending on target. To avoid forking all of LAVA in order to implement such deploy method, is there any plugin architecture that allows us to install additional boot methods alongside the LAVA packages?

I'd love to get your views on these issues and if there is a solution when using LAVA. 

Best regards, Magnus.