Hi Neil,
Thanks for your prompt answer.
On Fri, Apr 20, 2018 at 07:56:29AM +0100, Neil Williams wrote:
On 19 April 2018 at 20:11, Quentin Schulz quentin.schulz@bootlin.com wrote:
Hi all,
I've encountered a deadlock in my LAVA server with the following scheme.
I have an at91rm9200ek in my lab that got submitted a lot of multi-node
jobs requesting an other "board" (a laptop of type dummy-ssh). All of my other boards in the lab have received the same multi-node jobs requesting the same and only laptop.
That is the source of the resource starvation - multiple requirements of a single device. The scheduler needs to be greedy and grab whatever suitable devices it can as soon as it can to be able to run MultiNode. The primary ordering of scheduling is the Test Job ID which is determined at submission.
Why would you order test jobs without knowing if the boards it depends on are available when it's going to be scheduled? What am I missing?
If you have an imbalance between the number of machines which can be available and then submit MultiNode jobs which all rely on the starved resource, there is not much LAVA can do currently. We are looking at a way to reschedule MultiNode test jobs but it is very complex and low priority.
What version of lava-server and lava-dispatcher are you running?
lava-dispatcher 2018.2.post3-1+jessie lava-server 2018.2-1+jessie lava-coordinator 0.1.7-1
What is the structure of your current lab?
MultiNode is complex - not just at the test job synchronization level but also at the lab structure / administrative level.
I have two machines. One acting as LAVA server and one as LAVA slave. The LAVA slave is handling all boards in our lab.
I have one laptop (an actual x86 laptop for which we know the NIC driver works reliably at high (~1Gbps) speeds) that we use for MultiNode jobs (actually requesting the laptop and one board at the same time only) to test network. This laptop is seen as a board by LAVA, there is nothing LAVA-related on this board (it should be seen as a device).
I had to take the at91rm9200ek out of the lab because it was behaving.
However, LAVA is still scheduling multi-node jobs on the laptop which requires the at91rm9200ek as the other part of the job, while its status is clearly Maintenance.
A device in Maintenance is still available for scheduling - only Retired is excluded - test jobs submitted to a Retired device are rejected.
Why is that? The device is explicitely in Maintenance, which IMHO tells that the board shouldn't be used.
Once a test job has been submitted, it will be either scheduled or cancelled.
Yes, that's understood and that makes sense to me. However, for "normal" jobs, if you can't find a board of device type X that is available, it does not get scheduled, right? Why can't we do the same for MultiNode jobs?
Now, until I put the at91rm9200ek back in the lab, all my boards are
reserved and scheduling for a multi-node job and thus, my lab is basically dead.
The correct fix here is to have enough devices of the device-type of the starved resource such that one of each other device-type can use that resource simultaneously and then use device-tags to match up groups of devices so that submitting lots of jobs for one type all at the same time does not simply consume all of the available resources.
e.g. four device-types - phone, hikey, qemu and panda. Each multinode job wants a single QEMU with each of the others, so the QEMU type becomes starved, depending on how jobs are submitted. If two hikey-qemu jobs are submitted together, then 1 QEMU gets scheduled, waiting for the hikey to become free after running the first job. If each QEMU has device-tags, then the second hikey-qemu job will wait not only for the hikey but will also wait for the one QEMU which has the hikey device tag. This way, only those jobs would then wait for a QEMU device. There would be three QEMU devices, one with a device tag like "phone", one with "hikey" and one with "panda". If another panda device is added, another QEMU with the "panda" device tag would be required. The number of QEMU devices required is the sum of the number of devices of each other device-type which may be required in a MultiNode test job.
This is a structural problem within your lab.
You would need one "laptop" for each other device-type which can use that device-type in your lab. Then each "laptop" gets unique a device-tag . Each test job for at91rm9200ek must specify that the "laptop" device must have the matching device tag. Each test job for each other device-type uses the matching device-tag for that device-type. We had this problem in the Harston lab for a long time when using V1 and had to implement just such a structure of matched devices and device tags. However, the need for this disappeared when the Harston lab transitioned all devices and test jobs to LAVA V2.
I strongly disagree with your statement. A software problem can often be dealt with by adding more resources but I'm not willing to spend thousands on something that can be fixed on the software side.
Aside from a non-negligeable financial and time (to setup and maintain) effort to buy a board with a stable and reliable NIC for each and every board in my lab, it just isn't our use case.
If I would do such a thing, then my network would be the bottleneck to my network tests and I'd have to spend a lot (financially and on time or maintenance) to have a top notch network infrastructure for tests I don't care if they run one after the other. I can't have a separate network for each and every board as well, simply because my boards often have a single Ethernet port, thus I can't separate the test network from the lab network for, e.g. images downloading that are part of the booting process, hence I can't do reliable network testing even by multiplying "laptop" devices.
I can understand it's not your typical use case at Linaro and you've dozens and dozens of the same board and a huge infrastructure to handle the whole LAVA lab and maybe people working full-time on LAVA, the lab, the boards, the infrastructure. But that's the complete opposite of our use case.
Maybe you can understand ours where we have only one board of each device type, being part of KernelCI to test and report kernel booting status and having occasional custom tests (like network) on upstream or custom branches/repositories. We sporadically work on the lab, fixing the issues we're seeing with the boards but that's not what we do for a living.
We also work actively on the kernel and thus, we take boards (which we own only once) out of the lab to work on it and then put it into the lab once we've finished working. This is where we put it in Maintenance mode as, IMHO, Retired does not cover this use case. This "Maintenance" can take seconds, days or months.
For me, you're ignoring an issue that is almost inexistent in your case because you've dealt with it by adding as much resource as you could to make the probability to happen to be close to zero. That does not mean it does not exist. I'm not criticizing the way to deal with it, I'm just saying this way isn't a path we can take personally.
Let me know if I can be of any help debugging this thing or testing a possible fix. I'd have a look at the scheduler but you, obviously knowing the code base way better than I do, might have a quick patch on hand.
Patches would be a bad solution for a structural problem.
As a different approach, why do you need MultiNode with a "laptop" type device in the first place? Can the test jobs be reconfigured to use LXC which does not use MultiNode? What is the "laptop" device-type doing that cannot be done in an LXC? LXC is created on-the-fly, one for each device, when the test job requests one. This solved the resource starvation problem with the majority of MultiNode issues because the work previously done in the generic QEMU / "laptop" role can just as easily be done in an LXC.
We're testing Gigabit NICs can actually handle Gbps transfers. We need a fully available Gbps NIC for each and every test we do to make the results reliable and consistent.
What you are describing sounds like a misuse of MultiNode resulting in resource starvation and the fix is to have enough of the limited resource to prevent starvation - either by adding hardware and changing the current test jobs to use device-tags or relocating the work done on the starved resource into an LXC so that every device can have a dedicated "container" to do things which cannot be easily done on the device.
Neither of those options are possible in our use case.
I understand the MultiNode scheduler is complex and low priority. We've modestly contributed to LAVA before, we're not telling you to fix it ASAP but rather to help or guide us to fix this issue in a way it could be accepted in the upstream version of LAVA.
If you still stand strong against a patch or if it's a lengthy complete rework of the scheduler, could we have at least a way to tell for how long a test have been scheduled (or for how long a board has been reserved for a test that is scheduled)? That way we can use an external tool to monitor this and manually cancel them when needed. Currently, I don't think there is a way to tell since when the job was scheduled.
If I have misunderstood, misstated or said something wrong, I'm happy to be told,
Best regards, Quentin