Deadlock in scheduler

List overview All Threads
Download

newer

older

Re: [Lava-users] LAVA Scheduler...

LAVA Notifications

Quentin Schulz

19 Apr 2018 19 Apr '18

7:11 p.m.

Hi all,

I've encountered a deadlock in my LAVA server with the following scheme.

I have an at91rm9200ek in my lab that got submitted a lot of multi-node jobs requesting an other "board" (a laptop of type dummy-ssh). All of my other boards in the lab have received the same multi-node jobs requesting the same and only laptop.

I had to take the at91rm9200ek out of the lab because it was behaving.

However, LAVA is still scheduling multi-node jobs on the laptop which requires the at91rm9200ek as the other part of the job, while its status is clearly Maintenance.

Now, until I put the at91rm9200ek back in the lab, all my boards are reserved and scheduling for a multi-node job and thus, my lab is basically dead.

Let me know if I can be of any help debugging this thing or testing a possible fix. I'd have a look at the scheduler but you, obviously knowing the code base way better than I do, might have a quick patch on hand.

Best regards,

Quentin

Attachments:

signature.asc (application/pgp-signature — 801 bytes)

Show replies by date

Neil Williams

20 Apr 20 Apr

6:56 a.m.

New subject: [Lava-users] Deadlock in scheduler

On 19 April 2018 at 20:11, Quentin Schulz quentin.schulz@bootlin.com wrote:

...

Hi all,

I've encountered a deadlock in my LAVA server with the following scheme.

I have an at91rm9200ek in my lab that got submitted a lot of multi-node

...

jobs requesting an other "board" (a laptop of type dummy-ssh). All of my other boards in the lab have received the same multi-node jobs requesting the same and only laptop.

That is the source of the resource starvation - multiple requirements of a single device. The scheduler needs to be greedy and grab whatever suitable devices it can as soon as it can to be able to run MultiNode. The primary ordering of scheduling is the Test Job ID which is determined at submission.

If you have an imbalance between the number of machines which can be available and then submit MultiNode jobs which all rely on the starved resource, there is not much LAVA can do currently. We are looking at a way to reschedule MultiNode test jobs but it is very complex and low priority.

What version of lava-server and lava-dispatcher are you running?

What is the structure of your current lab?

MultiNode is complex - not just at the test job synchronization level but also at the lab structure / administrative level.

...

I had to take the at91rm9200ek out of the lab because it was behaving.

...

However, LAVA is still scheduling multi-node jobs on the laptop which requires the at91rm9200ek as the other part of the job, while its status is clearly Maintenance.

A device in Maintenance is still available for scheduling - only Retired is excluded - test jobs submitted to a Retired device are rejected.

Once a test job has been submitted, it will be either scheduled or cancelled.

Now, until I put the at91rm9200ek back in the lab, all my boards are

...

reserved and scheduling for a multi-node job and thus, my lab is basically dead.

The correct fix here is to have enough devices of the device-type of the starved resource such that one of each other device-type can use that resource simultaneously and then use device-tags to match up groups of devices so that submitting lots of jobs for one type all at the same time does not simply consume all of the available resources.

e.g. four device-types - phone, hikey, qemu and panda. Each multinode job wants a single QEMU with each of the others, so the QEMU type becomes starved, depending on how jobs are submitted. If two hikey-qemu jobs are submitted together, then 1 QEMU gets scheduled, waiting for the hikey to become free after running the first job. If each QEMU has device-tags, then the second hikey-qemu job will wait not only for the hikey but will also wait for the one QEMU which has the hikey device tag. This way, only those jobs would then wait for a QEMU device. There would be three QEMU devices, one with a device tag like "phone", one with "hikey" and one with "panda". If another panda device is added, another QEMU with the "panda" device tag would be required. The number of QEMU devices required is the sum of the number of devices of each other device-type which may be required in a MultiNode test job.

This is a structural problem within your lab.

You would need one "laptop" for each other device-type which can use that device-type in your lab. Then each "laptop" gets unique a device-tag . Each test job for at91rm9200ek must specify that the "laptop" device must have the matching device tag. Each test job for each other device-type uses the matching device-tag for that device-type. We had this problem in the Harston lab for a long time when using V1 and had to implement just such a structure of matched devices and device tags. However, the need for this disappeared when the Harston lab transitioned all devices and test jobs to LAVA V2.

...

Let me know if I can be of any help debugging this thing or testing a possible fix. I'd have a look at the scheduler but you, obviously knowing the code base way better than I do, might have a quick patch on hand.

Patches would be a bad solution for a structural problem.

As a different approach, why do you need MultiNode with a "laptop" type device in the first place? Can the test jobs be reconfigured to use LXC which does not use MultiNode? What is the "laptop" device-type doing that cannot be done in an LXC? LXC is created on-the-fly, one for each device, when the test job requests one. This solved the resource starvation problem with the majority of MultiNode issues because the work previously done in the generic QEMU / "laptop" role can just as easily be done in an LXC.

What you are describing sounds like a misuse of MultiNode resulting in resource starvation and the fix is to have enough of the limited resource to prevent starvation - either by adding hardware and changing the current test jobs to use device-tags or relocating the work done on the starved resource into an LXC so that every device can have a dedicated "container" to do things which cannot be easily done on the device.

...

Best regards,

Quentin

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Neil Williams

7:22 a.m.

New subject: [Lava-users] Deadlock in scheduler

On 20 April 2018 at 07:56, Neil Williams neil.williams@linaro.org wrote:

...

On 19 April 2018 at 20:11, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi all,

I've encountered a deadlock in my LAVA server with the following scheme.

I have an at91rm9200ek in my lab that got submitted a lot of multi-node

...
jobs requesting an other "board" (a laptop of type dummy-ssh). All of my other boards in the lab have received the same multi-node jobs requesting the same and only laptop.

That is the source of the resource starvation - multiple requirements of a single device. The scheduler needs to be greedy and grab whatever suitable devices it can as soon as it can to be able to run MultiNode. The primary ordering of scheduling is the Test Job ID which is determined at submission.

If you have an imbalance between the number of machines which can be available and then submit MultiNode jobs which all rely on the starved resource, there is not much LAVA can do currently. We are looking at a way to reschedule MultiNode test jobs but it is very complex and low priority.

What version of lava-server and lava-dispatcher are you running?

What is the structure of your current lab?

MultiNode is complex - not just at the test job synchronization level but also at the lab structure / administrative level.

...
I had to take the at91rm9200ek out of the lab because it was behaving.

...
However, LAVA is still scheduling multi-node jobs on the laptop which requires the at91rm9200ek as the other part of the job, while its status is clearly Maintenance.

A device in Maintenance is still available for scheduling - only Retired is excluded - test jobs submitted to a Retired device are rejected.

Once a test job has been submitted, it will be either scheduled or cancelled.

There is something we can improve here though - the current UI describes Health Bad and Health Maintenance as "no submissions possible" when what it should say is "no test jobs can be scheduled". The difference is important...

https://projects.linaro.org/browse/LAVA-1299

...

Now, until I put the at91rm9200ek back in the lab, all my boards are

...
reserved and scheduling for a multi-node job and thus, my lab is basically dead.

The correct fix here is to have enough devices of the device-type of the starved resource such that one of each other device-type can use that resource simultaneously and then use device-tags to match up groups of devices so that submitting lots of jobs for one type all at the same time does not simply consume all of the available resources.

e.g. four device-types - phone, hikey, qemu and panda. Each multinode job wants a single QEMU with each of the others, so the QEMU type becomes starved, depending on how jobs are submitted. If two hikey-qemu jobs are submitted together, then 1 QEMU gets scheduled, waiting for the hikey to become free after running the first job. If each QEMU has device-tags, then the second hikey-qemu job will wait not only for the hikey but will also wait for the one QEMU which has the hikey device tag. This way, only those jobs would then wait for a QEMU device. There would be three QEMU devices, one with a device tag like "phone", one with "hikey" and one with "panda". If another panda device is added, another QEMU with the "panda" device tag would be required. The number of QEMU devices required is the sum of the number of devices of each other device-type which may be required in a MultiNode test job.

This is a structural problem within your lab.

You would need one "laptop" for each other device-type which can use that device-type in your lab. Then each "laptop" gets unique a device-tag . Each test job for at91rm9200ek must specify that the "laptop" device must have the matching device tag. Each test job for each other device-type uses the matching device-tag for that device-type. We had this problem in the Harston lab for a long time when using V1 and had to implement just such a structure of matched devices and device tags. However, the need for this disappeared when the Harston lab transitioned all devices and test jobs to LAVA V2.

...
Let me know if I can be of any help debugging this thing or testing a possible fix. I'd have a look at the scheduler but you, obviously knowing the code base way better than I do, might have a quick patch on hand.

Patches would be a bad solution for a structural problem.

As a different approach, why do you need MultiNode with a "laptop" type device in the first place? Can the test jobs be reconfigured to use LXC which does not use MultiNode? What is the "laptop" device-type doing that cannot be done in an LXC? LXC is created on-the-fly, one for each device, when the test job requests one. This solved the resource starvation problem with the majority of MultiNode issues because the work previously done in the generic QEMU / "laptop" role can just as easily be done in an LXC.

What you are describing sounds like a misuse of MultiNode resulting in resource starvation and the fix is to have enough of the limited resource to prevent starvation - either by adding hardware and changing the current test jobs to use device-tags or relocating the work done on the starved resource into an LXC so that every device can have a dedicated "container" to do things which cannot be easily done on the device.

...
Best regards,

Quentin

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Quentin Schulz

7:43 a.m.

New subject: [Lava-users] Deadlock in scheduler

Hi Neil,

I'll take a bit of time to answer your first mail later.

On Fri, Apr 20, 2018 at 08:22:01AM +0100, Neil Williams wrote:

...

On 20 April 2018 at 07:56, Neil Williams neil.williams@linaro.org wrote:

...
On 19 April 2018 at 20:11, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi all,

I've encountered a deadlock in my LAVA server with the following scheme.

I have an at91rm9200ek in my lab that got submitted a lot of multi-node

...
jobs requesting an other "board" (a laptop of type dummy-ssh). All of my other boards in the lab have received the same multi-node jobs requesting the same and only laptop.

That is the source of the resource starvation - multiple requirements of a single device. The scheduler needs to be greedy and grab whatever suitable devices it can as soon as it can to be able to run MultiNode. The primary ordering of scheduling is the Test Job ID which is determined at submission.

If you have an imbalance between the number of machines which can be available and then submit MultiNode jobs which all rely on the starved resource, there is not much LAVA can do currently. We are looking at a way to reschedule MultiNode test jobs but it is very complex and low priority.

What version of lava-server and lava-dispatcher are you running?

What is the structure of your current lab?

MultiNode is complex - not just at the test job synchronization level but also at the lab structure / administrative level.

...
I had to take the at91rm9200ek out of the lab because it was behaving.

...
However, LAVA is still scheduling multi-node jobs on the laptop which requires the at91rm9200ek as the other part of the job, while its status is clearly Maintenance.

A device in Maintenance is still available for scheduling - only Retired is excluded - test jobs submitted to a Retired device are rejected.

Once a test job has been submitted, it will be either scheduled or cancelled.

There is something we can improve here though - the current UI describes Health Bad and Health Maintenance as "no submissions possible" when what it should say is "no test jobs can be scheduled". The difference is important...

https://projects.linaro.org/browse/LAVA-1299

This isn't LAVA's experienced behaviour though. LAVA indeed schedules jobs even though a Device health is Maintenance but it's impossible to submit jobs to a device type that has not a single device available, so what's described in the UI is correct, isn't it?

Quentin

...

...
Now, until I put the at91rm9200ek back in the lab, all my boards are

...
reserved and scheduling for a multi-node job and thus, my lab is basically dead.

The correct fix here is to have enough devices of the device-type of the starved resource such that one of each other device-type can use that resource simultaneously and then use device-tags to match up groups of devices so that submitting lots of jobs for one type all at the same time does not simply consume all of the available resources.

e.g. four device-types - phone, hikey, qemu and panda. Each multinode job wants a single QEMU with each of the others, so the QEMU type becomes starved, depending on how jobs are submitted. If two hikey-qemu jobs are submitted together, then 1 QEMU gets scheduled, waiting for the hikey to become free after running the first job. If each QEMU has device-tags, then the second hikey-qemu job will wait not only for the hikey but will also wait for the one QEMU which has the hikey device tag. This way, only those jobs would then wait for a QEMU device. There would be three QEMU devices, one with a device tag like "phone", one with "hikey" and one with "panda". If another panda device is added, another QEMU with the "panda" device tag would be required. The number of QEMU devices required is the sum of the number of devices of each other device-type which may be required in a MultiNode test job.

This is a structural problem within your lab.

You would need one "laptop" for each other device-type which can use that device-type in your lab. Then each "laptop" gets unique a device-tag . Each test job for at91rm9200ek must specify that the "laptop" device must have the matching device tag. Each test job for each other device-type uses the matching device-tag for that device-type. We had this problem in the Harston lab for a long time when using V1 and had to implement just such a structure of matched devices and device tags. However, the need for this disappeared when the Harston lab transitioned all devices and test jobs to LAVA V2.

...
Let me know if I can be of any help debugging this thing or testing a possible fix. I'd have a look at the scheduler but you, obviously knowing the code base way better than I do, might have a quick patch on hand.

Patches would be a bad solution for a structural problem.

As a different approach, why do you need MultiNode with a "laptop" type device in the first place? Can the test jobs be reconfigured to use LXC which does not use MultiNode? What is the "laptop" device-type doing that cannot be done in an LXC? LXC is created on-the-fly, one for each device, when the test job requests one. This solved the resource starvation problem with the majority of MultiNode issues because the work previously done in the generic QEMU / "laptop" role can just as easily be done in an LXC.

What you are describing sounds like a misuse of MultiNode resulting in resource starvation and the fix is to have enough of the limited resource to prevent starvation - either by adding hardware and changing the current test jobs to use device-tags or relocating the work done on the starved resource into an LXC so that every device can have a dedicated "container" to do things which cannot be easily done on the device.

...
Best regards,

Quentin

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Quentin Schulz

23 Apr 23 Apr

10:21 a.m.

New subject: [Lava-users] Deadlock in scheduler

Hi Neil,

Thanks for your prompt answer.

On Fri, Apr 20, 2018 at 07:56:29AM +0100, Neil Williams wrote:

...

On 19 April 2018 at 20:11, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi all,

I've encountered a deadlock in my LAVA server with the following scheme.

I have an at91rm9200ek in my lab that got submitted a lot of multi-node

...
jobs requesting an other "board" (a laptop of type dummy-ssh). All of my other boards in the lab have received the same multi-node jobs requesting the same and only laptop.

That is the source of the resource starvation - multiple requirements of a single device. The scheduler needs to be greedy and grab whatever suitable devices it can as soon as it can to be able to run MultiNode. The primary ordering of scheduling is the Test Job ID which is determined at submission.

Why would you order test jobs without knowing if the boards it depends on are available when it's going to be scheduled? What am I missing?

...

If you have an imbalance between the number of machines which can be available and then submit MultiNode jobs which all rely on the starved resource, there is not much LAVA can do currently. We are looking at a way to reschedule MultiNode test jobs but it is very complex and low priority.

What version of lava-server and lava-dispatcher are you running?

lava-dispatcher 2018.2.post3-1+jessie lava-server 2018.2-1+jessie lava-coordinator 0.1.7-1

...

What is the structure of your current lab?

MultiNode is complex - not just at the test job synchronization level but also at the lab structure / administrative level.

I have two machines. One acting as LAVA server and one as LAVA slave. The LAVA slave is handling all boards in our lab.

I have one laptop (an actual x86 laptop for which we know the NIC driver works reliably at high (~1Gbps) speeds) that we use for MultiNode jobs (actually requesting the laptop and one board at the same time only) to test network. This laptop is seen as a board by LAVA, there is nothing LAVA-related on this board (it should be seen as a device).

...

...
I had to take the at91rm9200ek out of the lab because it was behaving.

...
However, LAVA is still scheduling multi-node jobs on the laptop which requires the at91rm9200ek as the other part of the job, while its status is clearly Maintenance.

A device in Maintenance is still available for scheduling - only Retired is excluded - test jobs submitted to a Retired device are rejected.

Why is that? The device is explicitely in Maintenance, which IMHO tells that the board shouldn't be used.

...

Once a test job has been submitted, it will be either scheduled or cancelled.

Yes, that's understood and that makes sense to me. However, for "normal" jobs, if you can't find a board of device type X that is available, it does not get scheduled, right? Why can't we do the same for MultiNode jobs?

...

Now, until I put the at91rm9200ek back in the lab, all my boards are

...
reserved and scheduling for a multi-node job and thus, my lab is basically dead.

The correct fix here is to have enough devices of the device-type of the starved resource such that one of each other device-type can use that resource simultaneously and then use device-tags to match up groups of devices so that submitting lots of jobs for one type all at the same time does not simply consume all of the available resources.

e.g. four device-types - phone, hikey, qemu and panda. Each multinode job wants a single QEMU with each of the others, so the QEMU type becomes starved, depending on how jobs are submitted. If two hikey-qemu jobs are submitted together, then 1 QEMU gets scheduled, waiting for the hikey to become free after running the first job. If each QEMU has device-tags, then the second hikey-qemu job will wait not only for the hikey but will also wait for the one QEMU which has the hikey device tag. This way, only those jobs would then wait for a QEMU device. There would be three QEMU devices, one with a device tag like "phone", one with "hikey" and one with "panda". If another panda device is added, another QEMU with the "panda" device tag would be required. The number of QEMU devices required is the sum of the number of devices of each other device-type which may be required in a MultiNode test job.

This is a structural problem within your lab.

You would need one "laptop" for each other device-type which can use that device-type in your lab. Then each "laptop" gets unique a device-tag . Each test job for at91rm9200ek must specify that the "laptop" device must have the matching device tag. Each test job for each other device-type uses the matching device-tag for that device-type. We had this problem in the Harston lab for a long time when using V1 and had to implement just such a structure of matched devices and device tags. However, the need for this disappeared when the Harston lab transitioned all devices and test jobs to LAVA V2.

I strongly disagree with your statement. A software problem can often be dealt with by adding more resources but I'm not willing to spend thousands on something that can be fixed on the software side.

Aside from a non-negligeable financial and time (to setup and maintain) effort to buy a board with a stable and reliable NIC for each and every board in my lab, it just isn't our use case.

If I would do such a thing, then my network would be the bottleneck to my network tests and I'd have to spend a lot (financially and on time or maintenance) to have a top notch network infrastructure for tests I don't care if they run one after the other. I can't have a separate network for each and every board as well, simply because my boards often have a single Ethernet port, thus I can't separate the test network from the lab network for, e.g. images downloading that are part of the booting process, hence I can't do reliable network testing even by multiplying "laptop" devices.

I can understand it's not your typical use case at Linaro and you've dozens and dozens of the same board and a huge infrastructure to handle the whole LAVA lab and maybe people working full-time on LAVA, the lab, the boards, the infrastructure. But that's the complete opposite of our use case.

Maybe you can understand ours where we have only one board of each device type, being part of KernelCI to test and report kernel booting status and having occasional custom tests (like network) on upstream or custom branches/repositories. We sporadically work on the lab, fixing the issues we're seeing with the boards but that's not what we do for a living.

We also work actively on the kernel and thus, we take boards (which we own only once) out of the lab to work on it and then put it into the lab once we've finished working. This is where we put it in Maintenance mode as, IMHO, Retired does not cover this use case. This "Maintenance" can take seconds, days or months.

For me, you're ignoring an issue that is almost inexistent in your case because you've dealt with it by adding as much resource as you could to make the probability to happen to be close to zero. That does not mean it does not exist. I'm not criticizing the way to deal with it, I'm just saying this way isn't a path we can take personally.

...

...
Let me know if I can be of any help debugging this thing or testing a possible fix. I'd have a look at the scheduler but you, obviously knowing the code base way better than I do, might have a quick patch on hand.

Patches would be a bad solution for a structural problem.

As a different approach, why do you need MultiNode with a "laptop" type device in the first place? Can the test jobs be reconfigured to use LXC which does not use MultiNode? What is the "laptop" device-type doing that cannot be done in an LXC? LXC is created on-the-fly, one for each device, when the test job requests one. This solved the resource starvation problem with the majority of MultiNode issues because the work previously done in the generic QEMU / "laptop" role can just as easily be done in an LXC.

We're testing Gigabit NICs can actually handle Gbps transfers. We need a fully available Gbps NIC for each and every test we do to make the results reliable and consistent.

...

What you are describing sounds like a misuse of MultiNode resulting in resource starvation and the fix is to have enough of the limited resource to prevent starvation - either by adding hardware and changing the current test jobs to use device-tags or relocating the work done on the starved resource into an LXC so that every device can have a dedicated "container" to do things which cannot be easily done on the device.

Neither of those options are possible in our use case.

I understand the MultiNode scheduler is complex and low priority. We've modestly contributed to LAVA before, we're not telling you to fix it ASAP but rather to help or guide us to fix this issue in a way it could be accepted in the upstream version of LAVA.

If you still stand strong against a patch or if it's a lengthy complete rework of the scheduler, could we have at least a way to tell for how long a test have been scheduled (or for how long a board has been reserved for a test that is scheduled)? That way we can use an external tool to monitor this and manually cancel them when needed. Currently, I don't think there is a way to tell since when the job was scheduled.

If I have misunderstood, misstated or said something wrong, I'm happy to be told,

Best regards, Quentin

Neil Williams

11:54 a.m.

New subject: [Lava-users] Deadlock in scheduler

On 23 April 2018 at 11:21, Quentin Schulz quentin.schulz@bootlin.com wrote:

...

Hi Neil,

Thanks for your prompt answer.

On Fri, Apr 20, 2018 at 07:56:29AM +0100, Neil Williams wrote:

...
On 19 April 2018 at 20:11, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi all,

I've encountered a deadlock in my LAVA server with the following

scheme.

...
...
I have an at91rm9200ek in my lab that got submitted a lot of multi-node

...
jobs requesting an other "board" (a laptop of type dummy-ssh). All of my other boards in the lab have received the same multi-node

jobs

...
...
requesting the same and only laptop.

That is the source of the resource starvation - multiple requirements of

a

...
single device. The scheduler needs to be greedy and grab whatever

suitable

...
devices it can as soon as it can to be able to run MultiNode. The primary ordering of scheduling is the Test Job ID which is determined at

submission.

...
Why would you order test jobs without knowing if the boards it depends on are available when it's going to be scheduled? What am I missing?

To avoid the situation where a MultiNode job is constantly waiting for all devices to be available at exactly the same time. Instances frequently have long queues of submitted test jobs, a mix of single node and MultiNode. The MultiNode jobs must be able to grab whatever device is available, in order of submit time, and then wait for the other part to be available. Otherwise, all devices would run all single node test jobs in the entire queue before any MultiNode test jobs could start. Many instances constantly have a queue of single node test jobs.

...

...
If you have an imbalance between the number of machines which can be available and then submit MultiNode jobs which all rely on the starved resource, there is not much LAVA can do currently. We are looking at a

way

...
to reschedule MultiNode test jobs but it is very complex and low

priority.

...
What version of lava-server and lava-dispatcher are you running?

lava-dispatcher 2018.2.post3-1+jessie lava-server 2018.2-1+jessie lava-coordinator 0.1.7-1

(You need to upgrade to Stretch - there will be no fixes or upgrades available for Jessie. All development work must only happen on Stretch. See the lava-announce mailing list archive.)

...

...
What is the structure of your current lab?

MultiNode is complex - not just at the test job synchronization level but also at the lab structure / administrative level.

I have two machines. One acting as LAVA server and one as LAVA slave. The LAVA slave is handling all boards in our lab.

I have one laptop (an actual x86 laptop for which we know the NIC driver works reliably at high (~1Gbps) speeds) that we use for MultiNode jobs (actually requesting the laptop and one board at the same time only) to test network. This laptop is seen as a board by LAVA, there is nothing LAVA-related on this board (it should be seen as a device).

Then you need more LAVA devices to replicate the role played by the laptop. Exactly one device for each MultiNode test job which can be submitted at any one time. Then use device tags to allocate one of the "laptop" devices to each of the other boards involved in the MultiNode test jobs.

Alternatively, you need to manage both the submissions and the device availability.

Think of just the roles played by the devices.

There are N client role devices (not in Retired state) and there are X server role devices where the server role is what the laptop is currently doing.

You need to have N == X to solve the imbalance in the queue.

If N > 1 (and there are more than one device-type in the count 'N') then you also need to use device tags so that each device-type has a dedicated pool of server role devices where the number of devices in the server role pool exactly matches the number of devices of the device-type using the specified device tag.

...

...
...
I had to take the at91rm9200ek out of the lab because it was behaving.

...
However, LAVA is still scheduling multi-node jobs on the laptop which requires the at91rm9200ek as the other part of the job, while its

status

...
...
is clearly Maintenance.

A device in Maintenance is still available for scheduling - only Retired

is

...
excluded - test jobs submitted to a Retired device are rejected.

Why is that? The device is explicitely in Maintenance, which IMHO tells that the board shouldn't be used.

Not for the scheduler - the scheduler can still accept submissions and queue them up until the device comes out of Maintenance.

This prevents test jobs being rejected during certain kinds of maintenance. (Wider scale maintenance would involve taking down the UI on the master at which point submissions would get a 404 but that is up to admins to schedule and announce etc.)

This is about uptime for busy instances which frequently get batches of submissions out of operations like cron. The available devices quickly get swamped but the queue needs to continue accepting jobs until admins decide that devices which need work are going to be unavailable for long enough that the resulting queue would be impractical. i.e. when the length of time to wait for the job exceeds the useful window of the results from the job or when the number of test jobs in the queue exceeds the ability of the available devices to keep on top of the queue and avoid ever increasing queues.

...

...
Once a test job has been submitted, it will be either scheduled or cancelled.

Yes, that's understood and that makes sense to me. However, for "normal" jobs, if you can't find a board of device type X that is available, it does not get scheduled, right? Why can't we do the same for MultiNode jobs?

Because the MultiNode job will never find all devices in the right state at the same time once there is a mixed queue of single node and MultiNode jobs.

All devices defined in the MultiNode test job must be available at exactly the same time. Once there are single node jobs in the queue, that never happens.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

scheduler considers the queue - MultiNode cannot start (A is busy), so move on and start the single node job on B (because the single node test job on B may actually complete before the job on A finishes, so it is inefficient to keep B idle when it could be doing useful stuff for another user).

A is running B is running

no actions

A completes and goes to Idle B is still running

and so the pattern continues for as long as there are any single node test jobs for either A or B in the queue.

The MultiNode test job never starts because A and B are never Idle at the same time until the queue is completely empty (which *never* happens in many instances).

So the scheduler must grab B while it is Idle to prevent the single node test job starting. Then when A completes, the scheduler must also grab A before that single node test job starts running.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

B is transitioned to Scheduling and is unavailable for the single node test job.

A is running B is scheduling

no actions

A completes and goes to Idle B is scheduling

Scheduler transitions A into scheduling - that test job can now start.

(Now consider MultiNode test jobs covering a dozen devices in an instance with a hundred mixed devices and permanent queues of single node test jobs.)

The scheduler also needs to be very fast, so the actual decisions need to be made on quite simple criteria - specifically, without going back to the database to find out about what else might be in the queue or trying to second-guess when test jobs might end.

...

...
Now, until I put the at91rm9200ek back in the lab, all my boards are

...
reserved and scheduling for a multi-node job and thus, my lab is basically dead.

The correct fix here is to have enough devices of the device-type of the starved resource such that one of each other device-type can use that resource simultaneously and then use device-tags to match up groups of devices so that submitting lots of jobs for one type all at the same time does not simply consume all of the available resources.

e.g. four device-types - phone, hikey, qemu and panda. Each multinode job wants a single QEMU with each of the others, so the QEMU type becomes starved, depending on how jobs are submitted. If two hikey-qemu jobs are submitted together, then 1 QEMU gets scheduled, waiting for the hikey to become free after running the first job. If each QEMU has device-tags,

then

...
the second hikey-qemu job will wait not only for the hikey but will also wait for the one QEMU which has the hikey device tag. This way, only

those

...
jobs would then wait for a QEMU device. There would be three QEMU

devices,

...
one with a device tag like "phone", one with "hikey" and one with

"panda".

...
If another panda device is added, another QEMU with the "panda" device

tag

...
would be required. The number of QEMU devices required is the sum of the number of devices of each other device-type which may be required in a MultiNode test job.

This is a structural problem within your lab.

You would need one "laptop" for each other device-type which can use that device-type in your lab. Then each "laptop" gets unique a device-tag .

Each

...
test job for at91rm9200ek must specify that the "laptop" device must have the matching device tag. Each test job for each other device-type uses

the

...
matching device-tag for that device-type. We had this problem in the Harston lab for a long time when using V1 and had to implement just such

a

...
structure of matched devices and device tags. However, the need for this disappeared when the Harston lab transitioned all devices and test jobs

to

...
LAVA V2.

I strongly disagree with your statement. A software problem can often be dealt with by adding more resources but I'm not willing to spend thousands on something that can be fixed on the software side.

We've been through these loops within the team for many years and have millions of test jobs which demonstrate the problems and the fix. I'm afraid you are misunderstanding the problem if you think that there is a software solution for a queue containing both MultiNode and single node test jobs - other than the solution we now use in the LAVA scheduler. The process has been tried and tested over 8 years and millions of test jobs across dozens of mixed use case instances and has proven to be the most efficient use of resources across all those models.

Each test job in a MultiNode test is considered separately - if one or more devices are Idle, then those are immediately put into Scheduling. Only when all are in Scheduling can any of those jobs start. The status of other test jobs in the MultiNode group can only be handled at the point when at least one test job in that MultiNode group is in Scheduling.

...

Aside from a non-negligeable financial and time (to setup and maintain) effort to buy a board with a stable and reliable NIC for each and every board in my lab, it just isn't our use case.

If I would do such a thing, then my network would be the bottleneck to my network tests and I'd have to spend a lot (financially and on time or maintenance) to have a top notch network infrastructure for tests I don't care if they run one after the other. I can't have a separate network for each and every board as well, simply because my boards often have a single Ethernet port, thus I can't separate the test network from the lab network for, e.g. images downloading that are part of the booting process, hence I can't do reliable network testing even by multiplying "laptop" devices.

I can understand it's not your typical use case at Linaro and you've dozens and dozens of the same board and a huge infrastructure to handle the whole LAVA lab and maybe people working full-time on LAVA, the lab, the boards, the infrastructure. But that's the complete opposite of our use case.

Maybe you can understand ours where we have only one board of each device type, being part of KernelCI to test and report kernel booting status and having occasional custom tests (like network) on upstream or custom branches/repositories. We sporadically work on the lab, fixing the issues we're seeing with the boards but that's not what we do for a living.

I do understand and I personally run a lab in much the same way. However, the code needs to work the same way in that lab as it does in the larger labs. It is the local configuration and resource availability which must change to suit.

For now, the best thing is to put devices into Retired so that submissions are rejected and then you will also have to manage your submissions and your queue.

We're looking at what the Maintenance state means for MultiNode in https://projects.linaro.org/browse/LAVA-1299 but it is not acceptable to refuse submissions when devices are not Retired. Users have an expectation that devices which are being fixed will come back online at some point - or will go into retired. There is also https://projects.linaro.org/browse/LAVA-595 but that work has not yet been scoped. It could be a long time before that work starts and will take months of work once it does start.

The problem is a structural one in the physical resources available in your local lab. It is a problem we have faced more than once in our own instances and we have gone down all the various routes until we've come to the current implementation.

...

We also work actively on the kernel and thus, we take boards (which we own only once) out of the lab to work on it and then put it into the lab once we've finished working. This is where we put it in Maintenance mode as, IMHO, Retired does not cover this use case. This "Maintenance" can take seconds, days or months.

For me, you're ignoring an issue that is almost inexistent in your case

It is an issue which has had months of investigation, discussion and intervention in our use cases. We have spent a very long time going through all of the permutations.

...

because you've dealt with it by adding as much resource as you could to make the probability to happen to be close to zero. That does not mean it does not exist. I'm not criticizing the way to deal with it, I'm just saying this way isn't a path we can take personally.

Then you need to manage the queue on your instance in ways that allow for your situation.

...

...
...
Let me know if I can be of any help debugging this thing or testing a possible fix. I'd have a look at the scheduler but you, obviously knowing the code base way better than I do, might have a quick patch on hand.

Patches would be a bad solution for a structural problem.

As a different approach, why do you need MultiNode with a "laptop" type device in the first place? Can the test jobs be reconfigured to use LXC which does not use MultiNode? What is the "laptop" device-type doing that cannot be done in an LXC? LXC is created on-the-fly, one for each device, when the test job requests one. This solved the resource starvation

problem

...
with the majority of MultiNode issues because the work previously done in the generic QEMU / "laptop" role can just as easily be done in an LXC.

We're testing Gigabit NICs can actually handle Gbps transfers. We need a fully available Gbps NIC for each and every test we do to make the results reliable and consistent.

Then as that resource is limited, you must create a way that only one test job of this kind can ever actually run at a time. That can be done by working at the stage prior to submission or it can be done by changing the device availablity such that the submission is rejected. Critically, there must also be a way to prevent jobs entering the queue if one of the device-types is not available. That can be easily determined using the XML-RPC API prior to submission. Once submitted, LAVA must attempt to run the test job as quickly as possible, under the expectation that devices which have not been Retired will become available again within a reasonable amount of time. If that is not the case then those devices should be Retired. (Devices can be brought out of Retired as easily as going in, it doesn't have to be a permanent state, nothing is actually deleted from the device configuration.)

...

...
What you are describing sounds like a misuse of MultiNode resulting in resource starvation and the fix is to have enough of the limited resource to prevent starvation - either by adding hardware and changing the

current

...
test jobs to use device-tags or relocating the work done on the starved resource into an LXC so that every device can have a dedicated

"container"

...
to do things which cannot be easily done on the device.

Neither of those options are possible in our use case.

I understand the MultiNode scheduler is complex and low priority. We've modestly contributed to LAVA before, we're not telling you to fix it ASAP but rather to help or guide us to fix this issue in a way it could be accepted in the upstream version of LAVA.

If you still stand strong against a patch or if it's a lengthy complete rework of the scheduler, could we have at least a way to tell for how long a test have been scheduled (or for how long a board has been reserved for a test that is scheduled)?

That data is already available in the current UI and over the XML-RPC API and REST API.

Check for Queued Jobs and the scheduling state, also the job_details call in XML-RPC. There are a variety of ways of getting the information you require using the existing APIs - which one you use will depend on your preference and current scripting.

Rather than polling on XML-RPC, it would be better for a monitoring process to use ZMQ and the publisher events to get push notifications of change of state. That lowers the load on the master, depending on how busy the instance actually becomes.

...

That way we can use an external tool to monitor this and manually cancel them when needed. Currently, I don't think there is a way to tell since when the job was scheduled.

Every test job has a database object of submit_time created at the point where the job is created upon submission.

...

If I have misunderstood, misstated or said something wrong, I'm happy to be told,

Best regards, Quentin

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Quentin Schulz

24 Apr 24 Apr

8:08 a.m.

New subject: [Lava-users] Deadlock in scheduler

Hi Neil,

I think there was a global misunderstanding from a poorly choice of words. When I was saying "available device" I meant a device that isn't in Maintenance or Retired. If the device is idle, running a job, scheduled to run a job, etc... I consider it "available". Sorry for the misunderstanding.

On Mon, Apr 23, 2018 at 12:54:03PM +0100, Neil Williams wrote:

...

On 23 April 2018 at 11:21, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi Neil,

Thanks for your prompt answer.

On Fri, Apr 20, 2018 at 07:56:29AM +0100, Neil Williams wrote:

...
On 19 April 2018 at 20:11, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi all,

I've encountered a deadlock in my LAVA server with the following

scheme.

...
...
I have an at91rm9200ek in my lab that got submitted a lot of multi-node

...
jobs requesting an other "board" (a laptop of type dummy-ssh). All of my other boards in the lab have received the same multi-node

jobs

...
...
requesting the same and only laptop.

That is the source of the resource starvation - multiple requirements of

a

...
single device. The scheduler needs to be greedy and grab whatever

suitable

...
devices it can as soon as it can to be able to run MultiNode. The primary ordering of scheduling is the Test Job ID which is determined at

submission.

...
Why would you order test jobs without knowing if the boards it depends on are available when it's going to be scheduled? What am I missing?

To avoid the situation where a MultiNode job is constantly waiting for all devices to be available at exactly the same time. Instances frequently have long queues of submitted test jobs, a mix of single node and MultiNode. The MultiNode jobs must be able to grab whatever device is available, in order of submit time, and then wait for the other part to be available. Otherwise, all devices would run all single node test jobs in the entire queue before any MultiNode test jobs could start. Many instances constantly have a queue of single node test jobs.

That's understood and expected.

...

...
...
If you have an imbalance between the number of machines which can be available and then submit MultiNode jobs which all rely on the starved resource, there is not much LAVA can do currently. We are looking at a

way

...
to reschedule MultiNode test jobs but it is very complex and low

priority.

...
What version of lava-server and lava-dispatcher are you running?

lava-dispatcher 2018.2.post3-1+jessie lava-server 2018.2-1+jessie lava-coordinator 0.1.7-1

(You need to upgrade to Stretch - there will be no fixes or upgrades available for Jessie. All development work must only happen on Stretch. See the lava-announce mailing list archive.)

Thanks, we'll have a look into this.

...

...
...
What is the structure of your current lab?

MultiNode is complex - not just at the test job synchronization level but also at the lab structure / administrative level.

I have two machines. One acting as LAVA server and one as LAVA slave. The LAVA slave is handling all boards in our lab.

I have one laptop (an actual x86 laptop for which we know the NIC driver works reliably at high (~1Gbps) speeds) that we use for MultiNode jobs (actually requesting the laptop and one board at the same time only) to test network. This laptop is seen as a board by LAVA, there is nothing LAVA-related on this board (it should be seen as a device).

Then you need more LAVA devices to replicate the role played by the laptop. Exactly one device for each MultiNode test job which can be submitted at any one time. Then use device tags to allocate one of the "laptop" devices to each of the other boards involved in the MultiNode test jobs.

Alternatively, you need to manage both the submissions and the device availability.

Think of just the roles played by the devices.

There are N client role devices (not in Retired state) and there are X server role devices where the server role is what the laptop is currently doing.

You need to have N == X to solve the imbalance in the queue.

If N > 1 (and there are more than one device-type in the count 'N') then you also need to use device tags so that each device-type has a dedicated pool of server role devices where the number of devices in the server role pool exactly matches the number of devices of the device-type using the specified device tag.

...
...
...
I had to take the at91rm9200ek out of the lab because it was behaving.

...
However, LAVA is still scheduling multi-node jobs on the laptop which requires the at91rm9200ek as the other part of the job, while its

status

...
...
is clearly Maintenance.

A device in Maintenance is still available for scheduling - only Retired

is

...
excluded - test jobs submitted to a Retired device are rejected.

Why is that? The device is explicitely in Maintenance, which IMHO tells that the board shouldn't be used.

Not for the scheduler - the scheduler can still accept submissions and queue them up until the device comes out of Maintenance.

This prevents test jobs being rejected during certain kinds of maintenance. (Wider scale maintenance would involve taking down the UI on the master at which point submissions would get a 404 but that is up to admins to schedule and announce etc.)

This is about uptime for busy instances which frequently get batches of submissions out of operations like cron. The available devices quickly get swamped but the queue needs to continue accepting jobs until admins decide that devices which need work are going to be unavailable for long enough that the resulting queue would be impractical. i.e. when the length of time to wait for the job exceeds the useful window of the results from the job or when the number of test jobs in the queue exceeds the ability of the available devices to keep on top of the queue and avoid ever increasing queues.

I guess that's an implementation choice but I'd have guessed the scheduler was first looping over idle devices to then schedule the oldest job in the queue for this device type.

But my understanding is that the scheduler rather sets an order when jobs are submitted that isn't temperable with. Is that correct?

...

...
...
Once a test job has been submitted, it will be either scheduled or cancelled.

Yes, that's understood and that makes sense to me. However, for "normal" jobs, if you can't find a board of device type X that is available, it does not get scheduled, right? Why can't we do the same for MultiNode jobs?

Because the MultiNode job will never find all devices in the right state at the same time once there is a mixed queue of single node and MultiNode jobs.

All devices defined in the MultiNode test job must be available at exactly the same time. Once there are single node jobs in the queue, that never happens.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

scheduler considers the queue - MultiNode cannot start (A is busy), so move on and start the single node job on B (because the single node test job on B may actually complete before the job on A finishes, so it is inefficient to keep B idle when it could be doing useful stuff for another user).

A is running B is running

no actions

A completes and goes to Idle B is still running

and so the pattern continues for as long as there are any single node test jobs for either A or B in the queue.

The MultiNode test job never starts because A and B are never Idle at the same time until the queue is completely empty (which *never* happens in many instances).

So the scheduler must grab B while it is Idle to prevent the single node test job starting. Then when A completes, the scheduler must also grab A before that single node test job starts running.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

B is transitioned to Scheduling and is unavailable for the single node test job.

A is running B is scheduling

no actions

A completes and goes to Idle B is scheduling

Scheduler transitions A into scheduling - that test job can now start.

(Now consider MultiNode test jobs covering a dozen devices in an instance with a hundred mixed devices and permanent queues of single node test jobs.)

The scheduler also needs to be very fast, so the actual decisions need to be made on quite simple criteria - specifically, without going back to the database to find out about what else might be in the queue or trying to second-guess when test jobs might end.

That is understood as well for devices that are idle, running or scheduled to run. The point I was trying to make was why schedule a job for a device that is in Maintenance (what I meant by the poorly chosen "available" word).

Is that because one the job is submitted it's ordered by the scheduler and then run by the scheduler in the given order and the jobs are not discriminated against the device Maintenance status?

...

...
...
Now, until I put the at91rm9200ek back in the lab, all my boards are

...
reserved and scheduling for a multi-node job and thus, my lab is basically dead.

The correct fix here is to have enough devices of the device-type of the starved resource such that one of each other device-type can use that resource simultaneously and then use device-tags to match up groups of devices so that submitting lots of jobs for one type all at the same time does not simply consume all of the available resources.

e.g. four device-types - phone, hikey, qemu and panda. Each multinode job wants a single QEMU with each of the others, so the QEMU type becomes starved, depending on how jobs are submitted. If two hikey-qemu jobs are submitted together, then 1 QEMU gets scheduled, waiting for the hikey to become free after running the first job. If each QEMU has device-tags,

then

...
the second hikey-qemu job will wait not only for the hikey but will also wait for the one QEMU which has the hikey device tag. This way, only

those

...
jobs would then wait for a QEMU device. There would be three QEMU

devices,

...
one with a device tag like "phone", one with "hikey" and one with

"panda".

...
If another panda device is added, another QEMU with the "panda" device

tag

...
would be required. The number of QEMU devices required is the sum of the number of devices of each other device-type which may be required in a MultiNode test job.

This is a structural problem within your lab.

You would need one "laptop" for each other device-type which can use that device-type in your lab. Then each "laptop" gets unique a device-tag .

Each

...
test job for at91rm9200ek must specify that the "laptop" device must have the matching device tag. Each test job for each other device-type uses

the

...
matching device-tag for that device-type. We had this problem in the Harston lab for a long time when using V1 and had to implement just such

a

...
structure of matched devices and device tags. However, the need for this disappeared when the Harston lab transitioned all devices and test jobs

to

...
LAVA V2.

I strongly disagree with your statement. A software problem can often be dealt with by adding more resources but I'm not willing to spend thousands on something that can be fixed on the software side.

We've been through these loops within the team for many years and have millions of test jobs which demonstrate the problems and the fix. I'm afraid you are misunderstanding the problem if you think that there is a software solution for a queue containing both MultiNode and single node test jobs - other than the solution we now use in the LAVA scheduler. The process has been tried and tested over 8 years and millions of test jobs across dozens of mixed use case instances and has proven to be the most efficient use of resources across all those models.

Each test job in a MultiNode test is considered separately - if one or more devices are Idle, then those are immediately put into Scheduling. Only when all are in Scheduling can any of those jobs start. The status of other test jobs in the MultiNode group can only be handled at the point when at least one test job in that MultiNode group is in Scheduling.

I think there is a global misunderstanding due to my bad choice of words. I understand and I'm convinced there are no other ways to deal with what you explained above.

...

...
Aside from a non-negligeable financial and time (to setup and maintain) effort to buy a board with a stable and reliable NIC for each and every board in my lab, it just isn't our use case.

If I would do such a thing, then my network would be the bottleneck to my network tests and I'd have to spend a lot (financially and on time or maintenance) to have a top notch network infrastructure for tests I don't care if they run one after the other. I can't have a separate network for each and every board as well, simply because my boards often have a single Ethernet port, thus I can't separate the test network from the lab network for, e.g. images downloading that are part of the booting process, hence I can't do reliable network testing even by multiplying "laptop" devices.

I can understand it's not your typical use case at Linaro and you've dozens and dozens of the same board and a huge infrastructure to handle the whole LAVA lab and maybe people working full-time on LAVA, the lab, the boards, the infrastructure. But that's the complete opposite of our use case.

Maybe you can understand ours where we have only one board of each device type, being part of KernelCI to test and report kernel booting status and having occasional custom tests (like network) on upstream or custom branches/repositories. We sporadically work on the lab, fixing the issues we're seeing with the boards but that's not what we do for a living.

I do understand and I personally run a lab in much the same way. However, the code needs to work the same way in that lab as it does in the larger

Of course.

...

labs. It is the local configuration and resource availability which must change to suit.

For now, the best thing is to put devices into Retired so that submissions are rejected and then you will also have to manage your submissions and your queue.

Can't we have a "Maintenance but I don't know when it's coming back so please still submit jobs but do not schedule them" option :D ?

...

We're looking at what the Maintenance state means for MultiNode in https://projects.linaro.org/browse/LAVA-1299 but it is not acceptable to refuse submissions when devices are not Retired. Users have an expectation that devices which are being fixed will come back online at some point - or will go into retired. There is also

I agree.

...

https://projects.linaro.org/browse/LAVA-595 but that work has not yet been scoped. It could be a long time before that work starts and will take months of work once it does start.

The problem is a structural one in the physical resources available in your local lab. It is a problem we have faced more than once in our own instances and we have gone down all the various routes until we've come to the current implementation.

...
We also work actively on the kernel and thus, we take boards (which we own only once) out of the lab to work on it and then put it into the lab once we've finished working. This is where we put it in Maintenance mode as, IMHO, Retired does not cover this use case. This "Maintenance" can take seconds, days or months.

For me, you're ignoring an issue that is almost inexistent in your case

It is an issue which has had months of investigation, discussion and intervention in our use cases. We have spent a very long time going through all of the permutations.

I understand the scheduler is a critical part of the software that had your attention for a long time and appropriate testing, no doubt.

...

...
because you've dealt with it by adding as much resource as you could to make the probability to happen to be close to zero. That does not mean it does not exist. I'm not criticizing the way to deal with it, I'm just saying this way isn't a path we can take personally.

Then you need to manage the queue on your instance in ways that allow for your situation.

...
...
...
Let me know if I can be of any help debugging this thing or testing a possible fix. I'd have a look at the scheduler but you, obviously knowing the code base way better than I do, might have a quick patch on hand.

Patches would be a bad solution for a structural problem.

As a different approach, why do you need MultiNode with a "laptop" type device in the first place? Can the test jobs be reconfigured to use LXC which does not use MultiNode? What is the "laptop" device-type doing that cannot be done in an LXC? LXC is created on-the-fly, one for each device, when the test job requests one. This solved the resource starvation

problem

...
with the majority of MultiNode issues because the work previously done in the generic QEMU / "laptop" role can just as easily be done in an LXC.

We're testing Gigabit NICs can actually handle Gbps transfers. We need a fully available Gbps NIC for each and every test we do to make the results reliable and consistent.

Then as that resource is limited, you must create a way that only one test job of this kind can ever actually run at a time. That can be done by working at the stage prior to submission or it can be done by changing the device availablity such that the submission is rejected. Critically, there must also be a way to prevent jobs entering the queue if one of the device-types is not available. That can be easily determined using the XML-RPC API prior to submission. Once submitted, LAVA must attempt to run

That's a "lot" of complexity to deal with on our side but that's indeed a way to do it. I'd have to make sure only one device has a MultiNode job in queue and monitor it to send the next one.

...

the test job as quickly as possible, under the expectation that devices which have not been Retired will become available again within a reasonable amount of time. If that is not the case then those devices should be Retired. (Devices can be brought out of Retired as easily as going in, it doesn't have to be a permanent state, nothing is actually deleted from the device configuration.)

Hum... I'm just wondering. What about a device that was submitted a MultiNode job but got Retired since then?

Now I'm wondering what's the difference between Retired and Maintenance except that it does not accept job submission?

...

...
...
What you are describing sounds like a misuse of MultiNode resulting in resource starvation and the fix is to have enough of the limited resource to prevent starvation - either by adding hardware and changing the

current

...
test jobs to use device-tags or relocating the work done on the starved resource into an LXC so that every device can have a dedicated

"container"

...
to do things which cannot be easily done on the device.

Neither of those options are possible in our use case.

I understand the MultiNode scheduler is complex and low priority. We've modestly contributed to LAVA before, we're not telling you to fix it ASAP but rather to help or guide us to fix this issue in a way it could be accepted in the upstream version of LAVA.

If you still stand strong against a patch or if it's a lengthy complete rework of the scheduler, could we have at least a way to tell for how long a test have been scheduled (or for how long a board has been reserved for a test that is scheduled)?

That data is already available in the current UI and over the XML-RPC API and REST API.

Check for Queued Jobs and the scheduling state, also the job_details call in XML-RPC. There are a variety of ways of getting the information you require using the existing APIs - which one you use will depend on your preference and current scripting.

Rather than polling on XML-RPC, it would be better for a monitoring process to use ZMQ and the publisher events to get push notifications of change of state. That lowers the load on the master, depending on how busy the instance actually becomes.

...
That way we can use an external tool to monitor this and manually cancel them when needed. Currently, I don't think there is a way to tell since when the job was scheduled.

Every test job has a database object of submit_time created at the point where the job is created upon submission.

Submit_time isn't really an option if it's telling what its name is telling because I can have jobs in queue for days. (Multiple network tests for each and every board and also time-consuming tests (e.g. crypto) that have the same priority).

I'll have a look at what you've offered above, thanks.

Thanks for having taken the time to answer my question,

Quentin

Neil Williams

9:10 a.m.

New subject: [Lava-users] Deadlock in scheduler

On 24 April 2018 at 09:08, Quentin Schulz quentin.schulz@bootlin.com wrote:

...

Hi Neil,

I think there was a global misunderstanding from a poorly choice of words. When I was saying "available device" I meant a device that isn't in Maintenance or Retired. If the device is idle, running a job, scheduled to run a job, etc... I consider it "available". Sorry for the misunderstanding.

Currently, "Maintenance" is considered as available for submission & for scheduling. This is to support maximum uptime and minimal disruption to CI loops for temporary work happening on devices.

We are looking at clarifying this soon.

I know this has been a long and complex thread. Thank you for sticking with the discussion, despite the complexity and terminology.

...

On Mon, Apr 23, 2018 at 12:54:03PM +0100, Neil Williams wrote:

...
On 23 April 2018 at 11:21, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi Neil,

Thanks for your prompt answer.

On Fri, Apr 20, 2018 at 07:56:29AM +0100, Neil Williams wrote:

...
On 19 April 2018 at 20:11, Quentin Schulz <

quentin.schulz@bootlin.com>

...
...
...
wrote:

...
Hi all,

I've encountered a deadlock in my LAVA server with the following

scheme.

...
...
I have an at91rm9200ek in my lab that got submitted a lot of

multi-node

...
...
...
...
jobs requesting an other "board" (a laptop of type dummy-ssh). All of my other boards in the lab have received the same multi-node

jobs

...
...
requesting the same and only laptop.

That is the source of the resource starvation - multiple

requirements of

...
...
a

...
single device. The scheduler needs to be greedy and grab whatever

suitable

...
devices it can as soon as it can to be able to run MultiNode. The

primary

...
...
...
ordering of scheduling is the Test Job ID which is determined at

submission.

...
Why would you order test jobs without knowing if the boards it depends on are available when it's going to be scheduled? What am I missing?

To avoid the situation where a MultiNode job is constantly waiting for

all

...
devices to be available at exactly the same time. Instances frequently

have

...
long queues of submitted test jobs, a mix of single node and MultiNode.

The

...
MultiNode jobs must be able to grab whatever device is available, in

order

...
of submit time, and then wait for the other part to be available. Otherwise, all devices would run all single node test jobs in the entire queue before any MultiNode test jobs could start. Many instances

constantly

...
have a queue of single node test jobs.

That's understood and expected.

...
...
...
If you have an imbalance between the number of machines which can be available and then submit MultiNode jobs which all rely on the

starved

...
...
...
resource, there is not much LAVA can do currently. We are looking at

a

...
...
way

...
to reschedule MultiNode test jobs but it is very complex and low

priority.

...
What version of lava-server and lava-dispatcher are you running?

lava-dispatcher 2018.2.post3-1+jessie lava-server 2018.2-1+jessie lava-coordinator 0.1.7-1

(You need to upgrade to Stretch - there will be no fixes or upgrades available for Jessie. All development work must only happen on Stretch.

See

...
the lava-announce mailing list archive.)

Thanks, we'll have a look into this.

...
...
...
What is the structure of your current lab?

MultiNode is complex - not just at the test job synchronization

level but

...
...
...
also at the lab structure / administrative level.

I have two machines. One acting as LAVA server and one as LAVA slave. The LAVA slave is handling all boards in our lab.

I have one laptop (an actual x86 laptop for which we know the NIC

driver

...
...
works reliably at high (~1Gbps) speeds) that we use for MultiNode jobs (actually requesting the laptop and one board at the same time only) to test network. This laptop is seen as a board by LAVA, there is nothing LAVA-related on this board (it should be seen as a device).

Then you need more LAVA devices to replicate the role played by the

laptop.

...
Exactly one device for each MultiNode test job which can be submitted at any one time. Then use device tags to allocate one of the "laptop"

devices

...
to each of the other boards involved in the MultiNode test jobs.

Alternatively, you need to manage both the submissions and the device availability.

Think of just the roles played by the devices.

There are N client role devices (not in Retired state) and there are X server role devices where the server role is what the laptop is currently doing.

You need to have N == X to solve the imbalance in the queue.

If N > 1 (and there are more than one device-type in the count 'N') then you also need to use device tags so that each device-type has a dedicated pool of server role devices where the number of devices in the server

role

...
pool exactly matches the number of devices of the device-type using the specified device tag.

...
...
...
I had to take the at91rm9200ek out of the lab because it was

behaving.

...
...
...
...
...
However, LAVA is still scheduling multi-node jobs on the laptop

which

...
...
...
...
requires the at91rm9200ek as the other part of the job, while its

status

...
...
is clearly Maintenance.

A device in Maintenance is still available for scheduling - only

Retired

...
...
is

...
excluded - test jobs submitted to a Retired device are rejected.

Why is that? The device is explicitely in Maintenance, which IMHO tells that the board shouldn't be used.

Not for the scheduler - the scheduler can still accept submissions and queue them up until the device comes out of Maintenance.

This prevents test jobs being rejected during certain kinds of

maintenance.

...
(Wider scale maintenance would involve taking down the UI on the master

at

...
which point submissions would get a 404 but that is up to admins to schedule and announce etc.)

This is about uptime for busy instances which frequently get batches of submissions out of operations like cron. The available devices quickly

get

...
swamped but the queue needs to continue accepting jobs until admins

decide

...
that devices which need work are going to be unavailable for long enough that the resulting queue would be impractical. i.e. when the length of

time

...
to wait for the job exceeds the useful window of the results from the job or when the number of test jobs in the queue exceeds the ability of the available devices to keep on top of the queue and avoid ever increasing queues.

I guess that's an implementation choice but I'd have guessed the scheduler was first looping over idle devices to then schedule the oldest job in the queue for this device type.

But my understanding is that the scheduler rather sets an order when jobs are submitted that isn't temperable with. Is that correct?

The order is priority, submit_time and then target_group.

Changing ordering on-the-fly and backing out from certain states is the subject of https://projects.linaro.org/browse/LAVA-595 - that is the work I've already described as low priority, large scale and complex.

You do have the ability to set the Priority of new test jobs for submission which want to use the laptop in a MultiNode test job along with a device which is NOT the at91rm9200ek. You will need to cancel the test job involving the at91rm9200ek which is currently scheduled. (Other test jobs for the at91rm9200ek in the Queue can be left alone provided that these test jobs have a lower Priority than the jobs you want to run on other devices.)

When the scheduler comes back around, it will find a new test job with higher Priority which wants to use the laptop with a hikey device or whatever and the at91rm9200ek will be ignored. It's not perfect because you would then need to either keep that Priority pipe full or cancel the submitted test jobs for the at91rm9200ek.

...

...
...
...
Once a test job has been submitted, it will be either scheduled or cancelled.

Yes, that's understood and that makes sense to me. However, for

"normal"

...
...
jobs, if you can't find a board of device type X that is available, it does not get scheduled, right? Why can't we do the same for MultiNode jobs?

Because the MultiNode job will never find all devices in the right state

at

...
the same time once there is a mixed queue of single node and MultiNode

jobs.

...
All devices defined in the MultiNode test job must be available at

exactly

...
the same time. Once there are single node jobs in the queue, that never happens.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

scheduler considers the queue - MultiNode cannot start (A is busy), so

move

...
on and start the single node job on B (because the single node test job

on

...
B may actually complete before the job on A finishes, so it is

inefficient

...
to keep B idle when it could be doing useful stuff for another user).

A is running B is running

no actions

A completes and goes to Idle B is still running

and so the pattern continues for as long as there are any single node

test

...
jobs for either A or B in the queue.

The MultiNode test job never starts because A and B are never Idle at the same time until the queue is completely empty (which *never* happens in many instances).

So the scheduler must grab B while it is Idle to prevent the single node test job starting. Then when A completes, the scheduler must also grab A before that single node test job starts running.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

B is transitioned to Scheduling and is unavailable for the single node

test

...
job.

A is running B is scheduling

no actions

A completes and goes to Idle B is scheduling

Scheduler transitions A into scheduling - that test job can now start.

(Now consider MultiNode test jobs covering a dozen devices in an instance with a hundred mixed devices and permanent queues of single node test

jobs.)

...
The scheduler also needs to be very fast, so the actual decisions need to be made on quite simple criteria - specifically, without going back to

the

...
database to find out about what else might be in the queue or trying to second-guess when test jobs might end.

That is understood as well for devices that are idle, running or scheduled to run. The point I was trying to make was why schedule a job for a device that is in Maintenance (what I meant by the poorly chosen "available" word).

Is that because one the job is submitted it's ordered by the scheduler and then run by the scheduler in the given order and the jobs are not discriminated against the device Maintenance status?

The problem you have is not that the device is in Maintenance but that the other device(s) in the MultiNode test job are Idle. Therefore, those jobs get scheduled because there is no reason not to do so.

If the submission was to be rejected when all devices of the requested device-type are in Maintenance, that is a large change which would negatively impact a lot of busy instances.

We do need to clarify these states but essentially, Maintenance is a manual state change which has the same effect as the automatic state change to Bad. That is as far as it goes currently.

...

...
...
...
Now, until I put the at91rm9200ek back in the lab, all my boards are

...
reserved and scheduling for a multi-node job and thus, my lab is basically dead.

The correct fix here is to have enough devices of the device-type of

the

...
...
...
starved resource such that one of each other device-type can use that resource simultaneously and then use device-tags to match up groups

of

...
...
...
devices so that submitting lots of jobs for one type all at the same

time

...
...
...
does not simply consume all of the available resources.

e.g. four device-types - phone, hikey, qemu and panda. Each

multinode job

...
...
...
wants a single QEMU with each of the others, so the QEMU type becomes starved, depending on how jobs are submitted. If two hikey-qemu jobs

are

...
...
...
submitted together, then 1 QEMU gets scheduled, waiting for the

hikey to

...
...
...
become free after running the first job. If each QEMU has

device-tags,

...
...
then

...
the second hikey-qemu job will wait not only for the hikey but will

also

...
...
...
wait for the one QEMU which has the hikey device tag. This way, only

those

...
jobs would then wait for a QEMU device. There would be three QEMU

devices,

...
one with a device tag like "phone", one with "hikey" and one with

"panda".

...
If another panda device is added, another QEMU with the "panda"

device

...
...
tag

...
would be required. The number of QEMU devices required is the sum of

the

...
...
...
number of devices of each other device-type which may be required in

a

...
...
...
MultiNode test job.

This is a structural problem within your lab.

You would need one "laptop" for each other device-type which can use

that

...
...
...
device-type in your lab. Then each "laptop" gets unique a device-tag

.

...
...
Each

...
test job for at91rm9200ek must specify that the "laptop" device must

have

...
...
...
the matching device tag. Each test job for each other device-type

uses

...
...
the

...
matching device-tag for that device-type. We had this problem in the Harston lab for a long time when using V1 and had to implement just

such

...
...
a

...
structure of matched devices and device tags. However, the need for

this

...
...
...
disappeared when the Harston lab transitioned all devices and test

jobs

...
...
to

...
LAVA V2.

I strongly disagree with your statement. A software problem can often

be

...
...
dealt with by adding more resources but I'm not willing to spend thousands on something that can be fixed on the software side.

We've been through these loops within the team for many years and have millions of test jobs which demonstrate the problems and the fix. I'm afraid you are misunderstanding the problem if you think that there is a software solution for a queue containing both MultiNode and single node test jobs - other than the solution we now use in the LAVA scheduler. The process has been tried and tested over 8 years and millions of test jobs across dozens of mixed use case instances and has proven to be the most efficient use of resources across all those models.

Each test job in a MultiNode test is considered separately - if one or

more

...
devices are Idle, then those are immediately put into Scheduling. Only

when

...
all are in Scheduling can any of those jobs start. The status of other

test

...
jobs in the MultiNode group can only be handled at the point when at

least

...
one test job in that MultiNode group is in Scheduling.

I think there is a global misunderstanding due to my bad choice of words. I understand and I'm convinced there are no other ways to deal with what you explained above.

...
...
Aside from a non-negligeable financial and time (to setup and maintain) effort to buy a board with a stable and reliable NIC for each and every board in my lab, it just isn't our use case.

If I would do such a thing, then my network would be the bottleneck to my network tests and I'd have to spend a lot (financially and on time

or

...
...
maintenance) to have a top notch network infrastructure for tests I don't care if they run one after the other. I can't have a separate network for each and every board as well, simply because my boards

often

...
...
have a single Ethernet port, thus I can't separate the test network

from

...
...
the lab network for, e.g. images downloading that are part of the booting process, hence I can't do reliable network testing even by multiplying "laptop" devices.

I can understand it's not your typical use case at Linaro and you've dozens and dozens of the same board and a huge infrastructure to handle the whole LAVA lab and maybe people working full-time on LAVA, the lab, the boards, the infrastructure. But that's the complete opposite of our use case.

Maybe you can understand ours where we have only one board of each device type, being part of KernelCI to test and report kernel booting status and having occasional custom tests (like network) on upstream or custom branches/repositories. We sporadically work on the lab, fixing the issues we're seeing with the boards but that's not what we do for a living.

I do understand and I personally run a lab in much the same way. However, the code needs to work the same way in that lab as it does in the larger

Of course.

...
labs. It is the local configuration and resource availability which must change to suit.

For now, the best thing is to put devices into Retired so that

submissions

...
are rejected and then you will also have to manage your submissions and your queue.

Can't we have a "Maintenance but I don't know when it's coming back so please still submit jobs but do not schedule them" option :D ?

This is exactly what you already have - but it's not actually what you mean.

The problem is that you're thinking of the state of the at91rm9200ek when what matters to the scheduler is the state of the laptop device *in isolation*. The state of the at91rm9200ek only matters AFTER the laptop has been assigned.

What you mean is:

"I don't know when ANY of the devices of device-type A are going to be ready to start a test job, so do not allow ANY OTHER device of ANY OTHER device-type to be scheduled either IF that device-type is listed in a MultiNode test job which ALSO requires a device of this device-type".

(The reason it's ANY is that if 10 test jobs are submitted all wanting at91rm9200ek and laptop, then if you had 10 laptops and 1 at91rm9200ek, those 10 laptops would also go into Scheduled - that is the imbalance we talked about previously.)

It is a cross-relational issue.

Correlating across the MultiNode test job at the scheduling state is likely to have a massive impact on the speed of the scheduler because:

0: a device in Maintenance or Bad is NOT currently available to be scheduled - that's the whole point. 1: the other device(s) in the MultiNode group DO get scheduled because those were in Idle 2: asking the scheduler to check the state of all devices of all device-types mentioned in a MultiNode job when considering whether to schedule any other device in that same MultiNode job is going to make the scheduler SLOW.

So what we do is let the Idle (laptop) device go into a waiting state and let the scheduler move the at91rm9200ek device into scheduling *only* when a at91rm9200ek device becomes available for scheduling - as two completely separate decisions. Then, the relational work is done by the database when the lava-master makes the query "which test jobs are available to start NOW". This is much more efficient because we are looking at jobs where all devices in the target_group are in state SCHEDULED. The database can easily exclude test jobs which are in state SUBMITTED (the at91rm9200ek jobs) and a simple check on target_group shows that the MultiNode test job is not ready to start. That can all be done with a single database query using select_related and other ORM niceties.

Let's describe this with roles:

role: client device-type: at91rm9200ek

role: server device-type: laptop

If the at91rm9200ek is in Maintenance and there are no other devices of device-type at91rm9200ek in state Idle, then nothing will get scheduled for at91rm9200ek.

However, when a MultiNode test job is submitted for 1 at91rm9200ek and 1 or more "laptop" device(s), then there is no reason to stop scheduling the laptop device in state Idle without scrabbling through the test job definition and working out (again and again, every time the scheduler loops through the queue) which device-types are requested, which devices of those types are available and what to do next.

The problem is NOT the state of the at91rm9200ek - Maintenance or Bad, it makes no difference. The problem for the SCHEDULER is that the laptop device is Idle and requested by a test job with a sufficiently low submit_time (and high enough Priority) that it is first in the queue.

The problem at the SUBMISSION stage is that the only decision available is whether to allow the test job asking for at91rm9200ek & laptop onto the queue or whether to refuse it outright. Currently, a refusal is only implemented if all devices of at least one device-type specified in the test job are in Retired state.

After many, many rounds of testing, test jobs, discussions going on over several years we came to the decision that in your situation - where there is a dire shortage of a resource used by multiple MultiNode test jobs, that the only thing that was safe for the SCHEDULER to do was to allow the Idle device to be scheduled and let the test job wait for resources to become available, either by moving the other device out of Maintenance or providing extra hardware for the Idle device.

...

...
We're looking at what the Maintenance state means for MultiNode in https://projects.linaro.org/browse/LAVA-1299 but it is not acceptable to refuse submissions when devices are not Retired. Users have an

expectation

...
that devices which are being fixed will come back online at some point -

or

...
will go into retired. There is also

I agree.

...
https://projects.linaro.org/browse/LAVA-595 but that work has not yet

been

...
scoped. It could be a long time before that work starts and will take months of work once it does start.

The problem is a structural one in the physical resources available in

your

...
local lab. It is a problem we have faced more than once in our own instances and we have gone down all the various routes until we've come

to

...
the current implementation.

...
We also work actively on the kernel and thus, we take boards (which we own only once) out of the lab to work on it and then put it into the lab once we've finished working. This is where we put it in Maintenance mode as, IMHO, Retired does not cover this use case. This "Maintenance" can take seconds, days or months.

For me, you're ignoring an issue that is almost inexistent in your case

It is an issue which has had months of investigation, discussion and intervention in our use cases. We have spent a very long time going

through

...
all of the permutations.

I understand the scheduler is a critical part of the software that had your attention for a long time and appropriate testing, no doubt.

...
...
because you've dealt with it by adding as much resource as you could to make the probability to happen to be close to zero. That does not mean it does not exist. I'm not criticizing the way to deal with it, I'm

just

...
...
saying this way isn't a path we can take personally.

Then you need to manage the queue on your instance in ways that allow for your situation.

...
...
...
Let me know if I can be of any help debugging this thing or

testing a

...
...
...
...
possible fix. I'd have a look at the scheduler but you, obviously knowing the code base way better than I do, might have a quick

patch on

...
...
...
...
hand.

Patches would be a bad solution for a structural problem.

As a different approach, why do you need MultiNode with a "laptop"

type

...
...
...
device in the first place? Can the test jobs be reconfigured to use

LXC

...
...
...
which does not use MultiNode? What is the "laptop" device-type doing

that

...
...
...
cannot be done in an LXC? LXC is created on-the-fly, one for each

device,

...
...
...
when the test job requests one. This solved the resource starvation

problem

...
with the majority of MultiNode issues because the work previously

done in

...
...
...
the generic QEMU / "laptop" role can just as easily be done in an

LXC.

...
...
...
We're testing Gigabit NICs can actually handle Gbps transfers. We need

a

...
...
fully available Gbps NIC for each and every test we do to make the

results

...
...
reliable and consistent.

Then as that resource is limited, you must create a way that only one

test

...
job of this kind can ever actually run at a time. That can be done by working at the stage prior to submission or it can be done by changing

the

...
device availablity such that the submission is rejected. Critically,

there

...
must also be a way to prevent jobs entering the queue if one of the device-types is not available. That can be easily determined using the XML-RPC API prior to submission. Once submitted, LAVA must attempt to run

That's a "lot" of complexity to deal with on our side but that's indeed a way to do it. I'd have to make sure only one device has a MultiNode job in queue and monitor it to send the next one.

...
the test job as quickly as possible, under the expectation that devices which have not been Retired will become available again within a

reasonable

...
amount of time. If that is not the case then those devices should be Retired. (Devices can be brought out of Retired as easily as going in, it doesn't have to be a permanent state, nothing is actually deleted from

the

...
device configuration.)

Hum... I'm just wondering. What about a device that was submitted a MultiNode job but got Retired since then?

Well spotted - that is left to the admins to manage. There is a story outstanding to cancel all test jobs submitted,scheduled or running for devices which are transitioned into Retired.

...

Now I'm wondering what's the difference between Retired and Maintenance except that it does not accept job submission?

The difference only shows if ALL devices of the device-type are in the specified state.

Retired - no submissions allowed. No test jobs in the queue will be scheduled. Running test jobs will be left to complete. Submitted jobs are currently unchanged if the state changes to Retired. Maintenance - submissions are allowed. No test jobs in the queue will be scheduled. Running test jobs will be left to complete. Submitted jobs are unchanged if the state changes to Maintenance.

So the only distinction between Retired and Maintenance at the moment is submissions.

...

...
...
...
What you are describing sounds like a misuse of MultiNode resulting

in

...
...
...
resource starvation and the fix is to have enough of the limited

resource

...
...
...
to prevent starvation - either by adding hardware and changing the

current

...
test jobs to use device-tags or relocating the work done on the

starved

...
...
...
resource into an LXC so that every device can have a dedicated

"container"

...
to do things which cannot be easily done on the device.

Neither of those options are possible in our use case.

I understand the MultiNode scheduler is complex and low priority. We've modestly contributed to LAVA before, we're not telling you to fix it ASAP but rather to help or guide us to fix this issue in a way it could be accepted in the upstream version of LAVA.

If you still stand strong against a patch or if it's a lengthy complete rework of the scheduler, could we have at least a way to tell for how long a test have been scheduled (or for how long a board has been reserved for a test that is scheduled)?

That data is already available in the current UI and over the XML-RPC API and REST API.

Check for Queued Jobs and the scheduling state, also the job_details call in XML-RPC. There are a variety of ways of getting the information you require using the existing APIs - which one you use will depend on your preference and current scripting.

Rather than polling on XML-RPC, it would be better for a monitoring

process

...
to use ZMQ and the publisher events to get push notifications of change

of

...
state. That lowers the load on the master, depending on how busy the instance actually becomes.

...
That way we can use an external tool to monitor this and manually cancel them when needed. Currently, I don't think there is a way to tell since when the job was scheduled.

Every test job has a database object of submit_time created at the point where the job is created upon submission.

Submit_time isn't really an option if it's telling what its name is telling because I can have jobs in queue for days. (Multiple network tests for each and every board and also time-consuming tests (e.g. crypto) that have the same priority).

I'll have a look at what you've offered above, thanks.

Thanks for having taken the time to answer my question,

Quentin

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Neil Williams

9:15 a.m.

New subject: [Lava-users] Deadlock in scheduler

On 24 April 2018 at 10:10, Neil Williams neil.williams@linaro.org wrote:

...

On 24 April 2018 at 09:08, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi Neil,

I think there was a global misunderstanding from a poorly choice of words. When I was saying "available device" I meant a device that isn't in Maintenance or Retired. If the device is idle, running a job, scheduled to run a job, etc... I consider it "available". Sorry for the misunderstanding.

...

Currently, "Maintenance" is considered as available for submission & for scheduling. This is to support maximum uptime and minimal disruption to CI loops for temporary work happening on devices.

Currently, "Maintenance" is considered as available for submission but not for scheduling. This is to support maximum uptime and minimal disruption to CI loops for temporary work happening on devices.

...

We are looking at clarifying this soon.

I know this has been a long and complex thread. Thank you for sticking with the discussion, despite the complexity and terminology.

So complex, I sometimes typo it myself.

...

...
On Mon, Apr 23, 2018 at 12:54:03PM +0100, Neil Williams wrote:

...
On 23 April 2018 at 11:21, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi Neil,

Thanks for your prompt answer.

On Fri, Apr 20, 2018 at 07:56:29AM +0100, Neil Williams wrote:

...
On 19 April 2018 at 20:11, Quentin Schulz <

quentin.schulz@bootlin.com>

...
...
...
wrote:

...
Hi all,

I've encountered a deadlock in my LAVA server with the following

scheme.

...
...
I have an at91rm9200ek in my lab that got submitted a lot of

multi-node

...
...
...
...
jobs requesting an other "board" (a laptop of type dummy-ssh). All of my other boards in the lab have received the same

multi-node

...
...
jobs

...
...
requesting the same and only laptop.

That is the source of the resource starvation - multiple

requirements of

...
...
a

...
single device. The scheduler needs to be greedy and grab whatever

suitable

...
devices it can as soon as it can to be able to run MultiNode. The

primary

...
...
...
ordering of scheduling is the Test Job ID which is determined at

submission.

...
Why would you order test jobs without knowing if the boards it depends on are available when it's going to be scheduled? What am I missing?

To avoid the situation where a MultiNode job is constantly waiting for

all

...
devices to be available at exactly the same time. Instances frequently

have

...
long queues of submitted test jobs, a mix of single node and MultiNode.

The

...
MultiNode jobs must be able to grab whatever device is available, in

order

...
of submit time, and then wait for the other part to be available. Otherwise, all devices would run all single node test jobs in the entire queue before any MultiNode test jobs could start. Many instances

constantly

...
have a queue of single node test jobs.

That's understood and expected.

...
...
...
If you have an imbalance between the number of machines which can be available and then submit MultiNode jobs which all rely on the

starved

...
...
...
resource, there is not much LAVA can do currently. We are looking

at a

...
...
way

...
to reschedule MultiNode test jobs but it is very complex and low

priority.

...
What version of lava-server and lava-dispatcher are you running?

lava-dispatcher 2018.2.post3-1+jessie lava-server 2018.2-1+jessie lava-coordinator 0.1.7-1

(You need to upgrade to Stretch - there will be no fixes or upgrades available for Jessie. All development work must only happen on Stretch.

See

...
the lava-announce mailing list archive.)

Thanks, we'll have a look into this.

...
...
...
What is the structure of your current lab?

MultiNode is complex - not just at the test job synchronization

level but

...
...
...
also at the lab structure / administrative level.

I have two machines. One acting as LAVA server and one as LAVA slave. The LAVA slave is handling all boards in our lab.

I have one laptop (an actual x86 laptop for which we know the NIC

driver

...
...
works reliably at high (~1Gbps) speeds) that we use for MultiNode jobs (actually requesting the laptop and one board at the same time only)

to

...
...
test network. This laptop is seen as a board by LAVA, there is nothing LAVA-related on this board (it should be seen as a device).

Then you need more LAVA devices to replicate the role played by the

laptop.

...
Exactly one device for each MultiNode test job which can be submitted at any one time. Then use device tags to allocate one of the "laptop"

devices

...
to each of the other boards involved in the MultiNode test jobs.

Alternatively, you need to manage both the submissions and the device availability.

Think of just the roles played by the devices.

There are N client role devices (not in Retired state) and there are X server role devices where the server role is what the laptop is

currently

...
doing.

You need to have N == X to solve the imbalance in the queue.

If N > 1 (and there are more than one device-type in the count 'N') then you also need to use device tags so that each device-type has a

dedicated

...
pool of server role devices where the number of devices in the server

role

...
pool exactly matches the number of devices of the device-type using the specified device tag.

...
...
...
I had to take the at91rm9200ek out of the lab because it was

behaving.

...
...
...
...
...
However, LAVA is still scheduling multi-node jobs on the laptop

which

...
...
...
...
requires the at91rm9200ek as the other part of the job, while its

status

...
...
is clearly Maintenance.

A device in Maintenance is still available for scheduling - only

Retired

...
...
is

...
excluded - test jobs submitted to a Retired device are rejected.

Why is that? The device is explicitely in Maintenance, which IMHO

tells

...
...
that the board shouldn't be used.

Not for the scheduler - the scheduler can still accept submissions and queue them up until the device comes out of Maintenance.

This prevents test jobs being rejected during certain kinds of

maintenance.

...
(Wider scale maintenance would involve taking down the UI on the master

at

...
which point submissions would get a 404 but that is up to admins to schedule and announce etc.)

This is about uptime for busy instances which frequently get batches of submissions out of operations like cron. The available devices quickly

get

...
swamped but the queue needs to continue accepting jobs until admins

decide

...
that devices which need work are going to be unavailable for long enough that the resulting queue would be impractical. i.e. when the length of

time

...
to wait for the job exceeds the useful window of the results from the

job

...
or when the number of test jobs in the queue exceeds the ability of the available devices to keep on top of the queue and avoid ever increasing queues.

I guess that's an implementation choice but I'd have guessed the scheduler was first looping over idle devices to then schedule the oldest job in the queue for this device type.

But my understanding is that the scheduler rather sets an order when jobs are submitted that isn't temperable with. Is that correct?

The order is priority, submit_time and then target_group.

Changing ordering on-the-fly and backing out from certain states is the subject of https://projects.linaro.org/browse/LAVA-595 - that is the work I've already described as low priority, large scale and complex.

You do have the ability to set the Priority of new test jobs for submission which want to use the laptop in a MultiNode test job along with a device which is NOT the at91rm9200ek. You will need to cancel the test job involving the at91rm9200ek which is currently scheduled. (Other test jobs for the at91rm9200ek in the Queue can be left alone provided that these test jobs have a lower Priority than the jobs you want to run on other devices.)

When the scheduler comes back around, it will find a new test job with higher Priority which wants to use the laptop with a hikey device or whatever and the at91rm9200ek will be ignored. It's not perfect because you would then need to either keep that Priority pipe full or cancel the submitted test jobs for the at91rm9200ek.

...
...
...
...
Once a test job has been submitted, it will be either scheduled or cancelled.

Yes, that's understood and that makes sense to me. However, for

"normal"

...
...
jobs, if you can't find a board of device type X that is available, it does not get scheduled, right? Why can't we do the same for MultiNode jobs?

Because the MultiNode job will never find all devices in the right

state at

...
the same time once there is a mixed queue of single node and MultiNode

jobs.

...
All devices defined in the MultiNode test job must be available at

exactly

...
the same time. Once there are single node jobs in the queue, that never happens.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

scheduler considers the queue - MultiNode cannot start (A is busy), so

move

...
on and start the single node job on B (because the single node test job

on

...
B may actually complete before the job on A finishes, so it is

inefficient

...
to keep B idle when it could be doing useful stuff for another user).

A is running B is running

no actions

A completes and goes to Idle B is still running

and so the pattern continues for as long as there are any single node

test

...
jobs for either A or B in the queue.

The MultiNode test job never starts because A and B are never Idle at

the

...
same time until the queue is completely empty (which *never* happens in many instances).

So the scheduler must grab B while it is Idle to prevent the single node test job starting. Then when A completes, the scheduler must also grab A before that single node test job starts running.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

B is transitioned to Scheduling and is unavailable for the single node

test

...
job.

A is running B is scheduling

no actions

A completes and goes to Idle B is scheduling

Scheduler transitions A into scheduling - that test job can now start.

(Now consider MultiNode test jobs covering a dozen devices in an

instance

...
with a hundred mixed devices and permanent queues of single node test

jobs.)

...
The scheduler also needs to be very fast, so the actual decisions need

to

...
be made on quite simple criteria - specifically, without going back to

the

...
database to find out about what else might be in the queue or trying to second-guess when test jobs might end.

That is understood as well for devices that are idle, running or scheduled to run. The point I was trying to make was why schedule a job for a device that is in Maintenance (what I meant by the poorly chosen "available" word).

Is that because one the job is submitted it's ordered by the scheduler and then run by the scheduler in the given order and the jobs are not discriminated against the device Maintenance status?

The problem you have is not that the device is in Maintenance but that the other device(s) in the MultiNode test job are Idle. Therefore, those jobs get scheduled because there is no reason not to do so.

If the submission was to be rejected when all devices of the requested device-type are in Maintenance, that is a large change which would negatively impact a lot of busy instances.

We do need to clarify these states but essentially, Maintenance is a manual state change which has the same effect as the automatic state change to Bad. That is as far as it goes currently.

...
...
...
...
Now, until I put the at91rm9200ek back in the lab, all my boards are

...
reserved and scheduling for a multi-node job and thus, my lab is basically dead.

The correct fix here is to have enough devices of the device-type

of the

...
...
...
starved resource such that one of each other device-type can use

that

...
...
...
resource simultaneously and then use device-tags to match up groups

of

...
...
...
devices so that submitting lots of jobs for one type all at the

same time

...
...
...
does not simply consume all of the available resources.

e.g. four device-types - phone, hikey, qemu and panda. Each

multinode job

...
...
...
wants a single QEMU with each of the others, so the QEMU type

becomes

...
...
...
starved, depending on how jobs are submitted. If two hikey-qemu

jobs are

...
...
...
submitted together, then 1 QEMU gets scheduled, waiting for the

hikey to

...
...
...
become free after running the first job. If each QEMU has

device-tags,

...
...
then

...
the second hikey-qemu job will wait not only for the hikey but will

also

...
...
...
wait for the one QEMU which has the hikey device tag. This way, only

those

...
jobs would then wait for a QEMU device. There would be three QEMU

devices,

...
one with a device tag like "phone", one with "hikey" and one with

"panda".

...
If another panda device is added, another QEMU with the "panda"

device

...
...
tag

...
would be required. The number of QEMU devices required is the sum

of the

...
...
...
number of devices of each other device-type which may be required

in a

...
...
...
MultiNode test job.

This is a structural problem within your lab.

You would need one "laptop" for each other device-type which can

use that

...
...
...
device-type in your lab. Then each "laptop" gets unique a

device-tag .

...
...
Each

...
test job for at91rm9200ek must specify that the "laptop" device

must have

...
...
...
the matching device tag. Each test job for each other device-type

uses

...
...
the

...
matching device-tag for that device-type. We had this problem in the Harston lab for a long time when using V1 and had to implement just

such

...
...
a

...
structure of matched devices and device tags. However, the need for

this

...
...
...
disappeared when the Harston lab transitioned all devices and test

jobs

...
...
to

...
LAVA V2.

I strongly disagree with your statement. A software problem can often

be

...
...
dealt with by adding more resources but I'm not willing to spend thousands on something that can be fixed on the software side.

We've been through these loops within the team for many years and have millions of test jobs which demonstrate the problems and the fix. I'm afraid you are misunderstanding the problem if you think that there is a software solution for a queue containing both MultiNode and single node test jobs - other than the solution we now use in the LAVA scheduler.

The

...
process has been tried and tested over 8 years and millions of test jobs across dozens of mixed use case instances and has proven to be the most efficient use of resources across all those models.

Each test job in a MultiNode test is considered separately - if one or

more

...
devices are Idle, then those are immediately put into Scheduling. Only

when

...
all are in Scheduling can any of those jobs start. The status of other

test

...
jobs in the MultiNode group can only be handled at the point when at

least

...
one test job in that MultiNode group is in Scheduling.

I think there is a global misunderstanding due to my bad choice of words. I understand and I'm convinced there are no other ways to deal with what you explained above.

...
...
Aside from a non-negligeable financial and time (to setup and

maintain)

...
...
effort to buy a board with a stable and reliable NIC for each and

every

...
...
board in my lab, it just isn't our use case.

If I would do such a thing, then my network would be the bottleneck to my network tests and I'd have to spend a lot (financially and on time

or

...
...
maintenance) to have a top notch network infrastructure for tests I don't care if they run one after the other. I can't have a separate network for each and every board as well, simply because my boards

often

...
...
have a single Ethernet port, thus I can't separate the test network

from

...
...
the lab network for, e.g. images downloading that are part of the booting process, hence I can't do reliable network testing even by multiplying "laptop" devices.

I can understand it's not your typical use case at Linaro and you've dozens and dozens of the same board and a huge infrastructure to

handle

...
...
the whole LAVA lab and maybe people working full-time on LAVA, the

lab,

...
...
the boards, the infrastructure. But that's the complete opposite of

our

...
...
use case.

Maybe you can understand ours where we have only one board of each device type, being part of KernelCI to test and report kernel booting status and having occasional custom tests (like network) on upstream

or

...
...
custom branches/repositories. We sporadically work on the lab, fixing the issues we're seeing with the boards but that's not what we do for

a

...
...
living.

I do understand and I personally run a lab in much the same way.

However,

...
the code needs to work the same way in that lab as it does in the larger

Of course.

...
labs. It is the local configuration and resource availability which must change to suit.

For now, the best thing is to put devices into Retired so that

submissions

...
are rejected and then you will also have to manage your submissions and your queue.

Can't we have a "Maintenance but I don't know when it's coming back so please still submit jobs but do not schedule them" option :D ?

This is exactly what you already have - but it's not actually what you mean.

The problem is that you're thinking of the state of the at91rm9200ek when what matters to the scheduler is the state of the laptop device *in isolation*. The state of the at91rm9200ek only matters AFTER the laptop has been assigned.

What you mean is:

"I don't know when ANY of the devices of device-type A are going to be ready to start a test job, so do not allow ANY OTHER device of ANY OTHER device-type to be scheduled either IF that device-type is listed in a MultiNode test job which ALSO requires a device of this device-type".

(The reason it's ANY is that if 10 test jobs are submitted all wanting at91rm9200ek and laptop, then if you had 10 laptops and 1 at91rm9200ek, those 10 laptops would also go into Scheduled - that is the imbalance we talked about previously.)

It is a cross-relational issue.

Correlating across the MultiNode test job at the scheduling state is likely to have a massive impact on the speed of the scheduler because:

0: a device in Maintenance or Bad is NOT currently available to be scheduled - that's the whole point. 1: the other device(s) in the MultiNode group DO get scheduled because those were in Idle 2: asking the scheduler to check the state of all devices of all device-types mentioned in a MultiNode job when considering whether to schedule any other device in that same MultiNode job is going to make the scheduler SLOW.

So what we do is let the Idle (laptop) device go into a waiting state and let the scheduler move the at91rm9200ek device into scheduling *only* when a at91rm9200ek device becomes available for scheduling - as two completely separate decisions. Then, the relational work is done by the database when the lava-master makes the query "which test jobs are available to start NOW". This is much more efficient because we are looking at jobs where all devices in the target_group are in state SCHEDULED. The database can easily exclude test jobs which are in state SUBMITTED (the at91rm9200ek jobs) and a simple check on target_group shows that the MultiNode test job is not ready to start. That can all be done with a single database query using select_related and other ORM niceties.

Let's describe this with roles:

role: client device-type: at91rm9200ek

role: server device-type: laptop

If the at91rm9200ek is in Maintenance and there are no other devices of device-type at91rm9200ek in state Idle, then nothing will get scheduled for at91rm9200ek.

However, when a MultiNode test job is submitted for 1 at91rm9200ek and 1 or more "laptop" device(s), then there is no reason to stop scheduling the laptop device in state Idle without scrabbling through the test job definition and working out (again and again, every time the scheduler loops through the queue) which device-types are requested, which devices of those types are available and what to do next.

The problem is NOT the state of the at91rm9200ek - Maintenance or Bad, it makes no difference. The problem for the SCHEDULER is that the laptop device is Idle and requested by a test job with a sufficiently low submit_time (and high enough Priority) that it is first in the queue.

The problem at the SUBMISSION stage is that the only decision available is whether to allow the test job asking for at91rm9200ek & laptop onto the queue or whether to refuse it outright. Currently, a refusal is only implemented if all devices of at least one device-type specified in the test job are in Retired state.

After many, many rounds of testing, test jobs, discussions going on over several years we came to the decision that in your situation - where there is a dire shortage of a resource used by multiple MultiNode test jobs, that the only thing that was safe for the SCHEDULER to do was to allow the Idle device to be scheduled and let the test job wait for resources to become available, either by moving the other device out of Maintenance or providing extra hardware for the Idle device.

...
...
We're looking at what the Maintenance state means for MultiNode in https://projects.linaro.org/browse/LAVA-1299 but it is not acceptable

to

...
refuse submissions when devices are not Retired. Users have an

expectation

...
that devices which are being fixed will come back online at some point

or

...
will go into retired. There is also

I agree.

...
https://projects.linaro.org/browse/LAVA-595 but that work has not yet

been

...
scoped. It could be a long time before that work starts and will take months of work once it does start.

The problem is a structural one in the physical resources available in

your

...
local lab. It is a problem we have faced more than once in our own instances and we have gone down all the various routes until we've come

to

...
the current implementation.

...
We also work actively on the kernel and thus, we take boards (which we own only once) out of the lab to work on it and then put it into the lab once we've finished working. This is where we put it in

Maintenance

...
...
mode as, IMHO, Retired does not cover this use case. This "Maintenance" can take seconds, days or months.

For me, you're ignoring an issue that is almost inexistent in your

case

...
...
It is an issue which has had months of investigation, discussion and intervention in our use cases. We have spent a very long time going

through

...
all of the permutations.

I understand the scheduler is a critical part of the software that had your attention for a long time and appropriate testing, no doubt.

...
...
because you've dealt with it by adding as much resource as you could

to

...
...
make the probability to happen to be close to zero. That does not mean it does not exist. I'm not criticizing the way to deal with it, I'm

just

...
...
saying this way isn't a path we can take personally.

Then you need to manage the queue on your instance in ways that allow

for

...
your situation.

...
...
...
Let me know if I can be of any help debugging this thing or

testing a

...
...
...
...
possible fix. I'd have a look at the scheduler but you, obviously knowing the code base way better than I do, might have a quick

patch on

...
...
...
...
hand.

Patches would be a bad solution for a structural problem.

As a different approach, why do you need MultiNode with a "laptop"

type

...
...
...
device in the first place? Can the test jobs be reconfigured to use

LXC

...
...
...
which does not use MultiNode? What is the "laptop" device-type

doing that

...
...
...
cannot be done in an LXC? LXC is created on-the-fly, one for each

device,

...
...
...
when the test job requests one. This solved the resource starvation

problem

...
with the majority of MultiNode issues because the work previously

done in

...
...
...
the generic QEMU / "laptop" role can just as easily be done in an

LXC.

...
...
...
We're testing Gigabit NICs can actually handle Gbps transfers. We

need a

...
...
fully available Gbps NIC for each and every test we do to make the

results

...
...
reliable and consistent.

Then as that resource is limited, you must create a way that only one

test

...
job of this kind can ever actually run at a time. That can be done by working at the stage prior to submission or it can be done by changing

the

...
device availablity such that the submission is rejected. Critically,

there

...
must also be a way to prevent jobs entering the queue if one of the device-types is not available. That can be easily determined using the XML-RPC API prior to submission. Once submitted, LAVA must attempt to

run

That's a "lot" of complexity to deal with on our side but that's indeed a way to do it. I'd have to make sure only one device has a MultiNode job in queue and monitor it to send the next one.

...
the test job as quickly as possible, under the expectation that devices which have not been Retired will become available again within a

reasonable

...
amount of time. If that is not the case then those devices should be Retired. (Devices can be brought out of Retired as easily as going in,

it

...
doesn't have to be a permanent state, nothing is actually deleted from

the

...
device configuration.)

Hum... I'm just wondering. What about a device that was submitted a MultiNode job but got Retired since then?

Well spotted - that is left to the admins to manage. There is a story outstanding to cancel all test jobs submitted,scheduled or running for devices which are transitioned into Retired.

...
Now I'm wondering what's the difference between Retired and Maintenance except that it does not accept job submission?

The difference only shows if ALL devices of the device-type are in the specified state.

Retired - no submissions allowed. No test jobs in the queue will be scheduled. Running test jobs will be left to complete. Submitted jobs are currently unchanged if the state changes to Retired. Maintenance - submissions are allowed. No test jobs in the queue will be scheduled. Running test jobs will be left to complete. Submitted jobs are unchanged if the state changes to Maintenance.

So the only distinction between Retired and Maintenance at the moment is submissions.

...
...
...
...
What you are describing sounds like a misuse of MultiNode resulting

in

...
...
...
resource starvation and the fix is to have enough of the limited

resource

...
...
...
to prevent starvation - either by adding hardware and changing the

current

...
test jobs to use device-tags or relocating the work done on the

starved

...
...
...
resource into an LXC so that every device can have a dedicated

"container"

...
to do things which cannot be easily done on the device.

Neither of those options are possible in our use case.

I understand the MultiNode scheduler is complex and low priority. We've modestly contributed to LAVA before, we're not telling you to

fix

...
...
it ASAP but rather to help or guide us to fix this issue in a way it could be accepted in the upstream version of LAVA.

If you still stand strong against a patch or if it's a lengthy

complete

...
...
rework of the scheduler, could we have at least a way to tell for how long a test have been scheduled (or for how long a board has been reserved for a test that is scheduled)?

That data is already available in the current UI and over the XML-RPC

API

...
and REST API.

Check for Queued Jobs and the scheduling state, also the job_details

call

...
in XML-RPC. There are a variety of ways of getting the information you require using the existing APIs - which one you use will depend on your preference and current scripting.

Rather than polling on XML-RPC, it would be better for a monitoring

process

...
to use ZMQ and the publisher events to get push notifications of change

of

...
state. That lowers the load on the master, depending on how busy the instance actually becomes.

...
That way we can use an external tool to monitor this and manually cancel them when needed. Currently,

I

...
...
don't think there is a way to tell since when the job was scheduled.

Every test job has a database object of submit_time created at the point where the job is created upon submission.

Submit_time isn't really an option if it's telling what its name is telling because I can have jobs in queue for days. (Multiple network tests for each and every board and also time-consuming tests (e.g. crypto) that have the same priority).

I'll have a look at what you've offered above, thanks.

Thanks for having taken the time to answer my question,

Quentin

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Remi Duraffort

4:22 p.m.

New subject: [Lava-users] Deadlock in scheduler

Hello,

if I understand correctly the issue that you have, this is more or less the same as https://projects.linaro.org/browse/LAVA-1285

We have seen this issue on staging with a multinode jobs that need 2 bbb and 1 cubie. When the cubie is going BAD, then 2 bbb will stay in SCHEDULING until the cubie is back to GOOD. That's a bug in the current scheduler.

The card explains a bit what should be done to fix this issue: adding some code to undo the scheduling when the scheduler notices that some devices are missing. I don't have any time to look at that card and I won't in the near future. So if that's important to you, please take some time to write down a patch and send it for review.

This card is part of https://projects.linaro.org/browse/LAVA-595 which is a long task to improve the scheduler.

Cheers.

2018-04-24 11:15 GMT+02:00 Neil Williams neil.williams@linaro.org:

...

On 24 April 2018 at 10:10, Neil Williams neil.williams@linaro.org wrote:

...
On 24 April 2018 at 09:08, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi Neil,

I think there was a global misunderstanding from a poorly choice of words. When I was saying "available device" I meant a device that isn't in Maintenance or Retired. If the device is idle, running a job, scheduled to run a job, etc... I consider it "available". Sorry for the misunderstanding.

...
Currently, "Maintenance" is considered as available for submission & for scheduling. This is to support maximum uptime and minimal disruption to CI loops for temporary work happening on devices.

Currently, "Maintenance" is considered as available for submission but not for scheduling. This is to support maximum uptime and minimal disruption to CI loops for temporary work happening on devices.

...
We are looking at clarifying this soon.

I know this has been a long and complex thread. Thank you for sticking with the discussion, despite the complexity and terminology.

So complex, I sometimes typo it myself.

...
...
On Mon, Apr 23, 2018 at 12:54:03PM +0100, Neil Williams wrote:

...
On 23 April 2018 at 11:21, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi Neil,

Thanks for your prompt answer.

On Fri, Apr 20, 2018 at 07:56:29AM +0100, Neil Williams wrote:

...
On 19 April 2018 at 20:11, Quentin Schulz <

quentin.schulz@bootlin.com>

...
...
...
wrote:

> Hi all, > > I've encountered a deadlock in my LAVA server with the following

scheme.

...
> I have an at91rm9200ek in my lab that got submitted a lot of

multi-node

...
...
...
> jobs requesting an other "board" (a laptop of type dummy-ssh). > All of my other boards in the lab have received the same

multi-node

...
...
jobs

...
> requesting the same and only laptop. >

That is the source of the resource starvation - multiple

requirements of

...
...
a

...
single device. The scheduler needs to be greedy and grab whatever

suitable

...
devices it can as soon as it can to be able to run MultiNode. The

primary

...
...
...
ordering of scheduling is the Test Job ID which is determined at

submission.

...
Why would you order test jobs without knowing if the boards it

depends

...
...
on are available when it's going to be scheduled? What am I missing?

To avoid the situation where a MultiNode job is constantly waiting for

all

...
devices to be available at exactly the same time. Instances frequently

have

...
long queues of submitted test jobs, a mix of single node and

MultiNode. The

...
MultiNode jobs must be able to grab whatever device is available, in

order

...
of submit time, and then wait for the other part to be available. Otherwise, all devices would run all single node test jobs in the

entire

...
queue before any MultiNode test jobs could start. Many instances

constantly

...
have a queue of single node test jobs.

That's understood and expected.

...
...
...
If you have an imbalance between the number of machines which can

be

...
...
...
available and then submit MultiNode jobs which all rely on the

starved

...
...
...
resource, there is not much LAVA can do currently. We are looking

at a

...
...
way

...
to reschedule MultiNode test jobs but it is very complex and low

priority.

...
What version of lava-server and lava-dispatcher are you running?

lava-dispatcher 2018.2.post3-1+jessie lava-server 2018.2-1+jessie lava-coordinator 0.1.7-1

(You need to upgrade to Stretch - there will be no fixes or upgrades available for Jessie. All development work must only happen on

Stretch. See

...
the lava-announce mailing list archive.)

Thanks, we'll have a look into this.

...
...
...
What is the structure of your current lab?

MultiNode is complex - not just at the test job synchronization

level but

...
...
...
also at the lab structure / administrative level.

I have two machines. One acting as LAVA server and one as LAVA slave. The LAVA slave is handling all boards in our lab.

I have one laptop (an actual x86 laptop for which we know the NIC

driver

...
...
works reliably at high (~1Gbps) speeds) that we use for MultiNode

jobs

...
...
(actually requesting the laptop and one board at the same time only)

to

...
...
test network. This laptop is seen as a board by LAVA, there is

nothing

...
...
LAVA-related on this board (it should be seen as a device).

Then you need more LAVA devices to replicate the role played by the

laptop.

...
Exactly one device for each MultiNode test job which can be submitted

at

...
any one time. Then use device tags to allocate one of the "laptop"

devices

...
to each of the other boards involved in the MultiNode test jobs.

Alternatively, you need to manage both the submissions and the device availability.

Think of just the roles played by the devices.

There are N client role devices (not in Retired state) and there are X server role devices where the server role is what the laptop is

currently

...
doing.

You need to have N == X to solve the imbalance in the queue.

If N > 1 (and there are more than one device-type in the count 'N')

then

...
you also need to use device tags so that each device-type has a

dedicated

...
pool of server role devices where the number of devices in the server

role

...
pool exactly matches the number of devices of the device-type using the specified device tag.

...
...
> > I had to take the at91rm9200ek out of the lab because it was

behaving.

...
...
...
>

> However, LAVA is still scheduling multi-node jobs on the laptop

which

...
...
...
> requires the at91rm9200ek as the other part of the job, while its

status

...
> is clearly Maintenance. > > A device in Maintenance is still available for scheduling - only

Retired

...
...
is

...
excluded - test jobs submitted to a Retired device are rejected.

Why is that? The device is explicitely in Maintenance, which IMHO

tells

...
...
that the board shouldn't be used.

Not for the scheduler - the scheduler can still accept submissions and queue them up until the device comes out of Maintenance.

This prevents test jobs being rejected during certain kinds of

maintenance.

...
(Wider scale maintenance would involve taking down the UI on the

master at

...
which point submissions would get a 404 but that is up to admins to schedule and announce etc.)

This is about uptime for busy instances which frequently get batches of submissions out of operations like cron. The available devices quickly

get

...
swamped but the queue needs to continue accepting jobs until admins

decide

...
that devices which need work are going to be unavailable for long

enough

...
that the resulting queue would be impractical. i.e. when the length of

time

...
to wait for the job exceeds the useful window of the results from the

job

...
or when the number of test jobs in the queue exceeds the ability of the available devices to keep on top of the queue and avoid ever increasing queues.

I guess that's an implementation choice but I'd have guessed the scheduler was first looping over idle devices to then schedule the oldest job in the queue for this device type.

But my understanding is that the scheduler rather sets an order when jobs are submitted that isn't temperable with. Is that correct?

The order is priority, submit_time and then target_group.

Changing ordering on-the-fly and backing out from certain states is the subject of https://projects.linaro.org/browse/LAVA-595 - that is the work I've already described as low priority, large scale and complex.

You do have the ability to set the Priority of new test jobs for submission which want to use the laptop in a MultiNode test job along with a device which is NOT the at91rm9200ek. You will need to cancel the test job involving the at91rm9200ek which is currently scheduled. (Other test jobs for the at91rm9200ek in the Queue can be left alone provided that these test jobs have a lower Priority than the jobs you want to run on other devices.)

When the scheduler comes back around, it will find a new test job with higher Priority which wants to use the laptop with a hikey device or whatever and the at91rm9200ek will be ignored. It's not perfect because you would then need to either keep that Priority pipe full or cancel the submitted test jobs for the at91rm9200ek.

...
...
...
...
Once a test job has been submitted, it will be either scheduled or cancelled.

Yes, that's understood and that makes sense to me. However, for

"normal"

...
...
jobs, if you can't find a board of device type X that is available,

it

...
...
does not get scheduled, right? Why can't we do the same for MultiNode jobs?

Because the MultiNode job will never find all devices in the right

state at

...
the same time once there is a mixed queue of single node and MultiNode

jobs.

...
All devices defined in the MultiNode test job must be available at

exactly

...
the same time. Once there are single node jobs in the queue, that never happens.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

scheduler considers the queue - MultiNode cannot start (A is busy), so

move

...
on and start the single node job on B (because the single node test

job on

...
B may actually complete before the job on A finishes, so it is

inefficient

...
to keep B idle when it could be doing useful stuff for another user).

A is running B is running

no actions

A completes and goes to Idle B is still running

and so the pattern continues for as long as there are any single node

test

...
jobs for either A or B in the queue.

The MultiNode test job never starts because A and B are never Idle at

the

...
same time until the queue is completely empty (which *never* happens in many instances).

So the scheduler must grab B while it is Idle to prevent the single

node

...
test job starting. Then when A completes, the scheduler must also grab

A

...
before that single node test job starts running.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

B is transitioned to Scheduling and is unavailable for the single node

test

...
job.

A is running B is scheduling

no actions

A completes and goes to Idle B is scheduling

Scheduler transitions A into scheduling - that test job can now start.

(Now consider MultiNode test jobs covering a dozen devices in an

instance

...
with a hundred mixed devices and permanent queues of single node test

jobs.)

...
The scheduler also needs to be very fast, so the actual decisions need

to

...
be made on quite simple criteria - specifically, without going back to

the

...
database to find out about what else might be in the queue or trying to second-guess when test jobs might end.

That is understood as well for devices that are idle, running or scheduled to run. The point I was trying to make was why schedule a job for a device that is in Maintenance (what I meant by the poorly chosen "available" word).

Is that because one the job is submitted it's ordered by the scheduler and then run by the scheduler in the given order and the jobs are not discriminated against the device Maintenance status?

The problem you have is not that the device is in Maintenance but that the other device(s) in the MultiNode test job are Idle. Therefore, those jobs get scheduled because there is no reason not to do so.

If the submission was to be rejected when all devices of the requested device-type are in Maintenance, that is a large change which would negatively impact a lot of busy instances.

We do need to clarify these states but essentially, Maintenance is a manual state change which has the same effect as the automatic state change to Bad. That is as far as it goes currently.

...
...
...
...
Now, until I put the at91rm9200ek back in the lab, all my boards

are

...
...
...
> reserved and scheduling for a multi-node job and thus, my lab is > basically dead. > > The correct fix here is to have enough devices of the device-type

of the

...
...
...
starved resource such that one of each other device-type can use

that

...
...
...
resource simultaneously and then use device-tags to match up

groups of

...
...
...
devices so that submitting lots of jobs for one type all at the

same time

...
...
...
does not simply consume all of the available resources.

e.g. four device-types - phone, hikey, qemu and panda. Each

multinode job

...
...
...
wants a single QEMU with each of the others, so the QEMU type

becomes

...
...
...
starved, depending on how jobs are submitted. If two hikey-qemu

jobs are

...
...
...
submitted together, then 1 QEMU gets scheduled, waiting for the

hikey to

...
...
...
become free after running the first job. If each QEMU has

device-tags,

...
...
then

...
the second hikey-qemu job will wait not only for the hikey but

will also

...
...
...
wait for the one QEMU which has the hikey device tag. This way,

only

...
...
those

...
jobs would then wait for a QEMU device. There would be three QEMU

devices,

...
one with a device tag like "phone", one with "hikey" and one with

"panda".

...
If another panda device is added, another QEMU with the "panda"

device

...
...
tag

...
would be required. The number of QEMU devices required is the sum

of the

...
...
...
number of devices of each other device-type which may be required

in a

...
...
...
MultiNode test job.

This is a structural problem within your lab.

You would need one "laptop" for each other device-type which can

use that

...
...
...
device-type in your lab. Then each "laptop" gets unique a

device-tag .

...
...
Each

...
test job for at91rm9200ek must specify that the "laptop" device

must have

...
...
...
the matching device tag. Each test job for each other device-type

uses

...
...
the

...
matching device-tag for that device-type. We had this problem in

the

...
...
...
Harston lab for a long time when using V1 and had to implement

just such

...
...
a

...
structure of matched devices and device tags. However, the need

for this

...
...
...
disappeared when the Harston lab transitioned all devices and test

jobs

...
...
to

...
LAVA V2.

I strongly disagree with your statement. A software problem can

often be

...
...
dealt with by adding more resources but I'm not willing to spend thousands on something that can be fixed on the software side.

We've been through these loops within the team for many years and have millions of test jobs which demonstrate the problems and the fix. I'm afraid you are misunderstanding the problem if you think that there is

a

...
software solution for a queue containing both MultiNode and single node test jobs - other than the solution we now use in the LAVA scheduler.

The

...
process has been tried and tested over 8 years and millions of test

jobs

...
across dozens of mixed use case instances and has proven to be the most efficient use of resources across all those models.

Each test job in a MultiNode test is considered separately - if one or

more

...
devices are Idle, then those are immediately put into Scheduling. Only

when

...
all are in Scheduling can any of those jobs start. The status of other

test

...
jobs in the MultiNode group can only be handled at the point when at

least

...
one test job in that MultiNode group is in Scheduling.

I think there is a global misunderstanding due to my bad choice of words. I understand and I'm convinced there are no other ways to deal with what you explained above.

...
...
Aside from a non-negligeable financial and time (to setup and

maintain)

...
...
effort to buy a board with a stable and reliable NIC for each and

every

...
...
board in my lab, it just isn't our use case.

If I would do such a thing, then my network would be the bottleneck

to

...
...
my network tests and I'd have to spend a lot (financially and on

time or

...
...
maintenance) to have a top notch network infrastructure for tests I don't care if they run one after the other. I can't have a separate network for each and every board as well, simply because my boards

often

...
...
have a single Ethernet port, thus I can't separate the test network

from

...
...
the lab network for, e.g. images downloading that are part of the booting process, hence I can't do reliable network testing even by multiplying "laptop" devices.

I can understand it's not your typical use case at Linaro and you've dozens and dozens of the same board and a huge infrastructure to

handle

...
...
the whole LAVA lab and maybe people working full-time on LAVA, the

lab,

...
...
the boards, the infrastructure. But that's the complete opposite of

our

...
...
use case.

Maybe you can understand ours where we have only one board of each device type, being part of KernelCI to test and report kernel booting status and having occasional custom tests (like network) on upstream

or

...
...
custom branches/repositories. We sporadically work on the lab, fixing the issues we're seeing with the boards but that's not what we do

for a

...
...
living.

I do understand and I personally run a lab in much the same way.

However,

...
the code needs to work the same way in that lab as it does in the

larger

Of course.

...
labs. It is the local configuration and resource availability which

must

...
change to suit.

For now, the best thing is to put devices into Retired so that

submissions

...
are rejected and then you will also have to manage your submissions and your queue.

Can't we have a "Maintenance but I don't know when it's coming back so please still submit jobs but do not schedule them" option :D ?

This is exactly what you already have - but it's not actually what you mean.

The problem is that you're thinking of the state of the at91rm9200ek when what matters to the scheduler is the state of the laptop device *in isolation*. The state of the at91rm9200ek only matters AFTER the laptop has been assigned.

What you mean is:

"I don't know when ANY of the devices of device-type A are going to be ready to start a test job, so do not allow ANY OTHER device of ANY OTHER device-type to be scheduled either IF that device-type is listed in a MultiNode test job which ALSO requires a device of this device-type".

(The reason it's ANY is that if 10 test jobs are submitted all wanting at91rm9200ek and laptop, then if you had 10 laptops and 1 at91rm9200ek, those 10 laptops would also go into Scheduled - that is the imbalance we talked about previously.)

It is a cross-relational issue.

Correlating across the MultiNode test job at the scheduling state is likely to have a massive impact on the speed of the scheduler because:

0: a device in Maintenance or Bad is NOT currently available to be scheduled - that's the whole point. 1: the other device(s) in the MultiNode group DO get scheduled because those were in Idle 2: asking the scheduler to check the state of all devices of all device-types mentioned in a MultiNode job when considering whether to schedule any other device in that same MultiNode job is going to make the scheduler SLOW.

So what we do is let the Idle (laptop) device go into a waiting state and let the scheduler move the at91rm9200ek device into scheduling *only* when a at91rm9200ek device becomes available for scheduling - as two completely separate decisions. Then, the relational work is done by the database when the lava-master makes the query "which test jobs are available to start NOW". This is much more efficient because we are looking at jobs where all devices in the target_group are in state SCHEDULED. The database can easily exclude test jobs which are in state SUBMITTED (the at91rm9200ek jobs) and a simple check on target_group shows that the MultiNode test job is not ready to start. That can all be done with a single database query using select_related and other ORM niceties.

Let's describe this with roles:

role: client device-type: at91rm9200ek

role: server device-type: laptop

If the at91rm9200ek is in Maintenance and there are no other devices of device-type at91rm9200ek in state Idle, then nothing will get scheduled for at91rm9200ek.

However, when a MultiNode test job is submitted for 1 at91rm9200ek and 1 or more "laptop" device(s), then there is no reason to stop scheduling the laptop device in state Idle without scrabbling through the test job definition and working out (again and again, every time the scheduler loops through the queue) which device-types are requested, which devices of those types are available and what to do next.

The problem is NOT the state of the at91rm9200ek - Maintenance or Bad, it makes no difference. The problem for the SCHEDULER is that the laptop device is Idle and requested by a test job with a sufficiently low submit_time (and high enough Priority) that it is first in the queue.

The problem at the SUBMISSION stage is that the only decision available is whether to allow the test job asking for at91rm9200ek & laptop onto the queue or whether to refuse it outright. Currently, a refusal is only implemented if all devices of at least one device-type specified in the test job are in Retired state.

After many, many rounds of testing, test jobs, discussions going on over several years we came to the decision that in your situation - where there is a dire shortage of a resource used by multiple MultiNode test jobs, that the only thing that was safe for the SCHEDULER to do was to allow the Idle device to be scheduled and let the test job wait for resources to become available, either by moving the other device out of Maintenance or providing extra hardware for the Idle device.

...
...
We're looking at what the Maintenance state means for MultiNode in https://projects.linaro.org/browse/LAVA-1299 but it is not acceptable

to

...
refuse submissions when devices are not Retired. Users have an

expectation

...
that devices which are being fixed will come back online at some point

or

...
will go into retired. There is also

I agree.

...
https://projects.linaro.org/browse/LAVA-595 but that work has not yet

been

...
scoped. It could be a long time before that work starts and will take months of work once it does start.

The problem is a structural one in the physical resources available in

your

...
local lab. It is a problem we have faced more than once in our own instances and we have gone down all the various routes until we've

come to

...
the current implementation.

...
We also work actively on the kernel and thus, we take boards (which

we

...
...
own only once) out of the lab to work on it and then put it into the lab once we've finished working. This is where we put it in

Maintenance

...
...
mode as, IMHO, Retired does not cover this use case. This "Maintenance" can take seconds, days or months.

For me, you're ignoring an issue that is almost inexistent in your

case

...
...
It is an issue which has had months of investigation, discussion and intervention in our use cases. We have spent a very long time going

through

...
all of the permutations.

I understand the scheduler is a critical part of the software that had your attention for a long time and appropriate testing, no doubt.

...
...
because you've dealt with it by adding as much resource as you could

to

...
...
make the probability to happen to be close to zero. That does not

mean

...
...
it does not exist. I'm not criticizing the way to deal with it, I'm

just

...
...
saying this way isn't a path we can take personally.

Then you need to manage the queue on your instance in ways that allow

for

...
your situation.

...
...
> Let me know if I can be of any help debugging this thing or

testing a

...
...
...
> possible fix. I'd have a look at the scheduler but you, obviously > knowing the code base way better than I do, might have a quick

patch on

...
...
...
> hand. >

Patches would be a bad solution for a structural problem.

As a different approach, why do you need MultiNode with a "laptop"

type

...
...
...
device in the first place? Can the test jobs be reconfigured to

use LXC

...
...
...
which does not use MultiNode? What is the "laptop" device-type

doing that

...
...
...
cannot be done in an LXC? LXC is created on-the-fly, one for each

device,

...
...
...
when the test job requests one. This solved the resource starvation

problem

...
with the majority of MultiNode issues because the work previously

done in

...
...
...
the generic QEMU / "laptop" role can just as easily be done in an

LXC.

...
...
...
We're testing Gigabit NICs can actually handle Gbps transfers. We

need a

...
...
fully available Gbps NIC for each and every test we do to make the

results

...
...
reliable and consistent.

Then as that resource is limited, you must create a way that only one

test

...
job of this kind can ever actually run at a time. That can be done by working at the stage prior to submission or it can be done by changing

the

...
device availablity such that the submission is rejected. Critically,

there

...
must also be a way to prevent jobs entering the queue if one of the device-types is not available. That can be easily determined using the XML-RPC API prior to submission. Once submitted, LAVA must attempt to

run

That's a "lot" of complexity to deal with on our side but that's indeed a way to do it. I'd have to make sure only one device has a MultiNode job in queue and monitor it to send the next one.

...
the test job as quickly as possible, under the expectation that devices which have not been Retired will become available again within a

reasonable

...
amount of time. If that is not the case then those devices should be Retired. (Devices can be brought out of Retired as easily as going in,

it

...
doesn't have to be a permanent state, nothing is actually deleted from

the

...
device configuration.)

Hum... I'm just wondering. What about a device that was submitted a MultiNode job but got Retired since then?

Well spotted - that is left to the admins to manage. There is a story outstanding to cancel all test jobs submitted,scheduled or running for devices which are transitioned into Retired.

...
Now I'm wondering what's the difference between Retired and Maintenance except that it does not accept job submission?

The difference only shows if ALL devices of the device-type are in the specified state.

Retired - no submissions allowed. No test jobs in the queue will be scheduled. Running test jobs will be left to complete. Submitted jobs are currently unchanged if the state changes to Retired. Maintenance - submissions are allowed. No test jobs in the queue will be scheduled. Running test jobs will be left to complete. Submitted jobs are unchanged if the state changes to Maintenance.

So the only distinction between Retired and Maintenance at the moment is submissions.

...
...
...
...
What you are describing sounds like a misuse of MultiNode

resulting in

...
...
...
resource starvation and the fix is to have enough of the limited

resource

...
...
...
to prevent starvation - either by adding hardware and changing the

current

...
test jobs to use device-tags or relocating the work done on the

starved

...
...
...
resource into an LXC so that every device can have a dedicated

"container"

...
to do things which cannot be easily done on the device.

Neither of those options are possible in our use case.

I understand the MultiNode scheduler is complex and low priority. We've modestly contributed to LAVA before, we're not telling you to

fix

...
...
it ASAP but rather to help or guide us to fix this issue in a way it could be accepted in the upstream version of LAVA.

If you still stand strong against a patch or if it's a lengthy

complete

...
...
rework of the scheduler, could we have at least a way to tell for how long a test have been scheduled (or for how long a board has been reserved for a test that is scheduled)?

That data is already available in the current UI and over the XML-RPC

API

...
and REST API.

Check for Queued Jobs and the scheduling state, also the job_details

call

...
in XML-RPC. There are a variety of ways of getting the information you require using the existing APIs - which one you use will depend on your preference and current scripting.

Rather than polling on XML-RPC, it would be better for a monitoring

process

...
to use ZMQ and the publisher events to get push notifications of

change of

...
state. That lowers the load on the master, depending on how busy the instance actually becomes.

...
That way we can use an external tool to monitor this and manually cancel them when needed.

Currently, I

...
...
don't think there is a way to tell since when the job was scheduled.

Every test job has a database object of submit_time created at the

point

...
where the job is created upon submission.

Submit_time isn't really an option if it's telling what its name is telling because I can have jobs in queue for days. (Multiple network tests for each and every board and also time-consuming tests (e.g. crypto) that have the same priority).

I'll have a look at what you've offered above, thanks.

Thanks for having taken the time to answer my question,

Quentin

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

-- Rémi Duraffort LAVA Team

Quentin Schulz

25 Apr 25 Apr

10:33 a.m.

New subject: [Lava-users] Deadlock in scheduler

Hi Remi,

On Tue, Apr 24, 2018 at 06:22:19PM +0200, Remi Duraffort wrote:

...

Hello,

if I understand correctly the issue that you have, this is more or less the same as https://projects.linaro.org/browse/LAVA-1285

We have seen this issue on staging with a multinode jobs that need 2 bbb and 1 cubie. When the cubie is going BAD, then 2 bbb will stay in SCHEDULING until the cubie is back to GOOD. That's a bug in the current scheduler.

That seems to be the same issue indeed.

With all the ongoing discussion with Neil, it does look hard to fix with all the current constraints.

...

The card explains a bit what should be done to fix this issue: adding some code to undo the scheduling when the scheduler notices that some devices are missing. I don't have any time to look at that card and I won't in the near future. So if that's important to you, please take some time to write down a patch and send it for review.

This card is part of https://projects.linaro.org/browse/LAVA-595 which is a long task to improve the scheduler.

This card is a bit obscure to me as there is not much information.

We'll let you know if we start working on those cards.

Thanks,

Quentin

...

Cheers.

2018-04-24 11:15 GMT+02:00 Neil Williams neil.williams@linaro.org:

...
On 24 April 2018 at 10:10, Neil Williams neil.williams@linaro.org wrote:

...
On 24 April 2018 at 09:08, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi Neil,

I think there was a global misunderstanding from a poorly choice of words. When I was saying "available device" I meant a device that isn't in Maintenance or Retired. If the device is idle, running a job, scheduled to run a job, etc... I consider it "available". Sorry for the misunderstanding.

...
Currently, "Maintenance" is considered as available for submission & for scheduling. This is to support maximum uptime and minimal disruption to CI loops for temporary work happening on devices.

Currently, "Maintenance" is considered as available for submission but not for scheduling. This is to support maximum uptime and minimal disruption to CI loops for temporary work happening on devices.

...
We are looking at clarifying this soon.

I know this has been a long and complex thread. Thank you for sticking with the discussion, despite the complexity and terminology.

So complex, I sometimes typo it myself.

...
...
On Mon, Apr 23, 2018 at 12:54:03PM +0100, Neil Williams wrote:

...
On 23 April 2018 at 11:21, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi Neil,

Thanks for your prompt answer.

On Fri, Apr 20, 2018 at 07:56:29AM +0100, Neil Williams wrote: > On 19 April 2018 at 20:11, Quentin Schulz <

quentin.schulz@bootlin.com>

...
...
> wrote: > > > Hi all, > > > > I've encountered a deadlock in my LAVA server with the following scheme. > > > I have an at91rm9200ek in my lab that got submitted a lot of

multi-node

...
...
> > jobs requesting an other "board" (a laptop of type dummy-ssh). > > All of my other boards in the lab have received the same

multi-node

...
...
jobs > > requesting the same and only laptop. > > > > That is the source of the resource starvation - multiple

requirements of

...
...
a > single device. The scheduler needs to be greedy and grab whatever suitable > devices it can as soon as it can to be able to run MultiNode. The

primary

...
...
> ordering of scheduling is the Test Job ID which is determined at submission. >

Why would you order test jobs without knowing if the boards it

depends

...
...
on are available when it's going to be scheduled? What am I missing?

To avoid the situation where a MultiNode job is constantly waiting for

all

...
devices to be available at exactly the same time. Instances frequently

have

...
long queues of submitted test jobs, a mix of single node and

MultiNode. The

...
MultiNode jobs must be able to grab whatever device is available, in

order

...
of submit time, and then wait for the other part to be available. Otherwise, all devices would run all single node test jobs in the

entire

...
queue before any MultiNode test jobs could start. Many instances

constantly

...
have a queue of single node test jobs.

That's understood and expected.

...
...
> If you have an imbalance between the number of machines which can

be

...
...
> available and then submit MultiNode jobs which all rely on the

starved

...
...
> resource, there is not much LAVA can do currently. We are looking

at a

...
...
way > to reschedule MultiNode test jobs but it is very complex and low priority. > > What version of lava-server and lava-dispatcher are you running? >

lava-dispatcher 2018.2.post3-1+jessie lava-server 2018.2-1+jessie lava-coordinator 0.1.7-1

(You need to upgrade to Stretch - there will be no fixes or upgrades available for Jessie. All development work must only happen on

Stretch. See

...
the lava-announce mailing list archive.)

Thanks, we'll have a look into this.

...
...
> What is the structure of your current lab? > > MultiNode is complex - not just at the test job synchronization

level but

...
...
> also at the lab structure / administrative level. >

I have two machines. One acting as LAVA server and one as LAVA slave. The LAVA slave is handling all boards in our lab.

I have one laptop (an actual x86 laptop for which we know the NIC

driver

...
...
works reliably at high (~1Gbps) speeds) that we use for MultiNode

jobs

...
...
(actually requesting the laptop and one board at the same time only)

to

...
...
test network. This laptop is seen as a board by LAVA, there is

nothing

...
...
LAVA-related on this board (it should be seen as a device).

Then you need more LAVA devices to replicate the role played by the

laptop.

...
Exactly one device for each MultiNode test job which can be submitted

at

...
any one time. Then use device tags to allocate one of the "laptop"

devices

...
to each of the other boards involved in the MultiNode test jobs.

Alternatively, you need to manage both the submissions and the device availability.

Think of just the roles played by the devices.

There are N client role devices (not in Retired state) and there are X server role devices where the server role is what the laptop is

currently

...
doing.

You need to have N == X to solve the imbalance in the queue.

If N > 1 (and there are more than one device-type in the count 'N')

then

...
you also need to use device tags so that each device-type has a

dedicated

...
pool of server role devices where the number of devices in the server

role

...
pool exactly matches the number of devices of the device-type using the specified device tag.

...
> > > > > I had to take the at91rm9200ek out of the lab because it was

behaving.

...
...
> > > > > > > However, LAVA is still scheduling multi-node jobs on the laptop

which

...
...
> > requires the at91rm9200ek as the other part of the job, while its status > > is clearly Maintenance. > > > > > A device in Maintenance is still available for scheduling - only

Retired

...
...
is > excluded - test jobs submitted to a Retired device are rejected. >

Why is that? The device is explicitely in Maintenance, which IMHO

tells

...
...
that the board shouldn't be used.

Not for the scheduler - the scheduler can still accept submissions and queue them up until the device comes out of Maintenance.

This prevents test jobs being rejected during certain kinds of

maintenance.

...
(Wider scale maintenance would involve taking down the UI on the

master at

...
which point submissions would get a 404 but that is up to admins to schedule and announce etc.)

This is about uptime for busy instances which frequently get batches of submissions out of operations like cron. The available devices quickly

get

...
swamped but the queue needs to continue accepting jobs until admins

decide

...
that devices which need work are going to be unavailable for long

enough

...
that the resulting queue would be impractical. i.e. when the length of

time

...
to wait for the job exceeds the useful window of the results from the

job

...
or when the number of test jobs in the queue exceeds the ability of the available devices to keep on top of the queue and avoid ever increasing queues.

I guess that's an implementation choice but I'd have guessed the scheduler was first looping over idle devices to then schedule the oldest job in the queue for this device type.

But my understanding is that the scheduler rather sets an order when jobs are submitted that isn't temperable with. Is that correct?

The order is priority, submit_time and then target_group.

Changing ordering on-the-fly and backing out from certain states is the subject of https://projects.linaro.org/browse/LAVA-595 - that is the work I've already described as low priority, large scale and complex.

You do have the ability to set the Priority of new test jobs for submission which want to use the laptop in a MultiNode test job along with a device which is NOT the at91rm9200ek. You will need to cancel the test job involving the at91rm9200ek which is currently scheduled. (Other test jobs for the at91rm9200ek in the Queue can be left alone provided that these test jobs have a lower Priority than the jobs you want to run on other devices.)

When the scheduler comes back around, it will find a new test job with higher Priority which wants to use the laptop with a hikey device or whatever and the at91rm9200ek will be ignored. It's not perfect because you would then need to either keep that Priority pipe full or cancel the submitted test jobs for the at91rm9200ek.

...
...
...
> Once a test job has been submitted, it will be either scheduled or > cancelled. >

Yes, that's understood and that makes sense to me. However, for

"normal"

...
...
jobs, if you can't find a board of device type X that is available,

it

...
...
does not get scheduled, right? Why can't we do the same for MultiNode jobs?

Because the MultiNode job will never find all devices in the right

state at

...
the same time once there is a mixed queue of single node and MultiNode

jobs.

...
All devices defined in the MultiNode test job must be available at

exactly

...
the same time. Once there are single node jobs in the queue, that never happens.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

scheduler considers the queue - MultiNode cannot start (A is busy), so

move

...
on and start the single node job on B (because the single node test

job on

...
B may actually complete before the job on A finishes, so it is

inefficient

...
to keep B idle when it could be doing useful stuff for another user).

A is running B is running

no actions

A completes and goes to Idle B is still running

and so the pattern continues for as long as there are any single node

test

...
jobs for either A or B in the queue.

The MultiNode test job never starts because A and B are never Idle at

the

...
same time until the queue is completely empty (which *never* happens in many instances).

So the scheduler must grab B while it is Idle to prevent the single

node

...
test job starting. Then when A completes, the scheduler must also grab

A

...
before that single node test job starts running.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

B is transitioned to Scheduling and is unavailable for the single node

test

...
job.

A is running B is scheduling

no actions

A completes and goes to Idle B is scheduling

Scheduler transitions A into scheduling - that test job can now start.

(Now consider MultiNode test jobs covering a dozen devices in an

instance

...
with a hundred mixed devices and permanent queues of single node test

jobs.)

...
The scheduler also needs to be very fast, so the actual decisions need

to

...
be made on quite simple criteria - specifically, without going back to

the

...
database to find out about what else might be in the queue or trying to second-guess when test jobs might end.

That is understood as well for devices that are idle, running or scheduled to run. The point I was trying to make was why schedule a job for a device that is in Maintenance (what I meant by the poorly chosen "available" word).

Is that because one the job is submitted it's ordered by the scheduler and then run by the scheduler in the given order and the jobs are not discriminated against the device Maintenance status?

The problem you have is not that the device is in Maintenance but that the other device(s) in the MultiNode test job are Idle. Therefore, those jobs get scheduled because there is no reason not to do so.

If the submission was to be rejected when all devices of the requested device-type are in Maintenance, that is a large change which would negatively impact a lot of busy instances.

We do need to clarify these states but essentially, Maintenance is a manual state change which has the same effect as the automatic state change to Bad. That is as far as it goes currently.

...
...
...
> Now, until I put the at91rm9200ek back in the lab, all my boards

are

...
...
> > reserved and scheduling for a multi-node job and thus, my lab is > > basically dead. > > > > > The correct fix here is to have enough devices of the device-type

of the

...
...
> starved resource such that one of each other device-type can use

that

...
...
> resource simultaneously and then use device-tags to match up

groups of

...
...
> devices so that submitting lots of jobs for one type all at the

same time

...
...
> does not simply consume all of the available resources. > > e.g. four device-types - phone, hikey, qemu and panda. Each

multinode job

...
...
> wants a single QEMU with each of the others, so the QEMU type

becomes

...
...
> starved, depending on how jobs are submitted. If two hikey-qemu

jobs are

...
...
> submitted together, then 1 QEMU gets scheduled, waiting for the

hikey to

...
...
> become free after running the first job. If each QEMU has

device-tags,

...
...
then > the second hikey-qemu job will wait not only for the hikey but

will also

...
...
> wait for the one QEMU which has the hikey device tag. This way,

only

...
...
those > jobs would then wait for a QEMU device. There would be three QEMU devices, > one with a device tag like "phone", one with "hikey" and one with "panda". > If another panda device is added, another QEMU with the "panda"

device

...
...
tag > would be required. The number of QEMU devices required is the sum

of the

...
...
> number of devices of each other device-type which may be required

in a

...
...
> MultiNode test job. > > This is a structural problem within your lab. > > You would need one "laptop" for each other device-type which can

use that

...
...
> device-type in your lab. Then each "laptop" gets unique a

device-tag .

...
...
Each > test job for at91rm9200ek must specify that the "laptop" device

must have

...
...
> the matching device tag. Each test job for each other device-type

uses

...
...
the > matching device-tag for that device-type. We had this problem in

the

...
...
> Harston lab for a long time when using V1 and had to implement

just such

...
...
a > structure of matched devices and device tags. However, the need

for this

...
...
> disappeared when the Harston lab transitioned all devices and test

jobs

...
...
to > LAVA V2. >

I strongly disagree with your statement. A software problem can

often be

...
...
dealt with by adding more resources but I'm not willing to spend thousands on something that can be fixed on the software side.

We've been through these loops within the team for many years and have millions of test jobs which demonstrate the problems and the fix. I'm afraid you are misunderstanding the problem if you think that there is

a

...
software solution for a queue containing both MultiNode and single node test jobs - other than the solution we now use in the LAVA scheduler.

The

...
process has been tried and tested over 8 years and millions of test

jobs

...
across dozens of mixed use case instances and has proven to be the most efficient use of resources across all those models.

Each test job in a MultiNode test is considered separately - if one or

more

...
devices are Idle, then those are immediately put into Scheduling. Only

when

...
all are in Scheduling can any of those jobs start. The status of other

test

...
jobs in the MultiNode group can only be handled at the point when at

least

...
one test job in that MultiNode group is in Scheduling.

I think there is a global misunderstanding due to my bad choice of words. I understand and I'm convinced there are no other ways to deal with what you explained above.

...
...
Aside from a non-negligeable financial and time (to setup and

maintain)

...
...
effort to buy a board with a stable and reliable NIC for each and

every

...
...
board in my lab, it just isn't our use case.

If I would do such a thing, then my network would be the bottleneck

to

...
...
my network tests and I'd have to spend a lot (financially and on

time or

...
...
maintenance) to have a top notch network infrastructure for tests I don't care if they run one after the other. I can't have a separate network for each and every board as well, simply because my boards

often

...
...
have a single Ethernet port, thus I can't separate the test network

from

...
...
the lab network for, e.g. images downloading that are part of the booting process, hence I can't do reliable network testing even by multiplying "laptop" devices.

I can understand it's not your typical use case at Linaro and you've dozens and dozens of the same board and a huge infrastructure to

handle

...
...
the whole LAVA lab and maybe people working full-time on LAVA, the

lab,

...
...
the boards, the infrastructure. But that's the complete opposite of

our

...
...
use case.

Maybe you can understand ours where we have only one board of each device type, being part of KernelCI to test and report kernel booting status and having occasional custom tests (like network) on upstream

or

...
...
custom branches/repositories. We sporadically work on the lab, fixing the issues we're seeing with the boards but that's not what we do

for a

...
...
living.

I do understand and I personally run a lab in much the same way.

However,

...
the code needs to work the same way in that lab as it does in the

larger

Of course.

...
labs. It is the local configuration and resource availability which

must

...
change to suit.

For now, the best thing is to put devices into Retired so that

submissions

...
are rejected and then you will also have to manage your submissions and your queue.

Can't we have a "Maintenance but I don't know when it's coming back so please still submit jobs but do not schedule them" option :D ?

This is exactly what you already have - but it's not actually what you mean.

The problem is that you're thinking of the state of the at91rm9200ek when what matters to the scheduler is the state of the laptop device *in isolation*. The state of the at91rm9200ek only matters AFTER the laptop has been assigned.

What you mean is:

"I don't know when ANY of the devices of device-type A are going to be ready to start a test job, so do not allow ANY OTHER device of ANY OTHER device-type to be scheduled either IF that device-type is listed in a MultiNode test job which ALSO requires a device of this device-type".

(The reason it's ANY is that if 10 test jobs are submitted all wanting at91rm9200ek and laptop, then if you had 10 laptops and 1 at91rm9200ek, those 10 laptops would also go into Scheduled - that is the imbalance we talked about previously.)

It is a cross-relational issue.

Correlating across the MultiNode test job at the scheduling state is likely to have a massive impact on the speed of the scheduler because:

0: a device in Maintenance or Bad is NOT currently available to be scheduled - that's the whole point. 1: the other device(s) in the MultiNode group DO get scheduled because those were in Idle 2: asking the scheduler to check the state of all devices of all device-types mentioned in a MultiNode job when considering whether to schedule any other device in that same MultiNode job is going to make the scheduler SLOW.

So what we do is let the Idle (laptop) device go into a waiting state and let the scheduler move the at91rm9200ek device into scheduling *only* when a at91rm9200ek device becomes available for scheduling - as two completely separate decisions. Then, the relational work is done by the database when the lava-master makes the query "which test jobs are available to start NOW". This is much more efficient because we are looking at jobs where all devices in the target_group are in state SCHEDULED. The database can easily exclude test jobs which are in state SUBMITTED (the at91rm9200ek jobs) and a simple check on target_group shows that the MultiNode test job is not ready to start. That can all be done with a single database query using select_related and other ORM niceties.

Let's describe this with roles:

role: client device-type: at91rm9200ek

role: server device-type: laptop

If the at91rm9200ek is in Maintenance and there are no other devices of device-type at91rm9200ek in state Idle, then nothing will get scheduled for at91rm9200ek.

However, when a MultiNode test job is submitted for 1 at91rm9200ek and 1 or more "laptop" device(s), then there is no reason to stop scheduling the laptop device in state Idle without scrabbling through the test job definition and working out (again and again, every time the scheduler loops through the queue) which device-types are requested, which devices of those types are available and what to do next.

The problem is NOT the state of the at91rm9200ek - Maintenance or Bad, it makes no difference. The problem for the SCHEDULER is that the laptop device is Idle and requested by a test job with a sufficiently low submit_time (and high enough Priority) that it is first in the queue.

The problem at the SUBMISSION stage is that the only decision available is whether to allow the test job asking for at91rm9200ek & laptop onto the queue or whether to refuse it outright. Currently, a refusal is only implemented if all devices of at least one device-type specified in the test job are in Retired state.

After many, many rounds of testing, test jobs, discussions going on over several years we came to the decision that in your situation - where there is a dire shortage of a resource used by multiple MultiNode test jobs, that the only thing that was safe for the SCHEDULER to do was to allow the Idle device to be scheduled and let the test job wait for resources to become available, either by moving the other device out of Maintenance or providing extra hardware for the Idle device.

...
...
We're looking at what the Maintenance state means for MultiNode in https://projects.linaro.org/browse/LAVA-1299 but it is not acceptable

to

...
refuse submissions when devices are not Retired. Users have an

expectation

...
that devices which are being fixed will come back online at some point

or

...
will go into retired. There is also

I agree.

...
https://projects.linaro.org/browse/LAVA-595 but that work has not yet

been

...
scoped. It could be a long time before that work starts and will take months of work once it does start.

The problem is a structural one in the physical resources available in

your

...
local lab. It is a problem we have faced more than once in our own instances and we have gone down all the various routes until we've

come to

...
the current implementation.

...
We also work actively on the kernel and thus, we take boards (which

we

...
...
own only once) out of the lab to work on it and then put it into the lab once we've finished working. This is where we put it in

Maintenance

...
...
mode as, IMHO, Retired does not cover this use case. This "Maintenance" can take seconds, days or months.

For me, you're ignoring an issue that is almost inexistent in your

case

...
...
It is an issue which has had months of investigation, discussion and intervention in our use cases. We have spent a very long time going

through

...
all of the permutations.

I understand the scheduler is a critical part of the software that had your attention for a long time and appropriate testing, no doubt.

...
...
because you've dealt with it by adding as much resource as you could

to

...
...
make the probability to happen to be close to zero. That does not

mean

...
...
it does not exist. I'm not criticizing the way to deal with it, I'm

just

...
...
saying this way isn't a path we can take personally.

Then you need to manage the queue on your instance in ways that allow

for

...
your situation.

...
> > > > Let me know if I can be of any help debugging this thing or

testing a

...
...
> > possible fix. I'd have a look at the scheduler but you, obviously > > knowing the code base way better than I do, might have a quick

patch on

...
...
> > hand. > > > > Patches would be a bad solution for a structural problem. > > As a different approach, why do you need MultiNode with a "laptop"

type

...
...
> device in the first place? Can the test jobs be reconfigured to

use LXC

...
...
> which does not use MultiNode? What is the "laptop" device-type

doing that

...
...
> cannot be done in an LXC? LXC is created on-the-fly, one for each

device,

...
...
> when the test job requests one. This solved the resource starvation problem > with the majority of MultiNode issues because the work previously

done in

...
...
> the generic QEMU / "laptop" role can just as easily be done in an

LXC.

...
...
>

We're testing Gigabit NICs can actually handle Gbps transfers. We

need a

...
...
fully available Gbps NIC for each and every test we do to make the

results

...
...
reliable and consistent.

Then as that resource is limited, you must create a way that only one

test

...
job of this kind can ever actually run at a time. That can be done by working at the stage prior to submission or it can be done by changing

the

...
device availablity such that the submission is rejected. Critically,

there

...
must also be a way to prevent jobs entering the queue if one of the device-types is not available. That can be easily determined using the XML-RPC API prior to submission. Once submitted, LAVA must attempt to

run

That's a "lot" of complexity to deal with on our side but that's indeed a way to do it. I'd have to make sure only one device has a MultiNode job in queue and monitor it to send the next one.

...
the test job as quickly as possible, under the expectation that devices which have not been Retired will become available again within a

reasonable

...
amount of time. If that is not the case then those devices should be Retired. (Devices can be brought out of Retired as easily as going in,

it

...
doesn't have to be a permanent state, nothing is actually deleted from

the

...
device configuration.)

Hum... I'm just wondering. What about a device that was submitted a MultiNode job but got Retired since then?

Well spotted - that is left to the admins to manage. There is a story outstanding to cancel all test jobs submitted,scheduled or running for devices which are transitioned into Retired.

...
Now I'm wondering what's the difference between Retired and Maintenance except that it does not accept job submission?

The difference only shows if ALL devices of the device-type are in the specified state.

Retired - no submissions allowed. No test jobs in the queue will be scheduled. Running test jobs will be left to complete. Submitted jobs are currently unchanged if the state changes to Retired. Maintenance - submissions are allowed. No test jobs in the queue will be scheduled. Running test jobs will be left to complete. Submitted jobs are unchanged if the state changes to Maintenance.

So the only distinction between Retired and Maintenance at the moment is submissions.

...
...
...
> What you are describing sounds like a misuse of MultiNode

resulting in

...
...
> resource starvation and the fix is to have enough of the limited

resource

...
...
> to prevent starvation - either by adding hardware and changing the current > test jobs to use device-tags or relocating the work done on the

starved

...
...
> resource into an LXC so that every device can have a dedicated "container" > to do things which cannot be easily done on the device. >

Neither of those options are possible in our use case.

I understand the MultiNode scheduler is complex and low priority. We've modestly contributed to LAVA before, we're not telling you to

fix

...
...
it ASAP but rather to help or guide us to fix this issue in a way it could be accepted in the upstream version of LAVA.

If you still stand strong against a patch or if it's a lengthy

complete

...
...
rework of the scheduler, could we have at least a way to tell for how long a test have been scheduled (or for how long a board has been reserved for a test that is scheduled)?

That data is already available in the current UI and over the XML-RPC

API

...
and REST API.

Check for Queued Jobs and the scheduling state, also the job_details

call

...
in XML-RPC. There are a variety of ways of getting the information you require using the existing APIs - which one you use will depend on your preference and current scripting.

Rather than polling on XML-RPC, it would be better for a monitoring

process

...
to use ZMQ and the publisher events to get push notifications of

change of

...
state. That lowers the load on the master, depending on how busy the instance actually becomes.

...
That way we can use an external tool to monitor this and manually cancel them when needed.

Currently, I

...
...
don't think there is a way to tell since when the job was scheduled.

Every test job has a database object of submit_time created at the

point

...
where the job is created upon submission.

Submit_time isn't really an option if it's telling what its name is telling because I can have jobs in queue for days. (Multiple network tests for each and every board and also time-consuming tests (e.g. crypto) that have the same priority).

I'll have a look at what you've offered above, thanks.

Thanks for having taken the time to answer my question,

Quentin

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

-- Rémi Duraffort LAVA Team

Quentin Schulz

10:30 a.m.

New subject: [Lava-users] Deadlock in scheduler

Hi Neil,

On Tue, Apr 24, 2018 at 10:10:32AM +0100, Neil Williams wrote:

...

On 24 April 2018 at 09:08, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi Neil,

I think there was a global misunderstanding from a poorly choice of words. When I was saying "available device" I meant a device that isn't in Maintenance or Retired. If the device is idle, running a job, scheduled to run a job, etc... I consider it "available". Sorry for the misunderstanding.

Currently, "Maintenance" is considered as available for submission & for scheduling. This is to support maximum uptime and minimal disruption to CI loops for temporary work happening on devices.

We are looking at clarifying this soon.

I know this has been a long and complex thread. Thank you for sticking with the discussion, despite the complexity and terminology.

Thanks for taking the time to answer those questions, much appreciated.

...

...
On Mon, Apr 23, 2018 at 12:54:03PM +0100, Neil Williams wrote:

...
On 23 April 2018 at 11:21, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi Neil,

Thanks for your prompt answer.

On Fri, Apr 20, 2018 at 07:56:29AM +0100, Neil Williams wrote:

...
On 19 April 2018 at 20:11, Quentin Schulz <

quentin.schulz@bootlin.com>

...
...
...
wrote:

...
Hi all,

I've encountered a deadlock in my LAVA server with the following

scheme.

...
...
I have an at91rm9200ek in my lab that got submitted a lot of

multi-node

...
...
...
...
jobs requesting an other "board" (a laptop of type dummy-ssh). All of my other boards in the lab have received the same multi-node

jobs

...
...
requesting the same and only laptop.

That is the source of the resource starvation - multiple

requirements of

...
...
a

...
single device. The scheduler needs to be greedy and grab whatever

suitable

...
devices it can as soon as it can to be able to run MultiNode. The

primary

...
...
...
ordering of scheduling is the Test Job ID which is determined at

submission.

...
Why would you order test jobs without knowing if the boards it depends on are available when it's going to be scheduled? What am I missing?

To avoid the situation where a MultiNode job is constantly waiting for

all

...
devices to be available at exactly the same time. Instances frequently

have

...
long queues of submitted test jobs, a mix of single node and MultiNode.

The

...
MultiNode jobs must be able to grab whatever device is available, in

order

...
of submit time, and then wait for the other part to be available. Otherwise, all devices would run all single node test jobs in the entire queue before any MultiNode test jobs could start. Many instances

constantly

...
have a queue of single node test jobs.

That's understood and expected.

...
...
...
If you have an imbalance between the number of machines which can be available and then submit MultiNode jobs which all rely on the

starved

...
...
...
resource, there is not much LAVA can do currently. We are looking at

a

...
...
way

...
to reschedule MultiNode test jobs but it is very complex and low

priority.

...
What version of lava-server and lava-dispatcher are you running?

lava-dispatcher 2018.2.post3-1+jessie lava-server 2018.2-1+jessie lava-coordinator 0.1.7-1

(You need to upgrade to Stretch - there will be no fixes or upgrades available for Jessie. All development work must only happen on Stretch.

See

...
the lava-announce mailing list archive.)

Thanks, we'll have a look into this.

...
...
...
What is the structure of your current lab?

MultiNode is complex - not just at the test job synchronization

level but

...
...
...
also at the lab structure / administrative level.

I have two machines. One acting as LAVA server and one as LAVA slave. The LAVA slave is handling all boards in our lab.

I have one laptop (an actual x86 laptop for which we know the NIC

driver

...
...
works reliably at high (~1Gbps) speeds) that we use for MultiNode jobs (actually requesting the laptop and one board at the same time only) to test network. This laptop is seen as a board by LAVA, there is nothing LAVA-related on this board (it should be seen as a device).

Then you need more LAVA devices to replicate the role played by the

laptop.

...
Exactly one device for each MultiNode test job which can be submitted at any one time. Then use device tags to allocate one of the "laptop"

devices

...
to each of the other boards involved in the MultiNode test jobs.

Alternatively, you need to manage both the submissions and the device availability.

Think of just the roles played by the devices.

There are N client role devices (not in Retired state) and there are X server role devices where the server role is what the laptop is currently doing.

You need to have N == X to solve the imbalance in the queue.

If N > 1 (and there are more than one device-type in the count 'N') then you also need to use device tags so that each device-type has a dedicated pool of server role devices where the number of devices in the server

role

...
pool exactly matches the number of devices of the device-type using the specified device tag.

...
...
...
I had to take the at91rm9200ek out of the lab because it was

behaving.

...
...
...
...
...
However, LAVA is still scheduling multi-node jobs on the laptop

which

...
...
...
...
requires the at91rm9200ek as the other part of the job, while its

status

...
...
is clearly Maintenance.

A device in Maintenance is still available for scheduling - only

Retired

...
...
is

...
excluded - test jobs submitted to a Retired device are rejected.

Why is that? The device is explicitely in Maintenance, which IMHO tells that the board shouldn't be used.

Not for the scheduler - the scheduler can still accept submissions and queue them up until the device comes out of Maintenance.

This prevents test jobs being rejected during certain kinds of

maintenance.

...
(Wider scale maintenance would involve taking down the UI on the master

at

...
which point submissions would get a 404 but that is up to admins to schedule and announce etc.)

This is about uptime for busy instances which frequently get batches of submissions out of operations like cron. The available devices quickly

get

...
swamped but the queue needs to continue accepting jobs until admins

decide

...
that devices which need work are going to be unavailable for long enough that the resulting queue would be impractical. i.e. when the length of

time

...
to wait for the job exceeds the useful window of the results from the job or when the number of test jobs in the queue exceeds the ability of the available devices to keep on top of the queue and avoid ever increasing queues.

I guess that's an implementation choice but I'd have guessed the scheduler was first looping over idle devices to then schedule the oldest job in the queue for this device type.

But my understanding is that the scheduler rather sets an order when jobs are submitted that isn't temperable with. Is that correct?

The order is priority, submit_time and then target_group.

Changing ordering on-the-fly and backing out from certain states is the subject of https://projects.linaro.org/browse/LAVA-595 - that is the work I've already described as low priority, large scale and complex.

You do have the ability to set the Priority of new test jobs for submission which want to use the laptop in a MultiNode test job along with a device which is NOT the at91rm9200ek. You will need to cancel the test job involving the at91rm9200ek which is currently scheduled. (Other test jobs for the at91rm9200ek in the Queue can be left alone provided that these test jobs have a lower Priority than the jobs you want to run on other devices.)

When the scheduler comes back around, it will find a new test job with higher Priority which wants to use the laptop with a hikey device or whatever and the at91rm9200ek will be ignored. It's not perfect because you would then need to either keep that Priority pipe full or cancel the submitted test jobs for the at91rm9200ek.

...
...
...
...
Once a test job has been submitted, it will be either scheduled or cancelled.

Yes, that's understood and that makes sense to me. However, for

"normal"

...
...
jobs, if you can't find a board of device type X that is available, it does not get scheduled, right? Why can't we do the same for MultiNode jobs?

Because the MultiNode job will never find all devices in the right state

at

...
the same time once there is a mixed queue of single node and MultiNode

jobs.

...
All devices defined in the MultiNode test job must be available at

exactly

...
the same time. Once there are single node jobs in the queue, that never happens.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

scheduler considers the queue - MultiNode cannot start (A is busy), so

move

...
on and start the single node job on B (because the single node test job

on

...
B may actually complete before the job on A finishes, so it is

inefficient

...
to keep B idle when it could be doing useful stuff for another user).

A is running B is running

no actions

A completes and goes to Idle B is still running

and so the pattern continues for as long as there are any single node

test

...
jobs for either A or B in the queue.

The MultiNode test job never starts because A and B are never Idle at the same time until the queue is completely empty (which *never* happens in many instances).

So the scheduler must grab B while it is Idle to prevent the single node test job starting. Then when A completes, the scheduler must also grab A before that single node test job starts running.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

B is transitioned to Scheduling and is unavailable for the single node

test

...
job.

A is running B is scheduling

no actions

A completes and goes to Idle B is scheduling

Scheduler transitions A into scheduling - that test job can now start.

(Now consider MultiNode test jobs covering a dozen devices in an instance with a hundred mixed devices and permanent queues of single node test

jobs.)

...
The scheduler also needs to be very fast, so the actual decisions need to be made on quite simple criteria - specifically, without going back to

the

...
database to find out about what else might be in the queue or trying to second-guess when test jobs might end.

That is understood as well for devices that are idle, running or scheduled to run. The point I was trying to make was why schedule a job for a device that is in Maintenance (what I meant by the poorly chosen "available" word).

Is that because one the job is submitted it's ordered by the scheduler and then run by the scheduler in the given order and the jobs are not discriminated against the device Maintenance status?

The problem you have is not that the device is in Maintenance but that the other device(s) in the MultiNode test job are Idle. Therefore, those jobs get scheduled because there is no reason not to do so.

If the submission was to be rejected when all devices of the requested device-type are in Maintenance, that is a large change which would negatively impact a lot of busy instances.

We do need to clarify these states but essentially, Maintenance is a manual state change which has the same effect as the automatic state change to Bad. That is as far as it goes currently.

That's where I was missing the last piece of the puzzle. The scheduler actually only looks at the status of the device it's trying to schedule a job for and not for all the devices part of this job.

In my mind, scheduling a job for an Idle device that requires an other board which is in Maintenance was actually "scheduling a job with a device in Maintenance" which is not what Maintenance was stated to do. There is a slight but important nuance here for MultiNode jobs that isn't obvious.

...

...
...
...
...
Now, until I put the at91rm9200ek back in the lab, all my boards are

...
reserved and scheduling for a multi-node job and thus, my lab is basically dead.

The correct fix here is to have enough devices of the device-type of

the

...
...
...
starved resource such that one of each other device-type can use that resource simultaneously and then use device-tags to match up groups

of

...
...
...
devices so that submitting lots of jobs for one type all at the same

time

...
...
...
does not simply consume all of the available resources.

e.g. four device-types - phone, hikey, qemu and panda. Each

multinode job

...
...
...
wants a single QEMU with each of the others, so the QEMU type becomes starved, depending on how jobs are submitted. If two hikey-qemu jobs

are

...
...
...
submitted together, then 1 QEMU gets scheduled, waiting for the

hikey to

...
...
...
become free after running the first job. If each QEMU has

device-tags,

...
...
then

...
the second hikey-qemu job will wait not only for the hikey but will

also

...
...
...
wait for the one QEMU which has the hikey device tag. This way, only

those

...
jobs would then wait for a QEMU device. There would be three QEMU

devices,

...
one with a device tag like "phone", one with "hikey" and one with

"panda".

...
If another panda device is added, another QEMU with the "panda"

device

...
...
tag

...
would be required. The number of QEMU devices required is the sum of

the

...
...
...
number of devices of each other device-type which may be required in

a

...
...
...
MultiNode test job.

This is a structural problem within your lab.

You would need one "laptop" for each other device-type which can use

that

...
...
...
device-type in your lab. Then each "laptop" gets unique a device-tag

.

...
...
Each

...
test job for at91rm9200ek must specify that the "laptop" device must

have

...
...
...
the matching device tag. Each test job for each other device-type

uses

...
...
the

...
matching device-tag for that device-type. We had this problem in the Harston lab for a long time when using V1 and had to implement just

such

...
...
a

...
structure of matched devices and device tags. However, the need for

this

...
...
...
disappeared when the Harston lab transitioned all devices and test

jobs

...
...
to

...
LAVA V2.

I strongly disagree with your statement. A software problem can often

be

...
...
dealt with by adding more resources but I'm not willing to spend thousands on something that can be fixed on the software side.

We've been through these loops within the team for many years and have millions of test jobs which demonstrate the problems and the fix. I'm afraid you are misunderstanding the problem if you think that there is a software solution for a queue containing both MultiNode and single node test jobs - other than the solution we now use in the LAVA scheduler. The process has been tried and tested over 8 years and millions of test jobs across dozens of mixed use case instances and has proven to be the most efficient use of resources across all those models.

Each test job in a MultiNode test is considered separately - if one or

more

...
devices are Idle, then those are immediately put into Scheduling. Only

when

...
all are in Scheduling can any of those jobs start. The status of other

test

...
jobs in the MultiNode group can only be handled at the point when at

least

...
one test job in that MultiNode group is in Scheduling.

I think there is a global misunderstanding due to my bad choice of words. I understand and I'm convinced there are no other ways to deal with what you explained above.

...
...
Aside from a non-negligeable financial and time (to setup and maintain) effort to buy a board with a stable and reliable NIC for each and every board in my lab, it just isn't our use case.

If I would do such a thing, then my network would be the bottleneck to my network tests and I'd have to spend a lot (financially and on time

or

...
...
maintenance) to have a top notch network infrastructure for tests I don't care if they run one after the other. I can't have a separate network for each and every board as well, simply because my boards

often

...
...
have a single Ethernet port, thus I can't separate the test network

from

...
...
the lab network for, e.g. images downloading that are part of the booting process, hence I can't do reliable network testing even by multiplying "laptop" devices.

I can understand it's not your typical use case at Linaro and you've dozens and dozens of the same board and a huge infrastructure to handle the whole LAVA lab and maybe people working full-time on LAVA, the lab, the boards, the infrastructure. But that's the complete opposite of our use case.

Maybe you can understand ours where we have only one board of each device type, being part of KernelCI to test and report kernel booting status and having occasional custom tests (like network) on upstream or custom branches/repositories. We sporadically work on the lab, fixing the issues we're seeing with the boards but that's not what we do for a living.

I do understand and I personally run a lab in much the same way. However, the code needs to work the same way in that lab as it does in the larger

Of course.

...
labs. It is the local configuration and resource availability which must change to suit.

For now, the best thing is to put devices into Retired so that

submissions

...
are rejected and then you will also have to manage your submissions and your queue.

Can't we have a "Maintenance but I don't know when it's coming back so please still submit jobs but do not schedule them" option :D ?

This is exactly what you already have - but it's not actually what you mean.

The problem is that you're thinking of the state of the at91rm9200ek when what matters to the scheduler is the state of the laptop device *in isolation*. The state of the at91rm9200ek only matters AFTER the laptop has been assigned.

What you mean is:

"I don't know when ANY of the devices of device-type A are going to be ready to start a test job, so do not allow ANY OTHER device of ANY OTHER device-type to be scheduled either IF that device-type is listed in a MultiNode test job which ALSO requires a device of this device-type".

(The reason it's ANY is that if 10 test jobs are submitted all wanting at91rm9200ek and laptop, then if you had 10 laptops and 1 at91rm9200ek, those 10 laptops would also go into Scheduled - that is the imbalance we talked about previously.)

That's exactly it.

...

It is a cross-relational issue.

Correlating across the MultiNode test job at the scheduling state is likely to have a massive impact on the speed of the scheduler because:

0: a device in Maintenance or Bad is NOT currently available to be scheduled - that's the whole point. 1: the other device(s) in the MultiNode group DO get scheduled because those were in Idle 2: asking the scheduler to check the state of all devices of all device-types mentioned in a MultiNode job when considering whether to schedule any other device in that same MultiNode job is going to make the scheduler SLOW.

So what we do is let the Idle (laptop) device go into a waiting state and let the scheduler move the at91rm9200ek device into scheduling *only* when a at91rm9200ek device becomes available for scheduling - as two completely separate decisions. Then, the relational work is done by the database when the lava-master makes the query "which test jobs are available to start NOW". This is much more efficient because we are looking at jobs where all devices in the target_group are in state SCHEDULED. The database can easily exclude test jobs which are in state SUBMITTED (the at91rm9200ek jobs) and a simple check on target_group shows that the MultiNode test job is not ready to start. That can all be done with a single database query using select_related and other ORM niceties.

Let's describe this with roles:

role: client device-type: at91rm9200ek

role: server device-type: laptop

If the at91rm9200ek is in Maintenance and there are no other devices of device-type at91rm9200ek in state Idle, then nothing will get scheduled for at91rm9200ek.

However, when a MultiNode test job is submitted for 1 at91rm9200ek and 1 or more "laptop" device(s), then there is no reason to stop scheduling the laptop device in state Idle without scrabbling through the test job definition and working out (again and again, every time the scheduler loops through the queue) which device-types are requested, which devices of those types are available and what to do next.

The problem is NOT the state of the at91rm9200ek - Maintenance or Bad, it makes no difference. The problem for the SCHEDULER is that the laptop device is Idle and requested by a test job with a sufficiently low submit_time (and high enough Priority) that it is first in the queue.

The problem at the SUBMISSION stage is that the only decision available is whether to allow the test job asking for at91rm9200ek & laptop onto the queue or whether to refuse it outright. Currently, a refusal is only implemented if all devices of at least one device-type specified in the test job are in Retired state.

After many, many rounds of testing, test jobs, discussions going on over several years we came to the decision that in your situation - where there is a dire shortage of a resource used by multiple MultiNode test jobs, that the only thing that was safe for the SCHEDULER to do was to allow the Idle device to be scheduled and let the test job wait for resources to become available, either by moving the other device out of Maintenance or providing extra hardware for the Idle device.

Understood, thanks for the full explanation.

...

...
...
We're looking at what the Maintenance state means for MultiNode in https://projects.linaro.org/browse/LAVA-1299 but it is not acceptable to refuse submissions when devices are not Retired. Users have an

expectation

...
that devices which are being fixed will come back online at some point -

or

...
will go into retired. There is also

I agree.

...
https://projects.linaro.org/browse/LAVA-595 but that work has not yet

been

...
scoped. It could be a long time before that work starts and will take months of work once it does start.

The problem is a structural one in the physical resources available in

your

...
local lab. It is a problem we have faced more than once in our own instances and we have gone down all the various routes until we've come

to

...
the current implementation.

...
We also work actively on the kernel and thus, we take boards (which we own only once) out of the lab to work on it and then put it into the lab once we've finished working. This is where we put it in Maintenance mode as, IMHO, Retired does not cover this use case. This "Maintenance" can take seconds, days or months.

For me, you're ignoring an issue that is almost inexistent in your case

It is an issue which has had months of investigation, discussion and intervention in our use cases. We have spent a very long time going

through

...
all of the permutations.

I understand the scheduler is a critical part of the software that had your attention for a long time and appropriate testing, no doubt.

...
...
because you've dealt with it by adding as much resource as you could to make the probability to happen to be close to zero. That does not mean it does not exist. I'm not criticizing the way to deal with it, I'm

just

...
...
saying this way isn't a path we can take personally.

Then you need to manage the queue on your instance in ways that allow for your situation.

...
...
...
Let me know if I can be of any help debugging this thing or

testing a

...
...
...
...
possible fix. I'd have a look at the scheduler but you, obviously knowing the code base way better than I do, might have a quick

patch on

...
...
...
...
hand.

Patches would be a bad solution for a structural problem.

As a different approach, why do you need MultiNode with a "laptop"

type

...
...
...
device in the first place? Can the test jobs be reconfigured to use

LXC

...
...
...
which does not use MultiNode? What is the "laptop" device-type doing

that

...
...
...
cannot be done in an LXC? LXC is created on-the-fly, one for each

device,

...
...
...
when the test job requests one. This solved the resource starvation

problem

...
with the majority of MultiNode issues because the work previously

done in

...
...
...
the generic QEMU / "laptop" role can just as easily be done in an

LXC.

...
...
...
We're testing Gigabit NICs can actually handle Gbps transfers. We need

a

...
...
fully available Gbps NIC for each and every test we do to make the

results

...
...
reliable and consistent.

Then as that resource is limited, you must create a way that only one

test

...
job of this kind can ever actually run at a time. That can be done by working at the stage prior to submission or it can be done by changing

the

...
device availablity such that the submission is rejected. Critically,

there

...
must also be a way to prevent jobs entering the queue if one of the device-types is not available. That can be easily determined using the XML-RPC API prior to submission. Once submitted, LAVA must attempt to run

That's a "lot" of complexity to deal with on our side but that's indeed a way to do it. I'd have to make sure only one device has a MultiNode job in queue and monitor it to send the next one.

...
the test job as quickly as possible, under the expectation that devices which have not been Retired will become available again within a

reasonable

...
amount of time. If that is not the case then those devices should be Retired. (Devices can be brought out of Retired as easily as going in, it doesn't have to be a permanent state, nothing is actually deleted from

the

...
device configuration.)

Hum... I'm just wondering. What about a device that was submitted a MultiNode job but got Retired since then?

Well spotted - that is left to the admins to manage. There is a story outstanding to cancel all test jobs submitted,scheduled or running for devices which are transitioned into Retired.

That's a way to deal with it.

...

...
Now I'm wondering what's the difference between Retired and Maintenance except that it does not accept job submission?

The difference only shows if ALL devices of the device-type are in the specified state.

Retired - no submissions allowed. No test jobs in the queue will be scheduled. Running test jobs will be left to complete. Submitted jobs are currently unchanged if the state changes to Retired. Maintenance - submissions are allowed. No test jobs in the queue will be scheduled. Running test jobs will be left to complete. Submitted jobs are unchanged if the state changes to Maintenance.

So the only distinction between Retired and Maintenance at the moment is submissions.

Understood.

I understand in the current implementation of the scheduler, it's too costly (or at least assumed to be) to check for the status of all devices of a MultiNode job.

However, could we have something like a timeout for "scheduling" jobs?

e.g. after X minutes/hours after a job has been scheduled for a device, if that job hasn't been running, move it back to the queue and "unschedule it" and retry later. That way, the device isn't stuck forever (well, until a board is put back into Idle mode).

Quentin

...

...
...
...
...
What you are describing sounds like a misuse of MultiNode resulting

in

...
...
...
resource starvation and the fix is to have enough of the limited

resource

...
...
...
to prevent starvation - either by adding hardware and changing the

current

...
test jobs to use device-tags or relocating the work done on the

starved

...
...
...
resource into an LXC so that every device can have a dedicated

"container"

...
to do things which cannot be easily done on the device.

Neither of those options are possible in our use case.

I understand the MultiNode scheduler is complex and low priority. We've modestly contributed to LAVA before, we're not telling you to fix it ASAP but rather to help or guide us to fix this issue in a way it could be accepted in the upstream version of LAVA.

If you still stand strong against a patch or if it's a lengthy complete rework of the scheduler, could we have at least a way to tell for how long a test have been scheduled (or for how long a board has been reserved for a test that is scheduled)?

That data is already available in the current UI and over the XML-RPC API and REST API.

Check for Queued Jobs and the scheduling state, also the job_details call in XML-RPC. There are a variety of ways of getting the information you require using the existing APIs - which one you use will depend on your preference and current scripting.

Rather than polling on XML-RPC, it would be better for a monitoring

process

...
to use ZMQ and the publisher events to get push notifications of change

of

...
state. That lowers the load on the master, depending on how busy the instance actually becomes.

...
That way we can use an external tool to monitor this and manually cancel them when needed. Currently, I don't think there is a way to tell since when the job was scheduled.

Every test job has a database object of submit_time created at the point where the job is created upon submission.

Submit_time isn't really an option if it's telling what its name is telling because I can have jobs in queue for days. (Multiple network tests for each and every board and also time-consuming tests (e.g. crypto) that have the same priority).

I'll have a look at what you've offered above, thanks.

Thanks for having taken the time to answer my question,

Quentin

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Neil Williams

1:50 p.m.

New subject: [Lava-users] Deadlock in scheduler

On 25 April 2018 at 11:30, Quentin Schulz quentin.schulz@bootlin.com wrote:

...

Hi Neil,

On Tue, Apr 24, 2018 at 10:10:32AM +0100, Neil Williams wrote:

...
On 24 April 2018 at 09:08, Quentin Schulz quentin.schulz@bootlin.com wrote:

...
Hi Neil,

I think there was a global misunderstanding from a poorly choice of words. When I was saying "available device" I meant a device that isn't in Maintenance or Retired. If the device is idle, running a job, scheduled to run a job, etc... I consider it "available". Sorry for the misunderstanding.

Currently, "Maintenance" is considered as available for submission & for scheduling. This is to support maximum uptime and minimal disruption to

CI

...
loops for temporary work happening on devices.

We are looking at clarifying this soon.

I know this has been a long and complex thread. Thank you for sticking

with

...
the discussion, despite the complexity and terminology.

Thanks for taking the time to answer those questions, much appreciated.

...
...
On Mon, Apr 23, 2018 at 12:54:03PM +0100, Neil Williams wrote:

...
On 23 April 2018 at 11:21, Quentin Schulz <

quentin.schulz@bootlin.com>

...
...
...
wrote:

...
Hi Neil,

Thanks for your prompt answer.

On Fri, Apr 20, 2018 at 07:56:29AM +0100, Neil Williams wrote:

...
On 19 April 2018 at 20:11, Quentin Schulz <

quentin.schulz@bootlin.com>

...
...
...
wrote:

> Hi all, > > I've encountered a deadlock in my LAVA server with the

following

...
...
...
...
scheme.

...
> I have an at91rm9200ek in my lab that got submitted a lot of

multi-node

...
...
...
> jobs requesting an other "board" (a laptop of type dummy-ssh). > All of my other boards in the lab have received the same

multi-node

...
...
...
...
jobs

...
> requesting the same and only laptop. >

That is the source of the resource starvation - multiple

requirements of

...
...
a

...
single device. The scheduler needs to be greedy and grab whatever

suitable

...
devices it can as soon as it can to be able to run MultiNode. The

primary

...
...
...
ordering of scheduling is the Test Job ID which is determined at

submission.

...
Why would you order test jobs without knowing if the boards it

depends

...
...
...
...
on are available when it's going to be scheduled? What am I

missing?

...
...
...
...
To avoid the situation where a MultiNode job is constantly waiting

for

...
...
all

...
devices to be available at exactly the same time. Instances

frequently

...
...
have

...
long queues of submitted test jobs, a mix of single node and

MultiNode.

...
...
The

...
MultiNode jobs must be able to grab whatever device is available, in

order

...
of submit time, and then wait for the other part to be available. Otherwise, all devices would run all single node test jobs in the

entire

...
...
...
queue before any MultiNode test jobs could start. Many instances

constantly

...
have a queue of single node test jobs.

That's understood and expected.

...
...
...
If you have an imbalance between the number of machines which

can be

...
...
...
...
...
available and then submit MultiNode jobs which all rely on the

starved

...
...
...
resource, there is not much LAVA can do currently. We are

looking at

...
...
a

...
...
way

...
to reschedule MultiNode test jobs but it is very complex and low

priority.

...
What version of lava-server and lava-dispatcher are you running?

lava-dispatcher 2018.2.post3-1+jessie lava-server 2018.2-1+jessie lava-coordinator 0.1.7-1

(You need to upgrade to Stretch - there will be no fixes or upgrades available for Jessie. All development work must only happen on

Stretch.

...
...
See

...
the lava-announce mailing list archive.)

Thanks, we'll have a look into this.

...
...
...
What is the structure of your current lab?

MultiNode is complex - not just at the test job synchronization

level but

...
...
...
also at the lab structure / administrative level.

I have two machines. One acting as LAVA server and one as LAVA

slave.

...
...
...
...
The LAVA slave is handling all boards in our lab.

I have one laptop (an actual x86 laptop for which we know the NIC

driver

...
...
works reliably at high (~1Gbps) speeds) that we use for MultiNode

jobs

...
...
...
...
(actually requesting the laptop and one board at the same time

only) to

...
...
...
...
test network. This laptop is seen as a board by LAVA, there is

nothing

...
...
...
...
LAVA-related on this board (it should be seen as a device).

Then you need more LAVA devices to replicate the role played by the

laptop.

...
Exactly one device for each MultiNode test job which can be

submitted at

...
...
...
any one time. Then use device tags to allocate one of the "laptop"

devices

...
to each of the other boards involved in the MultiNode test jobs.

Alternatively, you need to manage both the submissions and the device availability.

Think of just the roles played by the devices.

There are N client role devices (not in Retired state) and there are

X

...
...
...
server role devices where the server role is what the laptop is

currently

...
...
...
doing.

You need to have N == X to solve the imbalance in the queue.

If N > 1 (and there are more than one device-type in the count 'N')

then

...
...
...
you also need to use device tags so that each device-type has a

dedicated

...
...
...
pool of server role devices where the number of devices in the server

role

...
pool exactly matches the number of devices of the device-type using

the

...
...
...
specified device tag.

...
...
> > I had to take the at91rm9200ek out of the lab because it was

behaving.

...
...
...
>

> However, LAVA is still scheduling multi-node jobs on the laptop

which

...
...
...
> requires the at91rm9200ek as the other part of the job, while

its

...
...
...
...
status

...
> is clearly Maintenance. > > A device in Maintenance is still available for scheduling - only

Retired

...
...
is

...
excluded - test jobs submitted to a Retired device are rejected.

Why is that? The device is explicitely in Maintenance, which IMHO

tells

...
...
...
...
that the board shouldn't be used.

Not for the scheduler - the scheduler can still accept submissions

and

...
...
...
queue them up until the device comes out of Maintenance.

This prevents test jobs being rejected during certain kinds of

maintenance.

...
(Wider scale maintenance would involve taking down the UI on the

master

...
...
at

...
which point submissions would get a 404 but that is up to admins to schedule and announce etc.)

This is about uptime for busy instances which frequently get batches

of

...
...
...
submissions out of operations like cron. The available devices

quickly

...
...
get

...
swamped but the queue needs to continue accepting jobs until admins

decide

...
that devices which need work are going to be unavailable for long

enough

...
...
...
that the resulting queue would be impractical. i.e. when the length

of

...
...
time

...
to wait for the job exceeds the useful window of the results from

the job

...
...
...
or when the number of test jobs in the queue exceeds the ability of

the

...
...
...
available devices to keep on top of the queue and avoid ever

increasing

...
...
...
queues.

I guess that's an implementation choice but I'd have guessed the scheduler was first looping over idle devices to then schedule the oldest job in the queue for this device type.

But my understanding is that the scheduler rather sets an order when jobs are submitted that isn't temperable with. Is that correct?

The order is priority, submit_time and then target_group.

Changing ordering on-the-fly and backing out from certain states is the subject of https://projects.linaro.org/browse/LAVA-595 - that is the

work

...
I've already described as low priority, large scale and complex.

You do have the ability to set the Priority of new test jobs for

submission

...
which want to use the laptop in a MultiNode test job along with a device which is NOT the at91rm9200ek. You will need to cancel the test job involving the at91rm9200ek which is currently scheduled. (Other test jobs for the at91rm9200ek in the Queue can be left alone provided that these test jobs have a lower Priority than the jobs you want to run on other devices.)

When the scheduler comes back around, it will find a new test job with higher Priority which wants to use the laptop with a hikey device or whatever and the at91rm9200ek will be ignored. It's not perfect because

you

...
would then need to either keep that Priority pipe full or cancel the submitted test jobs for the at91rm9200ek.

...
...
...
...
Once a test job has been submitted, it will be either scheduled

or

...
...
...
...
...
cancelled.

Yes, that's understood and that makes sense to me. However, for

"normal"

...
...
jobs, if you can't find a board of device type X that is

available, it

...
...
...
...
does not get scheduled, right? Why can't we do the same for

MultiNode

...
...
...
...
jobs?

Because the MultiNode job will never find all devices in the right

state

...
...
at

...
the same time once there is a mixed queue of single node and

MultiNode

...
...
jobs.

...
All devices defined in the MultiNode test job must be available at

exactly

...
the same time. Once there are single node jobs in the queue, that

never

...
...
...
happens.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

scheduler considers the queue - MultiNode cannot start (A is busy),

so

...
...
move

...
on and start the single node job on B (because the single node test

job

...
...
on

...
B may actually complete before the job on A finishes, so it is

inefficient

...
to keep B idle when it could be doing useful stuff for another user).

A is running B is running

no actions

A completes and goes to Idle B is still running

and so the pattern continues for as long as there are any single node

test

...
jobs for either A or B in the queue.

The MultiNode test job never starts because A and B are never Idle

at the

...
...
...
same time until the queue is completely empty (which *never* happens

in

...
...
...
many instances).

So the scheduler must grab B while it is Idle to prevent the single

node

...
...
...
test job starting. Then when A completes, the scheduler must also

grab A

...
...
...
before that single node test job starts running.

A is running B is Idle MultiNode submitted for A & B single node submitted for A single node submitted for B

B is transitioned to Scheduling and is unavailable for the single

node

...
...
test

...
job.

A is running B is scheduling

no actions

A completes and goes to Idle B is scheduling

Scheduler transitions A into scheduling - that test job can now

start.

...
...
...
(Now consider MultiNode test jobs covering a dozen devices in an

instance

...
...
...
with a hundred mixed devices and permanent queues of single node test

jobs.)

...
The scheduler also needs to be very fast, so the actual decisions

need to

...
...
...
be made on quite simple criteria - specifically, without going back

to

...
...
the

...
database to find out about what else might be in the queue or trying

to

...
...
...
second-guess when test jobs might end.

That is understood as well for devices that are idle, running or scheduled to run. The point I was trying to make was why schedule a job for a device that is in Maintenance (what I meant by the poorly chosen "available" word).

Is that because one the job is submitted it's ordered by the scheduler and then run by the scheduler in the given order and the jobs are not discriminated against the device Maintenance status?

The problem you have is not that the device is in Maintenance but that

the

...
other device(s) in the MultiNode test job are Idle. Therefore, those jobs get scheduled because there is no reason not to do so.

If the submission was to be rejected when all devices of the requested device-type are in Maintenance, that is a large change which would negatively impact a lot of busy instances.

We do need to clarify these states but essentially, Maintenance is a

manual

...
state change which has the same effect as the automatic state change to Bad. That is as far as it goes currently.

That's where I was missing the last piece of the puzzle. The scheduler actually only looks at the status of the device it's trying to schedule a job for and not for all the devices part of this job.

In my mind, scheduling a job for an Idle device that requires an other board which is in Maintenance was actually "scheduling a job with a device in Maintenance" which is not what Maintenance was stated to do. There is a slight but important nuance here for MultiNode jobs that isn't obvious.

...
...
...
...
...
Now, until I put the at91rm9200ek back in the lab, all my boards

are

...
...
...
...
...
> reserved and scheduling for a multi-node job and thus, my lab

is

...
...
...
...
...
> basically dead. > > The correct fix here is to have enough devices of the

device-type of

...
...
the

...
...
...
starved resource such that one of each other device-type can use

that

...
...
...
...
...
resource simultaneously and then use device-tags to match up

groups

...
...
of

...
...
...
devices so that submitting lots of jobs for one type all at the

same

...
...
time

...
...
...
does not simply consume all of the available resources.

e.g. four device-types - phone, hikey, qemu and panda. Each

multinode job

...
...
...
wants a single QEMU with each of the others, so the QEMU type

becomes

...
...
...
...
...
starved, depending on how jobs are submitted. If two hikey-qemu

jobs

...
...
are

...
...
...
submitted together, then 1 QEMU gets scheduled, waiting for the

hikey to

...
...
...
become free after running the first job. If each QEMU has

device-tags,

...
...
then

...
the second hikey-qemu job will wait not only for the hikey but

will

...
...
also

...
...
...
wait for the one QEMU which has the hikey device tag. This way,

only

...
...
...
...
those

...
jobs would then wait for a QEMU device. There would be three QEMU

devices,

...
one with a device tag like "phone", one with "hikey" and one with

"panda".

...
If another panda device is added, another QEMU with the "panda"

device

...
...
tag

...
would be required. The number of QEMU devices required is the

sum of

...
...
the

...
...
...
number of devices of each other device-type which may be

required in

...
...
a

...
...
...
MultiNode test job.

This is a structural problem within your lab.

You would need one "laptop" for each other device-type which can

use

...
...
that

...
...
...
device-type in your lab. Then each "laptop" gets unique a

device-tag

...
...
.

...
...
Each

...
test job for at91rm9200ek must specify that the "laptop" device

must

...
...
have

...
...
...
the matching device tag. Each test job for each other device-type

uses

...
...
the

...
matching device-tag for that device-type. We had this problem in

the

...
...
...
...
...
Harston lab for a long time when using V1 and had to implement

just

...
...
such

...
...
a

...
structure of matched devices and device tags. However, the need

for

...
...
this

...
...
...
disappeared when the Harston lab transitioned all devices and

test

...
...
jobs

...
...
to

...
LAVA V2.

I strongly disagree with your statement. A software problem can

often

...
...
be

...
...
dealt with by adding more resources but I'm not willing to spend thousands on something that can be fixed on the software side.

We've been through these loops within the team for many years and

have

...
...
...
millions of test jobs which demonstrate the problems and the fix. I'm afraid you are misunderstanding the problem if you think that there

is a

...
...
...
software solution for a queue containing both MultiNode and single

node

...
...
...
test jobs - other than the solution we now use in the LAVA

scheduler. The

...
...
...
process has been tried and tested over 8 years and millions of test

jobs

...
...
...
across dozens of mixed use case instances and has proven to be the

most

...
...
...
efficient use of resources across all those models.

Each test job in a MultiNode test is considered separately - if one

or

...
...
more

...
devices are Idle, then those are immediately put into Scheduling.

Only

...
...
when

...
all are in Scheduling can any of those jobs start. The status of

other

...
...
test

...
jobs in the MultiNode group can only be handled at the point when at

least

...
one test job in that MultiNode group is in Scheduling.

I think there is a global misunderstanding due to my bad choice of words. I understand and I'm convinced there are no other ways to deal with what you explained above.

...
...
Aside from a non-negligeable financial and time (to setup and

maintain)

...
...
...
...
effort to buy a board with a stable and reliable NIC for each and

every

...
...
...
...
board in my lab, it just isn't our use case.

If I would do such a thing, then my network would be the

bottleneck to

...
...
...
...
my network tests and I'd have to spend a lot (financially and on

time

...
...
or

...
...
maintenance) to have a top notch network infrastructure for tests I don't care if they run one after the other. I can't have a separate network for each and every board as well, simply because my boards

often

...
...
have a single Ethernet port, thus I can't separate the test network

from

...
...
the lab network for, e.g. images downloading that are part of the booting process, hence I can't do reliable network testing even by multiplying "laptop" devices.

I can understand it's not your typical use case at Linaro and

you've

...
...
...
...
dozens and dozens of the same board and a huge infrastructure to

handle

...
...
...
...
the whole LAVA lab and maybe people working full-time on LAVA, the

lab,

...
...
...
...
the boards, the infrastructure. But that's the complete opposite

of our

...
...
...
...
use case.

Maybe you can understand ours where we have only one board of each device type, being part of KernelCI to test and report kernel

booting

...
...
...
...
status and having occasional custom tests (like network) on

upstream or

...
...
...
...
custom branches/repositories. We sporadically work on the lab,

fixing

...
...
...
...
the issues we're seeing with the boards but that's not what we do

for a

...
...
...
...
living.

I do understand and I personally run a lab in much the same way.

However,

...
...
...
the code needs to work the same way in that lab as it does in the

larger

...
...
Of course.

...
labs. It is the local configuration and resource availability which

must

...
...
...
change to suit.

For now, the best thing is to put devices into Retired so that

submissions

...
are rejected and then you will also have to manage your submissions

and

...
...
...
your queue.

Can't we have a "Maintenance but I don't know when it's coming back so please still submit jobs but do not schedule them" option :D ?

This is exactly what you already have - but it's not actually what you

mean.

...
The problem is that you're thinking of the state of the at91rm9200ek when what matters to the scheduler is the state of the laptop device *in isolation*. The state of the at91rm9200ek only matters AFTER the laptop

has

...
been assigned.

What you mean is:

"I don't know when ANY of the devices of device-type A are going to be ready to start a test job, so do not allow ANY OTHER device of ANY OTHER device-type to be scheduled either IF that device-type is listed in a MultiNode test job which ALSO requires a device of this device-type".

(The reason it's ANY is that if 10 test jobs are submitted all wanting at91rm9200ek and laptop, then if you had 10 laptops and 1 at91rm9200ek, those 10

laptops

...
would also go into Scheduled - that is the imbalance we talked about previously.)

That's exactly it.

...
It is a cross-relational issue.

Correlating across the MultiNode test job at the scheduling state is

likely

...
to have a massive impact on the speed of the scheduler because:

0: a device in Maintenance or Bad is NOT currently available to be scheduled - that's the whole point. 1: the other device(s) in the MultiNode group DO get scheduled because those were in Idle 2: asking the scheduler to check the state of all devices of all device-types mentioned in a MultiNode job when considering whether to schedule any other device in that same MultiNode job is going to make the scheduler SLOW.

So what we do is let the Idle (laptop) device go into a waiting state and let the scheduler move the at91rm9200ek device into scheduling *only*

when

...
a at91rm9200ek device becomes available for scheduling - as two

completely

...
separate decisions. Then, the relational work is done by the database

when

...
the lava-master makes the query "which test jobs are available to start NOW". This is much more efficient because we are looking at jobs where

all

...
devices in the target_group are in state SCHEDULED. The database can

easily

...
exclude test jobs which are in state SUBMITTED (the at91rm9200ek jobs)

and

...
a simple check on target_group shows that the MultiNode test job is not ready to start. That can all be done with a single database query using select_related and other ORM niceties.

Let's describe this with roles:

role: client device-type: at91rm9200ek

role: server device-type: laptop

If the at91rm9200ek is in Maintenance and there are no other devices of device-type at91rm9200ek in state Idle, then nothing will get scheduled

for

...
at91rm9200ek.

However, when a MultiNode test job is submitted for 1 at91rm9200ek and 1

or

...
more "laptop" device(s), then there is no reason to stop scheduling the laptop device in state Idle without scrabbling through the test job definition and working out (again and again, every time the scheduler

loops

...
through the queue) which device-types are requested, which devices of

those

...
types are available and what to do next.

The problem is NOT the state of the at91rm9200ek - Maintenance or Bad, it makes no difference. The problem for the SCHEDULER is that the laptop device is Idle and requested by a test job with a sufficiently low submit_time (and high enough Priority) that it is first in the queue.

The problem at the SUBMISSION stage is that the only decision available

is

...
whether to allow the test job asking for at91rm9200ek & laptop onto the queue or whether to refuse it outright. Currently, a refusal is only implemented if all devices of at least one device-type specified in the test job are in Retired state.

After many, many rounds of testing, test jobs, discussions going on over several years we came to the decision that in your situation - where

there

...
is a dire shortage of a resource used by multiple MultiNode test jobs,

that

...
the only thing that was safe for the SCHEDULER to do was to allow the

Idle

...
device to be scheduled and let the test job wait for resources to become available, either by moving the other device out of Maintenance or providing extra hardware for the Idle device.

Understood, thanks for the full explanation.

...
...
...
We're looking at what the Maintenance state means for MultiNode in https://projects.linaro.org/browse/LAVA-1299 but it is not

acceptable to

...
...
...
refuse submissions when devices are not Retired. Users have an

expectation

...
that devices which are being fixed will come back online at some

point -

...
...
or

...
will go into retired. There is also

I agree.

...
https://projects.linaro.org/browse/LAVA-595 but that work has not

yet

...
...
been

...
scoped. It could be a long time before that work starts and will take months of work once it does start.

The problem is a structural one in the physical resources available

in

...
...
your

...
local lab. It is a problem we have faced more than once in our own instances and we have gone down all the various routes until we've

come

...
...
to

...
the current implementation.

...
We also work actively on the kernel and thus, we take boards

(which we

...
...
...
...
own only once) out of the lab to work on it and then put it into

the

...
...
...
...
lab once we've finished working. This is where we put it in

Maintenance

...
...
...
...
mode as, IMHO, Retired does not cover this use case. This "Maintenance" can take seconds, days or months.

For me, you're ignoring an issue that is almost inexistent in your

case

...
...
...
...
It is an issue which has had months of investigation, discussion and intervention in our use cases. We have spent a very long time going

through

...
all of the permutations.

I understand the scheduler is a critical part of the software that had your attention for a long time and appropriate testing, no doubt.

...
...
because you've dealt with it by adding as much resource as you

could to

...
...
...
...
make the probability to happen to be close to zero. That does not

mean

...
...
...
...
it does not exist. I'm not criticizing the way to deal with it, I'm

just

...
...
saying this way isn't a path we can take personally.

Then you need to manage the queue on your instance in ways that

allow for

...
...
...
your situation.

...
...
> Let me know if I can be of any help debugging this thing or

testing a

...
...
...
> possible fix. I'd have a look at the scheduler but you,

obviously

...
...
...
...
...
> knowing the code base way better than I do, might have a quick

patch on

...
...
...
> hand. >

Patches would be a bad solution for a structural problem.

As a different approach, why do you need MultiNode with a

"laptop"

...
...
type

...
...
...
device in the first place? Can the test jobs be reconfigured to

use

...
...
LXC

...
...
...
which does not use MultiNode? What is the "laptop" device-type

doing

...
...
that

...
...
...
cannot be done in an LXC? LXC is created on-the-fly, one for each

device,

...
...
...
when the test job requests one. This solved the resource

starvation

...
...
...
...
problem

...
with the majority of MultiNode issues because the work previously

done in

...
...
...
the generic QEMU / "laptop" role can just as easily be done in an

LXC.

...
...
...
We're testing Gigabit NICs can actually handle Gbps transfers. We

need

...
...
a

...
...
fully available Gbps NIC for each and every test we do to make the

results

...
...
reliable and consistent.

Then as that resource is limited, you must create a way that only one

test

...
job of this kind can ever actually run at a time. That can be done by working at the stage prior to submission or it can be done by

changing

...
...
the

...
device availablity such that the submission is rejected. Critically,

there

...
must also be a way to prevent jobs entering the queue if one of the device-types is not available. That can be easily determined using

the

...
...
...
XML-RPC API prior to submission. Once submitted, LAVA must attempt

to run

...
...
That's a "lot" of complexity to deal with on our side but that's indeed a way to do it. I'd have to make sure only one device has a MultiNode job in queue and monitor it to send the next one.

...
the test job as quickly as possible, under the expectation that

devices

...
...
...
which have not been Retired will become available again within a

reasonable

...
amount of time. If that is not the case then those devices should be Retired. (Devices can be brought out of Retired as easily as going

in, it

...
...
...
doesn't have to be a permanent state, nothing is actually deleted

from

...
...
the

...
device configuration.)

Hum... I'm just wondering. What about a device that was submitted a MultiNode job but got Retired since then?

Well spotted - that is left to the admins to manage. There is a story outstanding to cancel all test jobs submitted,scheduled or running for devices which are transitioned into Retired.

That's a way to deal with it.

...
...
Now I'm wondering what's the difference between Retired and Maintenance except that it does not accept job submission?

The difference only shows if ALL devices of the device-type are in the specified state.

Retired - no submissions allowed. No test jobs in the queue will be scheduled. Running test jobs will be left to complete. Submitted jobs are currently unchanged if the state changes to Retired. Maintenance - submissions are allowed. No test jobs in the queue will be scheduled. Running test jobs will be left to complete. Submitted jobs are unchanged if the state changes to Maintenance.

So the only distinction between Retired and Maintenance at the moment is submissions.

Understood.

I understand in the current implementation of the scheduler, it's too costly (or at least assumed to be) to check for the status of all devices of a MultiNode job.

However, could we have something like a timeout for "scheduling" jobs?

e.g. after X minutes/hours after a job has been scheduled for a device, if that job hasn't been running, move it back to the queue and "unschedule it" and retry later. That way, the device isn't stuck forever (well, until a board is put back into Idle mode).

We have considered that, it would have to be part of a wider change because only the test writer of the *running* test job or the admin of the device in Maintenance is going to be able to determine how long is "enough". (Depending on whether the latency is due to the other device(s) being in state Running or state Maintenance respectively.) We have many jobs in our labs which (usefully) run for more than a day which is unusual compared to a lot of labs feeding data to KernelCI. It also means adding data to the objects in the Queue that the data should be ignored for a configurable time because once it's back on the queue, it is at or very close to the start of the queue and the scheduler would likely just put it back straight into the state it just left. The scheduling timeout may even need to take into account the total size of the Queue so that the scheduling process itself does not get bogged down. It's a very tight loop currently and needs to stay fast.

It all gets very complex, very quickly and that's on top of the current complexity. (Don't forget, we have to check the VLANd criteria as well at this one point of assigning devices. MultiNode test jobs are not always limited to 2 devices either, ten or more is possible and labs can easily have 50 device types and over 100 devices.)

All of this only applies when there is no practical / affordable way to solve the resource problem by adding more devices.

So far, that has been the working solution for other instances of this issue.

Overall, there are so many factors here that it may just turn out to be something which is best managed by humans. The current workarounds aren't pretty and could be made smoother. (At each point, the settings need to be applied to all devices of the relevant device-type which are currently "available" at the start of the process - so that information needs to be retained and reinstated at the end.)

0: Set Retired to exclude the device(s) from having more jobs added to the Queue and exclude existing MultiNode jobs from being scheduled on the other devices in the group.

1: Submit test jobs to use the Idle device(s) with higher Priority than the jobs waiting for the Retired device(s).

2: Cancel any scheduled jobs for the Retired device(s) to free up the other device in the MultiNode group.

3: Return the Retired device(s) to Unknown when it is ready to start running test jobs again.

As an option:

2a: Set the Retired device to restricted submissions (like the admin group) temporarily to ensure that when it does come back to Idle it doesn't mangle the test jobs in the Queue through simple admin errors. Set the owner of the device(s) and clear the Public checkbox in the Django admin interface. The lava-server manage CLI can be used for this too.

2b: Once admin test jobs have run successfully, restore the Public setting to run the jobs in the Queue and ensure the health is set to Good (as presumably the admin will at least have run a health check).

...

Quentin

...
...
...
...
...
What you are describing sounds like a misuse of MultiNode

resulting

...
...
in

...
...
...
resource starvation and the fix is to have enough of the limited

resource

...
...
...
to prevent starvation - either by adding hardware and changing

the

...
...
...
...
current

...
test jobs to use device-tags or relocating the work done on the

starved

...
...
...
resource into an LXC so that every device can have a dedicated

"container"

...
to do things which cannot be easily done on the device.

Neither of those options are possible in our use case.

I understand the MultiNode scheduler is complex and low priority. We've modestly contributed to LAVA before, we're not telling you

to fix

...
...
...
...
it ASAP but rather to help or guide us to fix this issue in a way

it

...
...
...
...
could be accepted in the upstream version of LAVA.

If you still stand strong against a patch or if it's a lengthy

complete

...
...
...
...
rework of the scheduler, could we have at least a way to tell for

how

...
...
...
...
long a test have been scheduled (or for how long a board has been reserved for a test that is scheduled)?

That data is already available in the current UI and over the

XML-RPC API

...
...
...
and REST API.

Check for Queued Jobs and the scheduling state, also the job_details

call

...
...
...
in XML-RPC. There are a variety of ways of getting the information

you

...
...
...
require using the existing APIs - which one you use will depend on

your

...
...
...
preference and current scripting.

Rather than polling on XML-RPC, it would be better for a monitoring

process

...
to use ZMQ and the publisher events to get push notifications of

change

...
...
of

...
state. That lowers the load on the master, depending on how busy the instance actually becomes.

...
That way we can use an external tool to monitor this and manually cancel them when needed.

Currently, I

...
...
...
...
don't think there is a way to tell since when the job was

scheduled.

...
...
...
...
Every test job has a database object of submit_time created at the

point

...
...
...
where the job is created upon submission.

Submit_time isn't really an option if it's telling what its name is telling because I can have jobs in queue for days. (Multiple network tests for each and every board and also time-consuming tests (e.g. crypto) that have the same priority).

I'll have a look at what you've offered above, thanks.

Thanks for having taken the time to answer my question,

Quentin

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

2805

days inactive

2811

days old

lava-users@lists.lavasoftware.org

12 comments

participants

tags (0)

participants (3)

Neil Williams
Quentin Schulz
Remi Duraffort