Re: [Lava-users] Multinode pipeline synchronisation / timeouts

11 Feb 2019


      On Mon, 11 Feb 2019 at 11:46, Tim Jaacks tim.jaacks@garz-fricke.com wrote:
...
-----Ursprüngliche Nachricht-----
Von: Neil Williams neil.williams@linaro.org
Gesendet: Montag, 11. Februar 2019 11:38
An: Tim Jaacks tim.jaacks@garz-fricke.com
Cc: lava-users@lists.lavasoftware.org
Betreff: Re: [Lava-users] Multinode pipeline synchronisation / timeouts
...
On Mon, 11 Feb 2019 at 10:29, Tim Jaacks tim.jaacks@garz-fricke.com wrote:
...
Hello everyone,
I am having problems with timeouts when using the LAVA multinode protocol. Assume the following minimal pipeline with two nodes (device = DUT, remote = some kind of external hardware interfacing with the DUT):

deploy:
role: device

boot:
role: device

deploy:
role: remote

boot:
role: remote

test:
role: remote

test:
role: device


What I would expect: The device is booted first, then the remote is booted.
No. That is a fundamental misunderstanding of MultiNode. One MultiNode test job submission creates multiple LAVA test jobs. Those test jobs are scheduled only when all nodes defined in the target_group are ready to start. When scheduled, all test jobs start immediately.
Once any of the test jobs has got to a Lava Test Definition 1.0 shell, then that shell can use the MultiNode API to send a message declaring that it is ready and then use the same API to wait for whatever other nodes are required.
...
Afterwards, the tests run on both nodes, being able to interact with each other.
The pipeline model seems to be implemented in a way that each node has its own pipeline. This kind of makes sense, because the tests of course have to run simultaneously.
However, in my case booting the device takes a lot more time than booting the remote. This makes the 'test' stage on the remote run a lot earlier than the 'test' stage on the device.
My problem: How do I define a useful timeout value for the 'test' stage on the remote?
This has nothing to do with timeouts. This is about the MultiNode API.
https://master.lavasoftware.org/static/docs/v2/multinodeapi.html#multinode-a...
I am already using the MultiNode API and my test cases are working in general. I also understood that a MultiNode job submission creates sub-jobs for each node.
Are you using lava-send and lava-wait? Have you ensured that the
message sent by one node is different to the message sent by any other
node?
Messages are cached by the coordinator, so both nodes can lava-send
and then lava-wait. The timeout for the test action containing the
lava-wait on the fastest node must allow for all the time required for
the slower node to get to the point of sending it's message, as well
as however long it will take to execute it's own test action.
...
Maybe I have not made clear enough what my problem is. A 'test' stage has to have a timeout value, right? And this timeout value is defined in the test job description, so it has to be known before the job is submitted.
The test action timeout needs to allow for all the operations within
that test action to occur, this includes any time spent waiting.
...
Let's assume a useful timeout value for my test is 1 minute. That means: if both test stages (on the device and on the remote) start at the same time, the test should be completed in less than 1 minute, otherwise some kind of failure can be assumed.
BUT: in reality, the two test stages do NOT start at the same time. There is no way to predict or influence the point of time a stage starts.
Separate the synchronisation test shell from the other actions - as
recommended in the docs. The test shell which waits for the other node
to be ready uses the lava-multinode protocol timeout to determine how
long to wait.
The MultiNode API needs to be used to enforce that the two test stages
DO start at the same time.
Whichever node is fastest sends a signal that it's ready with
lava-send and then uses lava-wait to halt further execution until a
similar message is sent by the other node. When the second node sends
it's signal, the first node continues operation and all subsequent
test actions now run as if both nodes had started at the same time.
Then the total timeout for all test definitions within the test action
needs to cover the time required for the waiting and for the action
itself. This can be calculated from existing single-node test jobs in
some cases or by trial and error.
...
If my deploy and boot stages on the device take 3 minutes, while they take only 1 minute on my remote, the test stage on the remote times out before the test stage on the device has even started.
The test stage timeout needs to include the expected time required for
the synchronisation calls.
...
I created a simple test job as an example: https://pastebin.com/Gtk1xZ6N
There are two nodes. One of them sends a signal to the other one via the MultiNode API. But before, the first node runs an additional test stage containing a sleep. If this stage is removed, the job terminates successfully. If it is left in there, the test stage on node2 times out, even though I did not change any of the timeout values.
Of course it will timeout, the action has been extended beyond it's
timeout. 180 seconds > 2 minutes.
- test:
    role: node1
    timeout:
      minutes: 2
    definitions:
    - repository:
        metadata:
          format: Lava-Test Test Definition 1.0
          name: sleep
You must change the timeout values if you materially change the amount
of time that the test will normally take to execute. By adding 3
minutes to the execution time, you must increase the timeout for that
same action by at least 3 minutes.
...
Do I miss something here? What is the supposed way to handle this?
MultiNode is complex. You must consider the synchronisation to be part
of the test action - it must be included in the total timeout. LAVA
cannot tell if the test action is just sleeping or working.
...
...
...
Obviously I have to take the boot time difference between the two nodes into account. This seems counter-intuitive to me, since the timeout value should affect the actual test only.
...
...
...
What happens if I use an image on the device which takes even a lot more time to boot?
You extend the lava-multinode protocol timeout and you extend the
timeout of the test action of the faster role to allow for the time
that will be consumed waiting for the other role.
...
...
...
Or if I insert more testcases on the device which do not need the remote before? In both cases I would have to adjust the timeout value for the remote 'test' stage.
Yes. That is how the synchronisation works. One side waits and one
side sends the message. The time required for the wait must be
included in the timeout of the action doing the waiting. The timeout
set in the protocol is how long each wait can be. Test actions may
wait several times for different services and the test action timeout
must take this into account. The timeout applies to everything done by
the test action, not just any specific test definition.
...
...
...
Is this a design compromise? Or are there any possibilities of synchronizing pipeline stages on different nodes? I am thinking of some mechanism like "do not start 'test' stage on remote before 'boot' stage on device has finished".
You are thinking of the LAVA MultiNode API.
...
Mit freundlichen Grüßen / Best regards Tim Jaacks DEVELOPMENT ENGINEER
Garz & Fricke GmbH Tempowerkring 2
21079 Hamburg
Direct: +49 40 791 899 - 55
Fax: +49 40 791899 - 39
tim.jaacks@garz-fricke.com
www.garz-fricke.com
WE MAKE IT YOURS!
Sitz der Gesellschaft: D-21079 Hamburg
Registergericht: Amtsgericht Hamburg, HRB 60514
Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun

Lava-users mailing list
Lava-users@lists.lavasoftware.org
https://lists.lavasoftware.org/mailman/listinfo/lava-users
--
Neil Williams
neil.williams@linaro.org
http://www.linux.codehelp.co.uk/
Mit freundlichen Grüßen / Best regards
Tim Jaacks
DEVELOPMENT ENGINEER
Garz & Fricke GmbH
Tempowerkring 2
21079 Hamburg
Direct: +49 40 791 899 - 55
Fax: +49 40 791899 - 39
tim.jaacks@garz-fricke.com
www.garz-fricke.com
WE MAKE IT YOURS!
Sitz der Gesellschaft: D-21079 Hamburg
Registergericht: Amtsgericht Hamburg, HRB 60514
Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun
-- 

Neil Williams
=============
neil.williams@linaro.org
http://www.linux.codehelp.co.uk/

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Lava-users] Multinode pipeline synchronisation / timeouts

Neil Williams