AOSP multiple node job

List overview All Threads
Download

newer

older

docker-compose with lava

How to full validate PSCI

Chase Qi

24 Jan 2019 24 Jan '19

7:40 p.m.

Hi,

In most cases, we don't need multiple node job as we can control AOSP DUT from lxc via adb over USB. However, here is the use case.

CTS/VTS tradefed-shell --shards option supports to split tests and run them on multiple devices in parallel. To leverage the feature in LAVA, we need multinode job, right? And in multinode job, master-node lxc needs access to DUTs from salve nodes via adb over tcpip, right? Karsten shared a job example here[1]. This probably is the most advanced usage of LAVA, and probably also not encouraged? To make it more clear, the connectivity should look like this.

master.lxc <----adb over usb----> master.dut master.lxc <----adb over tcpip ---> slave1.dut master.lxc <----adb over tcpip ---> slave2.dut ....

I see two options for adb over tcpip.

Option #1: WiFi. adb over wifi can be enabled easily by issuing adb cmds from lxc. I am not using it for two reasons.

* WiFi isn't reliable for long cts/vts test run. * In Cambridge lab, WiFi sub-network isn't accessible from lxc network. Because of security concerns, there is no plan to change that.

Option #2: Wired Ethernet. On devices like hikey, we need to run 'pre-os-command' in boot action to power off OTG port so that USB Ethernet dongle works. Once OTG port is off, lxc has no access to the DUT, then test definition should be executed on DUT, right? I am also having the following problems to do this.

* Without context overriding, overlay tarball will be applied to '/system' directory and test job reported "/system/bin/sh: /lava-247856/bin/lava-test-runner: not found"[2]. * With the following job context, LAVA still runs '/lava-24/bin/lava-test-runner /lava-24/0' and it hangs there. It is tested in my local LAVA instance, test job definition and test log attached. Maybe my understanding on the context overriding is wrong, I thought LAVA should execute '/system/lava-24/bin/lava-test-runner /system/lava-24/0' instead. Any suggestions would be appreciated.

context: lava_test_sh_cmd: '/system/bin/sh' lava_test_results_dir: '/system/lava-%s'

I checked on the DUT directly, '/system/lava-%s' exist, but I cannot really run lava-test-runner. The shebang line seems problematic.

--- hacking --- hikey:/system/lava-24/bin # ./lava-test-runner /system/bin/sh: ./lava-test-runner: No such file or directory hikey:/system/lava-24/bin # cat lava-test-runner #!/bin/bash

#!/bin/sh

.... # /system/bin/sh lava-test-runner lava-test-runner[18]: .: /lava/../bin/lava-common-functions: No such file or directory --- ends ---

I had a discussion with Milosz. He proposed the third option which probably will be the most reliable one, but it is not supported in LAVA yet. Here is the idea. Milosz, feel free to explain more.

**Option #3**: Add support for accessing to multiple DUTs in single node job.

* Physically, we need the DUTs connected via USB cable to the same dispatcher. * In single node job, LAVA needs to add the DUTs specified(somehow) or assigned randomly(lets say both device type and numbers defined) to the same lxc container. Test definitions can take over from here.

Is this can be done in LAVA? Can I require the feature? Any suggestions on the possible implementations?

Thanks, Chase

[1] https://review.linaro.org/#/c/qa/test-definitions/+/29417/4/automated/androi... [2] https://staging.validation.linaro.org/scheduler/job/247856#L1888

Attachments:

job_24.yaml (application/x-yaml — 2.3 KB)
job_24.log (text/x-log — 354.5 KB)

Show replies by date

Neil Williams

24 Jan 24 Jan

11:06 p.m.

New subject: [Lava-users] AOSP multiple node job

On Thu, 24 Jan 2019 at 11:41, Chase Qi chase.qi@linaro.org wrote:

...

Hi,

In most cases, we don't need multiple node job as we can control AOSP DUT from lxc via adb over USB. However, here is the use case.

CTS/VTS tradefed-shell --shards option supports to split tests and run them on multiple devices in parallel. To leverage the feature in LAVA, we need multinode job, right?

If more than one device needs to have images deployed and booted specifically for this test job, then yes. MultiNode is required. To be sure that each device is at the same stage (as deploy and boot timings can vary), the test job will need to wait for all test jobs to be synchronised to the same point in each test job - synchronisation is currently restricted to POSIX shells.

...

And in multinode job, master-node lxc needs access to DUTs from salve nodes via adb over tcpip, right?

Not necessarily. From the LXC, the device can be controlled using USB. There is no need for devices to have a direct connection to each other just to use MultiNode. The shards implementation may require that though.

...

Karsten shared a job example here[1]. This probably is the most advanced usage of LAVA

All MultiNode is a complex usage of LAVA but VLANd used by the networking teams is more complex than your use case.

...

, and probably also not encouraged? To make it more clear, the connectivity should look like this.

There is a problem in this model: Every DUT will have it's own LXC and that device will be connected to the LXC using USB.

...

master.lxc <----adb over usb----> master.dut master.lxc <----adb over tcpip ---> slave1.dut master.lxc <----adb over tcpip ---> slave2.dut

Do not separate the LXC from the DUT - the LXC and it's DUT are a single node.

Master DUT has a master LXC. Slave1 DUT has a Slave1 LXC Slave2 DUT has a Slave2 LXC.

Depending on the boards in use, you may be able to configure each DUT, including the master DUT, to have TCP/IP networking. That then allows the processes running in the Master node to access the slave nodes.

(The following model is based on a theoretical device which doesn't have the crippling USB OTG problem of the hikey - but the hikey can work in this model if the IP addresses are determined statically and therefore are available to each slave LXC.)

0: A program executing in the Master LXC which uses USB to send commands to the master DUT which allow the Master LXC to retrieve the IP address of the master DUT.

1: That program in the Master LXC then uses the MultiNode API (lava-send) to declare that IP address to all the slave nodes. This is equivalent to how existing jobs declare the IP address of the device when using secondary connections.

2: Each slave node waits for the master-ip-addr message and sets that value in a program executing in the slave LXC. The slave LXC is connected to the slave DUT directly using USB so can use this to set the master IP address, if that is required.

3: Each slave node now runs a program in each slave LXC to connect to the slave DUT over USB and extract the slave DUT IP address

4: Each slave node then broadcasts that slave-<ID>-ip-addr message, so the first slave sends slave-1-ip-addr containing an IP address, slave 2 sends slave-2-ip-addr containing a different IP address.

5: The master node is waiting for all of these messages to be sent and retrieves the values in turn. This information is now available to a program executing inside the master LXC. This program could use USB to set these values in the master DUT, if that is required.

6: During this time, all the slave nodes are waiting for the master node to broadcast another message saying that the work on the master is complete.

7: Once the master sends the complete message, each slave node picks up this message from the MultiNode API and the script executing in the slave LXC then ends the Lava Test Definition and the slave test job completes.

8: The master can then do some other stuff and then complete.

https://staging.validation.linaro.org/scheduler/job/246447/multinode_definit...

https://staging.validation.linaro.org/scheduler/job/246230/multinode_definit...

Don't obsess about the LXC either. With upcoming changes for docker support, we could remove the presence of the LXC entirely. The LXC with android devices only exists as a unit of isolation for the benefit of the dispatcher. It has useful side effects but the reason for the LXC to exist is to separate the fastboot operations from the dispatcher operations.

For hikey and it's broken USB OTG support:

0: Each slave test job turns off the USB OTG support once the slave LXC has deployed all the test image files and determined that the slave DUT has booted correctly. If not, use lava-test-raise.

1: Next, each slave LXC uses the IP address of it's own slave DUT to check connectivity. If this fails, use lava-test-raise.

2: Each slave LXC uses the MultiNode API to declare the IP address of the slave DUT (because the slave node has determined that this IP is working).

3: The master node is waiting for these messages and these are picked up by the master LXC test definition.

4: The master LXC test definition issues commands to the master DUT - now depending on how the sharding works, this could be over USB (turn the USB OTG off later) or over TCP/IP (turn off the master USB OTG at the start of this test definition).

5: The master DUT has enough information to drive the sharding across the slave DUTs. The slave LXCs are waiting for the master to finish the sharding. (lava-wait)

6: When the master LXC determines that the master DUT has finished the sharding, then the master LXC sends a message to all the slave nodes that the test is complete.

7: Each slave node picks up the completion message in the slave LXC and the test definition finishes.

8: The master node can continue to do other tasks or can also complete it's test definition.

...

....

I see two options for adb over tcpip.

Option #1: WiFi. adb over wifi can be enabled easily by issuing adb cmds from lxc. I am not using it for two reasons.

Agreed, this doesn't need to rely on WiFi.

...

WiFi isn't reliable for long cts/vts test run.

In Cambridge lab, WiFi sub-network isn't accessible from lxc

network. Because of security concerns, there is no plan to change that.

Option #2: Wired Ethernet. On devices like hikey, we need to run 'pre-os-command' in boot action to power off OTG port so that USB Ethernet dongle works. Once OTG port is off, lxc has no access to the DUT, then test definition should be executed on DUT, right? I am also having the following problems to do this.

Before the OTG is switched, all data from the DUT needs to be retrieved (and set) using the USB connection.

What information you need to set depends on how the sharding works.

The problem, as I see it, is that the slave DUTs have no way to declare their IP address to the slave LXC once the OTG port is switched. Therefore, you will need to put in a request for the boards to have static IP addresses declared in the device dictionary. Then the OTG can be switched and things become easier because the LXC knows the IP address and can simply declare that to the MultiNode API so that the master LXC can know which IP matches which node. There are already a number of hikey devices with the static_ip device tag and you can specify this device tag in your MultiNode test definition.

...

Without context overriding, overlay tarball will be applied to

'/system' directory and test job reported "/system/bin/sh:

Why are you talking about /system ??? MultiNode only operates in a POSIX shell - the POSIX shell is in the LXC and each DUT has a dedicated LXC. In this use case, MultiNode API calls are only going to be made from each LXC. The master LXC sends some information and then receives information from test definitions running in each of the slave LXCs.

The overlay is to be deployed to the LXC, not the DUT because this is an Android system. What the android system does is determined either by commands run inside the slave LXC to deploy files (before the OTG switch) or commands run inside the master LXC (with knowledge of the IP address from the MultiNode API) to execute commands on the DUT over TCP/IP.

Use the LXC to deploy the files and boot the device, then to declare information about each particular node. Once that is done, whatever thing is controlling the test needs to just use TCP/IP to communicate and use the MultiNode API to send messages and allow some nodes to wait for other nodes whilst the test proceeds.

...

/lava-247856/bin/lava-test-runner: not found"[2].

With the following job context, LAVA still runs

'/lava-24/bin/lava-test-runner /lava-24/0' and it hangs there. It is tested in my local LAVA instance, test job definition and test log attached. Maybe my understanding on the context overriding is wrong, I thought LAVA should execute '/system/lava-24/bin/lava-test-runner /system/lava-24/0' instead. Any suggestions would be appreciated.

context: lava_test_sh_cmd: '/system/bin/sh' lava_test_results_dir: '/system/lava-%s'

I checked on the DUT directly, '/system/lava-%s' exist, but I cannot really run lava-test-runner. The shebang line seems problematic.

--- hacking --- hikey:/system/lava-24/bin # ./lava-test-runner /system/bin/sh: ./lava-test-runner: No such file or directory hikey:/system/lava-24/bin # cat lava-test-runner #!/bin/bash

#!/bin/sh

.... # /system/bin/sh lava-test-runner lava-test-runner[18]: .: /lava/../bin/lava-common-functions: No such file or directory --- ends ---

I had a discussion with Milosz. He proposed the third option which probably will be the most reliable one, but it is not supported in LAVA yet. Here is the idea. Milosz, feel free to explain more.

**Option #3**: Add support for accessing to multiple DUTs in single node job.

Physically, we need the DUTs connected via USB cable to the same dispatcher.

I don't see that this solves anything and it adds a lot of unnecessary lab configuration - entirely duplicating the point of having ethernet connections to the boards. Assign static IP addresses to each board and when the test job starts, each dedicated LXC can declare the static information according to whichever board was assigned to whichever node.

The DUTs only need to be visible to programs running on the master node and that can be done by declaring static IP addresses using the MultiNode API.

...

In single node job, LAVA needs to add the DUTs specified(somehow) or

assigned randomly(lets say both device type and numbers defined) to the same lxc container. Test definitions can take over from here.

No - the LXC is used to issue commands to deploy test images to the DUT. The LXC is a transparent part of the dispatcher, it is not just for test definitions. The LXC cannot be used for multiple test jobs, it is part of the one dispatcher.

...

Is this can be done in LAVA? Can I require the feature? Any suggestions on the possible implementations?

Thanks, Chase

[1] https://review.linaro.org/#/c/qa/test-definitions/+/29417/4/automated/androi... [2] https://staging.validation.linaro.org/scheduler/job/247856#L1888 _______________________________________________ Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Chase Qi

25 Jan 25 Jan

3:45 p.m.

New subject: [Lava-users] AOSP multiple node job

Hi Neil,

Thanks a lot for your guidance. It is really good to see you back :)

On Thu, Jan 24, 2019 at 11:07 PM Neil Williams neil.williams@linaro.org wrote:

...

On Thu, 24 Jan 2019 at 11:41, Chase Qi chase.qi@linaro.org wrote:

...
Hi,

In most cases, we don't need multiple node job as we can control AOSP DUT from lxc via adb over USB. However, here is the use case.

CTS/VTS tradefed-shell --shards option supports to split tests and run them on multiple devices in parallel. To leverage the feature in LAVA, we need multinode job, right?

If more than one device needs to have images deployed and booted specifically for this test job, then yes. MultiNode is required. To be sure that each device is at the same stage (as deploy and boot timings can vary), the test job will need to wait for all test jobs to be synchronised to the same point in each test job - synchronisation is currently restricted to POSIX shells.

...
And in multinode job, master-node lxc needs access to DUTs from salve nodes via adb over tcpip, right?

Not necessarily. From the LXC, the device can be controlled using USB. There is no need for devices to have a direct connection to each other just to use MultiNode. The shards implementation may require that though.

CTS/VTS sharding shards a run into given number of independent chunks, to run on multiple devices that connected to the same host. The host will be the master lxc in our case.

...

...
Karsten shared a job example here[1]. This probably is the most advanced usage of LAVA

All MultiNode is a complex usage of LAVA but VLANd used by the networking teams is more complex than your use case.

...
, and probably also not encouraged? To make it more clear, the connectivity should look like this.

There is a problem in this model: Every DUT will have it's own LXC and that device will be connected to the LXC using USB.

...
master.lxc <----adb over usb----> master.dut master.lxc <----adb over tcpip ---> slave1.dut master.lxc <----adb over tcpip ---> slave2.dut

Do not separate the LXC from the DUT - the LXC and it's DUT are a single node.

Master DUT has a master LXC. Slave1 DUT has a Slave1 LXC Slave2 DUT has a Slave2 LXC.

Depending on the boards in use, you may be able to configure each DUT, including the master DUT, to have TCP/IP networking. That then allows the processes running in the Master node to access the slave nodes.

Yes, that is what I am trying to do. The above connectivity topology I wrote is the goal not the initial state with LAVA design. Master lxc needs access to all the DUT nodes, either via USB or tcpip.

...

(The following model is based on a theoretical device which doesn't have the crippling USB OTG problem of the hikey - but the hikey can work in this model if the IP addresses are determined statically and therefore are available to each slave LXC.)

0: A program executing in the Master LXC which uses USB to send commands to the master DUT which allow the Master LXC to retrieve the IP address of the master DUT.

1: That program in the Master LXC then uses the MultiNode API (lava-send) to declare that IP address to all the slave nodes. This is equivalent to how existing jobs declare the IP address of the device when using secondary connections.

2: Each slave node waits for the master-ip-addr message and sets that value in a program executing in the slave LXC. The slave LXC is connected to the slave DUT directly using USB so can use this to set the master IP address, if that is required.

3: Each slave node now runs a program in each slave LXC to connect to the slave DUT over USB and extract the slave DUT IP address

4: Each slave node then broadcasts that slave-<ID>-ip-addr message, so the first slave sends slave-1-ip-addr containing an IP address, slave 2 sends slave-2-ip-addr containing a different IP address.

5: The master node is waiting for all of these messages to be sent and retrieves the values in turn. This information is now available to a program executing inside the master LXC. This program could use USB to set these values in the master DUT, if that is required.

6: During this time, all the slave nodes are waiting for the master node to broadcast another message saying that the work on the master is complete.

7: Once the master sends the complete message, each slave node picks up this message from the MultiNode API and the script executing in the slave LXC then ends the Lava Test Definition and the slave test job completes.

8: The master can then do some other stuff and then complete.

https://staging.validation.linaro.org/scheduler/job/246447/multinode_definit...

https://staging.validation.linaro.org/scheduler/job/246230/multinode_definit...

Don't obsess about the LXC either. With upcoming changes for docker support, we could remove the presence of the LXC entirely. The LXC with android devices only exists as a unit of isolation for the benefit of the dispatcher. It has useful side effects but the reason for the LXC to exist is to separate the fastboot operations from the dispatcher operations.

For hikey and it's broken USB OTG support:

0: Each slave test job turns off the USB OTG support once the slave LXC has deployed all the test image files and determined that the slave DUT has booted correctly. If not, use lava-test-raise.

1: Next, each slave LXC uses the IP address of it's own slave DUT to check connectivity. If this fails, use lava-test-raise.

2: Each slave LXC uses the MultiNode API to declare the IP address of the slave DUT (because the slave node has determined that this IP is working).

3: The master node is waiting for these messages and these are picked up by the master LXC test definition.

4: The master LXC test definition issues commands to the master DUT - now depending on how the sharding works, this could be over USB (turn the USB OTG off later) or over TCP/IP (turn off the master USB OTG at the start of this test definition).

5: The master DUT has enough information to drive the sharding across the slave DUTs. The slave LXCs are waiting for the master to finish the sharding. (lava-wait)

6: When the master LXC determines that the master DUT has finished the sharding, then the master LXC sends a message to all the slave nodes that the test is complete.

7: Each slave node picks up the completion message in the slave LXC and the test definition finishes.

8: The master node can continue to do other tasks or can also complete it's test definition.

...
....

I see two options for adb over tcpip.

Option #1: WiFi. adb over wifi can be enabled easily by issuing adb cmds from lxc. I am not using it for two reasons.

Agreed, this doesn't need to rely on WiFi.

...

WiFi isn't reliable for long cts/vts test run.

In Cambridge lab, WiFi sub-network isn't accessible from lxc

network. Because of security concerns, there is no plan to change that.

Option #2: Wired Ethernet. On devices like hikey, we need to run 'pre-os-command' in boot action to power off OTG port so that USB Ethernet dongle works. Once OTG port is off, lxc has no access to the DUT, then test definition should be executed on DUT, right? I am also having the following problems to do this.

Before the OTG is switched, all data from the DUT needs to be retrieved (and set) using the USB connection.

What information you need to set depends on how the sharding works.

The problem, as I see it, is that the slave DUTs have no way to declare their IP address to the slave LXC once the OTG port is switched. Therefore, you will need to put in a request for the boards

That is the problem I had. And that is why I was trying to run test definition on Android DUT directly to enable adb over tcpip and declare IP address. As you mentioned below, it is the wrong direction.

...

to have static IP addresses declared in the device dictionary. Then the OTG can be switched and things become easier because the LXC knows the IP address and can simply declare that to the MultiNode API so that the master LXC can know which IP matches which node. There are already a number of hikey devices with the static_ip device tag and you can specify this device tag in your MultiNode test definition.

Brilliant and brand new idea to me. I didn't realize static-ip tag is the solution. I have managed to enable and test adb over tcpip in this way(In my local instance). I have attached my test job definition here in case it is any help for other LAVA users. The following definitions are essential.

tags: - static-ip reboot_to_fastboot: false

- test: namespace: tlxc timeout: minutes: 10 protocols: lava-lxc: - action: lava-test-shell request: pre-os-command timeout: minutes: 2

Thanks, Chase

...

...

Without context overriding, overlay tarball will be applied to

'/system' directory and test job reported "/system/bin/sh:

Why are you talking about /system ??? MultiNode only operates in a POSIX shell - the POSIX shell is in the LXC and each DUT has a dedicated LXC. In this use case, MultiNode API calls are only going to be made from each LXC. The master LXC sends some information and then receives information from test definitions running in each of the slave LXCs.

The overlay is to be deployed to the LXC, not the DUT because this is an Android system. What the android system does is determined either by commands run inside the slave LXC to deploy files (before the OTG switch) or commands run inside the master LXC (with knowledge of the IP address from the MultiNode API) to execute commands on the DUT over TCP/IP.

Use the LXC to deploy the files and boot the device, then to declare information about each particular node. Once that is done, whatever thing is controlling the test needs to just use TCP/IP to communicate and use the MultiNode API to send messages and allow some nodes to wait for other nodes whilst the test proceeds.

...
/lava-247856/bin/lava-test-runner: not found"[2].

With the following job context, LAVA still runs

'/lava-24/bin/lava-test-runner /lava-24/0' and it hangs there. It is tested in my local LAVA instance, test job definition and test log attached. Maybe my understanding on the context overriding is wrong, I thought LAVA should execute '/system/lava-24/bin/lava-test-runner /system/lava-24/0' instead. Any suggestions would be appreciated.

context: lava_test_sh_cmd: '/system/bin/sh' lava_test_results_dir: '/system/lava-%s'

I checked on the DUT directly, '/system/lava-%s' exist, but I cannot really run lava-test-runner. The shebang line seems problematic.

--- hacking --- hikey:/system/lava-24/bin # ./lava-test-runner /system/bin/sh: ./lava-test-runner: No such file or directory hikey:/system/lava-24/bin # cat lava-test-runner #!/bin/bash

#!/bin/sh

.... # /system/bin/sh lava-test-runner lava-test-runner[18]: .: /lava/../bin/lava-common-functions: No such file or directory --- ends ---

I had a discussion with Milosz. He proposed the third option which probably will be the most reliable one, but it is not supported in LAVA yet. Here is the idea. Milosz, feel free to explain more.

**Option #3**: Add support for accessing to multiple DUTs in single node job.

Physically, we need the DUTs connected via USB cable to the same dispatcher.

I don't see that this solves anything and it adds a lot of unnecessary lab configuration - entirely duplicating the point of having ethernet connections to the boards. Assign static IP addresses to each board and when the test job starts, each dedicated LXC can declare the static information according to whichever board was assigned to whichever node.

The DUTs only need to be visible to programs running on the master node and that can be done by declaring static IP addresses using the MultiNode API.

...

In single node job, LAVA needs to add the DUTs specified(somehow) or

assigned randomly(lets say both device type and numbers defined) to the same lxc container. Test definitions can take over from here.

No - the LXC is used to issue commands to deploy test images to the DUT. The LXC is a transparent part of the dispatcher, it is not just for test definitions. The LXC cannot be used for multiple test jobs, it is part of the one dispatcher.

...
Is this can be done in LAVA? Can I require the feature? Any suggestions on the possible implementations?

Thanks, Chase

[1] https://review.linaro.org/#/c/qa/test-definitions/+/29417/4/automated/androi... [2] https://staging.validation.linaro.org/scheduler/job/247856#L1888 _______________________________________________ Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Neil Williams

5:40 p.m.

New subject: [Lava-users] AOSP multiple node job

On Fri, 25 Jan 2019 at 07:46, Chase Qi chase.qi@linaro.org wrote:

...

Hi Neil,

Thanks a lot for your guidance. It is really good to see you back :)

On Thu, Jan 24, 2019 at 11:07 PM Neil Williams neil.williams@linaro.org wrote:

...
On Thu, 24 Jan 2019 at 11:41, Chase Qi chase.qi@linaro.org wrote:

...
Hi,

In most cases, we don't need multiple node job as we can control AOSP DUT from lxc via adb over USB. However, here is the use case.

CTS/VTS tradefed-shell --shards option supports to split tests and run them on multiple devices in parallel. To leverage the feature in LAVA, we need multinode job, right?

If more than one device needs to have images deployed and booted specifically for this test job, then yes. MultiNode is required. To be sure that each device is at the same stage (as deploy and boot timings can vary), the test job will need to wait for all test jobs to be synchronised to the same point in each test job - synchronisation is currently restricted to POSIX shells.

...
And in multinode job, master-node lxc needs access to DUTs from salve nodes via adb over tcpip, right?

Not necessarily. From the LXC, the device can be controlled using USB. There is no need for devices to have a direct connection to each other just to use MultiNode. The shards implementation may require that though.

CTS/VTS sharding shards a run into given number of independent chunks, to run on multiple devices that connected to the same host. The host will be the master lxc in our case.

OK. So once the process running in the master LXC has received data from the MultiNode API about the IP addresses of each slave DUT involved in this MultiNode test job, that process should be able to use TCP/IP as the connection between the multiple devices.

...

...
...
Karsten shared a job example here[1]. This probably is the most advanced usage of LAVA

All MultiNode is a complex usage of LAVA but VLANd used by the networking teams is more complex than your use case.

...
, and probably also not encouraged? To make it more clear, the connectivity should look like this.

There is a problem in this model: Every DUT will have it's own LXC and that device will be connected to the LXC using USB.

...
master.lxc <----adb over usb----> master.dut master.lxc <----adb over tcpip ---> slave1.dut master.lxc <----adb over tcpip ---> slave2.dut

Do not separate the LXC from the DUT - the LXC and it's DUT are a single node.

Master DUT has a master LXC. Slave1 DUT has a Slave1 LXC Slave2 DUT has a Slave2 LXC.

Depending on the boards in use, you may be able to configure each DUT, including the master DUT, to have TCP/IP networking. That then allows the processes running in the Master node to access the slave nodes.

Yes, that is what I am trying to do. The above connectivity topology I wrote is the goal not the initial state with LAVA design. Master lxc needs access to all the DUT nodes, either via USB or tcpip.

...
(The following model is based on a theoretical device which doesn't have the crippling USB OTG problem of the hikey - but the hikey can work in this model if the IP addresses are determined statically and therefore are available to each slave LXC.)

0: A program executing in the Master LXC which uses USB to send commands to the master DUT which allow the Master LXC to retrieve the IP address of the master DUT.

1: That program in the Master LXC then uses the MultiNode API (lava-send) to declare that IP address to all the slave nodes. This is equivalent to how existing jobs declare the IP address of the device when using secondary connections.

2: Each slave node waits for the master-ip-addr message and sets that value in a program executing in the slave LXC. The slave LXC is connected to the slave DUT directly using USB so can use this to set the master IP address, if that is required.

3: Each slave node now runs a program in each slave LXC to connect to the slave DUT over USB and extract the slave DUT IP address

4: Each slave node then broadcasts that slave-<ID>-ip-addr message, so the first slave sends slave-1-ip-addr containing an IP address, slave 2 sends slave-2-ip-addr containing a different IP address.

5: The master node is waiting for all of these messages to be sent and retrieves the values in turn. This information is now available to a program executing inside the master LXC. This program could use USB to set these values in the master DUT, if that is required.

6: During this time, all the slave nodes are waiting for the master node to broadcast another message saying that the work on the master is complete.

7: Once the master sends the complete message, each slave node picks up this message from the MultiNode API and the script executing in the slave LXC then ends the Lava Test Definition and the slave test job completes.

8: The master can then do some other stuff and then complete.

https://staging.validation.linaro.org/scheduler/job/246447/multinode_definit...

https://staging.validation.linaro.org/scheduler/job/246230/multinode_definit...

Don't obsess about the LXC either. With upcoming changes for docker support, we could remove the presence of the LXC entirely. The LXC with android devices only exists as a unit of isolation for the benefit of the dispatcher. It has useful side effects but the reason for the LXC to exist is to separate the fastboot operations from the dispatcher operations.

For hikey and it's broken USB OTG support:

0: Each slave test job turns off the USB OTG support once the slave LXC has deployed all the test image files and determined that the slave DUT has booted correctly. If not, use lava-test-raise.

1: Next, each slave LXC uses the IP address of it's own slave DUT to check connectivity. If this fails, use lava-test-raise.

2: Each slave LXC uses the MultiNode API to declare the IP address of the slave DUT (because the slave node has determined that this IP is working).

3: The master node is waiting for these messages and these are picked up by the master LXC test definition.

4: The master LXC test definition issues commands to the master DUT - now depending on how the sharding works, this could be over USB (turn the USB OTG off later) or over TCP/IP (turn off the master USB OTG at the start of this test definition).

5: The master DUT has enough information to drive the sharding across the slave DUTs. The slave LXCs are waiting for the master to finish the sharding. (lava-wait)

6: When the master LXC determines that the master DUT has finished the sharding, then the master LXC sends a message to all the slave nodes that the test is complete.

7: Each slave node picks up the completion message in the slave LXC and the test definition finishes.

8: The master node can continue to do other tasks or can also complete it's test definition.

...
....

I see two options for adb over tcpip.

Option #1: WiFi. adb over wifi can be enabled easily by issuing adb cmds from lxc. I am not using it for two reasons.

Agreed, this doesn't need to rely on WiFi.

...

WiFi isn't reliable for long cts/vts test run.

In Cambridge lab, WiFi sub-network isn't accessible from lxc

network. Because of security concerns, there is no plan to change that.

Option #2: Wired Ethernet. On devices like hikey, we need to run 'pre-os-command' in boot action to power off OTG port so that USB Ethernet dongle works. Once OTG port is off, lxc has no access to the DUT, then test definition should be executed on DUT, right? I am also having the following problems to do this.

Before the OTG is switched, all data from the DUT needs to be retrieved (and set) using the USB connection.

What information you need to set depends on how the sharding works.

The problem, as I see it, is that the slave DUTs have no way to declare their IP address to the slave LXC once the OTG port is switched. Therefore, you will need to put in a request for the boards

That is the problem I had. And that is why I was trying to run test definition on Android DUT directly to enable adb over tcpip and declare IP address. As you mentioned below, it is the wrong direction.

...
to have static IP addresses declared in the device dictionary. Then the OTG can be switched and things become easier because the LXC knows the IP address and can simply declare that to the MultiNode API so that the master LXC can know which IP matches which node. There are already a number of hikey devices with the static_ip device tag and you can specify this device tag in your MultiNode test definition.

Brilliant and brand new idea to me. I didn't realize static-ip tag is the solution. I have managed to enable and test adb over tcpip in this way(In my local instance). I have attached my test job definition here in case it is any help for other LAVA users. The following definitions are essential.

tags:

static-ip

reboot_to_fastboot: false

test: namespace: tlxc timeout: minutes: 10 protocols: lava-lxc: - action: lava-test-shell request: pre-os-command timeout: minutes: 2

Yes, that looks good.

Remaining questions:

A: Do the slaves need to know anything about the master? e.g. does the slave need to know the IP address of the master? Is it enough for the master to know the IP address of each slave?

B: How many slaves do you expect to use in the final test job?

The next steps are:

0: Start with the simplest MultiNode test job arrangement - one master and one slave.

1: Write a Lava Test Definition which runs lava-target-ip and uses lava-send to broadcast that address using a unique message ID on the slave. That message ID could be a parameter but it will have to be determined in advance (at job submission) so that the master node knows exactly which message it needs to receive and it will be a different message from each slave node.

2: Assign that test definition to the slave role in the MultiNode test job.

3: Write a Lava Test Definition which runs lava-wait to retrieve the unique message ID used by the slave.

4: Assign that test definition to the master role in the MultiNode test job.

5: Without doing anything else on the master or slave, prove that the data can be retrieved into the master LXC.

Follow the example https://git.lavasoftware.org/lava/lava/blob/master/lava_scheduler_app/tests/... to add device tag support to the MultiNode test job definition. Each role would need: tags: - static-ip

You shouldn't need any context overrides for this test job.

You will need to get a unique message from each slave node, so as you increase the number of slaves, you will need to increase the number of MultiNode roles and keep the count for each role at one. i.e. roles would be a list of [master, slave] and then a list of [master, slave_one, slave_two]. This way, you can use the role itself as the unique message ID from each slave - just call lava-role to get the assigned role for the current node.

protocols: lava-multinode: roles: master: device_type: hikey count: 1 tags: - static-ip slave_one: device_type: hikey count: 1 tags: - static-ip slave_two: device_type: hikey count: 1 tags: - static-ip

...

Thanks, Chase

...
...

Without context overriding, overlay tarball will be applied to

'/system' directory and test job reported "/system/bin/sh:

Why are you talking about /system ??? MultiNode only operates in a POSIX shell - the POSIX shell is in the LXC and each DUT has a dedicated LXC. In this use case, MultiNode API calls are only going to be made from each LXC. The master LXC sends some information and then receives information from test definitions running in each of the slave LXCs.

The overlay is to be deployed to the LXC, not the DUT because this is an Android system. What the android system does is determined either by commands run inside the slave LXC to deploy files (before the OTG switch) or commands run inside the master LXC (with knowledge of the IP address from the MultiNode API) to execute commands on the DUT over TCP/IP.

Use the LXC to deploy the files and boot the device, then to declare information about each particular node. Once that is done, whatever thing is controlling the test needs to just use TCP/IP to communicate and use the MultiNode API to send messages and allow some nodes to wait for other nodes whilst the test proceeds.

...
/lava-247856/bin/lava-test-runner: not found"[2].

With the following job context, LAVA still runs

'/lava-24/bin/lava-test-runner /lava-24/0' and it hangs there. It is tested in my local LAVA instance, test job definition and test log attached. Maybe my understanding on the context overriding is wrong, I thought LAVA should execute '/system/lava-24/bin/lava-test-runner /system/lava-24/0' instead. Any suggestions would be appreciated.

context: lava_test_sh_cmd: '/system/bin/sh' lava_test_results_dir: '/system/lava-%s'

I checked on the DUT directly, '/system/lava-%s' exist, but I cannot really run lava-test-runner. The shebang line seems problematic.

--- hacking --- hikey:/system/lava-24/bin # ./lava-test-runner /system/bin/sh: ./lava-test-runner: No such file or directory hikey:/system/lava-24/bin # cat lava-test-runner #!/bin/bash

#!/bin/sh

.... # /system/bin/sh lava-test-runner lava-test-runner[18]: .: /lava/../bin/lava-common-functions: No such file or directory --- ends ---

I had a discussion with Milosz. He proposed the third option which probably will be the most reliable one, but it is not supported in LAVA yet. Here is the idea. Milosz, feel free to explain more.

**Option #3**: Add support for accessing to multiple DUTs in single node job.

Physically, we need the DUTs connected via USB cable to the same dispatcher.

I don't see that this solves anything and it adds a lot of unnecessary lab configuration - entirely duplicating the point of having ethernet connections to the boards. Assign static IP addresses to each board and when the test job starts, each dedicated LXC can declare the static information according to whichever board was assigned to whichever node.

The DUTs only need to be visible to programs running on the master node and that can be done by declaring static IP addresses using the MultiNode API.

...

In single node job, LAVA needs to add the DUTs specified(somehow) or

assigned randomly(lets say both device type and numbers defined) to the same lxc container. Test definitions can take over from here.

No - the LXC is used to issue commands to deploy test images to the DUT. The LXC is a transparent part of the dispatcher, it is not just for test definitions. The LXC cannot be used for multiple test jobs, it is part of the one dispatcher.

...
Is this can be done in LAVA? Can I require the feature? Any suggestions on the possible implementations?

Thanks, Chase

[1] https://review.linaro.org/#/c/qa/test-definitions/+/29417/4/automated/androi... [2] https://staging.validation.linaro.org/scheduler/job/247856#L1888 _______________________________________________ Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Chase Qi

1 Feb 1 Feb

7 p.m.

New subject: [Lava-users] AOSP multiple node job

On Fri, Jan 25, 2019 at 5:40 PM Neil Williams neil.williams@linaro.org wrote:

...

On Fri, 25 Jan 2019 at 07:46, Chase Qi chase.qi@linaro.org wrote:

...
Hi Neil,

Thanks a lot for your guidance. It is really good to see you back :)

On Thu, Jan 24, 2019 at 11:07 PM Neil Williams neil.williams@linaro.org wrote:

...
On Thu, 24 Jan 2019 at 11:41, Chase Qi chase.qi@linaro.org wrote:

...
Hi,

In most cases, we don't need multiple node job as we can control AOSP DUT from lxc via adb over USB. However, here is the use case.

CTS/VTS tradefed-shell --shards option supports to split tests and run them on multiple devices in parallel. To leverage the feature in LAVA, we need multinode job, right?

If more than one device needs to have images deployed and booted specifically for this test job, then yes. MultiNode is required. To be sure that each device is at the same stage (as deploy and boot timings can vary), the test job will need to wait for all test jobs to be synchronised to the same point in each test job - synchronisation is currently restricted to POSIX shells.

...
And in multinode job, master-node lxc needs access to DUTs from salve nodes via adb over tcpip, right?

Not necessarily. From the LXC, the device can be controlled using USB. There is no need for devices to have a direct connection to each other just to use MultiNode. The shards implementation may require that though.

CTS/VTS sharding shards a run into given number of independent chunks, to run on multiple devices that connected to the same host. The host will be the master lxc in our case.

OK. So once the process running in the master LXC has received data from the MultiNode API about the IP addresses of each slave DUT involved in this MultiNode test job, that process should be able to use TCP/IP as the connection between the multiple devices.

...
...
...
Karsten shared a job example here[1]. This probably is the most advanced usage of LAVA

All MultiNode is a complex usage of LAVA but VLANd used by the networking teams is more complex than your use case.

...
, and probably also not encouraged? To make it more clear, the connectivity should look like this.

There is a problem in this model: Every DUT will have it's own LXC and that device will be connected to the LXC using USB.

...
master.lxc <----adb over usb----> master.dut master.lxc <----adb over tcpip ---> slave1.dut master.lxc <----adb over tcpip ---> slave2.dut

Do not separate the LXC from the DUT - the LXC and it's DUT are a single node.

Master DUT has a master LXC. Slave1 DUT has a Slave1 LXC Slave2 DUT has a Slave2 LXC.

Depending on the boards in use, you may be able to configure each DUT, including the master DUT, to have TCP/IP networking. That then allows the processes running in the Master node to access the slave nodes.

Yes, that is what I am trying to do. The above connectivity topology I wrote is the goal not the initial state with LAVA design. Master lxc needs access to all the DUT nodes, either via USB or tcpip.

...
(The following model is based on a theoretical device which doesn't have the crippling USB OTG problem of the hikey - but the hikey can work in this model if the IP addresses are determined statically and therefore are available to each slave LXC.)

0: A program executing in the Master LXC which uses USB to send commands to the master DUT which allow the Master LXC to retrieve the IP address of the master DUT.

1: That program in the Master LXC then uses the MultiNode API (lava-send) to declare that IP address to all the slave nodes. This is equivalent to how existing jobs declare the IP address of the device when using secondary connections.

2: Each slave node waits for the master-ip-addr message and sets that value in a program executing in the slave LXC. The slave LXC is connected to the slave DUT directly using USB so can use this to set the master IP address, if that is required.

3: Each slave node now runs a program in each slave LXC to connect to the slave DUT over USB and extract the slave DUT IP address

4: Each slave node then broadcasts that slave-<ID>-ip-addr message, so the first slave sends slave-1-ip-addr containing an IP address, slave 2 sends slave-2-ip-addr containing a different IP address.

5: The master node is waiting for all of these messages to be sent and retrieves the values in turn. This information is now available to a program executing inside the master LXC. This program could use USB to set these values in the master DUT, if that is required.

6: During this time, all the slave nodes are waiting for the master node to broadcast another message saying that the work on the master is complete.

7: Once the master sends the complete message, each slave node picks up this message from the MultiNode API and the script executing in the slave LXC then ends the Lava Test Definition and the slave test job completes.

8: The master can then do some other stuff and then complete.

https://staging.validation.linaro.org/scheduler/job/246447/multinode_definit...

https://staging.validation.linaro.org/scheduler/job/246230/multinode_definit...

Don't obsess about the LXC either. With upcoming changes for docker support, we could remove the presence of the LXC entirely. The LXC with android devices only exists as a unit of isolation for the benefit of the dispatcher. It has useful side effects but the reason for the LXC to exist is to separate the fastboot operations from the dispatcher operations.

For hikey and it's broken USB OTG support:

0: Each slave test job turns off the USB OTG support once the slave LXC has deployed all the test image files and determined that the slave DUT has booted correctly. If not, use lava-test-raise.

1: Next, each slave LXC uses the IP address of it's own slave DUT to check connectivity. If this fails, use lava-test-raise.

2: Each slave LXC uses the MultiNode API to declare the IP address of the slave DUT (because the slave node has determined that this IP is working).

3: The master node is waiting for these messages and these are picked up by the master LXC test definition.

4: The master LXC test definition issues commands to the master DUT - now depending on how the sharding works, this could be over USB (turn the USB OTG off later) or over TCP/IP (turn off the master USB OTG at the start of this test definition).

5: The master DUT has enough information to drive the sharding across the slave DUTs. The slave LXCs are waiting for the master to finish the sharding. (lava-wait)

6: When the master LXC determines that the master DUT has finished the sharding, then the master LXC sends a message to all the slave nodes that the test is complete.

7: Each slave node picks up the completion message in the slave LXC and the test definition finishes.

8: The master node can continue to do other tasks or can also complete it's test definition.

...
....

I see two options for adb over tcpip.

Option #1: WiFi. adb over wifi can be enabled easily by issuing adb cmds from lxc. I am not using it for two reasons.

Agreed, this doesn't need to rely on WiFi.

...

WiFi isn't reliable for long cts/vts test run.

In Cambridge lab, WiFi sub-network isn't accessible from lxc

network. Because of security concerns, there is no plan to change that.

Option #2: Wired Ethernet. On devices like hikey, we need to run 'pre-os-command' in boot action to power off OTG port so that USB Ethernet dongle works. Once OTG port is off, lxc has no access to the DUT, then test definition should be executed on DUT, right? I am also having the following problems to do this.

Before the OTG is switched, all data from the DUT needs to be retrieved (and set) using the USB connection.

What information you need to set depends on how the sharding works.

The problem, as I see it, is that the slave DUTs have no way to declare their IP address to the slave LXC once the OTG port is switched. Therefore, you will need to put in a request for the boards

That is the problem I had. And that is why I was trying to run test definition on Android DUT directly to enable adb over tcpip and declare IP address. As you mentioned below, it is the wrong direction.

...
to have static IP addresses declared in the device dictionary. Then the OTG can be switched and things become easier because the LXC knows the IP address and can simply declare that to the MultiNode API so that the master LXC can know which IP matches which node. There are already a number of hikey devices with the static_ip device tag and you can specify this device tag in your MultiNode test definition.

Brilliant and brand new idea to me. I didn't realize static-ip tag is the solution. I have managed to enable and test adb over tcpip in this way(In my local instance). I have attached my test job definition here in case it is any help for other LAVA users. The following definitions are essential.

tags:

static-ip

reboot_to_fastboot: false

test: namespace: tlxc timeout: minutes: 10 protocols: lava-lxc: - action: lava-test-shell request: pre-os-command timeout: minutes: 2

Yes, that looks good.

Remaining questions:

A: Do the slaves need to know anything about the master? e.g. does the slave need to know the IP address of the master? Is it enough for the

No. Slave just needs to run adb in tcpip mode and send its IP address.

...

master to know the IP address of each slave?

Yes. master just needs to know slaves' IP.

...

B: How many slaves do you expect to use in the final test job?

I have a working example now https://validation.linaro.org/scheduler/job/1905140/multinode_definition . Instead of mix adb usb and tcpip, I decided to use tcpip only to simplify and unify connectivity between host and DUTs. The example job uses a lxc device as master/host and two hikey devices as workers.

...

The next steps are:

0: Start with the simplest MultiNode test job arrangement - one master and one slave.

1: Write a Lava Test Definition which runs lava-target-ip and uses lava-send to broadcast that address using a unique message ID on the slave. That message ID could be a parameter but it will have to be determined in advance (at job submission) so that the master node knows exactly which message it needs to receive and it will be a different message from each slave node.

2: Assign that test definition to the slave role in the MultiNode test job.

3: Write a Lava Test Definition which runs lava-wait to retrieve the unique message ID used by the slave.

4: Assign that test definition to the master role in the MultiNode test job.

5: Without doing anything else on the master or slave, prove that the data can be retrieved into the master LXC.

Follow the example https://git.lavasoftware.org/lava/lava/blob/master/lava_scheduler_app/tests/... to add device tag support to the MultiNode test job definition. Each role would need: tags:

static-ip

You shouldn't need any context overrides for this test job.

You will need to get a unique message from each slave node, so as you increase the number of slaves, you will need to increase the number of MultiNode roles and keep the count for each role at one. i.e. roles would be a list of [master, slave] and then a list of [master, slave_one, slave_two]. This way, you can use the role itself as the unique message ID from each slave - just call lava-role to get the assigned role for the current node.

Yeah, make sense. In Karsten's patch, he used lava-wait-all and send the same msg ID from all roles. Refer to https://review.linaro.org/#/c/qa/test-definitions/+/29416/3/automated/lib/an... . The good of this approach is we can simply increase the number of worker role count for more workers.

Thanks a lot, Chase

...

protocols: lava-multinode: roles: master: device_type: hikey count: 1 tags: - static-ip slave_one: device_type: hikey count: 1 tags: - static-ip slave_two: device_type: hikey count: 1 tags: - static-ip

...
Thanks, Chase

...
...

Without context overriding, overlay tarball will be applied to

'/system' directory and test job reported "/system/bin/sh:

Why are you talking about /system ??? MultiNode only operates in a POSIX shell - the POSIX shell is in the LXC and each DUT has a dedicated LXC. In this use case, MultiNode API calls are only going to be made from each LXC. The master LXC sends some information and then receives information from test definitions running in each of the slave LXCs.

The overlay is to be deployed to the LXC, not the DUT because this is an Android system. What the android system does is determined either by commands run inside the slave LXC to deploy files (before the OTG switch) or commands run inside the master LXC (with knowledge of the IP address from the MultiNode API) to execute commands on the DUT over TCP/IP.

Use the LXC to deploy the files and boot the device, then to declare information about each particular node. Once that is done, whatever thing is controlling the test needs to just use TCP/IP to communicate and use the MultiNode API to send messages and allow some nodes to wait for other nodes whilst the test proceeds.

...
/lava-247856/bin/lava-test-runner: not found"[2].

With the following job context, LAVA still runs

'/lava-24/bin/lava-test-runner /lava-24/0' and it hangs there. It is tested in my local LAVA instance, test job definition and test log attached. Maybe my understanding on the context overriding is wrong, I thought LAVA should execute '/system/lava-24/bin/lava-test-runner /system/lava-24/0' instead. Any suggestions would be appreciated.

context: lava_test_sh_cmd: '/system/bin/sh' lava_test_results_dir: '/system/lava-%s'

I checked on the DUT directly, '/system/lava-%s' exist, but I cannot really run lava-test-runner. The shebang line seems problematic.

--- hacking --- hikey:/system/lava-24/bin # ./lava-test-runner /system/bin/sh: ./lava-test-runner: No such file or directory hikey:/system/lava-24/bin # cat lava-test-runner #!/bin/bash

#!/bin/sh

.... # /system/bin/sh lava-test-runner lava-test-runner[18]: .: /lava/../bin/lava-common-functions: No such file or directory --- ends ---

I had a discussion with Milosz. He proposed the third option which probably will be the most reliable one, but it is not supported in LAVA yet. Here is the idea. Milosz, feel free to explain more.

**Option #3**: Add support for accessing to multiple DUTs in single node job.

Physically, we need the DUTs connected via USB cable to the same dispatcher.

I don't see that this solves anything and it adds a lot of unnecessary lab configuration - entirely duplicating the point of having ethernet connections to the boards. Assign static IP addresses to each board and when the test job starts, each dedicated LXC can declare the static information according to whichever board was assigned to whichever node.

The DUTs only need to be visible to programs running on the master node and that can be done by declaring static IP addresses using the MultiNode API.

...

In single node job, LAVA needs to add the DUTs specified(somehow) or

assigned randomly(lets say both device type and numbers defined) to the same lxc container. Test definitions can take over from here.

No - the LXC is used to issue commands to deploy test images to the DUT. The LXC is a transparent part of the dispatcher, it is not just for test definitions. The LXC cannot be used for multiple test jobs, it is part of the one dispatcher.

...
Is this can be done in LAVA? Can I require the feature? Any suggestions on the possible implementations?

Thanks, Chase

[1] https://review.linaro.org/#/c/qa/test-definitions/+/29417/4/automated/androi... [2] https://staging.validation.linaro.org/scheduler/job/247856#L1888 _______________________________________________ Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Karsten Tausche

21 Feb 21 Feb

7:13 p.m.

New subject: [Lava-users] AOSP multiple node job

Hi Neil,

I'm replying to this older thread regarding questions that came up for me in a recent mail: https://lists.lavasoftware.org/pipermail/lava-users/2019-February/001693.htm... and the issue you linked: https://git.lavasoftware.org/lava/lava/issues/114

On 24.01.19 16:06, Neil Williams wrote:

...

Don't obsess about the LXC either. With upcoming changes for docker support, we could remove the presence of the LXC entirely. The LXC with android devices only exists as a unit of isolation for the benefit of the dispatcher. It has useful side effects but the reason for the LXC to exist is to separate the fastboot operations from the dispatcher operations.

What exactly is the envisioned approach for AOSP testing in LAVA? Is it that LXC gets replaced by (user) docker containers? In earlier messages I got the impression that the containers are removed entirely. If not, will there still be one container per test job (thus per DUT)? How is isolation and scalability achieved for fastboot and adb, especially regarding the issues listed here: https://master.lavasoftware.org/static/docs/v2/integrate-fastboot.html ? Do you expect that changes on the test jobs/shells/scripts will be required to be compatible with the new approach, especially for MultiNode testing?

Thanks, Karsten

Neil Williams

7:36 p.m.

New subject: [Lava-users] AOSP multiple node job

On Thu, 21 Feb 2019 at 11:13, Karsten Tausche karsten@fairphone.com wrote:

...

Hi Neil,

I'm replying to this older thread regarding questions that came up for me in a recent mail: https://lists.lavasoftware.org/pipermail/lava-users/2019-February/001693.htm... and the issue you linked: https://git.lavasoftware.org/lava/lava/issues/114

On 24.01.19 16:06, Neil Williams wrote:

...
Don't obsess about the LXC either. With upcoming changes for docker support, we could remove the presence of the LXC entirely. The LXC with android devices only exists as a unit of isolation for the benefit of the dispatcher. It has useful side effects but the reason for the LXC to exist is to separate the fastboot operations from the dispatcher operations.

What exactly is the envisioned approach for AOSP testing in LAVA? Is it that LXC gets replaced by (user) docker containers?

Either or both, depending on how each instance is configured.

At the moment, only LXC support is actually working. The docker support is in development.

The principle problem with LXC is that the tooling inside the LXC needs to be installed and configured afresh each test job. Docker would solve that but that is about the only benefit - speed of operation and more direct control over the tooling environment.

...

In earlier messages I got the impression that the containers are removed entirely.

If you use LXC on bare metal, you just get LXC support. This is what is being used for all production testing of AOSP - in fact, it's what is being used for all production testing of any fastboot device. This issue isn't about AOSP as such. It is fastboot as a bootloader which is the restriction. Test jobs which boot OpenEmbedded or Debian on an X15 or hikey must still use LXC - unless the bootloader is changed from fastboot to U-Boot.

If you use LXC in an official LAVA Software Community Project docker image, then the LXC functionality is replaced by lava-lxc-mocker and the job works as normal, using the docker container instead of an LXC. This has persistence issues which are to be addressed in https://git.lavasoftware.org/lava/lava/issues/114 - the only container would be docker.

If you only ever use Docker, then the lava-lxc protocol blocks can be removed from your test job submissions. This has only been tested in developer situations and needs more work to be used more widely.

Thus, when it becomes fully supported to do testing of fastboot devices using Docker, then there will be no LXC in use. Only the docker image(s) will exist.

However, labs will retain the ability to support both options. It depends how each lab configures their own workers.

So don't obsess about LXC specifically but do start your work with fastboot devices using LXC. In time, Docker support will become available to replace the LXC but you will retain the choice of which to use.

...

If not, will there still be one container per test job (thus per DUT)? How is isolation and scalability achieved for fastboot and adb, especially regarding the issues listed here: https://master.lavasoftware.org/static/docs/v2/integrate-fastboot.html ?

That is the task of https://git.lavasoftware.org/lava/lava/issues/114

...

Do you expect that changes on the test jobs/shells/scripts will be required to be compatible with the new approach, especially for MultiNode testing?

Not mandatory, no.

However, in the course of time, if the usage of LXC drops as the use of Docker increases for these test jobs, then changes can be made. Essentially, the initial test jobs will carry unused blocks (the lava-lxc protocol definitions) which will be mocked up by lava-lxc-mocker. The only changes expected will be in the test job submissions and through lava-lxc-mocker, we aim to have the ability to submit the same LXC-based test job to a worker using LXC as to a worker using Docker.

MultiNode testing of AOSP is not commonplace, principally because AOSP does not provide a POSIX shell on serial by default.

For any further discussion of topics like this, please subscribe to and post to the lava-devel mailing list.

...

Thanks, Karsten

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

2360

days inactive

2388

days old

lava-users@lists.lavasoftware.org

6 comments

participants

tags (0)

participants (3)

Chase Qi
Karsten Tausche
Neil Williams