Hi, guys,
We found a blocking issue for android test, the story is next:
1. job #1 with device #1 is running for about 12 hours, during its run, it will restart the boards many times, then the usb path will e.g. start from /dev/bus/usb/003/001 to /dev/bus/usb/002, then /dev/bus/usb/003...... finally /dev/bus/usb/127. You know, the max number here will be 127, so, if device reset again, the number will back to 001.
Adb devices in container 1: $ adb devices List of devices attached 040c41d4d72d7393 device
2. job #2 with device #2 starts run during the job #1 still running, then E.g. it will mknod /dev/bus/usb/003/016 to another docker-test-shell container, also cgroup privilege added. But as the /dev/bus/usb/003/016 was once used by job #1, and this node won't be deleted from docker-test-shell container. So, we find high probability the device #2 was seen in job #1's docker-test-shell container (Checked with adb devices).
Now, adb devices in container 2: $adb devices List of devices attached
After above, adb devices in container 1: $ adb devices List of devices attached 040c41d4d72d7393 device 23305a0a5c85d936 device
This becomes a big issue for our parallel android test.
In fact, in the old LXC days, we also find similar issues, so we made a workaround in our local: https://github.com/atline/lava-docker-slave/blob/66f15d9da88912fc929fef52136... In this patch, we also monitor "remove", ENV{ID_SERIAL_SHORT}, that is "if a usb leaved, let it delete the node".
But, I don't know for which reasons, in current version(2020.08), now I can just monitor "remove" in udev, can't match "remove + ENV{ID_SERIAL_SHORT}" correctly.
So, to make our local android run could work in a short time, we did a patch as next:
# diff __init__.py.bak __init__.py 157,158c157,158 < "mkdir -p %s && mknod %s c %d %d || true" < % (os.path.dirname(node), node, major, minor), ---
"mkdir -p %s && rm -f %s/* && mknod %s c %d %d || true" % (os.path.dirname(node), os.path.dirname(node), node, major, minor),
Now, no issue happens in our side, but it looks this is somewhat not universal? So, I'm here to ask the question, have you ever found this issue? And what's your thought on this?
On Thu, 10 Sep 2020 at 06:12, Larry Shen larry.shen@nxp.com wrote:
Hi, guys,
We found a blocking issue for android test, the story is next:
- job #1 with device #1 is running for about 12 hours, during its run, it will restart the boards many times, then the usb path will e.g. start from /dev/bus/usb/003/001 to /dev/bus/usb/002, then /dev/bus/usb/003…… finally /dev/bus/usb/127.
You know, the max number here will be 127, so, if device reset again, the number will back to 001.
Adb devices in container 1:
$ adb devices
List of devices attached
040c41d4d72d7393 device
- job #2 with device #2 starts run during the job #1 still running, then E.g. it will mknod /dev/bus/usb/003/016 to another docker-test-shell container, also cgroup privilege added.
But as the /dev/bus/usb/003/016 was once used by job #1, and this node won’t be deleted from docker-test-shell container.
So, we find high probability the device #2 was seen in job #1’s docker-test-shell container (Checked with adb devices).
Now, adb devices in container 2:
$adb devices
List of devices attached
After above, adb devices in container 1:
$ adb devices
List of devices attached
040c41d4d72d7393 device
23305a0a5c85d936 device
This becomes a big issue for our parallel android test.
In fact, in the old LXC days, we also find similar issues, so we made a workaround in our local:
https://github.com/atline/lava-docker-slave/blob/66f15d9da88912fc929fef52136...
In this patch, we also monitor "remove", ENV{ID_SERIAL_SHORT}, that is “if a usb leaved, let it delete the node”.
But, I don’t know for which reasons, in current version(2020.08), now I can just monitor “remove” in udev, can’t match “remove + ENV{ID_SERIAL_SHORT}” correctly.
So, to make our local android run could work in a short time, we did a patch as next:
# diff __init__.py.bak __init__.py
157,158c157,158
< "mkdir -p %s && mknod %s c %d %d || true"
< % (os.path.dirname(node), node, major, minor),
"mkdir -p %s && rm -f %s/* && mknod %s c %d %d || true"
% (os.path.dirname(node), os.path.dirname(node), node, major, minor),
Now, no issue happens in our side, but it looks this is somewhat not universal?
So, I’m here to ask the question, have you ever found this issue? And what’s your thought on this?
This looks like a real bug. We're sometimes observing that the boards disappear from LXC after reboot (in android CTS). This might be a hint why it's happening as there may be more than 1 board running android job on the same dispatcher. @Antonio, could you take a look if it's possible to remove the nodes from container?
milosz
Lava-users mailing list Lava-users@lists.lavasoftware.org https://lists.lavasoftware.org/mailman/listinfo/lava-users
On Thu, Sep 10, 2020 at 09:15:04AM +0100, Milosz Wasilewski wrote:
This looks like a real bug. We're sometimes observing that the boards disappear from LXC after reboot (in android CTS). This might be a hint why it's happening as there may be more than 1 board running android job on the same dispatcher. @Antonio, could you take a look if it's possible to remove the nodes from container?
Yes, unsharing the device from the container is the right thing to do here. I created an issue for it, so it's now on my TODO list:
lava-users@lists.lavasoftware.org