-----Original Message----- From: Stephen Lawrence Sent: 11 August 2020 12:02 To: 'Antonio Terceiro' antonio.terceiro@linaro.org Cc: lava-users@lists.lavasoftware.org Subject: RE: fastboot reboot-bootloader from docker test shell (was "fastboot in docker support")
[snip]
In my simplified case I have a 'fastboot --set-active' after the reboot that will cause the host to naturally wait, but I did wonder about the discovery code getting the chance to find and share the restarted device. So anyway I tried a 10s and 20s sleep after the reboot. Unfortunately still no device discovery after reboot: https://lava.genivi.org/scheduler/job/1176#L240 https://lava.genivi.org/scheduler/job/1177#L240
I think I need to deep dive the new discovery code to better understand what methods need to be supported. It feels like it might be something minor. A missing udev rule or something extra required in the device dict like usb IDs to trigger the discovery the second time. It's curious that the serial number is sufficient for discovery to succeed the first time though.
OK so yesterday I spent some hours going back through the lava-dispatcher-host code and debugging my instance. I could not see from browsing the source why the discovery code would not find the device on reboot - matching on the serial num is there for example. So I turned to fundamentals and the udev rules it installs. Specifically the 'add' udev event for a device.
Here I made some debug progress. Running 'udevadm monitor' in both the host and the worker container I can see kernel and udev events on the host, but no udev events in the worker container. That doesn't explain why the device is found the first time, but it can explain why it is not found after the reboot (working from Antonio's summary of how the new discovery code works).
This surprises me as it takes me back to around May when we were first working through problems of device discovery. My recollection from then is I was seeing the udev events in the worker container once I shared the udev database (as /run/udev) into the container.
With all my rebasing onto latest code its possible I have missed something. I have not found anything like that and I'll keep looking but taking a step back I think there is larger more fundamental questions about LAVA here. How do the developers see udev discovery working in a docker environment? Sharing 'just enough' into the worker container for udev events to work? What is 'just enough'? Perhaps you see a duplicate 'full' udev system running in the worker container instead? I would expect the sharing approach but I don't know for certain.
More immediately how is the community overcoming this in their own LAVA instances? What are you sharing into your worker containers? Are you just running in privileged mode with everything shared? It seems unlikely we are the only ones doing this.
Of course its all 'just' configurable code but as Paul Sokolovsky pointed out in the issue he raised about the docker socket [1] there ought to be patterns we can collectively figure out. Paths that are known to work. That should make people productive with LAVA more quickly, whilst reducing support on long threads like this.
[1] https://git.lavasoftware.org/lava/pkg/docker-compose/-/issues/7
Regards
Steve