On 8 November 2016 at 15:46, Tomeu Vizoso tomeu.vizoso@collabora.com wrote:
On 11/08/2016 02:25 PM, Neil Williams wrote:
On 8 November 2016 at 12:54, Tomeu Vizoso tomeu.vizoso@collabora.com wrote:
[Moving to lava-users as suggested by Neil]
Trimming the CC list - the lava-team can reply using lava-users directly.
On 11/07/2016 03:20 PM, Neil Williams (Code Review) wrote:
Neil Williams has posted comments on this change.
https://review.linaro.org/#/c/15203/3/lava_dispatcher/pipeline/actions/deplo...
File lava_dispatcher/pipeline/actions/deploy/tftp.py:
Line 127: def _ensure_device_dir(self, device_dir):
Cannot say that I have fully understood it yet. Would it be correct if the
The Strategy classes must not set or modify anything. The accepts method does some very fast checks and returns True or False. Anything which the pipeline actions need to know must be specified in the job submission or the device configuration. So either this is restricted to specific device-types (so a setting goes into the template) or it has to be set for every job using this method (for situations where the support can be used or not used on the same hardware for different jobs).
What is this per-device directory anyway and how is it meant to work with tftpd-hpa which does not support configuration modification without restarting itself? Jobs cannot require that daemons restart - other jobs could easily be using that daemon at the same time.
So each firmware image containing Depthcharge will also contain hardcoded values for the IP of the TFTP server, and for the paths of a cmdline.txt file and a FIT image. The FIT image containing a kernel and a DTB file, and optionally a ramdisk.
That's not a system that lends itself easily to automation. Fixed IP addresses are the exception rather than the rule, the dispatcher itself doesn't care if it's IP address changes as long as the restarted slave can still contact the master over ZMQ.
It sounds a lot like a system that can work for single jobs on a developer desk but it has clear problems when it comes to automating the process at any level of scale.
You are right, the administrator of the lab would need to reflash if the dispatcher had to change IP.
OK. As long as that restriction is documented, that's fine.
(There'll need to be some docs added to lava-server with a file like doc/v2/actions-foo.rst and maybe something about how to deploy a device that needs to use this method.)
Because the paths are set when the FW image is flashed, we cannot use the per-job directory. Thus we add a parameter to the device that is to be set in the device-specific template of Chrome devices.
So that parameter needs to appear in the device-type template and therefore in the device.yaml delivered to the dispatcher. This way, there will be a readily identifiable value in the device['parameters'] block for which the accepts() can check. The validate() function in the Action initialised by the Strategy (once the accepts classmethod has finished) can then retrieve the value and do checks where required, setting self.errors if there are problems. (Do not raise exceptions in validate - the purpose is to collect all possible errors in a single run of all validate functions in the populated pipeline.) That is all standard V2.
If I understand correctly, that matches what the patch does, other that there's no validation yet (but I'm not sure what would need to be validated in this specific case).
By validation I mean the lava-dispatch --validate command which uses the validate() function declared for each Action used by the job.
depthcharge.py doesn't have a validate at all yet, so these are some guidelines on what needs to go into validate.
0: every time run() tries to read values from self.parameters or self.job.device (e.g. self.job.device['actions']['boot']['methods']) then the keys used in the call need to be checked for existence in the validate. (Either this validate or in a base class). The parser checks some (like actions: boot) and the Strategy checks others (e.g. that the method is acceptable to the Action) but beyond that, all subsequent keys need to be checked in validate and the values also need to be checked in case the code expects an int or a str or a list or whatever. Each check sets self.errors on failure and prevents any exceptions being raised. If subsequent checks rely on a key existing which does not exist, a return is fine to use after setting self.errors.
1: Whenever run assumes or checks that something exists on the filesystem of the dispatcher, including whether a necessary tool is installed, the check itself moves to validate() and run() can then use the tool without needing to check.
2: All preparation of strings and objects which will later be used in run() can be done in validate - this has the advantage that it makes it much easier to test these strings and objects in the unit tests because you don't have to run the action, just validate. To support this, a sample device configuration and a sample test job are supplied and a pipeline_ref file too (see the readme in pipeline_refs). This ensures that if anything about the Strategy or the populate functions changes, the unit tests will pick up how it affects your method without the team needing to run an actual test using real hardware (which they don't have).
This is all to try and help the team not inadvertently break your support in a future release.
If that parameter is present, then a directory in the root of the TFTP files tree will be created with the value of that parameter.
So the dispatcher needs read:write access on the TFTP server as well?
Not sure I get this. The patch just introduces the possibility for some devices to have their TFTP directories named differently.
I thought it was using a remote TFTP server, so don't worry about that. I'm not sure why but I thought it was a completely separate TFTP server.
So what I'm trying to achieve here is that we cover the following constraints:
0: a new deployment method which is only usable on certain hardware which is not currently available to the lava software team can be difficult to maintain. The software team need some assurance that their later changes will not break your contribution. To deliver that, the team need as much of the guts of your methods to be exposed through the unit tests as possible so that all that is left to test is whether the validate function gives the run() function exactly the same data as last time.
1: the code needs to follow the (sometimes undocumented) conventions elsewhere in the code to not break expectations when the software team are doing large scale changes to optimise across the codebase. This will continue to happen as more deployment methods are added and more common ground is identified that did not exist before.
2: the documentation itself needs to be improved and you will be in a key position to do that because you will have come across most of those undocumented conventions. I hope you'll have some feedback on how we can improve the Developer Guide in the docs to give some clarity to others who want to contribute. As with all docs written by the developers, it needs input from new people to make the content useful outside the team of developers who wrote it.
3: the code needs to scale. The unit test coverage is key to this as well.
4: lava is in migration and it makes things harder for everyone that the V1 code is still confusing things. (This is less of an issue for your specific review but still it is a factor that could easily put some people off contributing.)
Hopefully, that helps cover most of the reasons why the review has had the comments it has so far. With some changes to reflect these factors, I'd be happy to merge the support for depthcharge but it does need some of these things to be fixed first.
Regards,
Tomeu
How is that to be achieved? Where will this TFTP server be located? Is it shared by multiple instances or multiple dispatchers? Who manages any filesystem collisions? What process cleans up the per-job directories when the job completes or fails? Hint: a mountpoint on the dispatcher is not going to be workable either. We already have quite enough issues with SSHFS in V1 to go there again. We need to avoid solutions which make assumptions about local network topology. NFS is for DUTs only.
This is beginning to sound like it needs a full Protocol, similar to LXC or VLANd. A daemon would exist on the remote TFTP server which responds to (authenticated? encrypted?) requests from the dispatchers over a declared protocol (e.g. a tcp socket). The daemon makes the changes requested, manages conflicts, reports errors and generally keeps the remote TFTP server running and tidy. The dispatcher uses support just like VLANd or MultiNode to communicate with the daemon remotely, getting the IP address (of the daemon) from the device parameters.
The TFTP server doesn't need to be restarted because its configuration is left unchanged, we just create a directory where depthcharge will look for the files.
That's the basic problem: "we just create" 'foo on a remote server' is not going to work. The 'we' there needs to be fully automated, it needs to be managed for multiple simultaneous jobs, it needs to tidy up only when instructed and it will have to report errors and keep itself running. Otherwise, none of this will actually scale.