Re: [Lava-users] Change in lava/lava-dispatcher[master]: Add support for the depthcharge bootloader

10 Nov 2016


      On 8 November 2016 at 15:46, Tomeu Vizoso tomeu.vizoso@collabora.com wrote:
...
On 11/08/2016 02:25 PM, Neil Williams wrote:
...
On 8 November 2016 at 12:54, Tomeu Vizoso tomeu.vizoso@collabora.com wrote:
...
[Moving to lava-users as suggested by Neil]
Trimming the CC list - the lava-team can reply using lava-users directly.
...
On 11/07/2016 03:20 PM, Neil Williams (Code Review) wrote:
...
Neil Williams has posted comments on this change.
https://review.linaro.org/#/c/15203/3/lava_dispatcher/pipeline/actions/deplo...
File lava_dispatcher/pipeline/actions/deploy/tftp.py:
Line 127:     def _ensure_device_dir(self, device_dir):
...
Cannot say that I have fully understood it yet. Would it be correct
if the
The Strategy classes must not set or modify anything. The accepts
method does some very fast checks and returns True or False. Anything
which the pipeline actions need to know must be specified in the job
submission or the device configuration. So either this is restricted
to specific device-types (so a setting goes into the template) or it
has to be set for every job using this method (for situations where
the support can be used or not used on the same hardware for
different jobs).
What is this per-device directory anyway and how is it meant to work
with tftpd-hpa which does not support configuration modification
without restarting itself? Jobs cannot require that daemons restart -
other jobs could easily be using that daemon at the same time.
So each firmware image containing Depthcharge will also contain
hardcoded values for the IP of the TFTP server, and for the paths of a
cmdline.txt file and a FIT image. The FIT image containing a kernel and
a DTB file, and optionally a ramdisk.
That's not a system that lends itself easily to automation. Fixed IP
addresses are the exception rather than the rule, the dispatcher
itself doesn't care if it's IP address changes as long as the
restarted slave can still contact the master over ZMQ.
It sounds a lot like a system that can work for single jobs on a
developer desk but it has clear problems when it comes to automating
the process at any level of scale.
You are right, the administrator of the lab would need to reflash if the
dispatcher had to change IP.
OK. As long as that restriction is documented, that's fine.
(There'll need to be some docs added to lava-server with a file like
doc/v2/actions-foo.rst and maybe something about how to deploy a
device that needs to use this method.)
...
...
...
Because the paths are set when the FW image is flashed, we cannot use
the per-job directory. Thus we add a parameter to the device that is to
be set in the device-specific template of Chrome devices.
So that parameter needs to appear in the device-type template and
therefore in the device.yaml delivered to the dispatcher. This way,
there will be a readily identifiable value in the device['parameters']
block for which the accepts() can check. The validate() function in
the Action initialised by the Strategy (once the accepts classmethod
has finished) can then retrieve the value and do checks where
required, setting self.errors if there are problems. (Do not raise
exceptions in validate - the purpose is to collect all possible errors
in a single run of all validate functions in the populated pipeline.)
That is all standard V2.
If I understand correctly, that matches what the patch does, other that
there's no validation yet (but I'm not sure what would need to be
validated in this specific case).
By validation I mean the lava-dispatch --validate command which uses
the validate() function declared for each Action used by the job.
depthcharge.py doesn't have a validate at all yet, so these are some
guidelines on what needs to go into validate.
0: every time run() tries to read values from self.parameters or
self.job.device (e.g. self.job.device['actions']['boot']['methods'])
then the keys used in the call need to be checked for existence in the
validate. (Either this validate or in a base class). The parser checks
some (like actions: boot) and the Strategy checks others (e.g. that
the method is acceptable to the Action) but beyond that, all
subsequent keys need to be checked in validate and the values also
need to be checked in case the code expects an int or a str or a list
or whatever. Each check sets self.errors on failure and prevents any
exceptions being raised. If subsequent checks rely on a key existing
which does not exist, a return is fine to use after setting
self.errors.
1: Whenever run assumes or checks that something exists on the
filesystem of the dispatcher, including whether a necessary tool is
installed, the check itself moves to validate() and run() can then use
the tool without needing to check.
2: All preparation of strings and objects which will later be used in
run() can be done in validate - this has the advantage that it makes
it much easier to test these strings and objects in the unit tests
because you don't have to run the action, just validate. To support
this, a sample device configuration and a sample test job are supplied
and a pipeline_ref file too (see the readme in pipeline_refs). This
ensures that if anything about the Strategy or the populate functions
changes, the unit tests will pick up how it affects your method
without the team needing to run an actual test using real hardware
(which they don't have).
This is all to try and help the team not inadvertently break your
support in a future release.
...
...
...
If that
parameter is present, then a directory in the root of the TFTP files
tree will be created with the value of that parameter.
So the dispatcher needs read:write access on the TFTP server as well?
Not sure I get this. The patch just introduces the possibility for some
devices to have their TFTP directories named differently.
I thought it was using a remote TFTP server, so don't worry about
that. I'm not sure why but I thought it was a completely separate TFTP
server.
So what I'm trying to achieve here is that we cover the following constraints:
0: a new deployment method which is only usable on certain hardware
which is not currently available to the lava software team can be
difficult to maintain. The software team need some assurance that
their later changes will not break your contribution. To deliver that,
the team need as much of the guts of your methods to be exposed
through the unit tests as possible so that all that is left to test is
whether the validate function gives the run() function exactly the
same data as last time.
1: the code needs to follow the (sometimes undocumented) conventions
elsewhere in the code to not break expectations when the software team
are doing large scale changes to optimise across the codebase. This
will continue to happen as more deployment methods are added and more
common ground is identified that did not exist before.
2: the documentation itself needs to be improved and you will be in a
key position to do that because you will have come across most of
those undocumented conventions. I hope you'll have some feedback on
how we can improve the Developer Guide in the docs to give some
clarity to others who want to contribute. As with all docs written by
the developers, it needs input from new people to make the content
useful outside the team of developers who wrote it.
3: the code needs to scale. The unit test coverage is key to this as well.
4: lava is in migration and it makes things harder for everyone that
the V1 code is still confusing things. (This is less of an issue for
your specific review but still it is a factor that could easily put
some people off contributing.)
Hopefully, that helps cover most of the reasons why the review has had
the comments it has so far. With some changes to reflect these
factors, I'd be happy to merge the support for depthcharge but it does
need some of these things to be fixed first.
...
Regards,
Tomeu
...
How is that to be achieved? Where will this TFTP server be located? Is
it shared by multiple instances or multiple dispatchers? Who manages
any filesystem collisions? What process cleans up the per-job
directories when the job completes or fails? Hint: a mountpoint on the
dispatcher is not going to be workable either. We already have quite
enough issues with SSHFS in V1 to go there again. We need to avoid
solutions which make assumptions about local network topology. NFS is
for DUTs only.
This is beginning to sound like it needs a full Protocol, similar to
LXC or VLANd. A daemon would exist on the remote TFTP server which
responds to (authenticated? encrypted?) requests from the dispatchers
over a declared protocol (e.g. a tcp socket). The daemon makes the
changes requested, manages conflicts, reports errors and generally
keeps the remote TFTP server running and tidy. The dispatcher uses
support just like VLANd or MultiNode to communicate with the daemon
remotely, getting the IP address (of the daemon) from the device
parameters.
...
The TFTP server doesn't need to be restarted because its configuration
is left unchanged, we just create a directory where depthcharge will
look for the files.
That's the basic problem: "we just create" 'foo on a remote server' is
not going to work. The 'we' there needs to be fully automated, it
needs to be managed for multiple simultaneous jobs, it needs to tidy
up only when instructed and it will have to report errors and keep
itself running. Otherwise, none of this will actually scale.
-- 

Neil Williams
=============
neil.williams@linaro.org
http://www.linux.codehelp.co.uk/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Lava-users] Change in lava/lava-dispatcher[master]: Add support for the depthcharge bootloader