I have a multi-node test involving 13 roles that is no longer syncing properly after upgrading to 2016.11 this morning. It seems that 2 or 3 nodes end up waiting for a specific message while the other ones finish the message and move on to the next. Looking at the dispatcher log, I don't see any errors, but it's only logging that it's sending to some of the nodes. For example, I see a message like this for the nodes that work in a run:
2016-11-10 13:10:37,295 Sending wait messageID 'qa-network-info' to /var/lib/lava/dispatcher/slave/tmp/7615/device.yaml in group 2651c0a0-811f-4b77-bc07-22af31744fe5: {"/var/lib/lava/dispatcher/slave/tmp/7619/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7613/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7623/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tm
p/7620/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7611/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7621/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7622/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7617/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7618/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7614/device.yaml": {},
"/var/lib/lava/dispatcher/slave/tmp/7615/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7616/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7612/device.yaml": {}}
2016-11-10 13:10:37,295 Sending wait response to /var/lib/lava/dispatcher/slave/tmp/7615/device.yaml in group 2651c0a0-811f-4b77-bc07-22af31744fe5: {"message": {"/var/lib/lava/dispatcher/slave/tmp/7619/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7613/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7623/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7620/
device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7611/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7621/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7622/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7617/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7618/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7614/device.yaml": {}, "/var/l
ib/lava/dispatcher/slave/tmp/7615/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7616/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7612/device.yaml": {}}, "response": "ack"}
For the nodes that get stuck, there is no message like above.
All of the nodes are qemu type, all on the same host. The nodes that fail are not consistent, but there seems to be always 2 or 3 that fail in every run I tried.
Is there anything I can look at here to figure out what is happening?
--
James Oakley
james.oakley(a)multapplied.net
[Moving to lava-users as suggested by Neil]
On 11/07/2016 03:20 PM, Neil Williams (Code Review) wrote:
> Neil Williams has posted comments on this change.
>
> Change subject: Add support for the depthcharge bootloader
> ......................................................................
>
>
>
> Patch Set 3:
>
> (1 comment)
>
> https://review.linaro.org/#/c/15203/3/lava_dispatcher/pipeline/actions/depl…
>
> File lava_dispatcher/pipeline/actions/deploy/tftp.py:
>
> Line 127: def _ensure_device_dir(self, device_dir):
>> Cannot say that I have fully understood it yet. Would it be correct
>> if the
>
> The Strategy classes must not set or modify anything. The accepts
> method does some very fast checks and returns True or False. Anything
> which the pipeline actions need to know must be specified in the job
> submission or the device configuration. So either this is restricted
> to specific device-types (so a setting goes into the template) or it
> has to be set for every job using this method (for situations where
> the support can be used or not used on the same hardware for
> different jobs).
>
> What is this per-device directory anyway and how is it meant to work
> with tftpd-hpa which does not support configuration modification
> without restarting itself? Jobs cannot require that daemons restart -
> other jobs could easily be using that daemon at the same time.
So each firmware image containing Depthcharge will also contain
hardcoded values for the IP of the TFTP server, and for the paths of a
cmdline.txt file and a FIT image. The FIT image containing a kernel and
a DTB file, and optionally a ramdisk.
Because the paths are set when the FW image is flashed, we cannot use
the per-job directory. Thus we add a parameter to the device that is to
be set in the device-specific template of Chrome devices. If that
parameter is present, then a directory in the root of the TFTP files
tree will be created with the value of that parameter.
The TFTP server doesn't need to be restarted because its configuration
is left unchanged, we just create a directory where depthcharge will
look for the files.
Thanks,
Tomeu
> I think this needs to move from IRC and gerrit to a thread on the
> lava-users mailing list where the principles can be checked through
> more easily.
>
>