I have a multi-node test involving 13 roles that is no longer syncing properly after upgrading to 2016.11 this morning. It seems that 2 or 3 nodes end up waiting for a specific message while the other ones finish the message and move on to the next. Looking at the dispatcher log, I don't see any errors, but it's only logging that it's sending to some of the nodes. For example, I see a message like this for the nodes that work in a run:
2016-11-10 13:10:37,295 Sending wait messageID 'qa-network-info' to /var/lib/lava/dispatcher/slave/tmp/7615/device.yaml in group 2651c0a0-811f-4b77-bc07-22af31744fe5: {"/var/lib/lava/dispatcher/slave/tmp/7619/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7613/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7623/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tm p/7620/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7611/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7621/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7622/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7617/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7618/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7614/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7615/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7616/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7612/device.yaml": {}} 2016-11-10 13:10:37,295 Sending wait response to /var/lib/lava/dispatcher/slave/tmp/7615/device.yaml in group 2651c0a0-811f-4b77-bc07-22af31744fe5: {"message": {"/var/lib/lava/dispatcher/slave/tmp/7619/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7613/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7623/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7620/ device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7611/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7621/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7622/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7617/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7618/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7614/device.yaml": {}, "/var/l ib/lava/dispatcher/slave/tmp/7615/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7616/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7612/device.yaml": {}}, "response": "ack"}
For the nodes that get stuck, there is no message like above.
All of the nodes are qemu type, all on the same host. The nodes that fail are not consistent, but there seems to be always 2 or 3 that fail in every run I tried.
Is there anything I can look at here to figure out what is happening?
-- James Oakley james.oakley@multapplied.net