Hello,
We have an installation where we use LAVA 2017.12. We are regularly seeing jobs that remain stuck for several days.
For example, I have a job right now on the Armada XP GP that is stuck since 1 day and 11 hours. The log visible in the LAVA Web interface looks like this:
http://code.bulix.org/7pvru8-255308?raw
This is job #855671 in our setup.
The logs on the lava-slave look like this:
http://code.bulix.org/c5tejy-255312?raw
So, from the lava-slave point of view, the job is finished.
However, the "Job END" message had to be resent several times to the master. Interestingly, this sequence lead to a very nice:
ERROR [855671] lava-run crashed
On the lava-master side (which runs on another machine), the logs look like this:
http://code.bulix.org/b61keb-255316?raw
And this happens for lots of jobs. Pretty much every day or two, we have ten boards stuck in this situation.
I have the lava-master logs with DEBUGs if this can be helpful. However, must DEBUG logs don't have the job number in them, so it makes it difficult to associate the DEBUG messages with the problematic job (since numerous other jobs are running).
Does anyone has an idea what could be causing this ? Or how to debug this further ?
Best regards,
Thomas Petazzoni