Hello,

the second known reason for a job to be stuck is if lava-logs is stuck for some reason.

I found a bug yesterday in the callbacks that can stuck lava-logs forever. In fact, a timeout is missing in the callback http request.
If the remote server is taking forever to answer (I was playing with netcat as a remote server and netcat was not answering anything), then lava-logs (the process that is sending the notifications) will wait forever.
A patch is available here : https://git.lavasoftware.org/lava/lava/merge_requests/113

As you are using a callback in the given job, that might be a reason.


Rgds


Le mar. 9 oct. 2018 à 10:57, Remi Duraffort <remi.duraffort@linaro.org> a écrit :
Hello Corentin,

for what I can see in the job logs, the lava-run process was not killed cleanly as the last lines of logs are missing (like https://validation.linaro.org/scheduler/job/1894511#results_33286343). Even if lava-run is crashing the last line should be sent.
So was the server running lava-run restarted? Do you know what happened to the lava-run process?

To understand what happened there, a job cycle is:
lava-master => lava-slave: START
lava-slave => lava-master: START_OK when lava-run is started
lava-run => lava-logs: send the logs
When the job is about to finish, lava-run logs the last results (lava.job result with pass or fail)
When lava-logs receive such log line, it can mark a TestJob as finished and record the job health (canceled, success or failure)
At the same time, lava-slave does notice that lava-run finishes and send an END message to lava-master.
But lava-master won't do anything until lava-logs has marked the TestJob as finished. Because the logs hasn't been received yet.

In your case, the last line of log if missing, so lava-logs can't mark the job has finished. At the same time lava-master and lava-slave are both waiting for lava-logs.
That's why I added a "fail" button that can force this transition when (for some reasons like a server crash) the last line of log is not going to be sent.


Rgds.

Le mer. 26 sept. 2018 à 09:43, LABBE Corentin <clabbe@baylibre.com> a écrit :
On Tue, Sep 25, 2018 at 09:03:03AM +0100, Neil Williams wrote:
> On Tue, 25 Sep 2018 at 08:56, Corentin Labbe <clabbe@baylibre.com> wrote:
>
> > Hello
> >
> > We got a job (number 332) stuck in running state.
> > After 23h of inaction, the only way to stop it was to cancel+fail it.
> > According to the logs, a small disconnection happen between the slave and
> > master.
> > The slave seems to try to update the final status of the job but the
> > master "ignore" it.
> >
>
> What Debian package version(s) of lava-dispatcher on the slave and
> lava-server on the master?

On slave:
ii  lava-common                    2018.7-1+stretch                        all          Linaro Automated Validation Architecture common
ii  lava-dispatcher                2018.7-1+stretch                        amd64        Linaro Automated Validation Architecture dispatcher

On master:
ii  lava                                 2018.7-1+stretch                        all          Linaro Automated Validation Architecture metapackage
ii  lava-common                          2018.7-1+stretch                        all          Linaro Automated Validation Architecture common
ii  lava-coordinator                     0.1.7-1                                 all          LAVA Coordinator daemon
ii  lava-dev                             2018.7-1+stretch                        all          Linaro Automated Validation Architecture developer support
ii  lava-dispatcher                      2018.7-1+stretch                        amd64        Linaro Automated Validation Architecture dispatcher
ii  lava-server                          2018.7-1+stretch                        all          Linaro Automated Validation Architecture server
ii  lava-server-doc                      2018.7-1+stretch                        all          Linaro Automated Validation Architecture documentation

>
> Can you attach (rather than inline) the test job log file (output.yaml)?
> The job ended very quickly, looks like a validate error.

This is the job output
https://lava.automotivelinux.org/scheduler/job/332

>
> It also looks like both master and slave went offline at the same time. Can
> you confirm that master and slave are both running in timezone UTC & if
> both have ntp installed?

Yes I confirm

_______________________________________________
Lava-users mailing list
Lava-users@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lava-users


--
Rémi Duraffort
LAVA Team


--
Rémi Duraffort
LAVA Team