On 29 May 2018 at 15:12, Robert Marshall <robert.marshall@codethink.co.uk> wrote:
Neil Williams <neil.williams@linaro.org> writes:

> On 23 May 2018 at 15:36, Robert Marshall <robert.marshall@codethink.co.uk> wrote:
>
>  At some point last week - I think because of network connectivity issues
>  a job got stuck and I I cancelled it, it when run again it again appeared to hang. I again
>  cancelled it and am now seeing the health check not start (at least no
>  output appears on the job's webspage.
>
> What is the status of the relevant device(s) and any associated test jobs?

The status of the device was Bad - as the problems with the device have now
resolved maybe it is hard to diagnose further? But adding below what I
can see.

So a health check failed. You will need to resolve the problem and re-run the health check by setting the health to Unknown.
 

>
> Check the /var/log/lava-server/lava-master.log for the reasons why the device is not being assigned.
>

I think this was when it was failing rather than when I was cancelling it.

2018-05-23 14:26:18,620   ERROR [32] Error: b'Traceback (most recent
  call last):
  File "/usr/bin/lava-run", line 246, in <module>
  sys.exit(main())
  File "/usr/bin/lava-run", line 233, in main
  logger.close()  # pylint: disable=no-member
  File "/usr/lib/python3/dist-packages/lava_dispatcher/log.py", line 87,
  in close
  self.handler.close(linger)
  File "/usr/lib/python3/dist-packages/lava_dispatcher/log.py", line 71,
  in close
  self.context.destroy(linger=linger)
  File "zmq/backend/cython/context.pyx", line 244, in
  zmq.backend.cython.context.Context.destroy
  (zmq/backend/cython/context.c:3067)
  File "zmq/backend/cython/context.pyx", line 136, in
  zmq.backend.cython.context.Context.term
  (zmq/backend/cython/context.c:2348)
  File "zmq/backend/cython/checkrc.pxd", line 12, in
  zmq.backend.cython.checkrc._check_rc
  (zmq/backend/cython/context.c:3216)
  File "/usr/bin/lava-run", line 151, in cancelling_handler
  raise JobCanceled("The job was canceled")
  lava_dispatcher.action.JobCanceled: The job was canceled
  '

Though this is maybe more interesting?:

2018-05-23 14:26:18,655   ERROR [32] Unable to dump 'description.yaml'
2018-05-23 14:26:18,655   ERROR [32] Compressed data ended before the end-of-stream marker was reached
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/lava_server/management/commands/lava-master.py", line 333, in _handle_end
    description = lzma.decompress(compressed_description)
  File "/usr/lib/python3.5/lzma.py", line 340, in decompress
    raise LZMAError("Compressed data ended before the "
_lzma.LZMAError: Compressed data ended before the end-of-stream marker was reached

Likely that the test job specifies the wrong compression or that the file is invalid.

 


> Check the status of all daemons, including lava-logs
>

WARNING lava-logs is offline: can't schedule jobs

Check the rest of that log file and the systemd status of the lava-logs service, make sure that service can run normally.
 



> sudo service lava-master status
> sudo service lava-logs status
> sudo service lava-slave status

>  Looking at the output.yaml  (in /var/lib/lava-server/default/media/job-output/2018/05/23/32 ) I see
>  ... progress output for downloading https://images.validation.linaro.org/kvm/standard/stretch-2.img.gz
>
>  - {"dt": "2018-05-23T07:39:54.728015", "lvl": "debug", "msg": "[common] Preparing overlay tarball in
>  /var/lib/lava/dispatcher/tmp/32/lava-overlay-aye3n2ke"}
>  - {"dt":
>  - "2018-05-23T07:39:54.728root@stretch:/var/lib/lava-server/default/media/job-output/2018/05/23/32
>
>  But none of this appears in http://localhost:8080/scheduler/job/32
>
>  and at the head of that page I see the message:
>
>  Unable to parse invalid logs: This is maybe a bug in LAVA that should be reported.
>
>  which other logs are best for checking whether this is an error that
>  should be fed back?
>
>  (LAVA 2018.4)
>
>  Robert
>  _______________________________________________
>  Lava-users mailing list
Lava-users@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lava-users

--
Robert Marshall, Software Developer                       Codethink Ltd
Telephone: +44 7762 840 414       3rd Floor, Dale House, 35 Dale Street
https://www.codethink.co.uk/         MANCHESTER, M1 2HF. United Kingdom
We respect your privacy.   See https://www.codethink.co.uk/privacy.html



--