At some point last week - I think because of network connectivity issues a job got stuck and I I cancelled it, it when run again it again appeared to hang. I again cancelled it and am now seeing the health check not start (at least no output appears on the job's webspage.
Looking at the output.yaml (in /var/lib/lava-server/default/media/job-output/2018/05/23/32 ) I see ... progress output for downloading https://images.validation.linaro.org/kvm/standard/stretch-2.img.gz
- {"dt": "2018-05-23T07:39:54.728015", "lvl": "debug", "msg": "[common] Preparing overlay tarball in /var/lib/lava/dispatcher/tmp/32/lava-overlay-aye3n2ke"} - {"dt": - "2018-05-23T07:39:54.728root@stretch:/var/lib/lava-server/default/media/job-output/2018/05/23/32
But none of this appears in http://localhost:8080/scheduler/job/32
and at the head of that page I see the message:
Unable to parse invalid logs: This is maybe a bug in LAVA that should be reported.
which other logs are best for checking whether this is an error that should be fed back?
(LAVA 2018.4)
Robert
On 23 May 2018 at 15:36, Robert Marshall robert.marshall@codethink.co.uk wrote:
At some point last week - I think because of network connectivity issues a job got stuck and I I cancelled it, it when run again it again appeared to hang. I again cancelled it and am now seeing the health check not start (at least no output appears on the job's webspage.
What is the status of the relevant device(s) and any associated test jobs?
Check the /var/log/lava-server/lava-master.log for the reasons why the device is not being assigned.
Check the status of all daemons, including lava-logs
sudo service lava-master status sudo service lava-logs status sudo service lava-slave status
Looking at the output.yaml (in /var/lib/lava-server/default/media/job-output/2018/05/23/32 ) I see ... progress output for downloading https://images.validation. linaro.org/kvm/standard/stretch-2.img.gz
- {"dt": "2018-05-23T07:39:54.728015", "lvl": "debug", "msg": "[common]
Preparing overlay tarball in /var/lib/lava/dispatcher/tmp/ 32/lava-overlay-aye3n2ke"}
- {"dt":
- "2018-05-23T07:39:54.728root@stretch:/var/lib/lava-server/
default/media/job-output/2018/05/23/32
But none of this appears in http://localhost:8080/scheduler/job/32
and at the head of that page I see the message:
Unable to parse invalid logs: This is maybe a bug in LAVA that should be reported.
which other logs are best for checking whether this is an error that should be fed back?
(LAVA 2018.4)
Robert _______________________________________________ Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users
Neil Williams neil.williams@linaro.org writes:
On 23 May 2018 at 15:36, Robert Marshall robert.marshall@codethink.co.uk wrote:
At some point last week - I think because of network connectivity issues a job got stuck and I I cancelled it, it when run again it again appeared to hang. I again cancelled it and am now seeing the health check not start (at least no output appears on the job's webspage.
What is the status of the relevant device(s) and any associated test jobs?
The status of the device was Bad - as the problems with the device have now resolved maybe it is hard to diagnose further? But adding below what I can see.
Check the /var/log/lava-server/lava-master.log for the reasons why the device is not being assigned.
I think this was when it was failing rather than when I was cancelling it.
2018-05-23 14:26:18,620 ERROR [32] Error: b'Traceback (most recent call last): File "/usr/bin/lava-run", line 246, in <module> sys.exit(main()) File "/usr/bin/lava-run", line 233, in main logger.close() # pylint: disable=no-member File "/usr/lib/python3/dist-packages/lava_dispatcher/log.py", line 87, in close self.handler.close(linger) File "/usr/lib/python3/dist-packages/lava_dispatcher/log.py", line 71, in close self.context.destroy(linger=linger) File "zmq/backend/cython/context.pyx", line 244, in zmq.backend.cython.context.Context.destroy (zmq/backend/cython/context.c:3067) File "zmq/backend/cython/context.pyx", line 136, in zmq.backend.cython.context.Context.term (zmq/backend/cython/context.c:2348) File "zmq/backend/cython/checkrc.pxd", line 12, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/context.c:3216) File "/usr/bin/lava-run", line 151, in cancelling_handler raise JobCanceled("The job was canceled") lava_dispatcher.action.JobCanceled: The job was canceled '
Though this is maybe more interesting?:
2018-05-23 14:26:18,655 ERROR [32] Unable to dump 'description.yaml' 2018-05-23 14:26:18,655 ERROR [32] Compressed data ended before the end-of-stream marker was reached Traceback (most recent call last): File "/usr/lib/python3/dist-packages/lava_server/management/commands/lava-master.py", line 333, in _handle_end description = lzma.decompress(compressed_description) File "/usr/lib/python3.5/lzma.py", line 340, in decompress raise LZMAError("Compressed data ended before the " _lzma.LZMAError: Compressed data ended before the end-of-stream marker was reached
Check the status of all daemons, including lava-logs
WARNING lava-logs is offline: can't schedule jobs
sudo service lava-master status sudo service lava-logs status sudo service lava-slave status Looking at the output.yaml (in /var/lib/lava-server/default/media/job-output/2018/05/23/32 ) I see ... progress output for downloading https://images.validation.linaro.org/kvm/standard/stretch-2.img.gz
- {"dt": "2018-05-23T07:39:54.728015", "lvl": "debug", "msg": "[common] Preparing overlay tarball in
/var/lib/lava/dispatcher/tmp/32/lava-overlay-aye3n2ke"}
- {"dt":
- "2018-05-23T07:39:54.728root@stretch:/var/lib/lava-server/default/media/job-output/2018/05/23/32
But none of this appears in http://localhost:8080/scheduler/job/32
and at the head of that page I see the message:
Unable to parse invalid logs: This is maybe a bug in LAVA that should be reported.
which other logs are best for checking whether this is an error that should be fed back?
(LAVA 2018.4)
Robert _______________________________________________ Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users
On 29 May 2018 at 15:12, Robert Marshall robert.marshall@codethink.co.uk wrote:
Neil Williams neil.williams@linaro.org writes:
On 23 May 2018 at 15:36, Robert Marshall <robert.marshall@codethink.co.
uk> wrote:
At some point last week - I think because of network connectivity issues a job got stuck and I I cancelled it, it when run again it again
appeared to hang. I again
cancelled it and am now seeing the health check not start (at least no output appears on the job's webspage.
What is the status of the relevant device(s) and any associated test
jobs?
The status of the device was Bad - as the problems with the device have now resolved maybe it is hard to diagnose further? But adding below what I can see.
So a health check failed. You will need to resolve the problem and re-run the health check by setting the health to Unknown.
Check the /var/log/lava-server/lava-master.log for the reasons why the
device is not being assigned.
I think this was when it was failing rather than when I was cancelling it.
2018-05-23 14:26:18,620 ERROR [32] Error: b'Traceback (most recent call last): File "/usr/bin/lava-run", line 246, in <module> sys.exit(main()) File "/usr/bin/lava-run", line 233, in main logger.close() # pylint: disable=no-member File "/usr/lib/python3/dist-packages/lava_dispatcher/log.py", line 87, in close self.handler.close(linger) File "/usr/lib/python3/dist-packages/lava_dispatcher/log.py", line 71, in close self.context.destroy(linger=linger) File "zmq/backend/cython/context.pyx", line 244, in zmq.backend.cython.context.Context.destroy (zmq/backend/cython/context.c:3067) File "zmq/backend/cython/context.pyx", line 136, in zmq.backend.cython.context.Context.term (zmq/backend/cython/context.c:2348) File "zmq/backend/cython/checkrc.pxd", line 12, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/context.c:3216) File "/usr/bin/lava-run", line 151, in cancelling_handler raise JobCanceled("The job was canceled") lava_dispatcher.action.JobCanceled: The job was canceled '
Though this is maybe more interesting?:
2018-05-23 14:26:18,655 ERROR [32] Unable to dump 'description.yaml' 2018-05-23 14:26:18,655 ERROR [32] Compressed data ended before the end-of-stream marker was reached Traceback (most recent call last): File "/usr/lib/python3/dist-packages/lava_server/ management/commands/lava-master.py", line 333, in _handle_end description = lzma.decompress(compressed_description) File "/usr/lib/python3.5/lzma.py", line 340, in decompress raise LZMAError("Compressed data ended before the " _lzma.LZMAError: Compressed data ended before the end-of-stream marker was reached
Likely that the test job specifies the wrong compression or that the file is invalid.
Check the status of all daemons, including lava-logs
WARNING lava-logs is offline: can't schedule jobs
Check the rest of that log file and the systemd status of the lava-logs service, make sure that service can run normally.
sudo service lava-master status sudo service lava-logs status sudo service lava-slave status
Looking at the output.yaml (in /var/lib/lava-server/default/media/job-output/2018/05/23/32
) I see
... progress output for downloading https://images.validation.
linaro.org/kvm/standard/stretch-2.img.gz
- {"dt": "2018-05-23T07:39:54.728015", "lvl": "debug", "msg": "[common]
Preparing overlay tarball in
/var/lib/lava/dispatcher/tmp/32/lava-overlay-aye3n2ke"}
- {"dt":
- "2018-05-23T07:39:54.728root@stretch:/var/lib/lava-server/
default/media/job-output/2018/05/23/32
But none of this appears in http://localhost:8080/scheduler/job/32
and at the head of that page I see the message:
Unable to parse invalid logs: This is maybe a bug in LAVA that should
be reported.
which other logs are best for checking whether this is an error that should be fed back?
(LAVA 2018.4)
Robert _______________________________________________ Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users
-- Robert Marshall, Software Developer Codethink Ltd Telephone: +44 7762 840 414 3rd Floor, Dale House, 35 Dale Street https://www.codethink.co.uk/ MANCHESTER, M1 2HF. United Kingdom We respect your privacy. See https://www.codethink.co.uk/privacy.html
Neil Williams neil.williams@linaro.org writes:
On 29 May 2018 at 15:12, Robert Marshall robert.marshall@codethink.co.uk wrote:
Neil Williams neil.williams@linaro.org writes:
On 23 May 2018 at 15:36, Robert Marshall robert.marshall@codethink.co.uk wrote:
At some point last week - I think because of network connectivity issues a job got stuck and I I cancelled it, it when run again it again appeared to hang. I again cancelled it and am now seeing the health check not start (at least no output appears on the job's webspage.
What is the status of the relevant device(s) and any associated test jobs?
The status of the device was Bad - as the problems with the device have now resolved maybe it is hard to diagnose further? But adding below what I can see.
So a health check failed. You will need to resolve the problem and re-run the health check by setting the health to Unknown.
Yes that's what I did (setting the health to Unknown) - multiple times - and I previously was getting that message about 'Unable to parse invalid logs: This is maybe a bug in LAVA that should be reported.' on a re-run - I'm no longer getting that so in that sense the issue is resolved.
Check the /var/log/lava-server/lava-master.log for the reasons why the device is not being assigned.
I think this was when it was failing rather than when I was cancelling it.
2018-05-23 14:26:18,620 ERROR [32] Error: b'Traceback (most recent call last): File "/usr/bin/lava-run", line 246, in <module> sys.exit(main()) File "/usr/bin/lava-run", line 233, in main logger.close() # pylint: disable=no-member File "/usr/lib/python3/dist-packages/lava_dispatcher/log.py", line 87, in close self.handler.close(linger) File "/usr/lib/python3/dist-packages/lava_dispatcher/log.py", line 71, in close self.context.destroy(linger=linger) File "zmq/backend/cython/context.pyx", line 244, in zmq.backend.cython.context.Context.destroy (zmq/backend/cython/context.c:3067) File "zmq/backend/cython/context.pyx", line 136, in zmq.backend.cython.context.Context.term (zmq/backend/cython/context.c:2348) File "zmq/backend/cython/checkrc.pxd", line 12, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/context.c:3216) File "/usr/bin/lava-run", line 151, in cancelling_handler raise JobCanceled("The job was canceled") lava_dispatcher.action.JobCanceled: The job was canceled '
Though this is maybe more interesting?:
2018-05-23 14:26:18,655 ERROR [32] Unable to dump 'description.yaml' 2018-05-23 14:26:18,655 ERROR [32] Compressed data ended before the end-of-stream marker was reached Traceback (most recent call last): File "/usr/lib/python3/dist-packages/lava_server/management/commands/lava-master.py", line 333, in _handle_end description = lzma.decompress(compressed_description) File "/usr/lib/python3.5/lzma.py", line 340, in decompress raise LZMAError("Compressed data ended before the " _lzma.LZMAError: Compressed data ended before the end-of-stream marker was reached
Likely that the test job specifies the wrong compression or that the file is invalid.
Though the test file was unchanged between the version that works and the one that doesn't. I'm guessing there was a networking glitch which set this all off.
Check the status of all daemons, including lava-logs
WARNING lava-logs is offline: can't schedule jobs
Check the rest of that log file and the systemd status of the lava-logs service, make sure that service can run normally.
sudo service lava-master status sudo service lava-logs status sudo service lava-slave status Looking at the output.yaml (in /var/lib/lava-server/default/media/job-output/2018/05/23/32 ) I see ... progress output for downloading https://images.validation.linaro.org/kvm/standard/stretch-2.img.gz
- {"dt": "2018-05-23T07:39:54.728015", "lvl": "debug", "msg": "[common] Preparing overlay tarball in
/var/lib/lava/dispatcher/tmp/32/lava-overlay-aye3n2ke"}
- {"dt":
- "2018-05-23T07:39:54.728root@stretch:/var/lib/lava-server/default/media/job-output/2018/05/23/32
But none of this appears in http://localhost:8080/scheduler/job/32
and at the head of that page I see the message:
Unable to parse invalid logs: This is maybe a bug in LAVA that should be reported.
which other logs are best for checking whether this is an error that should be fed back?
(LAVA 2018.4)
Robert _______________________________________________ Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users
lava-users@lists.lavasoftware.org