Jobs stuck in canceled mode and blocking queue

List overview All Threads
Download

newer

older

Is it possible to request a device...

NFSroot problems

Robert Marshall

15 Aug 2017 15 Aug '17

2:04 p.m.

Hi,

I've got 2 jobs stuck in canceled mode which are preventing any other job from running.

I'm running lava (2017-7) in a VM and have tried rebooting the VM to clear the issue but without success (ie the jobs still block the queue).

an extract from /var/log/lava-server/django.log is attached

I get this 500 error when viewing the results for the job

Is there a manual way of clearing this? The health check has notification associated with it (and set to verbose) and every time I reboot I get an email and irc saying that it's finished!

Robert

Attachments:

lava-error.log (application/octet-stream — 4.4 KB)

Show replies by date

Neil Williams

15 Aug 15 Aug

2:16 p.m.

New subject: [Lava-users] Jobs stuck in canceled mode and blocking queue

On 15 August 2017 at 10:04, Robert Marshall robert.marshall@codethink.co.uk wrote:

...

Hi,

I've got 2 jobs stuck in canceled mode which are preventing any other job from running.

I'm running lava (2017-7) in a VM and have tried rebooting the VM to clear the issue but without success (ie the jobs still block the queue).

an extract from /var/log/lava-server/django.log is attached

That's reporting that there was no start_time or end_time associated with the test job. It sounds like a bug but I'm not clear on how to reproduce it. For now, you can modify the test job to have a start and end time. Try: $ sudo lava-server manage shell

import time from lava_scheduler_app.models import TestJob job = TestJob.objects.filter(id=123456) job.start_time job.start_time = time.now() job.end_time job.end_time = time.now() job.save()

i.e. if there is no start time, modify both start and end time. If there's no end_time, just add an end_time.

...

I get this 500 error when viewing the results for the job

Is there a manual way of clearing this? The health check has notification associated with it (and set to verbose) and every time I reboot I get an email and irc saying that it's finished!

Robert

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Robert Marshall

3:27 p.m.

New subject: [Lava-users] Jobs stuck in canceled mode and blocking queue

Neil Williams neil.williams@linaro.org writes:

Neil

Thanks for this response, comments below:

...

On 15 August 2017 at 10:04, Robert Marshall robert.marshall@codethink.co.uk wrote:

...
Hi,

I've got 2 jobs stuck in canceled mode which are preventing any other job from running.

I'm running lava (2017-7) in a VM and have tried rebooting the VM to clear the issue but without success (ie the jobs still block the queue).

an extract from /var/log/lava-server/django.log is attached

That's reporting that there was no start_time or end_time associated with the test job. It sounds like a bug but I'm not clear on how to reproduce it. For now, you can modify the test job to have a start and end time. Try: $ sudo lava-server manage shell

import time from lava_scheduler_app.models import TestJob job = TestJob.objects.filter(id=123456) job.start_time job.start_time = time.now() job.end_time job.end_time = time.now() job.save()

i.e. if there is no start time, modify both start and end time. If there's no end_time, just add an end_time.

Something appears deeply garbled here in the lava structures, I assume the id for the filter is the one in the recent jobs list rather than some internal value

job = TestJob.objects.filter(id=24) job.start_time Traceback (most recent call last): AttributeError: 'RestrictedTestJobQuerySet' object has no attribute 'start_time'

type(job) <class 'lava_scheduler_app.managers.RestrictedTestJobQuerySet'>

and there's no save function either

To attempt to remember the sequence of operations

- I updated the data dictionary - the health check had been logging that it wasn't using soft reboot which 2016 was doing so I made a change - prob wrong - to try to add that. - That job started but showed no output, I left it for 10 mins or so - I cancelled it - I then put back the data dictionary as it was and submitted another health check - that queued so I decided to reboot

Robert

...

...
I get this 500 error when viewing the results for the job

Is there a manual way of clearing this? The health check has notification associated with it (and set to verbose) and every time I reboot I get an email and irc saying that it's finished!

Robert

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

Neil Williams

3:42 p.m.

New subject: [Lava-users] Jobs stuck in canceled mode and blocking queue

On 15 August 2017 at 11:27, Robert Marshall robert.marshall@codethink.co.uk wrote:

...

Neil Williams neil.williams@linaro.org writes:

Neil

Thanks for this response, comments below:

Typo in my example.

...

...
On 15 August 2017 at 10:04, Robert Marshall robert.marshall@codethink.co.uk wrote:

...
Hi,

I've got 2 jobs stuck in canceled mode which are preventing any other job from running.

I'm running lava (2017-7) in a VM and have tried rebooting the VM to clear the issue but without success (ie the jobs still block the queue).

an extract from /var/log/lava-server/django.log is attached

That's reporting that there was no start_time or end_time associated with the test job. It sounds like a bug but I'm not clear on how to reproduce it. For now, you can modify the test job to have a start and end time. Try: $ sudo lava-server manage shell

import time from lava_scheduler_app.models import TestJob job = TestJob.objects.filter(id=123456)

job = TestJob.objects.get(id=123456)

job = TestJob.objects.filter(id=123456)[0]

Filter returns a QuerySet (which is a list).

...

...
job.start_time job.start_time = time.now() job.end_time job.end_time = time.now() job.save()

i.e. if there is no start time, modify both start and end time. If there's no end_time, just add an end_time.

Something appears deeply garbled here in the lava structures, I assume the id for the filter is the one in the recent jobs list rather than some internal value

job = TestJob.objects.filter(id=24) job.start_time Traceback (most recent call last): AttributeError: 'RestrictedTestJobQuerySet' object has no attribute 'start_time'

type(job) <class 'lava_scheduler_app.managers.RestrictedTestJobQuerySet'>

and there's no save function either

To attempt to remember the sequence of operations

I updated the data dictionary - the health check had been logging that it wasn't using soft reboot which 2016 was doing so I made a change - prob wrong - to try to add that.

That job started but showed no output, I left it for 10 mins or so

I cancelled it

I then put back the data dictionary as it was and submitted another health check

that queued so I decided to reboot

Robert

...
...
I get this 500 error when viewing the results for the job

Is there a manual way of clearing this? The health check has notification associated with it (and set to verbose) and every time I reboot I get an email and irc saying that it's finished!

Robert

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Robert Marshall

16 Aug 16 Aug

8:06 a.m.

New subject: [Lava-users] Jobs stuck in canceled mode and blocking queue

Neil Williams neil.williams@linaro.org writes:

...

On 15 August 2017 at 11:27, Robert Marshall robert.marshall@codethink.co.uk wrote:

...
Neil Williams neil.williams@linaro.org writes:

Neil

Thanks for this response, comments below:

Typo in my example.

...
...
On 15 August 2017 at 10:04, Robert Marshall robert.marshall@codethink.co.uk wrote:

...
Hi,

I've got 2 jobs stuck in canceled mode which are preventing any other job from running.

I'm running lava (2017-7) in a VM and have tried rebooting the VM to clear the issue but without success (ie the jobs still block the queue).

an extract from /var/log/lava-server/django.log is attached

That's reporting that there was no start_time or end_time associated with the test job. It sounds like a bug but I'm not clear on how to reproduce it. For now, you can modify the test job to have a start and end time. Try: $ sudo lava-server manage shell

import time from lava_scheduler_app.models import TestJob job = TestJob.objects.filter(id=123456)

job = TestJob.objects.get(id=123456)

or

job = TestJob.objects.filter(id=123456)[0]

Filter returns a QuerySet (which is a list).

...
...
job.start_time job.start_time = time.now() job.end_time job.end_time = time.now() job.save()

i.e. if there is no start time, modify both start and end time. If there's no end_time, just add an end_time.

Thanks - both jobs had no start time but did have an end time - because I cancelled before the job started? I set the start_time = end_time which has got rid of the '500' error

But the queue still appears jammed - there's a health check with status 'Submitted' and it doesn't get any further - jobs before it in the list are either Incomplete or Canceled.

There's nothing in django.log

Robert

...

...
Something appears deeply garbled here in the lava structures, I assume the id for the filter is the one in the recent jobs list rather than some internal value

job = TestJob.objects.filter(id=24) job.start_time Traceback (most recent call last): AttributeError: 'RestrictedTestJobQuerySet' object has no attribute 'start_time'

type(job) <class 'lava_scheduler_app.managers.RestrictedTestJobQuerySet'>

and there's no save function either

To attempt to remember the sequence of operations

I updated the data dictionary - the health check had been logging

that it wasn't using soft reboot which 2016 was doing so I made a change - prob wrong - to try to add that.

That job started but showed no output, I left it for 10 mins or so

I cancelled it

I then put back the data dictionary as it was and submitted another health check

that queued so I decided to reboot

Robert

...
...
I get this 500 error when viewing the results for the job

Is there a manual way of clearing this? The health check has notification associated with it (and set to verbose) and every time I reboot I get an email and irc saying that it's finished!

Robert

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

Neil Williams

8:13 a.m.

New subject: [Lava-users] Jobs stuck in canceled mode and blocking queue

On 16 August 2017 at 09:06, Robert Marshall robert.marshall@codethink.co.uk wrote:

...

Neil Williams neil.williams@linaro.org writes:

...
On 15 August 2017 at 11:27, Robert Marshall robert.marshall@codethink.co.uk wrote:

...
Neil Williams neil.williams@linaro.org writes:

Neil

Thanks for this response, comments below:

Typo in my example.

...
...
On 15 August 2017 at 10:04, Robert Marshall robert.marshall@codethink.co.uk wrote:

...
Hi,

I've got 2 jobs stuck in canceled mode which are preventing any other job from running.

I'm running lava (2017-7) in a VM and have tried rebooting the VM to clear the issue but without success (ie the jobs still block the queue).

an extract from /var/log/lava-server/django.log is attached

That's reporting that there was no start_time or end_time associated with the test job. It sounds like a bug but I'm not clear on how to reproduce it. For now, you can modify the test job to have a start and end time. Try: $ sudo lava-server manage shell

import time from lava_scheduler_app.models import TestJob job = TestJob.objects.filter(id=123456)

job = TestJob.objects.get(id=123456)

or

job = TestJob.objects.filter(id=123456)[0]

Filter returns a QuerySet (which is a list).

...
...
job.start_time job.start_time = time.now() job.end_time job.end_time = time.now() job.save()

i.e. if there is no start time, modify both start and end time. If there's no end_time, just add an end_time.

Thanks - both jobs had no start time but did have an end time - because I cancelled before the job started? I set the start_time = end_time which has got rid of the '500' error

But the queue still appears jammed - there's a health check with status 'Submitted' and it doesn't get any further - jobs before it in the list are either Incomplete or Canceled.

There's nothing in django.log

Robert

OK, this is now a standard scheduling issue - check the scheduling logs. It sounds like one or more devices still have a current job set. This can be fixed up in the django admin interface. Also check the status of each of the daemons. The python traceback in django.log is now fixed but that event probably interrupted one of the cleanup actions within the scheduler.

This is in the admin docs: https://validation.linaro.org/static/docs/v2/simple-admin.html#log-files and https://validation.linaro.org/static/docs/v2/pipeline-debug.html

...

...
...
Something appears deeply garbled here in the lava structures, I assume the id for the filter is the one in the recent jobs list rather than some internal value

job = TestJob.objects.filter(id=24) job.start_time Traceback (most recent call last): AttributeError: 'RestrictedTestJobQuerySet' object has no attribute 'start_time'

type(job) <class 'lava_scheduler_app.managers.RestrictedTestJobQuerySet'>

and there's no save function either

To attempt to remember the sequence of operations

I updated the data dictionary - the health check had been logging

that it wasn't using soft reboot which 2016 was doing so I made a change - prob wrong - to try to add that.

That job started but showed no output, I left it for 10 mins or so

I cancelled it

I then put back the data dictionary as it was and submitted another health check

that queued so I decided to reboot

Robert

...
...
I get this 500 error when viewing the results for the job

Is there a manual way of clearing this? The health check has notification associated with it (and set to verbose) and every time I reboot I get an email and irc saying that it's finished!

Robert

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Robert Marshall

11:47 a.m.

New subject: [Lava-users] Jobs stuck in canceled mode and blocking queue

Neil Williams neil.williams@linaro.org writes:

...

On 16 August 2017 at 09:06, Robert Marshall robert.marshall@codethink.co.uk wrote:

...
Neil Williams neil.williams@linaro.org writes:

...
On 15 August 2017 at 11:27, Robert Marshall robert.marshall@codethink.co.uk wrote:

...
Neil Williams neil.williams@linaro.org writes:

Neil

Thanks for this response, comments below:

Typo in my example.

...
...
On 15 August 2017 at 10:04, Robert Marshall robert.marshall@codethink.co.uk wrote:

...
Hi,

I've got 2 jobs stuck in canceled mode which are preventing any other job from running.

I'm running lava (2017-7) in a VM and have tried rebooting the VM to clear the issue but without success (ie the jobs still block the queue).

an extract from /var/log/lava-server/django.log is attached

That's reporting that there was no start_time or end_time associated with the test job. It sounds like a bug but I'm not clear on how to reproduce it. For now, you can modify the test job to have a start and end time. Try: $ sudo lava-server manage shell

import time from lava_scheduler_app.models import TestJob job = TestJob.objects.filter(id=123456)

job = TestJob.objects.get(id=123456)

or

job = TestJob.objects.filter(id=123456)[0]

Filter returns a QuerySet (which is a list).

...
...
job.start_time job.start_time = time.now() job.end_time job.end_time = time.now() job.save()

i.e. if there is no start time, modify both start and end time. If there's no end_time, just add an end_time.

Thanks - both jobs had no start time but did have an end time - because I cancelled before the job started? I set the start_time = end_time which has got rid of the '500' error

But the queue still appears jammed - there's a health check with status 'Submitted' and it doesn't get any further - jobs before it in the list are either Incomplete or Canceled.

There's nothing in django.log

Robert

OK, this is now a standard scheduling issue - check the scheduling logs. It sounds like one or more devices still have a current job set. This can be fixed up in the django admin interface. Also check the status of each of the daemons. The python traceback in django.log is now fixed but that event probably interrupted one of the cleanup actions within the scheduler.

This is in the admin docs: https://validation.linaro.org/static/docs/v2/simple-admin.html#log-files and https://validation.linaro.org/static/docs/v2/pipeline-debug.html

I suspect this was the initial problem

sudo lava-server manage check --deploy .... bbb01: Invalid configuration

I attempted to comment out some device dictionary stuff but it doesn't appear to have liked the syntax. I've now put back something better that doesn't give that error.

I've been through the django admin and can't see where the current job is set (if indeed it is)

http://vmhost:port/admin/lava_scheduler_app/device/ has no current job for one of the devices derived from that device type I've created another device and that shows as submitted but isn't running

The logs don't flag up anything and I've rebooted the VM. I can run health checks on another device type I've just created

Robert

3053

days inactive

3054

days old

lava-users@lists.lavasoftware.org

6 comments

participants

tags (0)

participants (2)

Neil Williams
Robert Marshall