Hi All,
We've previously had an issue on our LAVA instance where it stopped responding to workers and stopped dispatching jobs when it finished running large job definition (around 25000 lines in the definition, around 1000 deploy/boot/test actions). I've been looking into reproducing this safely in a development environment, and I've got a few observations and questions about how the situation could be improved.
The lava-master process appears be stuck processing the job results, and takes a painstakingly long time to finish this and send an ACK for END_OK. During this processing, the master doesn't respond to worker pings, and doesn't schedule other jobs. Tracking a bit deeper, it seems that the vast majority of time (I've never seen it finish as I have always restarted the lava services before it finishes) in the walk_actions and build_action functions of the lava_results_app/dbutils.py file: https://git.lavasoftware.org/lava/lava/blob/2019.05.post1/lava_results_app/d... https://git.lavasoftware.org/lava/lava/blob/2019.05.post1/lava_results_app/d...
What options is there to mitigate this issue? Some ideas below: - Could we optimize the build_action function? There are a few Django model/db queries in build_action, could some results be queried once and cached? With an obscenely large job, would this even give us enough savings to make the time invested in safely optimizing this worth it? - What are the implications of not having created ActionData objects for a job? Does this mean that no options will be available in the "Pipeline ↓" drop-down on the job page for quick navigation? Could we optionally abort after a certain amount of these (and make it configurable per LAVA instance)? - Should/could the handling of the results be forked off, so lava-master can continue to schedule more jobs and respond to worker pings, but slowly the ActionData objects can be populated? I'm unsure if you have to be on a special thread to write to Django models. Even if this could be done, would any weird behaviours occur on the slave side as it will still be waiting for the ACK for END_OK from the master?
Any guidance on how to proceed with this would be appreciated! I'm happy to place this and some more details in as a LAVA issue on git.lavasoftware.org if this is easier to track and discuss.
Thanks, Dean
On Tue, Jul 09, 2019 at 05:04:00PM +0100, Dean Birch wrote:
Hi All,
We've previously had an issue on our LAVA instance where it stopped responding to workers and stopped dispatching jobs when it finished running large job definition (around 25000 lines in the definition, around 1000 deploy/boot/test actions). I've been looking into reproducing this safely in a development environment, and I've got a few observations and questions about how the situation could be improved.
Ah, you're seeing this when the large job *finishes*, not when it's starting up? That's useful data - I think we were all looking at startup!
The lava-master process appears be stuck processing the job results, and takes a painstakingly long time to finish this and send an ACK for END_OK. During this processing, the master doesn't respond to worker pings, and doesn't schedule other jobs. Tracking a bit deeper, it seems that the vast majority of time (I've never seen it finish as I have always restarted the lava services before it finishes) in the walk_actions and build_action functions of the lava_results_app/dbutils.py file: https://git.lavasoftware.org/lava/lava/blob/2019.05.post1/lava_results_app/ dbutils.py#L401 https://git.lavasoftware.org/lava/lava/blob/2019.05.post1/lava_results_app/ dbutils.py#L354
What options is there to mitigate this issue? Some ideas below: - Could we optimize the build_action function? There are a few Django model/db queries in build_action, could some results be queried once and cached? With an obscenely large job, would this even give us enough savings to make the time invested in safely optimizing this worth it?
Oh, wow. build_action looks like it could be really slow. Do you know how many testdata items your job is iterating through?
- What are the implications of not having created ActionData objects for a job? Does this mean that no options will be available in the "Pipeline ↓" drop-down on the job page for quick navigation? Could we optionally abort after a certain amount of these (and make it configurable per LAVA instance)? - Should/could the handling of the results be forked off, so lava-master can continue to schedule more jobs and respond to worker pings, but slowly the ActionData objects can be populated? I'm unsure if you have to be on a special thread to write to Django models. Even if this could be done, would any weird behaviours occur on the slave side as it will still be waiting for the ACK for END_OK from the master?
Any guidance on how to proceed with this would be appreciated! I'm happy to place this and some more details in as a LAVA issue on git.lavasoftware.org if this is easier to track and discuss.
I think that would be useful, yes! :-)
Cheers,
On Wed, 10 Jul 2019 at 12:35, Steve McIntyre steve.mcintyre@linaro.org wrote:
On Tue, Jul 09, 2019 at 05:04:00PM +0100, Dean Birch wrote:
Hi All,
We've previously had an issue on our LAVA instance where it stopped
responding
to workers and stopped dispatching jobs when it finished running large job definition (around 25000 lines in the definition, around 1000
deploy/boot/test
actions). I've been looking into reproducing this safely in a development environment, and I've got a few observations and questions about how the situation could be improved.
Ah, you're seeing this when the large job *finishes*, not when it's starting up? That's useful data - I think we were all looking at startup!
The lava-master process appears be stuck processing the job results, and
takes
a painstakingly long time to finish this and send an ACK for END_OK.
During
this processing, the master doesn't respond to worker pings, and doesn't schedule other jobs. Tracking a bit deeper, it seems that the vast
majority of
time (I've never seen it finish as I have always restarted the lava
services
before it finishes) in the walk_actions and build_action functions of the lava_results_app/dbutils.py file:
https://git.lavasoftware.org/lava/lava/blob/2019.05.post1/lava_results_app/
dbutils.py#L401
https://git.lavasoftware.org/lava/lava/blob/2019.05.post1/lava_results_app/
dbutils.py#L354
What options is there to mitigate this issue? Some ideas below:
- Could we optimize the build_action function? There are a few Django
model/db
queries in build_action, could some results be queried once and cached?
With an
obscenely large job, would this even give us enough savings to make the
time
invested in safely optimizing this worth it?
Oh, wow. build_action looks like it could be really slow. Do you know how many testdata items your job is iterating through?
If we're talking all the different levels mentioned in the job (e.g. "1", "1.1", "1.2", ... "1.8.1" etc), there appears to be about 8000.
- What are the implications of not having created ActionData objects for
a
job? Does this mean that no options will be available in the "Pipeline ↓" drop-down on the job page for quick navigation? Could we optionally abort
after
a certain amount of these (and make it configurable per LAVA instance)?
- Should/could the handling of the results be forked off, so lava-master
can
continue to schedule more jobs and respond to worker pings, but slowly the ActionData objects can be populated? I'm unsure if you have to be on a
special
thread to write to Django models. Even if this could be done, would any
weird
behaviours occur on the slave side as it will still be waiting for the
ACK for
END_OK from the master?
Any guidance on how to proceed with this would be appreciated! I'm happy
to
place this and some more details in as a LAVA issue on
git.lavasoftware.org if
this is easier to track and discuss.
I think that would be useful, yes! :-)
I've created this issue now to continue discussion on: https://git.lavasoftware.org/lava/lava/issues/299
Cheers,
Steve McIntyre steve.mcintyre@linaro.org http://www.linaro.org/ Linaro.org | Open source software for ARM SoCs
lava-devel@lists.lavasoftware.org