New subject: [Lava-devel] Large job definitions in LAVA

9 Jul 2019


      Hi All,
We've previously had an issue on our LAVA instance where it stopped
responding to workers and stopped dispatching jobs when it finished running
large job definition (around 25000 lines in the definition, around 1000
deploy/boot/test actions). I've been looking into reproducing this safely
in a development environment, and I've got a few observations and questions
about how the situation could be improved.
The lava-master process appears be stuck processing the job results, and
takes a painstakingly long time to finish this and send an ACK for END_OK.
During this processing, the master doesn't respond to worker pings, and
doesn't schedule other jobs. Tracking a bit deeper, it seems that the vast
majority of time (I've never seen it finish as I have always restarted the
lava services before it finishes) in the walk_actions and build_action
functions of the lava_results_app/dbutils.py file:
https://git.lavasoftware.org/lava/lava/blob/2019.05.post1/lava_results_app/d...
https://git.lavasoftware.org/lava/lava/blob/2019.05.post1/lava_results_app/d...
What options is there to mitigate this issue? Some ideas below:
 - Could we optimize the build_action function? There are a few Django
model/db queries in build_action, could some results be queried once and
cached? With an obscenely large job, would this even give us enough savings
to make the time invested in safely optimizing this worth it?
 - What are the implications of not having created ActionData objects for a
job? Does this mean that no options will be available in the "Pipeline ↓"
drop-down on the job page for quick navigation? Could we optionally abort
after a certain amount of these (and make it configurable per LAVA
instance)?
 - Should/could the handling of the results be forked off, so lava-master
can continue to schedule more jobs and respond to worker pings, but slowly
the ActionData objects can be populated? I'm unsure if you have to be on a
special thread to write to Django models. Even if this could be done, would
any weird behaviours occur on the slave side as it will still be waiting
for the ACK for END_OK from the master?
Any guidance on how to proceed with this would be appreciated! I'm happy to
place this and some more details in as a LAVA issue on git.lavasoftware.org
if this is easier to track and discuss.
Thanks,
Dean

Large job definitions in LAVA

Cheers,